MoocData

Name	Size	Keywords	Download link	Reference
MOOCCube	1.01 GB	MOOC	data	ACL'20
MOOCCube_DS	7 KB	MOOC	data	ACL'20

MOOCCube

本页面所发布的数据集是ACL2020的论文"MOOCCube: A Large-scale Data Repository for NLP applications in MOOCs"中所介绍的。更多细节可以参考论文描述。
This is the dataset introduced in ACL2020 Paper "MOOCCube: A Large-scale Data Repository for NLP applications in MOOCs". You can find more details through the descriptions in our paper.

FAQ

1. 先后修对在两者中的区别 / pairs in prerequisite-dependency.json and prerequisite-prediction.json.
prerequisite-dependency.json提供了人工标注的ground-truth先后修对；prerequisite-prediction.json则是在此基础上使用神经网络模型预测的结果，label为-1/0/1分别表示没有人工标注的结果/人工标注为无关/人工标注为是先后修关系，predict给出了模型预测label是0/1的概率。
prerequisite-dependency.json provides manually labeled ground-truths of prerequisite pairs between concepts. prerequisite-prediction.json is the prediction result by a neural network model. The "label" field has value of -1, 0, or 1, indicating no manually labeled result / manually labeled as irrelevant / manually labeled as a prerequisite-dependency pair, respectively. The "predict" field provides the probablity of the label to be 0 or 1 by the model.

2. 字段在user_video_act.json中的意义 / fields in user_video_act.json.
watching_count：对该片段用户看了几次（每次不一定看整个片段）， video_duration：视频总长度， video_progress_time：用户播放视频的时长（包括倍速观看）， video_start_time：用户观看开始的位置， video_end_time：用户观看结束的位置， local_watching_time：用户实际观看时长， local_start_time：用户最早开始的时间， local_end_time：用户最晚结束的时间。
watching_count: watched times of the user (may not watch the whole segment), video_duration: total length of the video, video_progress_time: how long the user played the video (including speed up watching), video_start_time: the position where the user started watching, video_end_time: the position where the user ended watching, local_watching_time: the total watching time of the user, local_start_time: the earliest time the user started watching, local_end_time: the latest time the user ended watching.

什么是MOOCCube? / What is MOOCCube?

MOOCCube是面向大规模在线教育有关的自然语言处理、知识图谱、数据挖掘等研究者的开放数据仓库，包含706门真实在线课程、38,181个教学视频、114,563个概念、199,199名MOOC用户的数十万选课、视频观看记录，一个由概念间的先后修，上下位等关系的概念图谱以及包含数十万篇与课内概念有关的学术论文资源的补充资源库。
其概念术语以及实体的描述数据来源于百度百科和维基百科 , 课程数据和学生行为数据来自于学堂在线的真实使用环境。学术论文数据来自于大规模学术搜索引擎 Aminer。这些数据经过自动化筛选、众包标注以及专家标注三个阶段，最终组成了MOOCCube数据仓库。

MOOCCube is an open data repository for natural language processing, knowledge graphs, data mining and other researchers who are interested in massive open online courses(MOOCs). It contains 706 MOOC courses, 38,181 videos, 114,563 concepts, and 199,199 real MOOC users. This data source also contains a large-scale Concept Graph and related academic papers as additional resources for further utilization.
The information of our concept graph is extracted from Baidubaike and Wikipeida. The data of courses and student activities are from the real environment of XuetangX, one of the largest MOOC website in China. The academic resourses are provided by Aminer, an academic project which provides comprehensive search and mining services for researcher social networks.

数据组成和下载 / Data Composition and Download

MOOCCube包含主仓库 MOOCCube 和单独课程仓库 MOOCCube_DS 两部分:

主仓库：论文中描述的MOOCCube数据仓库，以概念、课程、学生行为作为三个主要维度，支持多种数据组合方式，用以支持不同的教学研究需要，其架构图和数据描述如下所示：
单独课程仓库: MOOCCube_DS是根据"数据结构"课程的实际教学需要，精细化标注的MOOCCube的特殊组成部分，其包含的信息比主仓库的课程维度更多，但是数据量较少，目前本项目仍在更新中，请继续关注！

MOOCCube contains two parts: main repository MOOCCube and special course repository MOOCCube_DS:

Main Repository: The MOOCCube data repository described in the paper, which takes concepts, courses, and student behavior as the three main dimensions, and provides multiple data combination methods to support different teaching and research needs.
Special Course Repository: MOOCCube_DS is a special component of the MOOCCube, which is refined and annotated according to the actual teaching needs of the "data structure" course. It contains more information than the main repository. Please stay tuned!

实体 / Entities

MOOCCube数据集包含以下实体类型：
MOOCCube dataset contains these types of entities:

type	prefix of id	important fields	file
concept	K_	name, en, explanation	concept.json
course	C_	name, about, core_id, video_order, video_name, chapter	course.json
paper	P_	title, author, venue, abstract, year, num_citation, ...	paper.json
school	S_	name, about	school.json
teacher	T_	name, about	teacher.json
user	U_	name, course_order, enroll_time	user.json
video	V_	name, duration, start, end, text	video.json
taxonomy	K_T_	name	concept.json

两门课程的core_id相同说明它们的video集合有交集，user的name字段是随机生成的名字。
Two courses will have same core_id if their video sets intersect. Name field of user entities is randomly generated.

关系 / Relations

concept-field
concept-paper
course-concept
course-video
parent-son (taxonomy)
prerequisite-dependency
school-course
school-teacher
teacher-course
user-course
user-video
video-concept

补充文件 / Additional information

concept_information: 105,379个概念的更多文本数据
user_video_act: 48,640个用户的视频观看行为数据（要求至少选过4门课和看过10个视频）
prerequisite_prediction: 在一个小规模的概念集合上用GCN分类器生成的先后修关系预测结果（700*700个概念对）

concept_information: more text data for 105,379 concepts.
user_video_act: filtered watching video behavior of 48,640 users (at least selected 4 courses and watched 10 videos).
prerequisite_prediction: more prerequisite relations generated by a GCN classifier on a small subset of concepts. (700*700 concept pairs)

MOOC问答数据 / MOOCQA

慕课场景下课程、概念等七种实体及相互间关系构成的单跳及多跳推理问题。（单跳25212、多跳28099）
Question Answering dataset which contains 1-hop and multi-hop questions composed of seven types of entities and their relationships in the MOOC scene. (1-hop 25212, multi-hop 28099)

MOOCCube应用 / The applications of MOOCCube

MOOCCube中的数据可以支持多种与MOOC有关的研究，包括:

课程推荐
学生行为预测
课程概念抽取
先后修关系抽取
...

MOOCCube can provide datasets to support multiple research topics related to MOOC, including:

Course Recommendation
Student Performance Prediction
Course Concept Extraction
Prerequisite Relation Learning
...

学堂小木 / Xiaomu

小木是一个挂载于学堂在线MOOC主站上的智能机器人，提供课程答疑、主动提问等教学辅助功能，其后台知识库的知识概念部分即主要由MOOCCube提供。
Xiaomu is an intelligent robot mounted on the XuetangX main station, providing teaching auxiliary functions such as course question answering and active questioning. The knowledge concept part of its background knowledge base is mainly provided by MOOCCube.