Name Size Keywords Download link Reference
MOOCCube 1.01 GB MOOC data ACL'20
MOOCCube_DS 7 KB MOOC data ACL'20


本页面所发布的数据集是ACL2020的论文"MOOCCube: A Large-scale Data Repository for NLP applications in MOOCs"中所介绍的。更多细节可以参考论文描述。
This is the dataset introduced in ACL2020 Paper "MOOCCube: A Large-scale Data Repository for NLP applications in MOOCs". You can find more details through the descriptions in our paper.


1. 先后修对在两者中的区别 / pairs in prerequisite-dependency.json and prerequisite-prediction.json.
prerequisite-dependency.json provides manually labeled ground-truths of prerequisite pairs between concepts. prerequisite-prediction.json is the prediction result by a neural network model. The "label" field has value of -1, 0, or 1, indicating no manually labeled result / manually labeled as irrelevant / manually labeled as a prerequisite-dependency pair, respectively. The "predict" field provides the probablity of the label to be 0 or 1 by the model.

2. 字段在user_video_act.json中的意义 / fields in user_video_act.json.
watching_count:对该片段用户看了几次(每次不一定看整个片段), video_duration:视频总长度, video_progress_time:用户播放视频的时长(包括倍速观看), video_start_time:用户观看开始的位置, video_end_time:用户观看结束的位置, local_watching_time:用户实际观看时长, local_start_time:用户最早开始的时间, local_end_time:用户最晚结束的时间。
watching_count: watched times of the user (may not watch the whole segment), video_duration: total length of the video, video_progress_time: how long the user played the video (including speed up watching), video_start_time: the position where the user started watching, video_end_time: the position where the user ended watching, local_watching_time: the total watching time of the user, local_start_time: the earliest time the user started watching, local_end_time: the latest time the user ended watching.

什么是MOOCCube? / What is MOOCCube?

其概念术语以及实体的描述数据来源于 百度百科维基百科 , 课程数据和学生行为数据来自于 学堂在线 的真实使用环境。学术论文数据来自于大规模学术搜索引擎 Aminer。 这些数据经过自动化筛选、众包标注以及专家标注三个阶段,最终组成了MOOCCube数据仓库。

MOOCCube is an open data repository for natural language processing, knowledge graphs, data mining and other researchers who are interested in massive open online courses(MOOCs). It contains 706 MOOC courses, 38,181 videos, 114,563 concepts, and 199,199 real MOOC users. This data source also contains a large-scale Concept Graph and related academic papers as additional resources for further utilization.
The information of our concept graph is extracted from Baidubaike and Wikipeida. The data of courses and student activities are from the real environment of XuetangX, one of the largest MOOC website in China. The academic resourses are provided by Aminer, an academic project which provides comprehensive search and mining services for researcher social networks.

数据组成和下载 / Data Composition and Download

MOOCCube包含主仓库 MOOCCube 和单独课程仓库 MOOCCube_DS 两部分:

  • 主仓库:论文中描述的MOOCCube数据仓库,以概念、课程、学生行为作为三个主要维度,支持多种数据组合方式,用以支持不同的教学研究需要,其架构图和数据描述如下所示:
  • 单独课程仓库: MOOCCube_DS是根据"数据结构"课程的实际教学需要,精细化标注的MOOCCube的特殊组成部分,其包含的信息比主仓库的课程维度更多,但是数据量较少,目前本项目仍在更新中,请继续关注!

MOOCCube contains two parts: main repository MOOCCube and special course repository MOOCCube_DS:

  • Main Repository: The MOOCCube data repository described in the paper, which takes concepts, courses, and student behavior as the three main dimensions, and provides multiple data combination methods to support different teaching and research needs.
  • Special Course Repository: MOOCCube_DS is a special component of the MOOCCube, which is refined and annotated according to the actual teaching needs of the "data structure" course. It contains more information than the main repository. Please stay tuned!

实体 / Entities

MOOCCube dataset contains these types of entities:

type prefix of id important fields file
concept K_ name, en, explanation concept.json
course C_ name, about, core_id, video_order, video_name, chapter course.json
paper P_ title, author, venue, abstract, year, num_citation, ... paper.json
school S_ name, about school.json
teacher T_ name, about teacher.json
user U_ name, course_order, enroll_time user.json
video V_ name, duration, start, end, text video.json
taxonomy K_T_ name concept.json

Two courses will have same core_id if their video sets intersect. Name field of user entities is randomly generated.

关系 / Relations
  • concept-field
  • concept-paper
  • course-concept
  • course-video
  • parent-son (taxonomy)
  • prerequisite-dependency
  • school-course
  • school-teacher
  • teacher-course
  • user-course
  • user-video
  • video-concept

补充文件 / Additional information
  • concept_information: 105,379个概念的更多文本数据
  • user_video_act: 48,640个用户的视频观看行为数据(要求至少选过4门课和看过10个视频)
  • prerequisite_prediction: 在一个小规模的概念集合上用GCN分类器生成的先后修关系预测结果(700*700个概念对)
  • concept_information: more text data for 105,379 concepts.
  • user_video_act: filtered watching video behavior of 48,640 users (at least selected 4 courses and watched 10 videos).
  • prerequisite_prediction: more prerequisite relations generated by a GCN classifier on a small subset of concepts. (700*700 concept pairs)


Question Answering dataset which contains 1-hop and multi-hop questions composed of seven types of entities and their relationships in the MOOC scene. (1-hop 25212, multi-hop 28099)

MOOCCube应用 / The applications of MOOCCube


  • 课程推荐
  • 学生行为预测
  • 课程概念抽取
  • 先后修关系抽取
  • ...

MOOCCube can provide datasets to support multiple research topics related to MOOC, including:

  • Course Recommendation
  • Student Performance Prediction
  • Course Concept Extraction
  • Prerequisite Relation Learning
  • ...

学堂小木 / Xiaomu

Xiaomu is an intelligent robot mounted on the XuetangX main station, providing teaching auxiliary functions such as course question answering and active questioning. The knowledge concept part of its background knowledge base is mainly provided by MOOCCube.