MoocData

Name	Size	Keywords	Download link	Reference
Course Concept Extraction	21.6 MB	Concept Extraction, Key-phrase Extraction	data	IJCNLP'17

Course Concept Extraction

This is the whole dataset of paper "Course Concept Extraction in MOOCs via Embedding-Based Graph Propagation" in IJCNLP 2017.

CSEN,CSZH,EcoEN,EcoZH are evaluation datasets mentioned in the paper. All data file are in standard json format. Each dataset contains two file: Captions and Candidates.

1. .captions file
Video captions of MOOC courses in the dataset, each line represents a video. The text has been tokenized and labeled with POS tagging. For CSZH and EcoZH, we employ Ansj(https://github.com/NLPchina/ansjseg) to perform word segmentation and POS tagging. For CSEN and EcoEN, we select the POS tagger implemented by the Stanford NLP group.(http://nlp.stanford.edu/software/tagger.shtml).

2. .candidates file
Candidate course concepts extracted from the dataset. The "label" field is the human annotated label for a candidate. "1" stands for a course concept and "0" otherwise.

You may use the dataset to test your concept/key-phrase Extraction model or do some more talent jobs.