Name | Size | Keywords | Download link | Reference |
---|---|---|---|---|
Course Concept Extraction | 21.6 MB | Concept Extraction, Key-phrase Extraction | data | IJCNLP'17 |
Course Concept Extraction
This is the whole dataset of paper "Course Concept Extraction in MOOCs via Embedding-Based Graph Propagation" in IJCNLP 2017.
CSEN,CSZH,EcoEN,EcoZH are evaluation datasets mentioned in the paper. All data file are in standard json format. Each dataset contains two file: Captions and Candidates.
1. .captions file
Video captions of MOOC courses in the dataset, each line represents a video.
The text has been tokenized and labeled with POS tagging.
For CSZH and EcoZH, we employ Ansj(https://github.com/NLPchina/ansjseg) to perform word segmentation and POS tagging.
For CSEN and EcoEN, we select the POS tagger implemented by the Stanford NLP group.(http://nlp.stanford.edu/software/tagger.shtml).
2. .candidates file
Candidate course concepts extracted from the dataset.
The "label" field is the human annotated label for a candidate.
"1" stands for a course concept and "0" otherwise.
You may use the dataset to test your concept/key-phrase Extraction model or do some more talent jobs.