Data
Download data: ML_final_release.tar.bz2
Data Description
There are several files which contain all the information we have about these logs:
- enrollment_train/test.csv: match the enrollment id to student and course.
- enrollment_id: enrollment ID
- username: student ID
- course_id: course ID
- log_train/test.csv: Logs for each enrollment.
- enrollment_id: enrollment ID
- time: the time of the event
- event_source: event source (server or browser)
- object: the object related to the event (see object.csv)
- event_type: the type of the event
- problem: operations on course problems
- video: operation on course videos
- access: accessing other courseware objects
- wiki: accessing the course wiki
- discussion: accessing the course forum
- navigate: navigating to other part of the course
- page_close: close the web page
- object.csv: Contain information about courses. Each course is represented as a tree of modules. For instance, a course contains multiple chapter modules, a chapter contains sequentials, and a sequential contains verticals and videos.
- course_id: the course to which the module belongs
- module_id: the ID of a courseware module
- category: the category of the courseware module
- children: the children modules’ id of the courseware module
- start: the time that the module was released to students
- sampleSubmission.csv: The required submission file should be a 24109×2 matrix, with no header or other information, like this file. The first column should be the enrollment ID, and the second column is your prediction (float or 0/1). The two columns should be split by a comma. An error will be reported if a submission file is of a wrong format.
For your convenience, the TAs have kindly provided you some basic features which can be directly used in training your models. The features are extracted from within log_train.csv. But of course, the basic features are not what you should be satisfied with—feature engineering is also an important issue when solving real world problems. Thus, you are highly encouraged to conduct your own feature extraction in order to get better performance.
- sample_train/test_x.csv
- ID: enrollment ID
- user_log_num: total number of logs of the user (student) in all the courses
- course_log_num: total number of logs belongs to the course
- take_course_num: number of courses the user takes
- take_user_num: number of users who take the course
- log_num: number of logs belongs to the enrollment
- (event_source)-(event_type): 9 dimensions, number of logs with different event_sources and event_types (refer to log_train/test.csv)
- (chapter/sequentail/video)_count: 3 dimensions, number of logs with certain objects