You can download the data set here: datasets1.tar.gz
The training data contain 8 full days (day 0 to day 7) of data, and 7 half days for the mornings of day 8 to day 14. The test data contain the other 7 half days for the afternoons of day 8 to day 14. Each row of the training data starts with a timestamp, followed by an advertisement id, the click “truth” (0 for no click and 1 for click), and the associated user features. The user features are binary, and only the indices of the value-1 features are listed in the row. The test data are of the same format as the training data, except that the click “truth” is hidden.
In practice, you won’t be able to peep the day-9 morning data when predicting in the afternoon of day 8. For this competition, we decide not to have such restrictions. But it is strongly encouraged for you to check how much your model benefits from “peeping” future data, and recommend the best approach based on realistic (no-peeping) scenario.
The data sets are processed from the Yahoo! R6B data, which aims for predicting whether a user would be interested in some news article. To maximize the level of fairness, you are not allowed to download the original Yahoo! data at any time. But you are welcomed to go check the descriptions of the data.
Submission format:
Each line in the submission file is a float number in [0, 1], and the line number is corresponding to the test data, i.e., the line 10 in the submission file is the answer of the 10-th instance in test_half_8. You can refer to the sample submission.