Edo Liberty's homepage

Class created given by Edo liberty.

Data Mining

Data Mining is concerned with efficiently extracting statistics, patterns, structures, or meanings from raw data. This task becomes hard when the amount of data is large, which is often the case in modern data sets.

This year the class will focus mostly on the streaming model. In this model the data is presented to the algorithm one item at a time (items can be: integers, strings, documents, sets, matrix entries, etc...). The algorithm must process the stream under sever space limitations. This is a common scenario in modern mining of massive data sets and presents interesting algorithmic challenges.

We will cover among other topics: sampling, finding frequent items, counting distinct elements, general frequency moment estimation, finding frequent item sets, clustering, dimensionality reduction, support vector machines, online convex optimization, linear regression, and matrix approximation.

I highly recommend that students be familiar with probability theory, combinatorics, linear algebra, basic complexity, and data structures. The class will attempt to be self contained but this is not always possible. Moreover, the class is devoted to ideas, algorithms, and proofs. Students who are interested in explicit data mining applications should not register. To quote a student feedback "you think you're going to get an interesting course and you end up getting a load of math".

Class Details

The class takes place 17:00 to 20:00 every Monday.

Undergraduate students:

Assignments (50% of the grade).
Final exam (50% of the grade).

Masters and PhD students:

Assignments (50% of the grade).
Either final exam or Final Project (50% of the grade).
- Should require roughly two weeks' worth of effort.
- Projects can contain both theoretical and experimental elements.

Last Year's Exams

The following are the Moed Alef and Beit exams from the last two years:

Student Projects

You should submit one PDF document containing your project motivation, background, data review, mathematical derivations, experimental results and discussion. You should include any other information you see fit, for example, pseudocode etc.

You are encouraged to use data from Yahoo!'s Webscope. This is a great source for a lot of interesting datasets that Yahoo! occasionally releases. The only thing is that you need to ask for permission to download using your university (.edu) email address. This is simply to identify you as students or researchers. This is strictly procedural and should take not more than a day or so.

The following are a few examples of student projects from previous years. Note that we do not study exactly the same topics every year: project_1 project_2 project_3 project_4 project_5 project_6

The deadline for project submission is May 19th 2013

Discussion Group

http://groups.google.com/group/algorithms-in-data-mining-tau

Class Notes and Materials

Lesson 1: Introduction presentation.
Lesson 1: Probability recap
Lesson 1: Mark and recapture
Lesson 2: The Chernoff bound
Lesson 2: Sampling from streams
Lesson 3: Item frequencies in streams
Lesson 4: Frequency Moment estimation in Streams
Lesson 4: First assignment, due Dec 3rd
Lesson 5: Random projection
Lesson 6: Fast random projection
Lesson 7: Singular Value Decomposition
Lesson 8: Matrix sampling and rank-k approximation
Lesson 8: First assignment possible answers
Lesson 8: Second assignment, Due Dec 31st
Lesson 9: Matrix approximation continued
Lesson 10: k-means clustering
Lesson 11: Nearest neighbor search and the curse of dimensionality
Lesson 12: Approximate nearest neighbor search