Edo Liberty Principal Scientist Amazon (Machine Learing, AWS) New York, NY edo@edoliberty.com |

Data Mining is concerned with efficiently extracting statistics, patterns, structures, or meanings from raw data. This task becomes hard when the amount of data is large, which is often the case in modern data sets.

This year the class will focus mostly on the streaming model. In this model the data is presented to the algorithm one item at a time (items can be: integers, strings, documents, sets, matrix entries, etc...). The algorithm must process the stream under sever space limitations. This is a common scenario in modern mining of massive data sets and presents interesting algorithmic challenges.

We will cover among other topics: sampling, finding frequent items, counting distinct elements, general frequency moment estimation, finding frequent item sets, clustering, dimensionality reduction, support vector machines, online convex optimization, linear regression, and matrix approximation.I highly recommend that students be familiar with probability theory, combinatorics, linear algebra, basic complexity, and data structures. The class will attempt to be self contained but this is not always possible. Moreover, the class is devoted to ideas, algorithms, and proofs. Students who are interested in explicit data mining applications should not register. To quote a student feedback "you think you're going to get an interesting course and you end up getting a load of math".

Undergraduate students:

- Assignments (50% of the grade).
- Final exam (50% of the grade).

- Assignments (50% of the grade).
- Either final exam or Final Project (50% of the grade).
- Should require roughly two weeks' worth of effort.
- Projects can contain both theoretical and experimental elements.

You should submit one PDF document containing your project motivation, background, data review, mathematical derivations, experimental results and discussion. You should include any other information you see fit, for example, pseudocode etc.

You are encouraged to use data from Yahoo!'s Webscope. This is a great source for a lot of interesting datasets that Yahoo! occasionally releases. The only thing is that you need to ask for permission to download using your university (.edu) email address. This is simply to identify you as students or researchers. This is strictly procedural and should take not more than a day or so.

The following are a few examples of student projects from previous years. Note that we do not study exactly the same topics every year: project_1 project_2 project_3 project_4 project_5 project_6

The deadline for project submission is **May 19th 2013**

http://groups.google.com/group/algorithms-in-data-mining-tau

- Lesson 1: Introduction presentation.
- Lesson 1: Probability recap
- Lesson 1: Mark and recapture
- Lesson 2: The Chernoff bound
- Lesson 2: Sampling from streams
- Lesson 3: Item frequencies in streams
- Lesson 4: Frequency Moment estimation in Streams
- Lesson 4: First assignment, due Dec 3rd
- Lesson 5: Random projection
- Lesson 6: Fast random projection
- Lesson 7: Singular Value Decomposition
- Lesson 8: Matrix sampling and rank-k approximation
- Lesson 8: First assignment possible answers
- Lesson 8: Second assignment, Due Dec 31st
- Lesson 9: Matrix approximation continued
- Lesson 10: k-means clustering
- Lesson 11: Nearest neighbor search and the curse of dimensionality
- Lesson 12: Approximate nearest neighbor search