Data Mining is concerned with efficiently extracting statistics, patterns, structures, or meanings from raw data. This task becomes hard when the amount of data is large, which is often the case in modern data sets. This course will survey modern algorithms, concepts, and data structures common to data mining in large data sets. We will try to cover, among other topics: data sampling, finding causal relations and frequent item sets, counting in data streams, ranking and sorting, approximating large matrix operations, dimensionality reduction and efficient searching in high dimensions. We will also discuss modern cluster architectures and computational models.

I recommend that students be familiar with probability theory, basic combinatorics, linear algebra, basic complexity theory, and traditional data structures, at least on at introductory level. The class will attempt to be self contained nonetheless.

Undergraduate students:

- Final exam (50% of the grade).
- Home assignments (50% of the grade).
- A project (for extra credit) is optional for students who are interested.

- No final exam.
- Home assignments (50% of the grade).
- Final project (50% of the grade). Should require roughly a week's worth of work, comparable to learning for and taking the final exam. These project can contain both theoretical and experimental elements.

Lesson 1: Probability recap. Class notes.

Lesson 1: Mark and recapture. Class notes.

Lesson 2: Sampling. Class notes.

Lesson 2: The Chernoff bound. Class notes by John Canny)

Homework 1: i.i.d. sampling from streams. Assignment, solution.

Lesson 3: Dinamic Item Set Counting. Class notes.

Lesson 3: A Simple Algorithm for Finding Frequent Elements in Streams and Bags Class notes.

Lesson 4: Finding Frequent Items in Data Streams. Class notes.

Lesson 4: Notes on Bloom filters vs. bit hashes. Class notes.

Lesson 4: Notes on Count Sketches. Class notes.

Lesson 5: Approximating the frequency moments. Class notes.

Homework 2: Approximate set unions. Assignment, solution.

Lesson 6: Random Projections. Class notes.

Lesson 7: Fast Random Projections

Lesson 8: Singular Value Decomposition (SVD) Class notes.

Lesson 9: the power method. Class notes.

Lesson 9: matrix sampling by Ahlswede and Winter Paper.

Lesson 10: Nearest Neighbor search, KD-trees and other space partitioning solutions.

Lesson 11: Approximate Nearest Neighbor search, Local Sensitive Hashing class notes.

Homework 3: Document duplicate detection Assignment.

Lesson 12: Searching, Inverted indexes and the set Containment problem. class notes.

Lesson 13: Material review. Example test questions.

Lesson 14: Introduction to spectral graph theory Class notes 1, class notes 2, and class notes 3. All by Daniel A. Spielman.

Exam 1: "Moed Alef"

Exam 1: "Moed Alef" answers.

Exam 2: "Moed Beit"

Exam 2: "Moed Beit" answers (comming soon)