Data Mining is concerned with efficiently extracting statistics, patterns, structures, or meanings from raw data. This task becomes hard when the amount of data is large, which is often the case in modern data sets. This course will survey modern algorithms, concepts, and data structures common to data mining in large data sets. We will try to cover, among other topics: data sampling, finding causal relations and frequent item sets, counting in data streams, ranking and sorting, approximating large matrix operations, dimensionality reduction and efficient searching in high dimensions. We will also discuss modern cluster architectures and computational models.

I recommend that students be familiar with probability theory, basic combinatorics, linear algebra, basic complexity theory, and traditional data structures, at least on at introductory level. The class will attempt to be self contained nonetheless.

Undergraduate students:

- Final exam (50% of the grade).
- Home assignments (50% of the grade).
- A project (for extra credit) is optional for students who are interested.

- No final exam.
- Home assignments (50% of the grade).
- Final project (50% of the grade). Should require roughly a week's worth of work, comparable to learning for and taking the final exam. These project can contain both theoretical and experimental elements.

http://groups.google.com/group/algorithms-in-data-mining-tau

Exam 2: "Moed Beit"

Class notes will be posted here as the sememster progresses.

Lesson 1: Introduction
presentation.

Lesson 1: Probability
recap class notes.

Lesson 1: Mark and
recapture class notes.

Lesson 2: Sampling class notes.

Lesson 2: Chernoff
bound class notes by John Canny.

Lesson 3:
Frequent items in streams class notes.

Assignment 1: Size of Trees and
Approximate Histograms. Due Monday Nov 28th

Lesson 4: Frequency
Moments class notes.

Lesson 5: Random
Projections class notes.

Lesson 6: Singular
Value Decomposition.

Assignment 2: Weak Random
Projection. Due Monday Dec 26th. Solutions

Lesson 7: Matrix
Sampling and efficient rank-k approximation

Lesson 7: Roman
Versynin's proof of the matrix Chernoff bound by Ahlswede and Winter

Lesson 8: Metric
Embedding into random trees (Notes from a Metric Embedding class
given by the late Avner Magen)

Lesson 9: Nearest
neighbor search, and kd-trees

Lesson 10: Approximate
nearest neighbor search, Locallity Sensitive Hashing

Lesson
11: k-means
clustering

Assignment 2: Meta-algorithms
and the power method. Due Monday Jan 30th

Lesson 12: Very
short intro to spectral graph theory (partial and temporary class
notes). Based on class
notes by Dan Spielman.

Lesson 13: Review of last year's exam
and answers "Moed Alef".