Data Mining class description

Data Mining is concerned with efficiently extracting statistics, patterns, structures, or meanings from raw data. This task becomes hard when the amount of data is large, which is often the case in modern data sets. This course will survey modern algorithms, concepts, and data structures common to data mining in large data sets. We will try to cover, among other topics: data sampling, finding causal relations and frequent item sets, counting in data streams, ranking and sorting, approximating large matrix operations, dimensionality reduction and efficient searching in high dimensions. We will also discuss modern cluster architectures and computational models.

I recommend that students be familiar with probability theory, basic combinatorics, linear algebra, basic complexity theory, and traditional data structures, at least on at introductory level. The class will attempt to be self contained nonetheless.

Some details...

The class takes place 17:00 to 20:00 every monday.

Undergraduate students:

Master and PhD students:

Discussion Group

Last year's exams

Exam 1: "Moed Alef" (with answers).
Exam 2: "Moed Beit"

Class notes and other materials

Class notes will be posted here as the sememster progresses.

Lesson 1: Introduction presentation.
Lesson 1: Probability recap class notes.
Lesson 1: Mark and recapture class notes.
Lesson 2: Sampling class notes.
Lesson 2: Chernoff bound class notes by John Canny.
Lesson 3: Frequent items in streams class notes.
Assignment 1: Size of Trees and Approximate Histograms. Due Monday Nov 28th
Lesson 4: Frequency Moments class notes.
Lesson 5: Random Projections class notes.
Lesson 6: Singular Value Decomposition.
Assignment 2: Weak Random Projection. Due Monday Dec 26th. Solutions
Lesson 7: Matrix Sampling and efficient rank-k approximation
Lesson 7: Roman Versynin's proof of the matrix Chernoff bound by Ahlswede and Winter
Lesson 8: Metric Embedding into random trees (Notes from a Metric Embedding class given by the late Avner Magen)
Lesson 9: Nearest neighbor search, and kd-trees
Lesson 10: Approximate nearest neighbor search, Locallity Sensitive Hashing
Lesson 11: k-means clustering
Assignment 2: Meta-algorithms and the power method. Due Monday Jan 30th
Lesson 12: Very short intro to spectral graph theory (partial and temporary class notes). Based on class notes by Dan Spielman.
Lesson 13: Review of last year's exam and answers "Moed Alef".