Long Term Memory in AI  Vector Search and Databases
NOTE: COS 597A class times changed for Fall semester 2023. Classes will be held 9am12noon.
Instructors
Overview
Long Term Memory is a foundational capability in the modern AI Stack. At their core, these systems use vector search. Vector search is also a basic tool for systems that manipulate large collections of media like search engines, knowledge bases, content moderation tools, recommendation systems, etc. As such, the discipline lays at the intersection of Artificial Intelligence and Database Management Systems. This course will cover the theoretical foundations and practical implementation of vector search applications, algorithms, and systems. The course will be evaluated with project and inclass presentation.
Contribute
All class materials are intended to be used freely by academics anywhere, students and professors alike. Please contribute in the form of pull requests or by opening issues.
https://github.com/edoliberty/vectorsearchclassnotes
On unixlike systems (e.g. macos) with bibtex and pdflatex available you should be able to run this:
git clone git@github.com:edoliberty/vectorsearchclassnotes.git
cd vectorsearchclassnotes
./build
Syllabus
 9/8  Class 1  Introduction to Vector Search [Matthijs + Edo + Nataly]
 Intro to the course: Topic, Schedule, Project, Grading, …
 Embeddings as an information bottleneck. Instead of learning endtoend, use embeddings as an intermediate representation
 Advantages: scalability, instant updates, and explainability
 Typical volumes of data and scalability. Embeddings are the only way to manage / access large databases
 The embedding contract: the embedding extractor and embedding indexer agree on the meaning of the distance. Separation of concerns.
 The vector space model in information retrieval
 Vector embeddings in machine learning
 Define vector, vector search, ranking, retrieval, recall
 9/15  Class 2  Text embeddings [Matthijs]
 2layer word embeddings. Word2vec and fastText, obtained via a factorization of a cooccurrence matrix. Embedding arithmetic: king + woman  man = queen, (already based on similarity search)
 Sentence embeddings: How to train, masked LM. Properties of sentence embeddings.
 Large Language Models: reasoning as an emerging property of a LM. What happens when the training set = the whole web
 9/22  Class 3  Image embeddings [Matthijs]
 Pixel structures of images. Early works on direct pixel indexing
 Traditional CV models. Global descriptors (GIST). Local descriptors (SIFT and friends)Direct indexing of local descriptors for image matching, local descriptor pooling (Fisher, VLAD)
 Convolutional Neural Nets. Offtheshelf models. Trained specifically (contrastive learning, selfsupervised learning)
 Modern Computer Vision models
 9/29  Class 4  Low Dimensional Vector Search [Edo]
 Vector search problem definition
 kd tree, space partitioning data structures
 Worst case proof for kdtrees
 Probabilistic inequalities. Recap of basic inequalities: Markov, Chernoof, Hoeffding
 Concentration Of Measure phenomena. Orthogonality of random vectors in high dimensions
 Curse of dimensionality and the failure of space partitioning
 10/6  Class 5  Dimensionality Reduction [Edo]
 Singular Value Decomposition (SVD)
 Applications of the SVD
 Rankk approximation in the spectral norm
 Rankk approximation in the Frobenius norm
 Linear regression in the leastsquared loss
 PCA, Optimal squared loss dimension reduction
 Closest orthogonal matrix
 Computing the SVD: The power method
 Randomprojection
 Matrices with normally distributed independent entries
 Fast Random Projections

10/13  No Class  Midterm Examination Week

10/20  No Class  Fall Recess
 10/27  Class 6  Approximate Nearest Neighbor Search [Edo]
 Definition of Approximate Nearest Neighbor Search (ANNS)
 Criteria: Speed / accuracy / memory usage / updateability / index construction time
 Definition of Locality Sensitive Hashing and examples
 The LSH Algorithm
 LSH Analysis, proof of correctness, and asymptotics
 11/3  Class 7  Clustering [Edo]
 Kmeans clustering  mean squared error criterion.
 Lloyd’s Algorithm
 kmeans and PCA
 εnet argument for fixed dimensions
 Sampling based seeding for kmeans
 kmeans++
 The Inverted File Model (IVF)
 11/10  Class 8  Quantization for lossy vector compression This class will take place remotely via zoom, see the edstem message to get the link [Matthijs]
 Python notebook corresponding to the class: Class_08_runbook_for_students.ipynb
 Vector quantization is a topline (directly optimizes the objective)
 Binary quantization and hamming comparison
 Product quantization. Chunked vector quantization. Optimized vector quantization
 Additive quantization. Extension of product quantization. Difficulty in training approximations (Residual quantization, CQ, TQ, LSQ, etc.)
 Cost of coarse quantization vs. inverted list scanning
 11/17  Class 9  Graph based indexes by Guest lecturer Harsha Vardhan Simhadri.
 Early works: hierarchical kmeans
 Neighborhood graphs. How to construct them. Nearest Neighbor Descent
 Greedy search in Neighborhood graphs. That does not work – need long jumps
 HNSW. A practical hierarchical graphbased index
 NSG. Evolving a graph kNN graph

11/24  No Class  Thanksgiving Recess
 12/1  Class 10  Student project and paper presentations [Edo + Nataly]
Project
Class work includes a final project. It will be graded based on
 50%  Project submission
 50%  Inclass presentation
Projects can be in three different flavors
 Theory/Research: propose a new algorithm for a problem we explored in class (or modify an existing one), explain what it achieves, give experimental evidence or a proof for its behavior. If you choose this kind of project you are expected to submit a write up.
 Data Science/AI: create an interesting use case for vector search using Pinecone, explain what data you used, what value your application brings, and what insights you gained. If you choose this kind of project you are expected to submit code (e.g. Jupyter Notebooks) and a writeup of your results and insights.
 Engineering/HPC: adapt or add to FAISS, explain your improvements, show experimental results. If you choose this kind of project you are expected to submit a branch of FAISS for review along with a short writeup of your suggested improvement and experiments.
Project schedule
 11/24  Onepage project proposal approved by the instructors
 12/1  Final project submission
 12/1  Inclass presentation
Some more details
 Project Instructor: Nataly nbrukhim@princeton.edu
 Projects can be worked on individually, in teams of two or at most three students.
 Expect to spend a few hours over the semester on the project proposal. Try to get it approved well ahead of the deadline.
 Expect to spent 35 full days on the project itself (on par with preparing for a final exam)
 In class project project presentation are 5 minutes per student (teams of two students present for 10 minutes. Teams of three, 15 minutes).
Selected Literature
 A fast random sampling algorithm for sparsifying matrices  Arora, Sanjeev and Hazan, Elad and Kale, Satyen  2006
 A Randomized Algorithm for Principal Component Analysis  Vladimir Rokhlin and Arthur Szlam and Mark Tygert  2009
 A search structure based on kd trees for efficient ray tracing  Subramanian, KR and Fussel, DS  1990
 A Short Proof for Gap Independence of Simultaneous Iteration  Edo Liberty  2016
 Accelerating LargeScale Inference with Anisotropic Vector Quantization  Ruiqi Guo and Philip Sun and Erik Lindgren and Quan Geng and David Simcha and Felix Chern and Sanjiv Kumar  2020
 Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 712, 2015, Montreal, Quebec, Canada  2015
 An Algorithm for Online KMeans Clustering  Edo Liberty and Ram Sriharsha and Maxim Sviridenko
 An Almost Optimal Unrestricted Fast JohnsonLindenstrauss Transform  Nir Ailon and Edo Liberty  2011
 An elementary proof of the JohnsonLindenstrauss lemma  S. DasGupta and A. Gupta  1999
 Approximate nearest neighbors and the fast JohnsonLindenstrauss transform  Nir Ailon and Bernard Chazelle  2006
 Billionscale similarity search with GPUs  Jeff Johnson and Matthijs Douze and Herv{'e} J{'e}gou  2017
 Clustering Data Streams: Theory and Practice  Sudipto Guha and Adam Meyerson and Nina Mishra and Rajeev Motwani and Liadan O’Callaghan  2003
 DiskANN: Fast Accurate Billionpoint Nearest Neighbor Search on a Single Node  Jayaram Subramanya, Suhas and Devvrit, Fnu and Simhadri, Harsha Vardhan and Krishnawamy, Ravishankar and Kadekodi, Rohan  2019
 Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs  Yu. A. Malkov and D. A. Yashunin  2018
 Efficient KNearest Neighbor Graph Construction for Generic Similarity Measures  Dong, Wei and Moses, Charikar and Li, Kai  2011
 Even Simpler Deterministic Matrix Sketching  Edo Liberty  2022
 Extensions of Lipschitz mappings into a Hilbert space  W. B. Johnson and J. Lindenstrauss  1984
 Fast Approximate Nearest Neighbor Search With The Navigating Spreadingout Graph  Cong Fu and Chao Xiang and Changxu Wang and Deng Cai  2018
 Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions  Halko, N. and Martinsson, P. G. and Tropp, J. A.  2011
 Invertibility of random matrices: norm of the inverse  Mark Rudelson  2008
 Kmeans clustering via principal component analysis  Chris H. Q. Ding and Xiaofeng He  2004
 kmeans++: the advantages of careful seeding  David Arthur and Sergei Vassilvitskii  2007
 Least squares quantization in pcm  Stuart P. Lloyd  1982
 LSQ++: Lower Running Time and Higher Recall in MultiCodebook Quantization  Martinez, Julieta and Zakhmi, Shobhit and Hoos, Holger H. and Little, James J.  2018
 Multidimensional binary search trees used for associative searching  Bentley, Jon Louis  1975
 NearOptimal Entrywise Sampling for Data Matrices  Achlioptas, Dimitris and Karnin, Zohar S and Liberty, Edo  2013
 Pass Efficient Algorithms for Approximating Large Matrices  Petros Drineas and Ravi Kannan  2003
 Product Quantization for Nearest Neighbor Search  Jegou, Herve and Douze, Matthijs and Schmid, Cordelia  2011
 QuickCSG: Arbitrary and faster boolean combinations of n solids  Douze, Matthijs and Franco, JeanS{'e}bastien and Raffin, Bruno  2015
 Quicker {ADC} : Unlocking the Hidden Potential of Product Quantization With {SIMD  Fabien Andre and AnneMarie Kermarrec and Nicolas Le Scouarnec  2021
 Random Projection Trees and Low Dimensional Manifolds  Dasgupta, Sanjoy and Freund, Yoav  2008
 Randomized Algorithms for LowRank Matrix Factorizations: Sharp Performance Bounds  Witten, Rafi and Cand`{e}s, Emmanuel  2015
 Randomized Block Krylov Methods for Stronger and Faster Approximate Singular Value Decomposition  Cameron Musco and Christopher Musco  2015
 Revisiting Additive Quantization  Julieta Martinez and Joris Clement and Holger H. Hoos and J. Little  2016
 Sampling from large matrices: An approach through geometric functional analysis  Rudelson, Mark and Vershynin, Roman  2007
 Similarity estimation techniques from rounding algorithms  Moses Charikar  2002
 Similarity Search in High Dimensions via Hashing  Aristides Gionis and Piotr Indyk and Rajeev Motwani  1999
 Simple and Deterministic Matrix Sketching  Edo Liberty  2012
 Smaller Coresets for kMedian and kMeans Clustering  S. {HarPeled} and A. Kushal  2005
 Sparser JohnsonLindenstrauss transforms  Daniel M. Kane and Jelani Nelson  2012
 Sparsity Lower Bounds for Dimensionality Reducing Maps  Jelani Nelson and Huy L. Nguyen  2012
 Spectral Relaxation for Kmeans Clustering  Hongyuan Zha and Xiaofeng He and Chris H. Q. Ding and Ming Gu and Horst D. Simon  2001
 Streaming kmeans approximation  Nir Ailon and Ragesh Jaiswal and Claire Monteleoni  2009
 Strong converse for identification via quantum channels  Rudolf Ahlswede and Andreas Winter  2002
 Transformer Memory as a Differentiable Search Index  Yi Tay and Vinh Q. Tran and Mostafa Dehghani and Jianmo Ni and Dara Bahri and Harsh Mehta and Zhen Qin and Kai Hui and Zhe Zhao and Jai Gupta and Tal Schuster and William W. Cohen and Donald Metzler  2022
 Unsupervised Neural Quantization for CompressedDomain Similarity Search  S. Morozov and A. Babenko  2019
 WorstCase Analysis for Region and Partial Region Searches in Multidimensional Binary Search Trees and Balanced Quad Trees  Lee, D. T. and Wong, C. K.  1977