Edo Liberty's homepage

Edo Liberty

edo@edoliberty.com • Google Scholar • Github • Bio • CV • Linkedin

About me

I'm the founder and Chief Scientist at Pinecone.

Before Pinecone I was a Director of Research at AWS and Head of Amazon AI Labs. The lab created algorithms, models, and systems including parts of SageMaker, OpenSearch, Kinesis, QuickSight, Glue, Rekognition, Personalize, and others.

Prior to that I was a Senior Research Director at Yahoo and Head of Yahoo's Research Lab in New York. We worked on building horizontal machine learning platforms and improving applications such as online advertising, web search, security, media recommendation, email abuse prevention, and many more.

I received my B.Sc in Physics and Computer Science from Tel Aviv University and my Ph.D. in Computer Science from Yale University. After that, I was a Postdoctoral fellow at Yale in the Program in Applied Mathematics.

My academic research focuses on mathematical foundations and algorithms dealing with large amounts of data. Topics include streaming algorithms, numerical linear algebra, machine learning, fast dimensionality reduction, clustering, and high dimensional data mining. I taught these topics at Tel Aviv university and at Princeton.

Recent news

Available class notes for Princeton course Long Term Memory in AI - Vector Search and Databases. Co-instructed with Matthijs Douze.

Some Recorded Talks

[2023] Ted talk: Long term Memory in AI
[2021] CMU Database seminar: The Pinecone Vector Database System
[2018] Amazon SageMaker: Infinitely Scalable Machine Learning Algorithms
[2018] AWS reInvent: Introducing Amazon SageMaker
[2018] Streaming Data Mining: Mergeable Summaries and the DataSketches Library
[2016] Online Data Mining.
[2013] Simple and Deterministic Matrix Sketches
[2011] Fast Random Projection

Keynotes and Tutorials

[Nov 2021] Andy Pavlo's CMU Database Seminar about The Pinecone Vector Database System.
[July 2020] Keynote at San Fransisco Data Council
[March 2020] Keynote at the Future of Information and Communication Conference about the Benefits and Challenges of combining Deep Learning and Retrieval Tasks.
[June 2019] Keynote at the Time Series Workshop at ICML about streaming algorithms, Apache DataSketches, and some sneak preview on the new coreset results in machine learning. The slides are are found here: Streaming algorithms, Apache DataSketches, and new results on corsets. Thank you Yuyang (Bernie) Wang Cheng Tang, Qi (Rose) Yu, Scott Yang and, Vitaly Kuznetsov for the invitation and for organizing this great workshop.
[Apr 2019] Keynote at the Southern Data Science Conference about streaming algorithms and the datasketches library. Thank you Khalifeh AlJadda for the opportunity and for putting together an outstanding conference. Kudos!
[Feb 2019] Keynote at ITA about coresets, discrepancy, and sketches in machine learning together with Dimitris Achlioptas, Ben Recht, and Chris Re. The slides are available here but the paper is yet unpublished.
[Nov 2018] SageMaker Algorithms at MLConf in San Francisco.
[Oct 2018] SageMaker Algorithms at the first ever Amazon Research day in Haifa. Thank you Yoelle Maarek and Liane Lewin for organizing this awesome event.
[Aug 2018] KDD Keynote on deep learning on AWS and SageMaker Algorithms with Alex Smola.
[Jun 2018] Keynote at TMA conference in Vienna. I talked at the TMA Experts Summit about SageMaker algorithm (presentation). Later, I gave a keynote and the TMA conference about data sketches and mergeble summaries (presentation).
[Jun 2017] Shonan - Japan Processing Big Data Streams workshop where I presented my work with Zohar Karnin and Kevin Lang on streaming quantiles. Vladimir Braverman, David Woodruff and Ke Yi did a wonderful job organizing it.
[May 2017] Amazon posted a blog post called In the Research Spotlight in which they interview me about my career and current efforts in AWS.
Correlation Clustering: from Theory to Practice
KDD 2014 Tutorial [slides] [bib]
Streaming Data Mining
KDD 2012 tutorial on practical algorithms in mining streaming data; with Jelani Nelson.
Fast Random Projections survey and new results,
SODA 2011 and IAS and Yale math seminars 2011.
Video of the talk at IAS available here.

Open Source Code

Apache DataSketches is the leading and most popular open source implementation of streaming algorithms for sketching and summarizing data such as counting distinct items (like HLL), frequent items, (aka top-k), streaming quantiles, and more. It is used by Druid, Spark, Yahoo, AWS, Google, and many more.
Frequent Directions: I have been asked to make some matrix sketching code available for a long time now. So, Mina Ghashami and I made some of our frequent direction git repo public. This code is distributed freely for academic use only. Please feel free to send pull requests.
Streaming Quantiles in Python: I'm excited about resolving one of the longests standing open problems in the streaming model. We designed an optimal algorithm for finding any approximate quantile of a stream of elements. See also the paper which Zohar Karnin, Kevin Lang, and myself posted on Arxiv.
Ezuzah Chrome Extention: Your browser is your door to the internet, why not hang a Mzuzah? (a digital art piece)

Past Interns

Omri Weinstein - Columbia University
Noa Avigdor-Elgrabli - Yahoo Research
Roy Schwartz - The Technion Institute
Dan Garber - Haifa University
Nikita Ivkin - Amazon AI Labs
Mina Ghashami - Visa Research
Ofir Geri - Stanford University
Yu Bai - Salesforce Research
Nicholas Ryder - OpenAI
Aditya Krishnan - Johns Hopkins University

Academic Work

I often serve on academic review committees for conferences, journals, and grants. Past activities include being area Chair, SPC, PC and/or reviewer for KDD, WSDM, COLT, ICML, ESA, KDD, ICML, WSDM, WWW, NIPS, COLT, SODA, FOCS, SIGIR, AISTATS, and NYCE.

Teaching

COS 597A : Long Term Memory in AI - Vector Search and Databases, Princeton University. Co-instructed with Matthijs Douze

fall 2023

0368-3248-01-Algorithms in Data Mining - Tel Aviv University 2011-1013

The course covered algorithmic tools for data mining massive data sets. It was given as a theory/algorithms class with and emphasis on randomization and streaming.

Talk with my papers

The following is powered by Pinecone Assistant, which is a knowledge based agent and chat engine.

Selected Papers

An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors

Sebastian Bruch, Franco Maria Nardini, Amir Ingber, Edo Liberty

ACM Transactions on Information Systems 2023

tl;dr: This paper explores algorithms and optimizations for hybrid search in vector databases.
Even Simpler Deterministic Matrix Sketching

Edo Liberty

tl;dr: This is a super simple one-line proof of Frequent Directions, the matrix sketching algorithm in my 2013 KDD best paper (see below).
Relative Error Streaming Quantiles

Graham Cormode, Zohar Karnin, Edo Liberty, Justin Thaler, Pavel Veselý

PODS 2021, Best paper award - 2022 ACM SIGMOD Research Highlight Award

tl;dr: This (finally) solves a problem I wanted to solve for years. Namely, how to efficiently sketch quantiles with relative errors. This is critical for large scale performance monitoring, for example.
Amazon SageMaker Elastic Algorithms

Edo Liberty, Zohar Karnin, Bing Xiang, Laurence Rouesnel, Baris Coskun, Ramesh Nallapati, Julio Delgado Mangas, Amir Sadoughi, Yury Astashonok, Piali Das, Can Balioglu, Saswata Chakravarty, Madhav Jha, Philip Gautier, Tim Januschowski, Valentin Flunkert, David Arpin, and Alex Smola.

SIGMOD 2020

tl;dr: The culmination of more than two years of work, this paper describes the algorithms and distributed architecture behind Amazon SageMaker's slastic ML algorithms.
Coresets, Discrepancy, and Sketches in Machine Learning

Zohar Karnin and Edo Liberty

COLT 2019

tl;dr: This ML-theory paper shows that many types of machine learning models have much smaller coresets than those previously known. As a special case of the general result, it resolves the open problem regarding the coreset complexity of gaussian density estimation.
Optimal Quantile Approximation in Streams

Zohar Karnin, Kevin Lang, Edo Liberty

FOCS 2016

tl;dr: This paper describes the KLL algorithm. It resolves one of the longest standing and basic problems in the streaming algorithms literature. Namely, optimally approximating ranks and quantiles in streaming data.[slides]
Simple and Deterministic Matrix Sketches

Edo Liberty

KDD 2013, Best paper award

tl:dr: This paper introduced frequent-directions, an incredibly simple, practically efficient, and theoretically optimal algorithm for approximating the covariance of vector streams. [slides], [experimental results], [talk], [bib]. [git repo].
Threading Machine Generated Email

Nir Ailon, Zohar Karnin, Edo Liberty, Yoelle Maarek

TechPulse 2012, Best paper award and WSDM 2013

tl:dr the paper shows how to use sketches to find causality relations between billions of events using trillions of observations. [bib]
An Almost Optimal Unrestricted Fast Johnson-Lindenstrauss Transform

Nir Ailon, Edo Liberty

SODA 2011, Best paper award

tl:dr The main result of my PhD work on fast dimension reduction. Specifically, matching the optimal target dimension of the Johnson-Lindenstrauss lemma with fast projection algorithms. [bib]

Conference Publications

Projective Clustering Product Quantization

Aditya Krishnan, Edo Liberty

Arxiv 2021
Relative Error Streaming Quantiles

Graham Cormode, Zohar Karnin, Edo Liberty, Justin Thaler, Pavel Veselý

Best paper at PODS 2021
From the lab to production: A case study of session-based recommendations in the home-improvement domain

Pigi Kouki, Ilias Fountalis, Nikolaos Vasiloglou, Xiquan Cui, Edo Liberty, Khalifeh Al Jadda

RecSys 2020
Amazon SageMaker Elastic Algorithms

Edo Liberty, Zohar Karnin, Bing Xiang, Laurence Rouesnel, Baris Coskun, Ramesh Nallapati, Julio Delgado Mangas, Amir Sadoughi, Yury Astashonok, Piali Das, Can Balioglu, Saswata Chakravarty, Madhav Jha, Philip Gautier, Tim Januschowski, Valentin Flunkert, David Arpin, and Alex Smola.

SIGMOD 2020
Streaming Quantiles Algorithms with Small Space and Update Time

Nikita Ivkin, Edo Liberty, Kevin Lang, Zohar Karnin, Vladimir Braverman

Sensors 2022
Coresets, Discrepancy, and Sketches in Machine Learning

Zohar Karnin and Edo Liberty

COLT 2019
Asymmetric Random Projections

Nicholas Ryder, Zohar Karnin, and Edo Liberty

ARXIV 2019
Proxquant: Quantized neural networks via proximal operators

Yu Bai, Yu-Xiang Wang, Edo Liberty

ICLR 2019
A High-Performance Algorithm for Identifying Frequent Items in Data Streams

Daniel Anderson, Pryce Bevan, Kevin Lang, Edo Liberty, Lee Rhodes, Justin Thaler

IMC 2017
Greedy Minimization of Weakly Supermodular Set Functions

Edo Liberty, Maxim Sviridenko, Approx 2017

[slides]

[bib]
Optimal Quantile Approximation in Streams

Zohar Karnin, Kevin Lang, Edo Liberty

FOCS 2016

[slides]
A Short Proof for Gap Independence of Simultaneous Iteration

Edo Liberty

Arxiv 2016

Professor Wenjian Yu of Tsinghua University pointed out that a the square was omitted from (1+eps) in equation 2. The proof is still correct after a straight forward correction. This will be corrected in the next version.
Efficient Frequent Directions Algorithm for Sparse Matrices

Mina Ghashami, Edo Liberty, Jeff M. Phillips

KDD 2016
Stratified Sampling meets Machine Learning

Kevin Lang, Edo Liberty, Konstantin Shmakov

ICML 2016 [slides]
Space Lower Bounds for Itemset Frequency Sketches

Edo Liberty, Michael Mitzenmacher, Justin Thaler, Jonathan Ullman

PODS 2016

[bib]
An Algorithm for Online K-Means Clustering

Edo Liberty, Ram Sriharsha, Maxim Sviridenko

ALENEX 2016 [bib]
Online PCA with Spectral Bounds

Zohar Karnin, Edo Liberty

COLT 2015

[bib]

(see also 5 minute video letcure)
Online Principal Component Analysis

Christos Boutsidis, Dan Garber, Zohar Karnin, Edo Liberty

SODA 2014 [bib]
Near-optimal Distributions for Data Matrix Sampling

Dimitris Achlioptas, Zohar Karnin, Edo Liberty

NIPS 2013 [bib]
Simple and Deterministic Matrix Sketches

Edo Liberty (see slides and experimental results in json format)

Also, here is talk I gave at the Simons Institute about this.

Best paper at KDD 2013 [bib]
See also frequent direction git repo by Mina Ghashami and myself.
Threading Machine Generated Email

Nir Ailon, Zohar Karnin, Edo Liberty, Yoelle Maarek

Best paper at TechPulse 2012 and WSDM 2013 [bib]
Unsupervised SVMs: On the complexity of the Furthest Hyperplane Problem

Zohar Karnin, Edo Liberty, Shachar Lovett, Roy Schwartz, and Omri Weinstein

COLT 2012 [Slides] [bib]
Framework and Algorithms for Network Bucket Testing

Liran Katzir, Edo Liberty, and Oren Somekh

WWW 2012 [bib]
An Almost Optimal Unrestricted Fast Johnson-Lindenstrauss Transform

Nir Ailon, Edo Liberty

Best paper at SODA 2011 [bib]
Improved Approximation Algorithms for Bipartite Correlation Clustering

Nir Ailon, Noa Avigdor-Elgrabli, Edo Liberty, Anke van Zuylen

ESA 2011 [slides] [bib]
Automatically Tagging Email by Leveraging Other Users' Folders

Yehuda Koren, Edo Liberty,Yoelle Maarek, and Roman Sandler

KDD 2011 [bib]
Estimating Sizes of Social Networks via Biased Sampling

Liran Katzir, Edo Liberty, and Oren Somekh

WWW 2011 [bib]
Inverted Index Compression via Online Document Routing

Gal Lavee, Ronny Lempel, Edo Liberty, and Oren Somekh

WWW 2011 [bib]
Correlation Clustering Revisited: The "True" Cost of Error Minimization Problems

Nir Ailon, Edo Liberty

ICALP 2009 [bib]
Dense Fast Random Projections and Lean Walsh Transforms

Edo Liberty, Nir Ailon, Amit Singer

RANDOM 2008 [bib]
Fast Dimension Reduction Using Rademacher Series on Dual BCH Codes

Nir Ailon, Edo Liberty

SODA 2008 [bib]

Journal Publications

An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors

Sebastian Bruch, Franco Maria Nardini, Amir Ingber, Edo Liberty

TOIS - ACM Transactions on Information Systems 2023
Frequent Directions: Simple and Deterministic Matrix Sketching

Mina Ghashami, Edo Liberty, Jeff M. Phillips, David P. Woodruff

[bib]
Estimating Sizes of Social Networks via Biased Sampling

Liran Katzir, Edo Liberty, Oren Somekh, Ioana A. Cosma

Journal of Internet Mathematics [bib]
An Almost Optimal Unrestricted Fast Johnson-Lindenstrauss Transform

Nir Ailon, Edo Liberty

Transactions on Algorithms [bib]
Improved Approximation Algorithms for Bipartite Correlation Clustering

Nir Ailon, Noa Avigdor-Elgrabli, Edo Liberty, and Anke van Zuylen

SIAM Journal on Computing [bib]
Unsupervised SVMs: On the complexity of the Furthest Hyperplane Problem

Zohar Karnin, Edo Liberty, Shachar Lovett, Roy Schwartz and Omri Weinstein

JMLR 2012 (Journal of Machine Learning Research) [bib]
Dense Fast Random Projections and Lean Walsh Transforms,

Edo Liberty, Nir Ailon, Amit Singer

DCG 2010 (Discrete and Computational Geometry) [bib]
The Mailman algorithm: a note on matrix vector multiplication

Edo Liberty, Steven Zucker

IPL 2009 (Information Processing Letters) [bib]
Fast Dimension Reduction Using Rademacher Series on Dual BCH Codes

Nir Ailon, Edo Liberty

DCG 2008 (Discrete and Computational Geometry) [bib]
A fast randomized algorithm for the approximation of matrices

Edo Liberty, Franco Woolfe, Vladimir Rokhlin, and Mark Tygert

ACHA 2008 (Applied and Computational Harmonic Analysis) [bib]
Randomized algorithms for the low-rank approximation of matrices,

Edo Liberty, Franco Woolfe, Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert.

PNAS 2007 (Proceedings of the National Academy of Sciences) [bib]
Electrons and Phonons on the Square Fibonacci Tiling

Roni Ilan, Edo Liberty, Shahar Even-Dar Mandel, and Ron Lifshitz.

Ferroelectrics 2004.

Other manuscripts

Accelerated Dense Random Projections

PhD Thesis. See also Talk slides
Even Simpler Deterministic Matrix Sketching

Edo Liberty

ML Patents

Method And System For Clustering Data Points

Nir Ailon, Edo Liberty, Hari Khalsa
Methods for filtering data and filling in missing data using nonlinear inference

Edo Liberty, Steven Zucker, Yosi Keller, Mauro M. Maggioni, Ronald R. Coifman, Frank Geshwind, and in collaboration with Plain Sight Systems.
Generalized Stratified Sampling

Kevin Lang, Edo Liberty ,Konstantin Shmakov
On-line content sampling

KJ Lang, E Liberty, K Shmakov
System and Method for Experimentation and Deployment of Machine Learning Models on Cloud Based Platforms

Edo Liberty, Stefano Stefani, Alexander Smola, Craig Wiley, Steve Loeppky, Tom Faulhaber, Swami Sivasubramanian, Zohar Karnin
Method for post-training Hyperparameter Tuning by training Machine Learning States

Edo Liberty, Zohar Karnin
Autoscaling of Training Machine Learning Jobs on Cloud Infrastructures

Edo Liberty, Stefano Stefani, Swami Sivasubramanian, Zohar Karnin, Tom Faulhaber, Alexan- der Smola, Craig Wiley, Amir Sadoughi, Dayanand Rangegowda
A system for autoscaling and hosting of ML Models for production inference

Edo Liberty, Stefano Stefani, Steve Loeppky, Craig Wiley, Tom Faulhaber
Online training with delayed feedback with applications to bandwidth-efficient com- munication over networks

Edo Liberty, Madhav Jha
System Architecture for Container Based Large Scale Machine Learning Platforms

Stefano Stefani, Craig Wiley, Thomas Faulhaber, Alexander Smola, Steven Loeppky, Richard Bice, Edo Liberty, Swaminathan Sivasubramanian, Charles Swan, Taylor Goodhart
Method and Systems for Optimal Graph Synchronization for Distribute Machine Learning

Mu Li, Edo Liberty, Alexander Smola, Leyuan Wang
Machine Learning model-assisted real-time enhancement of audio/video over a net- work call to significantly lower bandwidth requirements

Madhav Jha, Edo Liberty
Training machine learning models for physical agents and robotic controls with simulations

S Genc, E Liberty
Machine Learning system to remove accent from spoken speech

Edo Liberty, Leo Dirac

E-Mail Patents

Classifying man versus machine generated email

Zohar Karnin, Guy Halawi, David Wajc, Edo Liberty
A System for Email sequence identification

Edo Liberty, Zohar Karnin, Yoelle Maarek, Natalie Aizenberg
Sponsored Apps Marketplace in eMail

Ronny Lempel, Yoelle Maarek, Edward Bortnikov, Edo Liberty
Mining Global Email Folders For Identifying Auto-folders tags

Vishwanath Ramarao, Andrei Broder, Idan Szpektor, Edo Liberty, Yehuda Koren, Mark Risher, and Yoelle Maarek
Email sequence identification

Edo Liberty ,Zohar Karnin, Yoelle Maarek
Mailing List Identification and Representation

Zohar Karnin, Michal Aharon, Edo Liberty, Yoelle Maarek
Identification of subject line templates

Zohar Karnin, Edo Liberty, David Wajk, Guy Halawi
Computerized system and method for modifying a message to apply security features to the message’s content

Edo Liberty, Yoelle Maarek
Electronic message composition support method and apparatus

J Tetreault, A Pappu, E Liberty, L Cao, M Liu, E Pavlick, G Tsur, Y Maarek
Mail Lint: Write Better Emails

Joel Tetreaul, Aasish Pappu, Edo Liberty ,Liangliang Cao, Meizhu Liu ,Ellie Tobochnik, Gilad Tzur, Yoelle Maarek

Other Patents

Methods for Displaying Contextually Targeted Content on a Connected Television

Zeev Neumeier, Edo Liberty
Methods for Identifying Video Segments and Displaying Contextually Targeted Content on Connected Televisions

Zeev Neumeier, Edo Liberty
Contest Generation Methods for Daily Fantasy Sports

Justin Thaler, Maxim Sviridenko, Edo Liberty, Prerit Uppal, Ron Belmarch, Jerry Shen
Fantasy Sports Data Analysis for Game Structure Development

Justin Thaler, Maxim Sviridenko, Edo Liberty, Prerit Uppal, Ron Belmarch, Jerry Shen

Table of content

About me
Recent News
Some Recorded Talks
Keynotes and Tutorials
Open Source Code
Past Interns
Academic Work
Teaching
Talk with my papers
Selected Papers
Conference Publications
Journal Publications
Other Manuscripts
ML Patents
E-Mail Patents
Other Patents

Edo Liberty

About me

Recent news

Some Recorded Talks

Keynotes and Tutorials

Open Source Code

Past Interns

Academic Work

Teaching

Talk with my papers

Selected Papers

Conference Publications

Journal Publications

Dense Fast Random Projections and Lean Walsh Transforms,

Randomized algorithms for the low-rank approximation of matrices,

Other manuscripts

ML Patents

Method And System For Clustering Data Points

Methods for filtering data and filling in missing data using nonlinear inference

Generalized Stratified Sampling

On-line content sampling

System and Method for Experimentation and Deployment of Machine Learning Models on Cloud Based Platforms

Method for post-training Hyperparameter Tuning by training Machine Learning States

Autoscaling of Training Machine Learning Jobs on Cloud Infrastructures

A system for autoscaling and hosting of ML Models for production inference

Online training with delayed feedback with applications to bandwidth-efficient com- munication over networks

System Architecture for Container Based Large Scale Machine Learning Platforms

Method and Systems for Optimal Graph Synchronization for Distribute Machine Learning

Machine Learning model-assisted real-time enhancement of audio/video over a net- work call to significantly lower bandwidth requirements

Training machine learning models for physical agents and robotic controls with simulations

Machine Learning system to remove accent from spoken speech

E-Mail Patents

Classifying man versus machine generated email

A System for Email sequence identification

Sponsored Apps Marketplace in eMail

Mining Global Email Folders For Identifying Auto-folders tags

Email sequence identification

Mailing List Identification and Representation

Identification of subject line templates

Computerized system and method for modifying a message to apply security features to the message’s content

Electronic message composition support method and apparatus

Mail Lint: Write Better Emails

Other Patents

Methods for Displaying Contextually Targeted Content on a Connected Television

Methods for Identifying Video Segments and Displaying Contextually Targeted Content on Connected Televisions

Contest Generation Methods for Daily Fantasy Sports

Fantasy Sports Data Analysis for Game Structure Development

Table of content