Edo Liberty

edo@edoliberty.comGoogle ScholarGithubBioCVLinkedin

Recent news

New paper: Congrats to Pinecone's research team publishing the paper An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors at the ACM Transactions on Information Systems (TOIS). Research led by the brilliant Sebastian Bruch, Franco Maria Nardini, and Amir Ingber.

New graduate course: I will teach Long Term Memory in AI - Vector Search and Databases at Princeton University this upcoming fall (COS 597A) with Matthijs Douze the architect and main developer of FAISS.

About me

I'm the founder and CEO of Pinecone, the first vector database for machine learning.

Until April 2019, I was a Director of Research at AWS and Head of Amazon AI Labs. The Lab built cutting-edge machine learning algorithms, systems, and services for AWS customers. We build parts of SageMaker, Kinesis, QuickSight, Amazon ElasticSearch, Glue, Rekognition, DeepRacer, Personalize, Forecast, and other yet-to-be-released services from AWS.

Before AWS, I was a Senior Research Director at Yahoo and Head of Yahoo's Research Lab in New York. We worked on building horizontal machine learning platforms and improving applications such as online advertising, search, security, media recommendation, email abuse prevention, and many more.

I received my B.Sc in Physics and Computer Science from Tel Aviv University and my Ph.D. in Computer Science from Yale University. After that, I was a Postdoctoral fellow at Yale in the Program in Applied Mathematics.

My research focuses on mathematical foundations and algorithms for challenges arising in dealing with large amounts of data. Topics include fast dimensionality reduction, clustering, streaming algorithms, machine learning, large scale numerical linear algebra, and high dimensional geometry.

Keynotes and Tutorials

Open Source Code

  • Apache DataSketches is the leading and most popular open source implementation of streaming algorithms for sketching and summarizing data such as counting distinct items (like HLL), frequent items, (aka top-k), streaming quantiles, and more. It is used by Druid, Spark, Yahoo, AWS, Google, and many more.

  • Frequent Directions: I have been asked to make some matrix sketching code available for a long time now. So, Mina Ghashami and I made some of our frequent direction git repo public. This code is distributed freely for academic use only. Please feel free to send pull requests.

  • Streaming Quantiles in Python: I'm excited about resolving one of the longests standing open problems in the streaming model. We designed an optimal algorithm for finding any approximate quantile of a stream of elements. See also the paper which Zohar Karnin, Kevin Lang, and myself posted on Arxiv.

  • Ezuzah Chrome Extention: Your browser is your door to the internet, why not hang a Mzuzah? (a digital art piece)

Past Interns

Academic Work

I often serve on academic review committees for conferences, journals, and grants. Past activities include being area Chair, SPC, PC and/or reviewer for KDD, WSDM, COLT, ICML, ESA, KDD, ICML, WSDM, WWW, NIPS, COLT, SODA, FOCS, SIGIR, AISTATS, and NYCE.


COS 597A : Long Term Memory in AI - Vector Search and Databases, Princeton University. Co-instructed with Matthijs Douze

0368-3248-01-Algorithms in Data Mining - Tel Aviv University 2011-1013

The course covered algorithmic tools for data mining massive data sets. It was given as a theory/algorithms class with and emphasis on randomization and streaming.

Selected Papers

  • An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors

    Sebastian Bruch, Franco Maria Nardini, Amir Ingber, Edo Liberty

    ACM Transactions on Information Systems 2023

    tl;dr: This paper explores algorithms and optimizations for hybrid search in vector databases.

  • Even Simpler Deterministic Matrix Sketching

    Edo Liberty

    tl;dr: This is a super simple one-line proof of Frequent Directions, the matrix sketching algorithm in my 2013 KDD best paper (see below).

  • Relative Error Streaming Quantiles

    Graham Cormode, Zohar Karnin, Edo Liberty, Justin Thaler, Pavel Veselý

    PODS 2021, Best paper award

    tl;dr: This (finally) solves a problem I wanted to solve for years. Namely, how to efficiently sketch quantiles with relative errors. This is critical for large scale performance monitoring, for example.

  • Amazon SageMaker Elastic Algorithms

    Edo Liberty, Zohar Karnin, Bing Xiang, Laurence Rouesnel, Baris Coskun, Ramesh Nallapati, Julio Delgado Mangas, Amir Sadoughi, Yury Astashonok, Piali Das, Can Balioglu, Saswata Chakravarty, Madhav Jha, Philip Gautier, Tim Januschowski, Valentin Flunkert, David Arpin, and Alex Smola.

    SIGMOD 2020

    tl;dr: The culmination of more than two years of work, this paper describes the algorithms and distributed architecture behind Amazon SageMaker's slastic ML algorithms.

  • Coresets, Discrepancy, and Sketches in Machine Learning

    Zohar Karnin and Edo Liberty

    COLT 2019

    tl;dr: This ML-theory paper shows that many types of machine learning models have much smaller coresets than those previously known. As a special case of the general result, it resolves the open problem regarding the coreset complexity of gaussian density estimation.

  • Optimal Quantile Approximation in Streams

    Zohar Karnin, Kevin Lang, Edo Liberty

    FOCS 2016

    tl;dr: This paper describes the KLL algorithm. It resolves one of the longest standing and basic problems in the streaming algorithms literature. Namely, optimally approximating ranks and quantiles in streaming data.[slides]

  • Simple and Deterministic Matrix Sketches

    Edo Liberty

    KDD 2013, Best paper award

    tl:dr: This paper introduced frequent-directions, an incredibly simple, practically efficient, and theoretically optimal algorithm for approximating the covariance of vector streams. [slides], [experimental results], [talk], [bib]. [git repo].

  • Threading Machine Generated Email

    Nir Ailon, Zohar Karnin, Edo Liberty, Yoelle Maarek

    TechPulse 2012, Best paper award and WSDM 2013

    tl:dr the paper shows how to use sketches to find causality relations between billions of events using trillions of observations. [bib]

  • An Almost Optimal Unrestricted Fast Johnson-Lindenstrauss Transform

    Nir Ailon, Edo Liberty

    SODA 2011, Best paper award

    tl:dr The main result of my PhD work on fast dimension reduction. Specifically, matching the optimal target dimension of the Johnson-Lindenstrauss lemma with fast projection algorithms. [bib]

Conference Publications

Journal Publications

ML Patents

  • Method And System For Clustering Data Points

    Nir Ailon, Edo Liberty, Hari Khalsa

  • Methods for filtering data and filling in missing data using nonlinear inference

    Edo Liberty, Steven Zucker, Yosi Keller, Mauro M. Maggioni, Ronald R. Coifman, Frank Geshwind, and in collaboration with Plain Sight Systems.

  • Generalized Stratified Sampling

    Kevin Lang, Edo Liberty ,Konstantin Shmakov

  • On-line content sampling

    KJ Lang, E Liberty, K Shmakov

  • System and Method for Experimentation and Deployment of Machine Learning Models on Cloud Based Platforms

    Edo Liberty, Stefano Stefani, Alexander Smola, Craig Wiley, Steve Loeppky, Tom Faulhaber, Swami Sivasubramanian, Zohar Karnin

  • Method for post-training Hyperparameter Tuning by training Machine Learning States

    Edo Liberty, Zohar Karnin

  • Autoscaling of Training Machine Learning Jobs on Cloud Infrastructures

    Edo Liberty, Stefano Stefani, Swami Sivasubramanian, Zohar Karnin, Tom Faulhaber, Alexan- der Smola, Craig Wiley, Amir Sadoughi, Dayanand Rangegowda

  • A system for autoscaling and hosting of ML Models for production inference

    Edo Liberty, Stefano Stefani, Steve Loeppky, Craig Wiley, Tom Faulhaber

  • Online training with delayed feedback with applications to bandwidth-efficient com- munication over networks

    Edo Liberty, Madhav Jha

  • System Architecture for Container Based Large Scale Machine Learning Platforms

    Stefano Stefani, Craig Wiley, Thomas Faulhaber, Alexander Smola, Steven Loeppky, Richard Bice, Edo Liberty, Swaminathan Sivasubramanian, Charles Swan, Taylor Goodhart

  • Method and Systems for Optimal Graph Synchronization for Distribute Machine Learning

    Mu Li, Edo Liberty, Alexander Smola, Leyuan Wang

  • Machine Learning model-assisted real-time enhancement of audio/video over a net- work call to significantly lower bandwidth requirements

    Madhav Jha, Edo Liberty

  • Training machine learning models for physical agents and robotic controls with simulations

    S Genc, E Liberty

  • Machine Learning system to remove accent from spoken speech

    Edo Liberty, Leo Dirac

E-Mail Patents

  • Classifying man versus machine generated email

    Zohar Karnin, Guy Halawi, David Wajc, Edo Liberty

  • A System for Email sequence identification

    Edo Liberty, Zohar Karnin, Yoelle Maarek, Natalie Aizenberg

  • Sponsored Apps Marketplace in eMail

    Ronny Lempel, Yoelle Maarek, Edward Bortnikov, Edo Liberty

  • Mining Global Email Folders For Identifying Auto-folders tags

    Vishwanath Ramarao, Andrei Broder, Idan Szpektor, Edo Liberty, Yehuda Koren, Mark Risher, and Yoelle Maarek

  • Email sequence identification

    Edo Liberty ,Zohar Karnin, Yoelle Maarek

  • Mailing List Identification and Representation

    Zohar Karnin, Michal Aharon, Edo Liberty, Yoelle Maarek

  • Identification of subject line templates

    Zohar Karnin, Edo Liberty, David Wajk, Guy Halawi

  • Computerized system and method for modifying a message to apply security features to the message’s content

    Edo Liberty, Yoelle Maarek

  • Electronic message composition support method and apparatus

    J Tetreault, A Pappu, E Liberty, L Cao, M Liu, E Pavlick, G Tsur, Y Maarek

  • Mail Lint: Write Better Emails

    Joel Tetreaul, Aasish Pappu, Edo Liberty ,Liangliang Cao, Meizhu Liu ,Ellie Tobochnik, Gilad Tzur, Yoelle Maarek

Other Patents

  • Methods for Displaying Contextually Targeted Content on a Connected Television

    Zeev Neumeier, Edo Liberty

  • Methods for Identifying Video Segments and Displaying Contextually Targeted Content on Connected Televisions

    Zeev Neumeier, Edo Liberty

  • Contest Generation Methods for Daily Fantasy Sports

    Justin Thaler, Maxim Sviridenko, Edo Liberty, Prerit Uppal, Ron Belmarch, Jerry Shen

  • Fantasy Sports Data Analysis for Game Structure Development

    Justin Thaler, Maxim Sviridenko, Edo Liberty, Prerit Uppal, Ron Belmarch, Jerry Shen