Sign in

Director of Data Science | NLP | ML at scale | GCP | AWS | linkedin.com/in/marie-stephen-leo

Machine Learning, Opinion

Long live ANNs for their whopping 380X speedup over sklearn’s KNN while delivering 99.3% similar results.

We’re living through an extinction-level event. No, not COVID19. I’m talking about the demise of the popular KNN algorithm that is taught in pretty much every Data Science course! Read on to find out what’s replacing this staple in every Data Scientists' toolkit.

KNN Background

Finding “K” similar items to any given item is widely known in the machine learning community as a “similarity” search or “nearest neighbor” (NN) search. The most widely known NN search algorithm is the K-Nearest Neighbours (KNN) algorithm. In KNN, given a collection of objects like an e-commerce catalog of handphones, we can find a small number…


When in doubt, use data to decide!

Deciding which pre-trained model to use in your Deep Learning task ranks at the same level of classical dilemmas like what movie to watch on Netflix and what cereal to buy at the supermarket (P.S. buy the one with the least sugar and highest fiber content). This post will use a data-driven approach in Python to find out the best Keras Pre-Trained model for the cats_vs_dogs dataset. This post and the code provided will also help you easily choose the best Pre-Trained model for your problem’s dataset.

Table of Contents

  1. Background
  2. Criteria for selecting models
  3. Code
  4. Resources

Background

Transfer learning is a technique in…


Setup & Run SQL in Google Colab with just 2 helper functions!

Coding tests are pretty much standard in Data Science interview processes these days. As a Data Science hiring manager, I find a 20–30 min live coding test with some prepared tasks to be effective at identifying candidates who would be successful in the roles that I typically hire for.

Google Colab [Link] is an excellent tool for various offline and live Data Science coding interviews due to its familiar notebook environment and convenient sharing options. But Colab is pretty much limited to Python (and R with some hacks).

In my personal experience, SQL is a vital skill to be a…


~80% of what you need to know about logging in under 5 mins

There comes a time in every production Data Science project when the code base has become complex, and a refactor is necessary to maintain your sanity. Perhaps you want to abstract out commonly used code into Python modules with classes and functions so that it can be reused with a single line instead of copy-pasting the whole block of code multiple times in your project. Whatever your reason, writing informative logging into your program is critical to ensure you can track its operation and troubleshoot it when things inevitably go wrong.

In this article, I’ll share ~80% of the python…


What you can do when someone blatantly plagiarises your work on Medium

Publishing Data Science stories on Medium is hard work. It takes weeks (even months) to research interesting topics, architect code in the simplest way possible, and weave it all into an engaging story. For example, last December (2020), I published a story titled “KNN is Dead,” which was the culmination of more than three months of my research into the field and is one of my finest works to date.

Unfortunately, someone had the great (*sarcasm*) idea to take a shortcut and completely plagiarised my story in January 2021, almost word for word. This person had more than 100 followers…


Data Science

Scaling ANNs to “Big” Data Volumes

Docker containers are crucial for Data Science at Scale [Link]. That’s very well the case for Approximate Nearest Neighbors (ANNs) on “big” data too!

Everything must run in a container

Speed and Accuracy (or Recall) are the top two considerations while choosing a Nearest Neighbors or Similarity Search algorithm. In my previous post, KNN is Dead, I have proven the tremendous (>300X) speed advantage ANNs have over KNN at comparable accuracy. I’ve also discussed how you can choose the fastest, most accurate ANN algorithm on your own dataset [Link].

However, sometimes, in addition to speed and accuracy, you also need…


Machine Learning, Opinion

A data-driven approach to choose the fastest, most accurate ANN algorithm on your custom dataset

ANN Background

In my previous post [KNN is Dead!], I have compared an ANN algorithm called HNSW with sklearn's KNN and proved that HNSW has vastly superior performance with a 380X speed up while delivering 99.3% of the same results.

To make things even more interesting, there are several ANN algorithms like

  1. Spotify’s ANNOY
  2. Google’s ScaNN
  3. Facebook’s Faiss
  4. My personal favorite: Hierarchical Navigable Small World graphs HNSW
  5. and many more

As a data scientist, I am a huge proponent of making data-driven decisions, as I mentioned in How to Choose the Best Keras Pre-Trained Model. So, in this post, I’ll demonstrate a…


Careers, Data Science, Opinion

Opinion on what you could use in data science/analyst interviews instead.

Disclaimer: The opinions in this article are my own and not related to my employer in any way.

Data Science and Data Analytics are some of the hottest jobs on the market going into 2021. The field is so popular and job descriptions so broad that most job openings receive hundreds or even thousands of applicants because most men know they can apply to a position even when they don’t meet 100% of the requirements [Link]. For some reason, women are more conservative [Link].

With so many applications pouring in and Data Science/Analytics being new fields in many of these…


Artificial Intelligence, Opinion

How to explain it to your manager in under 1000 words

Artificial Intelligence is a broad term that encompasses many techniques, all of which enable computers to display some level of intelligence similar to us humans.

General AI

The most popular use of Artificial Intelligence is robots that are similar to super-humans at many different tasks. They can fight, fly, and have deeply insightful conversations about virtually any topic. There are many examples of robots in movies, both good and bad, like the Vision, Wall-E, Terminator, Ultron, etc. Though this is the holy grail of AI research, our current technology is very far from achieving that AI level, which we call General AI.


Machine Learning

A 10,000-foot view in less than 1000 words

AI is not going to replace managers, but managers who use AI are going to replace those who don’t.

Machine Learning (ML) is one of those heavily used buzzwords that you often hear these days. Most managers want to use it but don’t know where to start or even what it actually means. It may seem mysterious, technical, and intimidating at first. But in this post, I’ll breakdown what ML is, its applications, how ML is built, and the skills you need to develop ML at a very high “management” level.

What is ML?

In most simple words, Machine Learning is the ability…

Marie Stephen Leo

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store