• In the previous post we covered a method that approximates the Jaccard similarity by constructing a signature of the original representation. This allowed us to significantly speed up the process of computing similarities between sets. But remember that the goal is to find all similar items to any given item. This requires to compute the similarities between all pairs of items in the dataset. If we go back to our example, Spotify has about 1.2 million artists on their platform. Which means that to find all similar artists we need to make 1.4 trillion comparisons… ahm … how about no. We’re going to do something different. We’re instead going to use Locality Sensitive Hashing (LSH) to identify candidate pairs and only compute the similarities on those. This will substantially reduce the computational burden.

LSH is a neat method to find similar items without computing similarities between every possible pair. It works by having items that have high similarity be hashed to the same bucket with high probability. This allows us to only measure similarities between items that land in the same bucket rather than comparing every possible pair of items. If two items are hashed to the same bucket, we consider them as candidate pairs and proceed with computing their similarity.

• Suppose you’re an engineer at Spotify and you’re on a mission to create a feature that lets users explore new artists that are similar to the ones they already listen to. The first thing you need to do is represent the artists in such a way that they can be compared to each other. You figure that one obvious way to characterize an artist is by the people that listen to it. You decide that each artist shall be defined as a set of user IDs of people that have listened to that artist at least once. For example, the representation for Miles Davis could be,

$\text{Miles Davis} = \{5, 23533, 2034, 932, ..., 17\}$

The number of elements in the set is the number of users that have listened to Miles Davis at least once. To compute the similarity between artists, we can compare these set representations. Now, with Spotify having more than 271 million users, these sets could be very large (especially for popular artists). It would take forever to compute the similarities, especially since we have to compare every artist to each other. In this post, I’ll introduce a method that can help us speed up this process. We’re going to be converting each set into a smaller representation called a signature, such that the similarities between the sets are well preserved.

• This post is best suited for people who are familiar with linear classifiers. I will also be assuming that the reader is familiar with gradient descent.

• The goal of this post isn’t to be a comprehensive guide about neural networks, but rather an attempt to show an intuitive path going from linear classifiers to a simple neural network.

There are many types of neural networks, each having some advantage over others. In this post, I want to introduce the simplest form of a neural network, a Multilayer Perceptron (MLP). MLPs are a powerful method for approximating functions and it’s a relatively simple model to implement.

Before we delve into MLPs, let’s quickly go over linear classifiers. Given training data as pairs $$(\boldsymbol{x}_i, y_i)$$ where $$\boldsymbol{x}_i \in \mathbb{R}^{d}$$ are datapoints (observations) and $$y_i \in \{0, 1\}$$ are their corresponding class labels, the goal is to learn a vector of weights $$\boldsymbol{w} \in \mathbb{R}^{d}$$ and a bias $$b \in \mathbb{R}$$ such that $$\boldsymbol{w}^T\boldsymbol{x}_{i} + b \ge 0$$ if $$y_{i} = 1$$ and $$\boldsymbol{w}^T\boldsymbol{x}_{i} + b < 0$$ otherwise ($$y_{i} = 0$$). This decision can be summarized as the following step function:

$\text{Prediction} = \begin{cases} 1 & \boldsymbol{w}^T\boldsymbol{x} + b \ge 0 \\ 0 & \text{Otherwise}\\ \end{cases}$

In the case of Logistic Regression the decision function is characterized by the sigmoid function $$\sigma(z) = \frac{1}{1+e^{-z}}$$ where $$z = \boldsymbol{w}^T\boldsymbol{x} + b$$

$\text{Prediction} = \begin{cases} 1 & \sigma(z) \ge \theta \\ 0 & \text{Otherwise}\\ \end{cases}$

Where $$\theta$$ is a threshold that is usually set to be 0.5.