Jekyll2022-10-02T22:14:59+00:00https://giorgi.tech//Giorgi KvernadzeMostly things about Machine Learning.Giorgi KvernadzeSome Offline Metrics for Recommender Systems2021-09-06T00:00:00+00:002021-09-06T00:00:00+00:00https://giorgi.tech/blog/offline-metrics-for-recommender-systems<!-- Evaluation for recommendation systems usually goes through two stages. The first stage of evaluation is referred to as offline evaluation. The goal is to select the best candidate system out of some number of candidate solutions. The measurement is usually done on some held-out historical data. As a result the best performing system on the offline metrics is picked to be deployed live. This usually entails bringing the system to some subset of the traffic and measuring some predefined KPI against an existing live system. -->
<p>Evaluating recommender systems is notoriously tricky as offline measurements don’t always align with online outcomes, but offline metrics nonetheless have an important place in the toolset of a recommender system’s engineer. In this post, I’ll cover some popular offline metrics that are used for evaluating recommender systems.</p>
<!-- more -->
<!-- For all of the metrics, I'll assume that we have a recommender system that produces $$k$$ recommendations in descending order of relevance for each given query user, where $$k$$ is the maximum number of recommendations we're allowed to present to a user. -->
<p>Since the whole point of a recommender system is to aid the user in discovery by reducing the amount of items they have to consider, we will assume that our recommender system is only allowed to make a maximum of \(k\) recommendations for each user. We’ll further assume that the recommendations are outputted as a ranked list, where higher positions imply that the recommender system has higher confidence or score for those items.</p>
<p>To give more concrete context to the metric calculation, let’s imagine that we created a music recommendation system and we want to evaluate how well it works on some held-out data. Our evaluation data will be split across each user, that is for each user we’ll have some data that will be fed to the recommender system as input and some hidden data that will be used to evaluate the output of the recommender system.</p>
<!-- The ground truth data is in the form of a list, with each item being a song that user has listened to and optionally we can also include how many times they've listened to it as well. -->
<p>For each of the defined metric, I’ll provide a simple Python implementation to make it easy to play around with different values and gain more intuition. These implementations are by no means efficient, they are simply meant to provide more insight.</p>
<h4>Precision and Recall</h4>
<p>Precision and recall are the most popular and probably the most intuitive metrics you can calculate. Recall measures what percentage of the user’s liked items did we recommend, while precision computes what percentage of the recommended items were part of the user’s liked items.</p>
\[\text{Precision}_{k} = \frac{|\{\text{Liked Items}\}| \cap |\{\text{Recommended Items}\}|}{k}\]
\[\text{Recall}_{k} = \frac{|\{\text{Liked Items}\}| \cap |\{\text{Recommended Items}\}|}{|\{\text{Liked Items}\}|}\]
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">precision_k</span><span class="p">(</span><span class="n">relevant</span><span class="p">:</span> <span class="nb">list</span><span class="p">,</span>
<span class="n">predicted</span><span class="p">:</span> <span class="nb">list</span><span class="p">,</span>
<span class="n">k</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-></span> <span class="nb">float</span><span class="p">:</span>
<span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">relevant</span><span class="p">).</span><span class="n">intersection</span><span class="p">(</span><span class="n">predicted</span><span class="p">[:</span><span class="n">k</span><span class="p">]))</span> <span class="o">/</span> <span class="n">k</span>
<span class="k">def</span> <span class="nf">recall_k</span><span class="p">(</span><span class="n">relevant</span><span class="p">:</span> <span class="nb">list</span><span class="p">,</span>
<span class="n">predicted</span><span class="p">:</span> <span class="nb">list</span><span class="p">,</span>
<span class="n">k</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-></span> <span class="nb">float</span><span class="p">:</span>
<span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">relevant</span><span class="p">).</span><span class="n">intersection</span><span class="p">(</span><span class="n">predicted</span><span class="p">[:</span><span class="n">k</span><span class="p">]))</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">relevant</span><span class="p">)</span>
</code></pre></div></div>
<p>Notice that both metrics have the same numerator, you can use that fact to compute the precision and recall in one function and share the computed numerator between the two. Here’s a tiny implementation in Python:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">precision_recall_k</span><span class="p">(</span><span class="n">relevant</span><span class="p">:</span> <span class="nb">list</span><span class="p">,</span>
<span class="n">predicted</span><span class="p">:</span> <span class="nb">list</span><span class="p">,</span>
<span class="n">k</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-></span> <span class="n">Tuple</span><span class="p">[</span><span class="nb">float</span><span class="p">,</span> <span class="nb">float</span><span class="p">]:</span>
<span class="n">num_hits</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">relevant</span><span class="p">).</span><span class="n">intersection</span><span class="p">(</span><span class="n">predicted</span><span class="p">[:</span><span class="n">k</span><span class="p">]))</span>
<span class="n">precision_at_k</span> <span class="o">=</span> <span class="n">num_hits</span> <span class="o">/</span> <span class="n">k</span>
<span class="n">recall_at_k</span> <span class="o">=</span> <span class="n">num_hits</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">relevant</span><span class="p">)</span>
<span class="k">return</span> <span class="n">precision_at_k</span><span class="p">,</span> <span class="n">recall_at_k</span>
</code></pre></div></div>
<p>For both recall and precision the values are in between 0 and 1, where 1 is the best possible value. Although, note that if the user has not liked at least \(k\) items, then even the perfect system will get precision that is less than 1. Same goes for recall, if the user has greater than \(k\) liked items, the recall of the best possible system will be less than 1.</p>
<h4>F1-score</h4>
<p>Precision and recall can be seen as a trade-off, we can usually arbitrarily increase recall by increasing \(k\), the number of recommended items, however with higher \(k\) the precision usually decreases. On the flip side, reducing \(k\) usually leads to higher precision at the cost of lower recall. Having a metric that can capture both at the same time for a given, fixed \(k\) would be great. That’s exactly what F1-score does. It’s the harmonic mean of precision and recall.</p>
\[\text{F1}_{k} = \frac{2 \cdot \text{Precision}_{k} \cdot \text{Recall}_{k}}{\text{Precision}_{k} + \text{Recall}_{k}}\]
<p>The F1-score is high when both recall and precision are high and is low when either one or both of them are low.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">f1_score</span><span class="p">(</span><span class="n">relevant</span><span class="p">:</span> <span class="nb">list</span><span class="p">,</span>
<span class="n">predicted</span><span class="p">:</span> <span class="nb">list</span><span class="p">,</span>
<span class="n">k</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-></span> <span class="nb">float</span><span class="p">:</span>
<span class="n">precision_at_k</span><span class="p">,</span> <span class="n">recall_at_k</span> <span class="o">=</span> <span class="n">precision_recall_k</span><span class="p">(</span><span class="n">relevant</span><span class="p">,</span>
<span class="n">predicted</span><span class="p">,</span>
<span class="n">k</span><span class="p">)</span>
<span class="k">return</span> <span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="n">precision_at_k</span> <span class="o">*</span> <span class="n">recall_at_k</span><span class="p">)</span> <span class="o">/</span> \
<span class="p">(</span><span class="n">precision_at_k</span> <span class="o">+</span> <span class="n">recall_at_k</span><span class="p">)</span>
</code></pre></div></div>
<h4>Average Precision (AP) and Mean Average Precision (MAP)</h4>
<p>One of the downsides of using precision and recall as a metric is the fact that they ignore the order of the recommendations. For instance, let’s imagine we have two different recommender systems with the following outputs for some user:</p>
\[\text{System}_A(\text{user}) = [6, 2, 1, 0, 3]\]
\[\text{System}_B(\text{user}) = [4, 1, 7, 2, 6]\]
<p>And let’s say that the only relevant items for the user are items 2 and 6. The precision and recall for both of the systems are identical, but we’d probably want System A to be preferred since it ranked the relevant items higher than System B. One of the ways this difference in the two systems can become apparent is if we plot a precision-recall (PR) curve by calculating precision and recall from 1 up to k for both systems.</p>
<!-- [Insert PR curve] -->
<p>But although plots are great for analysis, ideally we want the difference to be comparable using a single number. Luckily that’s exactly what AP does. AP is (roughly) an approximation of the average area under the PR curve <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>. To compute AP, we compute precision at each position that had a relevant recommendation for a user and then take the average of that.</p>
<!--
For example, if the items at positions $$\{1, 4, 5\}$$ where relevant then we compute $$\text{Precision}_{1}, \text{Precision}_{4}$$ and $$\text{Precision}_{5}$$ and take the average. -->
<!-- AP gives more weight to correct predictions with a higher rank position in the recommendation, i.e. recommending a relevant item at position one will be more valuable than recommending it at position two. This means that we not only care about making relevant recommendations but we also care about how they are ordered. -->
<!-- To account for the order, AP measures precision at positions that had a relevant recommendation. For example, if the items at positions $$\{1, 4, 5\}$$ where relevant then we compute $$\text{Precision}_{1}, \text{Precision}_{4}$$ and $$\text{Precision}_{5}$$ and just take the average. -->
\[\text{AP}_{k} = \frac{1}{n} \sum_{i=1}^{k} \text{Precision}_{i} * \text{rel}_{i}\]
<p>Where \(\text{rel}_{i}\) is equal to 1 if item \(i\) is relevant and 0 otherwise and \(n\) is the number of relevant items in the entire recommendation set.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">ap_k</span><span class="p">(</span><span class="n">relevant</span><span class="p">:</span> <span class="nb">list</span><span class="p">,</span>
<span class="n">predicted</span><span class="p">:</span> <span class="nb">list</span><span class="p">,</span>
<span class="n">k</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-></span> <span class="nb">float</span><span class="p">:</span>
<span class="n">relevant_idx</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c1"># Find indices of predicted items that were relevant
</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">item</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">predicted</span><span class="p">[:</span><span class="n">k</span><span class="p">]):</span>
<span class="k">if</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">relevant</span><span class="p">:</span>
<span class="n">relevant_idx</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
<span class="c1"># Compute precision at each index of predicted relevant item
</span> <span class="n">precisions</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">idx</span> <span class="ow">in</span> <span class="n">relevant_idx</span><span class="p">:</span>
<span class="c1"># Using the precision_k function we defined earlier
</span> <span class="n">precision_at_idx</span> <span class="o">=</span> <span class="n">precision_k</span><span class="p">(</span><span class="n">relevant</span><span class="p">,</span> <span class="n">predicted</span><span class="p">,</span> <span class="n">idx</span><span class="p">)</span>
<span class="n">precisions</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">precision_at_idx</span><span class="p">)</span>
<span class="k">return</span> <span class="nb">float</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">precisions</span><span class="p">))</span>
</code></pre></div></div>
<p>Now, the MAP is just the mean of APs across a collection of users.</p>
\[\text{MAP}_{k} = \frac{1}{N} \sum_{j=1}^{N} \text{AP}_{k}(j)\]
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">map_k</span><span class="p">(</span><span class="n">relevant_batch</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">list</span><span class="p">],</span>
<span class="n">predicted_batch</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">list</span><span class="p">],</span>
<span class="n">k</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-></span> <span class="nb">float</span><span class="p">:</span>
<span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">([</span><span class="n">ap_k</span><span class="p">(</span><span class="n">relevant</span><span class="p">,</span> <span class="n">predicted</span><span class="p">,</span> <span class="n">k</span><span class="p">)</span>
<span class="k">for</span> <span class="n">relevant</span><span class="p">,</span> <span class="n">predicted</span>
<span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">relevant_batch</span><span class="p">,</span> <span class="n">predicted_batch</span><span class="p">)])</span>
</code></pre></div></div>
<h4>Reciprocal Rank (RR) and Mean Reciprocal Rank (MRR)</h4>
<p>Different from the previous metrics, reciprocal rank only cares about the rank of the first relevant recommendation. Let the \(\text{rank}_{k}(\text{user}_{i})\) be a function that returns the rank of the first relevant item in the \(k\) ranked recommendations for user \(i\). The reciprocal rank (RR) is then defined as:</p>
\[\text{RR}_k = \frac{1}{\text{rank}_{k}(\text{user}_{i})}\]
<p>Note that RR is undefined if the \(k\) recommendations do not contain any relevant items, in such a case we set the RR to zero.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">rr_k</span><span class="p">(</span><span class="n">relevant</span><span class="p">:</span> <span class="nb">list</span><span class="p">,</span>
<span class="n">predicted</span><span class="p">:</span> <span class="nb">list</span><span class="p">,</span>
<span class="n">k</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-></span> <span class="nb">float</span><span class="p">:</span>
<span class="n">rank</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">item</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">predicted</span><span class="p">[:</span><span class="n">k</span><span class="p">]):</span>
<span class="k">if</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">relevant</span><span class="p">:</span>
<span class="n">rank</span> <span class="o">=</span> <span class="n">i</span> <span class="o">+</span> <span class="mi">1</span>
<span class="k">break</span>
<span class="k">return</span> <span class="mf">1.</span> <span class="o">/</span> <span class="n">rank</span> <span class="k">if</span> <span class="n">rank</span> <span class="k">else</span> <span class="mf">0.</span>
</code></pre></div></div>
<p>Now, MRR is just the mean of RRs over a collection of users \(U\):</p>
\[\text{MRR}_k = \frac{1}{N}\sum_{i=1}^{N} \frac{1}{\text{rank}_{k}(\text{user}_{i})} = \frac{1}{N}\sum_{i=1}^{N} \text{RR}_{i}\]
<p>Where \(N\) is the number of users \(U\). We can utilize the implementation for RR to implement the MRR:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">mrr_k</span><span class="p">(</span><span class="n">relevant_batch</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">list</span><span class="p">],</span>
<span class="n">predicted_batch</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">list</span><span class="p">],</span>
<span class="n">k</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-></span> <span class="nb">float</span><span class="p">:</span>
<span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">([</span><span class="n">rr_k</span><span class="p">(</span><span class="n">relevant</span><span class="p">,</span> <span class="n">predicted</span><span class="p">,</span> <span class="n">k</span><span class="p">)</span>
<span class="k">for</span> <span class="n">relevant</span><span class="p">,</span> <span class="n">predicted</span>
<span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">relevant_batch</span><span class="p">,</span> <span class="n">predicted_batch</span><span class="p">)])</span>
</code></pre></div></div>
<h4>Normalized Discount Cumulative Gain (nDCG)</h4>
<!-- So far, all of the metrics we discussed assume that item relevancy is binary, i.e. either something is relevant to the user or not. But often times in real applications, we not only know wether someone likes something or not but we also have some information on how much someone likes a particular item. For example, for our music recommender maybe instead of just knowing what songs the user liked we also know how much they liked it by using the listen counts on those songs.
e.g. we can take how many times they listened to a song and use that to measure how much they like the song. This is where nDCG comes in, it's a special kind of metric that not only takes the order into account but also the relevancy scores of the predicted items. -->
<p>So far, all of the metrics we discussed assume that item relevancy is binary, i.e. either something is relevant to the user or not. But often times in real applications, we not only know if an item is relevant, but we also have some information on <em>how</em> relevant the item is to the user. Going back to the music recommender, instead of just looking at what songs the user liked, we could additionally consider the listen counts and use it to measure the degree of relevancy. In this setting, the goal will be to recommend items that have high degree of relevancy to the user at higher positions in the recommendation output. In order to understand if we’re achieving that, we need a metric like nDCG.</p>
<!-- we not only know wether someone likes something or not but we also have some information on how much someone likes a particular item. For example, for our music recommender, we could measure how much a user likes a song by counting the number of listens. We can say that the more times a user has listened to a song the higher that user rates that song. In order to take this information into account, we need a metric that uses non-binary relevancy scores. This is where nDCG comes in, it's a scary sounding metric but it's actually quite simple. The first letter just stands for normalized, which we'll get to later. Let's first focus on DCG.
But what if we have non-binary relevancy. For example, for our music recommender maybe instead of just knowing what songs the user liked we also know how much they liked it, e.g. we can take how many times they listened to a song and use that to measure how much they like the song. This is where nDCG comes in, it's a special kind of metric that not only takes the order into account but also the relevancy scores of the predicted items. -->
<!--
it's a scary sounding metric but it's actually quite simple. The first letter just stands for normalized, which we'll get to later. Let's first focus on DCG. -->
<p>Let \(s_{i}\) be the relevancy score for the item at position \(i\) in our recommended item list. Then the DCG is computed as:</p>
\[DCG_{k} = \sum_{i=1}^{k}\frac{s_{i}}{\log_{2}({i + 1})}\]
<p>We normalize the DCG by dividing it by the best possible DCG score that is achievable for the given user. We refer to this quantity as the ideal discount cumulative gain or IDCG for short. The IDCG is simply the score we would get if we recommended all the user’s relevant items in descending order of relevancy score. The nDCG then is computed as the ratio of the DCG and the IDCG:</p>
<!-- We normalize the DCG to make it between 0 and 1, we compute the best possible DCG score that is achievable on the given list of items. Which would be achieved by an ordering that has the highest scoring item by relevancy as the first item, the second highest as the second item and so on. We refer to the ideal score as $$IDCG$$. The nDCG then is computed as the ratio of the DCG and the IDCG: -->
<!-- To normalize the DCG, we compute the best possible DCG score that is achievable on the given list of items. Which would be achieved by an ordering that has the highest scoring item by relevancy as the first item, the second highest as the second item and so on. We refer to the ideal score as $$IDCG$$. The nDCG then is computed as the ratio of the DCG and the IDCG: -->
\[nDCG_{k} = \frac{DCG_{k}}{IDCG_{k}}\]
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="k">def</span> <span class="nf">ndcg_k</span><span class="p">(</span><span class="n">relevant</span><span class="p">:</span> <span class="nb">list</span><span class="p">,</span>
<span class="n">relevancy_scores</span><span class="p">:</span> <span class="nb">list</span><span class="p">,</span>
<span class="n">predicted</span><span class="p">:</span> <span class="nb">list</span><span class="p">,</span>
<span class="n">k</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="bp">None</span><span class="p">)</span> <span class="o">-></span> <span class="nb">float</span><span class="p">:</span>
<span class="c1"># Create a relevancy array for the predicted items
</span> <span class="n">relevancy_scores_dict</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">relevant</span><span class="p">,</span> <span class="n">relevancy_scores</span><span class="p">))</span>
<span class="n">predicted_item_relevancy_scores</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">predicted</span><span class="p">:</span>
<span class="k">if</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">relevant</span><span class="p">:</span>
<span class="n">predicted_item_relevancy_scores</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">relevancy_scores_dict</span><span class="p">[</span><span class="n">item</span><span class="p">])</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">predicted_item_relevancy_scores</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="c1"># Convert it to a ndarray
</span> <span class="n">predicted_item_relevancy_scores</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">predicted_item_relevancy_scores</span><span class="p">)</span>
<span class="c1"># Compute the DCG
</span> <span class="n">dcg</span> <span class="o">=</span> <span class="n">_dcg_k</span><span class="p">(</span><span class="n">predicted_item_relevancy_scores</span><span class="p">,</span> <span class="n">k</span><span class="p">)</span>
<span class="c1"># Compute the ideal DCG
</span> <span class="n">idcg</span> <span class="o">=</span> <span class="n">_dcg_k</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">sort</span><span class="p">(</span><span class="n">relevancy_scores</span><span class="p">)[::</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">k</span><span class="p">)</span>
<span class="c1"># return normalized DCG
</span> <span class="k">return</span> <span class="n">dcg</span> <span class="o">/</span> <span class="n">idcg</span>
<span class="k">def</span> <span class="nf">_dcg_k</span><span class="p">(</span><span class="n">relevancy_scores</span><span class="p">:</span> <span class="nb">list</span><span class="p">,</span> <span class="n">k</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-></span> <span class="nb">float</span><span class="p">:</span>
<span class="n">discounts</span> <span class="o">=</span> <span class="mf">1.0</span> <span class="o">/</span> <span class="n">np</span><span class="p">.</span><span class="n">log2</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="nb">min</span><span class="p">(</span><span class="n">k</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">relevancy_scores</span><span class="p">))</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="nb">float</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">relevancy_scores</span><span class="p">[:</span><span class="n">k</span><span class="p">]</span> <span class="o">*</span> <span class="n">discounts</span><span class="p">))</span>
</code></pre></div></div>
<!-- #### Coverage
One of the simplest metrics one can look at that does not require any held-out data for users is the item coverage. Item coverage is the percentage of items from the entire item space is provided as a recommendation to at least one user.
$$\text{Coverage} = \frac{\text{Number of recommended items}}{\text{Number of total items}}$$
This metric essentially measures how much of the item space the recommendation system will allow the users to explore through the provided recommendations. -->
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>For more information, check out <a href="https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html"> Evaluation of ranked retrieval results </a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Giorgi KvernadzeEvaluating recommender systems is notoriously tricky as offline measurements don’t always align with online outcomes, but offline metrics nonetheless have an important place in the toolset of a recommender system’s engineer. In this post, I’ll cover some popular offline metrics that are used for evaluating recommender systems.Locality Sensitive Hashing for MinHash2020-06-10T00:00:00+00:002020-06-10T00:00:00+00:00https://giorgi.tech/blog/locality-sensitive-hashing<p>In the <a href="https://giorgi.tech/blog/minhashing/">previous post</a> we covered a method that approximates the Jaccard similarity by constructing a signature of the original representation. This allowed us to significantly speed up the process of computing similarities between sets. But remember that the goal is to find all similar items to any given item. This requires to compute the similarities between all pairs of items in the dataset. If we go back to our example, Spotify has about 1.2 million artists on their platform. Which means that to find all similar artists we need to make 1.4 trillion comparisons… ahm … how about no. We’re going to do something different. We’re instead going to use Locality Sensitive Hashing (LSH) to identify candidate pairs and only compute the similarities on those. This will substantially reduce the computational time.</p>
<p>LSH is a neat method to find similar items without computing similarities between every possible pair. It works by having items that have high similarity be hashed to the same bucket with high probability. This allows us to only measure similarities between items that land in the same bucket rather than comparing every possible pair of items. If two items are hashed to the same bucket, we consider them as candidate pairs and proceed with computing their similarity.</p>
<!-- more -->
<h3>Banding Technique</h3>
<p>LSH is a broad term that refers to the collection of hashing methods that preserve similarities. In this post we’re going to be discussing one particular such method that efficiently computes candidate pairs for items that are in the form of minhash signatures. It is a pretty easy procedure both algorithmically and conceptually. It uses the intuition that if two items have identical signature parts in some random positions then they’re probably similar. This is the idea we’re going to turn to in order to identify candidate pairs.</p>
<p>In order to proceed we first need a signature matrix, if you don’t recall how a signature matrix is computed you can refer to my previous post on <a href="https://giorgi.tech/blog/minhashing/">min hashing</a>. Let’s assume that a signature matrix is provided to us:</p>
<p><img class="center" src="/images/sig.png" width="50%" /></p>
<p>We begin by dividing the signature matrix into \(b\) bands with \(r\) rows. This means that we are slicing each item’s signature into contiguous but distinct chunks.</p>
<p><img class="center" src="/images/sig_banded.png" /></p>
<p>For each band, we take all of the chunks and hash them individually using some hashing function <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> and we store them into a hash bucket. An important thing to note is that we use a separate hash bucket for each of the bands, this makes sure that we only compare chunks of signatures within the same bands rather than across bands. The idea is that if two items land in the same bucket for any of the bands then we consider them as candidates. Using a hashing function rather than directly comparing the items is what allows us to avoid the quadratic amount of comparisons.</p>
<p><img class="center" src="/images/lsh.png" /></p>
<p>In this case it looks like we have the following candidate pairs: \((\text{artist}_{3}, \text{artist}_{5})\) and \((\text{artist}_{1}, \text{artist}_{5})\).</p>
<p><em>Note: Although the picture depicts a hash table with only four buckets in reality the number of buckets is usually much larger than the number of items.</em></p>
<p>Here’s a really simple implementation of an LSH for Jaccard similarities:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">defaultdict</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="k">def</span> <span class="nf">minhash_lsh</span><span class="p">(</span><span class="n">sig_matrix</span><span class="p">,</span> <span class="n">num_bands</span><span class="p">):</span>
<span class="n">num_rows</span> <span class="o">=</span> <span class="n">sig_matrix</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="n">bands</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="n">sig_matrix</span><span class="p">,</span> <span class="n">num_bands</span><span class="p">)</span>
<span class="n">bands_buckets</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">band</span> <span class="ow">in</span> <span class="n">bands</span><span class="p">:</span>
<span class="n">items_buckets</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">list</span><span class="p">)</span>
<span class="n">items</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">hsplit</span><span class="p">(</span><span class="n">band</span><span class="p">,</span> <span class="n">num_rows</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">item</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">items</span><span class="p">):</span>
<span class="n">item</span> <span class="o">=</span> <span class="nb">tuple</span><span class="p">(</span><span class="n">item</span><span class="p">.</span><span class="n">flatten</span><span class="p">().</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">))</span>
<span class="n">items_buckets</span><span class="p">[</span><span class="n">item</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
<span class="n">bands_buckets</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">items_buckets</span><span class="p">)</span>
<span class="k">return</span> <span class="n">bands_buckets</span>
</code></pre></div></div>
<p>Now you may have noticed that \(b\) and \(r\) are parameters that we completely arbitrarily picked. To understand the significance of them we have to go back a little bit. Recall that the probability of two items having the same min hash value in any of the rows of the signature matrix is equal to the Jaccard similarity of those two items. We can use this fact to compute the probability of these two items being candidate pairs. Let \(s\) be the Jaccard similarity, then:</p>
<ul>
<li>If each band has \(r\) rows, the probability that the signatures agree on the entire band is: \(s^r\)</li>
<li>The inverse of this, the probability that they do not agree is \(1 - s^r\)</li>
<li>The probability that the signatures disagree in all of the bands \((1 - s^r)^b\)</li>
<li>Therefore, the probability that the two items signatures agree in at least one band is \(1 - (1 - s^r)^b\)</li>
</ul>
<p>We have just derived the probability of two items being a candidate pair as a function of \(s\) with parameters \(r\) and \(b\): \(f_{b, r}(s) = 1 - (1 - s^r)^b\). If you plot this function using any \(b\) and \(r\) it will look like an S curve.</p>
<p>For example, let’s plot the function with parameters \(b=2\) and \(r=3\).</p>
<p><img class="center" src="/images/b2r3.png" width="80%" /></p>
<p>As we can see the plot is shifted to right side, this means that in order for two items be candidates their similarities have to be high. For example, if two items have similarity of \(0.5\) they only have \(0.23\) probability of being candidates. If you go back and look at the signature matrix this should make perfect sense, we selected parameters that produce large bands relative to the signature matrix. If we wanted to make it more probably for candidates to appear, we can increase \(b\).</p>
<p><img class="center" src="/images/b3r2.png" width="80%" /></p>
<p>Notice how the plot has shifted to the left. With these parameters if two candidates have similarity \(0.5\) we have \(0.57\) probability of them being candidates.</p>
<p>In the beginning of the post I mentioned that we would only cover a single instance of an LSH method. The method we described can work really great for approximating nearest neighbours when your data points are sets but what if our data points are vectors in some high dimensional space? Luckily, there are methods that work on other types of data. Check out <a href="http://infolab.stanford.edu/~ullman/mmds/ch3.pdf">Chapter 3.6 of Mining Massive Datasets</a> if you want to know more about what LSH is formally and what other techniques are there.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>We can use the built-in hashing function of whatever programming language we’re using. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Giorgi KvernadzeIn the previous post we covered a method that approximates the Jaccard similarity by constructing a signature of the original representation. This allowed us to significantly speed up the process of computing similarities between sets. But remember that the goal is to find all similar items to any given item. This requires to compute the similarities between all pairs of items in the dataset. If we go back to our example, Spotify has about 1.2 million artists on their platform. Which means that to find all similar artists we need to make 1.4 trillion comparisons… ahm … how about no. We’re going to do something different. We’re instead going to use Locality Sensitive Hashing (LSH) to identify candidate pairs and only compute the similarities on those. This will substantially reduce the computational time. LSH is a neat method to find similar items without computing similarities between every possible pair. It works by having items that have high similarity be hashed to the same bucket with high probability. This allows us to only measure similarities between items that land in the same bucket rather than comparing every possible pair of items. If two items are hashed to the same bucket, we consider them as candidate pairs and proceed with computing their similarity.Illustrated Guide to Min Hashing2020-03-10T00:00:00+00:002020-03-10T00:00:00+00:00https://giorgi.tech/blog/minhashing<p>Suppose you’re an engineer at Spotify and you’re on a mission to create a feature that lets users explore new artists that are similar to the ones they already listen to. The first thing you need to do is represent the artists in such a way that they can be compared to each other. You figure that one obvious way to characterize an artist is by the people that listen to it. You decide that each artist shall be defined as a set of user IDs of people that have listened to that artist at least once. For example, the representation for Miles Davis could be,</p>
\[\text{Miles Davis} = \{5, 23533, 2034, 932, ..., 17\}\]
<p>The number of elements in the set is the number of users that have listened to Miles Davis at least once. To compute the similarity between artists, we can compare these set representations. Now, with Spotify having more than 271 million users, these sets could be very large (especially for popular artists). It would take forever to compute the similarities, especially since we have to compare every artist to each other. In this post, I’ll introduce a method that can help us speed up this process. We’re going to be converting each set into a smaller representation called a signature, such that the similarities between the sets are well preserved.</p>
<!-- more -->
<!-- We're going to do something different instead. Instead of representing the artist as a set of all of the users, that listen to it.
In this post, we're going to talk about how to speed up the process of computing the similarities between these sets. -->
<!-- That is, you compute the ratio between the size of the intersection of the sets and the union. This -->
<!-- Suppose we want to cluster similar artists on Spotify. With about 271 million artists, if we assume that each artist has about 1000 songs in their listening history we would need ~4.33 terabytes to represent the entire data. That is a lot! To make things even worse, in order to find the clusters with $$n$$ artists, we need to measure the similarity between all $${n \choose 2} = \frac{n(n - 1)}{2}$$ pairs of artists. With 271 million artists that is an astronomical number of comparisons!
In this post, the goal will be to reduce the size of the representation of each artist while preserving the similarities. This will take care of the memory issue and some of the computational burden. In the next post, we'll talk about what to do with the bigger computational problem of calculating similarities between all of the pairs of artists. -->
<!--
Measuring similarity of objects is one of the most fundamental computations for data mining. Similarity can be used to detect plagiarism, categorize documents, recommend products to customers and there are many many more applications. There are a lot of different ways of defining similarity. In this post I'll be talking about Jaccard similarity and its' approximation. -->
<h3>Toy Example</h3>
<p>I think working with tiny examples to build intuition can be an excellent method for learning. In that spirit, now consider a toy example. Assume that we only have 3 artists and we have a total of 8 users in our dataset.</p>
\[\text{artist}_{1} = \{1, 4, 7\}\]
\[\text{artist}_{2} = \{0, 1, 2, 4, 5, 7\}\]
\[\text{artist}_{3} = \{0, 2, 3, 5, 6\}\]
<!-- I think working with tiny examples to build intuition is an excellent method for learning. So in that spirit, let's consider a toy example. Let's assume that we only have 3 artists and we have a total of 8 songs in our dataset. -->
<!-- ![](../images/artist_data.png) -->
<h3>Jaccard similarity</h3>
<p>The goal is to find similar artists, so we need a way to measure similarities between a pair of artists. We will be using Jaccard similarity, it is defined as the fraction of shared elements between two sets. In our case the sets are user ids. All we have to compute is how many users each pair of artists share, divided by total number of users in both artists. For example, the Jaccard similarity between \(\text{artist}_1\) and \(\text{artist}_2\):</p>
<!-- To compute the similarities, we said that we were going to measure the fraction of shared users for each artist. This computation actually has a name, it's called the Jaccard similarity. For a pair of artists, we compute the Jaccard similarity by counting the number of users that are shared and dividing it by the total number of users in both of the artist sets. For example the Jaccard similarity between $$\text{artist}_{1}$$ and $$\text{artist}_{2}$$ -->
<!-- [^renamed_artists]: To make things a little easier to write, I'll refer to the artists by their subscripts from now on. For example. $$\text{artist}_{A} \rightarrow A$$ and so on. -->
<!-- The goal is to cluster similar artists, so we obviously need a way to measure similarities between a pair of artists. We will be using Jaccard similarity. It is defined as the fraction of shared elements between two sets. In our case the sets are artists' listening history. All we have to compute is how many songs each pair of artists share with each other, divided by total number of songs in both artists. For example the Jaccard similarity between $$\text{artist}_1$$ and $$\text{artist}_2$$ -->
\[J(\text{artist}_{1}, \text{artist}_{2}) = \frac{|\text{artist}_{1} \cap \text{artist}_{2}|}{|\text{artist}_{1} \cup \text{artist}_{2}|} = \frac{|\{1, 4, 7\}|}{|\{0, 1, 2, 4, 5, 7\}|} = \frac{3}{6} = 0.5\]
<p>Similarly for the other pairs, we have:</p>
<center>
$$J(\text{artist}_{2}, \text{artist}_{3}) = \frac{3}{8} = 0.375$$
$$J(\text{artist}_{1}, \text{artist}_{3}) = \frac{0}{8} = 0$$
</center>
<p>A few key things about the Jaccard similarity:</p>
<ul>
<li>The Jaccard similarity is 0 if the two sets share no elements, and it’s 1 if the two sets are identical. Every other case has values between 0 and 1.</li>
<li>The Jaccard similarity between two sets corresponds to the probability of a randomly selected element from the union of the sets also being in the intersection.</li>
</ul>
<p>Let’s unpack the second one, because it’s definitely the most important thing to know about the Jaccard similarity.</p>
<!-- An important fact about the Jaccard similarity is that it corresponds to the probability that a randomly selected element in the union is also in the intersection. This is a crucial property of the Jaccard similarity that is central to understanding why min hashing works. -->
<h3>Intuition behind the Jaccard similarity</h3>
<p>For some people (present company included), visual explanations are easier to grasp than algebraic ones. We’ll briefly shift our view from sets to Venn diagrams. Let’s imagine any two artists as Venn diagrams, the Jaccard similarity is the size of the intersection divided by the size of the union:</p>
<p><img class="center" src="/images/js_venn.png" width="50%" /></p>
<p>Now imagine that I’m throwing darts on the diagrams and I’m horrible at it. I’m so bad that every element on the diagrams has an equal chance of being hit. What’s the chance that I throw a dart and it lands on the intersection? It would be the number of elements in the intersection divided by the total number of elements, which is exactly what the Jaccard similarity is. This implies that the larger the similarity, the higher the probability that we land on the intersection with a random throw.</p>
<p><img class="center" src="/images/venns.png" width="50%" /></p>
<p>Consider another scenario. Suppose you want to know the similarity between two sets, but you can’t see the diagram, you’re blindfolded. However, if you throw a dart, you do get the information on where it landed. Can you make a good guess on the similarity of two sets by randomly throwing darts on it? Let’s say after throwing 10 darts we know that 6 of them landed in the intersection. What would you guess that the similarity of the two sets are? Let’s say after throwing 40 more darts, we know that 30 of the total 50 throws landed in the intersection. What would your guess be now? Are you more certain about your guess? Why?</p>
<p>Ponder about this for a little bit and keep this picture in mind throughout reading this post. This is, in essence, the basis for the MinHash algorithm.</p>
<h3>Approximate Jaccard similarity</h3>
<p>In the previous paragraph, we have alluded to the fact that it’s possible to approximate the Jaccard similarity between two sets. In order to see why that’s true, we need to rehash some of things we’ve said, mathematically.</p>
<p>Let’s take \(\text{artist}_1\) and \(\text{artist}_2\) and their union \(\text{artist}_1 \cup \text{artist}_2 = \{0, 1, 2, 4, 5, 7\}\).
Some of the elements in union are also in the intersection, more specifically \(\{1, 4, 7\}\).</p>
<p>Let’s replace the elements with the symbols “-“ and “+”, denoting if an element appears in the intersection or not.</p>
\[\{0, 1, 2, 4, 5, 7\} \rightarrow \{-, +, -, +, -, +\}\]
<p>If every element has an equal probability of being picked, what is the probability of drawing an element that is of type “+”? It’s the number of pluses divided by number of pluses and number of minuses.</p>
\[P(\text{picking a "+"}) = \frac{\text{number of "+"}}{\text{number of "+" and "-"}}\]
<p>The number of “+” corresponds to the number of elements in the intersection and the number of “+” and the number of “-“ corresponds to the total number of elements or the size of the union. Therefore,</p>
<div style="font-size: 80%;">
$$P(\text{picking a "+"}) = \frac{\text{number of "+"}}{\text{number of "+" and "-"}} = \frac{|\{1, 4, 7\}|}{|\{0, 1, 2, 4, 5, 7\}|} = J(\text{artist}_1, \text{artist}_2)$$
</div>
<!-- More than two sets:
If we have more than two sets some things change. Now we can have multiple intersections. Just knowing that a dart landed in an intersection is not enough, we need to keep track of all possible intersections between the sets. This might be too much of a hassle. So we're gonna change the game.
Imagine that you're at a carnival and there's a shooting game. There are n diagrams on the board, you are again blindfolded. The game is to shoot for k rounds and guess the similarities between all the sets with **some** tolerance for error. Each round consists of up to n throws of a dart. Let's imagine that each throw has a guarantee to hit at least one diagram but it could potentially hit more than one (if it lands in an intersection). After a diagram gets hit, it gets eliminated. The round is over when all of the diagrams are eliminated. The diagrams get reset after each round. As before, you get to know where the dart lands, that is, you get to know the exact element that you hit and which of the diagrams were eliminated. Can you come up with a way to guess the similarities between all of the pairs of sets? -->
<!-- Let $$X$$ be a random variable such that $$X=1$$ if we draw a plus and $$X=0$$ if we draw a minus.
$$\mathbb{E}[X] = P(X=1) \times 1 + P(X=0) \times 0 = 0.5 = J(\text{artist}_{1}, \text{artist}_{2})$$ -->
<p>What this means is that we can approximate the Jaccard similarity between pairs of artists. Let \(X\) be a random variable such that \(X = 1\) if we draw a plus and \(X = 0\) if we draw a minus. \(X\) is a Bernoulli random variable with \(p=J(\text{artist}_{1}, \text{artist}_{2})\). In order to estimate the similarity, we can estimate \(p\). In this case, we obviously know that \(p=0.5\) since we already computed it, but let’s assume that we don’t know this.</p>
<p>If we repeat the random draw multiple times and keep track of how many times a “+” type came up versus a “-“, we can estimate the parameter \(p\) for \(X\) by maximum likelihood estimation (MLE):</p>
\[\hat{p} = \frac{1}{n} \sum_{i=1}^{n} X_{i} = \hat{J}(\text{artist}_{1}, \text{artist}_{2})\]
<p>Where \(X_{i}\) are our observations and \(n\) is the total number of draws that were made. The larger the number of draws \(n\), the better the estimation will be.</p>
<p>The code below will simulate the process a 30 times and empirically compute the similarity.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">num_trials</span> <span class="o">=</span> <span class="mi">30</span>
<span class="c1"># Union of artist_1 and artist_2
</span><span class="n">union</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">7</span><span class="p">])</span>
<span class="c1"># Intersection of artist_1 and artist_2
</span><span class="n">intersection</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">7</span><span class="p">])</span>
<span class="c1"># Randomly pick element
</span><span class="n">draws</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="n">union</span><span class="p">,</span> <span class="n">num_trials</span><span class="p">)</span>
<span class="n">num_intersect</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">count_nonzero</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">isin</span><span class="p">(</span><span class="n">draws</span><span class="p">,</span> <span class="n">intersection</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="n">num_intersect</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">draws</span><span class="p">))</span>
</code></pre></div></div>
<p>If you run the above code you should get something that is close to \(0.5\). Which, as expected, corresponds to the Jaccard similarity between \(\text{artist}_1\) and \(\text{artist}_2\). Play around with the variable \(\text{num_trials}\), what happens if you set it to 1? What about 10,000?</p>
<!-- So what does this mean? This means that we can approximate Jaccard similarity using randomness. We're going to be using this fact to come up with a way to encode the original data into a smaller representation called a *signature* such that the Jaccard similarities are well approximated. -->
<!-- This means that this random process in expectation is the same as the Jaccard similarity. We're going to be using this fact to come up with a way to encode the original data into a smaller representation such that the Jaccard similarities are well approximated. -->
<!-- Just to restate the goal, we have a dataset $$D$$ that we want to encode in some smaller dataset $$D^{'}$$ such that $$J_{pairwise}(D) \approx J_{pairwise}(D^{'})$$. Where $$J_{pairwise}$$ is the pairwise Jaccard similarity of all artists in the data. -->
<!-- This is great, but we need to compute similarities between all pairs of artists not just two artists. -->
<h3>Shuffling and Picking First \(\equiv\) Randomly Picking</h3>
<p>Before we move on, we need to understand one more thing. Randomly selecting an element from a set is the same thing as shuffling the set and picking the first element <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>. Everything that we have said above is also true if we, instead of randomly selecting an element, shuffled the set and picked the first element.</p>
<!-- Make sure to pause here, if this doesn't make sense. -->
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">num_trials</span> <span class="o">=</span> <span class="mi">30</span>
<span class="c1"># Union of artist_1 and artist_2
</span><span class="n">union</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">7</span><span class="p">])</span>
<span class="c1"># Intersection of artist_1 and artist_2
</span><span class="n">intersection</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">7</span><span class="p">])</span>
<span class="c1"># Shuffle and pick first element
</span><span class="n">num_intersect</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_trials</span><span class="p">):</span>
<span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">shuffle</span><span class="p">(</span><span class="n">union</span><span class="p">)</span>
<span class="k">if</span> <span class="n">union</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="ow">in</span> <span class="n">intersection</span><span class="p">:</span>
<span class="n">num_intersect</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">print</span><span class="p">(</span><span class="n">num_intersect</span><span class="o">/</span><span class="n">num_trials</span><span class="p">)</span>
</code></pre></div></div>
<p>The code above implements the same process that I described before, but instead of randomly picking an element, it is shuffling the elements in the union and picking the first element. If you run this, you should similarly get something that is close to \(0.5\).</p>
<h3>Data Matrix</h3>
<p>We have shown that it’s possible to approximate Jaccard similarity for a pair of artists using randomness but our previous method had a significant issue. We still needed to have the intersection and the union of the sets to estimate the Jaccard similarity, which kind of defeats the whole purpose. We need a way to approximate the similarities without having to compute these sets. We also need to approximate the similarities for all pairs of artists, not just a given pair. In order to do that, we’re going to switch our view of the data from sets to a matrix.</p>
<p><img class="center" src="/images/artist_matrix.png" width="50%" /></p>
<p>The columns represent the artists and the rows represent the user IDs. A given artist has a \(1\) in a particular row if the user with that ID has that artist in in their listening history <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>.</p>
<h3>Min Hashing</h3>
<p>Going back to our main goal, we want to reduce the size of the representation for each artist while preserving the Jaccard similarities between pairs of artists in the dataset. In more “mathy” terms, we have a data matrix \(D\) that we want to encode in some smaller matrix \(\hat{D}\) called the signature matrix, such that \(J_{pairwise}(D) \approx \hat{J}_{pairwise}(\hat{D})\)</p>
<!-- To reiterate the goal, we want to encode the data into a smaller representation such that the Jaccard similarities are preserved. In more "mathy" terms, we have a data matrix $$D$$ that we want to encode in some smaller matrix $$\hat{D}$$ called the signature matrix, such that $$J_{pairwise}(D) \approx \hat{J}_{pairwise}(\hat{D})$$ [^4].
[^4]: $$J_{pairwise}$$ is a function that produces a matrix which represents all pairwise similarities of the artists in the data. -->
<p>The first algorithm I will be describing is not really practical but it’s a good way to introduce the actual algorithm called MinHash. The whole procedure can be summarized in a sentence: shuffle the rows of the data matrix and for each artist (column) store the ID of the first non-zero element. That’s it!</p>
<p><strong>naive-minhashing</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for k iterations
shuffle rows
for each column
store the ID of first non-zero element into the signature matrix
</code></pre></div></div>
<p>Let’s go through one iteration of this algorithm:</p>
<p><img src="/images/1iteration.png" alt="" /></p>
<p>We have now reduced each artist to a single number. To compute the Jaccard similarities between the artists we compare the signatures. Let \(h\) be the function that finds and returns the index of the first non-zero element. Then we have:</p>
<center>
$$h(\text{artist}_{1}) = 7$$
$$h(\text{artist}_{2}) = 0$$
$$h(\text{artist}_{3}) = 0$$
</center>
<p>And the Jaccard similarities are estimated as <sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup>:</p>
\[\hat{J}(\text{artist}_{i}, \text{artist}_{j}) = \unicode{x1D7D9}[h(\text{artist}_{i}) = h(\text{artist}_{j})]\]
<p><strong>Why would this work?</strong></p>
<p>To understand why this is a reasonable thing to do, we need to recall our previous discussion on approximating the Jaccard similarity. We were drawing elements from the union of two sets at random and checking if that element appeared in the intersection. What we’re doing here might look different, but it’s actually the same thing.</p>
<ul>
<li>We are shuffling the rows (thus bringing in randomness)</li>
<li>By picking the first non-zero element for every artist, we’re always picking an element from the union (for any pair of artists).</li>
<li>By checking if \(h(\text{artist}_{i}) = h(\text{artist}_{j})\) we are checking if the element is in the intersection</li>
<li>And most importantly the probability of a randomly drawn element being in the intersection is exactly the Jaccard similarity, that is, \(P(h(\text{artist}_{i}) = h(\text{artist}_{j})) = J(\text{artist}_{i}, \text{artist}_{j})\)</li>
</ul>
<!-- for any two sets, we're always sampling from the union, and when do we have a success? When both of the rows have a one, i.e $$h(\text{artist}_{i}) = h(\text{artist}_{j})$$ i.e++ the element is in the intersection. Sound familiar?
Let $$Y$$ be a random variable that is 1 if $$h(\text{artist}_{i}) = h(\text{artist}_{j})$$ and is 0 otherwise. Now what's $$p = P(h(\text{artist}_{i}) = h(\text{artist}_{j}))$$, it is none other than $$J(\text{artist}_{i}, \text{artist}_{j})$$, that is, we're claiming that
$$P(h(\text{artist}_{i}) = h(\text{artist}_{j})) = J(\text{artist}_{i}, \text{artist}_{j})$$ -->
<!-- $$Y$$ is a Bernouli random variable with parameter $$p = J(\text{artist}_{i}, \text{artist}_{j})$$. Hopefully, things should be coming back now. How can we estimate $$p$$, the same way as before. We do multiple trials and estimate $$p$$ as the average. -->
<!-- $$\hat{J}(\text{artist}_{i}, \text{artist}_{j}) = \frac{1}{k}\sum_{l=1}^{k} = \unicode{x1D7D9}[h_{l}(\text{artist}_{i}) = h_{l}(\text{artist}_{j})]$$ -->
<!-- The probability that $$h(\text{artist}_{i})$$ and $$h(\text{artist}_{j})$$ are the same is exactly $$J(\text{artist}_{i}, \text{artist}_{j})$$. That is, we're claiming that $$P(h(\text{artist}_{i}) = h(\text{artist}_{j})) = J(\text{artist}_{i}, \text{artist}_{j})$$. What this means is that, if a pair of artists have a high similarity, there is a high probability they will have the same value for $$h$$. Do you remember throwing darts at the diagrams? The intuition is the same here. -->
<p>Let’s go through an example together with sets \(\text{artist}_{1}\) and \(\text{artist}_{2}\). I’ve highlighted the relevant rows using the same definition for the symbols “+” and “-“ as before. We have an additional symbol called “null”, these correspond to elements that are in neither of the selected artists. The “null” type rows can be ignored since they do not contribute to the similarity (and they are skipped over in the algorithm).</p>
<p><img class="center" src="/images/artist_matrix_highlighted.png" width="50%" /></p>
<p>If we shuffled the rows what is the probability that the first <strong>non</strong>-“null” row is of type “+”? In other words, after shuffling the rows, if we proceeded from top to bottom while skipping over all “null” rows, what is the probability of seeing a “+” before seeing a “-“?</p>
<p>If we think back to our example with sets, this question should be easy to answer. All we have to realize is that,
encountering a “+” before a “-“ is the exact same thing as randomly drawing a “+” in the union, which we know has a probability that is equal to the Jaccard similarity.</p>
\[P(\text{seeing a "+" before "-"}) = \frac{\text{number of "+"}}{\text{number of "+" and "-"}} = J(\text{artist}_{1}, \text{artist}_{2})\]
<p>If the first row is of type “+” that also means that \(h(\text{artist}_{1}) = h(\text{artist}_{2})\), so the above expression is equivalent to saying:</p>
\[P(h(\text{artist}_{1}) = h(\text{artist}_{2})) = J(\text{artist}_{1}, \text{artist}_{2})\]
<p>The same argument holds for any pair of artists. The most important take away here is that if the Jaccard similarity is high between two pairs of sets, then the probability that \(h(\text{artist}_{i}) = h(\text{artist}_{j})\) is also high. Remember, throwing darts at the diagrams? It’s the same intuition here.</p>
<p>Now going back to our example. With a single trial we have the following estimations.</p>
<!-- So $$Y$$ is a Bernoulli random variable with parameter $$p = J(\text{artist}_{i}, \text{artist}_{j})$$. How can we estimate $$p$$, the same way we did before, by simulating multiple trials and taking the average. -->
<!-- Remember, throwing darts at the diagrams? That's exactly what we're doing here. You can think of this process as throwing a dart on the diagram and then checking if it landed in an intersection. -->
<!-- Going back to our example, we have pairs ($$\text{artist}_1$$, $$\text{artist}_2$$) and ($$\text{artist}_1$$, $$\text{artist}_3$$) having similarity zero since their signatures do not match. The similarity for ($$\text{artist}_2$$, $$\text{artist}_3$$) will be 1 since both have the same signature. -->
<p><img class="center" src="/images/sig1.png" width="50%" /></p>
<p>As you can see, it’s a <em>little</em> off. How can we make it better? It’s simple, we run more iterations and make the signatures larger. In the earlier discussions we introduced a Bernoulli random variable and we estimated it’s parameter by simulating multiple random trials. We can do the same exact thing here. Let \(Y\) be a random variable that has value 1 if \(h(\text{artist}_{i}) = h(\text{artist}_{j})\) and is 0 otherwise. \(Y\) is an instance of a Bernoulli random variable with \(p = J(\text{artist}_{i}, \text{artist}_{j})\). If we run the algorithm multiple times, thus simulating multiple but identical variables \(Y_{1}, Y_{2}, ..., Y_{k}\), we can then estimate the Jaccard similarity as:</p>
<!-- Remember how we approximated the parameter $$p$$ for the random variable $$X$$? It's the exact same thing here. With a signature with length greater than one the estimation for Jaccard similarity is done by taking the average of each element-wise comparison. -->
\[\hat{J}(\text{artist}_{i}, \text{artist}_{j}) = \frac{1}{k}\sum_{m=1}^{k}Y_{m} = \frac{1}{k}\sum_{l=1}^{k} \unicode{x1D7D9}[h_{m}(\text{artist}_{i}) = h_{m}(\text{artist}_{j})]\]
<p>Where \(h_{m}\) is a function that returns the first non-zero index for iteration \(m\).</p>
<!-- In the earlier parts of the post, we defined a Bernoulli random variable $$X$$. Do you see similarities to that and what we're doing now? -->
<!-- This is because we're only using a single signature to measure the similarities. This corresponds to only having a single trial in the random experiments we defined previously. As before, the more trials we have, the better the estimation will be.
We've mentioned before that the more random simulations we run the better the approximation will be. In order to have a better approximation, we should run a few more iterations of this process. This would result in a larger signature matrix. -->
<p>The animation below shows the process of going through 3 iterations of this algorithm:</p>
<p><img src="/images/minhashing_permuation_animation.gif" alt="" /></p>
<p>Computing the Jaccard similarities with the larger signature matrix:</p>
<p><img class="center" src="/images/sig3_sims.png" /></p>
<p>That’s much better. It’s still not exactly the same but it’s not too far off. We’ve managed to reduce the number of rows of the matrix from 8 to 3 while preserving the pairwise Jaccard similarities up to some error. To achieve a better accuracy, we could construct an even larger signature matrix, but obviously we would be trading off the size of the representation.</p>
<p>If you want to play around with this algorithm, here’s an implementation in Python using Numpy:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="k">def</span> <span class="nf">min_hashing_naive</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">num_iter</span><span class="p">):</span>
<span class="n">num_artists</span> <span class="o">=</span> <span class="n">data</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="n">sig</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="n">num_iter</span><span class="p">,</span> <span class="n">num_artists</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">int</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_iter</span><span class="p">):</span>
<span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">shuffle</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_artists</span><span class="p">):</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">data</span><span class="p">[:,</span> <span class="n">j</span><span class="p">]</span>
<span class="k">if</span> <span class="n">np</span><span class="p">.</span><span class="nb">any</span><span class="p">(</span><span class="n">c</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">):</span>
<span class="n">min_hash</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">where</span><span class="p">(</span><span class="n">c</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>
<span class="n">sig</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">min_hash</span>
<span class="k">return</span> <span class="n">sig</span>
</code></pre></div></div>
<h3>MinHash Algorithm</h3>
<p>Shuffling the rows of the data matrix can be infeasible if the matrix is large. In the Spotify example, we would have to shuffle 271 million rows for each iteration of the algorithm. So while the algorithm works conceptually, it is not that useful in practice.</p>
<p>Instead of explicitly shuffling the rows, what we can instead do is <em>implicitly</em> shuffle the rows by mapping each row index to some unique integer. There are special functions called hash functions that can do exactly that. They map each unique input to some unique output (usually in the same range).</p>
\[h: [n] \rightarrow [n]\]
<!-- For example with 8 rows, the hash function could map them to:
$$[0, 1, 2, 3, 4, 5, 6, 7] \rightarrow [4, 1, 5, 6, 0, 2, 3, 7]$$ -->
<p>Although it’s not a necessary for the range of the hash values to be the same as the indices, let’s assume for the sake of this example that it is. Then you can think of these permutations as, <em>where the row would have landed if we actually randomly shuffled the rows</em>. For example if we had some hash function \(h\) and applied it to row index 4:</p>
\[h(4) = 2\]
<p>The way you can interpret this is, the row at position 4 got moved to position 2 after shuffling.</p>
<!-- be generating a random permutation on the indices of the rows. We'll define functions that will take the index of a row as input and will output a random integer such that each row will have a unique integer associated with it. These kinds of functions are called, hash functions.
What we're going to do instead is *implicitly* shuffle the rows by generating a permutation on the indices of the rows. In order to do this, we're going to introduce hash functions. -->
<!-- A hash function $$h$$ will map every index in the row to some unique integer. Although it's not a necessary for the range of the hash values to be the same as the indices, let's assume for the sake of this example that it is. Then you can think of these permutations as, *where the row would have landed if we actually randomly shuffled the rows*. For example with 8 rows, the hash function could map them to:
$$[0, 1, 2, 3, 4, 5, 6, 7] \rightarrow [4, 1, 5, 6, 0, 2, 3, 7]$$ -->
<p>To simulate multiple iterations of implicit shuffling, we’re going to apply multiple distinct hash functions \(h_{1}, h_{2}, ..., h_{k}\) to each row index.</p>
<p><strong>Recipe for generating hash functions</strong></p>
<p>Pick a prime number \(p \ge m\) where \(m\) is the number of rows in the dataset. Then each hash function \(h_{i}\) can be defined as:</p>
\[h_{i}(x) = (a_{i}x + b_{i}) \mod p\]
<p>Where \(a_{i}, b_{i}\) are random integers in the range \([1, p)\) and \([0, p)\), respectively. The input \(x\) to the function is the index of the row. To generate a hash function, all we have to do is pick the parameters \(a\) and \(b\).</p>
<p>For example, let’s define three hash functions: \(h_{1}, h_{2}, h_{3}\)</p>
<center>
$$h_{1}(x) = 7x \mod 11$$
$$h_{2}(x) = (x + 5) \mod 11$$
$$h_{3}(x) = (3x + 1) \mod 11$$
</center>
<p>We’ll be applying these hash functions to the rows of our toy dataset. Since the number of rows \(m = 8\) is not a prime number we chose \(p = 11\). As I’ve mentioned before, the values of the hash function need not be in the same range as the indices. As long as each index is mapped to a unique value, the range of the values actually makes no difference. If this doesn’t make sense to you, let’s unpack the example we have. In this case, our hash functions will produce values in the range \([0, 10]\). We can imagine expanding our dataset with a bunch of “null” type rows so that we have \(p=11\) rows. We know that the “null” rows don’t change the probability of two artists having the same signature, therefore our estimates should be be unaffected.</p>
<!-- The reason for doing this is because we don't want to have collisions, that is we don't want more than one row to map to the same value for a given hash function. -->
<!-- But what this means is that our hash functions will produce values in the range $$[0, 10]$$, which is larger than your set of indices. This will actually end up not making any difference. To see why we can imagine expanding our dataset with a bunch of "null" type rows so that we have $$m=11$$. We know that the "null" rows don't change the probability of two artists having the same signature, therefore having a range bigger than the actual is not going to change anything. -->
<p><img class="center" src="/images/artist_matrix_hash.png" width="50%" /></p>
<p>Since each hash function defines an implicit shuffling order, we can iterate over the rows in that order. As an exercise, iterate the rows in the defined orders of each hash function. For each column (artist) store the index of the first-non zero element. Then to compute the Jaccard similarities, compare the stored values the same way we did before. <sup id="fnref:min_index_diff" role="doc-noteref"><a href="#fn:min_index_diff" class="footnote" rel="footnote">4</a></sup></p>
<p>The MinHash algorithm is essentially doing the same thing but in a more efficient way by just making a single pass over the rows.</p>
<!-- Now that we have the hash functions, we're finally ready for the MinHash algorithm: -->
<p><strong>MinHash</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Initialize all elements in the signature matrix sig to infinity
for each row r in the dataset
Compute h_{i}(r) for all hash functions h_{1}, h_{2}, ..., h_{k}
for each non-zero column c in the row r
if sig(i, c) > h_{i}(r)
update sig(i, c) = h_{i}(r)
</code></pre></div></div>
<p>When the algorithm terminates the signature matrix should contain all the minimum hash values for each artist and hash function pair.</p>
<p>The video below is an animation that simulates the algorithm over the toy dataset. Watching it should hopefully clear up any questions you have about how or why the algorithm works.</p>
<div class="video-responsive">
<iframe width="560" height="315" src="https://www.youtube.com/embed/YoVJOlpViog" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
</div>
<p>Now with this algorithm we can reduce all of the artist representations to smaller sets. If we use \(k\) hash functions we will have a signature of size \(k\) for each artist. This means that the time complexity of comparing two sets (artists) is now \(O(k)\), which is independent of the size of the original sets.</p>
<h3>Next Steps</h3>
<p>Using the MinHash algorithm, we can reduce the computational complexity of computing similarities between pairs of artists but there is still one more issue. In order to implement the recommendation feature we still need to compute the similarities between every pair of artists. This is quadratic in running time, if \(n\) is the number of artists, we need to make \({n \choose 2} = \frac{n(n-1)}{2} = O(n^2)\) comparisons. If \(n\) is large, even with parallelization, this will be a horribly slow computation.</p>
<p>We’re in luck because there’s another ingenious method called Locality-sensitive hashing (LSH) that uses the minhash signatures to find candidate pairs. This means that we’ll only have to compute the similarities for the candidates, rather than for every pair. I’ll write about LSH in the next post. Until then, :v:.</p>
<h2>Further reading</h2>
<ul>
<li><a href="https://www.cs.utah.edu/~jeffp/DMBook/L4-Minhash.pdf">Min Hashing</a> - Lecture notes from University of Utah CS 5140 (Data Mining) by Jeff M Phillips. This is were I actually learned about Min Hashing. Answers an important question that I have not addressed in this tutorial, “So how large should we set k so that this gives us an accurate measure?”</li>
<li><a href="http://infolab.stanford.edu/~ullman/mmds/ch3.pdf">Finding Similar Items</a> - Chapter 3 of the book “Mining of Massive Datasets” by Jure Leskovec, Anand Rajaraman and Jeff Ullman. Has some really good exercises that are worth checking out.</li>
</ul>
<hr />
<p>If you have any questions or you see any mistakes, please feel free to use the comment section below.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>I know shuffling a set of elements is meaningless since sets don’t have order but imagine that they do :). <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>In practice this matrix would be very sparse, therefore we wouldn’t store the data in this form, since would be extremely wasteful. But seeing the data as a matrix will be a helpful for conceptualizing the methods that we’re gonna discuss. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>The strange looking one (\(\unicode{x1D7D9}\)) is called the indicator function. It outputs a 1 if the expression inside the brackets evaluates to true, otherwise the output is a 0. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:min_index_diff" role="doc-endnote">
<p>You may notice that the values that get stored are different from what we would store in the naive-min hashing algorithm, will this make any difference? Why? <a href="#fnref:min_index_diff" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Giorgi KvernadzeSuppose you’re an engineer at Spotify and you’re on a mission to create a feature that lets users explore new artists that are similar to the ones they already listen to. The first thing you need to do is represent the artists in such a way that they can be compared to each other. You figure that one obvious way to characterize an artist is by the people that listen to it. You decide that each artist shall be defined as a set of user IDs of people that have listened to that artist at least once. For example, the representation for Miles Davis could be, \[\text{Miles Davis} = \{5, 23533, 2034, 932, ..., 17\}\] The number of elements in the set is the number of users that have listened to Miles Davis at least once. To compute the similarity between artists, we can compare these set representations. Now, with Spotify having more than 271 million users, these sets could be very large (especially for popular artists). It would take forever to compute the similarities, especially since we have to compare every artist to each other. In this post, I’ll introduce a method that can help us speed up this process. We’re going to be converting each set into a smaller representation called a signature, such that the similarities between the sets are well preserved.Introduction to Neural Networks2019-12-19T00:00:00+00:002019-12-19T00:00:00+00:00https://giorgi.tech/blog/multilayer-perceptron<ul>
<li>
<p><em>This post is best suited for people who are familiar with linear classifiers. I will also be assuming that the reader is familiar with gradient descent.</em></p>
</li>
<li>
<p><em>The goal of this post isn’t to be a comprehensive guide about neural networks, but rather an attempt to show an intuitive path going from linear classifiers to a simple neural network.</em></p>
</li>
</ul>
<p>There are many types of neural networks, each having some advantage over others. In this post, I want to introduce the simplest form of a neural network, a Multilayer Perceptron (MLP). MLPs are a powerful method for approximating functions and it’s a relatively simple model to implement.</p>
<p>Before we delve into MLPs, let’s quickly go over linear classifiers. Given training data as pairs \((\boldsymbol{x}_i, y_i)\) where \(\boldsymbol{x}_i \in \mathbb{R}^{d}\) are datapoints (observations) and \(y_i \in \{0, 1\}\) are their corresponding class labels, the goal is to learn a vector of weights \(\boldsymbol{w} \in \mathbb{R}^{d}\) and a bias \(b \in \mathbb{R}\) such that \(\boldsymbol{w}^T\boldsymbol{x}_{i} + b \ge 0\) if \(y_{i} = 1\) and \(\boldsymbol{w}^T\boldsymbol{x}_{i} + b < 0\) otherwise (\(y_{i} = 0\)). This decision can be summarized as the following step function:</p>
\[\text{Prediction} = \begin{cases}
1 & \boldsymbol{w}^T\boldsymbol{x} + b \ge 0 \\
0 & \text{Otherwise}\\
\end{cases}\]
<p>In the case of Logistic Regression the decision function is characterized by the sigmoid function \(\sigma(z) = \frac{1}{1+e^{-z}}\) where \(z = \boldsymbol{w}^T\boldsymbol{x} + b\)</p>
\[\text{Prediction} = \begin{cases}
1 & \sigma(z) \ge \theta \\
0 & \text{Otherwise}\\
\end{cases}\]
<p>Where \(\theta\) is a threshold that is usually set to be 0.5.</p>
<!-- more -->
<!-- *Note: These are actually just a couple of examples of a zoo of functions that people in deep learning literature refer to as activation functions.* -->
<p>If the dataset is linearly separable, this is all fine since we can always learn \(\boldsymbol{w}\) and \(b\) that separates the data perfectly. We’re in good shape even if the dataset isn’t perfectly linearly separable, i.e the data points can be separated with a line barring a few noisy observations.</p>
<p><img class="center" src="/images/blobs.png" /></p>
<p>But what can we do if the dataset is highly non-linear? For example, something like this:</p>
<p><img class="center" src="/images/circles.png" /></p>
<p>One thing we could potentially do is to come up with some non-linear transformation function \(\phi(\boldsymbol{x})\), such that applying it renders the data linearly separable. Having this transformation function would allow us to use all the tools we have for linear classification.</p>
<p>For example, in this case, we can see that the data points come from two concentric circles. Using this information we define the following transformation function: \(\phi(\boldsymbol{x}) = [x_1^2, x_2^2]\)</p>
<p>Now we can learn a vector \(\boldsymbol{w}\) and bias \(b\) such that \(\boldsymbol{w}^T\phi(\boldsymbol{x}_{i}) + b \ge 0\) if \(y_{i} = 1\) and \(\boldsymbol{w}^T\phi(\boldsymbol{x}_{i}) + b < 0\) otherwise.</p>
<p><img class="center" src="/images/circles_transformed_clf.png" /></p>
<p>This works for this particular case since we know exactly what the data generation process is, but what can we do when the underlying function is not obvious? What if we’re working in high dimensions where we can’t visualize the shape of the dataset? In general, it’s hard to come up with these transformation functions.</p>
<p>Here’s another idea, instead of learning one linear classifier, let’s try to learn three linear classifiers and then combine them to get something like this:</p>
<p><img class="center" src="/images/circles3.png" /></p>
<p>We know how to learn a single linear classifier but how can we learn three linear classifiers that can produce a result like this? The naive approach would be to try to learn them independently using different random initializations and hope that they converge to something like what we want. However, this approach is doomed from the beginning since each classifier will try to fit the whole data while ignoring what the other classifiers are doing. In other words there will be no cooperation since none of the classifiers will be “aware” of each other. This is the opposite of what we want. We want/need the classifiers to work together.</p>
<p>This is where MLPs come in. A simple MLP can actually do both of the aforementioned things. It can learn a non-linear transformation that makes the dataset linearly separable and it can learn multiple linear classifiers that cooperate.</p>
<p>The goal for the next section is to come up with a classifier that can potentially learn how to correctly classify the concentric circles dataset.</p>
<!-- **Neural Networks:**
By far the most common way of introducing neural networks is with the notion of computational graphs. While I do think that computational graphs are an important concept to understand, I do not think that they are the best way to be introduced to neural networks. Instead I will be using concepts that hopefully you the reader are familiar with. These are the essential operations for neural networks: matrix multiplication, non-linear activation functions and function composition. -->
<!-- In general it's better to teach new ideas using concepts and terms that a person is already familiar with, since this allows for the already known things to function as a foundation to be built on, rather than trying build from scratch.
The term 'neural networks' itself is kind of misleading. It creates an image of a brain like structure and it feeds into the whole hype about AI taking over. Neural networks in reality are nothing but a chain of matrix multiplications followed by non-linear functions. -->
<h2>Design</h2>
<h4>Three Linear Classifiers</h4>
<p>Let’s continue our idea of learning multiple linear classifiers. Define three classifiers \((\boldsymbol{w}_{1}, b_1), (\boldsymbol{w}_{2}, b_3)\) and \((\boldsymbol{w}_{3}, b_3)\), where \(\boldsymbol{w}_i \in \mathbb{R}^2\) and \(b_i \in \mathbb{R}\).</p>
<p>Because we want to learn all three jointly, it makes sense to combine them into a single object. Let’s stack all of the classifiers into a single matrix \(\boldsymbol{W}^{3 \times 2}\) and the biases into a vector \(\boldsymbol{b}^{3 \times 1}\), as such:</p>
\[\boldsymbol{W} = \begin{bmatrix}
\boldsymbol{w}^T_{1} \\
\boldsymbol{w}^T_{2} \\
\boldsymbol{w}^T_{3} \\
\end{bmatrix} = \begin{bmatrix}
w_1^{(1)} & w_1^{(2)}\\
w_2^{(1)} & w_2^{(2)}\\
w_3^{(1)} & w_3^{(2)}\\
\end{bmatrix} \boldsymbol{b} = \begin{bmatrix}
b_1 \\
b_2 \\
b_3 \\
\end{bmatrix}\]
<p>Now we need to get a classification decision from each one of the classifiers. We mentioned two types of decision functions in the beginning of the post, the step function and the sigmoid, which is basically a smooth step function. For technical reasons that will become clear in the next section, we’re gonna use the sigmoid function to produce decisions. For each pair \((\boldsymbol{w}_{i}, b_i)\), to get the prediction for a given data point we take \(\sigma(\boldsymbol{w}_{i}^T\boldsymbol{x} + b_i)\). This is not taking the advantage of having everything packed into a matrix. Instead of enumerating the classifiers one by one, we could do everything in one operation.</p>
\[\sigma(\boldsymbol{Wx} + \boldsymbol{b}) = \begin{bmatrix}
\sigma(\boldsymbol{w}_{1}\boldsymbol{x} + b_1) \\
\sigma(\boldsymbol{w}_{2}\boldsymbol{x} + b_2) \\
\sigma(\boldsymbol{w}{3}\boldsymbol{x} + b_3) \\
\end{bmatrix}\]
<p><em>Note: The \(\sigma\) function for vector valued functions is an element-wise operation.</em></p>
<p>This is great but so far we haven’t really solved anything. We just came up with a neat way to compute the output of all three classifiers given some input. We still need to connect them in order to create “cooperation”.</p>
<h4>The Meta Classifier</h4>
<p>Let’s define another linear classifier but this time instead of taking the data points as input, this classifier will take the outputs of the three classifiers as input and will output a final classification decision. In a way it’s a meta classifier since it classifies using outputs of other classifiers.</p>
<p>Let \(\boldsymbol{h}^{3 \times 1}\) be the output of the previous classifiers, i.e \(\boldsymbol{h} = \sigma(\boldsymbol{Wx} + \boldsymbol{b})\), then the prediction of the meta classifier \((\boldsymbol{w}_{m}, b_{m})\) is defined as: \(\sigma(\boldsymbol{w}_{m}^T\boldsymbol{h} + b_{m})\), where \(\boldsymbol{w}_{m} \in \mathbb{R}^3\) and \(b_{m} \in \mathbb{R}\).</p>
<p>And there it is, we finally have all the components. All three classifiers are connected, we have a way to produce a single prediction using all three of them and there is hope that coordination will happen because of the meta classifier.</p>
<p>Just to recap, the expression below is the function that corresponds to our MLP:</p>
\[\text{MLP}(\boldsymbol{x}; \boldsymbol{w}_{m}, b_{m}, \boldsymbol{W}, \boldsymbol{b}) =\sigma(\boldsymbol{w}_{m}^T\sigma(\boldsymbol{Wx} + \boldsymbol{b}) + b_{m})\]
<p>Everything before the semicolon is the input of the function and everything after are the parameters of the function. Our goal is to learn the parameters.</p>
<p><strong>Exercise:</strong>
A question that you might have at this point is “why do we need to have a decision function applied to the three linear classifiers, can’t we directly plug the outputs to the meta classifier and produce a decision?”. I’m gonna leave the answer to that as an exercise. Remove all the \(\sigma\) functions, and simplify the expression. What do you get? Is it different than having a single linear classifier?</p>
<h2>Learn</h2>
<p>We have managed to define a simple MLP but we still need a way to learn the parameters of the function. The function is fully differentiable and this is no accident. As I said earlier, we chose to use the sigmoid function instead of the step-function as a decision function because of technical reasons. Well the technical reason is this: differentiability is nice and we like it because it allows us to use gradient based optimization algorithms like gradient descent.</p>
<h4>Loss Function</h4>
<p>Since the function is differentiable, we can define a loss function and then start optimizing with respect to the learnable parameters using gradient descent. Notice that the output of the MLP is a real number between 0 and 1. What we’re essentially doing is modeling the conditional distribution
\(P(y | \boldsymbol{x})\) with a parametrized function \(MLP(\boldsymbol{x}; \theta)\) <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>. This means that we can use the principle of maximum likelihood to estimate the parameters.</p>
\[L(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^n -y_i\log\hat{y_i} - (1-y_i)\log(1-\hat{y_i})\]
<p>Where \(\hat{y} = MLP(\boldsymbol{x}; \theta)\). The objective is to minimize \(L(y, \hat{y})\) <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> with respect to the learnable parameters \(\theta\).</p>
<h4>Optimization</h4>
<p>The plan is to use gradient descent to optimize \(L\). Remember that during gradient descent, we need take the gradient of the objective at every step of the algorithm (hence the name).</p>
\[\theta \leftarrow \theta - \alpha \nabla_{\theta} L\]
<p>Where \(\alpha\) is the step size (learning rate).</p>
<p>Since \(L\) is a composition function, we will need to use the chain rule (from calculus). Furthermore, \(\theta\) isn’t a single variable, we will be optimizing with respect to 4 different variables \(\boldsymbol{w}_{m}, b_{m}, \boldsymbol{W}, \boldsymbol{b}\). We’re going to need to update each one at every step:</p>
<p>\(\boldsymbol{w}_{m} \leftarrow \boldsymbol{w}_{m} - \alpha * \frac{\partial L}{\partial \boldsymbol{w}_{m}}\) <br />
\(b_{m} \leftarrow b_{m} - \alpha * \frac{\partial L}{\partial b_{m}}\) <br />
\(\boldsymbol{W} \leftarrow \boldsymbol{W} - \alpha * \frac{\partial L}{\partial \boldsymbol{W}}\) <br />
\(\boldsymbol{b} \leftarrow \boldsymbol{b} - \alpha * \frac{\partial L}{\partial \boldsymbol{b}}\)</p>
<h4>Derivatives, Derivatives, Derivatives</h4>
<p><em>Skip this section if you don’t care about all of the gory details of computing the partials. Although I do think that it’s a good idea to do this at least once by hand.</em></p>
<p>Now we will need to breakdown each of the partial derivatives using the chain rule. If we don’t give names to intermediate values, it will quickly get hairy. Let’s do that first.</p>
<p>\(\boldsymbol{s}_1 = \boldsymbol{Wx} + \boldsymbol{b}\) <br />
\(\boldsymbol{h} = \sigma(\boldsymbol{s}_1)\) <br />
\(s_2 = \boldsymbol{w}^T_{m}\boldsymbol{h} + b_{m}\) <br />
\(\hat{y} = \sigma(s_2)\)</p>
<p>Before we start the tedious process of taking partial derivatives of a composed function, I want to remind you that the goal is to compute these four partial derivatives: \(\frac{\partial L}{\partial \boldsymbol{w}_{m}}, \frac{\partial L}{\partial b_{m}}, \frac{\partial L}{\partial \boldsymbol{W}}, \frac{\partial L}{\partial \boldsymbol{b}}\). If we have these values, we can use them to update the parameters at each step of gradient descent. Using the chain rule we can write down each of the partial derivatives as a product:</p>
<p>\(\frac{\partial L}{\partial \boldsymbol{w}_{m}} = \frac{\partial L}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial s_2}\frac{\partial s_2}{\partial \boldsymbol{w}_{m}}\) <br />
\(\frac{\partial L}{\partial b_{m}} = \frac{\partial L}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial s_2}\frac{\partial s_2}{\partial b_{m}}\) <br />
\(\frac{\partial L}{\partial \boldsymbol{W}} = \frac{\partial L}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial s_2}\frac{\partial s_2}{\partial \boldsymbol{h}}\frac{\partial \boldsymbol{h}}{\partial \boldsymbol{s}_1}\frac{\partial \boldsymbol{s}_1}{\partial \boldsymbol{W}}\) <br />
\(\frac{\partial L}{\partial \boldsymbol{b}} = \frac{\partial L}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial s_2}\frac{\partial s_2}{\partial \boldsymbol{h}}\frac{\partial \boldsymbol{h}}{\partial \boldsymbol{s}_1}\frac{\partial \boldsymbol{s}_1}{\partial \boldsymbol{b}}\)</p>
<p>I know this looks complex but it really isn’t that complicated. All we’re doing is taking a partial derivative of the loss with respect to each of the learnable parameters. Since the loss is a composition function we have to use chain rule. That’s it.</p>
<p>We can see that \(\frac{\partial L}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial s_2}\) is shared among all of them and that \(L, \hat{y}, s_2\) are all scalar variables therefore the derivatives are relatively easy to compute.</p>
<p>\(\frac{\partial L}{\partial \hat{y}} = \frac{\hat{y}-y}{\hat{y}(1-\hat{y})}\) <br />
\(\frac{\partial \hat{y}}{\partial s_2} = \hat{y}(1-\hat{y})\) (Recall that \(\sigma^{'}(z) = (1-\sigma(z))\sigma(z)\))</p>
<p>Hence \(\frac{\partial L}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial s_2} = \hat{y}-y\).</p>
<p>Continuing down the chain we get:</p>
<p>\(\frac{\partial s_2}{\partial \boldsymbol{w}_{m}} = \boldsymbol{h}\) <br />
\(\frac{\partial s_2}{\partial b_{m}} = 1\) <br />
\(\frac{\partial s_2}{\partial \boldsymbol{h}} = \boldsymbol{w}_{m}\)</p>
<p>Now since, \(\boldsymbol{h}\) and \(\boldsymbol{s_1}\) are both vectors, the partial \(\frac{\partial \boldsymbol{h}}{\partial \boldsymbol{s_1}}\) will be a matrix; however it will be a diagonal matrix.</p>
\[\frac{\partial \boldsymbol{h}}{\partial \boldsymbol{s_1}} = \text{diag}((\boldsymbol{1} - \boldsymbol{h}) \odot \boldsymbol{h})\]
<p>This can be replaced by an element-wise multiplication in the chain as: \(\odot (\boldsymbol{1} - \boldsymbol{h}) \odot \boldsymbol{h}\)</p>
<p>The partial derivative \(\frac{\partial \boldsymbol{s_1}}{\partial \boldsymbol{W}}\) is the most complicated to compute. \(\boldsymbol{s_1}\) is a vector and \(\boldsymbol{W}\) is a matrix, therefore the result of the partial derivative will be a 3 dimensional tensor! Fortunately, we will be able to reduce it to something more simple.</p>
<p>Instead of computing the partial derivative with respect to entire weight matrix, let’s instead take derivatives with respect to each of the classifiers \(\boldsymbol{w_1}, \boldsymbol{w_2},\) and \(\boldsymbol{w_3}\) (these correspond to the rows of \(\boldsymbol{W}\)). Each of these derivatives will be a matrix instead of a tensor.</p>
<p>\(\frac{\partial \boldsymbol{s_1}}{\partial \boldsymbol{w_1}} = \begin{bmatrix}
x_1 && x_2\\
0 && 0 \\
0 && 0 \\
\end{bmatrix}\) <br />
\(\frac{\partial \boldsymbol{s_1}}{\partial \boldsymbol{w_2}} = \begin{bmatrix}
0 && 0\\
x_1 && x_2 \\
0 && 0 \\
\end{bmatrix}\) <br />
\(\frac{\partial \boldsymbol{s_1}}{\partial \boldsymbol{w_3}} = \begin{bmatrix}
0 && 0\\
0 && 0 \\
x_1 && x_2 \\
\end{bmatrix}\)</p>
<p>We know that we’re gonna be using these values in a multiplication. We can use this fact to simplify the expression for the derivative. Let \(\frac{\partial L}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial s_2}\frac{\partial s_2}{\partial \boldsymbol{h}}\frac{\partial \boldsymbol{h}}{\partial \boldsymbol{s}_1} = \boldsymbol{\delta}\), then we’ll have</p>
<p>\(\boldsymbol{\delta} \frac{\partial \boldsymbol{s_1}}{\partial \boldsymbol{w_1}} = [\delta_1x_1, \delta_1x_2]\) <br />
\(\boldsymbol{\delta} \frac{\partial \boldsymbol{s_1}}{\partial \boldsymbol{w_2}} = [\delta_2x_1, \delta_2x_2]\) <br />
\(\boldsymbol{\delta} \frac{\partial \boldsymbol{s_1}}{\partial \boldsymbol{w_3}} = [\delta_3x_1, \delta_3x_2]\)</p>
<p>Which implies that \(\boldsymbol{\delta}\frac{\partial \boldsymbol{s_1}}{\partial \boldsymbol{W}} = \begin{bmatrix}
\delta_1x_1 && \delta_1x_2\\
\delta_2x_1 && \delta_2x_2 \\
\delta_3x_1 && \delta_3x_2 \\
\end{bmatrix}\)</p>
<p>We can rewrite this compactly as an <em>outer product</em> between \(\boldsymbol{\delta}\) and \(\boldsymbol{x}\).</p>
\[\frac{\partial L}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial s_2}\frac{\partial s_2}{\partial \boldsymbol{h}}\frac{\partial \boldsymbol{h}}{\partial \boldsymbol{s}_1}\frac{\partial \boldsymbol{s_1}}{\partial \boldsymbol{W}} = \boldsymbol{\delta} \otimes \boldsymbol{x}\]
<p>And finally,</p>
\[\frac{\partial \boldsymbol{s_1}}{\partial \boldsymbol{b}} = \text{diag}(\boldsymbol{1}) = \boldsymbol{I}\]
<p>Putting everything together:</p>
<p>\(\frac{\partial L}{\partial \boldsymbol{w}_{m}} = (\hat{y} - y)\boldsymbol{h}\) <br />
\(\frac{\partial L}{\partial b_{m}} = \hat{y} - y\) <br />
\(\frac{\partial L}{\partial \boldsymbol{W}} = ((\hat{y} - y)\boldsymbol{w}_{m}\odot (\boldsymbol{1} - \boldsymbol{h}) \odot \boldsymbol{h}) \otimes \boldsymbol{x}\) <br />
\(\frac{\partial L}{\partial \boldsymbol{b}} = ((\hat{y} - y)\boldsymbol{w}_{m}\odot (\boldsymbol{1} - \boldsymbol{h}) \odot \boldsymbol{h})^T\)</p>
<p>You may have noticed that all of this is for a single datapoint \(\boldsymbol{x}\), we wouldn’t do this in practice. It is much more preferable to have everything computed for a batch (or mini-batch) of inputs \(\boldsymbol{X}\), this allows us to update the parameters much more efficiently. I highly recommend you redo all of the computations of the partial derivatives in matrix form.</p>
<p>I’ve purposefully skipped over a lot of the details. I want this block of the post to serve as a reference for your own solutions rather than a complete step-by-step guide. Here are some useful notes that can come in handy if you want to do everything from scratch:</p>
<ul>
<li><a href="http://cs231n.stanford.edu/vecDerivs.pdf">Vector, Matrix, and Tensor Derivatives - Erik Learned-Miller</a></li>
<li><a href="https://web.stanford.edu/class/cs224n/readings/gradient-notes.pdf">Computing Neural Network Gradients - Kevin Clark</a></li>
</ul>
<h2>Results</h2>
<p>Phew! Now that’s over with. Let’s see what the results are after running gradient descent (1000 iterations with a learning rate of 0.01). Do you remember how we started? We said that if only we had a transformation function that could make the dataset linearly separable, then learning would be easy. Well \(\phi(\boldsymbol{x}) = \sigma(\boldsymbol{Wx} + \boldsymbol{b})\) will actually be that transformation that makes the dataset linearly separable. This is what the data looks like after applying that learned function:</p>
<p><img class="center" src="/images/projection.png" /></p>
<p>As you can see the data is completely linearly separable. In essence, this is what most of learning is when it comes to neural networks. Every neural network classifier that has classification as a primary task is trying to learn some kind of a transformation on the data so that the data becomes linearly separable. This is a big reason why neural networks became so popular. In the past, people (usually domain experts) spent tremendous efforts in engineering features to make learning easy. Now a lot of that is handled by (deep) neural networks <sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup>.</p>
<p>We were also trying to learn multiple linear classifiers. And voilà, these are the three linear classifiers \((\boldsymbol{w}_{1}, b_1), (\boldsymbol{w}_{2}, b_3)\) and \((\boldsymbol{w}_{3}, b_3)\) that are learned:</p>
<p><img class="center" src="/images/hidden_classifiers.png" /></p>
<p>Finally this is what the learned decision boundary looks like in the original space. The colors indicate the predictions of the classifier.</p>
<p><img class="center" src="/images/decision_boundary.png" /></p>
<p>This is awesome, isn’t it? But wait, hold on. While this classifier gets 100% accuracy, it does not represent the true function… with three classifiers, the shape we are learning is a triangle-ish shape. That’s because it’s the only possible shape that captures all the data with three lines. But we know that the actual function is a circle. With four classifiers we can get rectangle-ish shapes, with five a pentagon-ish and so on. Intuitively, if we add more classifiers, we should get closer to an actual circle. Here’s a progression of the decision boundary going from 5 to 50 with increments of 5:</p>
<p><img class="center" src="/images/decision_boundaries_progress.png" /></p>
<p>This looks much better. Yet this isn’t really the true function either. Everything in the middle is classified as red, but there will never be any points there. The true function generates points on the boundary of the circle, never inside the circle. Furthermore, the only reason we were able to make this correction was because we’re working in 2 dimensions and we know exactly what the true function is. What do we do if we have a dataset in high dimensions coming from an unknown function? Would we be able to trust the learned classifier even if we get 100% accuracy?</p>
<h2>Jargon</h2>
<p>For the entirety of the post, I have purposefully avoided mentioning neural network lingo that you usually see in the literature. I think some of the terms themselves can bring a lot of confusion to people when they first get introduced to neural networks. However, since the field is set on using these terms, it’s necessary to know them. Let’s go back and put names on some of the things we’ve talked about.</p>
<h3>Activation Functions</h3>
<p>We talked about decision functions. We mentioned the step function and the sigmoid function. The justification for having them was straight-forward since we were talking in the context of classifiers and we had to have a function that produces a prediction. In the context of neural networks we don’t really care for predictions if it isn’t the last classifier (the meta-classifier). Every intermediate function can have any form, as long as it’s differentiable.</p>
<p>Because of this, we aren’t constrained to using functions that produce predictions like sigmoid or the step function. Here are a few others we could have used: Tanh, ReLu, LeakyReLu, SoftPlus etc. People refer to these functions as activation functions. The most popular choice in practice is the ReLu activation defined as \(\text{ReLu}(z)=\max(0, z)\). Activation functions are almost always non-linear. The non-linearity is the reason why neural networks are able to learn non-linear functions. When the input is a vector or a matrix, the activation function is applied element-wise.</p>
<h3>Unit (Neuron)</h3>
<p>As we mentioned above, we don’t really need to predict in the intermediate operations. Therefore, we probably shouldn’t be calling these functions classifiers. People usually call these functions neurons or units. I prefer to call them units since calling to them neurons is drawing a parallel to biological neurons which are not similar at all. A unit takes the following form:</p>
\[g(\boldsymbol{w}^T\boldsymbol{x} + b) = y\]
<p>Where \(g\) is some (usually non-linear) activation function.</p>
<h3>Layer</h3>
<p>A layer in the context of an MLP is a linear transformation followed by an activation function. A bunch of neurons together on the same level make a layer. What a level means will be more clear when we see the graphical representation of neural networks.</p>
<p>In this post, we defined a 2 layer MLP.</p>
<ul>
<li>Layer 1: Linear transformation \(\boldsymbol{Wx} + \boldsymbol{b} = \boldsymbol{s}_1 \rightarrow\) activation \(\rightarrow \sigma(\boldsymbol{s}_1) = \boldsymbol{h}\)</li>
<li>Layer 2: Linear transformation \(\boldsymbol{w}_{m}^T\boldsymbol{h} + b_{m} = \boldsymbol{s}_2 \rightarrow\) activation \(\sigma(\boldsymbol{s}_2) \rightarrow \hat{y}\)</li>
</ul>
<p>People refer to the layers before the last layer as hidden layers. In this case, we only had one hidden layer (Layer 1).</p>
<p><strong>More layers:</strong>
In practice, we usually have many such layers with each connected to each other, i.e the output of one becomes the input for to next one. Chaining layers like this is actually the same as function composition. If we define each layer as a function \(f_i(x) = g(\boldsymbol{W}_i\boldsymbol{x} + \boldsymbol{b}_i)\) where \(g\) is some activation function, then an n-layer MLP can be written as the function composition \(MLP(x) = f_n(f_{n-1}(...(f_1(x)))\). The depth of a network corresponds to \(n\). A network with depth \(n > 2\), is called deep (this is where the term deep learning comes from). The width of a network corresponds to the number of units in each of the layer.</p>
<h3>Graph</h3>
<p>You may have been confused about the fact that MLP is called a neural network. So far we haven’t seen the “network” part. The MLP that we defined can equivalently be represented by a directed acyclic graph (DAG).</p>
<p><img class="center" src="/images/nn.png" /></p>
<p>These kinds of graphs are called computational graphs and they are just another way to describe a neural network model. It provides a good way to break down a complex computation into its’ primitive parts.</p>
<p>All of the edges correspond to the weights (parameters) of the model. The nodes represent computation. For example, \(h_1\) represents the following computation:</p>
\[h_{1} = \sigma(\boldsymbol{w}_{1}^T\boldsymbol{x} + b_{1})\]
<p>Edges coming out of the node that have a 1 on it are the biases.</p>
<p>To make sense of the rest of the edges, let’s highlight a path of a single unit \((\boldsymbol{w_1}, b_{1})\) to the output:</p>
<p><img class="center" src="/images/nn_single.png" /></p>
<p>This representation is useful for computing gradients. If we wanted to take the derivative of the loss with respect to the first unit, the highlighted path tells us that we have to start from the last output and work our way backwards until we reach the desired variables.</p>
<p>In this post we calculated all of the gradients by hand but in practice this is done through the algorithm known as backpropagation. It works by repeatedly applying the chain rule to compute all the gradients.</p>
<p><strong>Forward pass:</strong>
Running through the graph and computing all the values is called the forward pass. It’s called forward pass because we’re traveling from the first layer to the last.</p>
<p><strong>Backward pass:</strong>
Computing the derivatives of all the parameters with respect to the outputs is called a backward pass. Similar to forward pass, the backward pass is called backward because we’re traversing starting from the last layer and working our way back.</p>
<h2>Final Words</h2>
<p>I hope this post has provided some insight to you on how neural networks work. It is by no means comprehensive, I have skipped over a lot of details. If you want to continue learning about neural networks, I would recommend the <a href="https://www.deeplearningbook.org/">Deep Learning book by Ian Goodfellow and Yoshua Bengio and Aaron Courville</a> as a good place to start. Here are a few other good resources:</p>
<ul>
<li><a href="https://playground.tensorflow.org">Neural Network Playground</a> - One of the best ways to learn something is to play around with it. The NN playground lets you easily build and train models on various synthetic datasets. Great tool for building intuition.</li>
<li><a href="http://cs231n.github.io/">CS231n: Convolutional Neural Networks for Visual Recognition</a> - Contains excellent notes from Andrej Karpathy, highly recommended.</li>
<li><a href="https://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/">CSC 321: Intro to Neural Networks and Machine Learning</a> - This has more than just neural networks. The lecture slides and notes are really good and it builds up from linear classifiers.</li>
<li><a href="https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi">3Blue1Brown: Neural Networks</a> - One of my all time favorite educational channels. Has some amazing, visual heavy explanations on the concepts behind neural networks.</li>
</ul>
<hr />
<h2>Code</h2>
<p>What’s a tutorial without code, am I right? <a href="https://github.com/colonialjelly/multilayer-perceptron/blob/master/multilayer-perceptron.ipynb">Here</a> is a link to the Jupyter notebook that contains all the code for this post.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>To simplify the notation I’m referring to all of the parameters \(\boldsymbol{w}_{m}, b_{m}, \boldsymbol{W}, \boldsymbol{b}\) with just \(\theta\). <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>What we have written here is the negative log-likelihood. Some people refer to this loss function as binary cross-entropy loss. These are equivalent loss functions, the only difference is the method/assumptions that one uses to arrive at each. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>There are downsides to this, I’ll write a post about this in the future (hopefully). <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Giorgi KvernadzeThis post is best suited for people who are familiar with linear classifiers. I will also be assuming that the reader is familiar with gradient descent. The goal of this post isn’t to be a comprehensive guide about neural networks, but rather an attempt to show an intuitive path going from linear classifiers to a simple neural network. There are many types of neural networks, each having some advantage over others. In this post, I want to introduce the simplest form of a neural network, a Multilayer Perceptron (MLP). MLPs are a powerful method for approximating functions and it’s a relatively simple model to implement. Before we delve into MLPs, let’s quickly go over linear classifiers. Given training data as pairs \((\boldsymbol{x}_i, y_i)\) where \(\boldsymbol{x}_i \in \mathbb{R}^{d}\) are datapoints (observations) and \(y_i \in \{0, 1\}\) are their corresponding class labels, the goal is to learn a vector of weights \(\boldsymbol{w} \in \mathbb{R}^{d}\) and a bias \(b \in \mathbb{R}\) such that \(\boldsymbol{w}^T\boldsymbol{x}_{i} + b \ge 0\) if \(y_{i} = 1\) and \(\boldsymbol{w}^T\boldsymbol{x}_{i} + b < 0\) otherwise (\(y_{i} = 0\)). This decision can be summarized as the following step function: \[\text{Prediction} = \begin{cases} 1 & \boldsymbol{w}^T\boldsymbol{x} + b \ge 0 \\ 0 & \text{Otherwise}\\ \end{cases}\] In the case of Logistic Regression the decision function is characterized by the sigmoid function \(\sigma(z) = \frac{1}{1+e^{-z}}\) where \(z = \boldsymbol{w}^T\boldsymbol{x} + b\) \[\text{Prediction} = \begin{cases} 1 & \sigma(z) \ge \theta \\ 0 & \text{Otherwise}\\ \end{cases}\] Where \(\theta\) is a threshold that is usually set to be 0.5.