Cheer ML2018-03-03T15:32:14+00:00Jun Lu, Yixuan HuAn Introduction to Word Embeddings - Part 2: Problems and Theory2017-12-14T00:00:00+00:00/wordembedtheory<p>In the previous <a href="/wordembedapp">post</a>, we introduced what word embeddings are and what they can do. This time, we’ll try to make sense of them. What problem do they solve? How can they help computers understand natural language?</p>
<h2 id="understanding">Understanding</h2>
<p>See if you can guess what the word wapper means from how it’s used in the following two sentences:</p>
<ol>
<li>After running the marathon, I could barely keep my legs from wappering.</li>
<li>Thou’ll not see Stratfort to-night, sir, thy horse is wappered out. (Or perhaps a more modern take: I can’t drive you to Stratford tonight for I’m wappered out).</li>
</ol>
<p>The second example is from the <a href="http://www.oed.com/view/Entry/225584">Oxford English Dictionary</a> entry. If you haven’t guessed, the word was probably more popular in the late-19th century. But it means to shake, especially from fatigue (and it might share the same linguistic roots as to waver).</p>
<p>By now, you likely have a pretty good understanding of what to wapper means, if you like, even creating new sentences. Impressively, you probably didn’t need me to explicitly tell you the definition; indeed, how many words in this sentence did you learn by reading a dictionary entry? We learn what words mean from their surrounding contexts.</p>
<p>This implies that even though it appears that the meaning of a word is intrinsic to the word, <strong>some of the meaning of a word also exists in its context</strong>.</p>
<p>Words, like notes on an instrument, do have their individual tones. But it is their relationship with each other—their interplay—that gives way to fuller music. Context enriches meaning.</p>
<p>So, take a context,</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">After</span> <span class="n">running</span> <span class="n">the</span> <span class="n">marathon</span><span class="p">,</span> <span class="n">I</span> <span class="n">could</span> <span class="n">barely</span> <span class="n">keep</span> <span class="n">my</span> <span class="n">legs</span> <span class="n">from</span> <span class="n">____________</span><span class="p">.</span>
</code></pre></div></div>
<p>We should have a sense of what words could fill the blank. Much more likely to appear are words like shaking, trembling, and of course, wappering. Especially compared to nonsense like pickled, big data, or even pickled big data.</p>
<p>In short, the higher probability of appearing in this context corresponds to greater shared meaning. From this, we can deduce the <a href="https://en.wikipedia.org/wiki/Distributional_semantics#Distributional_hypothesis">distributional hypothesis</a>: words that share many contexts tend to have similar meaning.</p>
<p>What does this mean for a computer trying to understand words?</p>
<p>Well, if it can estimate how likely a word is to appear in different contexts, then for most intents and purposes, the computer has learned the meaning of the word.<a href="#he1" class="footnoteRef" id="fnhe1"><sup>1</sup></a> Mathematically, we want to approximate the probability distribution of</p>
<script type="math/tex; mode=display">p( word | context ) \textit{ or } p( context | word ).</script>
<p>Then, the next time the computer sees a specific context $c$, it can just figure out which words have the highest probability of appearing, $p( word | c )$.</p>
<h2 id="challenges">Challenges</h2>
<p>The straightforward and naive approach to approximating the probability distribution is:</p>
<ul>
<li>step 1: obtain a huge training corpus of texts,</li>
<li>step 2: calculate the probability of each <em>(word,context)</em> pair within the corpus.</li>
</ul>
<p>The underlying (bad) assumption? The probability distribution learned from the training corpus will approximate the theoretical distribution over all word-context pairs.</p>
<p>However, if we think about it, the number of contexts is so great that the computer will never see a vast majority of them. That is, many of the probabilities $p( word | c )$ will be computed to be 0. This is mostly a terrible approximation.</p>
<p>The problem we’ve run into is the <strong>curse of dimensionality</strong>. The number of possible contexts grows exponentially relative to the size of our vocabulary—when we add a new word to our vocabulary, we more or less multiply the number of contexts we can make.<a href="#he2" class="footnoteRef" id="fnhe2"><sup>2</sup></a></p>
<p><img src="/assets/blog/wordembedtheory/wmd-Copy.png" alt="Figure 1. The exponential growth of the number of contexts with respect to the number of words." /></p>
<p>We overcome the curse of dimensionality with word embeddings, otherwise known as <strong>distributed representations of words</strong>. Instead of focusing on words as individual entities to be trained one-by-one, we focus on the attributes or features that words share.</p>
<p>For example, king is a noun, singular and masculine. Of course, many words are masculine singular nouns. But as we add more features, we narrow down on the number of words satisfying each of those qualities.</p>
<p>Eventually, if we consider enough features, the collection of features a word satisfies will be distinct from that of any other word.<a href="#he3" class="footnoteRef" id="fnhe3"><sup>3</sup></a> This lets us uniquely represent words by their features. As a result, we can now train features instead of individual words.<a href="#he4" class="footnoteRef" id="fnhe4"><sup>4</sup></a></p>
<p>This new type of algorithm would learn more along the lines of <em>in this context, nouns having such and such qualities are more likely to appear instead of we’re more likely to see words X, Y, Z</em>. And since many words are nouns, each context teaches the algorithm a little bit about many words at once.</p>
<p>In summary, every word we train actually recalls a whole network of other words. This allows us to overcome the exponential explosion of word-context pairs by training an exponential number of them at a time.<a href="#he5" class="footnoteRef" id="fnhe5"><sup>5</sup></a></p>
<h2 id="a-new-problem">A New Problem</h2>
<p>In theory, representing words by their features can help solve our dimensionality problem. But, how do we implement it? Somehow, we need to be able to turn every word into a unique feature vector, like so:</p>
<p><img src="/assets/blog/wordembedtheory/Vector-Representation-of-Words.png" alt="Figure 2. The feature vector of the word king would be ⟨1,1,1,0,...⟩." /></p>
<p>But features like is a word isn’t very helpful; it doesn’t contribute to forming a unique representation. One way to ensure uniqueness is by looking at a whole lot of specific features. Take is the word ‘king’ or is the word ‘gastroenteritis’, for example. That way, every word definitely corresponds to a different feature vector:</p>
<p><img src="/assets/blog/wordembedtheory/Vector-Representation-of-Words-2.png" alt="Figure 3. An inefficent representation defeating the purpose of word embeddings." /></p>
<p>This isn’t a great representation though. Not only is this a very inefficient way to represent words, but it also fails to solve the original dimensionality problem. Although every word still technically recalls a whole network of words, each network contains only one word!</p>
<p>Constructing the right collection of features is a hard problem. They have to be not too general, not too specific. The resulting representation of each word using those features should be unique. And, we should limit the number of features to between 100-1000, usually.</p>
<p>Furthermore, even though it’s simpler to think about binary features that take on True/False values, we’ll actually want to allow a spectrum of feature values. In particular, any real value. So, feature vectors are also actually vectors in a real vector space.</p>
<h2 id="a-new-solution">A New Solution</h2>
<p>The solution to feature construction is: don’t. At least not directly.<a href="#he6" class="footnoteRef" id="fnhe6"><sup>6</sup></a></p>
<p>Instead, let’s revisit the probability distributions from before:
<script type="math/tex">p( word | context )\textit{ and } p( context | word ).</script></p>
<p>This time, words and contexts are represented by feature vectors:</p>
<script type="math/tex; mode=display">\textit{word}\_i=⟨\theta_{i,1},\theta_{i,2}, \cdots,\theta_{i,300}⟩,</script>
<p>which are just a collection of numbers. This turns the probability distributions from a functions over categorical objects (i.e. individual words) into a function over numerical variables θijθij. This is something that allows us to bring in a lot of existing analytical tools—in particular, neural networks and other optimization methods.</p>
<p>The short version of the solution: from the above probability distributions, we can calculate the probability of seeing our training corpus, p(corpus)p(corpus), which had better be relatively large. We just need to find the values for each of the $\theta_{ij}$’s that maximize p(corpus)p(corpus).</p>
<p>These values for $\theta_{ij}$ give precisely the feature representations for each word, which in turn lets us calculate $p( word | context )p( word | context )$.</p>
<p>Recall that this in theory teaches a computer the meaning of a word!</p>
<h2 id="a-bit-of-math">A Bit of Math</h2>
<p>In this section, I’ll give enough details for the interested reader to go on and understand the literature with a bit more ease.</p>
<p>Recall from previously, we have a collection of probability distributions that are functions over some $\theta_{ij}$’s. The literature refers to these $\theta_{ij}$’s as parameters of the probability distributions $p(w|c)$ and $p(c|w)$. The collection of parameters $\theta_{ij}$’s is often denoted by a singular θθ, and the parametrized distributions by $p(w|c;\theta)$ and $p(c|w;\theta)$.<a href="#he7" class="footnoteRef" id="fnhe7"><sup>7</sup></a></p>
<p>If the goal is to maximize the probability of the training corpus, let’s first write $p(\textit{corpus};\theta)$ in terms of $p(c|w;\theta)$.</p>
<p>There are a few different approaches. But in the simplest, we think of a training corpus as an ordered list of words, $w^1,w2, \cdots ,w^T$. Each word in the corpus wtwt has an associated context $C_t$, which is a collection of surrounding words.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\textit{corpus}: w_1 | w_2 | w_3 | \cdots &w_t \cdots | w_{T−2} | w_{T−1} | w_T \\
\textit{context of } w_t: [w_{t−n} \cdots w_{t−1}] \, &w_t\, [w_{t+1} \cdots w_{t+n}]
\end{align*} %]]></script>
<p><strong>Diagram 1</strong>. A corpus is just an ordered list of words. The context CtCt of wtwt is a collection of words around it.<a href="#he8" class="footnoteRef" id="fnhe8"><sup>8</sup></a></p>
<p>For a given word $w_t$ in the corpus, the probability of seeing another word $w_c$ in its context is $p(w_c | w_t;\theta)$. Therefore, the probability that a word sees all of the surrounding context words $w_c$ in the training corpus is</p>
<script type="math/tex; mode=display">\prod_{w_c \in C_t} p(w_c|w_t;\theta).</script>
<p>To get the total probability of seeing our training corpus, we just take the product over all words in the training corpus. Thus,</p>
<script type="math/tex; mode=display">p(\textit{corpus};\theta) = \prod_{w_t} \prod_{w_c \in C_t} p(w_c|w_t;\theta).</script>
<p>Now that we have the objective function, $f(\theta)=p(\textit{corpus};\theta)$, it’s just a matter of choosing the parameters $\theta$ that maximize $f$.<a href="#he9" class="footnoteRef" id="fnhe9"><sup>9</sup></a> Depending on how the probability distribution is parametrized by $\theta$, this optimization problem can be solved using neural networks. For this reason, this method is also called a <a href="https://en.wikipedia.org/wiki/Language_model#Neural_language_models">neural language model</a> (NLM).</p>
<p>There are actually more layers of abstraction and bits of brilliance between theory and implementation. While I hope that I’ve managed to give you some understanding on where research is proceeding, the successes of the current word embedding methods are still rather mysterious. The intuition we’ve developed on the way is still, as Goldberg, et. al. wrote, “very hand-wavy” (Goldberg 2014).</p>
<p>Still, perhaps this can help you have some intuition for what’s going on behind all the math when reading the literature. A lot more has been written on this subject too; you can also take a look below where I list more resources I found useful.</p>
<h2 id="resources">Resources</h2>
<p>I focused mainly on word2vec while researching neural language models. However, do keep in mind that word2vec was just one of the earlier and possibly more famous models. To understand the theory, I quite liked all of the following. Approximately in order of increasing specificity,</p>
<ul>
<li>Radim Řehůřek’s introductory post on word2vec. He also wrote and optimized the word2vec algorithm for Python, which he notes sometimes exceeds the performance of the original C code.</li>
<li>Chris McCormick’s word2vec tutorial series, which goes into much more depths on the actual word2vec algorithm. He writes very clearly, and he also provides a list of resources.</li>
<li>Goldberg and Levy 2014, word2vec Explained, which helped me formulate my explanations above.</li>
<li>Sebastian Ruder’s word embedding series. I found this series comprehensive but really accessible.</li>
<li>Bojanowski 2016, Enriching word vectors with subword information. This paper actually is for Facebook’s fastText (which Mikolov is a part of), but it is based in part on word2vec. I found the explanation of word2vec’s model in Section 3.1 transparent and concise.</li>
<li>Levy 2015, Improving distributional similarity with lessons learned from word embeddings points out that actually, the increased performance of word2vec over previous word embeddings models might be a result of “hyperparameter optimizations,” and not necessarily in the algorithm itself.
Expand here to see the he references I cited above</li>
</ul>
<h2 id="references">References</h2>
<ul>
<li>(Bengio 2003) Bengio, Yoshua, et al. “A neural probabilistic language model.” Journal of machine learning research 3.Feb (2003): 1137-1155.</li>
<li>(Bojanowski 2016) Bojanowski, Piotr, et al. “Enriching word vectors with subword information.” arXiv preprint arXiv:1607.04606 (2016).</li>
<li>(Goldberg 2014) Goldberg, Yoav, and Omer Levy. “word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method.” arXiv preprint arXiv:1402.3722 (2014).</li>
<li>(Goodfellow 2016) Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.</li>
<li>(Hamilton 2016) Hamilton, William L., et al. “Inducing domain-specific sentiment lexicons from unlabeled corpora.” arXiv preprint arXiv:1606.02820 (2016).</li>
<li>(Kusner 2015) Kusner, Matt, et al. “From word embeddings to document distances.” International Conference on Machine Learning. 2015.</li>
<li>(Levy 2015) Levy, Omer, Yoav Goldberg, and Ido Dagan. “Improving distributional similarity with lessons learned from word embeddings.” Transactions of the Association for Computational Linguistics 3 (2015): 211-225.</li>
<li>(Mikolov 2013a) Mikolov, Tomas, Wen-tau Yih, and Geoffrey Zweig. “Linguistic regularities in continuous space word representations.” hlt-Naacl. Vol. 13. 2013.</li>
<li>(Mikolov 2013b) Mikolov, Tomas, et al. “Efficient estimation of word representations in vector space.” arXiv preprint arXiv:1301.3781 (2013).</li>
<li>(Mikolov 2013c) Mikolov, Tomas, et al. “Distributed representations of words and phrases and their compositionality.” Advances in neural information processing systems. 2013.</li>
<li>(Mikolov 2013d) Mikolov, Tomas, Quoc V. Le, and Ilya Sutskever. “Exploiting similarities among languages for machine translation.” arXiv preprint arXiv:1309.4168 (2013).</li>
<li>(Mnih 2012) Mnih, Andriy, and Yee Whye Teh. “A fast and simple algorithm for training neural probabilistic language models.” arXiv preprint arXiv:1206.6426 (2012).</li>
<li>(Rong 2014) Rong, Xin. “word2vec parameter learning explained.” arXiv preprint arXiv:1411.2738 (2014).</li>
</ul>
<h2 id="foot-note">Foot note</h2>
<ol>
<li id="he1">And let's not get into any philosophical considerations of whether the computer really understands the word. Come to think of it, how do I even know you understand a word of what I'm saying? Maybe it's just a matter of serendipity that the string of words I write make sense to you. But here I am really talking about how to oil paint clouds, and you think that I'm talking about machine learning <a href="#fnhe1">↩</a> </li>
<li id="he2">Consider a 20-word context. If we assume that the average English speaker's vocabulary is 25,000 words, then the increase of 1 word corresponds to an increase of about $7.2 e 84$ contexts, which is actually more than the number of atoms in the universe. Of course, most of those contexts wouldn't make any sense. <a href="#fnhe2">↩</a></li>
<li id="he3">The algorithm used by the Google researchers mentioned above assumes 300 features.<a href="#fnhe3">↩</a></li>
<li id="he4">The term distributed representation of words comes from this: we can now represent words by their features, which are shared (i.e. distributed) across all words. We can imagine the representation as a feature vector. For example, it might have a ‘noun bit' that would be set to 1 for nouns and 0 for everything else. This is, however, a bit simplified. Features can take on a spectrum of values, in particular, any real value. So, feature vectors are actually vectors in a real vector space. <a href="#fnhe4">↩</a></li>
<li id="he5">The distributed representations of words "allows each training sentence to inform the model about an exponential number of semantically neighboring sentences," (Bengio 2003). <a href="#fnhe5">↩</a></li>
<li id="he6">This also means that there's probably not a ‘noun bit' in our representation, like in the figures above. There might not be any obvious meaning to each feature. <a href="#fnhe6">↩</a></li>
<li id="he7">The softmax function is often chosen as the ideal probability distribution. <a href="#fnhe7">↩</a></li>
<li id="he8">One can control the algorithm by specifying different hyperparameters: do we care about order of words? How many surrounding words do we consider? And on. <a href="#fnhe8">↩</a></li>
<li id="he9"> <a href="#fnhe9">↩</a></li>
</ol>
<h2 id="remark">Remark</h2>
<p>This blog content is requested by <a href="https://www.linkedin.com/in/gautambay/">Gautam Tambay</a> and edited by Jun Lu.</p>
<div style="display:none;">
<a href="https://clustrmaps.com/site/19y73" title="Visit tracker"><img src="//www.clustrmaps.com/map_v2.png?d=nOecGrns4_UsvfdrtsTQMSsDviIU1hU3d3oQeIFA2qY&cl=ffffff" /></a>
</div>
An Introduction to Word Embeddings - Part 1: Applications2017-12-14T00:00:00+00:00/wordembedapp<p>If you already have a solid understanding of word embeddings and are well into your data science career, skip ahead to the <a href="/wordembedtheory">next part</a>!</p>
<p>Human language is <a href="https://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness_of_Mathematics_in_the_Natural_Sciences">unreasonably effective</a> at describing how we relate to the world. With a few, short words, we can convey many ideas and actions with little ambiguity. Well, <a href="http://mentalfloss.com/article/24445/10-amelia-bedelia-isms">mostly</a>.</p>
<p>Because we’re capable of seeing and describing so much complexity, a lot of structure is implicitly encoded into our language. It is no easy task for a computer (or a human, for that matter) to learn natural language, for it entails understanding how we humans observe the world, if not understanding how to observe the world.</p>
<p>For the most part, computers can’t understand natural language. Our programs are still line-by-line instructions telling a computer what to do — they often miss nuance and context. How can you explain sarcasm to a machine?</p>
<p>There’s good news though. There’s been some important breakthroughs in natural language processing (NLP), the domain where researchers try to teach computers human language.</p>
<p>Famously, in 2013 Google researchers (Mikolov 2013) found a method that enabled a computer to learn relations between words such as:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">king</span><span class="o">-</span><span class="n">man</span><span class="o">+</span><span class="n">woman</span><span class="err">≈</span><span class="n">queen</span><span class="o">.</span>
</code></pre></div></div>
<p>This method, called word embeddings, has a lot of promise; it might even be able to reveal hidden structure in the world we see. Consider one relation it <a href="http://byterot.blogspot.ch/2015/06/five-crazy-abstractions-my-deep-learning-word2doc-model-just-did-NLP-gensim.html">discovered</a>:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">president</span><span class="o">-</span><span class="n">power</span><span class="err">≈</span><span class="n">prime</span> <span class="n">minister</span>
</code></pre></div></div>
<p>Admittedly, this might be one of those specious relations.</p>
<p>Joking aside, it’s worth studying word embeddings for at least two reasons. First, there are a lot of applications made possible by word embeddings. Second, we can learn from the way researchers approached the problem of deciphering natural language for machines.</p>
<p>In Part 1 of this article series, let’s take a look at the first of these reasons.</p>
<h2 id="uses-of-word-embeddings">Uses of Word Embeddings</h2>
<p>There’s no obvious way to usefully compare two words unless we already know what they mean. The goal of word-embedding algorithms is, therefore, <strong>to embed words with meaning based on their similarity or relationship with other words</strong>.</p>
<p>In practice, words are embedded into a real vector space, which comes with notions of distance and angle. We hope that these notions extend to the embedded words in meaningful ways, quantifying relations or similarity between different words. And empirically, they actually do!</p>
<p>For example, the Google algorithm I mentioned above discovered certain nouns are singular/plural or have gender (Mikolov 2013abc):</p>
<p><img src="/assets/blog/wordembedapp/relations-Copy.png" alt="img1" /></p>
<p>They also found a country-capital relationship:</p>
<p><img src="/assets/blog/wordembedapp/country-Copy.png" alt="img2" /></p>
<p>And as further evidence that a word’s meaning can be implied from its relationships with other words, they actually found that the learned structure for one language often correlated to that of another language, perhaps suggesting the possibility for <a href="https://en.wikipedia.org/wiki/Machine_translation">machine translation</a> through word embeddings (Mikolov 2013c):</p>
<p><img src="/assets/blog/wordembedapp/mt-Copy.png" alt="img3" /></p>
<p>They released their C code as the <a href="https://code.google.com/archive/p/word2vec/">word2vec</a> package, and soon after, others adapted the algorithm for more programming languages. Notably, for <a href="https://radimrehurek.com/gensim/index.html">gensim</a> (Python) and <a href="https://deeplearning4j.org/word2vec">deeplearning4j</a> (Java).</p>
<p>Today, many companies and data scientists have found different ways to incorporate word2vec into their businesses and research. <a href="https://www.slideshare.net/eshvk/spotifys-music-recommendations-lambda-architecture">Spotify</a> uses it to help provide music recommendation. <a href="http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/">Stitch Fix</a> uses it to recommend clothing. Google is thought to use word2vec in <a href="https://searchengineland.com/faq-all-about-the-new-google-rankbrain-algorithm-234440">RankBrain</a> as part of their search algorithm.</p>
<p>Other researchers are using <a href="https://niksto.com/rankbrain/">word2vec</a> for sentiment analysis, which attempts to identify the emotionality behind the words people use to communicate. For example, one <a href="https://arxiv.org/pdf/1606.02820.pdf">Stanford research group</a> looked at how the same words in different Reddit communities take on different connotations. Here’s an example with the word soft:</p>
<p><img src="/assets/blog/wordembedapp/reddit-Copy.png" alt="img4" /></p>
<p>As you can see, the word “soft” has a negative connotation when you’re talking about sports (you might think of the term “soft players”) while they have a positive connotation when you’re talking about cartoons.</p>
<p>And here are more examples where the computer could analyze the emotional sentiment of the same words across different communities.</p>
<p><img src="/assets/blog/wordembedapp/reddit-spectrum-Copy.png" alt="img5" /></p>
<p>They can even apply the same method over time, following how the word terrific, which meant horrific for the majority of the 20th century, has come to essentially mean great today.</p>
<p><img src="/assets/blog/wordembedapp/terrific-Copy.png" alt="img6" /></p>
<p>As a light-hearted example, one <a href="http://www.pelleg.org/shared/hp/download/fun-facts-wsdm.pdf">research group</a> used word2vec to help them determine whether a fact is surprising or not, so that they could automatically generate trivia facts.</p>
<p>The successes of word2vec have also helped spur on other forms of word embedding—<a href="https://arxiv.org/pdf/1506.02761.pdf">WordRank</a>, Stanford’s <a href="https://nlp.stanford.edu/projects/glove/">GloVe</a>, and Facebook’s <a href="https://research.fb.com/projects/fasttext/">fastText</a>, to name a few major ones.</p>
<p>These algorithms seek to improve on word2vec — they also look at texts through different units: characters, subwords, words, phrases, sentences, documents, and perhaps even units of thought. As a result, they allows us to think about not just word similarity, but also sentence similarity and document similarity—like this paper did (Kusner 2015):</p>
<p><img src="/assets/blog/wordembedapp/wmd-Copy.png" alt="img6" /></p>
<p>Word embeddings <strong>transform human language meaningfully into a form conducive to numerical analysis</strong>. In doing so, they allow computers to explore the wealth of knowledge encoded implicitly into our own ways of speaking. <strong>We’ve barely scratched the surface of that potential</strong>.</p>
<p>Any individual programmer or scholar can use these tools and contribute new knowledge. Many areas of research and industry that could benefit from NLP have yet to be explored. Word embeddings and neural language models are powerful techniques. But perhaps the most powerful aspect of machine learning is its collaborative culture. Many, if not most, of the state-of-the-art methods are open-source, along with their accompanying research.</p>
<p>So, it’s there, if we want to take advantage. Now, the main obstacle is just ourselves. And maybe an expensive GPU.</p>
<p>For the theory behind word embeddings, see <a href="/wordembedtheory">Part 2</a>.</p>
<h2 id="reference">Reference</h2>
<ul>
<li>(Hamilton 2016) Hamilton, William L., et al. “Inducing domain-specific sentiment lexicons from unlabeled corpora.” arXiv preprint arXiv:1606.02820 (2016).</li>
<li>(Kusner 2015) Kusner, Matt, et al. “From word embeddings to document distances.” International Conference on Machine Learning. 2015.</li>
<li>(Mikolov 2013a) Mikolov, Tomas, Wen-tau Yih, and Geoffrey Zweig. “Linguistic regularities in continuous space word representations.” hlt-Naacl. Vol. 13. 2013.</li>
<li>(Mikolov 2013b) Mikolov, Tomas, et al. “Efficient estimation of word representations in vector space.” arXiv preprint arXiv:1301.3781 (2013).</li>
<li>(Mikolov 2013c) Mikolov, Tomas, et al. “Distributed representations of words and phrases and their compositionality.” Advances in neural information processing systems. 2013.</li>
<li>(Mikolov 2013d) Mikolov, Tomas, Quoc V. Le, and Ilya Sutskever. “Exploiting similarities among languages for machine translation.” arXiv preprint arXiv:1309.4168 (2013).</li>
</ul>
<h2 id="remark">Remark</h2>
<p>This blog content is requested by <a href="https://www.linkedin.com/in/gautambay/">Gautam Tambay</a> and edited by <a href="https://github.com/junlulocky">Jun Lu</a>. You can also find this blog at <a href="https://www.springboard.com/blog/introduction-word-embeddings/">Springboard</a>.</p>
<div style="display:none;">
<a href="https://clustrmaps.com/site/19y73" title="Visit tracker"><img src="//www.clustrmaps.com/map_v2.png?d=nOecGrns4_UsvfdrtsTQMSsDviIU1hU3d3oQeIFA2qY&cl=ffffff" /></a>
</div>
Subspace-Embedding2017-10-10T00:00:00+00:00/Subspace-Embedding<p>Subspace embedding is a powerful tool to simplify the matrix calculation and analyze high dimensional data, especially for sparse matrix.</p>
<h2 id="subspace-embedding">Subspace Embedding</h2>
<p>A random matrix $\Pi \in \mathbb{R}^{m \times n}$ is a $(d, \epsilon, \delta)$-subspace embedding if for every $d$-dimensional subspace $U \subseteq \mathbb{R}^n$, $\forall x \in U$ has,</p>
<script type="math/tex; mode=display">\mathrm{P}(\left| ||\Pi x||_2 - ||x||_2 \right| \leq \epsilon ||x||_2) \geq 1 - \delta</script>
<p>Essentially, the sketch matrix maps any vector $x \in \mathbb{R}^n$ in the span of the columns of $U$ to $\mathbb{R}^m$ and the $l_2$ norm is preserved with high probability.</p>
<h2 id="matrix-multiplication-via-subspace-embedding">Matrix Multiplication via Subspace Embedding</h2>
<p>Consider a simple problem, given two matrix $A, B \in \mathbb{R}^{n \times d}$, what is the complexity to compute the $C = A^{\top} B$? The simple algorithm takes $O(nd^2)$. Now we use subspace embedding to solve it. The result matrix is just $C’ = (\Pi A)^{\top} (\Pi B)$. We can prove that with at least $1 - 3d^2 \epsilon$ probability, <script type="math/tex">\| C' - C \|_F \leq \epsilon \| A \|_F \| B \|_F</script> holds.</p>
<h2 id="least-squares-regression-via-subspace-embedding">Least Squares Regression via Subspace Embedding</h2>
<p>Before we introduce subspace embedding, consider a simple problem, least squares regression. The exact least squares regression is the following problem: Given $A \in \mathbb{R}^{n \times d}$ and $b \in \mathbb{R}^n$, solve that</p>
<script type="math/tex; mode=display">x^{*} = \arg \min_{x \in R^d } \| Ax - b \|_2 \qquad (1)</script>
<p>It is well-known that the solution is $(A^{\top}A)^{+} A^T b$, where $(A^{\top}A)^{+}$ is the Moore-Penrose pseudoinverse of $A^{\top}A$. It can be calculated via SVD computation, taking $O(n d^2)$ time. However, if we allow approximation, can we decrease the time complexity? We can formalize the question as below, instead of finding the exact solution $x^{*}$, we would like to find $x’ \in \mathbb{R}^d$ such that,</p>
<script type="math/tex; mode=display">\| Ax^{*} - b \|_2 \leq \| Ax' - b \|_2 \leq (1 + \Delta) \| Ax^{*} - b \|_2</script>
<p>where $\Delta$ is a small constant number.</p>
<p>Suppose there exist a $(d+1, \epsilon, \delta)$-subspace embedding matrix $\Pi$, can we solve the following problem instead?</p>
<script type="math/tex; mode=display">x' = \arg \min_{x \in R^d } \| \Pi Ax - \Pi b \|_2 \qquad (2)</script>
<p>Proof: By the definition of $d+1$-subspace embedding matrix, the following equation holds with probability at least $1 - \delta$ for every arbitrary $x \in \mathbb{R}^d$</p>
<script type="math/tex; mode=display">\left| \| \Pi [A;b] \cdot [x^{\top}; -1] \|_2 - \| [A;b] \cdot [x^{\top}; -1] \|_2 \right| \leq \epsilon \| [A;b] \cdot [x^{\top}; -1] \|_2 \qquad (3)</script>
<p>For $x’$ is optimum in equation(2), we have</p>
<script type="math/tex; mode=display">\begin{align}
\| \Pi Ax' - \Pi b \|_2 \leq \| \Pi Ax^{*} - \Pi b \|_2
\end{align} \qquad (4)</script>
<p>Replace $x^{\star}$ with $x$ in equation(3), we have</p>
<script type="math/tex; mode=display">\| \Pi Ax^{\star} - \Pi b \|_2 \leq (1 + \epsilon) \|Ax^{\star} - b \|_2 \qquad (5)</script>
<p>Replace $x’$ with $x$ in equation(3), we have</p>
<script type="math/tex; mode=display">\| \Pi Ax' - \Pi b \|_2 \geq (1 - \epsilon) \|Ax' - b \|_2 \qquad (6)</script>
<p>Combine equation(4, 5, 6) to get</p>
<script type="math/tex; mode=display">(1 - \epsilon) \|Ax' - b \|_2 \leq (1 + \epsilon) \|Ax^{\star} - b \|_2</script>
<p>Take $\Delta = \frac{2 \epsilon}{1 - \epsilon}$ to conclude that the solution in equation(2) satisfies the desired statement.</p>
<script type="math/tex; mode=display">\| Ax^{*} - b \|_2 \leq \| Ax' - b \|_2 \leq (1 + \Delta) \| Ax^{*} - b \|_2</script>
<p>Util now, we have seen how to solve approximate least regression problem by subspace embedding. However, one fundamental questions may arise, how to construct subspace embedding matrix? In the following section, we demonstrate that CountSketch is a subspace embedding.</p>
<h2 id="subspace-embedding-via-countsketch">Subspace Embedding Via CountSketch</h2>
<p>CountSketch matrix $S \in \mathbb{R}^{B \times n}$ is defined as follows, fix the number of buckets $B$, a hash function $h:[n] \rightarrow [B]$ and a sign function $\phi:[n] \rightarrow {-1, +1}$. For $r \in [B], a \in [n]$, let</p>
<script type="math/tex; mode=display">% <![CDATA[
S_{ra} = \begin{cases}
\phi(a) & \text{if } h(a) = r \\
0 & \text{otherwise}
\end{cases} %]]></script>
<p>CountSketch Example:</p>
<script type="math/tex; mode=display">% <![CDATA[
\left(
\begin{matrix}
0 & -1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\
1 & 0 & 0 & 0 & 0 & -1 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & -1 & 0 & 0 & 0 & 1 & 0 & 1 \\
0 & 0 & 0 & 1 & 0 & 0 & -1 & 0 & 0 & 0 & 0 \\
0 & 0 & 1 & 0 & 0 & 0 & 0 & -1 & 0 & 0 & 0 \\
\end{matrix}
\right) %]]></script>
<p>We can show that for every subspace $U \in \mathbb{R}^{n \times d}$, then</p>
<script type="math/tex; mode=display">P(\left| ||\Pi x||_2 - ||x||_2 \right| \leq \epsilon ||x||_2, \forall x \in \text{the column span of }U) > 1 - \delta</script>
<p>Proof:</p>
<p>For $x$ is the the column span of $U$, then write $x$ as $Uy$ where $y \in \mathbb{R}^d$.</p>
<script type="math/tex; mode=display">(1 - \epsilon)||x||_2^2 \leq ||\Pi x||_2^2 \leq (1+\epsilon)||x||_2^2</script>
<p>equivalent to</p>
<script type="math/tex; mode=display">(1 - \epsilon) y^{\top} U^{\top} U y \leq y^{\top} U^{\top} \Pi^{\top} \Pi U y \leq (1+\epsilon) y^{\top} U^{\top} U y</script>
<p>For $U^{\top} U = I$,</p>
<script type="math/tex; mode=display">(1 - \epsilon) y^{\top} y \leq y^{\top} U^{\top} \Pi^{\top} \Pi U y \leq (1+\epsilon) y^{\top} y</script>
<p>equivalent to</p>
<script type="math/tex; mode=display">|| U^{\top} \Pi^{\top} \Pi U - I ||_2 \leq \epsilon</script>
<p>Since Frobenius norm upper bounds spectral norm, it suffices to show that</p>
<script type="math/tex; mode=display">|| U^{\top} \Pi^{\top} \Pi U - I ||_F \leq \epsilon</script>
<p>We can show that (the detailed proof is ignored)</p>
<script type="math/tex; mode=display">\mathbf{E}[ ||U^{\top} \Pi^{\top} \Pi U - I ||_F^2 ] \leq \frac{2 d^2}{B}</script>
<p>By the Markov’s inequality,</p>
<script type="math/tex; mode=display">\mathrm{P}( ||U^{\top} \Pi^{\top} \Pi U - I ||_F^2
\geq \epsilon^2) \leq \frac{2 d^2}{B \epsilon^2}</script>
<p>Then we can obtain
<script type="math/tex">\mathrm{P}( ||U^{\top} \Pi^{\top} \Pi U - I ||_F
\geq \epsilon) \leq \frac{2 d^2}{B \epsilon^2}</script></p>
<p>Thus</p>
<script type="math/tex; mode=display">\mathrm{P} (\left| ||\Pi x||_2 - ||x||_2 \right| \leq \epsilon ||x||_2) \geq 1 - \frac{2 d^2}{B \epsilon^2}</script>
<p>which implies that CountSketch is a $(d, \epsilon, \frac{2 d^2}{B \epsilon^2})$-subspace embedding. Setting $B = \frac{C d^2}{\epsilon^2}$ for a large enough absolute constant $C$ gives a subspace embedding with large constant probability.</p>
<h2 id="complexity-analysis">Complexity Analysis</h2>
<p>The matrix $\Pi A$ is a $B \times d$ matrix, where $B = \frac{C d^2}{\epsilon^2}$. Thus using SVD to solve $||\Pi A x - \Pi b||$ takes $ploy(d, \frac{1}{\epsilon})$. How much time does it take to form the matrix $\Pi A$ and the vector $\Pi b$? Since every column of $\Pi$ has exactly one nonzero, the runtime of this is proportional to the number of nonzeros in the matrix $A$ and the vector $b$. The overall time is $O(nnz(A) + ploy(d, \frac{1}{\epsilon}))$. Note that if the matrix is sparse, this is very efficient.</p>
<h2 id="experiment">Experiment</h2>
<h1 id="reference">Reference</h1>
<ul>
<li>
<p>EPFL Topics in Theoretical Computer Science (Sublinear Algorithm for Big Data Analysis), 2017</p>
</li>
<li>
<p>Xiangrui Meng and Michael W. Mahoney. Low-distortion subspace embeddings in input-sparsity
time and applications to robust linear regression, 2012.</p>
</li>
<li>
<p>Jelani Nelson and Huy L. Nguyen. Osnap: Faster numerical linear algebra algorithms via sparser
subspace embeddings, 2012.</p>
</li>
</ul>
Dimensionality Reduction via JL Lemma and Random Projection2017-10-10T00:00:00+00:00/Dimensionality-Reduction-via-JL-Lemma-and-Random-Projection<p>Nowadays, dimensionality is a serious problem of data analysis as the huge data we experience today results in very sparse sets and very high dimensions. Although, data scientists have long used tools such as principal component analysis (PCA) and independent component analysis (ICA) to project the high-dimensional data onto a subspace, but all those techniques reply on the computation of the eigenvectors of a $n \times n$ matrix, a very expensive operation (e.g., spectral decomposition) for high dimension $n$. Moreover, even though eigenspace has many important properties, it does not lead good approximations for many useful measures such as vector norms. We discuss another method random projection to reduce dimensionality.</p>
<p>In 1984, two mathematicians introduced and proved the following lemma.</p>
<h2 id="johnson-lindenstrauss-lemma">Johnson-Lindenstrauss lemma</h2>
<p>For any $\epsilon \in (0,\frac{1}{2})$, $\forall x_1, x_2, \dots, x_d \in \mathbb{R}^{n}$, there exists a matrix $M \in \mathbb{R}^{m \times n}$ with $m = O(\frac{1}{\epsilon^2} \log{d})$ such that $\forall 1 \leq i,j \leq d$, we have</p>
<script type="math/tex; mode=display">(1-\epsilon)||x_i - x_j||_2 \leq ||Mx_i - Mx_j||_2 \leq (1+\epsilon)||x_i - x_j||_2</script>
<p>Remark: This lemma states that for any pair vector $x_i, x_j$ in $\mathbb{R}^n$ dimension, there exist a sketch matrix $M$ which maps $\mathbb{R}^n \rightarrow \mathbb{R}^m$ and the Euclidean distance is preserved within $\epsilon$ factor. The result dimension does not relate to origin dimension $n$ (only relates to the number of vector pairs $d$).</p>
<p>During a long time, no one can figure out how to get this sketch matrix.</p>
<h2 id="random-projection">Random Projection</h2>
<p>Until 2003, some researches point out that this sketch matrix can be created using Gaussian distribution.</p>
<p>Consider the following matrix $A \in \mathbb{R}^{m \times n}$, where $A_{ij} \sim \mathcal{N}(0,1)$ and all $A_{ij}$ are independent. We claim that this matrix satisfies the statement of JL lemma.</p>
<p>Proof. It is obvious that sketch has an additional property,
$\forall i, (Ax)_i = \sum_{j=1}^{n} A_{ij} x_j \sim \mathcal{N}(0, ||x||_2^2)$. In other word, Gaussian distribution is 2-stable distribution. Then we can obtain $||Ax||_2^2 = \sum_{i=1}^{m} y_i^2$, where $y_i \sim \mathcal{N}(0, ||x||_2^2)$. That is to say, $||Ax||_2^2$ follows a $\chi^2$ (chi-squared) distribution with degrees of freedom $m$. For tail bound of $\chi^2$ distribution, we can get</p>
<script type="math/tex; mode=display">% <![CDATA[
P(||Ax||_2^2 - m||X||_2^2| > \epsilon m||X||_2^2) < \exp(-C \epsilon^2 m) %]]></script>
<p>for a constant $C > 0$.</p>
<p>Fix two index $i, j$, and let $y^{ij} = x_i - x_j$ and $M = \frac{1}{\sqrt{m}} A$, and set $m = \frac{4}{C \epsilon^2} \log{n}$ to get</p>
<script type="math/tex; mode=display">% <![CDATA[
P(\left| ||M y^{ij}||_2^2 - ||y^{ij}||_2^2 \right| > \epsilon ||y^{ij}||_2^2) < \exp(-C \epsilon^2 m) = \frac{1}{n^4} %]]></script>
<p>Take the union bound to obtain,</p>
<script type="math/tex; mode=display">% <![CDATA[
P(\forall i \neq j, \left| ||M y^{ij}||_2^2 - ||y^{ij}||_2^2 \right| > \epsilon ||y^{ij}||_2^2 ) < \sum_{i \neq j} P(|||M y^{ij}||_2^2 - ||y^{ij}||_2^2| < \epsilon ||y^{ij}||_2^2 ) < {n \choose 2} n^4 < \frac{1}{n^2} %]]></script>
<p>Thus,</p>
<script type="math/tex; mode=display">P(\forall i \neq j, \left| ||M (x_i - x_j)||_2^2 - ||x_i - x_j||_2^2 \right| \leq \epsilon ||x_i - x_j||_2^2 ) > 1 - \frac{1}{n^2}</script>
<p>which is same as the guarantee in Johnson-Lindenstrauss lemma.</p>
<h2 id="application">Application</h2>
<p>In this or other forms, the JL lemma has been used for a large variety of computational tasks, especially in streaming algorithm, such as</p>
<ul>
<li>
<p><a href="https://www.stat.berkeley.edu/~mmahoney/f13-stat260-cs294/Lectures/lecture19.pdf">Computing a low-rank approximation to the original matrix A.</a></p>
</li>
<li>
<p><a href="http://web.stanford.edu/class/cs369g/files/lectures/lec16.pdf">Finding nearest neighbors in high-dimensional space.</a></p>
</li>
<li>
<p><a href="https://simons.berkeley.edu/sites/default/files/docs/1768/slidessrivastava1.pdf">Simplify the calculation of the effective resistance in Graph Spectral Sparsification.</a></p>
</li>
<li>
<p><a href="https://people.cs.umass.edu/~mcgregor/papers/12-pods1.pdf">Relates to Graph Sketches.</a></p>
</li>
</ul>
<h2 id="reference">Reference</h2>
<ul>
<li>EPFL Topics in Theoretical Computer Science (Sublinear Algorithm for Big Data Analysis), 2017</li>
<li>EPFL Advanced Algorithm, 2016</li>
<li>Johnson, William B.; Lindenstrauss, Joram (1984). “Extensions of Lipschitz mappings into a Hilbert space”. In Beals, Richard; Beck, Anatole; Bellow, Alexandra; et al. Conference in modern analysis and probability (New Haven, Conn., 1982). Contemporary Mathematics. 26. Providence, RI: American Mathematical Society. pp. 189–206.</li>
<li>Kane, Daniel M.; Nelson, Jelani (2012). “Sparser Johnson-Lindenstrauss Transforms”. Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms,. New York: Association for Computing Machinery (ACM).</li>
</ul>
<p>The post is used for study purpose only.</p>
A comparison of distributed machine learning platform2017-09-07T00:00:00+00:00/a-comparison-of-distributed-machine-learning-platform<p>A short summary and comparison of different platforms. Based on <a href="http://muratbuffalo.blogspot.ch/2017/07/a-comparison-of-distributed-machine.html">this blog</a> and (Zhang et al., 2017).</p>
<!-- more -->
<p>We categorize the distributed ML platforms under 3 basic design approaches:</p>
<ol>
<li>basic dataflow</li>
<li>parameter-server model</li>
<li>advanced dataflow.</li>
</ol>
<p>We talk about each approach in brief:</p>
<ul>
<li>using Apache Spark as an example of the basic dataflow approach</li>
<li>PMLS (Petuum) as an example of the parameter-server model</li>
<li>TensorFlow and MXNet as examples of the advanced dataflow model.</li>
</ul>
<h1 id="spark">Spark</h1>
<p>Spark enables in-memory caching of frequently used data and avoids the overhead of writing a lot of intermediate data to disk. For this Spark leverages on Resilient Distributed Datasets (RDD), read-only, partitioned collection of records distributed across a set of machines. RDDs are the collection of objects divided into logical partitions that are stored and processed as in-memory, with shuffle/overflow to disk.</p>
<p>In Spark, a computation is modeled as a directed acyclic graph (DAG), where each vertex denotes an RDD and each edge denotes an operation on RDD. On a DAG, an edge E from vertex A to vertex B implies that RDD B is a result of performing operation E on RDD A. There are two kinds of operations: transformations and actions. A transformation (e.g., map, filter, join) performs an operation on an RDD and produces a new RDD.</p>
<p>A typical Spark job performs a couple of transformations on a sequence of RDDs and then applies an action to the latest RDD in the lineage of the whole computation. A Spark application runs multiple jobs in sequence or in parallel.</p>
<p><img src="https://4.bp.blogspot.com/-cN_-PWvDGCs/WX6pgpqlTSI/AAAAAAAAGbw/vp4ttIiQ5jAGmjllTEyMrFq200uDWyalQCK4BGAYYCw/s400/sparkArch.png" alt="" /></p>
<p>A Spark cluster comprises of a master and multiple workers. A master is responsible for negotiating resource requests made by the Spark driver program corresponding to the submitted Spark application. Worker processes hold Spark executors (each of which is a JVM instance) that are responsible for executing Spark tasks. The driver contains two scheduler components, the DAG scheduler, and the task scheduler. The DAG scheduler is responsible for stage-oriented scheduling, and the task scheduler is responsible for submitting tasks produced by the DAG scheduler to the Spark executors.</p>
<p>The Spark user models the computation as a DAG which transforms & runs actions on RDDs. The DAG is compiled into stages. Unlike the MapReduce framework that consists of only two computational stages, map and reduce, a Spark job may consist of a DAG of multiple stages. The stages are run in topological order. A stage contains a set of independent tasks which perform computation on partitions of RDDs. These tasks can be executed either in parallel or as pipelined.</p>
<p><img src="https://4.bp.blogspot.com/-_KxjkVBsznQ/WX6pcFQ7C5I/AAAAAAAAGbo/GYdLBgVqY78ZEllZ971WoHmBAbnDRayAgCK4BGAYYCw/s400/apache.png" alt="" /></p>
<p>Spark defines two types of dependency relation that can capture data dependency among a set of RDDs:</p>
<ul>
<li>Narrow dependency. Narrow dependency means each partition of the parent RDD is used by at most one partition of the child RDD.</li>
<li>Shuffle dependency (wide dependency). Wide dependency means multiple child partitions of RDD may depend on a single parent RDD partition.</li>
</ul>
<p>Narrow dependencies are good for efficient execution, whereas wide dependencies introduce bottlenecks since they disrupt pipelining and require communication intensive shuffle operations.</p>
<h2 id="fault-tolerance">Fault tolerance</h2>
<p>Spark uses the DAG to track the lineage of operations on RDDs. For shuffle dependency, the intermediate records from one stage are materialized on the machines holding parent partitions. This intermediate data is used for simplifying failure recovery. If a task fails, the task will be retried as long as its stage’s parents are still accessible. If some stages that are required are no longer available, the missing partitions will be re-computed in parallel.</p>
<p>Spark is unable to tolerate a scheduler failure of the driver, but this can be addressed by replicating the metadata of the scheduler. The task scheduler monitors the state of running tasks and retries failed tasks. Sometimes, a slow straggler task may drag the progress of a Spark job.</p>
<h2 id="machine-learning-on-spark">Machine learning on Spark</h2>
<p>Spark was designed for general data processing, and not specifically for machine learning. However, using the MLlib for Spark, it is possible to do ML on Spark. In the basic setup, Spark stores the model parameters in the driver node, and the workers communicate with the driver to update the parameters after each iteration. For large scale deployments, the model parameters may not fit into the driver and would be maintained as an RDD. This introduces a lot of <strong>overhead</strong> because a new RDD will need to be created in each iteration to hold the updated model parameters. Updating the model involves shuffling data across machines/disks, this limits the scalability of Spark. This is where the basic dataflow model (the DAG) in Spark falls short. Spark does not support iterations needed in ML well.</p>
<h1 id="pmls">PMLS</h1>
<p>PMLS was designed specifically for ML with a clean slate. It introduced the parameter-server (PS) abstraction for serving the iteration-intensive ML training process.</p>
<p>In PMLS, a worker process/thread is responsible for requesting up to date model parameters and carrying out computation over a partition of data, and a parameter-server thread is responsible for storing and updating
model parameters and making response to the request from workers.</p>
<p>Figure below shows the architecture of PMLS.
<img src="https://3.bp.blogspot.com/-cFL80lqWCCo/WX6pk2jzcdI/AAAAAAAAGb4/XFYSzGWsD6UPhrewWEll5w61g-vbYAYYwCK4BGAYYCw/s400/pmlsArch.png" alt="" /></p>
<ul>
<li>The parameter server is implemented as distributed tables. All model parameters are stored via these tables. A PMLS application can register more than one table. These tables are maintained by server threads. Each table consists of multiple rows. Each cell in a row is identified by a column ID and typically stores one parameter. The rows of the tables can be stored across multiple servers on different machines.</li>
<li>Workers are responsible for performing computation defined by a user on partitioned dataset in each iteration and need to request up to date parameters for its computation. Each worker may contain multiple working threads. There is no communication across workers. Instead, workers only communicate with servers.</li>
<li>'’worker’’ and ‘‘server’’ are not necessarily separated physically. In fact server threads co-locate with the worker processes/threads in PMLS.</li>
</ul>
<h2 id="error-tolerance-of-ml-algorithm">Error tolerance of ML algorithm.</h2>
<p>PMLS exploits the error-tolerant property of many machine learning algorithms to make a trade-off between efficiency and consistency.</p>
<p>In order to leverage such error-tolerant property, PMLS follows Staleness Synchronous Parallel (SSP) model. In SSP model, worker threads can proceed without waiting for slow threads.</p>
<blockquote>
<p>Fast threads may carry out computation using stale model parameters. Performing computation on stale version of model parameter does cause errors, however these errors are bounded.</p>
</blockquote>
<p>The communication protocol between workers and servers can guarantee that the model parameters that a working thread reads from its local cache is of bounded staleness.</p>
<h2 id="fault-tolerance-1">Fault tolerance</h2>
<p>Fault tolerance in PMLS is achieved by checkpointing the model parameters in the parameter server periodically. To resume from a failure, the whole system restarts from the last checkpoint.</p>
<h2 id="programing-interface">Programing interface</h2>
<p>PMLS is written in C++.</p>
<p>While PMLS has very little overhead, on the negative side, the users of PMLS need to know how to handle computation using relatively low-level APIs.</p>
<h1 id="tensorflow">TensorFlow</h1>
<p>Tensorflow is the first generation distributed parameter-server system.
In TensorFlow the computation is abstracted and represented by a directed graph. But unlike traditional dataflow systems, TensorFlow allows nodes to represent computations that own or update mutable state.</p>
<ul>
<li>Variable: a stateful operations, owns mutable buffer, and can be used to store model parameters that need to be updated at each iteration.</li>
<li>Node: represents operations, and some operations are control flow operations.</li>
<li>Tensors: values that flow along the directed edges in the TensorFlow graph, with arbitrary dimensionality matrices.
<ul>
<li>An operation can take in one or more tensors and produce a result tensor.</li>
</ul>
</li>
<li>Edge: special edges called control dependencies can be added into TensorFlow’s dataflow graph with no data flowing along such edges.</li>
</ul>
<p>In summary, TensorFlow is a dataflow system that offers mutable state and allows cyclic computation graph, and as such enables training a machine learning algorithm with parameter-server model.</p>
<h2 id="architecture">Architecture</h2>
<p>The Tensorflow runtime consists of three main components: client, master, worker.</p>
<ul>
<li>client: is responsible for holding a session where a user can define computational graph to run. When a client requests the evaluation of a Tensorflow graph via a session object, the request is sent to master service.</li>
<li>master: schedules the job over one or more workers and coordinates the execution of the computational graph.</li>
<li>worker: Each worker handles requests from the master and schedules the execution of the kernels (The implementation of an operation on a particular device is called a kernel) in the computational graph. The dataflow executor in a worker dispatches the kernels to local devices and runs the kernels in parallel when possible.</li>
</ul>
<h2 id="characteristics">Characteristics</h2>
<h3 id="node-placement">Node Placement</h3>
<p>If multiple devices are involved in computation, a procedure called node placement is executed in a Tensorflow
runtime. Tensorflow uses a cost model to estimate the cost of executing an operation on all available devices (such as CPUs and GPUs) and assigns an operation to a suitable device to execute, subject to implicit or explicit device constraints in the graph.</p>
<h3 id="sub-graph-execution">Sub-graph execution</h3>
<p>TensorFlow supports sub-graph execution. A single round of executing a graph/sub-graph is called a step.</p>
<p>A training application contains two type of jobs: parameter server (ps) job and worker job. Like data parallelism in PMLS, TensorFlow’s data parallelism training involves multiple tasks in a worker job training the same model on different minibatches of data, updating shared parameters hosted in a one or more tasks in a ps job.</p>
<h3 id="a-typical-replicated-training-structure-between-graph-replication">A typical replicated training structure: between-graph replication</h3>
<p><img src="https://1.bp.blogspot.com/-LToYY4Kj2YE/WX6pod_r5pI/AAAAAAAAGcA/Ls-ZWfTebYk_sc3l2pCHRAWv9e6U_eT_gCK4BGAYYCw/s400/tf.png" alt="" /></p>
<p>There is a separate client for each worker task, typically in the same process as the worker task. Each client builds a similar graph containing the parameters (pinned to ps) and a single copy of the compute-intensive part of the computational graph that is pinned to the local task in the worker job.</p>
<p>For example, a compute-intensive part is to compute gradient during each iteration of stochastic gradient descent algorithm.</p>
<p>Users can also specify the consistency model in the betweengraph replicated training as either synchronous training or asynchronous training:</p>
<ul>
<li>In asynchronous mode, each replica of the graph has an independent training loop that executes without coordination.</li>
<li>In synchronous mode, all of the replicas read the same values for the current parameters, compute gradients in parallel, and then apply them to a stateful accumulators which act as barriers for updating variables.</li>
</ul>
<h2 id="fault-tolerance-2">Fault tolerance</h2>
<p>TensorFlow provides user-controllable checkpointing for fault tolerance via primitive operations: <em>save</em> writes tensors to checkpoint file, and <em>restore</em> reads tensors from a checkpointing file.
TensorFlow allows customized fault tolerance mechanism through its primitive operations, which provides users the ability to make a balance between reliability and checkpointing overhead.</p>
<h1 id="mxnet">MXNET</h1>
<p>Similar to TensorFlow, MXNet is a dataflow system that allows cyclic computation graphs with mutable states, and supports training with parameter server model. Similar to TensorFlow, MXNet provides good support for data-parallelism on multiple CPU/GPU, and also allows model-parallelism to be implemented.
MXNet allows both synchronous and asynchronous training.</p>
<h2 id="characteristics-1">Characteristics</h2>
<p>Figure below illustrates main components of MXNet. The runtime dependency engine analyzes the dependencies in computation processes and parallelizes the computations that are not dependent. On top of runtime dependency engine, MXNet has a middle layer for graph and memory optimization.</p>
<p><img src="https://raw.githubusercontent.com/dmlc/dmlc.github.io/master/img/mxnet/system/overview.png" alt="" /></p>
<h2 id="fault-tolerance-3">Fault tolerance</h2>
<p>MXNet supports basic fault tolerance through checkpointing, and provides save and load model operations. The save operaton writes the model parameters to the checkpoint file and the load operation reads model parameters from the checkpoint file.</p>
<h1 id="reference">Reference</h1>
<ul>
<li>Zhang, Kuo and Alqahtani, Salem and Demirbas, Murat, ‘A Comparison of Distributed Machine Learning Platforms’, ICCCN, 2017.
The post is used for study purpose only.</li>
</ul>
Bias-variance decomposition in a nutshell2016-12-08T00:00:00+00:00/biasvariance<p>This post means to give you a nutshell description of bias-variance decomposition.</p>
<h2 id="basic-setting">basic setting</h2>
<p>We will show four key results using Bias-variance decomposition.</p>
<p>Let us assume $f_{true}(x_n)$ is the true model, and the observations are given by:</p>
<p>\[y_n = f_{true}(x_n) + \epsilon_n \qquad (1)\]</p>
<p>where $\epsilon_n$ are i.i.d. with zero mean and variance $\sigma^2$. Note that $f_{true}$ can be nonlinear and $\epsilon_n$ does not have to be Gaussian.</p>
<p>We denote the least-square estimation by</p>
<p>\[f_{lse}(x_{\ast}) = \tilde{x}_{\ast}^T w_{lse} \]</p>
<p>Where the tilde symbol means there is a constant 1 feature added to the raw data. For this derivation, we will assume that $x_{\ast}$ is fixed, although it is straightforward to generalize this.</p>
<h2 id="expected-test-error">Expected Test Error</h2>
<p>Bias-variance comes directly out of the test error:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\overline{teErr} &= \mathbb{E}[(observation - prediction)^2] \qquad (2.1) \\
& =\mathbb{E}_{D_{tr},D_{te}} [(y_\ast − f_{lse})^2] \qquad (2.2)\\
&= \mathbb{E}_{y_\ast,w_{lse}} [(y_\ast −f_{lse} )^2] \qquad (2.3)\\
&= \mathbb{E}_{y_\ast, w_{lse}} [(y_\ast −f_{true} + f_{true} −f_{lse})^2] \qquad (2.4) \\
&= \mathbb{E}_{y_\ast}[(y_{\ast}−f_{true})^2] + \mathbb{E}_{w_{lse}} [(f_{lse} − f_{true})^2] \qquad (2.5)\\
&= \sigma^2 + \mathbb{E} w_{lse} [(f_{lse} − \mathbb{E} w_{lse} [f_{lse}] −f_{true} + \mathbb{E}w_{lse} [f_{lse}])^2] \qquad (2.6)\\
&= \sigma^2 + \mathbb{E} w_{lse} [(f_{lse} − \mathbb{E} w_{lse} [f_{lse}])^2] + [f_{true} + \mathbb{E}w_{lse} (f_{lse})]^2 \qquad (2.7)\\
\end{align*} %]]></script>
<p>Where equation (2.2) is the expectation over training data and testing data; and the second term in equation (2.7) is called <strong>predict variance</strong>, and the third term of it is called the square of <strong>predict bias</strong>. Thus comes the name bias-variance decomposition.</p>
<h2 id="where-does-the-bias-come-from-model-bias-and-estimation-bias">Where does the bias come from? model bias and estimation bias</h2>
<p>As illustrated in the following figure, bias comes from model bias and estimation bias. Model bias comes from the model itself; and estimation bias comes from dataset (mainly). And bear in mind that ridge regression increases estimation bias while reducing variance(you may need to find other papers to get this idea)</p>
<p><img src="/assets/imgblog/bias-variance.png" alt="Where does bias come from?" /></p>
<h2 id="references">References</h2>
<ul>
<li>Kevin, Murphy. “Machine Learning: a probabilistic perspective.” (2012).</li>
<li>Bishop, Christopher M. “Pattern recognition.” Machine Learning 128 (2006).</li>
<li>Emtiyaz Khan’s lecture notes on PCML, 2015</li>
</ul>
<div style="display:none;">
<a href="https://clustrmaps.com/site/17ajv" title="Visit tracker"><img src="//www.clustrmaps.com/map_v2.png?d=PV6kH5NpVrDYOmzTWkD6yWxiKu9I4hssL2eZJ8y5qLM&cl=ffffff" /></a>
</div>
On Saddle Points: a painless tutorial2016-09-07T00:00:00+00:00/SaddlePoints<p>Are we really stuck in the local minima rather than anything else?</p>
<h2 id="different-types-of-critical-points">Different types of critical points</h2>
<hr />
<div class="fig figcenter fighighlight">
<img src="/assets/blog/updatemethods/minmaxsaddle.png" width="100%" />
<div class="figcaption">
Various Types of Critical Points. Source: Rong Ge's blog.
</div>
</div>
<hr />
<p>To minimize the function \(f:\mathbb{R}^n\to \mathbb{R}\), the most popular approach is to follow the opposite direction of the gradient \(\nabla f(x)\) (for simplicity, all functions we talk about are infinitely differentiable), that is,</p>
<script type="math/tex; mode=display">y = x - \eta \nabla f(x),</script>
<p>Here \(\eta\) is a small step size. This is the <em>gradient descent</em> algorithm.</p>
<p>Whenever the gradient \(\nabla f(x)\) is nonzero, as long as we choose a small enough \(\eta\), the algorithm is guaranteed to make <em>local</em> progress. When the gradient \(\nabla f(x)\) is equal to \(\vec{0}\), the point is called a <strong>critical point</strong>, and gradient descent algorithm will get stuck. For (strongly) convex functions, there is a unique <em>critical point</em> that is also the <em>global minimum</em>.</p>
<p>However, this is not always this case. All critical points of \( f(x) \) can be further characterized by the curvature of the function in its vicinity, especially described by it’s eigenvalues of the Hessian matrix. Here I describe three possibilities as the figure above shown:</p>
<ul>
<li>If all eigenvalues are non-zero and positive, then the critical point is a local minimum.</li>
<li>If all eigenvalues are non-zero and negative, then the critical point is a local maximum.</li>
<li>If the eigenvalues are non-zero, and both positive and negative eigenvalues exist, then the critical point is a saddle point.</li>
</ul>
<p>The proof of the above three possibilities can be shown from the reparametrization of the space of Hessian matrix. The Taylor expansion is given by(first order derivative vanishes):</p>
<script type="math/tex; mode=display">f(x+\Delta x) = f(x) + \frac{1}{2} (\Delta x)^T \mathbf{H} \Delta x \,\,\,\, ----- \,\,\,(1)</script>
<p>And assume \(\mathbf{e_1}, \mathbf{e_2}, …, \mathbf{e_n}\) are the eigenvectors and \(\lambda_1, \lambda_2, …, \lambda_n\) are the eigenvalues correspondingly. We can make the reparametrization of the space by:</p>
<script type="math/tex; mode=display">\Delta v = \frac{1}{2} \begin{bmatrix} \mathbf{e_1}^T\\ ... \\ \mathbf{e_n}^T \end{bmatrix} \Delta x</script>
<p>Then combined with Taylor expansion, we can get the following equation:</p>
<script type="math/tex; mode=display">f(x+ \Delta x) = f(x)+\frac{1}{2} \sum_{i=1}^n \lambda_i(\mathbf{e_i}^T \Delta x)^2 = f(x) + \sum_{i=1}^n \lambda_i \Delta \mathbf{v_i}^2</script>
<p>For the proof of the above equation, you may need to look at <a href="https://inst.eecs.berkeley.edu/~ee127a/book/login/l_sym_sed.html">Spectrum Theorem</a>, which is related to the eigenvalues and eigenvectors of symmetric matrices.</p>
<p>From this equation, all the three scenarios for critical points are self-explained.</p>
<h2 id="first-order-method-to-escape-from-saddle-point">First order method to escape from saddle point</h2>
<p>A <a href="http://www.offconvex.org/2016/03/22/saddlepoints/">post</a> by Rong Ge introduced a first order method to escape from saddle point. He claimed that saddle points are very <em>unstable</em>: if we put a ball on a saddle point, then slightly perturb it, the ball is likely to fall to a local minimum, especially when the second order term \(\frac{1}{2} (\Delta x)^T \mathbf{H} \Delta x\) is significantly smaller than 0(there is a steep direction where the function value decrease, and assume we are looking for local minimum), which is called a <em>Strict Saddle Function</em> in Rong Ge’s post. In this case we can use <em>noisy gradient descent</em>:</p>
<blockquote>
<p>\(y = x - \eta \nabla f(x) + \epsilon.\)</p>
</blockquote>
<p>where \(\epsilon\) is a noise vector that has mean \(\mathbf{0}\). Actually, it is the basic idea of <em>stochastic gradient descent</em>, which uses the gradient of a mini batch rather than the true gradient. However, the drawback of the stochastic gradient descent is not the direction, but the size of the step along each eigenvector. The step, along any direction \(\mathbf{e_i}\), is given by \(-\lambda_i \Delta \mathbf{v_i}\), when the steps taken in the direction with small absolute value of eigenvalues, the step is small. To be more concrete, an example that the curvature of the error surface may not be the same in all directions. If there is a long and narrow valley in the error surface, the component of the gradient in the direction that points along base of the valley is very small while the component perpendicular to the valley walls is quite large even though we have to move a long distance along the base and a small distance perpendicular to the walls. This phenomenon can be seen as the following figure:</p>
<hr />
<div class="fig figcenter fighighlight">
<img src="/assets/blog/updatemethods/without_momentum.png" width="70%" />
<div class="figcaption">
SGD optimization routes
</div>
</div>
<hr />
<p>We normally move by making a step that is some constant times the negative gradient rather than a step of constant length in the direction of the negative gradient. This means that in steep regions (where we have to be careful not to make our steps too large), we move quickly, and in shallow regions (where we need to move in big steps), we move slowly.</p>
<h2 id="newton-methods">Newton methods</h2>
<p>To look at the detail of newton methods, you can follow the proof shown in (Sam Roweis’s) in the reference list. The newton method solves the slowness problem by rescaling the gradients in each direction with the inverse of the corresponding eigenvalue, yielding the step \(-\Delta \mathbf{v_i}\)(because \(\frac{1}{\lambda_i}\mathbf{e_i} = \mathbf{H}^{-1}\mathbf{e_i} \) ). However, this approach can result in moving in the wrong direction when the eigenvalue is negative. The newton step moves along the eigenvector in a direction <strong>opposite</strong> to the gradient descent step, thus increase the error.</p>
<p>From the idea of Levenberg gradient descent method, we can use damping, in which case we remove negative curvature by adding a constant \(\alpha\) to its diagonal. Informally, \(x^{k+1} = x^{k} - (\mathbf{H}+\alpha \mathbf{I})^{-1} \mathbf{g_k}\). We can view \(\alpha\) as the tradeoff between newton methods and gradient descent. When \(\alpha\) is small, it is closer to newton method, when \(\alpha\) is large, it is closer to gradient descent. In this case, we get the step \(-\frac{\lambda_i}{\lambda_i + \alpha}\Delta \mathbf{v_i}\). Therefore, obviously, the drawback of damping newton method is that it potentially has small step size in many eigen-directions incurred by large damping factor \(\alpha\).</p>
<h2 id="saddle-free-newton-method">Saddle free newton method</h2>
<p>(Dauphin et al., 2014) introduced a method called saddle free newton method, which is a modified version of trust region approach. It minimizes first-order Taylor expansion constraint by the distance between first-order Taylor expansion and second-order Taylor expansion. By this constraint, unlike gradient descent, it can move further in the directions of low curvature; and move less in the directions of high curvature. I recommend you to read this paper throughly.</p>
<h2 id="future-post">Future post</h2>
<p>I have talked about degenerate critical point in this post， where there are only positive and zero eigenvalues in the Hessian matrix.</p>
<h2 id="marks">Marks</h2>
<p>The slides for the talk of this blog can be found at <a href="http://www.junlulocky.com/assets/talks/2016onsaddlepoints.pdf">Link</a>. Contact me if the link is not working.</p>
<h2 id="references">References</h2>
<ul>
<li><a href="https://inst.eecs.berkeley.edu/~ee127a/book/login/l_sym_sed.html">Berkeley Optimization Models: Spectral Theorem</a></li>
<li>Dauphin, Yann N., et al. <em>Identifying and attacking the saddle point problem in high-dimensional non-convex optimization.</em> Advances in neural information processing systems. 2014.</li>
<li><a href="https://www.cs.nyu.edu/~roweis/notes/lm.pdf">Sam Roweis’s note on Levenberg-Marquardt Optimization</a></li>
<li>Rong Ge, <em>Escaping from Saddle Points</em>, Off the convex path blog, 2016</li>
<li>Benjamin Recht, <em>Saddles Again</em>, Off the convex path blog, 2016</li>
</ul>
<div style="display:none;">
<center>
<a href="http://www.clustrmaps.com/map/Junlulocky.com/machine-learning/2016/09/07/SaddlPoints/" title="Visit tracker for Junlulocky.com/machine-learning/2016/09/07/SaddlPoints/"><img src="//www.clustrmaps.com/map_v2.png?u=0YFz&d=GDl5Xml2o7f7_qypSrKSzo7CAmf5CHWJqKbkwlNum-g" /></a>
</center>
</div>
Normalizations in Neural Networks2016-08-03T00:00:00+00:00/Normalizations-in-neural-networks<p>This post will introduce some normalization related tricks in neural networks.</p>
<!--more-->
<h2 id="normalization-and-equalization">Normalization and Equalization</h2>
<p>In image process area, the term “normalization” has many other names such as contrast stretching, histogram stretching or dynamic range expansion etc.
If you have an 8-bit grayscale image, the minimum and maximum pixel values are 50 and 180, we can normalize this image to a larger dynamic range say 0 to 255. After normalize, the previous 50 becomes 0, and 180 becomes 255, the values in the middle will be scaled according to the following formula:</p>
<p>(I_n: new_intensity) = ((I_o: old_intensity)- (I_o_min: old_minimum_intensity)) x ((I_n_max: new_maximum_intensity) - (I_n_min: new_minimum_intensity)) / ((I_o_max: old_maximum_intensity) - (I_o_min: old_minimum_intensity)) + (I_n_min: new_minimum_intensity)</p>
<p><br />
<img src="http://yeephycho.github.io/blog_img/normalization.jpg" alt="Normalization" /></p>
<p>It’s a typical linear transform. Still the previous image, the pixel value 70 will become (70-50)x(255-0)/(180-50) - 0 = 39, the pixel value 130 will become (130-50)x(255-0)/(180-50) - 0 = 156.
The image above shows the effect of an image before and after normalization, the third image is effect of another transform called <a href="https://en.wikipedia.org/wiki/Histogram_equalization">histogram equalization</a>, for your information, histogram equalization is different from normalization, normalization will not change your image’s histogram but equalization will. Histogram equalization doesn’t care about intensity value of the pixel, however, the ranking of the current intensity in the whole image matters a lot.
The maximum intensity of the original image is 238 and the minimum is 70, implemented through OpenCV. (OpenCV’s normalization function isn’t the normalization we are talking about, if you want to repeat the effect, you have to do it yourself.)
For normalization, the new intensity derives from the new and old maximum, minimum intensity; for equalization, the new intensity derives from the intensity value’s ranking in the whole image (for example, a image has 64 pixels, the intensity of a certain pixel is 90, and there are 22 pixels has a low intensity and 41 pixels has a higher intensity, the new intensity after equalization of that point is (22/64) x (255-0) = 87).</p>
<h2 id="simplified-whitening">Simplified Whitening</h2>
<p>Real whitening process is a series of linear transform to make the data have zero means and unit variances, and decorrelated. And there’s quite a lot of math, I don’t really want to talk too much about the math. (Editing formulas are quite annoying, you know.)
As issued above, the propose of whitening or ICA (Independent Compoment Analysis) or sphering is to get ride of the correlations among the raw data. Let’s say, for an image, there’s a high chance that the adjacent pixel’s intensity is similar, this kind of similarity over the spatial domain is the so called correlation, and ICA is a way to reduce this similarity.
Usually, in neural networks we use simplified whitening instead of original ICA, because the computation burden for ICA is just too heavy for big data (say millions of images).
<img src="http://yeephycho.github.io/blog_img/simplified_whitening.jpg" alt="simplified whitening" />
Too tired to explain this formula, maybe later, forgive me…</p>
<p>Let’s presume you have 100 grayscale images to process, each image has a width of 64 and height of 64, conventions are described as follows:</p>
<ul>
<li>First, calculate the mean and standard deviation (square root of variance) for pixels that has the same x and y coordinate.</li>
<li>Then, for each pixel, subtract the mean and divide the standard deviation.</li>
</ul>
<p>For example, among the 100 image, get the intensity for pixels at the position (0, 0), you will have 100 intensity values, calculate the mean and standard deviation for these 100 values. And then, for each pixel of these 100 values, subtracts the mean and divides the variance. And then repeat the same process for other pixels at other positions, in this example, you should iterate it for 64x64 times in total.
After the above process, each dimension of the data set along the batch axis has a zero mean and unit variance. The similarity has already been reduced (for my understanding, the first order similarity has already gone, but the higher order similarity is still there, that’s where the real ICA takes place, to wipe out the higher order similarity).
By doing whitening, the network will converge faster than without whitening.</p>
<h2 id="local-constrast-normalization-lcn">Local Constrast Normalization (LCN)</h2>
<p>Related papers are listed below:
<a href="http://journals.plos.org/ploscompbiol/article?id=10.1371%2Fjournal.pcbi.0040027">Why is Real-World Visual Object Recognition Hard?</a>, published in Jan. 2008. At this time, the name is “Local input divisive normalization”.
<a href="http://www.cns.nyu.edu/pub/lcv/lyu08b.pdf">Nonlinear Image Representation Using Divisive Normalization</a>, published in Jun. 2008. The name is “Divisive Normalization”.
<a href="http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf">What is the Best Multi-Stage Architecture for Object Recognition?</a>, released in 2009. The name is “Local Constrast Normalization”.
Whitening is a way to normalize the data in different dimensions to reduce the correlations among the data, however, local contrast normalization, whose idea is inspired by computational neuroscience, aims at to make the features in feature maps more significant.</p>
<p>This (Local Constrast Normalization) module performs local subtraction and division normalizations, enforcing a sort of local competition between adjacent features in a feature map, and between features at the same spatial location in different feature maps.</p>
<p>Local contrast normalization is implemented as follows:</p>
<ul>
<li>First, for each pixel in a feature map, find its adjacent pixels. Let’s say the radius is 1, so there are 8 pixels around the target pixel (do the zero padding if the target is at the edge of the feature map).</li>
<li>Then, compute the mean of these 9 pixels (8 neighbor pixels and the target pixel itself), subtract the mean for each one of the 9 pixels.</li>
<li>Next, compute the standard deviation of these 9 pixels. And judge whether the standard deviation is larger then 1. If larger than 1, divide the target pixel’s value (after mean subtraction) by the standard deviation. If not larger, keep the target’s value as they what they are (after mean subtraction).</li>
<li>At last, save the target pixel value to the same spatial position of a blank feature map as the input of the following CNN stages.</li>
</ul>
<p>I typed the following python code to illustrate the math of the LCN:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="o">>>></span> <span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">matrix</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">255</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">)))</span> <span class="c"># generate a random 3x3 matrix, the pixel value is ranging from 0 to 255.</span>
<span class="o">>>></span> <span class="n">x</span>
<span class="n">matrix</span><span class="p">([[</span><span class="mi">201</span><span class="p">,</span> <span class="mi">239</span><span class="p">,</span> <span class="mi">77</span><span class="p">],</span> <span class="p">[</span><span class="mi">139</span><span class="p">,</span> <span class="mi">157</span><span class="p">,</span> <span class="mi">23</span><span class="p">],</span> <span class="p">[</span><span class="mi">235</span><span class="p">,</span> <span class="mi">207</span><span class="p">,</span> <span class="mi">173</span><span class="p">]])</span>
<span class="o">>>></span> <span class="n">mean</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">mean</span>
<span class="mf">161.2222222222223</span>
<span class="o">>>></span> <span class="n">x</span> <span class="o">=</span> <span class="n">x</span> <span class="o">-</span> <span class="n">mean</span>
<span class="o">>>></span> <span class="n">x</span>
<span class="n">matrix</span><span class="p">([[</span><span class="mf">39.77777778</span><span class="p">,</span> <span class="mf">77.77777778</span><span class="p">,</span> <span class="o">-</span><span class="mf">84.22222222</span><span class="p">],</span> <span class="p">[</span><span class="o">-</span><span class="mf">22.22222222</span><span class="p">,</span> <span class="o">-</span><span class="mf">4.22222222</span><span class="p">,</span> <span class="o">-</span><span class="mf">138.22222222</span><span class="p">],</span> <span class="p">[</span><span class="mf">73.77777778</span><span class="p">,</span> <span class="mf">45.77777778</span><span class="p">,</span> <span class="mf">11.77777778</span><span class="p">]])</span>
<span class="o">>>></span> <span class="n">std_var</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">var</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="err">）</span>
<span class="o">>>></span> <span class="n">std_var</span>
<span class="mf">68.328906</span><span class="o">...</span>
<span class="o">>>></span> <span class="n">std_var</span> <span class="o">></span> <span class="mi">1</span>
<span class="bp">True</span>
<span class="o">>>></span> <span class="n">LCN_value</span> <span class="o">=</span> <span class="n">x</span><span class="p">[(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)]</span><span class="o">/</span><span class="n">std_var</span>
<span class="o">>>></span> <span class="n">LCN_value</span>
<span class="o">-</span><span class="mf">0.0617926207</span><span class="o">...</span>
</code></pre></div></div>
<p>Please be noted that the real process in the neural network is not looks like this, because the data is usually whitened before feed to the network, the image usually isn’t randomly generated and the negative value is usually set to zero in ReLU.
Here, we presume that each adjacent pixel has the same importance to the contrast normalization so we calculate the mean of the 9 pixels, actually, the weights for each pixel can be various.
We, presume the adjacent pixel radius is 1 and the image has only one channel, but the radius can be larger or smaller, you can pick up 4 adjacent pixels (up, down, left, right) or 24 pixels (radius is 2) or arbitrary pixels at arbitrary positions (the result may looks odd).
In the third paper, they introduced the divisive normalization into neural networks, and there is variation, that is the contrast normalization among adjacent feature maps at the same spatial position (say a pixel select two adjacent feature maps, the neighbor pixel number is 3x3x3 - 1). In conv. neural network, the output of a layer may have may feature maps, and the LCN can enhance feature presentations in some feature maps at the mean time restrain the presentations in other feature maps.</p>
<h2 id="local-response-normalization-lrn">Local Response Normalization (LRN)</h2>
<p>This concept was raised in AlexNet, click <a href="http://yeephycho.github.io/2016/07/21/A-reminder-of-algorithms-in-Convolutional-Neural-Networks-and-their-influences-I/">here</a> to learn more.
Local response normalization algorithm was inspired by the real neurons, as the author said, “bears some resemblance to the local contrast normalization”. The common point is that they both want to introduce competitions to the neuron outputs, the difference is LRN do not subtract mean and the competition happens among the outputs of adjacent kernels at the same layer.
The formula for LRN is as follows:
<img src="http://yeephycho.github.io/blog_img/local_response_normalization.jpg" alt="Local Response Normalization" /></p>
<p><strong><em>a(i, x, y)</em></strong> represents the <em>i</em> th conv. kernel’s output (after ReLU) at the position of (x, y) in the feature map.
<strong><em>b(i, x, y)</em></strong> represents the output of local response normalization, and of course it’s also the input for the next layer.
<strong><em>N</em></strong> is the number of the conv. kernel number.
<strong><em>n</em></strong> is the adjacent conv. kernel number, this number is up to you. In the article they choose n = 5.
<strong><em>k, α， β</em></strong> are hyper-parameters, in the article, they choose <strong><em>k = 2, α = 10e-4, β = 0.75</em></strong>.
<img src="http://yeephycho.github.io/blog_img/local_response_normalization_process.jpg" alt="Local Response Normalization illustration" />
Flowchart of Local Response Normalization</p>
<p>I drew the above figure to illustrate the process of LRN in neural network. Just a few tips here:</p>
<ul>
<li>This graph presumes that the <em>i</em> th kernel is not at the edge of the kernel space. If i equals zero or one or last or one to the last, one or two additional zero padding conv. kernels are required.</li>
<li>In the article, n is 5, we presume n/2 is integer division, 5/2 = 2.</li>
<li>Summation of the squares of output of ReLU stands for: for each output of ReLU, compute its square, then, add the 5 squared value together. This process is the summation term of the formula.</li>
<li>I presume the necessary padding is used by the input feature map so that the output feature maps have the same size of the input feature map, if you really care. But this padding may not be quite necessary.</li>
</ul>
<p>After knowing what LRN is, another question is: what the output of LRN looks like?
Because the LRN happens after ReLU, so the inputs should all be no less than 0. The following graph tries to give you an intuitive understanding on the output of LRN, however, you still need to use your imagination.
<img src="http://yeephycho.github.io/blog_img/LRN.png" alt="Local Response Normalization output" /></p>
<p>Be noted that the x axis represents the summation of the squared output of ReLU, ranging from 0 to 1000, and the y axis represents b(i, x, y) divides a(i, x, y). The hyper-parameters are set default to the article.
So, the real b(i, x, y)’s value should be the the y axis’s value multiplied with the a(i, x, y), use your imagination here, two different inputs a(i, x, y) pass through this function. Since the slope at the beginning is very steep, little difference among the inputs will be significantly enlarged, this is where the competition happens.
The figure was generated by the following python code:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="o">>>></span> <span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="o">>>></span> <span class="k">def</span> <span class="nf">lrn</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="o">>>>...</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">/</span> <span class="p">(</span><span class="mi">2</span> <span class="o">+</span> <span class="p">(</span><span class="mf">10e-4</span><span class="p">)</span> <span class="o">*</span> <span class="n">x</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span> <span class="o">**</span> <span class="mf">0.75</span>
<span class="o">>>>...</span> <span class="k">return</span> <span class="n">y</span>
<span class="o">>>></span> <span class="nb">input</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1000</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">output</span> <span class="o">=</span> <span class="n">lrn</span><span class="p">(</span><span class="nb">input</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="nb">input</span><span class="p">,</span> <span class="n">output</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'sum(x^2)'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'1 / (k + a * sum(x^2))'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<h2 id="batch-normalization">Batch Normalization</h2>
<p>I summarized related paper in <a href="http://yeephycho.github.io/2016/08/02/A-reminder-of-algorithms-in-Convolutional-Neural-Networks-and-their-influences-II/">another blog</a>.
Batch normalization, at first glance, is quite difficult to understand. It truly introduced something new to CNNs, that is a kind of learnable whitening process to the inputs of the non-linear activations(ReLUs or Sigmoids).
You can view the BN operation (represented as op. at the rest of this post) as a simplified whitening on the data in the intermittent layer of the neural network. In the original paper, I think the BN op. happens after the conv. op. but before the ReLU or Sigmoid op.
But, BN is not that easy, because, first, the hyper-parameters “means” and “variances” are learned through back-propagation, and the training is mini-batch training either online training nor batch training. I’m going to explain these ideas below.
First, let’s review the simplified whitening formula:
<img src="http://yeephycho.github.io/blog_img/simplified_whitening.jpg" alt="simplified whitening" /></p>
<p>Then, follow the similar idea, batch normalization defined two trainable parameters one comes from the mean, the other comes from the variance (or variance’s square root-standard deviation), view the algorithms and formula sets in page 3 and 4 in <a href="https://arxiv.org/pdf/1502.03167.pdf">original paper</a>.
<img src="http://yeephycho.github.io/blog_img/bn_train.jpg" alt="Batch normalization algorithms" /></p>
<p>Online training means that when you train your network, each time you feed only one instance to your network, calculate the loss at the last layer and based on the loss of this single instance, using back-propagation to adjust your network’s parameters. Batch training means when you train your network, you feed all your data to the network, and calculate the loss of the whole dataset, based on the total loss do be BP learning. Mini-batch training means you feed a small part of your training data to the network, then, calculate the total loss of the small part of the data at the last layer, then based on the loss of this small part of data do the BP learning.
Online training usually suffers from the noise the adjustment is usually quite noisy but if your training is implemented on a single thread CPU, online training is believed to be the fastest scheme and you can use larger learning rate.
Batch training has a better estimation on the gradient, so the training can be less noisy, but batch training should be carefully initialized and the learning rate should be small, so the training speed is believed to be slow.
Mini-batch training is a compromise between online training and the batch training. It uses a batch of data to estimate the gradient, so the learning is less noisy. Batch training and mini-batch training all can take advantage of the parallel computing such as multi-thread computing or GPU computing. So, the speed is much faster than single thread training.
Batch normalization of course uses batch training. In ImageNet classification, they choose the batch size of 32, that is every time they feed 32 images to the network to calculate the loss and estimate the error. Each image is 224*224 pixels, so each batch has 50176 dimensions.
Let’s take out a intermittent conv. layer to illustrate the BN op.
<img src="http://yeephycho.github.io/blog_img/bn_process.jpg" alt="Batch normalization process" /></p>
<p>The trick part is γ and β are initialized by the batch standard deviations and means but trained subject to the loss of the network through back-propagation.
Why? Why we need to train γ and β instead of using the standard deviation and mean of the batch directly, you may think it’s possibly a better way to reduce the correlations or shifts. The reason is that it can be proved or observed that by naive subtracting mean and dividing the variance, there’s no help to the network, take mean as an example, the bias unit in the network will make up the loss of the mean.
In my opinion, batch normalization is trying to find a balance between the simplified whitening and raw. They issued in the paper, the initial transform is an identity transform to the data. After γ and β were trained, I believe that the transform is not identity anymore. And they also say BN is a way to solve the internal covariate shift, to solve the problem of shifting in the distribution of the inputs in different layers, according to their description, BN is a significant improvement to the network architecture, I believe it’s true but I don’t think they can really get ride of the distribution shift, as the title of the paper said, it can improve the network by “Reducing Internal Covariate Shift”.
The last thing to address, when using the network trained by BN to do the inference, a further process to γ and β are needed, you can find how to implement the process according to the 8, 9, 10, 11 lines in Alg. 2. The idea is that the trained γ and β in the model need to be further normalized by the <a href="https://en.wikipedia.org/wiki/Maximum_likelihood_estimation">maximum likelihood estimation</a> of global variance and mean.</p>
<p><br /></p>
<h2 id="license">License</h2>
<p>The content of this blog itself is licensed under the <a href="https://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution 4.0 International License</a>.
<img src="http://yeephycho.github.io/blog_img/license.jpg" alt="CC-BY-SA LICENCES" /></p>
<p>The containing source code (if applicable) and the source code used to format and display that content is licensed under the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License 2.0</a>.
Copyright [2016] [yeephycho]
Licensed under the Apache License, Version 2.0 (the “License”);
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
<a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License 2.0</a>
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an “AS IS” BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
express or implied. See the License for the specific language
governing permissions and limitations under the License.
<img src="http://yeephycho.github.io/blog_img/APACHE.jpg" alt="APACHE LICENCES" /></p>
The math behind Gradient Descent2016-06-03T00:00:00+00:00/gradientdescentmath<p>This post means to help starters to understand the math behind Gradient Descent (GD).</p>
<h2 id="intuitive-understanding">Intuitive understanding</h2>
<p>An intuitive way to think of Gradient Descent is to imagine the path of a river originating from top of a mountain. The goal of gradient descent is exactly what the river strives to achieve - namely, reach the bottom most point (at the foothill) climbing down from the mountain. That’s what you taught by your machine learning teacher right? But do you only understand this and use gradient descent naively every time? This post will help you understand the math behind gradient descent.</p>
<h2 id="math-behind-gradient-descent">Math behind Gradient Descent</h2>
<p>Here I define the objective function to be \(L(x,w,b)\) and the input variable of \(L\) is \(x\) with \(d\)-dimension, weight variable \(w\) and bias variable\(b\), our goal is to use algorithm to get the minimum of \(L(x,w,b)\).</p>
<p>To make this question more precise, let’s think about what happens when we move the ball a small amount \(\Delta x_1\) in the \(x_1\) direction, a small amount \(\Delta x_2\) in the \(x_2\) direction, …, and a small amount \(\Delta x_d\) in the \(x_d\) direction. Calculus tells us that \(L(x,w,b)\) changes as follows:</p>
<script type="math/tex; mode=display">\Delta L \approx \frac{\partial L}{\partial x_1}\Delta x_1 + ... + \frac{\partial L}{\partial x_d}\Delta x_d</script>
<p>In this sense, we need to find a way of choosing \(\Delta x_1\), …, \(\Delta x_d\) so as to make \(\Delta L\) negative; i.e., we’ll make the objective function decrease so that to minimize.</p>
<ul>
<li>Define \(\Delta x=(\Delta x_1, …, \Delta x_d)^T\) to be the vector of changes in \(x\).</li>
<li>Define \(\nabla L=(\frac{\partial L}{\partial x_1}, …, \frac{\partial L}{\partial x_d})^T\) to be the gradient vector of \(L\).</li>
</ul>
<p>So we can find: <script type="math/tex">\Delta L \approx \frac{\partial L}{\partial x_1}\Delta x_1 + ... + \frac{\partial L}{\partial x_d}\Delta x_d = \nabla L ^T \Delta x</script>. By now, things are becoming easier. Suppose, \(\Delta x=-\eta \nabla L\) (i.e. the step size in gradient descent, where \(\eta\) is the learning rate). Then:</p>
<script type="math/tex; mode=display">\nabla L \approx -\eta \nabla L^T\nabla L = -\eta||\nabla L||_2^2 \leq 0</script>
<p>Now, we can find the rightness of gradient descent. We can use the following update rule to update next \(x\):</p>
<script type="math/tex; mode=display">x^{k+1} = x^{k} - \eta \nabla L(x^k)</script>
<p>this update rule will make the objective function drop to the minimum point.</p>
<h2 id="gradient-descent-in-a-convex-problem">Gradient Descent in a convex problem</h2>
<p>Now, I will consider the gradient descent in a convex problem, because we usually use gradient descent in a convex problem, otherwise, we usually get the local minimum. If the objective function is convex, then \(\nabla L(x^k)^T(x^{k+1}-x^{k})\geq 0\) implies \(L(x^{k+1}) \geq L(x^k)\). This can be derived from the convex property of a convex function, i.e. \(L(x^{k+1}) \geq L(x^k)^T(x^{k+1}-x^k)\).</p>
<p>In this sense, we need to make \(\nabla L(x^k)^T(x^{k+1}-x^{k})\leq 0\) so as to make the objective function decrease. In gradient descent \(\Delta x\) is chosen to be \(-\nabla L(x^k)\). However, there are many other descent method, such as <strong>steepest descend</strong>, <strong>normalized steepest descent</strong>, <strong>newton step</strong> and so on. The main idea of these methods is to make \(\nabla L(x^k)^T(x^{k+1}-x^{k})= \nabla L^T \Delta x \leq 0\).</p>
<h2 id="further-reading">Further reading</h2>
<p>For further reading of the gradient descent method, please refer to the Chapter 8 of (Beck, 2017).</p>
<h2 id="acknowledgement">Acknowledgement</h2>
<p>We would like to thank <a href="https://github.com/Marvinmw">Wei Ma</a> and <a href="https://github.com/yeephycho">Yixuan Hu</a> for checking the details of this blog.</p>
<h2 id="references">References</h2>
<ul>
<li>Michael A. Nielsen, <em>Neural Networks and Deep Learning</em>, Determination Press, 2015</li>
<li>Stephen Boyd, and Lieven Vandenberghe. <em>Convex optimization</em>. Cambridge university press, 2004.</li>
<li>Beck, Amir. <em>First-Order Methods in Optimization</em>. Vol. 25. SIAM, 2017.</li>
</ul>
<h2 id="remarks">Remarks</h2>
<p>Last updated on June 14, 2016</p>
<div style="display:none;">
<center>
<a href="http://www.clustrmaps.com/map/Junlulocky.com/machine-learning/2016/06/03/gradientdescentmath/" title="Visit tracker for Junlulocky.com/machine-learning/2016/06/03/gradientdescentmath/"><img src="//www.clustrmaps.com/map_v2.png?u=0YFz&d=dPqQX9JbVugBKkw8uA4Q9ME40uE-X7RKY5L1f9ATkig" /></a>
</center>
</div>
Contributing an article2010-01-01T00:00:00+00:00/contribute<p>If you’re writing an article for this blog, please follow these guidelines.</p>
<p>One of the rewards of switching my website to <a href="http://jekyllrb.com/">Jekyll</a> is the
ability to support <strong>MathJax</strong>, which means I can write LaTeX-like equations that get
nicely displayed in a web browser, like this one \( \sqrt{\frac{n!}{k!(n-k)!}} \) or
this one \( x^2 + y^2 = r^2 \).</p>
<!--more-->
<p><img class="centered" src="http://gastonsanchez.com/images/blog/mathjax_logo.png" /></p>
<h3 id="whats-mathjax">What’s MathJax?</h3>
<p>If you check MathJax website <a href="http://www.mathjax.org/">(www.mathjax.org)</a> you’ll see
that it <em>is an open source JavaScript display engine for mathematics that works in all
browsers</em>.</p>
<h3 id="how-to-implement-mathjax-with-jekyll">How to implement MathJax with Jekyll</h3>
<p>I followed the instructions described by Dason Kurkiewicz for
<a href="http://dasonk.github.io/blog/2012/10/09/Using-Jekyll-and-Mathjax/">using Jekyll and Mathjax</a>.</p>
<p>Here are some important details. I had to modify the Ruby library for Markdown in
my <code class="highlighter-rouge">_config.yml</code> file. Now I’m using redcarpet so the corresponding line in the
configuration file is: <code class="highlighter-rouge">markdown: redcarpet</code></p>
<p>To load the MathJax javascript, I added the following lines in my layout <code class="highlighter-rouge">page.html</code>
(located in my folder <code class="highlighter-rouge">_layouts</code>)</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="o"><</span><span class="n">script</span><span class="w"> </span><span class="n">type</span><span class="o">=</span><span class="s2">"text/javascript"</span><span class="w">
</span><span class="n">src</span><span class="o">=</span><span class="s2">"http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"</span><span class="o">></span><span class="w">
</span><span class="o"></</span><span class="n">script</span><span class="o">></span></code></pre></figure>
<p>Of course you can choose a different file location in your jekyll layouts.</p>
<h3 id="a-couple-of-examples">A Couple of Examples</h3>
<p>Here’s a short list of examples. To know more about the details behind MathJax, you can
always checked the provided documentation available at
<a href="http://docs.mathjax.org/en/latest/">http://docs.mathjax.org/en/latest/</a></p>
<p>I’m assuming you are familiar with LaTeX. However, you should know that MathJax does not
have the exactly same behavior as LaTeX. By default, the <strong>tex2jax</strong> preprocessor defines the
LaTeX math delimiters, which are <code class="highlighter-rouge">\\(...\\)</code> for in-line math, and <code class="highlighter-rouge">\\[...\\]</code> for
displayed equations. It also defines the TeX delimiters <code class="highlighter-rouge">$$...$$</code> for displayed
equations, but it does not define <code class="highlighter-rouge">$...$</code> as in-line math delimiters. Fortunately,
you can change these predefined specifications if you want to do so.</p>
<p>Let’s try a first example. Here’s a dummy equation:</p>
<script type="math/tex; mode=display">a^2 + b^2 = c^2</script>
<p>How do you write such expression? Very simple: using <strong>double dollar</strong> signs</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="o">$$</span><span class="n">a</span><span class="o">^</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">b</span><span class="o">^</span><span class="m">2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">c</span><span class="o">^</span><span class="m">2</span><span class="o">$$</span></code></pre></figure>
<p>To display inline math use <code class="highlighter-rouge">\\( ... \\)</code> like this <code class="highlighter-rouge">\\( sin(x^2) \\)</code> which gets
rendered as \( sin(x^2) \)</p>
<p>Here’s another example using type <code class="highlighter-rouge">\mathsf</code></p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="o">$$</span><span class="w"> </span><span class="err">\</span><span class="n">mathsf</span><span class="p">{</span><span class="n">Data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">PCs</span><span class="p">}</span><span class="w"> </span><span class="err">\</span><span class="n">times</span><span class="w"> </span><span class="err">\</span><span class="n">mathsf</span><span class="p">{</span><span class="n">Loadings</span><span class="p">}</span><span class="w"> </span><span class="o">$$</span></code></pre></figure>
<p>which gets displayed as</p>
<script type="math/tex; mode=display">\mathsf{Data = PCs} \times \mathsf{Loadings}</script>
<p>Or even better:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="err">\\</span><span class="p">[</span><span class="w"> </span><span class="err">\</span><span class="n">mathbf</span><span class="p">{</span><span class="n">X</span><span class="p">}</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="err">\</span><span class="n">mathbf</span><span class="p">{</span><span class="n">Z</span><span class="p">}</span><span class="w"> </span><span class="err">\</span><span class="n">mathbf</span><span class="p">{</span><span class="n">P</span><span class="o">^</span><span class="err">\</span><span class="n">mathsf</span><span class="p">{</span><span class="nb">T</span><span class="p">}}</span><span class="w"> </span><span class="err">\\</span><span class="p">]</span></code></pre></figure>
<p>is displayed as</p>
<p>\[ \mathbf{X} = \mathbf{Z} \mathbf{P^\mathsf{T}} \]</p>
<h2 id="important-notes">Important notes</h2>
<h3 id="1-subscripts">1. Subscripts</h3>
<p>If you want to use subscripts like this \( \mathbf{X}_{n,p} \) you need to scape the
underscores with a backslash like so <code class="highlighter-rouge">\mathbf{X}\_{n,p}</code>:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="o">$$</span><span class="w"> </span><span class="err">\</span><span class="n">mathbf</span><span class="p">{</span><span class="n">X</span><span class="p">}</span><span class="err">\_</span><span class="p">{</span><span class="n">n</span><span class="p">,</span><span class="n">p</span><span class="p">}</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="err">\</span><span class="n">mathbf</span><span class="p">{</span><span class="n">A</span><span class="p">}</span><span class="err">\_</span><span class="p">{</span><span class="n">n</span><span class="p">,</span><span class="n">k</span><span class="p">}</span><span class="w"> </span><span class="err">\</span><span class="n">mathbf</span><span class="p">{</span><span class="n">B</span><span class="p">}</span><span class="err">\_</span><span class="p">{</span><span class="n">k</span><span class="p">,</span><span class="n">p</span><span class="p">}</span><span class="w"> </span><span class="o">$$</span></code></pre></figure>
<p>will be displayed as</p>
<p>\[ \mathbf{X}_{n,p} = \mathbf{A}_{n,k} \mathbf{B}_{k,p} \]</p>
<h3 id="2-vertical-line">2. vertical line</h3>
<p>If you want to use vertical line <code class="highlighter-rouge">|</code>, you also need to scape the underscores with a blackslash like so <code class="highlighter-rouge">\|</code></p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="o">$$</span><span class="w"> </span><span class="err">\</span><span class="o">|</span><span class="err">\</span><span class="o">|</span><span class="n">A</span><span class="err">\</span><span class="o">|</span><span class="err">\</span><span class="o">|</span><span class="w"> </span><span class="o">$$</span></code></pre></figure>
<p>will be displayed as</p>
<p>\[ ||A|| \]</p>