\(\renewcommand{\vec}[1]{\mathbf{#1}}\)

\(\newcommand{\vecv}{\vec{v}}\) \(\newcommand{\vectheta}{\pmb{\theta}}\)

Exploration 4.3: Deep Learning for Language (Part 1): Word Embeddings

In the previous exploration, we discussed how to apply deep learning to vision. But in order to apply it to natural language, we face a bigger challenge. You see, unlike vision, our language is inherently discrete (words) and symbolic, but deep learning is fundamentally connectionist, continuous, and anti-symbolic. So how to unify them? Well, we first need continuous representations of discrete words, so that deep learning can take them as inputs. In other words, we want to view words like “cat” and “bunny” as vectors in some high-dimensional space, and similar words would have similar vectors. This is known as “word embeddings”, a fundamental topic in deep learning-based NLP. Today’s generative AI such as ChatGPT is all based on this very concept.

Word embeddings are continuous vector representations of words that capture semantic and syntactic relationships between words. In this exploration, we will cover the basics of word embeddings, their properties, and how they can be trained.

From One-Hot to Continuous Representation

Traditionally, in machine learning and NLP, words were represented using one-hot vectors, where each word in the vocabulary is assigned a unique index (e.g., in the following example, “Orange” is assigned the index of 6257) and is represented as a binary vector with a single 1 in the position corresponding to the word’s index and 0s elsewhere. To convert a word to its one-hot vector representation, a look-up process is employed: first, the word’s index is determined from the predefined vocabulary, and then a vector of the same length as the vocabulary is created, with the element at the word’s index set to 1, and all other elements set to 0.

Visual demonstration of one-hot encoding: ‘Man’ (5391), ‘Woman’ (9853), ‘King’ (4914), ‘Queen’ (7157), ‘Apple’ (456), and ‘Orange’ (6257). Each word from this vocabulary is assigned a unique index and represented as a high-dimensional vector, indicating its position in the linguistic space. The image highlights the sparsity of one-hot vectors and their key limitation - the inability to capture and represent the inherent semantic relationships between different words.

However, one-hot representations are sparse and do not capture any semantic information about the words. To address this limitation, continuous representations such as word embeddings are used. These dense vector representations are capable of capturing semantic and syntactic relationships between words, making them more suitable for NLP tasks.

Properties of Word Embeddings

Some interesting properties of word embeddings include:

  1. Similar words have similar embeddings: The embeddings of semantically similar words tend to be close together in the vector space. This can be measured using cosine similarity, which computes the cosine of the angle between two vectors. For example, the cosine similarity between the embeddings of the words “cat” and “kitten” would be higher than the cosine similarity between “cat” and “tree”.
Visualizing Learned Word Embeddings: This toy example illustrates how word embeddings transform words from a text corpus into vector space. Here, the 1st dimension strongly correlates with ‘living beings’. Semantically or syntactically similar words cluster together in this space, a feature evident in the top-right quadrant via dimensionality reduction techniques like t-SNE or PCA. The image also highlights how consistent offsets in vector space can reflect shared cultural relationships between different words.
  1. Analogies: Word embeddings can capture analogies, such as “man - woman = king - queen”. This can be illustrated by performing vector arithmetic on the embeddings and finding the closest word in the vector space. For example:

    \[\text{embedding}(\text{"man"}) - \text{embedding}(\text{"woman"}) + \text{embedding}(\text{"queen"}) \approx \text{embedding}(\text{"king"})\]

Unveiling Semantic Relationships with Embeddings: The visualized real embeddings in this image display the geometric relationships embodying semantic connections - gender, verb tense, and country-to-capital relations. Note how capital cities cluster closely with their respective countries. Additionally, countries with similar characteristics are located near each other, yet on a different dimension, revealing the nuanced relationships between countries and their capitals.
  1. Visualizations: Word embeddings can be projected to a lower-dimensional space (e.g., using t-SNE or PCA) for visualization purposes. This can help in understanding the relationships between words and identifying clusters of similar words.
2D Visualization of Word Embedding Space: In this graphical representation, words with similar meanings occupy adjacent locations. Here, ‘similarity’ can be gauged either by Euclidean distance (spatial distance between points) or cosine similarity (angle between two vectors), thus facilitating an intuitive understanding of linguistic relationships.

Model and Training

There are several models and algorithms for learning word embeddings, including:

  1. Word2Vec: This is the most famous family of models that learn word embeddings using a shallow neural network. Word2Vec is trained to predict a word given its context or the context given a word:

    1. Predict a word given context (Continuous Bag of Words, CBOW): The CBOW model is trained to predict the (masked) target word \(w_i\) given its context words (e.g., \(w_{i-2}\), \(w_{i-1}\), \(w_{i+1}\), \(w_{i+2}\)). For example, given the same sentence “The cat is on the mat”, the model would be trained to predict the word “is” when given the words “The”, “cat”, “on”, and “the”. This idea is very intuitive (you can imagine most masked words are easy to recover), inspired by the Cloze test, and inspired later masked language models such as BERT.
    CBOW: predict the target word given context
    CBOW implemented as neural network
    1. Predict the context given word (Skip-gram): The Skip-gram model is the opposite: trained to predict the context words given a target word. For example, given the sentence “The cat is on the mat”, the model would be trained to predict the words “The”, “cat”, “on”, and “the” when given the word “is”. This is certainly much more counterintuitive, as the task is much harder. However, precisely because it is trained to do something ridiculously hard, skip-gram yields slightly higher accuracy than CBOW, with the drawback being slower in training.
    skip-gram: predict the context given a word
  2. GloVe (Global Vectors for Word Representation): A model that learns word embeddings by factorizing a word co-occurrence matrix. The main idea behind GloVe is that the relationships between words can be encoded in the ratio of their co-occurrence probabilities. GloVe is trained to minimize the difference between the dot product of word vectors and the logarithm of their co-occurrence probabilities. By doing so, it learns dense vector representations that capture semantic and syntactic information about words.

  3. FastText: An extension of the Word2Vec model that represents words as the sum of their character n-grams. FastText can learn embeddings for out-of-vocabulary words and is more robust to spelling mistakes. FastText can be trained using either the Skip-gram or CBOW architecture, similar to Word2Vec.

To train word embeddings, a large text corpus is required. The text corpus is preprocessed (e.g., tokenization, lowercasing) and fed into the chosen model. The model learns the embeddings by updating its weights using gradient-based optimization algorithms (e.g., stochastic gradient descent).

Pretrained Word Embeddings

In many cases, it is not necessary to train word embeddings from scratch. There are several pre-trained word embeddings available that can be used directly in NLP tasks or fine-tuned for specific domains. Some popular pre-trained word embeddings include:

  1. Google’s Word2Vec: Pre-trained on the Google News dataset, containing 100 billion words and resulting in a 300-dimensional vector for 3 million words and phrases.

  2. Stanford’s GloVe: Pre-trained on the Common Crawl dataset, containing 840 billion words and resulting in 300-dimensional vectors for 2.2 million words.

  3. Facebook’s FastText: Pre-trained on Wikipedia, containing 16 billion words and resulting in 300-dimensional vectors for 1 million words.

I have added more details on the properties of word embeddings, including some examples, and explained how to use pretrained embeddings in NLP tasks in the word_embeddings.html file below:

Using Pretrained Word Embeddings in NLP Tasks

Pretrained word embeddings can be used as a starting point for various NLP tasks, such as text classification, sentiment analysis, and machine translation. They can be used in the following ways:

  1. As input features: The word embeddings can be used as input features for machine learning models, such as neural networks or support vector machines. For example, in a text classification task, you could average the embeddings of all words in a document to obtain a document-level embedding, which can then be used as input to a classifier.

  2. As initialization for fine-tuning: In some cases, it might be beneficial to fine-tune the pretrained embeddings on a specific task or domain. You can initialize the embedding layer of a neural network with the pretrained embeddings and then update the embeddings during training. This can help the model to better capture domain-specific knowledge.

  3. In combination with other embeddings: Pretrained embeddings can be combined with other types of embeddings, such as character-level embeddings or part-of-speech embeddings, to create richer representations for NLP tasks.

By using pretrained embeddings, you can leverage the knowledge captured from large-scale text corpora and improve the performance of your NLP models.

Videos