Yayun Liu's Blog Engineer, Blogger, Learner, Runner

Natural Language Processing 0.0.2: Word Embeddings

1 minutes to read
2024-08-01

Word Embeddings

A common way to associate a vector with a word is the use of dense word vectors, also called word embeddings. The following is the comparison between one-hot word vectors and word embeddings:

Word vectors obtained via one-hot encoding

  • Binary, sparse, and high-dimensional (same dimensionality as the number of words in the vocabulary).
  • Obtained via one-hot encoding.

Word Embeddings

  • Low-dimensional floating-point vectors.
  • Learned from data.

embeddings

Methods of Obtaining Word Embeddings

Learning Word Embeddings with Embedding Layer

We can use the Embedding layer offered by Keras to create an embedding layer which can map word index (one-hot encoding) to vector representation.

from keras.layers import Embedding
embedding_layer = Embedding(input_dim=1000, output_dim=64)

The following picture shows how a 1000-dim word index can be mapped to a N-dim vector representation.
vector-repre

Code example

Github word embeddings: link

Using pretrained word embeddings

Instead of learning word embeddings from scratch, we can use pretrained word embeddings. The most famous pretrained word embeddings are word2vec, GloVe, and FastText.

  • word2vec was developed by Tomas Mikolov in an internship project at Google in 2013.
  • GloVe, which stands for Global Vectors for Word representation, was developed by Stanford University in 2014. The data used for the training is from Wikipedia and Common Crawl data.
  • FastText was developed by Facebook in 2016.

Code example

Using GloVe as the pretrained word embeddings: link

References

  • Deep Learning with Python by Francois Chollet, Chapter 6

Comments

Content