Word embeddings

The representation of words as vectors


Word-embeddings are a way to represent language words as vectors of real numbers.

One-hot encoding

One-hot encoding is one simple method of representing every word from a vocabulary. The goal is to convert the vocabulary into a format that can be used as input for machine learning models, which typically require numerical data.

One-hot encoding is a unique style of encoding information. In one-hot encodings, only one position in the vector equals 1, and the rest of the positions are 0.

Example: bird = 00010

The length of a one-hot encoding vector is equal to the vocabulary size.

Example

           
dog 0 1 0 0 0
bird 0 0 1 0 0
fish 0 0 0 1 0
deer 0 0 0 0 1
crocodile 0 0 0 0 0

The sample vectors contain 5 dimensions. This size implies that the vocabulary contains 5 words.

Challenges

  • Large vocabularies need longer one-hot encoding vectors.
  • One-hot encoding vectors consist of mostly 0. As a result, operations with one-hot encoded vectors are very inefficient.
  • One-hot encoding vectors do not represent text meaning or similarity.

Embedding matrices

The goal of word embeddings is to capture meaning and context. Further word information can be represented in a multidimensional vector. As a result, word embedding lengths are shorter than one-hot encodings.

Example

  pets mammals horns
dog 1 1 0
bird 1 0 0
fish 1 0 0
deer 0 1 1
crocodile 0 0 0

The sample vectors contain 3 digits. Similar concepts will have similar vectors.

In neural machine translation, embedding matrices are usually learned during training.


Want to learn more about Word embeddings?


Edit this article →

Machine Translate is created and edited by contributors like you!

Learn more about contributing →

Licensed under CC-BY-SA-4.0.

Cite this article →