A data compression algorithm for tokenisation for machine translation
Byte-pair encoding (BPE) is a data compression algorithm. In byte-pair encoding, a byte that does not exist in the data replaces the most common consecutive pair of bytes. Byte-pair encoding was first introduced in the article titled “A new Algorithm for Data Compression” by Philip Gage in the February 1994 edition of C Users Journal.
- Initialise the vocabulary with the frequency of word occurences in the corpus.
- Represent each word in the corpus as a concatenation. Use the special token
</w>to identify the word end.
- Split each word into characters and count the character occurence.
- Look for the most frequent pairing, merge them and perform the same iteration again.
- Repeat step 4 until all the iterations are performed or the token limit size is reached.
- The most frequent byte pair is
aa. It is replaced
Z. The string is now
- The new most frequent byte pair is
ab. It is replaced by
Y. The string is now
ZYis encoded as
Xapplying recursive byte-pair encoding. The string is now
XdXaccannot be compressed further.
To decompress the data, the string characters are replaced in reverse order.
The use of byte-pair encoding ensures that the most common words in the vocabulary are presented as single tokens, whereas the rarest words are broken down into subword tokens. Byte-pair encoding works well for both character-level and word-level representations, and it can manage large corpora. Byte-pair encoding follows a greedy approach where it goes through every possible option at each iteration.
Implementation of byte-pair encoding is slightly modified to fit for subword tokenisation. In subword tokenisation, frequently occurring subwords are merged instead of replaced by another byte.
- The unigram is split up into more frequent subwords.
- The unigram is represented as
['__sow', 'pr', 'ai', 'se', 'd', '__eow']. The
'__eow'markers indicate the start and the end of the tokens.