Byte-pair encoding
A data compression algorithm for tokenisation for machine translation
Byte-pair encoding (BPE) is a data compression algorithm. In byte-pair encoding, a byte that does not exist in the data replaces the most common consecutive pair of bytes. Byte-pair encoding was first introduced in the article titled “A new Algorithm for Data Compression” by Philip Gage in the February 1994 edition of C Users Journal.
Algorithm
- Initialise the vocabulary with the frequency of word occurences in the corpus.
- Represent each word in the corpus as a concatenation. Use the special token
</w>
to identify the word end. - Split each word into characters and count the character occurence.
- Look for the most frequent pairing, merge them and perform the same iteration again.
- Repeat step 4 until all the iterations are performed or the token limit size is reached.
Example
String: aaabdaaabac
- The most frequent byte pair is
aa
. It is replacedZ
. The string is nowZabdZabac
. - The new most frequent byte pair is
ab
. It is replaced byY
. The string is nowZYdZYac
. ZY
is encoded asX
applying recursive byte-pair encoding. The string is nowXdXac
.XdXac
cannot be compressed further.
To decompress the data, the string characters are replaced in reverse order.
Subword tokenisation
The use of byte-pair encoding ensures that the most common words in the vocabulary are presented as single tokens, whereas the rarest words are broken down into subword tokens. Byte-pair encoding works well for both character-level and word-level representations, and it can manage large corpora. Byte-pair encoding follows a greedy approach where it goes through every possible option at each iteration.
Implementation of byte-pair encoding is slightly modified to fit for subword tokenisation. In subword tokenisation, frequently occurring subwords are merged instead of replaced by another byte.
Example
Unigram: praise
- The unigram is split up into more frequent subwords.
- The unigram is represented as
['__sow', 'pr', 'ai', 'se', 'd', '__eow']
. The'__sow'
and'__eow'
markers indicate the start and the end of the tokens.