Byte-pair encoding

A data compression algorithm for tokenisation for machine translation

Byte-pair encoding (BPE) is a data compression algorithm. In byte-pair encoding, a byte that does not exist in the data replaces the most common consecutive pair of bytes. Byte-pair encoding was first introduced in the article titled “A new Algorithm for Data Compression” by Philip Gage in the February 1994 edition of C Users Journal.

Algorithm

Initialise the vocabulary with the frequency of word occurences in the corpus.
Represent each word in the corpus as a concatenation. Use the special token </w> to identify the word end.
Split each word into characters and count the character occurence.
Look for the most frequent pairing, merge them and perform the same iteration again.
Repeat step 4 until all the iterations are performed or the token limit size is reached.

Example

String: aaabdaaabac

The most frequent byte pair is aa. It is replaced Z. The string is now ZabdZabac.
The new most frequent byte pair is ab. It is replaced by Y. The string is now ZYdZYac.
ZY is encoded as X applying recursive byte-pair encoding. The string is now XdXac.
XdXac cannot be compressed further.

To decompress the data, the string characters are replaced in reverse order.

Subword tokenisation

The use of byte-pair encoding ensures that the most common words in the vocabulary are presented as single tokens, whereas the rarest words are broken down into subword tokens. Byte-pair encoding works well for both character-level and word-level representations, and it can manage large corpora. Byte-pair encoding follows a greedy approach where it goes through every possible option at each iteration.

Implementation of byte-pair encoding is slightly modified to fit for subword tokenisation. In subword tokenisation, frequently occurring subwords are merged instead of replaced by another byte.

Example

Unigram: praise

The unigram is split up into more frequent subwords.
The unigram is represented as ['__sow', 'pr', 'ai', 'se', 'd', '__eow']. The '__sow' and '__eow' markers indicate the start and the end of the tokens.