![]() |
VOOZH | about |
Byte-Pair Encoding (BPE) is a text tokenization technique in Natural Language Processing. It breaks down words into smaller, meaningful pieces called subwords. It works by repeatedly finding the most common pairs of characters in the text and combining them into a new subword until the vocabulary reaches a desired size. This technique helps in handling rare or unknown words by breaking them into smaller parts that the model has already learned during training. By reducing the vocabulary size, it makes it easier to work with large amounts of text while allowing the model to understand wide variety of languages.
To understand BPE better, it’s important to know its key concepts:
Suppose we have a text corpus with the following four words: "ab", "bc", "bcd" and "cde". We begin by calculating the frequencies of each byte (character). Initial vocabulary consists of all the unique characters in the corpus like {"a", "b", "c", "d", "e"}.
Vocabulary = {"a", "b", "c", "d", "e"}Frequency = {"a": 1, "b": 3, "c": 3, "d": 2, "e": 1}Most frequent pair is "bc" with a frequency of 2.Merge "bc" to create a new subword unit "bc".Update the frequency counts of all the bytes or characters that contain "bc":
Frequency = {"a": 1, "b": 1, "c": 1, "d": 2, "e": 1, "bc": 2}Add "bc" to the vocabulary:
Vocabulary = {"a", "b", "c", "d", "e", "bc"}Resulting vocabulary consists of the following subword units: {"a", "b", "c", "d", "e", "bc", "cd", "de","ab","bcd","cde"}."ab" -> "a" + "b"
"bc" -> "bc"
"bcd" -> "bc" + "d"
"cde" -> "c" + "de"This representation helps in reducing the vocabulary size while maintaining the original meaning and structure of the text.
We will be using defaultdictfrom collections to easily manage the frequency of character pairs and subword units. This is important for storing and updating vocabulary during the BPE process.
learn_bpe function is designed to learn and return the most frequent character pairs from the input text. It also merges these frequent pairs iteratively.
defaultdict(int) to initialize frequency counts of words automatically to zero.vocab.We iterate through the vocabulary to find the most frequent adjacent character pair and perform the merge.
Process repeats for a defined number of merges (num_merges).
most_frequent = max(vocab, key=lambda x: vocab[x]): Finds the pair with the highest frequency.merges list.In the apply_bpe function, we take a word and apply the previously learned merges. We iterate over the list of merges in reverse order, merging the pairs in the text. For each merge we look for matching adjacent character pairs and replace them with the merged character.
Finally we demonstrate the usage of the learn_bpe and apply_bpe functions with a sample corpus. We first learn the BPE merges from the corpus "ab bc bcd cde" and then apply the learned merges to the word "bcd".
Output:
Learned Merges: [('<', 'b'), ('b', 'c'), ('c', 'd')]
BPE Representation for 'bcd': ['<b', 'cd', '>']
Learned merges show the most common pairs of characters that were combined into new subwords. These pairs were merged during the training process to create subword units. When applying BPE to the word "bcd" learned merges split it into the subwords ['<b', 'cd', '>'].
You can download source code from here.