I have seen that NLP models such as BERT utilize WordPiece for tokenization. In WordPiece, we split the tokens like playing
to play
and ##ing
. It is mentioned that it covers a wider spectrum of Out-Of-Vocabulary (OOV) words. Can someone please help me explain how WordPiece tokenization is actually done, and how it handles effectively helps to rare/OOV words?
WordPiece and BPE are two similar and commonly used techniques to segment words into subword-level in NLP tasks. In both cases, the vocabulary is initialized with all the individual characters in the language, and then the most frequent/likely combinations of the symbols in the vocabulary are iteratively added to the vocabulary.
Consider the WordPiece algorithm from the original paper (wording slightly modified by me):
- Initialize the word unit inventory with all the characters in the text.
- Build a language model on the training data using the inventory from 1.
- Generate a new word unit by combining two units out of the current word inventory to increment the word unit inventory by one. Choose the new word unit out of all the possible ones that increases the likelihood on the training data the most when added to the model.
- Goto 2 until a predefined limit of word units is reached or the likelihood increase falls below a certain threshold.
The BPE algorithm only differs in Step 3, where it simply chooses the new word unit as the combination of the next most frequently occurring pair among the current set of subword units.
Example
Input text: she walked . he is a dog walker . i walk
First 3 BPE Merges:
w
a
= wa
l
k
= lk
wa
lk
= walk
So at this stage, your vocabulary includes all the initial characters, along with wa
, lk
, and walk
. You usually do this for a fixed number of merge operations.
How does it handle rare/OOV words?
Quite simply, OOV words are impossible if you use such a segmentation method. Any word which does not occur in the vocabulary will be broken down into subword units. Similarly, for rare words, given that the number of subword merges we used is limited, the word will not occur in the vocabulary, so it will be split into more frequent subwords.
How does this help?
Imagine that the model sees the word walking
. Unless this word occurs at least a few times in the training corpus, the model can't learn to deal with this word very well. However, it may have the words walked
, walker
, walks
, each occurring only a few times. Without subword segmentation, all these words are treated as completely different words by the model.
However, if these get segmented as walk@@ ing
, walk@@ ed
, etc., notice that all of them will now have walk@@
in common, which will occur much frequently while training, and the model might be able to learn more about it.