How does max_length, padding and truncation arguments work in HuggingFace' BertTokenizerFast.from_pretrained('bert-base-uncased') work??

Deshwal picture Deshwal · Dec 11, 2020 · Viewed 9.2k times · Source

I am working with Text Classification problem where I want to use the BERT model as the base followed by Dense layers. I want to know how does the 3 arguments work? For example, if I have 3 sentences as:

'My name is slim shade and I am an aspiring AI Engineer',
'I am an aspiring AI Engineer',
'My name is Slim'

SO what will these 3 arguments do? What I think is as follows:

  1. max_length=5 will keep all the sentences as of length 5 strictly
  2. padding=max_length will add a padding of 1 to the third sentence
  3. truncate=True will truncate the first and second sentence so that their length will be strictly 5.

Please correct me if I am wrong.

Below is my code which I have used.

! pip install transformers==3.5.1

from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

tokens = tokenizer.batch_encode_plus(text,max_length=5,padding='max_length', truncation=True)
  
text_seq = torch.tensor(tokens['input_ids'])
text_mask = torch.tensor(tokens['attention_mask'])

Answer

Ashwin Geet D'Sa picture Ashwin Geet D'Sa · Dec 11, 2020

What you have assumed is almost correct, however, there are few differences.

max_length=5, the max_length specifies the length of the tokenized text. By default, BERT performs word-piece tokenization. For example the word "playing" can be split into "play" and "##ing" (This may not be very precise, but just to help you understand about word-piece tokenization), followed by adding [CLS] token at the beginning of the sentence, and [SEP] token at the end of sentence. Thus, it first tokenizes the sentence, truncates it to max_length-2 (if truncation=True), then prepend [CLS] at the beginning and [SEP] token at the end.(So a total length of max_length)

padding='max_length', In this example it is not very evident that the 3rd example will be padded, as the length exceeds 5 after appending [CLS] and [SEP] tokens. However, if you have a max_length of 10. The tokenized text corresponds to [101, 2026, 2171, 2003, 11754, 102, 0, 0, 0, 0], where 101 is id of [CLS] and 102 is id of [SEP] tokens. Thus, padded by zeros to make all the text to the length of max_length

Likewise, truncate=True will ensure that the max_length is strictly adhered, i.e, longer sentences are truncated to max_length only if truncate=True