I am working with Text Classification problem where I want to use the BERT model as the base followed by Dense layers. I want to know how does the 3 arguments work? For example, if I have 3 sentences as:
'My name is slim shade and I am an aspiring AI Engineer',
'I am an aspiring AI Engineer',
'My name is Slim'
SO what will these 3 arguments do? What I think is as follows:
max_length=5
will keep all the sentences as of length 5 strictlypadding=max_length
will add a padding of 1 to the third sentencetruncate=True
will truncate the first and second sentence so that their length will be strictly 5.Please correct me if I am wrong.
Below is my code which I have used.
! pip install transformers==3.5.1
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
tokens = tokenizer.batch_encode_plus(text,max_length=5,padding='max_length', truncation=True)
text_seq = torch.tensor(tokens['input_ids'])
text_mask = torch.tensor(tokens['attention_mask'])
What you have assumed is almost correct, however, there are few differences.
max_length=5
, the max_length
specifies the length of the tokenized text. By default, BERT performs word-piece tokenization. For example the word "playing" can be split into "play" and "##ing" (This may not be very precise, but just to help you understand about word-piece tokenization), followed by adding [CLS]
token at the beginning of the sentence, and [SEP]
token at the end of sentence. Thus, it first tokenizes the sentence, truncates it to max_length-2
(if truncation=True
), then prepend [CLS]
at the beginning and [SEP]
token at the end.(So a total length of max_length
)
padding='max_length'
, In this example it is not very evident that the 3rd example will be padded, as the length exceeds 5
after appending [CLS]
and [SEP]
tokens. However, if you have a max_length
of 10. The tokenized text corresponds to [101, 2026, 2171, 2003, 11754, 102, 0, 0, 0, 0]
, where 101 is id of [CLS]
and 102 is id of [SEP]
tokens. Thus, padded by zeros to make all the text to the length of max_length
Likewise, truncate=True
will ensure that the max_length is strictly adhered, i.e, longer sentences are truncated to max_length
only if truncate=True