Data Preprocessing for NLP Pre-training Models (e.g. ELMo, Bert)

machine-learning nlp pre-trained-model transfer-learning natural-language-processing

Xin · Mar 1, 2019 · Viewed 7.5k times · Source

I plan to train ELMo or Bert model from scratch based on data(notes typed by people) on hand. The data I have now is all typed by different people. There are problems with spelling, formatting, and inconsistencies in sentences. After read the ELMo and Bert papers, I know that both models use a lot of sentences like from Wikipedia. I haven't been able to find any processed training samples or any preprocessing tutorial for Emlo or Bert model. My question is:

Does the Bert and ELMo models have standard data preprocessing steps or standard processed data formats?
Based on my existing dirty data, is there any way to preprocess this data so that the resulting word representation is more accurate?

Answer

Bert uses WordPiece embeddings which somewhat helps with dirty data. https://github.com/google/sentencepiece

Also Google-Research provides data preprocessing in their code. https://github.com/google-research/bert/blob/master/tokenization.py

Default Elmo implementation takes tokens as the output (if you provide an untokenized string, it will split it on spaces). Thus spelling correction, deduplication, lemmatization (e.g. as in spacy https://spacy.io/api/lemmatizer), separating tokens from punctuation and other standard preprocessing methods may help.

You may check standard ways to preprocess text in the NLTK package. https://www.nltk.org/api/nltk.tokenize.html (for example Twitter tokenizer). (Beware that NLTK is slow by itself). Many machine learning libraries provide their basic preprocessing (https://github.com/facebookresearch/pytext https://keras.io/preprocessing/text/)

You may also try to experiment and provide bpe-encodings or character n-grams to the input.

It also depends on the amount of data that you have; the more data you have, the less is the benefit of preprocessing (in my opinion). Given that you want to train Elmo or Bert from scratch, you should have a lot of data.

Data Preprocessing for NLP Pre-training Models (e.g. ELMo, Bert)

Answer

Related questions