Data Preprocessing for NLP Pre-training Models (e.g. ELMo, Bert)

Xin picture Xin · Mar 1, 2019 · Viewed 7.5k times · Source

I plan to train ELMo or Bert model from scratch based on data(notes typed by people) on hand. The data I have now is all typed by different people. There are problems with spelling, formatting, and inconsistencies in sentences. After read the ELMo and Bert papers, I know that both models use a lot of sentences like from Wikipedia. I haven't been able to find any processed training samples or any preprocessing tutorial for Emlo or Bert model. My question is:

  • Does the Bert and ELMo models have standard data preprocessing steps or standard processed data formats?
  • Based on my existing dirty data, is there any way to preprocess this data so that the resulting word representation is more accurate?

Answer

Denis Gordeev picture Denis Gordeev · Mar 1, 2019

Bert uses WordPiece embeddings which somewhat helps with dirty data. https://github.com/google/sentencepiece

Also Google-Research provides data preprocessing in their code. https://github.com/google-research/bert/blob/master/tokenization.py

Default Elmo implementation takes tokens as the output (if you provide an untokenized string, it will split it on spaces). Thus spelling correction, deduplication, lemmatization (e.g. as in spacy https://spacy.io/api/lemmatizer), separating tokens from punctuation and other standard preprocessing methods may help.

You may check standard ways to preprocess text in the NLTK package. https://www.nltk.org/api/nltk.tokenize.html (for example Twitter tokenizer). (Beware that NLTK is slow by itself). Many machine learning libraries provide their basic preprocessing (https://github.com/facebookresearch/pytext https://keras.io/preprocessing/text/)

You may also try to experiment and provide bpe-encodings or character n-grams to the input.

It also depends on the amount of data that you have; the more data you have, the less is the benefit of preprocessing (in my opinion). Given that you want to train Elmo or Bert from scratch, you should have a lot of data.