I plan to train ELMo or Bert model from scratch based on data(notes typed by people) on hand. The data I have now is all typed by different people. There are problems with spelling, formatting, and inconsistencies in sentences. After read the ELMo and Bert papers, I know that both models use a lot of sentences like from Wikipedia. I haven't been able to find any processed training samples or any preprocessing tutorial for Emlo or Bert model. My question is:
Bert uses WordPiece embeddings which somewhat helps with dirty data. https://github.com/google/sentencepiece
Also Google-Research provides data preprocessing in their code. https://github.com/google-research/bert/blob/master/tokenization.py
Default Elmo implementation takes tokens as the output (if you provide an untokenized string, it will split it on spaces). Thus spelling correction, deduplication, lemmatization (e.g. as in spacy https://spacy.io/api/lemmatizer), separating tokens from punctuation and other standard preprocessing methods may help.
You may check standard ways to preprocess text in the NLTK package. https://www.nltk.org/api/nltk.tokenize.html (for example Twitter tokenizer). (Beware that NLTK is slow by itself). Many machine learning libraries provide their basic preprocessing (https://github.com/facebookresearch/pytext https://keras.io/preprocessing/text/)
You may also try to experiment and provide bpe-encodings or character n-grams to the input.
It also depends on the amount of data that you have; the more data you have, the less is the benefit of preprocessing (in my opinion). Given that you want to train Elmo or Bert from scratch, you should have a lot of data.