I am doing a data cleaning exercise on python and the text that I am cleaning contains Italian words which I would like to remove. I have been searching online whether I would be able to do this on Python using a tool kit like nltk.
For example given some text :
"Io andiamo to the beach with my amico."
I would like to be left with :
"to the beach with my"
Does anyone know of a way as to how this could be done? Any help would be much appreciated.
You can use the words
corpus from NLTK:
import nltk
words = set(nltk.corpus.words.words())
sent = "Io andiamo to the beach with my amico."
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
if w.lower() in words or not w.isalpha())
# 'Io to the beach with my'
Unfortunately, Io happens to be an English word. In general, it may be hard to decide whether a word is English or not.