If I have a text containing for example an article of a newspaper in Catalan language, how could I find all cities from that text?
I have been looking at the package nltk for python and I have downloaded the corpus for catalan language (nltk.corpus.cess_cat).
What I have at this moment: I have installed all necessary from nltk.download(). An example of what I have at this moment:
te = nltk.word_tokenize('Tots els gats son de Sant Cugat del Valles.')
nltk.pos_tag(te)
The city is 'Sant Cugat del Valles'. What I get from the output is:
[('Tots', 'NNS'),
('els', 'NNS'),
('gats', 'NNS'),
('son', 'VBP'),
('de', 'IN'),
('Sant', 'NNP'),
('Cugat', 'NNP'),
('del', 'NN'),
('Valles', 'NNP')]
NNP seems to indicate nouns whose first letter is uppercase. Is there a way of getting places or cities and not all names? Thank you
You can use the geotext python library for the same.
pip install geotext
is all it takes to install this library. The usage is as simple as:
from geotext import GeoText
places = GeoText("London is a great city")
places.cities
gives the result 'London'
The list of cities covered in this library is not extensive but it has a good list.