What is CoNLL data format?

swapna sourav rout picture swapna sourav rout · Dec 11, 2014 · Viewed 33.3k times · Source

I am new to text mining. I am using a open source jar (Mate Parser) which gives me output in a CoNLL 2009 format after dependency parsing. I want to use the dependency parsing results for Information Extraction. But i am able to understand some of the output but not able to comprehend the CoNLL data format. Can any one help me in making me understand the CoNLL data format?? Any kind of pointers would be appreciated.

Answer

dmcc picture dmcc · Dec 11, 2014

There are many different CoNLL formats since CoNLL is a different shared task each year. The format for CoNLL 2009 is described here. Each line represents a single word with a series of tab-separated fields. _s indicate empty values. Mate-Parser's manual says that it uses the first 12 columns of CoNLL 2009:

ID FORM LEMMA PLEMMA POS PPOS FEAT PFEAT HEAD PHEAD DEPREL PDEPREL

The definition of some of these columns come from earlier shared tasks (the CoNLL-X format used in 2006 and 2007):

  • ID (index in sentence, starting at 1)
  • FORM (word form itself)
  • LEMMA (word's lemma or stem)
  • POS (part of speech)
  • FEAT (list of morphological features separated by |)
  • HEAD (index of syntactic parent, 0 for ROOT)
  • DEPREL (syntactic relationship between HEAD and this word)

There are variants of those columns (e.g., PPOS but not POS) that start with P indicate that the value was automatically predicted rather a gold standard value.

Update: There is now a CoNLL-U data format as well which extends the CoNLL-X format.