Clause Extraction using Stanford parser

AMisra picture AMisra · Sep 27, 2014 · Viewed 9.3k times · Source

I have a complex sentence and I need to separate it into main and dependent clause. For example for the sentence
ABC cites the fact that chemical additives are banned in many countries and feels they may be banned in this state too.
The split required

1)ABC cites the fact   
2)chemical additives are banned in many countries   
3)ABC feels they may be banned in this state too.    

I think I could use the Stanford Parser tree or dependencies, but I am not sure how to proceed from here.

The tree

(ROOT
  (S
    (NP (NNP ABC))
    (VP (VBZ cites)
      (NP (DT the) (NN fact))
      (SBAR (IN that)
        (S
          (NP (NN chemical) (NNS additives))
          (VP
            (VP (VBP are)
              (VP (VBN banned)
                (PP (IN in)
                  (NP (JJ many) (NNS countries)))))
            (CC and)
            (VP (VBZ feels)
              (SBAR
                (S
                  (NP (PRP they))
                  (VP (MD may)
                    (VP (VB be)
                      (VP (VBN banned)
                        (PP (IN in)
                          (NP (DT this) (NN state)))
                        (ADVP (RB too))))))))))))
    (. .)))

and the collapsed dependency parse

nsubj(cites-2, ABC-1)  
root(ROOT-0, cites-2)  
det(fact-4, the-3)   
dobj(cites-2, fact-4)  
mark(banned-9, that-5)  
nn(additives-7, chemical-6)  
nsubjpass(banned-9, additives-7)   
nsubj(feels-14, additives-7)   
auxpass(banned-9, are-8)   
ccomp(cites-2, banned-9)   
amod(countries-12, many-11)  
prep_in(banned-9, countries-12)   
ccomp(cites-2, feels-14)    
conj_and(banned-9, feels-14)    
nsubjpass(banned-18, they-15)   
aux(banned-18, may-16)    
auxpass(banned-18, be-17)    
ccomp(feels-14, banned-18)   
det(state-21, this-20)    
prep_in(banned-18, state-21)    
advmod(banned-18, too-22)   

Answer

Chthonic Project picture Chthonic Project · Sep 27, 2014

It is probably better if you primarily use the constituenty-based parse tree, and not the dependencies. The dependencies will be helpful, but only after the main work is done! I am going to explain this towards the end of my answer.

This is because constituency-parse is based on phrase structure grammar, which is the most relevant if you are seeking to extract clauses from a sentence. It can be done using dependencies as well, but in that case, you will essentially be reconstructing the phrase structure -- starting from the root and looking at dependent nodes (e.g. ABC and facts are the nominal subject and direct object of the verb cites, and so on ... ).

It is helpful to visualize the parse tree, however. In your example, the clauses are indicated by the SBAR tag, which is a clause introduced by a (possibly empty) subordinating conjunction. All you need to do is the following:

  1. Identify the non-root clausal nodes in the parse tree
  2. Remove (but retain separately) the subtrees rooted at these clausal nodes from the main tree.
  3. In the main tree (after removal of subtrees in step 2), remove any hanging prepositions, subordinating conjunctions and adverbs.

In step 3, what I mean by "hanging" is that any prepositions, etc. whose dependency has been removed in step 2. E.g., from "ABC cites the fact that", you need to remove the preposition/subordinating-conjunction "that" because its dependent node "banned" was removed in step 2. You will thus have three independent clauses:

  • chemical additives are banned in many countries (SBAR removal in step 2)
  • they may be banned in this state too (SBAR removal in step 2)
  • ABC cites the fact (step 3)

The only issue here is the connection ABC--feels. For this, note that both "banned" and "feels" are complements of the verb "cites", and hence, have the same subject, which is "ABC"! And you're done. When this is done, you will get a fourth clause, "ABC feels", which is something you may or may not want to include in your final result.

For a list of all clausal tags (and, in fact, all Penn Treebank tags), see this list: http://www.surdeanu.info/mihai/teaching/ista555-fall13/readings/PennTreebankConstituents.html

For an online parse-tree visualization, you may want to use the online Berkeley parser demo. It helps a lot in forming a better intuition. Here's the image generated for your example sentence: Berkeley Parser Tree

Caveats

  1. Even the best parsers will not always parse sentences correctly, so keep that in mind.
  2. Additionally, many complex sentences involve right node raising, which is almost never parsed correctly by most parsers.
  3. You may need to modify the algorithm a little if a clause is in passive voice.

Apart from these three pitfalls, the above algorithm should work quite accurately.