According to my understanding, Distant Supervision is the process of specifying the concept which the individual words of a passage, usually a sentence, are trying to convey.
For example, a database maintains the structured relationship concerns( NLP, this sentence).
Our distant supervision system would take as input the sentence: "This is a sentence about NLP."
Based on this sentence it would recognize the entities, since as a pre-processing step the sentence would have been passed through a named-entity recognizer, NLP
& this sentence
.
Since our database has it that NLP
and this sentence
are related by the bond of concern(s)
it would identify the input sentence as expressing the relationship Concerns(NLP, this sentence)
.
My questions is two fold:
1) What is the use of that? Is it that later our system might see a sentence in "the wild" such as That sentence is about OPP
and realize that it's seen something similar to that before and thereby realize the novel relationship such that concerns(OPP, that sentence).
, based only on the words/ individual tokens?
2) Does it take into account the actual words of the sentence? The verb 'is' and the adverb 'about' for instance, realizing (through WordNet or some other hyponymy system) that this is somehow similar to the higher-order concept "concerns"?
Does anyone have some code used to generate a distant supervision system that I could look at, i.e. a system that cross references a KB, such as Freebase, and a corpus, such as the NYTimes, and produces a distant supervision database? I think that would go a long way in clarifying my conception of distant supervision.
RE 1) Yes, this is exactly right. In the end, what we want is a classifier that takes as input text, and a pair of entity mentions in the text, and tells us what relation holds between those entities in that sentence. Distant supervision is a way of mocking this training data, using "distant supervision" from a known knowledge base. But, the end goal is the same as most machine learning tasks: generalize to new sentences.
RE 2) Certainly! Distant supervision only applies to how the training data is generated [1]. Once you've assumed distant supervision, what you're left with is a corpus of (sentence, relation_for_sentence) pairs, and then you extract all of the usual NLP features on the sentence.
[1] To a first approximation -- there are "distantly supervised" models (like MultiR and MIML-RE) which don't directly generate fake training data, but incorporate the supervision indirectly into the training procedure itself. But, even in these, there is a factor in the latent-variable model that amounts to a per-sentence classification, and it's just that the output variable is latent rather than naively "observed" as in vanilla distant supervision.