Question 1: In Lucene's SpanNearQuery
(or span_near
in ElasticSearch), what is the exact meaning of slop
? Is it the number of words separating the two matching words, or is it the separating number of words plus 1?
For example, suppose your indexed text is: foo bar biz
Which queries would match this text: "foo biz"~0
, "foo biz"~1
, "foo biz"~2
I would expect that the first wouldn't match and the last would. But what about the middle?
Question 2: Now a second and more complex corollary question: how is slop
handled if there are more than two search clauses? Is it applied to each pair of clauses or any pair of clauses.
For example, suppose you construct a SpanNearQuery
with three clauses: foo
, bar
, biz
. What slop is needed to match the same indexed text above? I would expect a slop of 2
definitely would, but what about 0
or 1
?
Similarly, with the same three clause query, what slop is needed to match the text: foo bar ble biz
Question 1: Slop is the number of words separating the span clauses. So slop 0 would mean they are adjacent. In the example I gave, slop of 1 would match.
Question 2: When there are more than two span near clauses, each clause must be connected to at least one other clause by no more than slop words separating them AND all of the clauses must be connected to each other through a chain. However, each clause need not be separated by slop words to every other clause.
For the first example in question 2: slop of 0, 1, and 2 would all match. Slop of zero matches even though foo
and biz
are separated by more than one because there is a chain through all clauses.
For the second example in question 2: slop of 0 would not match because biz
is separated from all other clauses by more than 0 slop. Slop of 1 would match because foo
and bar
are separated by 0 slop, in addition bar
and biz
are separated by 1 slop. It matches even though foo
and biz
are separated by more than one because there is a chain through all clauses. Slop of 2 would obviously match.