You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
I only have one particular query in my system: a BooleanQuery of TermQuerys for the should, must, and must not clauses where only the should clauses contribute to scoring via BM25.
I noticed that a BooleanQuery of TermQuerys implements the Block WAND optimization within BooleanWeight::for_each_pruning when it only has should clauses, but it falls back to the generic for_each_pruning_scorer when there are any must or must not clauses.
Describe the solution you'd like
I have a prototype for our usage of Tantivy within Convex that is able to utilize Block WAND even with should or must not clauses, and I'd be curious if you would be interested in us upstreaming it!
The high level idea is to move the for_each_pruning API from the weights to the scorers. First, we add an optional method to the Scorer trait:
This method mutates the document set, filtering out documents that have score less than or equal to threshold past the current position. The caller must call this method with monotonically increasing thresholds over the lifetime of the scorer.
Then, the protocol between the collector and the scorer moves from being callback-based (like for_each_pruning_scorer today) to alternating between advancing the document set, updating the collector, and updating the thresholds.
Next, we implement a specialized WeakAnd scorer that takes most of the logic from today's block_wand.rs.
pubstructWeakAnd{scorers:Vec<TermScorerWithMaxScore>,threshold:Score,// Invariant: If `scorers` is nonempty, this is present with the current// match. current_match:Option<WeakAndMatch>,}structWeakAndMatch{len:usize,doc:DocId,score:Score,}
The idea here is that we run the Block WAND algorithm for generating pivot candidates in both DocSet::advance and DocSet::seek. So, a layer above can call WeakAnd::seek on some doc: DocId, and it will then resume where it left off, finding the next pivot after doc that's better than the current threshold.
Then, this approach composes well with query::intersection::Intersection: If we're computing the intersection of WeakAnd with a single TermScorer (which we'll assume has a smaller size_hint), the intersection will...
Iterate to the next document on the TermScorer to get a candidate
Advance the WeakAnd to this candidate, returning seek_doc. This step potentially skips over many DocIds in the union if the intersection is more selective.
If seek_doc > candidate, advance the TermScorer and retry. In this condition, the WeakAnd having a high threshold potentially lets us skip over many DocIds in the intersection.
Otherwise, emit candidate.
A similar approach works for Exclude, which we use for the must not clauses.
I haven't tested this approach rigorously yet, and I also haven't evaluated how it performs compared to the current callback-based API. But if you are interested in this work, I can clean it up and start evaluating it. I'd also be curious how this adjust_threshold approach works with other score combiners too.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
I only have one particular query in my system: a
BooleanQuery
ofTermQuery
s for the should, must, and must not clauses where only the should clauses contribute to scoring via BM25.I noticed that a
BooleanQuery
ofTermQuery
s implements the Block WAND optimization withinBooleanWeight::for_each_pruning
when it only has should clauses, but it falls back to the genericfor_each_pruning_scorer
when there are any must or must not clauses.Describe the solution you'd like
I have a prototype for our usage of Tantivy within Convex that is able to utilize Block WAND even with should or must not clauses, and I'd be curious if you would be interested in us upstreaming it!
The high level idea is to move the
for_each_pruning
API from the weights to the scorers. First, we add an optional method to theScorer
trait:This method mutates the document set, filtering out documents that have score less than or equal to
threshold
past the current position. The caller must call this method with monotonically increasing thresholds over the lifetime of the scorer.Then, the protocol between the collector and the scorer moves from being callback-based (like
for_each_pruning_scorer
today) to alternating between advancing the document set, updating the collector, and updating the thresholds.Next, we implement a specialized
WeakAnd
scorer that takes most of the logic from today'sblock_wand.rs
.(Full source: https://gist.github.com/sujayakar/c742f07dd99089f4c529a242c3b5ac9c)
The idea here is that we run the Block WAND algorithm for generating pivot candidates in both
DocSet::advance
andDocSet::seek
. So, a layer above can callWeakAnd::seek
on somedoc: DocId
, and it will then resume where it left off, finding the next pivot afterdoc
that's better than the currentthreshold
.Then, this approach composes well with
query::intersection::Intersection
: If we're computing the intersection ofWeakAnd
with a singleTermScorer
(which we'll assume has a smallersize_hint
), the intersection will...TermScorer
to get acandidate
WeakAnd
to thiscandidate
, returningseek_doc
. This step potentially skips over manyDocId
s in the union if the intersection is more selective.seek_doc > candidate
, advance theTermScorer
and retry. In this condition, theWeakAnd
having a high threshold potentially lets us skip over manyDocId
s in the intersection.candidate
.A similar approach works for
Exclude
, which we use for the must not clauses.I haven't tested this approach rigorously yet, and I also haven't evaluated how it performs compared to the current callback-based API. But if you are interested in this work, I can clean it up and start evaluating it. I'd also be curious how this
adjust_threshold
approach works with other score combiners too.The text was updated successfully, but these errors were encountered: