java - What is the beginning and end of a Lucene token stream when parsing a query? -
we're trying implement custom filter needs remember past tokens processed before within same query. tried overriding end()
and/or reset()
methods tokenfilter
found out these methods called between each token.
this contrary our expectation of end()
and/or reset()
methods being called @ beginning or end of token stream representing query. unexpected behavior can reproduced following example code below.
(solr) schema snippet:
<fieldtype name="text_general" class="solr.textfield"> <analyzer type="index"> ... </analyzer> <analyzer type="query"> <tokenizer class="solr.whitespacetokenizerfactory"/> <filter class="com.foobar.solr.customfilterfactory" /> </analyzer> </fieldtype>
filter implementation:
public class customfilter extends tokenfilter { chartermattribute termattribute = addattribute(chartermattribute.class); public customfilter(tokenstream in) { super(in); } @override public boolean incrementtoken() throws ioexception { system.out.println("### increment token pre loop: " + termattribute.tostring()); while (input.incrementtoken()) { system.out.println("### increment token looping through input: " + termattribute.tostring()); } return false; } @override public void end() throws ioexception { system.out.println("### end"); super.end(); } @override public void reset() throws ioexception { system.out.println("### reset"); super.reset(); } }
log output query "foo bar":
### reset ### increment token pre loop: ### increment token looping through input: foo ### end ### reset ### increment token pre loop: ### increment token looping through input: bar ### end
why end()
, reset()
methods called each token instead of complete query?
edit: or why input.incrementtoken()
returning false
after first token processed?
since noticed you're using solr, need understand query parser split query whitespaces, , takes precedence analyzer : if query "foo bar" have "foo" , "bar" passed separately through analyzer chain. can bypass behavior making "foo bar" phrase query, adding \"foo bar\"
edit: clarification, phrase query take precedence above query parser whitespace splitting , defined wrapping sequence of tokens inside quote characters
Comments
Post a Comment