java - What is the beginning and end of a Lucene token stream when parsing a query? -


we're trying implement custom filter needs remember past tokens processed before within same query. tried overriding end() and/or reset() methods tokenfilter found out these methods called between each token.

this contrary our expectation of end() and/or reset() methods being called @ beginning or end of token stream representing query. unexpected behavior can reproduced following example code below.

(solr) schema snippet:

<fieldtype name="text_general" class="solr.textfield">   <analyzer type="index">     ...   </analyzer>   <analyzer type="query">     <tokenizer class="solr.whitespacetokenizerfactory"/>     <filter class="com.foobar.solr.customfilterfactory" />   </analyzer> </fieldtype> 

filter implementation:

public class customfilter extends tokenfilter {     chartermattribute termattribute = addattribute(chartermattribute.class);      public customfilter(tokenstream in) {         super(in);     }      @override     public boolean incrementtoken() throws ioexception {         system.out.println("### increment token pre loop: " + termattribute.tostring());          while (input.incrementtoken()) {             system.out.println("### increment token looping through input: " + termattribute.tostring());         }          return false;     }      @override     public void end() throws ioexception {         system.out.println("### end");         super.end();     }      @override     public void reset() throws ioexception {         system.out.println("### reset");         super.reset();     } } 

log output query "foo bar":

### reset ### increment token pre loop: ### increment token looping through input: foo ### end ### reset ### increment token pre loop: ### increment token looping through input: bar ### end 

why end() , reset() methods called each token instead of complete query?

edit: or why input.incrementtoken() returning false after first token processed?

since noticed you're using solr, need understand query parser split query whitespaces, , takes precedence analyzer : if query "foo bar" have "foo" , "bar" passed separately through analyzer chain. can bypass behavior making "foo bar" phrase query, adding \"foo bar\"

edit: clarification, phrase query take precedence above query parser whitespace splitting , defined wrapping sequence of tokens inside quote characters


Comments

Popular posts from this blog

javascript - RequestAnimationFrame not working when exiting fullscreen switching space on Safari -

Python ctypes access violation with const pointer arguments -