postgresql - to_tsvector in simple mode throwing away non english in some setups -
on pg installs noticing following happens
sam=# select '你好 世界'::tsvector; tsvector --------------- '世界' '你好' (1 row) sam=# select to_tsvector('simple', '你好 世界'); to_tsvector ------------- (1 row)
even though db configured so:
mba:bin sam$ ./psql -l list of databases name | owner | encoding | collate | ctype | access privileges -----------+-------+----------+-------------+-------------+------------------- postgres | sam | utf8 | en_au.utf-8 | en_au.utf-8 | sam | sam | utf8 | en_au.utf-8 | en_au.utf-8 | template0 | sam | utf8 | en_au.utf-8 | en_au.utf-8 | =c/sam + | | | | | sam=ctc/sam template1 | sam | utf8 | en_au.utf-8 | en_au.utf-8 | =c/sam + | | | | | sam=ctc/sam (4 rows)
on other similar setups seeing select to_tsvector('simple', '你好 世界');
correctly return tokens.
how diagnose simple tokeniser figure out why chucking out these letters?
simplest repro seems installing postgres via postgres app. not happen when installing postgres on ubuntu locale set.
unfortunately, default parser used text search highly depends on database initialization , on lc_collate
, current database object encoding.
this due inner working of default text parser. vaguely documented:
note: parser's notion of "letter" determined database's locale setting,
lc_ctype
. words containing basic ascii letters reported separate token type, since useful distinguish them.
the important part these comments in postgresql source code:
/* [...] * notes: * - multibyte encoding , c-locale isw* function may fail * or give wrong result. * - multibyte encoding , c-locale used * asian languages. * - if locale c use pgwstr instead of wstr. */
and below:
/* * non-ascii symbol multibyte encoding c-locale * alpha character */
consequently, if want use default parser chinese, make sure database initialized c locale , have multibyte encoding, characters above u+007f treated alpha (including spaces such ideographic space u+3000 !). typically, following initdb
call expect:
initdb --locale=c -e utf-8
otherwise, chinese characters skipped , treated blank.
you can check debug function ts_debug
. database initialized lc_collate=en_us.utf-8
or other configuration tokenization fails, get:
select * ts_debug('simple', '你好 世界'); alias | description | token | dictionaries | dictionary | lexemes -------+---------------+-----------+--------------+------------+--------- blank | space symbols | 你好 世界 | {} | |
conversely, lc_collate=c
, utf-8 database (initialized above), proper result:
select * ts_debug('simple', '你好 世界'); alias | description | token | dictionaries | dictionary | lexemes -------+-------------------+-------+--------------+------------+--------- word | word, letters | 你好 | {simple} | simple | {你好} blank | space symbols | | {} | | word | word, letters | 世界 | {simple} | simple | {世界}
it seems, however, mean tokenize chinese text words separated regular spaces, i.e. tokenization/segmentation not happen within postgresql. use case, i suggest using custom parser. true if not use other features of postgresql simple parser, such tokenizing urls.
a parser tokenizing on space characters easy implement. in fact, in contrib/test_parser, there sample code doing that. parser work whatever locale. there buffer overrun bug in parser fixed in 2012, make sure use recent version.
Comments
Post a Comment