postgresql - to_tsvector in simple mode throwing away non english in some setups -

August 15, 2010

on pg installs noticing following happens

sam=# select '你好 世界'::tsvector;    tsvector     ---------------  '世界' '你好' (1 row)  sam=# select to_tsvector('simple', '你好 世界');  to_tsvector  -------------  (1 row)

even though db configured so:

mba:bin sam$ ./psql -l                               list of databases    name    | owner | encoding |   collate   |    ctype    | access privileges -----------+-------+----------+-------------+-------------+-------------------  postgres  | sam   | utf8     | en_au.utf-8 | en_au.utf-8 |  sam       | sam   | utf8     | en_au.utf-8 | en_au.utf-8 |  template0 | sam   | utf8     | en_au.utf-8 | en_au.utf-8 | =c/sam           +            |       |          |             |             | sam=ctc/sam  template1 | sam   | utf8     | en_au.utf-8 | en_au.utf-8 | =c/sam           +            |       |          |             |             | sam=ctc/sam (4 rows)

on other similar setups seeing select to_tsvector('simple', '你好世界'); correctly return tokens.

how diagnose simple tokeniser figure out why chucking out these letters?

simplest repro seems installing postgres via postgres app. not happen when installing postgres on ubuntu locale set.

unfortunately, default parser used text search highly depends on database initialization , on lc_collate , current database object encoding.

this due inner working of default text parser. vaguely documented:

note: parser's notion of "letter" determined database's locale setting, lc_ctype. words containing basic ascii letters reported separate token type, since useful distinguish them.

the important part these comments in postgresql source code:

/* [...]  * notes:  *  - multibyte encoding , c-locale isw* function may fail  *    or give wrong result.  *  - multibyte encoding , c-locale used  *    asian languages.  *  - if locale c use pgwstr instead of wstr.  */

and below:

/*  * non-ascii symbol multibyte encoding c-locale  * alpha character  */

consequently, if want use default parser chinese, make sure database initialized c locale , have multibyte encoding, characters above u+007f treated alpha (including spaces such ideographic space u+3000 !). typically, following initdb call expect:

initdb --locale=c -e utf-8

otherwise, chinese characters skipped , treated blank.

you can check debug function ts_debug. database initialized lc_collate=en_us.utf-8 or other configuration tokenization fails, get:

select * ts_debug('simple', '你好 世界');  alias |  description  |   token   | dictionaries | dictionary | lexemes  -------+---------------+-----------+--------------+------------+---------  blank | space symbols | 你好 世界 | {}            |            |

conversely, lc_collate=c , utf-8 database (initialized above), proper result:

select * ts_debug('simple', '你好 世界');  alias |    description    | token | dictionaries | dictionary | lexemes -------+-------------------+-------+--------------+------------+---------  word  | word, letters | 你好  | {simple}     | simple     | {你好}  blank | space symbols     |       | {}           |            |   word  | word, letters | 世界  | {simple}     | simple     | {世界}

it seems, however, mean tokenize chinese text words separated regular spaces, i.e. tokenization/segmentation not happen within postgresql. use case, i suggest using custom parser. true if not use other features of postgresql simple parser, such tokenizing urls.

a parser tokenizing on space characters easy implement. in fact, in contrib/test_parser, there sample code doing that. parser work whatever locale. there buffer overrun bug in parser fixed in 2012, make sure use recent version.

Search This Blog

My

postgresql - to_tsvector in simple mode throwing away non english in some setups -

Comments

Post a Comment

Popular posts from this blog

javascript - RequestAnimationFrame not working when exiting fullscreen switching space on Safari -

Why am I getting Internal .NET Framework Data Provider error 1025 when passing Method to where? -

linux - phpmyadmin, neginx error.log - Check group www-data has read access and open_basedir -