Python string splitlines() removes certain Unicode control characters -
i noticed python's standard string method splitlines() removes crucial unicode control characters well. example
>>> s1 = u'asdf \n fdsa \x1d asdf' >>> s1.splitlines() [u'asdf ', u' fdsa ', u' asdf']
notice how "\x1d" character quietly disappears.
it doesn't happen if string s1 still python bytestring though (without "u" prefix):
>>> s2 = 'asdf \n fdsa \x1d asdf' >>> s2.splitlines() ['asdf ', ' fdsa \x1d asdf']
i can't find information in reference https://docs.python.org/2.7/library/stdtypes.html#str.splitlines.
why happen? other characters "\x1d" (or unichr(29)) affected?
i'm using python 2.7.3 on ubuntu 12.04 lts.
this indeed under-documented; had dig through source code find it.
the unicodetype_db.h
file defines linebreaks as:
case 0x000a: case 0x000b: case 0x000c: case 0x000d: case 0x001c: case 0x001d: case 0x001e: case 0x0085: case 0x2028: case 0x2029:
these generated unicode database; codepoint listed in unicode standard line_break
property set bk
, cr
, lf
or nl
or bidirectional category set b
(paragraph break) considered line break.
from unicode data file, version 6 of standard lists u+001d paragraph break:
001d;<control>;cc;0;b;;;;;n;information separator three;;;;
(5th column bidirectional category).
you use regular expression if want limit characters split on:
import re linebreaks = re.compile(ur'[\n-\r\x85\u2028\u2929]') linebreaks.split(yourtext)
would split text on same set of linebreaks except u+001c, u+001d or u+001e codepoints, 3 data structuring control characters.
Comments
Post a Comment