Python string splitlines() removes certain Unicode control characters -


i noticed python's standard string method splitlines() removes crucial unicode control characters well. example

>>> s1 = u'asdf \n fdsa \x1d asdf' >>> s1.splitlines() [u'asdf ', u' fdsa ', u' asdf'] 

notice how "\x1d" character quietly disappears.

it doesn't happen if string s1 still python bytestring though (without "u" prefix):

>>> s2 = 'asdf \n fdsa \x1d asdf' >>> s2.splitlines() ['asdf ', ' fdsa \x1d asdf'] 

i can't find information in reference https://docs.python.org/2.7/library/stdtypes.html#str.splitlines.

why happen? other characters "\x1d" (or unichr(29)) affected?

i'm using python 2.7.3 on ubuntu 12.04 lts.

this indeed under-documented; had dig through source code find it.

the unicodetype_db.h file defines linebreaks as:

case 0x000a: case 0x000b: case 0x000c: case 0x000d: case 0x001c: case 0x001d: case 0x001e: case 0x0085: case 0x2028: case 0x2029: 

these generated unicode database; codepoint listed in unicode standard line_break property set bk, cr, lf or nl or bidirectional category set b (paragraph break) considered line break.

from unicode data file, version 6 of standard lists u+001d paragraph break:

001d;<control>;cc;0;b;;;;;n;information separator three;;;; 

(5th column bidirectional category).

you use regular expression if want limit characters split on:

import re  linebreaks = re.compile(ur'[\n-\r\x85\u2028\u2929]') linebreaks.split(yourtext) 

would split text on same set of linebreaks except u+001c, u+001d or u+001e codepoints, 3 data structuring control characters.


Comments

Popular posts from this blog

javascript - RequestAnimationFrame not working when exiting fullscreen switching space on Safari -

Python ctypes access violation with const pointer arguments -