python - Why is readlines() reading much more than the sizehint? -

background

i parsing large text files (30gb+) in python 2.7.6. speed process bit, splitting files chunks , farming them out subprocesses using multiprocessing library. this, iterating on file in main process, recording byte positions want split input file , passing byte positions subprocesses, open input file , read in block using file.readlines(chunk_size). however, i'm finding chunks read in seem larger (4x) sizehint argument.

the question

why isn't sizehint being heeded?

example code

this following code demonstrates issue:

import sys  # set test chunk size 2kb chunk_size = 1024 * 2  count = 0 chunk_start = 0 chunk_list = []  fi = open('test.txt', 'r') while true:     # increment chunk counter     count += 1      # calculate new chunk end, advance file pointer     chunk_end = chunk_start + chunk_size     fi.seek(chunk_end)      # advance file pointer end of current line chunks don't have broken      # lines     fi.readline()      chunk_end = fi.tell()      # record chunk start , stop positions, chunk number     chunk_list.append((chunk_start, chunk_end, count))      # advance start current end     chunk_start = chunk_end      # read line confirm we're not past end of file     line = fi.readline()     if not line:         break      # reset file pointer last line read     fi.seek(chunk_end, 0)  fi.close()  # code represents action taken subprocesses, each subprocess # receives 1 chunk instead of iterating list of chunks itself. open('test.txt', 'r', 0) fi:     # iterate on chunks     chunk in chunk_list:         chunk_start, chunk_end, chunk_num = chunk          # advance file pointer chunk start         fi.seek(chunk_start, 0)          # print notes , read in chunk         sys.stdout.write("chunk #{0}: size: {1} start {2} real start: {3} stop {4} "               .format(chunk_num, chunk_end-chunk_start, chunk_start, fi.tell(), chunk_end))         chunk = fi.readlines(chunk_end - chunk_start)         print("real stop: {0}".format(fi.tell()))          # write chunk out file examination         open('test_chunk{0}'.format(chunk_num), 'w') fo:             fo.writelines(chunk)

results

i ran code input file (test.txt) of 23.3kb , produced following output:

chunk #1: size: 2052 start 0 real start: 0 stop 2052 real stop: 8193
chunk #2: size: 2051 start 2052 real start: 2052 stop 4103 real stop: 10248
chunk #3: size: 2050 start 4103 real start: 4103 stop 6153 real stop: 12298
chunk #4: size: 2050 start 6153 real start: 6153 stop 8203 real stop: 14348
chunk #5: size: 2050 start 8203 real start: 8203 stop 10253 real stop: 16398
chunk #6: size: 2050 start 10253 real start: 10253 stop 12303 real stop: 18448
chunk #7: size: 2050 start 12303 real start: 12303 stop 14353 real stop: 20498
chunk #8: size: 2050 start 14353 real start: 14353 stop 16403 real stop: 22548
chunk #9: size: 2050 start 16403 real start: 16403 stop 18453 real stop: 23893
chunk #10: size: 2050 start 18453 real start: 18453 stop 20503 real stop: 23893
chunk #11: size: 2050 start 20503 real start: 20503 stop 22553 real stop: 23893
chunk #12: size: 2048 start 22553 real start: 22553 stop 24601 real stop: 23893

each of chunk sizes reported ~2kb, of start/stop positions line way should, , real file position reported fi.tell() seem correct, i'm chunking algorithm good. however, real stop locations show readlines() reading more size hint. also, output files #1 - #8 8.0kb, larger size hint.

even if attempts break chunks on line ends wrong, readlines() still shouldn't have read more 2kb + 1 line. files #9 - #12 increasingly smaller, makes sense since chunk starting points closer , closer end of file, , readlines() won't read past end of file.

notes

my test input file has "< line number >\n" printed on each line, 1-5000.
i tried again different chunk , input file sizes similar results.
the readlines documentation says read sizes may rounded size of internal buffer, i've tried opening files without buffering (as shown) , made no difference.
i using algorithm split file because need able support *.bz2 , *.gz compressed files, , *.gz files have no way me identify uncompressed file size without decompressing file. *.bz2 files don't either, seek 0 bytes end of , use fi.tell() file size. see my related question.
before requirement support compressed files added, previous version of script used os.path.getsize() stopping condition on chunking loop, , readlines seemed work fine method.

the buffer readlines documentation mentions isn't related buffering third argument of open call controls. buffer this buffer in file_readlines:

static pyobject * file_readlines(pyfileobject *f, pyobject *args) {     long sizehint = 0;     pyobject *list = null;     pyobject *line;     char small_buffer[smallchunk];

where smallchunk defined earlier:

#if bufsiz < 8192 #define smallchunk 8192 #else #define smallchunk bufsiz #endif

i don't know bufsiz comes from, looks you're getting #define smallchunk 8192 case. in case, readlines never use buffer smaller 8 kib, should make chunks bigger that.

Search This Blog

My