python - Why is readlines() reading much more than the sizehint? -
background
i parsing large text files (30gb+) in python 2.7.6. speed process bit, splitting files chunks , farming them out subprocesses using multiprocessing library. this, iterating on file in main process, recording byte positions want split input file , passing byte positions subprocesses, open input file , read in block using file.readlines(chunk_size). however, i'm finding chunks read in seem larger (4x) sizehint argument.
the question
why isn't sizehint being heeded?
example code
this following code demonstrates issue:
import sys # set test chunk size 2kb chunk_size = 1024 * 2 count = 0 chunk_start = 0 chunk_list = [] fi = open('test.txt', 'r') while true: # increment chunk counter count += 1 # calculate new chunk end, advance file pointer chunk_end = chunk_start + chunk_size fi.seek(chunk_end) # advance file pointer end of current line chunks don't have broken # lines fi.readline() chunk_end = fi.tell() # record chunk start , stop positions, chunk number chunk_list.append((chunk_start, chunk_end, count)) # advance start current end chunk_start = chunk_end # read line confirm we're not past end of file line = fi.readline() if not line: break # reset file pointer last line read fi.seek(chunk_end, 0) fi.close() # code represents action taken subprocesses, each subprocess # receives 1 chunk instead of iterating list of chunks itself. open('test.txt', 'r', 0) fi: # iterate on chunks chunk in chunk_list: chunk_start, chunk_end, chunk_num = chunk # advance file pointer chunk start fi.seek(chunk_start, 0) # print notes , read in chunk sys.stdout.write("chunk #{0}: size: {1} start {2} real start: {3} stop {4} " .format(chunk_num, chunk_end-chunk_start, chunk_start, fi.tell(), chunk_end)) chunk = fi.readlines(chunk_end - chunk_start) print("real stop: {0}".format(fi.tell())) # write chunk out file examination open('test_chunk{0}'.format(chunk_num), 'w') fo: fo.writelines(chunk) results
i ran code input file (test.txt) of 23.3kb , produced following output:
chunk #1: size: 2052 start 0 real start: 0 stop 2052 real stop: 8193
chunk #2: size: 2051 start 2052 real start: 2052 stop 4103 real stop: 10248
chunk #3: size: 2050 start 4103 real start: 4103 stop 6153 real stop: 12298
chunk #4: size: 2050 start 6153 real start: 6153 stop 8203 real stop: 14348
chunk #5: size: 2050 start 8203 real start: 8203 stop 10253 real stop: 16398
chunk #6: size: 2050 start 10253 real start: 10253 stop 12303 real stop: 18448
chunk #7: size: 2050 start 12303 real start: 12303 stop 14353 real stop: 20498
chunk #8: size: 2050 start 14353 real start: 14353 stop 16403 real stop: 22548
chunk #9: size: 2050 start 16403 real start: 16403 stop 18453 real stop: 23893
chunk #10: size: 2050 start 18453 real start: 18453 stop 20503 real stop: 23893
chunk #11: size: 2050 start 20503 real start: 20503 stop 22553 real stop: 23893
chunk #12: size: 2048 start 22553 real start: 22553 stop 24601 real stop: 23893
each of chunk sizes reported ~2kb, of start/stop positions line way should, , real file position reported fi.tell() seem correct, i'm chunking algorithm good. however, real stop locations show readlines() reading more size hint. also, output files #1 - #8 8.0kb, larger size hint.
even if attempts break chunks on line ends wrong, readlines() still shouldn't have read more 2kb + 1 line. files #9 - #12 increasingly smaller, makes sense since chunk starting points closer , closer end of file, , readlines() won't read past end of file.
notes
- my test input file has "< line number >\n" printed on each line, 1-5000.
- i tried again different chunk , input file sizes similar results.
- the readlines documentation says read sizes may rounded size of internal buffer, i've tried opening files without buffering (as shown) , made no difference.
- i using algorithm split file because need able support *.bz2 , *.gz compressed files, , *.gz files have no way me identify uncompressed file size without decompressing file. *.bz2 files don't either, seek 0 bytes end of , use
fi.tell()file size. see my related question. - before requirement support compressed files added, previous version of script used
os.path.getsize()stopping condition on chunking loop, , readlines seemed work fine method.
the buffer readlines documentation mentions isn't related buffering third argument of open call controls. buffer this buffer in file_readlines:
static pyobject * file_readlines(pyfileobject *f, pyobject *args) { long sizehint = 0; pyobject *list = null; pyobject *line; char small_buffer[smallchunk]; where smallchunk defined earlier:
#if bufsiz < 8192 #define smallchunk 8192 #else #define smallchunk bufsiz #endif i don't know bufsiz comes from, looks you're getting #define smallchunk 8192 case. in case, readlines never use buffer smaller 8 kib, should make chunks bigger that.
Comments
Post a Comment