Parsing Huge structured file in python 2.7 -
i newbie in python world , bioinformatics. dealing 50gb structured file write out. take great tips you.
the file goes this. (it's called fastq_format)
@machinename:~:team1:atcatg 1st line. atatgacatgacatgaca 2nd line. + 3rd line. asldjfwe!@#$#%$ 4th line.
these 4 lines repeated in order. 4 lines team. , have 30 candidates dna sequences. e.g. atgcat
, tttagc
what doing have each candidate dna sequence going through huge file find whether candidate sequence similar team dna sequence, means allowing 1 mismatch each (e.g. taaaaa
= aaaata
) , if similar or same, use dictionary store them write out later. key candidate dna sequence. value (4 lines) in list store them in order line order
so have done is:
def myfunction(str1, str2): # find if similar( allowed 1 mis match) if similar, returns true f = open('hugefile') diction = {} mylist = ['candidate dna sequences1','dna2','dna3','dna4'...] while true: line = f.readline() if not line: break if "machine name" in line: teamseq = line.split(':')[-1] if function(candidate dna, team dna) == true: if not candidate dna in diction.keys(): diction[candidate dna] = [] diction[candidate dna].append(line) diction[candidate dna].append(line) diction[candidate dna].append(line) diction[candidate dna].append(line) else: # chances same team dna repeated. diction[candidate dna].append(line) diction[candidate dna].append(line) diction[candidate dna].append(line) diction[candidate dna].append(line) f.close() wf = open(hughfile+".out", 'w') in candidate dna list: # dna 1 , dna2, dna3 wf.write(diction[i] + '\n') wf.close()
my function doesn't use global variables (i think happy function), whereas dictionary variable global variable , takes data making lots of list instances. code simple slow , such big pain in butt cpu , memory. use pypy though.
so tips write out in order line order?
i suggest opening input , output files simultaneously , writing output step through input. now, reading 50gb memory , writing out. both slow , unnecessary.
in pseudocode:
with open(huge file) fin, open(hughfile+".out", 'w') fout: line in f: if "machine name" in line: # read following 4 lines fin record # process record # write record fout # input record in no longer needed -- allow garbage collected...
as have outlined it, previous 4 line records written encountered , disposed of. if need refer diction.keys()
previous records, keep minimum necessary set()
cut down total size of in-memory data.
Comments
Post a Comment