Parsing Huge structured file in python 2.7 -

May 15, 2012

i newbie in python world , bioinformatics. dealing 50gb structured file write out. take great tips you.

the file goes this. (it's called fastq_format)

@machinename:~:team1:atcatg   1st line. atatgacatgacatgaca            2nd line.        +                             3rd line.            asldjfwe!@#$#%$               4th line.

these 4 lines repeated in order. 4 lines team. , have 30 candidates dna sequences. e.g. atgcat, tttagc

what doing have each candidate dna sequence going through huge file find whether candidate sequence similar team dna sequence, means allowing 1 mismatch each (e.g. taaaaa = aaaata) , if similar or same, use dictionary store them write out later. key candidate dna sequence. value (4 lines) in list store them in order line order

so have done is:

def myfunction(str1, str2): # find if similar( allowed 1 mis match) if similar, returns true      f = open('hugefile')     diction = {}     mylist = ['candidate dna sequences1','dna2','dna3','dna4'...]     while true:       line = f.readline()       if not line:          break       if "machine name" in line:          teamseq = line.split(':')[-1]          if function(candidate dna, team dna) == true:              if not candidate dna in diction.keys():                 diction[candidate dna] = []                 diction[candidate dna].append(line)                 diction[candidate dna].append(line)                 diction[candidate dna].append(line)                 diction[candidate dna].append(line)              else:          # chances same team dna repeated.                 diction[candidate dna].append(line)                 diction[candidate dna].append(line)                 diction[candidate dna].append(line)                 diction[candidate dna].append(line)     f.close()      wf = open(hughfile+".out", 'w')     in candidate dna list:   # dna 1 , dna2, dna3           wf.write(diction[i] + '\n')     wf.close()

my function doesn't use global variables (i think happy function), whereas dictionary variable global variable , takes data making lots of list instances. code simple slow , such big pain in butt cpu , memory. use pypy though.

so tips write out in order line order?

i suggest opening input , output files simultaneously , writing output step through input. now, reading 50gb memory , writing out. both slow , unnecessary.

in pseudocode:

with open(huge file) fin, open(hughfile+".out", 'w') fout:    line in f:       if "machine name" in line:           # read following 4 lines fin record           # process record           # write record fout           # input record in no longer needed -- allow garbage collected...

as have outlined it, previous 4 line records written encountered , disposed of. if need refer diction.keys() previous records, keep minimum necessary set() cut down total size of in-memory data.

Search This Blog

My

Parsing Huge structured file in python 2.7 -

Comments

Post a Comment

Popular posts from this blog

javascript - RequestAnimationFrame not working when exiting fullscreen switching space on Safari -

Why am I getting Internal .NET Framework Data Provider error 1025 when passing Method to where? -

linux - phpmyadmin, neginx error.log - Check group www-data has read access and open_basedir -