large files - XML parser using iterparse 'loses' children -

June 15, 2010

i appreciate on following: need read large xml file , convert csv.

i have 2 functions suppose same, 1 (function1) uses iterparse (because need process 2gb files) , doesn't (function2).

function2 works same xml file (but 150 mb), , after size fails due memory.

the problem have that, despite fact code (for function1) not give errors, looses of children (this huge problem!). function2 on other hand reads children , doesn't 'loose' or fail any.

q: see in code of function1 reasons why children lost (or not read correctly, or ignored) ?

note1: have 50 kb xml sample ready send in case needed.
note2: variable 'nchil_count' count number of children.

code (function1):

def function1 ():     # function uses iterparse     # doesn't give errors looses children. why?     # prints output csv file, wcel.csv      xml.etree.celementtree import iterparse      fname = "c:\leonardo\input data\xml input data\netactfiles\netact_3g_rnc11_t1.xml"     # element_list = ["wcel"]      # delete contents exit file     open("c:\leonardo\input data\xml input data\wcel.csv", 'w').close()      # open exit file     open("c:\leonardo\input data\xml input data\wcel.csv", "a") exit_file:          open(fname) xml_doc:             context = iterparse(xml_doc, events=("start", "end"))             context = iter(context)             event, root = context.next()              event, elem in context:                  if event == "start" , elem.tag == "{raml20.xsd}managedobject":                 # if event == "start":                     if elem.get('class') == 'wcel':                         print elem.attrib                         # print elem.tag                          element = elem.getchildren()                         nchil_count = 0                          child in element:                             if child.tag == "{raml20.xsd}p":                                 nchil_count = nchil_count + 1                                 # print child.tag                                 # print child.attrib                                 val = child.text                                 # print val                                 val = str (val)                                 exit_file.write(val + ",")                          exit_file.write('\n')                         print nchil_count                  elif event == "end" , elem.tag == "{raml20.xsd}managedobject":                     # clear memory                     root.clear()      xml_doc.close()     exit_file.close()      return ()

code (function2):

def function2 (xmlfile):     # using element tree     # successful     # works files of 150 mb, xml (raml) rnc export netact (1 rnc only)     # fails huge files due memory      import xml.etree.celementtree etree     import shutil      open("c:\leonardo\input data\xml input data\wcel.csv", "a") exit_file:          # populate values per cell:          tree = etree.parse(xmlfile)         value in tree.getiterator(tag='{raml20.xsd}managedobject'):             if value.get('class') == 'wcel':                 print value.attrib                  element = value.getchildren()                 nchil_count = 0                  child in element:                     if child.tag == "{raml20.xsd}p":                         nchil_count = nchil_count + 1                         # print child.tag                         # print child.attrib                         val = child.text                         # print val                          val = str (val)                         exit_file.write(val + ",")                  exit_file.write('\n')                 print nchil_count      exit_file.close() ## file closing after writing.      return ()

i had similar problem. there important differences, though:

i used lxml.etree, not xml.etree (binary version windows 'lxml-3.4.2-cp34-none-win32.whl' http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml)
i used iterparse specific element end event active
then drilling down element use of xpath() method

but result equivalent: of nodes ignored (lost). nothing in file explain why. given file - same nodes. when made technical change (format xmllint) - other nodes lost.

i reorganized code (no xpath(), iterparse without tag argument, both 'start' , 'end' events, controlling process element.tag property value) , found out sometimes (i don't know when) the process "forgets" default namespace. mean, in cases value of element.tag "{namespace uri}tag_name", in 2% of cases - "tag_name". that's why wasn't found xpath().

i knew in file 1 default namespace, add "{namespace uri}" myself, , had file processed correctly.

there no problem when there namespace prefix declared explicitly in main tag , used in other tags.

this looks bug somewhere in parsing large xml files - not in lxml if had same effect in xml.etree?

Search This Blog

My

large files - XML parser using iterparse 'loses' children -

Comments

Post a Comment

Popular posts from this blog

rdbms - what exactly the undo information lives in oracle? -

bash - How do you programmatically add a bats test? -

javascript - RequestAnimationFrame not working when exiting fullscreen switching space on Safari -