large files - XML parser using iterparse 'loses' children -
i appreciate on following: need read large xml file , convert csv.
i have 2 functions suppose same, 1 (function1) uses iterparse (because need process 2gb files) , doesn't (function2).
function2 works same xml file (but 150 mb), , after size fails due memory.
the problem have that, despite fact code (for function1) not give errors, looses of children (this huge problem!). function2 on other hand reads children , doesn't 'loose' or fail any.
q: see in code of function1 reasons why children lost (or not read correctly, or ignored) ?
note1: have 50 kb xml sample ready send in case needed.
note2: variable 'nchil_count' count number of children.
code (function1):
def function1 (): # function uses iterparse # doesn't give errors looses children. why? # prints output csv file, wcel.csv xml.etree.celementtree import iterparse fname = "c:\leonardo\input data\xml input data\netactfiles\netact_3g_rnc11_t1.xml" # element_list = ["wcel"] # delete contents exit file open("c:\leonardo\input data\xml input data\wcel.csv", 'w').close() # open exit file open("c:\leonardo\input data\xml input data\wcel.csv", "a") exit_file: open(fname) xml_doc: context = iterparse(xml_doc, events=("start", "end")) context = iter(context) event, root = context.next() event, elem in context: if event == "start" , elem.tag == "{raml20.xsd}managedobject": # if event == "start": if elem.get('class') == 'wcel': print elem.attrib # print elem.tag element = elem.getchildren() nchil_count = 0 child in element: if child.tag == "{raml20.xsd}p": nchil_count = nchil_count + 1 # print child.tag # print child.attrib val = child.text # print val val = str (val) exit_file.write(val + ",") exit_file.write('\n') print nchil_count elif event == "end" , elem.tag == "{raml20.xsd}managedobject": # clear memory root.clear() xml_doc.close() exit_file.close() return () code (function2):
def function2 (xmlfile): # using element tree # successful # works files of 150 mb, xml (raml) rnc export netact (1 rnc only) # fails huge files due memory import xml.etree.celementtree etree import shutil open("c:\leonardo\input data\xml input data\wcel.csv", "a") exit_file: # populate values per cell: tree = etree.parse(xmlfile) value in tree.getiterator(tag='{raml20.xsd}managedobject'): if value.get('class') == 'wcel': print value.attrib element = value.getchildren() nchil_count = 0 child in element: if child.tag == "{raml20.xsd}p": nchil_count = nchil_count + 1 # print child.tag # print child.attrib val = child.text # print val val = str (val) exit_file.write(val + ",") exit_file.write('\n') print nchil_count exit_file.close() ## file closing after writing. return ()
i had similar problem. there important differences, though:
- i used lxml.etree, not xml.etree (binary version windows 'lxml-3.4.2-cp34-none-win32.whl' http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml)
- i used iterparse specific element end event active
- then drilling down element use of xpath() method
but result equivalent: of nodes ignored (lost). nothing in file explain why. given file - same nodes. when made technical change (format xmllint) - other nodes lost.
i reorganized code (no xpath(), iterparse without tag argument, both 'start' , 'end' events, controlling process element.tag property value) , found out sometimes (i don't know when) the process "forgets" default namespace. mean, in cases value of element.tag "{namespace uri}tag_name", in 2% of cases - "tag_name". that's why wasn't found xpath().
i knew in file 1 default namespace, add "{namespace uri}" myself, , had file processed correctly.
there no problem when there namespace prefix declared explicitly in main tag , used in other tags.
this looks bug somewhere in parsing large xml files - not in lxml if had same effect in xml.etree?
Comments
Post a Comment