Parsing Large Xml Using Iterparse() Consumes Too Much Memory. Any Alternative?
I am using python 2.7 with latest lxml library. I am parsing a large XML file with very homogenous structure and millions of elements. I thought lxml's iterparse would not build an
Solution 1:
Try using Liza Daly's fast_iter:
deffast_iter(context, func, args=[], kwargs={}):
# http://www.ibm.com/developerworks/xml/library/x-hiperfparse/# Author: Liza Dalyfor event, elem in context:
func(elem, *args, **kwargs)
elem.clear()
while elem.getprevious() isnotNone:
del elem.getparent()[0]
del context
fast_iter
removes elements from the tree after they have been parsed, and also previous elements (maybe with other tags) that are no longer needed.
It could be used like this:
import lxml.etree as ET
defprocess_element(elem):
...
context=ET.iterparse(filename, events=('end',), tag=...)
fast_iter(context, process_element)
Solution 2:
I had this problem and solved it with a hint from http://effbot.org/zone/element-iterparse.htm#incremental-parsing:
elems = ET.Element('MyElements')
forevent, elem in ET.iterparse(filename):
ifis_needed(elem): # implement this condition however you like
elems.append(elem)
else:
elem.clear()
This gives you a tree with only the elements you need, without requiring unnecessary memory during parsing.
Post a Comment for "Parsing Large Xml Using Iterparse() Consumes Too Much Memory. Any Alternative?"