{"id":12,"date":"2004-07-21T09:06:27","date_gmt":"2004-07-21T13:06:27","guid":{"rendered":"http:\/\/blogs.law.harvard.edu\/rlucastemp\/2004\/07\/21\/hint-preprocessing-mongo-xml-files"},"modified":"2004-07-21T09:06:27","modified_gmt":"2004-07-21T13:06:27","slug":"hint-preprocessing-mongo-xml-files-for-use-with-xmlsimple","status":"publish","type":"post","link":"https:\/\/archive.blogs.harvard.edu\/rlucastemp\/2004\/07\/21\/hint-preprocessing-mongo-xml-files-for-use-with-xmlsimple\/","title":{"rendered":"[HINT] Preprocessing mongo XML files for use with XML::Simple"},"content":{"rendered":"<p><a name='a37'><\/a><\/p>\n<p>If you are a reasonable Perlista, the first thing you will do when you<br \/>\nhave to do some modest but non-trivial munging of data locked up in XML<br \/>\nis to use XML::Simple.&nbsp; The API is nearly perfect (absent the lack<br \/>\nof some defaults that could be more helpfully set for strictness) for<br \/>\npurposes of comprehensibility and transparency.<\/p>\n<p>However, if you prototype on a small document, and then try to use your<br \/>\ncode on a much bigger XML document, you will find the drawback:<br \/>\ntree-building is costly, and you may spend the vast majority of your<br \/>\nprogram&#8217;s time parsing in the document.&nbsp; One handy solution is to<br \/>\npreprocess your XML &#8212; just run XML::Simple&#8217;s XMLin sub, and use<br \/>\nData::Dumper to spit out the structure that results to a file.&nbsp;<br \/>\nWhen you want to use it, you can simply &#8220;eval&#8221; it, for it defines a<br \/>\nnative Perl structure, and you can use the remainder of your code<br \/>\nunchanged.&nbsp; This resulted for me in a 2x &#8211; 10x speedup for certain<br \/>\ndocuments and certain sizes.<\/p>\n<p>However &#8212; now imagine that you have some real torture-test data &#8212; 10<br \/>\nMB, heavily nested monstrosities of XML.&nbsp; The Dumper output of the<br \/>\nparsed tree is now working on 100 MB!&nbsp; Slurping this in and<br \/>\nevaling it is now the real problem.<\/p>\n<p>Here&#8217;s an idea: rather than slurping and evaling, try inlining it at<br \/>\nthe compilation stage.&nbsp; That&#8217;s right &#8212; make use of Perl&#8217;s much<br \/>\nmore efficient way of slurping and evaling a filehandle with a pipe:<\/p>\n<p><span style=\"font-family: courier;\">cat preprocessed_xml.dd myscript.pl | perl<\/span><\/p>\n<p>It&#8217;s somewhat unorthodox, but entirely functional.&nbsp; Combined with<br \/>\njudicious use of gzip, this could be a very efficient way to get<br \/>\nlittle-changing XML documents into perl quickly &#8212; often very important<br \/>\nwhen doing dev work for which numerous iterations are required and for<br \/>\nwhich a minutes-long parse stage would adversely affect progress.<\/p>\n<p><span style=\"font-weight: bold;\">Update:<\/span> It occurred to me that<br \/>\nusing Storable or a Cache::* module might be faster yet.&nbsp; At this<br \/>\npoint, my work proceeds with tolerable speed using Data::Dumper, plus I<br \/>\nlike using Dumper so that I can edit the output structures by hand if<br \/>\nneed be.&nbsp; But perhaps you should try those modules if you need<br \/>\neven better performance, or cringe at the hackishness of catenating<br \/>\nfiles piped to perl.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>If you are a reasonable Perlista, the first thing you will do when you have to do some modest but non-trivial munging of data locked up in XML is to use XML::Simple.&nbsp; The API is nearly perfect (absent the lack of some defaults that could be more helpfully set for strictness) for purposes of comprehensibility [&hellip;]<\/p>\n","protected":false},"author":1180,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-12","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/archive.blogs.harvard.edu\/rlucastemp\/wp-json\/wp\/v2\/posts\/12","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/archive.blogs.harvard.edu\/rlucastemp\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/archive.blogs.harvard.edu\/rlucastemp\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/archive.blogs.harvard.edu\/rlucastemp\/wp-json\/wp\/v2\/users\/1180"}],"replies":[{"embeddable":true,"href":"https:\/\/archive.blogs.harvard.edu\/rlucastemp\/wp-json\/wp\/v2\/comments?post=12"}],"version-history":[{"count":0,"href":"https:\/\/archive.blogs.harvard.edu\/rlucastemp\/wp-json\/wp\/v2\/posts\/12\/revisions"}],"wp:attachment":[{"href":"https:\/\/archive.blogs.harvard.edu\/rlucastemp\/wp-json\/wp\/v2\/media?parent=12"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/archive.blogs.harvard.edu\/rlucastemp\/wp-json\/wp\/v2\/categories?post=12"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/archive.blogs.harvard.edu\/rlucastemp\/wp-json\/wp\/v2\/tags?post=12"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}