Splitting large XML files into manageble sections for Hadoop | ansaurus

tags:

hadoop
xml

views:

268

answers:

1

+3 Q:

Splitting large XML files into manageble sections for Hadoop

Is there a input class to deal with [multiple] large XML files based on their tree structure in Hadoop? I have a set of XML files that are of the same schema, but I need to split them into sections of data, as opposed to breaking the sections up.

For example the XML file would be:

<root>
  <parent> data </parent>
  <parent> more data</parent>
  <parent> even more data</parent>
</root>

I would define each section as: /root/parent.

What I'm asking is: Is there a record input reader already included for Hadoop to do this?

+1 A:

I think the Cloud9 project at UMD might help you with this.

The library provides has an XMLInputFormat class which might be of use.

Also of interest is this page in the Cloud9 documentation which looks at how you can deal with an XML dump of Wikipedia in MapReduce.

Binary Nerd 2010-03-05 21:25:21

related questions

Load an XmlNodeList into an XmlDocument without looping?

Does System.Xml use MSXML?

Using an XML catalog with Python's lxml?

Why Are People Still Creating RSS Feeds?

Pretty printing XML files on Emacs

Application configuration files

What is the best XML editor?

How much extra overhead is generated when sending a file over a web service as a byte array?

XPATHS and Default Namespaces

How to parse XML in VBA

Small modification to an XML document using StAX

how to use xpath in python

Best binary XML format for JavaME

How can I split an XML document into thirds (or, even better, n pieces)?

Test serialization encoding

Is it "bad practice" to be sensitive to linebreaks in XML documents?

HTML comments break down

Authoritative source on XML-sig

Best way to get InnerXml of an XElement?

HTML version choice

SQL 2005 For XML Explicit - Need help formatting

Any experiences with Protocol Buffers?

XML Editing/Viewing Software

XML Processing in Python

Converting CSV File to XML in Java