How do I use Elastic MapReduce to run an XSLT transformation on millions of small S3 xml files? | ansaurus

tags:

views:

26

answers:

1

Q:

How do I use Elastic MapReduce to run an XSLT transformation on millions of small S3 xml files?

More specifically, is there a somewhat easy streaming solution?

A:

See this link: How do I process files, one per map?

Upload your data to an S3 bucket
Generate a file containing the full s3n:// path to each file
Write a mapper script that:
- Pulls 'mapred_work_output_dir' out of the environment (*)
- Performs XSLT transform based on the name of the file, saving to the output directory
Write an identity reducer that does nothing
Upload your mapper / reducer scripts to an S3 bucket
Test your script via the AWS EMR console

(*) Streaming puts your jobconf in the processes environment. See code here.

Ryan Cox 2010-08-11 11:44:16

related questions

Load an XmlNodeList into an XmlDocument without looping?

Does System.Xml use MSXML?

Using an XML catalog with Python's lxml?

Why Are People Still Creating RSS Feeds?

Pretty printing XML files on Emacs

Application configuration files

What is the best XML editor?

How much extra overhead is generated when sending a file over a web service as a byte array?

XPATHS and Default Namespaces

How to parse XML in VBA

Small modification to an XML document using StAX

how to use xpath in python

Best binary XML format for JavaME

How can I split an XML document into thirds (or, even better, n pieces)?

Test serialization encoding

Is it "bad practice" to be sensitive to linebreaks in XML documents?

HTML comments break down

Authoritative source on XML-sig

Best way to get InnerXml of an XElement?

HTML version choice

SQL 2005 For XML Explicit - Need help formatting

Any experiences with Protocol Buffers?

XML Editing/Viewing Software

XML Processing in Python

Converting CSV File to XML in Java