ansaurus

Question

Answer 1

+1 A:

I am not convinced that SAX is the best approach for you. There are different ways you could use SAX here, though.

If element order is not guaranteed within certain elements, like ListingDetails, then you need to be proactive.

When you start a ListingDetails, initialize a map as a member variable on the handler. In each subelement, set the appropriate key-value in that map. When you finish a ListingDetails, examine the map and explicitly mock values such as nulls for the missing elements. Assuming you have one ListingDetails per item, save it to a member variable in the handler.

Now, when your item element is over, have a function that writes the line of CSVs based on the map in the order you wanted.

The risk with this is if you have corrupted XML. I would strongly consider setting all these variables to null when an item starts, and then checking for errors and announcing them when the item ends.

Uri 2010-07-20 19:06:49

Answer 2

A:

You could use XStream (http://xstream.codehaus.org ) or JOX (http://www.wutka.com/jox.html) to recognize xml and then convert it to a Java Bean. I think you can convert the Beans to CSV automatically once you get the bean.

pabiagioli 2010-07-20 19:13:19

Answer 3

+2 A:

The best way to code based on your described requirement is to use the easy feature of FreeMarker and XML processing. See the docs.

In this case you will only need the template that will produce a CSV.

An alternative to this is XMLGen, but very similar in approach. Just look at that diagram and examples, and instead of SQL statements, you will output CSV.

These two similar approaches are not "conventional" but do the job very quickly for your situation, and you don't have to learn XSL (quite hard to master I think).

A. Ionescu 2010-07-20 19:32:43

Answer 4

+2 A:

Here some code that implements the conversion of the XML to CSV using StAX. Although the XML you gave is only an example, I hope that this shows you how to handle the optional elements.

import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;
import java.io.*;

public class App 
{
    public static void main( String[] args ) throws XMLStreamException, FileNotFoundException
    {
        new App().convertXMLToCSV(new BufferedInputStream(new FileInputStream(args[0])), new BufferedOutputStream(new FileOutputStream(args[1])));
    }

    static public final String ROOT = "root";
    static public final String ITEM = "Item";
    static public final String ITEM_ID = "ItemID";
    static public final String ITEM_DETAILS = "ListingDetails";
    static public final String START_TIME = "StartTime";
    static public final String END_TIME = "EndTime";
    static public final String ITEM_URL = "ViewItemURL";
    static public final String AVERAGES = "averages";
    static public final String AVERAGE_TIME = "AverageTime";
    static public final String AVERAGE_PRICE = "AveragePrice";
    static public final String SEPARATOR = ",";

    public void convertXMLToCSV(InputStream in, OutputStream out) throws XMLStreamException
    {
        PrintWriter writer = new PrintWriter(out);
        XMLStreamReader xmlStreamReader = XMLInputFactory.newInstance().createXMLStreamReader(in);
        convertXMLToCSV(xmlStreamReader, writer);
    }

    public void convertXMLToCSV(XMLStreamReader xmlStreamReader, PrintWriter writer) throws XMLStreamException {
        writer.println("ItemID,StartTime,EndTime,ViewItemURL,AverageTime,AveragePrice");
        xmlStreamReader.nextTag();
        xmlStreamReader.require(XMLStreamConstants.START_ELEMENT, null, ROOT);

        while (xmlStreamReader.hasNext()) {
            xmlStreamReader.nextTag();
            if (xmlStreamReader.isEndElement())
                break;

            xmlStreamReader.require(XMLStreamConstants.START_ELEMENT, null, ITEM);
            String itemID = nextValue(xmlStreamReader, ITEM_ID);
            xmlStreamReader.nextTag(); xmlStreamReader.require(XMLStreamConstants.START_ELEMENT, null, ITEM_DETAILS);
            String startTime = nextValue(xmlStreamReader, START_TIME);
            xmlStreamReader.nextTag();
            String averageTime = null;
            String averagePrice = null;

            if (xmlStreamReader.getLocalName().equals(AVERAGES))
            {
                averageTime = nextValue(xmlStreamReader, AVERAGE_TIME);
                averagePrice = nextValue(xmlStreamReader, AVERAGE_PRICE);
                xmlStreamReader.nextTag();
                xmlStreamReader.require(XMLStreamConstants.END_ELEMENT, null, AVERAGES);
                xmlStreamReader.nextTag();
            }
            String endTime = currentValue(xmlStreamReader, END_TIME);
            String url = nextValue(xmlStreamReader,ITEM_URL);
            xmlStreamReader.nextTag(); xmlStreamReader.require(XMLStreamConstants.END_ELEMENT, null, ITEM_DETAILS);
            xmlStreamReader.nextTag(); xmlStreamReader.require(XMLStreamConstants.END_ELEMENT, null, ITEM);

            writer.append(esc(itemID)).append(SEPARATOR)
                    .append(esc(startTime)).append(SEPARATOR)
                    .append(esc(endTime)).append(SEPARATOR)
                    .append(esc(url));
            if (averageTime!=null)
                writer.append(SEPARATOR).append(esc(averageTime)).append(SEPARATOR)
                        .append(esc(averagePrice));
            writer.println();                        
        }

        xmlStreamReader.require(XMLStreamConstants.END_ELEMENT, null, ROOT);
        writer.close();

    }

    private String esc(String string) {
        if (string.indexOf(',')!=-1)
            string = '"'+string+'"';
        return string;
    }

    private String nextValue(XMLStreamReader xmlStreamReader, String name) throws XMLStreamException {
        xmlStreamReader.nextTag();
        return currentValue(xmlStreamReader, name);
    }

    private String currentValue(XMLStreamReader xmlStreamReader, String name) throws XMLStreamException {
        xmlStreamReader.require(XMLStreamConstants.START_ELEMENT, null, name);
        String value = "";
        for (;;) {
            int next = xmlStreamReader.next();
            if (next==XMLStreamConstants.CDATA||next==XMLStreamConstants.SPACE||next==XMLStreamConstants.CHARACTERS)
                value += xmlStreamReader.getText();
            else if (next==XMLStreamConstants.END_ELEMENT)
                break;
            // ignore comments, PIs, attributes
        }
        xmlStreamReader.require(XMLStreamConstants.END_ELEMENT, null, name);
        return value.trim();
    }    
}

mdma 2010-07-27 18:25:21

@mdma Thank you for your response, I'm looking for more Generic approach, meaning that it should work for any number of nodes with any depth, and sometimes as in the example xml, it can happen that one item object has greater number of nodes than the next one so there should be also case for that. Also it can happen that nodes have the same name but different values and attributes that is the case for new column in CSV as well.

c0mrade 2010-07-27 21:16:49

Answer 5

+5 A:

The code provided should be considered a sketch rather than the definitive article. I am not an expert on SAX and the implementation could be improved for better performance, simpler code etc. That said SAX should be able to cope with streaming large XML files.

I would approach this problem with 2 passes using the SAX parser. (Incidentally, I would also use a CSV generating library to create the output as this would deal with all the fiddly character escaping that CSV involves but I haven't implemented this in my sketch).

First pass: Establish number of header columns

Second pass: Output CSV

I assume that the XML file is well formed. I assume that we don't have a scheme/DTD with a predefined order.

In the first pass I have assumed that a CSV column will be added for every XML element containing text content or for any attribute (I have assumed attributes will contain something!).

The second pass, having established the number of target columns, will do the actual CSV output.

Based on your example XML my code sketch would produce:

ItemID,StartTime,EndTime,ViewItemURL,AverageTime,category,category,type,type,AveragePrice
4504216603,10:00:10.000Z,10:00:30.000Z,http://url,,,,,,
4504216604,10:30:10.000Z,11:00:10.000Z,http://url,value1,9823,9112,TX,TY,value2

Please note I have used the google collections LinkedHashMultimap as this is helpful when associating multiple values with a single key. I hope you find this useful!

import com.google.common.collect.LinkedHashMultimap;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.LinkedHashMap;
import java.util.Map.Entry;
import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;

public class App {

    public static void main(String[] args) throws SAXException, FileNotFoundException, IOException {
        // First pass - to determine headers
        XMLReader xr = XMLReaderFactory.createXMLReader();
        HeaderHandler handler = new HeaderHandler();
        xr.setContentHandler(handler);
        xr.setErrorHandler(handler);
        FileReader r = new FileReader("test1.xml");
        xr.parse(new InputSource(r));

        LinkedHashMap<String, Integer> headers = handler.getHeaders();
        int totalnumberofcolumns = 0;
        for (int headercount : headers.values()) {
            totalnumberofcolumns += headercount;
        }
        String[] columnheaders = new String[totalnumberofcolumns];
        int i = 0;
        for (Entry<String, Integer> entry : headers.entrySet()) {
            for (int j = 0; j < entry.getValue(); j++) {
                columnheaders[i] = entry.getKey();
                i++;
            }
        }
        StringBuilder sb = new StringBuilder();
        for (String h : columnheaders) {
            sb.append(h);
            sb.append(',');
        }
        System.out.println(sb.substring(0, sb.length() - 1));

        // Second pass - collect and output data

        xr = XMLReaderFactory.createXMLReader();

        DataHandler datahandler = new DataHandler();
        datahandler.setHeaderArray(columnheaders);

        xr.setContentHandler(datahandler);
        xr.setErrorHandler(datahandler);
        r = new FileReader("test1.xml");
        xr.parse(new InputSource(r));
    }

    public static class HeaderHandler extends DefaultHandler {

        private String content;
        private String currentElement;
        private boolean insideElement = false;
        private Attributes attribs;
        private LinkedHashMap<String, Integer> itemHeader;
        private LinkedHashMap<String, Integer> accumulativeHeader = new LinkedHashMap<String, Integer>();

        public HeaderHandler() {
            super();
        }

        private LinkedHashMap<String, Integer> getHeaders() {
            return accumulativeHeader;
        }

        private void addItemHeader(String headerName) {
            if (itemHeader.containsKey(headerName)) {
                itemHeader.put(headerName, itemHeader.get(headerName) + 1);
            } else {
                itemHeader.put(headerName, 1);
            }
        }

        @Override
        public void startElement(String uri, String name,
                String qName, Attributes atts) {
            if ("item".equalsIgnoreCase(qName)) {
                itemHeader = new LinkedHashMap<String, Integer>();
            }
            currentElement = qName;
            content = null;
            insideElement = true;
            attribs = atts;
        }

        @Override
        public void endElement(String uri, String name, String qName) {
            if (!"item".equalsIgnoreCase(qName) && !"root".equalsIgnoreCase(qName)) {
                if (content != null && qName.equals(currentElement) && content.trim().length() > 0) {
                    addItemHeader(qName);
                }
                if (attribs != null) {
                    int attsLength = attribs.getLength();
                    if (attsLength > 0) {
                        for (int i = 0; i < attsLength; i++) {
                            String attName = attribs.getLocalName(i);
                            addItemHeader(attName);
                        }
                    }
                }
            }
            if ("item".equalsIgnoreCase(qName)) {
                for (Entry<String, Integer> entry : itemHeader.entrySet()) {
                    String headerName = entry.getKey();
                    Integer count = entry.getValue();
                    //System.out.println(entry.getKey() + ":" + entry.getValue());
                    if (accumulativeHeader.containsKey(headerName)) {
                        if (count > accumulativeHeader.get(headerName)) {
                            accumulativeHeader.put(headerName, count);
                        }
                    } else {
                        accumulativeHeader.put(headerName, count);
                    }
                }
            }
            insideElement = false;
            currentElement = null;
            attribs = null;
        }

        @Override
        public void characters(char ch[], int start, int length) {
            if (insideElement) {
                content = new String(ch, start, length);
            }
        }
    }

    public static class DataHandler extends DefaultHandler {

        private String content;
        private String currentElement;
        private boolean insideElement = false;
        private Attributes attribs;
        private LinkedHashMultimap dataMap;
        private String[] headerArray;

        public DataHandler() {
            super();
        }

        @Override
        public void startElement(String uri, String name,
                String qName, Attributes atts) {
            if ("item".equalsIgnoreCase(qName)) {
                dataMap = LinkedHashMultimap.create();
            }
            currentElement = qName;
            content = null;
            insideElement = true;
            attribs = atts;
        }

        @Override
        public void endElement(String uri, String name, String qName) {
            if (!"item".equalsIgnoreCase(qName) && !"root".equalsIgnoreCase(qName)) {
                if (content != null && qName.equals(currentElement) && content.trim().length() > 0) {
                    dataMap.put(qName, content);
                }
                if (attribs != null) {
                    int attsLength = attribs.getLength();
                    if (attsLength > 0) {
                        for (int i = 0; i < attsLength; i++) {
                            String attName = attribs.getLocalName(i);
                            dataMap.put(attName, attribs.getValue(i));
                        }
                    }
                }
            }
            if ("item".equalsIgnoreCase(qName)) {
                String data[] = new String[headerArray.length];
                int i = 0;
                for (String h : headerArray) {
                    if (dataMap.containsKey(h)) {
                        Object[] values = dataMap.get(h).toArray();
                        data[i] = (String) values[0];
                        if (values.length > 1) {
                            dataMap.removeAll(h);
                            for (int j = 1; j < values.length; j++) {
                                dataMap.put(h, values[j]);
                            }
                        } else {
                            dataMap.removeAll(h);
                        }
                    } else {
                        data[i] = "";
                    }
                    i++;
                }
                StringBuilder sb = new StringBuilder();
                for (String d : data) {
                    sb.append(d);
                    sb.append(',');
                }
                System.out.println(sb.substring(0, sb.length() - 1));
            }
            insideElement = false;
            currentElement = null;
            attribs = null;
        }

        @Override
        public void characters(char ch[], int start, int length) {
            if (insideElement) {
                content = new String(ch, start, length);
            }
        }

        public void setHeaderArray(String[] headerArray) {
            this.headerArray = headerArray;
        }
    }
}

Mark McLaren 2010-07-30 00:13:02

Answer 6

+6 A:

This looks like a good case for using XSL. Given your basic requirements it may be easier to get at the right nodes with XSL as compared to custom parsers or serializers. The benefit would be that your XSL could target "//Item//AverageTime" or whatever nodes you require without worrying about node depth.

UPDATE: The following is the xslt I threw together to make sure this worked as expected.

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
<xsl:output method="text" />
<xsl:template match="/">
ItemID,StartTime,EndTime,ViewItemURL,AverageTime,AveragePrice
<xsl:for-each select="//Item">
<xsl:value-of select="ItemID"/><xsl:text>,</xsl:text><xsl:value-of select="//StartTime"/><xsl:text>,</xsl:text><xsl:value-of select="//EndTime"/><xsl:text>,</xsl:text><xsl:value-of select="//ViewItemURL"/><xsl:text>,</xsl:text><xsl:value-of select="//AverageTime"/><xsl:text>,</xsl:text><xsl:value-of select="//AveragePrice"/><xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>

</xsl:stylesheet>

Robert Diana 2010-07-30 00:42:50

Escpecially the requirement of "any number of nodes with any depth" should force one's thoughts towards XSL and "//Item".

f1sh 2010-08-02 10:20:58

XSL would be the perfect choice if this was a small file however, the DOM for a 1gb file could take up a huge amount of memory. So I would imagine some sort of specialized streaming XSL would need to be used (this thread already mentioned Saxonica and VTD-XML) See also: http://stackoverflow.com/questions/2301926/xml-process-large-data

Mark McLaren 2010-08-03 09:24:54

That is some interesting information. In that case, a streaming xsl tech would be useful. Thanks for the link Mark.

Robert Diana 2010-08-03 10:43:06

Answer 7

+2 A:

I'm not sure to understand how generic the solution should be. Do you really want to parse a 1 GB file twice for a generic solution? And if you want something generic, why did you skipped the <category> element in your example? How much different format do you need to handle? Do you really don't know what the format can be (even if some element can be ommited)? Can you clarify?

To my experience, it's generally preferable to parse specific files in a specific way (this doesn't exclude using a generic API though). My answer will go in this direction (and I'll update it after the clarification).

If you don't feel comfortable with XML, you could consider using some existing (commercial) libraries, for example Ricebridge XML Manager and CSV Manager. See How to convert CSV into XML and XML into CSV using Java for a full example. The approach is pretty straightforward: you define the data fields using XPath expressions (which is perfect in your case since you can have "extra" elements), parse the the file and then pass the result List to the CSV component to generate the CSV file. The API looks simple, the code tested (the source code of their test cases is available under a BSD-style license), they claim supporting gigabyte-sized files.

You can get a Single Developer license for $170 which is not very expensive compared to developer daily rates.

They offer 30 days trial versions, have a look.

Another option would be to use Spring Batch. Spring batch offers everything required to work with XML files as input or output (using StAX and the XML binding framework of your choice) and flat files as input or output. See:

the Spring Batch Documentation
the Samples (especially the trade sample)
A first look at Spring-Batch, part 2

You could also use Smooks to do XML to CSV transformations. See also:

Structured Event Streaming with Smooks

Another option would be to roll your own solution, using a StAX parser or, why not, using VTD-XML and XPath. Have a look at:

Pascal Thivent 2010-07-30 05:37:46

Answer 8

+1 A:

Note that this would be a prime example of using XSLT except that most XSLT processors read in the whole XML file into memory which is not an option as it is large. Note, however, that the enterprise version of Saxon can do streaming XSLT processing (if the XSLT script adheres to the restrictions).

You may also want to use an external XSLT processor outside your JVM instead, if applicable. This opens up for several more options.

Streaming in Saxon-EE: http://www.saxonica.com/documentation/sourcedocs/serial.html

Thorbjørn Ravn Andersen 2010-08-01 13:00:44

There is also Joost/STX http://joost.sourceforge.net/which is an XSLT-like language with some additional constraints for streaming. Since this problem only requires sequential processing of the input, it should fit well into that model.

Steven D. Majewski 2010-08-03 15:29:10

Why just XSLT-_like_ instead of an XSLT subset?

Thorbjørn Ravn Andersen 2010-08-03 15:46:12

ansaurus

tags:

views:

answers:

Convert XML file to CSV in java

related questions