views:

208

answers:

2

I'm trying to get only elements that have text, ex xml :

<root>
      <Item>
        <ItemID>4504216603</ItemID>
        <ListingDetails>
          <StartTime>10:00:10.000Z</StartTime>
          <EndTime>10:00:30.000Z</EndTime>
          <ViewItemURL>http://url&lt;/ViewItemURL&gt;
            ....
           </item> 

It should print

Element Local Name:ItemID
Text:4504216603
Element Local Name:StartTime
Text:10:00:10.000Z
Element Local Name:EndTime
Text:10:00:30.000Z
Element Local Name:ViewItemURL
Text:http://url

This code prints also root, item etc. Is it even possible, it must be I just can't google it.

XMLInputFactory inputFactory = XMLInputFactory.newInstance();
InputStream input = new FileInputStream(new File("src/main/resources/file.xml"));
XMLStreamReader xmlStreamReader = inputFactory.createXMLStreamReader(input);

while (xmlStreamReader.hasNext()) {
    int event = xmlStreamReader.next();

    if (event == XMLStreamConstants.START_ELEMENT) {
    System.out.println("Element Local Name:" + xmlStreamReader.getLocalName());
    }

    if (event == XMLStreamConstants.CHARACTERS) {
                        if(!xmlStreamReader.getText().trim().equals("")){
                        System.out.println("Text:"+xmlStreamReader.getText().trim());
                        }
                }

            }

Edit incorrect behaviour :

    Element Local Name:root
    Element Local Name:item
    Element Local Name:ItemID
    Text:4504216603
    Element Local Name:ListingDetails
    Element Local Name:StartTime
    Text:10:00:10.000Z
    Element Local Name:EndTime
    Text:10:00:30.000Z
    Element Local Name:ViewItemURL
    Text:http://url

I don't want that root and other nodes which don't have text to be printed, just the output which I wrote above. thank you

+1  A: 

Try this:

while (xmlStreamReader.hasNext()) {
    int event = xmlStreamReader.next();

    if (event == XMLStreamConstants.START_ELEMENT) {
        try {
            String text = xmlStreamReader.getElementText();
            System.out.println("Element Local Name:" + xmlStreamReader.getLocalName());
            System.out.println("Text:" + text);
        } catch (XMLStreamException e) {

        }
    }

}

SAX based solution (works):

public class Test extends DefaultHandler {

    public static void main(String[] args) throws ParserConfigurationException, IOException, SAXException, XPathExpressionException, XMLStreamException {
        SAXParser parser = SAXParserFactory.newInstance().newSAXParser();
        parser.parse(new File("src/file.xml"), new Test());
    }

    private String currentName;

    @Override
    public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
        currentName = qName;
    }

    @Override
    public void characters(char[] ch, int start, int length) throws SAXException {
        String string = new String(ch, start, length);
        if (hasText(string)) {
            System.out.println(currentName);
            System.out.println(string);
        }
    }

    private boolean hasText(String string) {
        string = string.trim();
        return string.length() > 0;
    }
}
Georgy Bolyuba
@Georgy Bolyuba I think I already tried xmlStreamReader.getElementText(); but I didn't store it in a variable, is it possible that it caused a problem?
c0mrade
Actually, this solution does not work 100% (just checked). It skips <StartTime>. Implementation "swallows" second START_ELEMENT, I think. The good news is that you can improve it. Check out current impl: http://download.oracle.com/docs/cd/E17409_01/javase/6/docs/api/javax/xml/stream/XMLStreamReader.html#getElementText%28%29 and make a better one :)
Georgy Bolyuba
@Georgy Bolyuba yes I realized that just now, but still I leave you +1, hehehehe you're funny "make a better one" :D
c0mrade
Yeah, here is a hint: use SAX
Georgy Bolyuba
@Georgy Bolyuba I'll accept your solution I did it using stax as well, nice to learn new things.
c0mrade
You should post your stax solution here
Georgy Bolyuba
@Georgy Bolyuba alright
c0mrade
@Georgy Bolyuba just want to thank you again, this works really good with outputstream better than the junk I used to have, is it possible to customize this code of yours to print something before or after top xml element(the one which comes after the root element, which is repeating throughout document).
c0mrade
You would have to implement endElement and add the logic you want. It is pretty easy to do
Georgy Bolyuba
@Georgy Bolyuba I was reading SAX documentation, I saw that startElement and endElement does things for every XML element, is it possible to capture when main element(as in question `<item>`) starts and write something before it or main element(not before or after every element)?
c0mrade
There is no specific method for that. You will have to put some logic into startElement to check if this is your "main" element yourself (like, compare the name). But at this point I would switch to DOM model. If you are planning to add more logic to your code, DOM would be a better option for you.
Georgy Bolyuba
A: 

Stax solution :

Parse document

public void parseXML(InputStream xml) {
        try {

            DOMResult result = new DOMResult();
            XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
            XMLEventReader reader = xmlInputFactory.createXMLEventReader(new StreamSource(xml));
            TransformerFactory transFactory = TransformerFactory.newInstance();
            Transformer transformer = transFactory.newTransformer();
            transformer.transform(new StAXSource(reader), result);
            Document document = (Document) result.getNode();

            NodeList startlist = document.getChildNodes();

            processNodeList(startlist);

        } catch (Exception e) {
            System.err.println("Something went wrong, this might help :\n" + e.getMessage());
        }
    }

Now all nodes from the document are in a NodeList so do this next :

private void processNodeList(NodeList nodelist) {
        for (int i = 0; i < nodelist.getLength(); i++) {
            if (nodelist.item(i).getNodeType() == Node.ELEMENT_NODE && (hasValidAttributes(nodelist.item(i)) || hasValidText(nodelist.item(i)))) {
                getNodeNamesAndValues(nodelist.item(i));
            }
            processNodeList(nodelist.item(i).getChildNodes());
        }
    }

Then for each element node with valid text get name and value

public void getNodeNamesAndValues(Node n) {

        String nodeValue = null;
        String nodeName = null;

        if (hasValidText(n)) {
            while (n != null && isWhiteSpace(n.getTextContent()) == true && StringUtils.isWhitespace(n.getTextContent()) && n.getNodeType() != Node.ELEMENT_NODE) {
                n = n.getFirstChild();
            }

            nodeValue = StringUtils.strip(n.getTextContent());
            nodeName = n.getLocalName();

            System.out.println(nodeName + " " + nodeValue);

        }
    }

Bunch of useful methods to check nodes :

private static boolean hasValidAttributes(Node node) {
        return (node.getAttributes().getLength() > 0);

    }

private boolean hasValidText(Node node) {
        String textValue = node.getTextContent();

        return (textValue != null && textValue != "" && isWhiteSpace(textValue) == false && !StringUtils.isWhitespace(textValue) && node.hasChildNodes());
    }

private boolean isWhiteSpace(String nodeText) {
        if (nodeText.startsWith("\r") || nodeText.startsWith("\t") || nodeText.startsWith("\n") || nodeText.startsWith(" "))
            return true;
        else
            return false;
    }

I also used StringUtils, you can get that by including this in your pom.xml if you're using maven :

<dependency>
            <groupId>commons-lang</groupId>
            <artifactId>commons-lang</artifactId>
            <version>2.5</version>
        </dependency>

This is inefficient if you're reading huge files, but not so much if you split them first. This is what I've come with(with google). There are more better solutions this is mine, I'm an amateur(for now).

c0mrade
What is the point of using Stax if you process DOM model? :)
Georgy Bolyuba
@Georgy Bolyuba I wouldn't know as I said I'm not a pro I googled found the stuff thats working.
c0mrade
@Georgy Bolyuba there should be naming post option like don't do like this ..
c0mrade