tags:

views:

9

answers:

0

I am working a project in which i need to extract xml(sitemap)data from gz file using apache tika[AM NEW TO TIKA]. the fie name is something like sitemap01.xml.gz I could extract data from normal text file or html,but i don't know how to extract xml from gz and extract the meta and data from xml... I searched Google for past two days.

Do i need to use delegateParser in tika to extract data from xml? Please guide me to some sample or articles....

Here is my try

public void parseXml() throws IOException{
    Metadata metadata = new Metadata();
    ContentHandler handler = new BodyContentHandler();
    Parser parser = new AutoDetectParser();
    ParseContext context = new ParseContext();
     InputStream stream =this.getClass().getResourceAsStream("sitemap.xml.gz");
    try {
        parser.parse(stream,handler,metadata,context);
        for(int i = 0; i <metadata.names().length; i++) {
            String name = metadata.names()[i];
            System.out.println(name + " : " + metadata.get(name));
          }
        System.out.println(handler.toString());

    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (SAXException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (TikaException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }finally{
         if(stream!=null) {
                stream.close();
            }
    }


}