tags:

views:

240

answers:

5

i have some user defined tag. for example data here , jssj .I have a file(not xml) which contains some data embeded in tags.I need a parser for this which will identify my tags and will extract the data in proper format. Eg

<newpage> thix text </newpage>
<tagD>
 <tagA> kk</tagA>
</tagD>

tags can also have some attributes as simlar to html tags. Eg

<mytag height="f" width ="d" > bla bla bla </mytag>
<mytag attribute="val"> bla bla bla</mytag>
+1  A: 

Are these XML tags? If so, look into one of the many Java XML libraries already available. If they're some kind of custom tagging format, then you're just going to have to write it yourself.

Amber
the text is not a xml file.
A: 

For xml tags - use DOM parser or SAX parser.

adatapost
they are not xml tags. the text is not a xml file
Could you post an example?
the text is not a xml file
i have posted a eg
+2  A: 

You could look at a parser generator like antlr.

Unless your tag syntax can be represented with a (simple) regular grammar (in which case you could try to scan the file with regexes), you will need a proper parser. It is actually not very hard to do at all - just the first time tastes like biting bullets...

Daren Thomas
A: 

You example is XML with this modification:

<root>
  <newpage> thix text </newpage>
  <tagD>
    <tagA> kk</tagA>
  </tagD>
</root>

You can use any XML parser you want to parse it.

Edit:

Attributes are a normal part of XML.

<root>
  <newpage> thix text </newpage>
  <tagD>
    <tagA> kk</tagA>
  </tagD>
  <mytag height="f" width ="d" > bla bla bla </mytag>
  <mytag attribute="val"> bla bla bla</mytag>
</root>

Every XML parser can deal with them.

Edit:

If you were able to use Python, you could do something like this:

import lxml.etree

doc = lxml.etree.parse("foo.xml")
print doc.xpath("//mytag[1]/@width")
# => ['d']

That's what i call simple.

what i canse if the tage has optional attribute? i have give a example of it now in main post
what is i have some thing like <root> my name is <b> hhshs</b> </root>
The it still XML. The element root would contain mixed content which is legal.
+2  A: 

You can use JAXB, already included in Java. It's quite simple. First you need to create a binding to your XML code. The binding provides a map between Java objects and the XML code.

An example would be:

@XmlRootElement(name = "YourRootElement", namespace ="http://someurl.org")
@XmlAccessorType(XmlAccessType.FIELD)
@XmlType(name = "", propOrder = {
    "intValue",
    "stringArray",
    "stringValue"}
)
public class YourBindingClass {
    protected int intValue;

    @XmlElement(nillable = false)
    protected List<String> stringArray;

    @XmlElement(name = "stringValue", required = true)
    protected String stringValue;

    public int getIntValue() {
        return intValue;
    }

    public void setIntValue(int value) {
        this.intValue = value;
    }

    public List<String> getStringArray() {
        if (stringArray == null) {
            stringArray = new ArrayList<String>();
        }
        return this.stringArray;
    }

    public String getStringValue() {
        return stringValue;
    }

    public void setStringValue(String value) {
        this.stringValue = value;
    }
}

Then, to encode your Java objects into XML, you can use:

YourBindingClass yourBindingClass = ...;
JAXBContext jaxbContext = JAXBContext.newInstance(YourBindingClass.class);
Marshaller marshaller = jaxbContext.createMarshaller();
marshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, true);
marshaller.setProperty(Marshaller.JAXB_FRAGMENT, false);

/** If you need to specify a schema */
SchemaFactory sf = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
Schema schema = sf.newSchema(new URL("http:\\www.someurl.org"));      
marshaller.setSchema(schema);
marshaller.setProperty(Marshaller.JAXB_SCHEMA_LOCATION, true);

ByteArrayOutputStream stream = new ByteArrayOutputStream();
marshaller.marshal(yourBindingClass, stream);
System.out.println(stream);

To parse your XML back to objects:

InputStream resourceAsStream = ... // Your XML, File, etc. 
JAXBContext jaxbContext = JAXBContext.newInstance(YourBindingClass.class);
Unmarshaller unmarshaller = jaxbContext.createUnmarshaller();
Object r = unmarshaller.unmarshal(resourceAsStream);
if (r instanceof YourBindingClass) ...

Example starting from a Java object:

YourBindingClass s = new YourBindingClass();
s.setIntValue(1);
s.setStringValue("a");
s.getStringArray().add("b1");
s.getStringArray().add("b2");

// marshal ...

Result:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ns2:YourRootElement xmlns:ns2="http://someurl.org"&gt;
    <intValue>1</intValue>
    <stringArray>b1</stringArray>
    <stringArray>b2</stringArray>
    <stringValue>a</stringValue>
</ns2:YourRootElement>

If you don't know the input format, that means you probably don't have a XML schema. If you don't have a schema you don't have some it's benefits such as:

  • It is easier to describe allowable document content
  • It is easier to validate the correctness of data
  • It is easier to define data facets (restrictions on data)
  • It is easier to define data patterns (data formats)
  • It is easier to convert data between different data types

Anyway, the previous code also works with XML code that contains 'unknown' tags. However your XML code still have to present the required fields and follow the declared patterns. So the following XML code is also valid. The only restriction is: the tag 'stringValue' should be there. Note that 'stringArrayQ' was not previously declared.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ns2:YourRootElement xmlns:ns2="http://someurl.org"&gt;
         <stringValue>a</stringValue>
         <stringArrayQ>b1</stringArrayQ>
</ns2:YourRootElement>
Bruno Simões
"It's quite simple". Ha!
Probably your are right :) i was too enthusiastic. Writing bindings for long XML codes should be really boring. Anyway you get a good performance, at least over DOM parser.
Bruno Simões
i dont know the format of the input file in advance. The input file is not in a particular format. The only thing i know in advance are the tags which it can contains.For example tag1 can be nested in tag2 or tag3 or any other combination.Like how we can use some of html tags. html eg <div> <p> data</p> </div> or <p><div>data </p>