tags:

views:

1097

answers:

5

Hi All,

I am trying to parse an XML file whcih contains some special characters like "&" using DOM parser. I am getting the saxparse exception "the reference to entity must end with a a delimiter". Is there any way to overcome this exception, since I can not modify the XML file to remove the special characters, since it is coming from different application. Please suggest a way to parse this XML file to get the root element?

Thanks in advance

This the part of the XML which I am parsing

<P>EDTA/THAM WASH 
</P>

<P>jhc ^ 72. METER SOLVENT: Meter 21 LITERS of R. O. WATER through the add line into 
FT-250. Start agitator. 
</P>

<P>R. O. WATER &lt;ZLl LITERS </P>

<P>•     NOTE: The following is a tool control operation. The area within 10 feet of any open vessel or container is under tool control. </P>

<P>-af . 73. CHARGE SOLIDS: Remove any unnecessary items from the tool controlled area. Indicate the numbers of each item that will remain in the tool controlled area during the operation in the IN box of the Tool Control Log. </P>

<P>^___y_ a. To minimize the potential for cross contamination, confirm that no other solids are being charged or packaged in adjacent equipment. </P>

<P>kk k WARNING: Wear protective gloves, air jacket and use local exhaust when handling TROMETHAMINE USP (189400) (THAM) (K-l--Irritant!). The THAM may be dusty. </P>

<P>-&lt;&amp;^b .   Charge 2.1 KG of TROMETHAMINE USP (189400) (THAM) into FT-250 through the top. </P>

<P>TROMETHAMINE USP (189400) (THAM) </P>

<P>Scale ID:     / / 7S </P>

<P>LotNo.:   qy/o^yo^ </P>

<P>Gross:    ^ . S </P>

<P>Tare: 10 ,1 </P>

<P>Net:     J^l </P>

<P>Total:   JL'J </P>

<P><Figure ActualText="&T ">

<ImageData src="images/17PT 07009K_img_1.jpg"/>
&amp;T </Figure>
Checked by </P>
+1  A: 

I'm not sure I understand the question. As far as I'm aware, unless you're inside a CDATA, naked & characters without a closing ; are invalid.

If that's not the case for your XML file, then it's invalid, and you'll need to find another way of parsing it, or fixing it before SAX gets a hold of it.

If I'm misunderstanding something here, you should probably post a sample of the actual XML so we can hep further.

Update:

It looks like:

Figure ActualText="&T "

is the offending line. Is this section within a CDATA or not? If not, this is not valid XML and you should not expect SAX to be able to handle it.

You'll need to either:

  • change the application that created it; or
  • fix it before it's loaded by SAX (if you can't change that application) to something like "Figure ActualText="&amp;T ""; or
  • find a non-SAX method for parsing.
paxdiablo
A: 

find a non-SAX method for parsing.

Can u suggest me any other parser will do this functionality?

Thanks in advance.

sudha
+1  A: 

As a workaround, you can:

  1. Replace all the occurrences of & with &amp; in the original input;
  2. Parse it;
  3. In your code that handles the result, handle the case where you now get escaped characters (e.g. &lt; instead of <).

Depending on the parser you're using, you can also try to find the class responsible for parsing and unescaping &-strings, and see if you can extend it/supply your own resolver. (What I'm saying is very vague, but the specifics depend on the tools you're using.)

Eli Acherkan
+1  A: 

Your input is invalid XML. Specifically, you cannot have an '&' character in an attribute value unless it is part of a well-formed character entity reference.

AFAIK, you have two choices:

  • Write a "not exactly XML" parser yourself. I seriously doubt that you will find an existing one. Any self-respecting XML parser will reject invalid input.
  • Fix whatever is creating this (so-called) XML so that it doesn't put random '&' characters in places where they are not allowed. It's quite simple really. As you are building the XML, replace the '&' character that is not already part of a character reference with '&amp;'
Stephen C
A: 
PSpeed