ansaurus

Question

How to get the text from XML with position in the XML file?

Answer 1

+1 A:

You should not rely on text position in an XML file(whitespace is completely ignored by any sane parser). What you can (and should) do is use XPath to identify the nodes you are interested in, and then take out the text from those nodes. If you're interested in just the text nodes, then the query "//text()" will grab all the text nodes.

Mike 2009-12-18 08:00:53

The question is not about extracting text nodes that I can do anyway with even more ease. The HTML browsers have InnerText property. I wan't the text location to modify at high speed, XML parsers can't do what I want to achieve.

Priyank Bolia 2009-12-18 08:19:19

Answer 2

+6 A:

XmlTextReader implements IXmlLineInfo - if you look at the docs for IXmlLineInfo it gives an example of reading an XML file and reporting the location of each node.

EDIT: For those saying it's irrelevant, it may well be irrelevant to the XML - but quite possibly not to a human. If you're trying to tell people where to look in the XML for particular bits, it can be very helpful to report line numbers and positions.

Jon Skeet 2009-12-18 08:02:29

this is good, but would solve the problem, I don't want the line number and line position, I am looking for the exact char position in the XML file, not sure if this can get that position.

Priyank Bolia 2009-12-18 08:28:34

@Priyank: No, I'm not sure you can, I'm afraid.

Jon Skeet 2009-12-18 08:40:02

You could work back from line+column to character offset by loading the file as text (decoded using the XmlTextReader.Encoding) and counting newlines.

bobince 2009-12-18 09:11:21

Answer 3

A:

You should never rely on the exact position (line and character count) of a node in a flat XML file. Those numbers are irrelevant and not part of the XML standard. Whitespace should be (and will be by any good parser) ignored.

If you really, really want to solve this the way you want to, I suggest you forget about it being XML and parse each line individually using a regexp to find a certain string.
But I do think you are solving the right problem with the wrong tools. If you tell us a bit more about why you are trying to do this you might get better answers.

mizipzor 2009-12-18 08:07:27

And what string will you grep in HTML, with will be zillions such string.

Priyank Bolia 2009-12-18 08:25:54

That of course depends on the site you want to parse. Is speed an issue here?

mizipzor 2009-12-18 08:40:00

Answer 4

A:

The SAX specification for reading XML (which almost all XML tools implement) provides a ContentHandler with a Locator which allows you to get the line and character (column) number.

int     getColumnNumber()
          Return the column number where the current document event ends.
 int    getLineNumber()
          Return the line number where the current document event ends.

(I missed the requirement for C#. The example above is for Java but I will try to find the corresponding C# interface).

The event could be a string of characters.

SAX for .NET is described in: http://saxdotnet.sourceforge.net/

peter.murray.rust 2009-12-18 08:08:25

The question specifies C#.

Jon Skeet 2009-12-18 08:13:21

ansaurus

tags:

views:

answers:

How to get the text from XML with position in the XML file?

related questions