views:

78

answers:

1

Hi there. I am writing a fairly basic GreaseMonkey script that locates text in a specific element and then uses that text to do things later. The relevant bits of code are as follows:

In the HTML there is a span with the class 'someclass', which contains a small string of text:

<span class="someclass">some text</span>

Then in the JavaScript i am trying to find this class and pull its contents (the 'some text') into a variable using the standard XPath jazz:

document.evaluate("//span[@class='someclass']/text()", document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);

Here's the problem: When i run this on pages where 'some text' is a basic string with basic characters, everything works fine, but when i run it on pages where 'some text' contains entities, then it fails. For example, these are all fine and XPath returns the text i want:

<span class="someclass">some text</span>
<span class="someclass">some other text</span>
<span class="someclass">sometext</span>
<span class="someclass">some text 12345</span>

However, this gives me an error:

<span class="someclass">some text&#39;s text</span>

The error returned is:

Error: The expression is not a legal expression.
Source File: file:///blahblahblah.user.js
Line: (JS line i gave above)

I found a few results on here and on Google talking about how XPath has trouble with entities, but they were all doing things like [text() = 'blah &racquo; blah'] — in other words, their entities are in the XPath query itself. Mine aren't, they're in the text that i'm trying to return from the XPath query.

Is this the same problem? Is there any easy way around it?

Thanks!

A: 

The problem is that a string literal in an XPath expression must be surrounded by either quotes or apostrophes and should not contain the surrounding character.

A literal string that contains both quotes and apostrophes needs to be transformed (in your case by your Javascript program) into one that doesn't contain both these types of characters.

The simplest way to do this is to replace each instance of one of these types of characters with its character entity -- say replace every ' with &#39; and use the ' as surrounding character for the literal string.

A second way is to replace

some text&#39;s text

with the XPath expression:

concat('some text', "'", ' text')

Warning: It is not a good idea to use untrusted data to create an XPath expression -- this may result in XPath injection. To avoid XPath injections, if your programming language and function libraries allow this, always compile your XPath expression and run it with passing the data as parameter(s).

Dimitre Novatchev
I had been trying to go 'the simple way', but it seems Firefox's XPaths implementation (or maybe all of them, i don't know, i'm only pretending to be a programmer...) automatically evaluates character entities — so querying for ''' doesn't actually work, even though that's what's in the HTML. Your second way is how i'll have to do it. Cheers!
ccw
@ccw: The "simple method" will work if you issue a Replace("'", "'"). In the text it *is* actually just an apostrophe -- on reading the text the browser converts any character entity to a single character. You have programmatically to do the reverse.
Dimitre Novatchev