views:

100

answers:

4

I have some text in an element in my page, and i want to scrap the price on that page without any text beside.

I found the page contain price like that:

<span class="discount">now $39.99</span>

How to filter this and just get "$39.99" just using JavaScript and regular expressions.

The question may be too easy or asked by another way before but i know nothing about regular expressions so asked for your help :).

+3  A: 
document.evaluate("//span[@class='discount']", 
  document, 
  null, 
  XPathResult.ANY_UNORDERED_NODE_TYPE, 
  null).singleNodeValue.textContent.replace("now $", "");

EDIT: This is standard XPath. I'm not sure what kind of explanation you're seeking. For outdated browsers, you will need a third-party library like Sarissa and/or Java-line.

Matthew Flaschen
Don't you think some explanation would help? What is this? What are the requirements to use this?
rudolfson
I am only vaguely familiar with XPath, but here is a rough explanation: // means that we want to match any instances of the following tag in the tree (as opposed to one with a specific parent), span is the tag to match, the stuff between [] are additional constraints, in this case the attribute class must be discount.
Chas. Owens
+4  A: 
<script language="javascript">
window.onload = function () {

    // Get all of the elements with class name "discount"
    var elements = document.getElementsByClassName('discount');

    // Loop over each <span class="discount">
    for (var i=0; i < elements.length; i++) {

         // get the text, e.g. "now $39.99"
         var rawText = elements[i].innerHTML;

         // Here's a regular expression to match one or more digits (\d+)
         // followed by a period (\.) and one or more digits again (\d+)
         var priceAsString = rawText.match(/\d+\.\d+/)

         // You'll want to make the price a floating point number if you 
         // intend to do any calculations with it.
         var price = parseFloat(priceAsString); 

         // Now what do you want to do with the price? I'll just write it out
         // to the console (using FireBug or something similar)
         console.log(price);

    }
}
</script>
Patrick McElhaney
Just because you name the variable spans doesn't mean you're only matching spans...
Matthew Flaschen
Good point, Matthew. Corrected.
Patrick McElhaney
Yes..., but you only corrected the variable name. It still matches non-spans!
Matthew Flaschen
The OP says "I have some text in an element in my page" - it doesn't say that it's just in spans.
nickf
He said, it contains a "price like that", where that clearly is a "span".
Matthew Flaschen
and it is also clearly an "example". you don't know that all the occurrences appear in spans.
nickf
+1  A: 

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

Patrick McElhaney's and Matthew Flaschen's answers are both good ways to solve the problem.

Chas. Owens
The question asks how to filter using regexes AND javascript. You would leverage the DOM functions of javascript and then use a regex. A browser is as good a HTML parser as you'll need. There's *definitely* no need to be using a new parsing library just for this.
nickf
@nickf As you say the browser is a parser. Look at the examples on the page, there are at least two that use the browser as the parser.
Chas. Owens
A: 

as Matthew Flaschen suggested, XPATH is a better way to go, if you know something about the node structure of the target document (and since you provided an example, you seem to). If you don't know the node structure, regexes are still lousy for parsing XML.

some more resources to kick-start you:

I've also found the FireFox extension combo of DOM Inspector and XPather to be an invaluable tool for deriving and testing XPath expressions on a given page. (If you're using another browser -- well, I don't know).

Michael Paulukonis