views:

230

answers:

7

I am looking for a regex for Javascript to search for text ("span" for example) in HTML.

Example:

<div>Lorem span Ipsum dor<a href="blabla">lablala</a> dsad <span>2</span> ... </div>

BUT only the "span" after "Lorem" should be matched, not the <span> tag.
For a second example, if we search for "bla", only the bold text should be matched.

EDIT:

The HTML is gotten by innerHTML, the matchings will be surrounded with <span class="x">$text</span>, an then rewritten to innerHTML of this node, and all these without killing the other tags.

EDIT2 and My Solution:

I wrote my own search, it is searching char by char, with cache and flags.

Thanks for ure Help guys!

A: 

If I understand you correctly, you want to search for a word, but only words which are not part of an HTML tag.

I don't have an exact answer for you, but some tools I use for developing regular expressions are this site: http://www.regular-expressions.info/ and this program: http://www.radsoftware.com.au/regexdesigner/

Brandon Montgomery
A: 

This might be impossible in the general case because you will need to count opening and closing tags what is not possible with regular expressions.

Regex is not a smart solution for handling XML. Instead you should use HTML or XML DOM methods to extract the required information.

If you really want or need to use regular expressions you might try something like the following.

>[^<]*bla[^<]*<

But I am quite sure that this will not work in the general case.

Daniel Brückner
+1  A: 

If you've got the HTML in a DOM element, you may use textContent/innerText to grab the text (without any HTML tags):

var getText = function(el) {
    return el.textContent || el.innerText;
};
// usage:
// <div id="myElement"><span>Lorem</span> ipsum <em>dolor<em></div>
alert(getText(document.getElementById('myElement'))); // "Lorem ipsum dolor"
moff
+1  A: 
(?<!\<|/)span

This should give all span occurrences that are not tags. Hope this helped at least a bit :)

Explanation: find every span occurrence that is not preceded by < or /

Peter Perháč
sry but there is no lookbehind in js: http://www.regular-expressions.info/javascript.htmland what is with "href" for example?
Stupid2.de
then try changing approach. don't force javascript to solve problems it isn't designed to solve. whatever you're doing, try looking at the task at hand from a different perspective.
Peter Perháč
+2  A: 

You could use dom methods to process every text node.

This method takes a parent node for the first argument and loops through all of its childnodes, processing the text nodes with the function passed as the second argument. The function is where you would operate on the test node's data, to find or replace or delete or wrap the found text in a 'highlighted' span, for example.

You can call the function with only the first argument, and it will return an array of text nodes, and you can then use that array to manipulate the text- the array items in that case are each nodes, and have data, parents and siblings.

document.deepText= function(hoo, fun){
    var A= [], tem;
    if(hoo){
     hoo= hoo.firstChild;
     while(hoo!= null){
      if(hoo.nodeType== 3){
       if(fun){
        if((tem= fun(hoo))!== undefined){
           A[A.length]= tem;
        }
       }
       else A[A.length]= hoo;
      }
      else A= A.concat(arguments.callee(hoo, fun));
      hoo= hoo.nextSibling;
     }
    }
    return A;
}

//test case

function ucwords(pa, rx){
    var f= function(node){
     var t= node.data;
     if(t && t.search(rx)!=-1){
      node.data= t.replace(rx,function(w){return w.toUpperCase()});
      return node;
     }
     return undefined;
    }
    return document.deepText(pa, f);
}

ucwords(document.body,/\bspan\b/ig)

kennebec
+1  A: 

What you want to do can be done pretty easily with jQuery:

  $("span:contains('blah'))

If you want to do regular expression matching do what was done in this previous stack overflow example:

jQuery Regular Expressions

For a more elegant solution, create a custom selector.

altCognito
+1  A: 
/span(?=[^>]*<)/

In other words, looking ahead from the end of the word "span" there is no closing angle bracket before the next opening angle bracket, so we can't be inside a tag. Supposedly, quoted attribute values can contain closing angle brackets, though I've never seen it done. But, to cover that possibility, you can use this regex:

/span(?=(?:[^>"']+|"[^"]*"|'[^']*')*<)/
Alan Moore