views:

230

answers:

3

page contents:

aa<b>1;2'3</b>hh<b>aaa</b>..
 .<b>bbb</b>
blabla..

i want to get result:

1;2'3aaabbb

match tag is <b> and </b>

how to write this regex using javascript? thanks!

+5  A: 

You cannot parse HTML using regular expressions.

Instead, you should use Javascript's DOM.

For example (using jQuery):

var text = "";
$('<div>' + htmlSource + '</div>')
    .find('b')
    .each(function() { text += $(this).text(); });

I wrap the HTML in a <div> tag to find both nested and non-nested <b> elements.

SLaks
1732348 is SO's 42. it answers a huge amount of questions. upvoting for it starts feeling daft, but heck, it won't stop being true any time soon...
David Hedlund
For the record, you cannot **reliably** parse HTML using regular expressions. If certain conditions are met, information can be *extracted* just fine from well-formed (X)HTML with regular expressions.
vladr
i want use javascript regex to get the resulti don't like parse HTML (this's slow)any other idea?thanks :)
Zenofo
@lazyanno, if you are trying to extract information from the page itself, then the HTML has already been parsed by the browser and you don't pay any additional penalty for using the DOM like `SLaks` suggested
vladr
You cannot do this with a regex. (Unless you want it to mysteriously fail every couple of hours)
SLaks
@Vlad Romascanu,i get this content from a XHR stream,it's not a HTML page and not parsed by my browser,it's only a javascript variable,so,i want use regex get the result
Zenofo
i use $('<div>'+c+'</div>').find('b') ,it's work,thanks,Do not know any better solution. I think faster regex directly.
Zenofo
+1  A: 

Here is an example without a jQuery dependency:

// get all elements with a certain tag name
var b = document.getElementsByTagName("B");

// map() executes a function on each array member and
// builds a new array from the function results...
var text = b.map( function(element) {
  // ...in this case we are interested in the element text
  if (typeof element.textContent != "undefined")
    return element.textContent; // standards compliant browsers
  else
    return element.innerText;   // IE
});

// now that we have an array of strings, we can join it
var result = text.join('');
Tomalak
I don't think his HTML is in the DOM.
SLaks
@SLaks: Hm… He said "page contents:" in his post.
Tomalak
Read his comment to my answer.
SLaks
@SLaks: I see. Hooray for precise question asking.
Tomalak
+1  A: 

Lazyanno,

If and only if:

  1. you have read SLaks's post (as well as the previous article he links to), and
  2. you fully understand the numerous and wondrous ways in which extracting information from HTML using regular expressions can break, and
  3. you are confident that none of the concerns apply in your case (e.g. you can guarantee that your input will never contain nested, mismatched etc. <b>/</b> tags or occurrences of <b> or </b> within <script>...</script> or comment <!-- .. --> tags, etc.)
  4. you absolutely and positively want to proceed with regular expression extraction

...then use:

var str = "aa<b>1;2'3</b>hh<b>aaa</b>..\n.<b>bbb</b>\nblabla..";

var match, result = "", regex = /<b>(.*?)<\/b>/ig;
while (match = regex.exec(str)) { result += match[1]; }

alert(result);

Produces:

1;2'3aaabbb
vladr
it's cool!!!thank you! :))))
Zenofo
@lazyanno, before picking either the regex or DOM solution (based on the criteria of performance), make sure to **time both** (**parse a "representative" string** with both methods several times, in a loop, and see what the **actual timing is** on a **variety of browsers**.)
vladr