views:

128

answers:

6

Hi, I had a quick question regarding RegEx...

I have a string that looks something like the following:

"This was written by <p id="auth">John Doe</p> today!"

What I want to do (with javascript) is basically extract out the 'John Doe' from any tag with the ID of "auth".

Could anyone shed some light? I'm sorry to ask.

Full story: I am using an XML parser to pass data into variables from a feed. However, there is one tag in the XML document () that contains HTML passed into a string. It looks something like this:

 <item>
  <title>This is a title</title>
  <description>
  "By <p id="auth">John Doe</p> text text text... so on"
  </description>
 </item>

So as you can see, I can't use an HTML/XML parser for that p tag, because it's in a string, not a document.

A: 

Perhaps something like

document.getElementById("auth").innerHTML.replace(/<^[^>]+>/g, '')

might work. innerHTML is supported on all modern browsers. (You may omit the replace if you don't care about removing HTML bits from the inner content.)

If you have jQuery at your disposal, just do

$("#auth").text()
AKX
+1  A: 

No need of regular expressions to do this. Use the DOM instead.

var obj = document.getElementById('auth');
if (obj)
{
    alert(obj.innerHTML);
}

By the way, having multiples id with the same value in the same page is invalid (and will surely result in odd JS behavior).

If you want to have many auth on the same page use class instead of id. Then you can use something like:

//IIRC getElementsByClassName is new in FF3 you might consider using JQuery to do so in a more "portable" way but you get the idea...
var objs = document.getElementsByClassName('auth');
if (objs)
{
    for (var i = 0; i < objs.length; i++)
        alert(obj[i].innerHTML);
}

EDIT: Since you want to parse a string that contain some HTML, you won't be able to use my answer as-iis. Will your HTML string contain a whole HTML document? Some part? Valid HTML? Partial (broken) HTML?

AlexV
+1. Furthermore, in the post it is implied that there's more than one element with the ID 'auth'. @Jon McIntosh, perhaps you should be using the className 'auth' instead?
karim79
I don't think this will work if I'm extracting the 'html tags' from a string and not an actual HTML document, will it?
Jon McIntosh
Ooh you should have mentioned this earlier :) I don't think regular expression will be a "sane" solution for this problem. Try a parser instead. Also you could send the HTML string to a web service that will do the job for you instead of "eating" the process on the client side (in JS).
AlexV
Updated OP with the details
Jon McIntosh
A: 

What I want to do (with javascript) is basically extract out the 'John Doe' from any tag with the ID of "auth".

You can't have the same id (auth) for more than one element. An id should be assigned once per element per page.

If, however, you assign a class of auth to elements, you can go about something like this assuming we are dealing with paragraph elements:

// find all paragraphs
var elms = document.getElementsByTagName('p');

for(var i = 0; i < elms.length; i++)
{
  // find elements with class auth
  if (elms[i].getAttribute('class') === 'auth') {
    var el = elms[i];

    // see if any paragraph contains the string
    if (el.innerHTML.indexOf('John Doe') != -1) {
      alert('Found ' + el.innerHTML);
    }
  }
}
Sarfraz
A: 

If the content of the tag contains only text, you could use this:

function getText (htmlStr, id) {
  return new RegExp ("<[^>]+\\sid\\s*=\\s*([\"'])"
    + id 
    + "\\1[^>]*>([^<]*)<"
  ).exec (htmlStr) [2];
}


var htmlStr = "This was written by <p id=\"auth\">John Doe</p> today!";
var id = "auth";
var text = getText (htmlStr, id);
alert (text === "John Doe");
trinithis
I tried your regex but I don't think it works. For example, "<[^>]+\\sid\\s*=\\s*([\"'])auth\\1[^>]*>([^<]*)<" wont find the content between the tags
Jon McIntosh
Oh, sorry, I wrote `auth` instead of `id` in the example.
trinithis
+3  A: 

Here's a way to get the browser to do the HTML parsing for you:

var string = "This was written by <p id=\"auth\">John Doe</p> today!";

var div = document.createElement("div");

div.innerHTML = string; // get the browser to parse the html

var children = div.getElementsByTagName("*");

for (var i = 0; i < children.length; i++)
{
    if (children[i].id == "auth")
    {
        alert(children[i].textContent);
    }
}

If you use a library like jQuery, you could hide the for loop and replace the use of textContent with something cross-browser.

Douglas
+1 - good job passing off the hard work to a real parser
Ryan Kinal
A: 

Assuming you only have 1 auth per string, you might go with something like this:

var str = "This was written by <p id=\"auth\">John Doe</p> today!",
    p = str.split('<p id="auth">'),
    q = p[1].split('</p>'),
    a = q[0];
alert(a);

Simple enough. Split your string on your paragraph, then split the second part on the paragraph close, and the first part of the result will be your value. Every time.

Ryan Kinal