ansaurus

Question

Answer 1

+9 A:

document.getElementsByTagName('body')[0].innerHTML will return a string of everything in the body tag. It's not a regex, but I'm not sure why you need one...?

POST QUESTION EDIT:

Your XHR object that you performed your AJAX with has responseText and responseXML properties. As long as the response is valid xml, which is probably should be, you can get any tag you want using getElementsByTagName on the xml object that I mentioned. But if you just want the inner parts of the body, I would do it this way:

var inner = myXHR.responseText.split(/(<body>|</body>)/ig)[2]);

geowa4 2009-07-30 17:13:56

+1 for suggesting the right avenue to take... I provided the reasons why this was the right avenue to take in my response.

BenAlabaster 2009-07-30 17:31:11

"need"? It can't even be (sanely) done with a regex.

Svante 2009-07-30 18:31:11

**@Svante**: let's not get into sanity. if we started talking about that, we would realize how crazy you have to be to even look at a damn computer.

geowa4 2009-07-30 18:32:27

Answer 2

+5 A:

Regex isn't the ideal tool for parsing the DOM as you will see mentioned throughout this site and others. The most ideal way, as suggested by George IV is to use the JavaScript tools that are more suited to this and that is getElementsByTagName and grab the innerHTML:

var bodyText = document.getElementsByTagName("body")[0].innerHTML;

Edit1: I've not checked it out yet, but Rudisimo suggested a tool that shows a lot of promise - the XRegExp Library which is an open sources and extensible library out of MIT. This could potentially be a viable option - I still think the DOM is the better way, but this looks far superior to the standard JavaScript implementation of regex.

Edit2: I recant my previous statements about the Regex engine [for reasons of accuracy] due to the example provided by Gumbo - however absurd the expression might be. I do, however, stand by my opinion that using regex in this instance is an inherently bad way to go and you should reference the DOM using the aforementioned example.

BenAlabaster 2009-07-30 17:30:25

-1 You don’t need a look-behind assertion. JavaScript’s regex has a `i` modifier. And the `.` plus `s` modifier could be replaced by `[\s\S]`, `[\w\W]`, `[\d\D]`, etc.

Gumbo 2009-07-30 18:02:08

@Gumbo can you point me to documentation to support that? I've had problems with this in the past and I called no joy as all the documentation suggests otherwise. Can you post a regex that *would* work so I can test it and verify? Then I can remove this answer as inaccurate.

BenAlabaster 2009-07-30 18:11:15

See https://developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/RegExp

Gumbo 2009-07-30 18:29:53

This is not a limitation of JavaScript's regex engine, it is a fundamental limitation of regular expressions per se.

Svante 2009-07-30 18:33:09

@Svante - I disagree, it took me only a few seconds to knock up an expression of .NET flavour in RegexBuddy that gave me the correct info without any problems at all. If it was an inherent limitation with regular expressions in general then this wouldn't be the case.

BenAlabaster 2009-07-30 20:52:34

Answer 3

A:

There is an alternative fix to the dot matches newline limitation of the RegExp library in JavaScript. XRegExp is a powerful and open source library with an almost limitless license "MIT License" (for commercial projects), which is very compact (2.7KB gzipped) and powerful.

If you go to the New Flags section, you can see how there's a flag (s), in which dot matches all characters; including newlines.

Rudisimo 2009-07-30 17:52:53

+1 Fantastic find! Do you know what flavour of regex it implements? Seems very promising at first glance.

BenAlabaster 2009-07-30 17:56:01

Check out the http://xregexp.com/syntax/ section. It gives you an idea of which version it uses based on its Named Capture support, which seems to be .NET's.

Rudisimo 2009-07-30 18:11:33

Answer 4

+1 A:

In general regular expressions are not suitable for parsing. But if you really want to use a regular expression, try this:

/^\s*(?:<(?:!(?:(?:--(?:[^-]+|-[^-])*--)+|\[CDATA\[(?:[^\]]+|](?:[^\]]|][^>]))*\]\]|[^<>]+)|(?!body[\s>])[a-z]+(?:\s*(?:[^<>"']+|"[^"]*"|'[^']*'))*|\/[a-z]+)\s*>|[^<]+)*\s*<body(?:\s*(?:[^<>"']+|"[^"]*"|'[^']*'))*\s*>([\s\S]+)<\/body\s*>/i

As you see, there is no easy way to do that. And I wouldn’t even claim that this is a correct regular expression. But it should take comment tags (), CDATA tags (<![CDATA[ … ]]>) and normal HTML tags into account.

Good luck while trying to read it.

Gumbo 2009-07-30 18:47:51

Okay, you got me beat - good job, +1 for shear tenacity. That expression is ridiculous though. I wouldn't recommend that to my worst enemy over traversing the DOM.

BenAlabaster 2009-07-30 20:54:07

Answer 5

+1 A:

Everybody seems dead set on using regular expressions so I figured I'd go the other way and answer the second query you had.

It is theoretically possible to parse the result of your AJAX as an xmlDocument. There are a few steps you'll likely want to take if you want this to work.

Use a library. I recommend jQuery
If you're using a library you must make sure that the mimetype of the response is an xml mimetype!
Make sure you test thoroughly in all your target browsers. You will get tripped up.

That being said, I created a quick example on jsbin. It works in both IE and Firefox, unfortunately in order to get it to work I had to roll my own XMLHttpRequest object.

View the example source code here

(Seriously though, this code is ugly. It's worth using a library and setting the mime type properly...)

function getXHR() {
 var xmlhttp;
 //Build the request
 if (window.XMLHttpRequest) {
  // code for IE7+, Firefox, Chrome, Opera, Safari
  xmlhttp=new XMLHttpRequest();
 } else if (window.ActiveXObject) {
  // code for IE6, IE5
  xmlhttp=new ActiveXObject("Microsoft.XMLHTTP");
 } else {
  alert("Your browser does not support XMLHTTP!");
 }


 //Override the mime type for firefox so that it returns the 
 //result as an XMLDocument.
 if( xmlhttp.overrideMimeType ) {
  xmlhttp.overrideMimeType('application/xhtml+xml; charset=x-user-defined');
 }

 return xmlhttp;
}

function runVanillaAjax(url,functor)
{
 var xmlhttp = getXHR();
 xmlhttp.onreadystatechange=function() { functor(xmlhttp); };
 xmlhttp.open("GET",url,true);
 xmlhttp.send(null);
}

function vanillaAjaxDone( response ) {
 if(response.readyState==4) {

  //Get the xml document element for IE or firefox
  var xml;
  if ($.browser.msie) {
   xml = new ActiveXObject("Microsoft.XMLDOM");
   xml.async = false;
   xml.loadXML(response.responseText);
  } else {
   xml = response.responseXML.documentElement;
  }

  var textarea = document.getElementById('textarea');
  var bodyTag = xml.getElementsByTagName('body')[0];
  if( $.browser.msie ) {
   textarea.value = bodyTag.text;
  } else {
   textarea.value = bodyTag.textContent;
  }
 }
}

function vanillaAjax() {
 runVanillaAjax('http://jsbin.com/ulevu',vanillaAjaxDone);
}

coderjoe 2009-07-30 21:55:17

ansaurus

tags:

views:

answers:

Regex to match contents of HTML body

related questions