ansaurus

Question

How can I extract a script tag from some text using Regex?

Answer 1

A:

Look at the accepted answer here:

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

Also the 2nd answer on that page.

Also: http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not

user9876 2010-08-12 12:57:10

Answer 2

+2 A:

You should use the HTML Agility Pack.

For example:

var doc = new HtmlDocument();
doc.Parse(source);

var scripts = doc.DocumentNode.Descendants("script");

SLaks 2010-08-12 12:57:42

Answer 3

+1 A:

The . does not, by default, match newlines, so you will only get single-line results.

Use RegexOptions.Singleline to fix this. It changes the meaning of . to match any character, including the newline, so you get multi-line matches too.

Don’t get confused by the name. Also don’t confuse it with RegexOptions.Multiline, which is completely different (read the IntelliSense tooltips to find out).

Timwi 2010-08-12 12:57:59

This actually works well, quickly and gives me exactly what I want... I don't like kittens anyway, so I don't really care that much if God kills one because I use Regex.

GenericTypeTea 2010-08-12 13:27:37

@Timwi - Works as expected now, thank you.

GenericTypeTea 2010-08-12 13:45:23

Answer 4

A:

Depending on the quality of your HTML.

var scripts = XDocument.Parse(HTMLSTRING).Descendants("SCRIPT");

Edit: Pre Xml.Linq version:

XmlDocument xDoc = new XmlDocument();
xDoc.Load(HTMLSTRING);
XmlNodeList scripts = xDoc.SelectNodes("//*/SCRIPT");

Note, both are those are untested....

Robin Day 2010-08-12 13:01:04

Unfortunately I'm using c#2.0 on this project. Looks like it would of been a good solution though.

GenericTypeTea 2010-08-12 13:04:20

You can still use XmlDocument object. It's just more than one line of code.

Robin Day 2010-08-12 13:05:28

Added, as I say though, untested, but you should get the idea. Biggest problem you will have though is if your HTML is valid XML or not.

Robin Day 2010-08-12 13:08:55

Yeah, seems to have issues "There are multiple root elements.". There's a lot of 3rd party crap in this project. Namely Infragistics, so quality is a pretty far fetched idea.

GenericTypeTea 2010-08-12 13:12:48

Downvoted because the question is asking about *HTML*, not *XHTML*. `XDocument.Parse()` will completely fail and throw an exception for everything that isn’t XML, even when it’s valid HTML.

Timwi 2010-08-12 13:25:05

@Timwi: I put a caveat at the top depending on the quality of the HTML. Also noted in comments that the HTML would have to be valid XML. It is an alternative answer showing one way of not using Regular Expressions. You may well be able to achieve this with Regex, however, it is a code smell, there will ALWAYS be a gotcha that will get you later on.

Robin Day 2010-08-12 13:37:21

An XML parser will *completely fail on perfectly valid, high-quality HTML*. It won’t even output anything half-useful: it will just throw.

Timwi 2010-08-12 13:55:55

ansaurus

tags:

views:

answers:

How can I extract a script tag from some text using Regex?

related questions