tags:

views:

70

answers:

4

I don't know Regex very well, and I'm trying to get all of the script tags from some extracted page text. I've tried the following pattern:

<script.*?>.*?</script>

But this doesn't seem to return any script tag that has any code within it. I.e. it from the following:

<script type="text/javascript" src="Scripts/Scipt1.js"></script>
<script type="text/javascript" src="Scripts/Scipt2.js"></script>

<script type="text/javascript">
   function SomeMethod()
   {

   }
</script>

I'll only get the following results:

<script type="text/javascript" src="Scripts/Scipt1.js"></script>
<script type="text/javascript" src="Scripts/Scipt2.js"></script>

How can I return all 3? (NB. I do want to maintain the outer script tags in the results).

+2  A: 

You should use the HTML Agility Pack.

For example:

var doc = new HtmlDocument();
doc.Parse(source);

var scripts = doc.DocumentNode.Descendants("script"); 
SLaks
+1  A: 

The . does not, by default, match newlines, so you will only get single-line results.

Use RegexOptions.Singleline to fix this. It changes the meaning of . to match any character, including the newline, so you get multi-line matches too.

Don’t get confused by the name. Also don’t confuse it with RegexOptions.Multiline, which is completely different (read the IntelliSense tooltips to find out).

Timwi
This actually works well, quickly and gives me exactly what I want... I don't like kittens anyway, so I don't really care that much if God kills one because I use Regex.
GenericTypeTea
@Timwi - Works as expected now, thank you.
GenericTypeTea
A: 

Depending on the quality of your HTML.

var scripts = XDocument.Parse(HTMLSTRING).Descendants("SCRIPT");

Edit: Pre Xml.Linq version:

XmlDocument xDoc = new XmlDocument();
xDoc.Load(HTMLSTRING);
XmlNodeList scripts = xDoc.SelectNodes("//*/SCRIPT");

Note, both are those are untested....

Robin Day
Unfortunately I'm using c#2.0 on this project. Looks like it would of been a good solution though.
GenericTypeTea
You can still use XmlDocument object. It's just more than one line of code.
Robin Day
Added, as I say though, untested, but you should get the idea. Biggest problem you will have though is if your HTML is valid XML or not.
Robin Day
Yeah, seems to have issues "There are multiple root elements.". There's a lot of 3rd party crap in this project. Namely Infragistics, so quality is a pretty far fetched idea.
GenericTypeTea
Downvoted because the question is asking about *HTML*, not *XHTML*. `XDocument.Parse()` will completely fail and throw an exception for everything that isn’t XML, even when it’s valid HTML.
Timwi
@Timwi: I put a caveat at the top depending on the quality of the HTML. Also noted in comments that the HTML would have to be valid XML. It is an alternative answer showing one way of not using Regular Expressions. You may well be able to achieve this with Regex, however, it is a code smell, there will ALWAYS be a gotcha that will get you later on.
Robin Day
An XML parser will *completely fail on perfectly valid, high-quality HTML*. It won’t even output anything half-useful: it will just throw.
Timwi