tags:

views:

905

answers:

7

I'm using this regex to find <script> tags:

<script (.|\n)*>(.|\n)*?</script>

The problem is, it matches the ENTIRE string below, not just each tag separately:

<script src="crap2.js"></script><script src="crap2.js"></script>
+10  A: 

I don't think anything else needs to be said other than http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454.

JSBangs
That is probably the best answer I've seen to any question!
Andy
That’s rather a comment than an answer.
Gumbo
Should be wiki.
Michael Myers
This is a terrible answer. Look, I'm not trying to use regex to parse <b>XHTML</b>. I'm trying to match the <b>string</b> <script ...></script>. That is perfectly within the capabilities of regex.
JamesBrownIsDead
JamesBrownIsDead, except that you need to care for case, whitespaces, HTML comments, strings inside embedded scripts, `<pre>` regions... Parsing HTML is a solved problem.
Svante
Again, I'm not parsing HTML.
JamesBrownIsDead
You *are* parsing HTML. If you weren't, there wouldn't be <script> tags in it.
Carl Smotricz
+8  A: 

You really would be better off using the DOM to process HTML for this reason and all sorts of others.

Andy
Why did this get a downvote? +1
Daniel
I'm not processing HTML.
JamesBrownIsDead
If you're not processing HTML, why did you tag your question as HTML-related?
TrueWill
Because it's HTML-[i]related[/i].
JamesBrownIsDead
+4  A: 

change your first * to *?

This is the non-greedy 'match all', so it will match the smallest set of characters before the next '>'.

TheSean
while i agree with JS Bangs' link, im pretty sure this will fix his problem
Galen
If someone comes to a gunfight with a dull knife, will sharpening it fix his problem?
Svante
@Svante: yes, as long as there are no bullets :)
TheSean
@TheSean: And I guess with "bullets", you mean things like javascript strings containing '</script>'? Basically, you are *assuming* there are no bullets. But if you value your life: Run if you see a gun pointed at you!
soulmerge
A: 

try to exclude any '<' from the content

 <script (.|\n)*>(.|\n|[^<])*?</script>
Pierre
Even if it's technically not valid valid HTML, people often write code like: `<script>if(a < b) { /* code */ }</script>`
intgr
Good thing I'm not parsing code.
JamesBrownIsDead
You're not excluding `<` from the content with `(.|\n|[^<])*?`. The negated character class will never be reached when an occurrence of a `<` is stumbled upon since the `.` meta character already matches it. In fact, the only character will be `\r` (carriage feed) that `[^<]` is going to match.
Bart Kiers
A: 
<script[\s\S]*?>[\s\S]*?</script>

This matches most common situations, but it's very important to consider JS Bangs answer.

Rubens Farias
+2  A: 

I'll keep posting links to my previous answers until this question type has been wiped from this planet's surface (hopefully in 10 years or so): Don't user regular expressions for irregular languages like html or xml. Use a parser instead.

soulmerge
I'm not parsing a language.
JamesBrownIsDead
Any regular expression you create will match a closing script tag in your javascript, for example, so: Yes, you *are* parsing a language.
soulmerge
Another approach: You are parsing XML, which *is* a language. (or a sub-set of XML - XML documents must have a single root node, which your string doesn't)
soulmerge
+7  A: 

Also see this week's Coding Horror: Parsing Html The Cthulhu Way, inspired by the epic answer by @bobince that @JS Bangs links to.

Bill Karwin
+1: you beat me to it!
Steve Folly