views:

222

answers:

3

I'm trying to make a regexp that will match numbers, excluding numbers that are part of other words or numbers inside certain html tags. The part for matching numbers works well but I can't figure out how to find the numbers inside the html.

Current code:

//number regexp part
var prefix = '\\b()';//for future use
var baseNumber = '((\\+|-)?([\\d,]+)(?:(\\.)(\\d+))?)';
var SIBaseUnit = 'm|kg|s|A|K|mol|cd';
var SIPrefix = 'Y|Z|E|P|T|G|M|k|h|ia|d|c|m|µ|n|p|f|a|z|y';
var SIUnit = '(?:('+SIPrefix+')?('+SIBaseUnit+'))';
var generalSuffix = '(PM|AM|pm|am|in|ft)';
var suffix = '('+SIUnit+'|'+generalSuffix+')?\\b';
var number = '(' + prefix + baseNumber + suffix + ')';

//trying to make it match only when not within tags or inside excluded tags
var htmlBlackList = 'script|style|head'
var htmlStartTag = '<[^(' + htmlBlackList + ')]\\b[^>]*?>';
var reDecimal = new RegExp(htmlStartTag + '[^<]*?' + number + '[^>]*?<');
A: 

The [^] regex modifier only works on single characters, not on compound expressions like (script|style|head). What you want is ?! :

var htmlStartTag = '<(?!(' + htmlBlackList + ')\\b)[^>]*?>';

(?! ... ) means 'not followed by ...' but [^ ... ] means 'a single character not in ...'.

+1  A: 
<script>
   var htmlFragment = "<script>alert('hi')</script>";
   var style = "<style>.foo { font-size: 14pt }</style>";
   // ...
</script>
<!-- turn off this style for now
  <style> ... </style>
 -->

Good luck getting a regular expression to figure that out.

You're using JavaScript, so I'm guessing you're probably running in a browser. Which means you have access to the DOM, giving you access to the browser's very capable HTML parser. Use it.

derobert
I was planning on doing it before the html had been rendered inside a firefox plugin. Hence why I couldn't have access to the complete html file as it's given out in parts. Maybe I should rethink things if it's going to be as hard as I think it is.
Annan
I could create dom nodes from the html, find the numbers and process them, then change the dom back into html before passing it back. I wonder how much changing strings into dom and back cost if it isn't rendered. Even if I managed to get a regexp to work it would probably be just as inefficient.
Annan
No idea how long it'd take, I'd suggest benchmarking it.
derobert
A: 

I'm trying to make a regexp that will match numbers, excluding numbers that are part of other words or numbers inside certain html tags.

Regex cannot parse HTML. Do not use regex to parse HTML. Do not pass Go. Do not collect £200.

To ‘only match something not-within something else’ you would need a negative lookbehind assertion (“(?<!”), but JavaScript Regexps do not support lookbehind, and most other regex implementations don't support the complex variable-length lookbehind you'd need to have any hope of matching a context like being inside a tag. Even if you did have variable-length lookbehind, that'd still not reliably parse HTML, because as previously mentioned many times every day, regex cannot parse HTML.

Use an HTML parser. A browser HTML parser will be able to digest even partial input without complaining.

bobince