tags:

views:

41

answers:

2

I have small problem with a simple tokenizer regex:

def test_tokenizer_regex_limit
   string = '<p>a</p>' * 400
   tokens = string.scan(/(<\s*tag:.*?\/?>)|((?:[^<]|\<(?!\s*tag:.*?\/?>))+)/)
end

Basically it runs through the text and gets pairs of [ matched_tag , other_text ]. Here's an example: http://rubular.com/r/f88JBjfzFh

Works fine for smaller sets. If you run in under ruby 1.8.7 it will blow up. 1.9.2 works fine.

Any ideas how to simplify / improve this? My regex-fu is weak

A: 

This is a bit more simplified but not much:

(<[^<]*:[^<]*>)|((?:[^<]|<[^:]*>)+)

(<.*?>|[^<>]+)

tinifni
kinda, but not really. I need two capture groups like so: (<.*?>)|([^<>]+) and it's almost there. But! It will match '<not_tag>' into the first group. I need to put tags only of this format <tag:something> everything else should be in the second capture group.
Grocery
I updated my solution, but it's not much better than what you already have. I hope you find your answer!
tinifni
+1  A: 

This has been answered before and that answer currently has 3,891 votes. It's better than anything I could possibly type here...

RegEx vs HTML.

DigitalRoss
That is such a great response. Thanks for reminding me about it.
Greg