ansaurus

Question

Answer 1

+4 A:

It is considered Verey Bad generally to parse HTML with RegExs since HTML does not have regular grammar. See the list of links to explanations (some from SO) here.

You should instead use a designated HTML library, such as this

Alicia 2010-04-03 15:32:24

Answer 2

+2 A:

To parse HTML with Ruby, use Nokogiri or hpricot.

macek 2010-04-03 15:37:37

I'd definitely use hpricot, it's really easy to use. There is good documentation in the readme here http://github.com/whymirror/hpricot

Jamie 2010-04-03 16:27:19

And I'd definitely use Nokogiri because it was able to handle malformed XML that hpricot puked on. :-) http://nokogiri.org/

Greg 2010-04-04 00:50:04

@Jamie, of the two, I'd recommend Nokogiri, too.

macek 2010-04-04 01:12:18

Answer 3

+2 A:

I didn't read the whole code you posted since it burned my eyes.

<span>.*</span>

This regex matches hello correctly, but fails at hellothere and matches the whole string. Remember that the * operator is greedy, so it will match the longest string possible. You can make it non-greedy by using .*? should make it work.

However, it's not wise to use regular expressions to parse HTML code.

1- You can't always parse HTML with regex. HTML is not regular.

2- It's very hard to write or maintain regex.

3- It's easy to break the regex by using an input like <a href=""></a>.

tiftik 2010-04-03 15:48:44

Answer 4

+1 A:

(it doesn't appear that the sample html you posted actually has any examples of the pattern you're trying to match.)

Alicia is correct that regex against html is generally a bad idea, and as your requirements become more complex it will break down.

That said, your example is pretty simple..

doc.scan(/<span dir=ltr>(.*)<\/span/) do |match|
   puts match               
end

As mentioned, .* is usually greedy (and I expected to have to account for that), but it appears that when used within scan, you don't get greedy behavior. I was able to match several of these patterns in a single document.

Mike Cargal 2010-04-03 15:49:16

ansaurus

tags:

views:

answers:

ruby regex, parsing html

related questions