views:

94

answers:

3

I was doing some web scraping and i was looking for some div elements with particular class names and markup.

This is my objective , i have to extract everything within the div having the class s_specs_box s_box_4

Could someone please provide the regular expression in .NET terms (i.e., which can be straight away passed into Regex's constructor)to match one such div (given below)

<div class=\"s_specs_box s_box_4\"><h3>Display</h3><ul><li><strong><span class='s_tooltip_anchor'>Display:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Display</b> - Phone's main display</p></span></strong><ul>\n<li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Type:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Type</b> - Refers to the type of the display. There are four major display types: Greyscale, Black&White, LCD:STN-color and LCD:TFT-color</p></span></strong><ul><li>Color</li></ul>\n</li><li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Technology:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Technology</b> - Refers to the type of the color displays. There are five major types: LCD, TFT, TFD, STN and OLED</p></span></strong><ul><li>Super AMOLED</li></ul>\n</li><li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Size:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Size</b> - Refers to the width and the height of the display</p></span></strong><ul><li><span title='Big display' class=\"s_display_rating s_size_1 s_mr_5\"><span></span></span>480 x 800 pixels</li></ul>\n</li><li class='clear clearfix'><strong>Physical Size:</strong><ul><li>4.00 inches</li></ul>\n</li><li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Colors:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Colors</b> - Shows the number of colors that the display supports</p></span></strong><ul><li>16 777 216</li></ul>\n</li><li class='clear clearfix'><strong>Touch Screen:</strong><ul>\n<li class='clear clearfix'><strong>Type:</strong><ul><li>Capacitive</li></ul>\n</li>\n</ul></li><li class='clear clearfix'><strong>Multi-touch:</strong><ul><li>Yes</li></ul>\n</li><li class='clear clearfix'><strong>Proximity Sensor:</strong><ul><li>Yes</li></ul>\n</li><li class='clear clearfix'><strong>Light sensor:</strong><ul><li>Yes</li></ul>\n</li>\n</ul></li></ul>\n</div>

Thanks in advance ,

Vijay

+4  A: 

You cannot parse HTML using regular expressions.

Instead, you should use the HTML Agility Pack in C# or jQuery in Javascript.

For example:

var html = document.DocumentNode.Descendants("div")
    .First(div => div.GetAttributeValue("class", null) == "s_specs_box s_box_4")
    .InnerHtml;
SLaks
Really, you *can* parse HTML with regular expressions, but it is not recommended, since your RegEx will fail, and fail often...
davisoa
Obligatory Jamie Zawinski quote: "Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems."
+1  A: 

Ok, if nobody else wants to link this outright for a better description, I will ... (Altho @SLaks really helped you out better than this could)

http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

drachenstern
That's a decent posting, although there are several errors in the comments. It all comes down to "Even if you can/could, you almost never should." I have to parse tens of thousands of HTML pages all the time. And even though I actually *can* parse them with regexes, I assure you I make no such silly attempt. I use HTML::TreeBuilder, and am quite happy with it.
tchrist
Given the need to parse such documents, I would almost surely brute force it, as I've seen too many odd cases, such as `<ul><li>text<li>text</ul>` to worry about matching braces. The best I can assume is that there will be information I will want, and sites like weather.com (to randomly pick one that many people will be familiar with) will be consistently formatted so I can look for key phrases or IDs. Otherwise, I really have no business parsing HTML in the first place so I don't.
drachenstern
A: 

This works for your provided sample data:

string subject = "<div class=\"s_specs_box s_box_4\"><h3>Display</h3><ul><li><strong><span class='s_tooltip_anchor'>Display:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Display</b> - Phone's main display</p></span></strong><ul>\n<li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Type:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Type</b> - Refers to the type of the display. There are four major display types: Greyscale, Black&White, LCD:STN-color and LCD:TFT-color</p></span></strong><ul><li>Color</li></ul>\n</li><li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Technology:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Technology</b> - Refers to the type of the color displays. There are five major types: LCD, TFT, TFD, STN and OLED</p></span></strong><ul><li>Super AMOLED</li></ul>\n</li><li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Size:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Size</b> - Refers to the width and the height of the display</p></span></strong><ul><li><span title='Big display' class=\"s_display_rating s_size_1 s_mr_5\"><span></span></span>480 x 800 pixels</li></ul>\n</li><li class='clear clearfix'><strong>Physical Size:</strong><ul><li>4.00 inches</li></ul>\n</li><li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Colors:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Colors</b> - Shows the number of colors that the display supports</p></span></strong><ul><li>16 777 216</li></ul>\n</li><li class='clear clearfix'><strong>Touch Screen:</strong><ul>\n<li class='clear clearfix'><strong>Type:</strong><ul><li>Capacitive</li></ul>\n</li>\n</ul></li><li class='clear clearfix'><strong>Multi-touch:</strong><ul><li>Yes</li></ul>\n</li><li class='clear clearfix'><strong>Proximity Sensor:</strong><ul><li>Yes</li></ul>\n</li><li class='clear clearfix'><strong>Light sensor:</strong><ul><li>Yes</li></ul>\n</li>\n</ul></li></ul>\n</div>";
Match match = Regex.Match(subject,
    @"<div[^>]+class\s*=\s*""s_specs_box s_box_4""[^>]*>(.*?)<\s*/\s*div\s*>",
    RegexOptions.Singleline);
Console.WriteLine(match.Success);
string result = match.Groups[1].Value;
Console.WriteLine(result);

Disclaimer 1: Don't parse HTML with regex. It is particularly bad at matching nested tags of the same type. If for example, your main <div> had a <div> child, my code would almost certainly not yield the results you desire. This is not the only problem with using regex to parse HTML, just the first of many.

Disclaimer 2: Don't use regex to parse HTML in production code or with unknown, future inputs. It's kind of OK if you are just going use it to batch transform a few dozen HTML files on your hard drive, and you are going to manually verify the results. It's not OK to trust it for new, unknown inputs.

Mike Clark