tags:

views:

237

answers:

1

Hello everyone. I began studying Java Regular Expression recently and I found some really intersting task.For example,I now need to dig out "Product Name","Product Description" and "Sellers for this product" out of the following HTML code.(I am sorry for the big chunck of code,but it is very straightforward)

<td class="sr-check">
<input type="checkbox" name="cptitle" value="678560038" /></td>
<td class="sr-image" style="width: 80px;"><a href="/Nikon-D300S-12-3-678560038/prices-html"     class="strictRule" rel="nofollow"><img src="http://img01.static-nextag.com/image/Nikon-D300S-12-3-MP-Digital-SLR-Camera-Body-Black/0/000/006/789/461/678946110.jpg" alt="Nikon D300S 12.3 MP Digital SLR Camera Body - Black" class="imageLink strictRule" height="75" width="75" id="opILink_0" title="Nikon Digital Cameras - Nikon D300S 12.3 MP Digital SLR Camera Body - Black" /></a><div class="breaker">&nbsp;</div></td>
<td class="sr-info">
<div class="sr-info">
<a id="opPNLink_0" class="underline" style="font-size:16px" href="/Nikon-D300S-12-3-678560038  /prices-html" >Nikon D300S 12.3 MP <b>Digital</b> SLR <b>Camera</b> Body - Black</a> <div class="sr-subinfo">
<div class="sr-info-description">SLR - 13.1MP, 12.3MP - 1x Optical Zoom - CompactFlash, SD/MMC Memory Card - 3in.</div>
<div class="rating">
<img src="http://img01.static-nextag.com/imagefiles/stars/stars4_10px.gif" alt="4/5 stars" title="4/5 stars" /> (92 user ratings)</div>
<div style="clear: both;">
<!-- nxtginc=nextag.api.ServerInclude$JSPIncludeWriter(/buyer/ATLSSI.jsp?ptid=678560038&dts=y) -->
<a id="_atl_0" style="" href="http://www.nextag.com/serv/main/buyer/MyPDir.jsp?list=_transCookieList&amp;amp;cmd=add&amp;amp;ptitle=678560038" rel="nofollow">+ Add to Shopping List</a> &nbsp;|&nbsp; 
<!-- endnxtginc -->
<a rel="nofollow" id="mltLink_0" class="mlt-link" href="/Digital-Cameras--zz500001z2z678560038zB2dgz5---html">See More Like This</a>
</div>
<div id="fsLink_0" class="featuredSeller">
<a rel="nofollow" class="featuredSeller" id="opFSLink_0_0" href="/norob/PtitleSeller.jsp?chnl=main&amp;tag=785646073amp;ctx=x%2BN%2Fs9zy56l4u8RXCzALE1jeLesDMzeK09rPQEdK3Yjx395ZzX9cMh9N5JAxjk7xPqF9hjk2ztM5IRXU5nspLubIXYaVzI%2B%2Fg7h1Qz58TzgvrWuNawV8qEIqqSmClArWMq6mpzNRuSlgg2xCXYObNnaIH00iKSUmBawDRvecwbCpAxhXgXoLEiEinTwr3EipComdzxL9UHFYTLoWUToUB5SRSsolQmEJ3mgnnvu83%2FC8W34TGpN9mJo%2BnyAeTkt4&amp;ptitle=678560038"  target="_blank" >Thundercameras</a>:$1,289 &nbsp;
<a rel="nofollow" class="featuredSeller" id="opFSLink_0_1" href="/norob/PtitleSeller.jsp?chnl=main&amp;tag=797076595&amp;ctx=x%2BN%2Fs9zy56l4u8RXCzALE1jeLesDMzeK09rPQEdK3Yjx395ZzX9cMh9N5JAxjk7xPqF9hjk2ztM5IRXU5nspLubIXYaVzI%2B%2Fg7h1Qz58TzgvrWuNawV8qEIqqSmClArWMq6mpzNRuSlgg2xCXYObNrcWLhL%2BhryuAGhXNhYSPE%2BpAxhXgXoLEiEinTwr3EipComdzxL9UHFYTLoWUToUB5SRSsolQmEJ3mgnnvu83%2FC8W34TGpN9mJo%2BnyAeTkt4&amp;ptitle=678560038"  target="_blank" >PhotoVideoSuperStore</a>:$1,269 &nbsp;
<a rel="nofollow" class="featuredSeller" id="opFSLink_0_2" href="/norob/PtitleSeller.jsp?chnl=main&amp;tag=803555293&amp;ctx=x%2BN%2Fs9zy56l4u8RXCzALE1jeLesDMzeK09rPQEdK3Yjx395ZzX9cMh9N5JAxjk7xPqF9hjk2ztM5IRXU5nspLubIXYaVzI%2B%2Fg7h1Qz58TzgvrWuNawV8qEIqqSmClArWMq6mpzNRuSlgg2xCXYObNt06qcvLJ5UQz7S3zKd4urWpAxhXgXoLEiEinTwr3EipComdzxL9UHFYTLoWUToUB5SRSsolQmEJ3mgnnvu83%2FC8W34TGpN9mJo%2BnyAeTkt4&amp;ptitle=678560038"  target="_blank" >Digitalelect</a>:$1,279 &nbsp;</div>

I would think of :

(1) digging out the product name from <td class="sr-image > tag,and using regular expression

exp ="<td><span\\s+class=\"sr-image\"[^>]*>"
          + ".*?</span><a href=\""
          + "([^\"]+)"      
          + "\"[^>]*>"      
          + "([^<]+)" + "</a>.*?</td>";

(2) digging out the product info from the <div class="sr-info-description"> tag.

exp = "<div class="sr-info-description"> [^>]*>"

(3) digging out the Sellers' names from <div id="fsLink_0" class="featuredSeller"> tag.

exp = "<div id="fslink_0" class="featuredSeller[^>]*>"
          + ".*?</span><a rel=\""
          + "([^\"]+)"      
          + "\"[^>]*>"      
          + "([^<]+)" + "</a>.*?</td>";

I am just beginning learing using Java Regular Expression,I would be grateful if you could correct me if I am in the wrong track or my regular expressiona are wrong. Thanks a lot,guys.

+1  A: 

As noted, you should use a parser to interpret the html input.

But I want to answer the question for a regex to extract the product info from a text line like

<div class="sr-info-description">SLR - 13.1MP, 12.3MP - 1x Optical Zoom - CompactFlash, SD/MMC Memory Card - 3in.</div>

Assuming that it's all one line and contains no tags by itself (in which case you absolutely need to use an html parser), the regular expression should look like

<div class="sr-info-description">([^<]*)</div>

Construct a Matcher for your expression, find() it in your input, and then group(1) contains the text within the div tag (while group(0) contains the matched region including the div tag).

Christian Semrau