tags:

views:

73

answers:

5

I have html page source code with img tags like

<p>xyz </p>< img ....... 1 . gif >........<p>xyz</p>
           < img ........ 2 . jpg >..............<p>xyz</p>    
           < img ........ 3 . jpg ><p>xyz</p>
           < img ....... 4 . gif >......<span>xyz</span>

Img tags can contains both jpg and other format images and can be in any order in web page source.Now I want to use .NET regular expression which can give me first img tag with JPG image like

< img ... 2. jpg >

or any first img tag with no gif image. Basically i want to remove smiley gif images in my regular expression

Please suggest me the regular expression

+1  A: 

Do not parse HTML with RegEx. See here for compelling reasons.

HTML is not a regular language and as such not suitable for parsing with a regular expression.

Use the HTML Agility Pack to parse HTML. It exposes the parsed HTML similarly to XmlDocument and can be queried using XPath.

Oded
This is not parsing. This is searching in text which is fundamentally different.
Stilgar
Hi Oded Thank for your reply , But already I have .net with input string ( source code) and I am using something like string pattern = @"<img.*\.(bmp|JPG|jpg|jpeg|jpe|png|tif|tiff).*>"; System.Text.RegularExpressions.Match m = System.Text.RegularExpressions.Regex.Match(input, pattern, System.Text.RegularExpressions.RegexOptions.IgnoreCase | System.Text.RegularExpressions.RegexOptions.Multiline);
Raj
@Stilgar - this is HTML, and may be quite variable. The HTML Agility Pack will be able to deal with that better than any regex.
Oded
Agree, But is it possible to get first jpg img tag using regular expression ?ThanksRaj
Raj
@Raj - possible does not mean recommended. It is possible to write a very complex GUI app using nothing but notepad and the command line compiler. I wouldn't recommend it though.
Oded
Hi Oded ,The code I am referring you is from blogengine 1.6 and there is bug in that ,picking smileys rather than first jpg (in blog post) if smiley is before jpg.They have used string pattern = @"<img(.|\n)+?>"; regular expression.Just I want to change it so that it only pick first jpg in blog post.The only other way is I need to replace/remove all gif from input source code. i hope you got my problem
Raj
@Oded you make that claim based on what? If you ask me this is not HTML this is a string and we're looking for a substring. Regex is fine in this case. A string becomes HTML once you build a tree structure. There is NOTHING in this issue that requires or will even benefit from using the tree structure. Of course if you need to use the tree structure for anything Regex will fail you because as everybody here knows even Jon Skeet cannot parse HTML with regular expressions.
Stilgar
@Stilgar - first line of the question: `I have html page source code with img tags`. He is posting a fragment of his HTML page. He is **not** simply looking through a bit of text.
Oded
@Oded surely he is since he does not need the tree structure. Without the tree structure HTML is just another string.
Stilgar
@Stilgar - from other comments, this is a blog engine, and the issue is with images on blog entries. This will be unstructured HTML, if anything so IMHO, unless there are certain constraints on text entry (which were not mentioned), a RegEx is not suitable.
Oded
A: 
<.*img[^>]*\.[^>]*jpg[^>]*>
onof
It is giving me wrong result img tag start with gif image.like < img ....... 1 . gif >........<p>xyz</p> < img ........ 2 . jpg >
Raj
Sorry, the right one is:<[^>]*img[^>]*\.[^>]*jpg[^>]*>
onof
Thanks, its working ,Can you please tell me why above expression not including gif image tag which is the first one <img.if it start parsing from starting of string
Raj
[^>] stands for any character different from ">", while the wrong expression i gave you has "." which stands for any character. So, the wrong one matches the whole first img tag and the second: it "can't stop on the close tag".
onof
The same mistake I was doing :( ,Any ways Thanks once again
Raj
A: 

Using regular expressions for parsing or modifying HTML documents is frowned upon. For a one shot operation, you could use

<img\s+[^>]*2.jpg[^>]*>(</img>)?

to identify image tags containing "2.jpg". If you want to do this more than once, you'd do yourself a favor using a HTML Parser like the HTML Agility Pack. There are much less fragile when confronted with real world HTML code.

Jens
I cant use hard coded image name like 2 and if I am not using then it includes gif smileys also
Raj
@Raj: How exactly can you identify the image tags you need to find? Is it digit.jpg or anything.gif?
Jens
A: 

if the html is valid xhtml you can also use xpath or xslt.

xpath should look like that (sorry not tested):

//img[not fn:ends-with(@src, ".gif")]
codymanix
A: 

how about jquery?

it is easy to find html dom parts and change them $('img[src~=.gif]').hide();