tags:

views:

1553

answers:

3

I am looking for a regular expression that can get me src (case insensitive) tag from following HTML snippets in java.

<html><img src="kk.gif" alt="text"/></html>
<html><img src='kk.gif' alt="text"/></html>
<html><img src = "kk.gif" alt="text"/></html>
+4  A: 

This question comes up a lot here.

Regular expressions are a bad way of handling this problem. Do yourself a favour and use an HTML parser of some kind.

Regexes are flaky for parsing HTML. You'll end up with a complicated expression that'll behave unexpectedly in some corner cases that will happen otherwise.

Edit: If your HTML is that simple then:

Pattern p = Pattern.compile("src\\s*=\\s*([\\"'])?([^ \\"']*)");
Matcher m = p.matcher(str);
if (m.find()) {
  String src = m.group(2);
}

And there are any number of Java HTML parsers out there.

cletus
even xpath would be better for this *sigh*
annakata
Saying that without linking to a parser is not really useful.
wds
I agree; but I have a small snippet in data and for each data element in loop and not sure whether parser loading and getting the value will be viable from performance point of view
Krishna Kumar
@wds, saying _that_ without linking to a parser is also not useful ;). here is a list of open source java parsers: http://java-source.net/open-source/html-parsers
akf
A: 

You mean the src-attribute of the img-Tag? In that case you can go with the following:

<[Ii][Mm][Gg]\\s*([Ss][Rr][Cc]\\s*=\\s*[\"'].*?[\"'])

That should work. The expression src='...' is in parantheses, so it is a matcher-group and can be processed separately.

Mnementh
yes; I need src attribute from the image; but this expression compilation in java; can you please verify this.
Krishna Kumar
That will work, until somebody uses apostrophes instead of double quotes to limit the attribute value (src='foo'). Also, your approach would fail if the img tag had other attributes. The complexity involved is fairly high, although you can get most cases right with a good regex. I don't have one handy though.
Jouni Heikniemi
Thanks for the reply; this regEx compilation is failing in java with following error.java.util.regex.PatternSyntaxException: Unclosed grop near index 43<[Ii][Mm][Gg]\s*([Ss][Rr][Cc]\s*=\s*\".*?\" ^
Krishna Kumar
I fixed some problems.
Mnementh
I edited again to include single quotes.
Mnementh
This compiles fine now; but does not return a mathc for src in the following text. <html><img src="kk.t"></html>
Krishna Kumar
What's a mathc?
Mnementh
sorry for the typo; please read it as 'match'.
Krishna Kumar
+1  A: 

One possibility:

String imgRegex = "<img[^>]+src\\s*=\\s*['\"]([^'\"]+)['\"][^>]*>";

is a possibility (if matched case-insensitively). It's a bit of a mess, and deliberately ignores the case where quotes aren't used. To represent it without worrying about string escapes:

<img[^>]+src\s*=\s*['"]([^'"]+)['"][^>]*>

This matches:

  • <img
  • one or more characters that aren't > (i.e. possible other attributes)
  • src
  • optional whitespace
  • =
  • optional whitespace
  • starting delimiter of ' or "
  • image source (which may not include a single or double quote)
  • ending delimiter
  • although the expression can stop here, I then added:
    • zero or more characters that are not > (more possible attributes)
    • > to close the tag

Things to note:

  • If you want to include the src= as well, move the open bracket further left :-)
  • This does not care about delimiter balancing or attribute values without delimiters, and it can also choke on badly-formed attributes (such as attributes that include > or image sources that include ' or ").
  • Parsing HTML with regular expressions like this is non-trivial, and at best a quick hack that works in the majority of cases.
Dave
Thanks; this returns "<img src="kk.t">" match for string <html><img src="kk.t"></html>. can this expression be change to get me only "kk.txt"; hope I am not asking for too much;)
Krishna Kumar
The first submatch should return what you want. See http://java.sun.com/docs/books/tutorial/essential/regex/groups.html for how to access the group. You essentially want to use the `group()` method on your match result, with the argument `1`.
Dave
See the code from cletus above for an example on how to get a captured subgroup -- you just want the argument to `group()` to be `1`.
Dave