ansaurus

Question

Answer 1

A:

Here's the regex you need

title='([a-zA-Z0-9]+)'

but if you're going to be doing a lot more stuff like this, using a parser might make it much more robust and useful.

nickf 2009-01-21 11:33:43

Answer 2

A:

Try this instead:

title=\'[a-zA-Z0-9]*\'

Frederick 2009-01-21 11:34:11

Answer 3

A:

I'm not familiar with PowerGrep, however, your regex is incomplete. Try this:

title='[a-zA-Z0-9 ]*'

or better yet:

title='([^']*)'

David Hanak 2009-01-21 11:36:16

Single quotes are allowed: http://www.w3.org/TR/xml/#NT-AttValue

Gumbo 2009-01-21 12:26:27

-1 Single quotes are allowed

cletus 2009-01-21 12:49:20

You are right, guys. I removed the sidenote.

David Hanak 2009-01-21 14:09:10

Sorry but this still doesn't do the job. Lots more characters are valid for attributes, as are entities that you'd want to decode and your example now uses single quotes instead of double quotes. A working regex should work with a matching pair of single or double quotes.

cletus 2009-01-21 23:11:59

I agree that you could use a more complex regex. For example, you can group the opening quote - (['"]) - and use a back reference both between the quotes and at the end. I intenionally avoided overcomplication, suspecting that the questioner is not really at ease with regular expressions.

David Hanak 2009-01-22 08:45:13

Answer 4

A:

The other answers all give correct changes to the regex, so I'll explain what the issue was with your original.

The square brackets indicate a character class - meaning that the regex will match any character within those brackets. However, like everything else, it will only match it once by default. Just as the regex "s" would match only the first character in "ssss", the regex "[a-zA-Z0-9]" will match only the first character in "Connectivity Framework".

By adding repetition, one can get that character class to match repeatedly. The easiest way to do this is by adding an asterisk after it (which will match 0 or more occurences). Thus the regex "[a-zA-Z0-9]*" will match as many characters in a row until it hits a character that is not in that character class (in your case, the space character since you didn't include that in your brackets).

Regexes though can be pretty complex to describe the syntax accurately - what if someone put a non-alphanumeric character such as an ampersand within the attribute? You could try to capture all input between the quotes by making the character set "anything except a quote character", so "'[^']*'" would usually do the right thing. Often you need to bear in mind escaping as well (e.g. with a string 'Mary\'s lamb' you do actually want to capture the apostrophe in the middle so a simple "everything but apostrophes" character set won't cut it) though thankfully this is not an issue with XML/HTML according to the specs.

Still, if there is an existing library available that will do the extraction for you, this is likely to be faster and more correct than rolling your own, so I would lean towards that if possible.

Andrzej Doyle 2009-01-21 12:15:28

Answer 5

+4 A:

I'm just not sure how many times the question of regular expression parsing of HTML files has to be asked (and answered with the correct solution of "use a DOM parser"). It comes up every day.

The difficulties are:

In HTML attributes can have single-quotes, double-quotes or even no quotes;
Similar strings can appear in the HTML document itself;
You have to handle correct escaping; and
Malformed HTML (decent parsers are extremely robust to common errors).

So if you cater for all this (and it gets to be a pretty complicated yet still imperfect regex), it's still not 100%.

HTML parsers exist for a reason. Use them.

cletus 2009-01-21 12:17:12

I believe there used to be a tag for this, but I think someone went and removed it from everything. It *is* rather depressing how frequently this comes up. Sticky? :P

annakata 2009-01-21 12:36:01

We do need something to stop this question coming up multiple times a day. Some kind of FAQ which pops up a questioner uses the 'regex' and 'html' tags together?

bobince 2009-01-21 14:13:58

There is sometimes the need for fast "parsing"/checking of html. DOM-Parser are heavy in memory and speed. So if I have to process a ton of data, regexp is easier and faster. Some restrictions apply of course.

ReneS 2009-04-16 02:11:16

That's what SAX parsers are for.

cletus 2009-04-16 02:32:20

Answer 6

A:

I would use this regular expression to get the title attribute values

<[a-z]+[^>]*\s+title\s*=\s*("[^"]*"|'[^']*'|[^\s >]*)

Note that this regex matches the attribute value expression with quotes. So you have to remove them if needed.

Gumbo 2009-01-21 12:23:37

Answer 7

A:

@David Hanak: thanks it worked. But can u pls. tell me how it works.

What does ^' - inside the rectangular brackets mean

and what are the ( ) brackets for?

It would help.

-AD

goldenmean 2009-01-21 12:34:33

^ in the brackets means: "anything but ...", ( ) are used to group things, so you can later get the value that was captures in this group.

wvanbergen 2009-01-21 12:39:51

You can also use RegexBuddy that will tell you what each part of the expression does.

Dror 2009-01-21 14:11:14

ansaurus

tags:

views:

answers:

Regex to match attributes in HTML?

related questions