views:

1601

answers:

7

Hi, I have a txt file which actually is a html source of some webpage. Inside that txt file there are various strings preceded by a "title=" tag. e.g.

<div id='UWTDivDomains_5_6_2_2'  title='Connectivity Framework'>

I am interested in getting the text Connectivity Framework to be extraced and written to a separate file.

Like this, there are many such tags each having a different text after the title='some text here which i need to extract ' I want to extract all such instances of the text from the html source/txt file and write to a separate txt file. The text can contain lower case, upper case letters and number only. The length of each text string(in characters) will vary.

I am using PowerGrep for windows. Powergrep allows me to search a text file with regular expression inout. I tried using the search as title='[a-zA-Z0-9]

It shows the correct matches, but it matches only first character of the string and writes only the first character of the text string matched to the second txt file, not all string.

I want all string to be matched and written to the second file.

What is the correct regular expression or way to do what i want to do, using powergrep?

-AD.

A: 

Here's the regex you need

title='([a-zA-Z0-9]+)'

but if you're going to be doing a lot more stuff like this, using a parser might make it much more robust and useful.

nickf
A: 

Try this instead:

title=\'[a-zA-Z0-9]*\'
Frederick
A: 

I'm not familiar with PowerGrep, however, your regex is incomplete. Try this:

title='[a-zA-Z0-9 ]*'

or better yet:

title='([^']*)'
David Hanak
Single quotes are allowed: http://www.w3.org/TR/xml/#NT-AttValue
Gumbo
-1 Single quotes are allowed
cletus
You are right, guys. I removed the sidenote.
David Hanak
Sorry but this still doesn't do the job. Lots more characters are valid for attributes, as are entities that you'd want to decode and your example now uses single quotes instead of double quotes. A working regex should work with a matching pair of single or double quotes.
cletus
I agree that you could use a more complex regex. For example, you can group the opening quote - (['"]) - and use a back reference both between the quotes and at the end. I intenionally avoided overcomplication, suspecting that the questioner is not really at ease with regular expressions.
David Hanak
A: 

The other answers all give correct changes to the regex, so I'll explain what the issue was with your original.

The square brackets indicate a character class - meaning that the regex will match any character within those brackets. However, like everything else, it will only match it once by default. Just as the regex "s" would match only the first character in "ssss", the regex "[a-zA-Z0-9]" will match only the first character in "Connectivity Framework".

By adding repetition, one can get that character class to match repeatedly. The easiest way to do this is by adding an asterisk after it (which will match 0 or more occurences). Thus the regex "[a-zA-Z0-9]*" will match as many characters in a row until it hits a character that is not in that character class (in your case, the space character since you didn't include that in your brackets).

Regexes though can be pretty complex to describe the syntax accurately - what if someone put a non-alphanumeric character such as an ampersand within the attribute? You could try to capture all input between the quotes by making the character set "anything except a quote character", so "'[^']*'" would usually do the right thing. Often you need to bear in mind escaping as well (e.g. with a string 'Mary\'s lamb' you do actually want to capture the apostrophe in the middle so a simple "everything but apostrophes" character set won't cut it) though thankfully this is not an issue with XML/HTML according to the specs.

Still, if there is an existing library available that will do the extraction for you, this is likely to be faster and more correct than rolling your own, so I would lean towards that if possible.

Andrzej Doyle
+4  A: 

I'm just not sure how many times the question of regular expression parsing of HTML files has to be asked (and answered with the correct solution of "use a DOM parser"). It comes up every day.

The difficulties are:

  • In HTML attributes can have single-quotes, double-quotes or even no quotes;
  • Similar strings can appear in the HTML document itself;
  • You have to handle correct escaping; and
  • Malformed HTML (decent parsers are extremely robust to common errors).

So if you cater for all this (and it gets to be a pretty complicated yet still imperfect regex), it's still not 100%.

HTML parsers exist for a reason. Use them.

cletus
I believe there used to be a tag for this, but I think someone went and removed it from everything. It *is* rather depressing how frequently this comes up. Sticky? :P
annakata
We do need something to stop this question coming up multiple times a day. Some kind of FAQ which pops up a questioner uses the 'regex' and 'html' tags together?
bobince
There is sometimes the need for fast "parsing"/checking of html. DOM-Parser are heavy in memory and speed. So if I have to process a ton of data, regexp is easier and faster. Some restrictions apply of course.
ReneS
That's what SAX parsers are for.
cletus
A: 

I would use this regular expression to get the title attribute values

<[a-z]+[^>]*\s+title\s*=\s*("[^"]*"|'[^']*'|[^\s >]*)

Note that this regex matches the attribute value expression with quotes. So you have to remove them if needed.

Gumbo
A: 

@David Hanak: thanks it worked. But can u pls. tell me how it works.

What does ^' - inside the rectangular brackets mean

and what are the ( ) brackets for?

It would help.

-AD

goldenmean
^ in the brackets means: "anything but ...", ( ) are used to group things, so you can later get the value that was captures in this group.
wvanbergen
You can also use RegexBuddy that will tell you what each part of the expression does.
Dror