views:

389

answers:

12

I'm trying to fetch some HTML from various blogs and I've noticed that different providers use the same tag in different ways.

For example, here are two major providers that use the Generator differently:

Blogger: <meta content='blogger' name='generator'/> (content first, name later and, yes, single quotes!)
Wordpress: <meta name="generator" content="WordPress.com" /> (name first, content later)

Is there a way to extract the value of content for all cases? (single/double quotes, first/last in the row)

Thank you.

P.S. Although I'm using Java, the answer would probably help more people if it where for Regular Expressions generally

+2  A: 

Those differences are not really important according to the XHTML standard.

In other words, they are exactly the same thing.

Also, if you replace double quotes with single quotes would be the same.

The typical way of 'normalizing' an xml document is to pare it using some API that treats the document as its Infoset representation. Both DOM and SAX style APIs work that way.

If you want to parse them by hand (or with a RegEx) you have to replicate all those things in your code and, in my opinion, that's not practical.

Sergio Acosta
A: 

@Sergio Yes, but how can I parse something that can be as different as this in a programming language in regex?

pek
+3  A: 

Actually, you should probably use some sort of HTML parser where you can inspect each node (and therefore node attributes) in the DOM of the page. I've not used any of these for a while so I don't know the pros and cons but here's a list http://java-source.net/open-source/html-parsers

martinatime
+12  A: 

The answer is: don't use regular expressions.

Seriously. Use a SGML parser, or an XML parser if you happen to know it's valid XML (probably almost never true). You will absolutely screw up and waste tons of time trying to get it right. Just use what's already available.

Brad Wilson
A: 

@Brad Wilson

That would make sense if he was actually trying to parse and interpret the entire HTML document. However, when looking for certain tags, or a certain attribute within a tag, regex is more than valid for the job. Loading up an SGML parsing library and having it parse the entire document just to look for one type of tag seems to be overkill to me.

Kibbee
+1  A: 

Note: single quotes (even no quotes, if the value doesn't contain a space) is valid according to the W3C HTML spec. Quote:

By default, SGML requires that all attribute values be delimited using either double quotation marks (ASCII decimal 34) or single quotation marks (ASCII decimal 39)... In certain cases, authors may specify the value of an attribute without any quotation marks.

Also, don't forget that the order of attributes can be reversed and that other attributes can appear in the tag.

Cd-MaN
A: 

Ok, since you are looking for language-agnostic then you can try a REGEX like /<meta\s.*content=.*>/ and take the result from that and parse out the specific values that you are looking for. I'm by no means a REGEX expert so there is probably a better way but in using the tool at http://www.codehouse.com/webmaster_tools/regex/ I matched both of the strings you provided.

martinatime
+1  A: 

You may want to give Java's HTMLEditorKit a shot. It is easy to experiment with to see if the parsing provides what you are looking for.

Preston
A: 

If you must use regex, here is a regex to get just the content part:

content\s*=\s*['"].*?['"]

returns

content = "blogger"

and

content='Worpress.com'

respectively. I'm no regex expert, but it gets those when given your examples in regexpal.

Once you get that you can get everything between the quotes however you choose, be it another regex (which is just immoral at that point) or just looping over the characters.

dwestbrook
A: 

@martinatime and @dwestbrook your regex codes work perfect for the strings I provided, but unfortunately, meta tags are used in more than one situations.

For example, Blogger.com has these in there source:

<meta content='true' name='MSSmartTagsPreventParsing'/>
<meta content='blogger' name='generator'/>

Your regex examples will return the values of both. That's why I mentioned the problem with the name and content being reversed.

pek
A: 

Agreed, HTML parsing is not a tool for regular expressions, especially not with freely available DOM parsers available.

Chris Marasti-Georg
A: 

If your using java you may want to look at tagsoup, which is a SAX-compliant parser for "[parsing] HTML as it is found in the wild".

Peter Stuifzand