ansaurus

Question

Best way to fetch an "unstandard" HTML tag

Answer 1

+2 A:

Those differences are not really important according to the XHTML standard.

In other words, they are exactly the same thing.

Also, if you replace double quotes with single quotes would be the same.

The typical way of 'normalizing' an xml document is to pare it using some API that treats the document as its Infoset representation. Both DOM and SAX style APIs work that way.

If you want to parse them by hand (or with a RegEx) you have to replicate all those things in your code and, in my opinion, that's not practical.

Sergio Acosta 2008-08-28 02:28:16

Answer 2

A:

@Sergio Yes, but how can I parse something that can be as different as this in a programming language in regex?

pek 2008-08-28 02:29:26

Answer 3

+3 A:

Actually, you should probably use some sort of HTML parser where you can inspect each node (and therefore node attributes) in the DOM of the page. I've not used any of these for a while so I don't know the pros and cons but here's a list http://java-source.net/open-source/html-parsers

martinatime 2008-08-28 02:30:42

Answer 4

+12 A:

The answer is: don't use regular expressions.

Seriously. Use a SGML parser, or an XML parser if you happen to know it's valid XML (probably almost never true). You will absolutely screw up and waste tons of time trying to get it right. Just use what's already available.

Brad Wilson 2008-08-28 02:31:40

Answer 5

A:

@Brad Wilson

That would make sense if he was actually trying to parse and interpret the entire HTML document. However, when looking for certain tags, or a certain attribute within a tag, regex is more than valid for the job. Loading up an SGML parsing library and having it parse the entire document just to look for one type of tag seems to be overkill to me.

Kibbee 2008-08-28 02:49:26

Answer 6

+1 A:

Note: single quotes (even no quotes, if the value doesn't contain a space) is valid according to the W3C HTML spec. Quote:

By default, SGML requires that all attribute values be delimited using either double quotation marks (ASCII decimal 34) or single quotation marks (ASCII decimal 39)... In certain cases, authors may specify the value of an attribute without any quotation marks.

Also, don't forget that the order of attributes can be reversed and that other attributes can appear in the tag.

Cd-MaN 2008-08-28 02:56:03

Answer 7

A:

Ok, since you are looking for language-agnostic then you can try a REGEX like /<meta\s.*content=.*>/ and take the result from that and parse out the specific values that you are looking for. I'm by no means a REGEX expert so there is probably a better way but in using the tool at http://www.codehouse.com/webmaster_tools/regex/ I matched both of the strings you provided.

martinatime 2008-08-28 03:20:22

Answer 8

+1 A:

You may want to give Java's HTMLEditorKit a shot. It is easy to experiment with to see if the parsing provides what you are looking for.

Preston 2008-08-28 03:24:04

Answer 9

A:

If you must use regex, here is a regex to get just the content part:

content\s*=\s*['"].*?['"]

returns

content = "blogger"

and

content='Worpress.com'

respectively. I'm no regex expert, but it gets those when given your examples in regexpal.

Once you get that you can get everything between the quotes however you choose, be it another regex (which is just immoral at that point) or just looping over the characters.

dwestbrook 2008-08-28 03:38:00

Answer 10

A:

@martinatime and @dwestbrook your regex codes work perfect for the strings I provided, but unfortunately, meta tags are used in more than one situations.

For example, Blogger.com has these in there source:

<meta content='true' name='MSSmartTagsPreventParsing'/>
<meta content='blogger' name='generator'/>

Your regex examples will return the values of both. That's why I mentioned the problem with the name and content being reversed.

pek 2008-08-28 04:29:28

Answer 11

A:

Agreed, HTML parsing is not a tool for regular expressions, especially not with freely available DOM parsers available.

Chris Marasti-Georg 2008-08-28 12:22:56

Answer 12

A:

If your using java you may want to look at tagsoup, which is a SAX-compliant parser for "[parsing] HTML as it is found in the wild".

Peter Stuifzand 2008-08-28 12:53:47

ansaurus

tags:

views:

answers:

Best way to fetch an "unstandard" HTML tag

related questions