ansaurus

Question

How to use Regular Expression to match the charset string in HTML?

Answer 1

A:

I tried with javascript placing your string in a variable and doing a match:

var x = '<meta http-equiv="Content-type" content="text/html;charset=utf-8" />';
var result = x.match(/charset=([a-zA-Z0-9-]+)/);
alert(result[1]);

Zsolti 2010-08-11 12:37:47

Oh no, `<p>A paragraph containing <code>charset=bogus</code>!</p>`.

You 2010-08-11 12:41:15

Well, correct. I considered the string containing only the <meta> tag.

Zsolti 2010-08-12 10:54:05

Answer 2

A:

Don't use regular expressions to parse (X)HTML! Use a proper tool, i.e. a SGML or XML parser. Your code looks like XHTML, so I'd try an XML parser. After getting the attribute from the meta element, however; a regex would be more appropriate. Although, just a string split at ; would certainly do the trick (and faster, too).

You 2010-08-11 12:40:17

He is not parsing a whole HTML document, just a single line.

Oded 2010-08-11 12:41:57

I don't see that in the original question.

DavidYell 2010-08-11 12:43:16

Doesn't say that anywhere. And the "no regex" rule still applies, even to single lines; (X)HTML is not a regular grammar and can't be parsed using regular expressions.

You 2010-08-11 12:44:19

Answer 3

A:

For PHP:

$charset = preg_match('/charset=([a-zA-Z0-9-]+)/', $line);
$charset = $charset[1];

Delan Azabani 2010-08-11 12:53:08

-1, using regexps is not a good idea. See my comment on the answer by @Zsolti.

You 2010-08-11 12:56:33

Answer 4

+1 A:

This regex:

<meta.*?charset=([^"']+)

Should work. Using an XML parser to extract this is overkill.

NullUserException 2010-08-11 12:57:33

Hm... `<meta name="author" value="me"><meta charset="utf-8">`. Give me a HTML-parsing regex, and I shall break it.

You 2010-08-11 13:19:28

@You This is a contrived non-example that would almost never occur in real world usage.

NullUserException 2010-08-11 13:24:21

I am happy with my regex working 99.9% of the time. By the way, you can't always use an XML parser because real world markup is rarely well behaved.

NullUserException 2010-08-11 13:33:49

+1, although I would make the .* a non-capturing group, so as a string literal in C# it would be "\\<meta(?:.*)charset=([^\"']+)". Agree that loading the how thing into an XML would be overkill, and wouldn't be guaranteed to be any more reliable than the regex solution.

John M Gant 2010-08-11 13:41:35

If you're handling XHTML, it *should* be valid XML. Otherwise it's not XHTML. In the case of HTML, an SGML parser will be able to parse it, in as many cases as this regex will work. If not more.

You 2010-08-11 15:11:47

Answer 5

A:

I tend to agree with @You however I'll give you the answer you request plus some other solutions.

        String meta = "<meta http-equiv=\"Content-type\" content=\"text/html;charset=utf-8\" />";
        String charSet = System.Text.RegularExpressions.Regex.Replace(meta,"<meta.*charset=([^\\s'\"]+).*","$1");

        // if meta tag has attributes encapsulated by double quotes
        String charSet = ((meta.Split(new String[] { "charset=" }, StringSplitOptions.None))[1].Split('"'))[0];
        // if meta tag has attributes encapsulated by single quotes
        String charSet = ((meta.Split(new String[] { "charset=" }, StringSplitOptions.None))[1].Split('\''))[0];

Either way any of the above should work, however definitely the String.Split commands can be dangerous without first checking to see if the array has data, so might want to break out the above otherwise you'll get a NullException.

Brian 2010-08-11 12:58:53

Answer 6

A:

My regex:

<meta[^>]*?charset=([^"'>]*)

My testcase:

<meta http-equiv="Content-type" content="text/html;charset=utf-8" />
<meta name="author" value="me"><!-- Maybe we should have a charset=something meta element? --><meta charset="utf-8">

C#-Code:

using System.Text.RegularExpressions;
string resultString = Regex.Match(sourceString, "<meta[^>]*?charset=([^\"'>]*)").Groups[1].Value;

RegEx-Description:

// <meta[^>]*?charset=([^"'>]*)
// 
// Match the characters "<meta" literally «<meta»
// Match any character that is not a ">" «[^>]*?»
//    Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
// Match the characters "charset=" literally «charset=»
// Match the regular expression below and capture its match into backreference number 1 «([^"'>]*)»
//    Match a single character NOT present in the list ""'>" «[^"'>]*»
//       Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»

Floyd 2010-08-11 14:08:22

I'll break this one too, because I'm bored: `<meta name="title" value="charset=utf-8 — is it really useful?"><meta charset="utf-8">`

You 2010-08-11 15:11:13

ansaurus

tags:

views:

answers:

How to use Regular Expression to match the charset string in HTML?

related questions