HTML code example:
<meta http-equiv="Content-type" content="text/html;charset=utf-8" />
I want to use RegEx to extract the charset information (i.e. here, it's "utf-8")
(I'm using C#)
HTML code example:
<meta http-equiv="Content-type" content="text/html;charset=utf-8" />
I want to use RegEx to extract the charset information (i.e. here, it's "utf-8")
(I'm using C#)
I tried with javascript placing your string in a variable and doing a match:
var x = '<meta http-equiv="Content-type" content="text/html;charset=utf-8" />';
var result = x.match(/charset=([a-zA-Z0-9-]+)/);
alert(result[1]);
Don't use regular expressions to parse (X)HTML! Use a proper tool, i.e. a SGML or XML parser. Your code looks like XHTML, so I'd try an XML parser. After getting the attribute from the meta element, however; a regex would be more appropriate. Although, just a string split at ;
would certainly do the trick (and faster, too).
For PHP:
$charset = preg_match('/charset=([a-zA-Z0-9-]+)/', $line); $charset = $charset[1];
This regex:
<meta.*?charset=([^"']+)
Should work. Using an XML parser to extract this is overkill.
I tend to agree with @You however I'll give you the answer you request plus some other solutions.
String meta = "<meta http-equiv=\"Content-type\" content=\"text/html;charset=utf-8\" />";
String charSet = System.Text.RegularExpressions.Regex.Replace(meta,"<meta.*charset=([^\\s'\"]+).*","$1");
// if meta tag has attributes encapsulated by double quotes
String charSet = ((meta.Split(new String[] { "charset=" }, StringSplitOptions.None))[1].Split('"'))[0];
// if meta tag has attributes encapsulated by single quotes
String charSet = ((meta.Split(new String[] { "charset=" }, StringSplitOptions.None))[1].Split('\''))[0];
Either way any of the above should work, however definitely the String.Split commands can be dangerous without first checking to see if the array has data, so might want to break out the above otherwise you'll get a NullException.
My regex:
<meta[^>]*?charset=([^"'>]*)
My testcase:
<meta http-equiv="Content-type" content="text/html;charset=utf-8" />
<meta name="author" value="me"><!-- Maybe we should have a charset=something meta element? --><meta charset="utf-8">
C#-Code:
using System.Text.RegularExpressions;
string resultString = Regex.Match(sourceString, "<meta[^>]*?charset=([^\"'>]*)").Groups[1].Value;
RegEx-Description:
// <meta[^>]*?charset=([^"'>]*)
//
// Match the characters "<meta" literally «<meta»
// Match any character that is not a ">" «[^>]*?»
// Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
// Match the characters "charset=" literally «charset=»
// Match the regular expression below and capture its match into backreference number 1 «([^"'>]*)»
// Match a single character NOT present in the list ""'>" «[^"'>]*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»