ansaurus

Question

Answer 1

A:

I think your approach is too simplistic. Parsing an HTML by using regular expressions might be much more difficult than you think. I would suggest you to take a look at this question.

2009-03-20 23:23:40

Using some third-party framework for such a task would be "much more" too. I know that HTML Agility Pack is quite powerful, but i'll try to use it in case it's really necessary.

Jaded 2009-03-21 09:31:22

Answer 2

A:

i suggest you to take 10 minutes and learn about xslt in www.w3schools.com, its very simple and will help u a lot to perform some xml transformation on your html content

Chen Kinnrot 2009-03-20 23:49:37

Answer 3

A:

I know that XPath would be a perfect fit for such problem

Quite so. Or any other XML parser-based technique, such as DOM methods.

It's really not a hard thing to learn: stuff your string into the XmlDocument.LoadXml() method then call selectNodes() on it with something like '//tagname[@attrname]' to get a list of elements with the unwanted attribute. Peasy.

i'm trying to use Regular Expressions to solve this problem, with no success

What is it with regexes? People keep using them even when they know it's the wrong thing, even though they're frequently unreadable and difficult to get right (as the endless “why doesn't my regex work?” questions demonstrate).

So what's so attractive about the damned things? There are several questions on SO every day about parsing [X][HT]ML with regex, all answered “don't use regex, regex is not powerful enough to parse HTML”. But somehow it never gets through.

I guess it's a mistake in pattern...

Well the pattern appears to be trying to match entire tags to replace with an empty string, which isn't what you want. Instead you'd want to be targeting just the attribute, then to ensure only attributes inside a “<tag ...>” counted, you'd have to use a negative lookbehind assertion — “(?!<tag )”. But you usually can't have a variable-length lookbehind assertion, which you would need to allow other attributes to come between the tag name and the targeted attribute.

Also your ‘\S+’ clause has the potential to gobble up large amounts of unintended content. As you've got well-formed XHTML, you're guaranteed properly quoted attributes, so you don't need that anyway.

But the mistake is not the pattern. It is regex.

bobince 2009-03-20 23:51:01

RegEx's are the best solution - to the right problem

Henk Holterman 2009-03-20 23:53:25

Sure. Regex are useful for many problems. But if the questions on SO are anything to go by — and judging by the amount of real-world coding horror I've seen, they probably are — a majority of regex usage is totally inappropriate.

bobince 2009-03-20 23:57:02

Well... I thought Regular Expressions are better than something as follows : Source.Substring(Source.IndexOf(Attribute),Attribute.Length + ParameterLength) or something... Plus a document i'm working with appears to be not fully XHTML complaint. It has xml namespace included, but fails validation.

Jaded 2009-03-21 09:40:49

“Validation” is not important for processing it as XML, it only has to be “well-formed”. Otherwise, there are HTML parsers such as the Agility Pack that are still much, much easier than trying to hack out a regex.

bobince 2009-03-21 13:06:02

Answer 4

+1 A:

That's an interesting approach, but like bobince said, you can only process one attribute per match. This regex will match everything up to the attribute you're interested in:

@"(<{0}\b[^>]*?\b){1}=""(?:[^""]*)"""

Then you use "$1" as your replacement string to plug back in everything but the attribute.

This approach requires you to make a separate pass over the string for each of your target tag/attribute pairs, and at the beginning of each pass you have to create and compile the regex. Not very efficient, but if the string isn't too large it should be okay. A much bigger problem is that it won't catch duplicate attributes; if there are two "onmouseover" attributes on a button, you'll only catch the first one.

If I were doing this in C# I would probably use the regex to match the target tag, then use a MatchEvaluator to remove all of the target attributes at once. But seriously, if the string really is well-formed XML, there's no excuse for not using XML-specific tools to process it--this is what XML was invented for.

Alan Moore 2009-03-21 05:25:09

It seems like the closing round bracket of the group is missing (regex doesn't compile). Fixed expression : @"(<{0}\b[^>]*?\b)({1}=""(?:[^""]*)"")"

Jaded 2009-03-21 09:46:08

And, of course, thanks a lot, your hint is actually what i needed.

Jaded 2009-03-21 10:37:43

Oops. Actually, the opening round bracket just before the {1} shouldn't be there. There's no point capturing the attribute, since all you're doing is deleting it.

Alan Moore 2009-03-21 15:30:08

I keep forgetting we can edit our answers forever on SO. Regex fixed.

Alan Moore 2009-03-22 04:26:16

Answer 5

A:

So, the rewritten code is :

public static string Process(string Source, string Tag, string Attribute)
{
     return Regex.Replace(Source, string.Format(@"(<{0}\b[^>]*?\b)({1}=""(?:[^""]*)"")", Tag, Attribute), "$1");                  
}

I've tested it, and it works fine.

string before = @"<input type=""text"" name=""Input"" id=""Input"" onMouseOver=""some js to be eliminated1""/>"
  + "\r\n" + @"<input type=""text"" name=""Input2"" id=""Input2"" onMouseOver=""some js to be eliminated2"">"
  + "\r\n" + @"<input type=""text"" name=""Input3"" id=""Input3"" onMouseOver=""some js to be eliminated3"">";   
string after = Process(before, "input", "onMouseOver");
//<input type="text" name="Input" id="Input" />
//<input type="text" name="Input2" id="Input2" >
//<input type="text" name="Input3" id="Input3" >

For now the problem is solved. I'd try to use a xml-related workaround, but it seems like before creating XmlDocument i need to rework input html again, because according to w3c validator it has errors. It starts as follows

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
 <HTML xmlns="http://www.w3.org/1999/xhtml"&gt;
 <HEAD>
 <TITLE>page title</TITLE>

On LoadXml i get "System.Xml.XmlException about '>' marker is not acceptable - line 1 position 63. Adding document type definition causes the same exception but this time about '--' marker incorrect , '>' expected.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
        "http://www.w3.org/TR/html4/strict.dtd"&gt;

Any ideas ? Or let it go ?)

Jaded 2009-03-21 10:36:02

If it says <HTML> in upper case, it's not XHTML — probably the original legacy-HTML doctype is more appropriate and the ‘xmlns’ is just lies.

bobince 2009-03-21 13:08:05

(And we can't see it from the input posted, but the error about ‘--’ is usually a sign of a broken comment like “”, which is invalid in both HTML and XHTML, but will be handled OK by browsers and the Agility Pack.

bobince 2009-03-21 13:09:14

ansaurus

tags:

views:

answers:

C# - Processing html tag attributes

related questions