ansaurus

Question

Using C#, how can I manually validate a html tag?

Answer 1

+1 A:

The parsing of the tag is the hardest part, seeing as you've done that, all you have to do now is loop through the elements, check them against an array of valid ones, if they aren't valid check them against an array of commonly misspelt items and replace/delete as necessary.

Someting similar to:

String[] ValidItems = {"alt", "src", "width", "height", "align", "border", "hspace", "longdesc", "vpace"};

Dictionary<String, String> MispeltItems = { {"al", "alt" } };

for(int i = ImgTagAttributes-1; i >= 0; i--)
{
    var element = ImgTagAttributes[i];
    if(!ValidItems.Contains(element))
    {
        if(MispeltItems.ContainsKey(element))
        {
            ImgTagElements.Replace(element, MispeltItems[element].Value);
            //Or use remove and insert.
        }
        else
        {
            ImgTagElements.RemoveAt(i);
        }
    }
}

(wrote this in stack overflow, if there's any errors just say, it's just so you can get a basic idea)

Blam 2010-10-08 09:51:04

and if I have a html file with 26000 rows? using all different tags in html?

Jeff Norman 2010-10-08 09:54:53

Well you said you have a list of the valid attributes, so that could be done with say Dictionary<string, string[]>, where the first element is the tag and the second is the valid attributes. Getting misspelt items can't really be done automatically without a misspelt items list due to ambiguities (al could be alt or align for example). Getting them into a c# format would be another thing entirely, which well you haven't said enough about so i'll ignore it for now.

Blam 2010-10-08 10:00:18

Answer 2

+1 A:

You could use Linq2Xml to easily parse the code:

XElement doc = XElement.Parse(...)

Then correct the wrong attributes using a best-match algorithm against a valid attributes in-memory dictionary.

edit: I wrote and tested this simplified best-matched algorithm (sorry, it's VB):

Dim validTags() As String =
            {
                "width",
                "height",
                "img"
            }

(simplified, you should create a more structured dictionary with tags and possible attributes for each tag)

Dim maxMatch As Integer = 0
Dim matchedTag As String = Nothing
For Each Tag As String In validTags
    Dim match As Integer = checkMatch(Tag, source)
    If match > maxMatch Then
        maxMatch = match
        matchedTag = Tag
    End If
Next

Debug.WriteLine("matched tag {0} matched % {1}", matchedTag, maxMatch)

The above code calls a method to determine the percentage the source string equals any valid tag.

Private Function checkMatch(ByVal tag As String, ByVal source As String) As Integer

        If tag = source Then Return 100


        Dim maxPercentage As Integer = 0

        For index As Integer = 0 To tag.Length - 1

            Dim tIndex As Integer = index
            Dim sIndex As Integer = 0
            Dim matchCounter As Integer = 0

            While True
                If tag(tIndex) = source(sIndex) Then
                    matchCounter += 1
                End If

                tIndex += 1
                sIndex += 1

                If tIndex + 1 > tag.Length OrElse sIndex + 1 > source.Length Then
                    Exit While
                End If
            End While

            Dim percentage As Integer = CInt(matchCounter * 100 / Math.Max(tag.Length, source.Length))
            If percentage > maxPercentage Then maxPercentage = percentage
        Next

        Return maxPercentage

    End Function

The above method, given a source string and a tag, finds the best match percentage comparing the single characters.

Given "widt" as input, it finds "width" as the best match with a 80% match value.

vulkanino 2010-10-08 09:59:16

best-mach algorithm?

Jeff Norman 2010-10-08 10:06:58

yes, you check each found tag, and each attribute element for each tag, against a list of valid tags; if a tag/attribute name is not in the valid tags list, you count the chars they differ, ie: "widt" is different from "width" but it matches at 80% so you can correct it.

vulkanino 2010-10-08 11:14:57

can you give me please an example of this algorithm? Thank you.

Jeff Norman 2010-10-08 11:44:22

I've implementad your algorithm in c#, but if I have 'scr' instead of 'src' it gives me 33% match and if I have 'sr' instead of 'src' it gives me 66% match... I think the best it will be in bought cases 66%. I don't understand why if I omit a char in the middle of the word (or put it wrong) the result is less than 50%...

Jeff Norman 2010-10-11 11:03:08

giving 33% for scr is right: you only provided 1/3 correct chars, only the "s". the algo doesn't check for permutations of the same chars, and I think this is correct. Should left match felt? I don't think so, they're unrelated words, even if it's one the anagram of the other. also sr->src is correct I think, one word is 2/3 correct. different is the case that you say when you omit a letter in the middle of the source: my algo compares letters sequentially so if you have for example "middle" and "mixdle", you get 5 out of 6 matched letters, and that's right. but if you omit a letter...

vulkanino 2010-10-11 11:24:05

... then the comparison blows up. with sanity->saity you get 2 out of 6, because the missing letter scrambles everything. if you want a better algo, you could change it and try to force the two words to be the same lenght. ie: sanity->saity would add a letter to the shorter word, in different positions: Xsaity, sXaity, saXity, saiXity, saitXy, saityX. best match witch saXity yuo get 5 out of 6 :)

vulkanino 2010-10-11 11:28:42

I understand now :) ... I am thinking now it's not the best idea to add another letter, it can be anything from a to z, and for adding every letter in different positions in a 10 char word (for example) , there are 24 letters to add and verify... I think the code will be very slow

Jeff Norman 2010-10-11 11:46:13

No, you could add a placeholder letter, like the X I gave in the example. Maybe choose a char not very common, ie a § or an invisible char (non-printable). The placeholder only serves the purpose to make the linear comparison appropriate.

vulkanino 2010-10-11 11:55:07

ansaurus

tags:

views:

answers:

Using C#, how can I manually validate a html tag?

related questions