views:

57

answers:

2

Hi,

I have for example this image tag:

<img src="http://... .jpg" al="myImage" hhh="aaa" />

and I mantain, for example, for a generally image tag the list of all valid attributes

L1=(alt, src, width, height, align, border, hspace, longdesc, vpace)

I am parsing the img tag and I am getting the used attributes like this:

L2=(src, al, hhh)

How can I programaticaly validate the image tag? So that the 'al' attribute should become 'alt' ('alt' attribute is more like than 'align' that contains much more characters) and the 'hhh' tag will disappear (because there is no attribute to be like it)?

For result the tag should look like this:

<img src="http://... .jpg" alt="myImage" />

Thanks.

Jeff

+1  A: 

The parsing of the tag is the hardest part, seeing as you've done that, all you have to do now is loop through the elements, check them against an array of valid ones, if they aren't valid check them against an array of commonly misspelt items and replace/delete as necessary.

Someting similar to:

String[] ValidItems = {"alt", "src", "width", "height", "align", "border", "hspace", "longdesc", "vpace"};

Dictionary<String, String> MispeltItems = { {"al", "alt" } };

for(int i = ImgTagAttributes-1; i >= 0; i--)
{
    var element = ImgTagAttributes[i];
    if(!ValidItems.Contains(element))
    {
        if(MispeltItems.ContainsKey(element))
        {
            ImgTagElements.Replace(element, MispeltItems[element].Value);
            //Or use remove and insert.
        }
        else
        {
            ImgTagElements.RemoveAt(i);
        }
    }
}

(wrote this in stack overflow, if there's any errors just say, it's just so you can get a basic idea)

Blam
and if I have a html file with 26000 rows? using all different tags in html?
Jeff Norman
Well you said you have a list of the valid attributes, so that could be done with say Dictionary<string, string[]>, where the first element is the tag and the second is the valid attributes. Getting misspelt items can't really be done automatically without a misspelt items list due to ambiguities (al could be alt or align for example). Getting them into a c# format would be another thing entirely, which well you haven't said enough about so i'll ignore it for now.
Blam
+1  A: 

You could use Linq2Xml to easily parse the code:

XElement doc = XElement.Parse(...)

Then correct the wrong attributes using a best-match algorithm against a valid attributes in-memory dictionary.

edit: I wrote and tested this simplified best-matched algorithm (sorry, it's VB):

Dim validTags() As String =
            {
                "width",
                "height",
                "img"
            }

(simplified, you should create a more structured dictionary with tags and possible attributes for each tag)

Dim maxMatch As Integer = 0
Dim matchedTag As String = Nothing
For Each Tag As String In validTags
    Dim match As Integer = checkMatch(Tag, source)
    If match > maxMatch Then
        maxMatch = match
        matchedTag = Tag
    End If
Next

Debug.WriteLine("matched tag {0} matched % {1}", matchedTag, maxMatch)

The above code calls a method to determine the percentage the source string equals any valid tag.

Private Function checkMatch(ByVal tag As String, ByVal source As String) As Integer

        If tag = source Then Return 100


        Dim maxPercentage As Integer = 0

        For index As Integer = 0 To tag.Length - 1

            Dim tIndex As Integer = index
            Dim sIndex As Integer = 0
            Dim matchCounter As Integer = 0

            While True
                If tag(tIndex) = source(sIndex) Then
                    matchCounter += 1
                End If

                tIndex += 1
                sIndex += 1

                If tIndex + 1 > tag.Length OrElse sIndex + 1 > source.Length Then
                    Exit While
                End If
            End While

            Dim percentage As Integer = CInt(matchCounter * 100 / Math.Max(tag.Length, source.Length))
            If percentage > maxPercentage Then maxPercentage = percentage
        Next

        Return maxPercentage

    End Function

The above method, given a source string and a tag, finds the best match percentage comparing the single characters.

Given "widt" as input, it finds "width" as the best match with a 80% match value.

vulkanino
best-mach algorithm?
Jeff Norman
yes, you check each found tag, and each attribute element for each tag, against a list of valid tags; if a tag/attribute name is not in the valid tags list, you count the chars they differ, ie: "widt" is different from "width" but it matches at 80% so you can correct it.
vulkanino
can you give me please an example of this algorithm? Thank you.
Jeff Norman
I've implementad your algorithm in c#, but if I have 'scr' instead of 'src' it gives me 33% match and if I have 'sr' instead of 'src' it gives me 66% match... I think the best it will be in bought cases 66%. I don't understand why if I omit a char in the middle of the word (or put it wrong) the result is less than 50%...
Jeff Norman
giving 33% for scr is right: you only provided 1/3 correct chars, only the "s". the algo doesn't check for permutations of the same chars, and I think this is correct. Should left match felt? I don't think so, they're unrelated words, even if it's one the anagram of the other. also sr->src is correct I think, one word is 2/3 correct. different is the case that you say when you omit a letter in the middle of the source: my algo compares letters sequentially so if you have for example "middle" and "mixdle", you get 5 out of 6 matched letters, and that's right. but if you omit a letter...
vulkanino
... then the comparison blows up. with sanity->saity you get 2 out of 6, because the missing letter scrambles everything. if you want a better algo, you could change it and try to force the two words to be the same lenght. ie: sanity->saity would add a letter to the shorter word, in different positions: Xsaity, sXaity, saXity, saiXity, saitXy, saityX. best match witch saXity yuo get 5 out of 6 :)
vulkanino
I understand now :) ... I am thinking now it's not the best idea to add another letter, it can be anything from a to z, and for adding every letter in different positions in a 10 char word (for example) , there are 24 letters to add and verify... I think the code will be very slow
Jeff Norman
No, you could add a placeholder letter, like the X I gave in the example. Maybe choose a char not very common, ie a § or an invisible char (non-printable). The placeholder only serves the purpose to make the linear comparison appropriate.
vulkanino