views:

450

answers:

3

Hey all,

I'm trying to call the HtmlTidy library dll from C#. There's a few examples floating around on the net but nothing definitive... and I'm having no end of trouble. I'm pretty certain the problem is with the p/invoke declaration... but danged if I know where I'm going wrong.

I got the libtidy.dll from http://www.paehl.com/open_source/?HTML_Tidy_for_Windows which seems to be a current version.

Here's a console app that demonstrates the problem I'm having:

using System;
using System.Collections.Generic;
using System.Text;
using System.Runtime.InteropServices;

namespace ConsoleApplication5
{
    class Program
    {
        [StructLayout(LayoutKind.Sequential)]
        public struct TidyBuffer
        {
            public IntPtr bp;         // Pointer to bytes
            public uint size;         // # bytes currently in use
            public uint allocated;    // # bytes allocated
            public uint next;         // Offset of current input position
        };

        [DllImport("libtidy.dll")]
        public static extern int tidyBufAlloc(ref TidyBuffer tidyBuffer, uint allocSize);


        static void Main(string[] args)
        {
            Console.WriteLine(CleanHtml("<html><body><p>Hello World!</p></body></html>"));
        }

        static string CleanHtml(string inputHtml)
        {
            byte[] inputArray = Encoding.UTF8.GetBytes(inputHtml);
            byte[] inputArray2 = Encoding.UTF8.GetBytes(inputHtml);

            TidyBuffer tidyBuffer2;
            tidyBuffer2.size = 0;
            tidyBuffer2.allocated = 0;
            tidyBuffer2.next = 0;
            tidyBuffer2.bp = IntPtr.Zero;

            //
            // tidyBufAlloc overwrites inputArray2... why? how? seems like
            // tidyBufAlloc is stomping on the stack a bit too much... but
            // how? I've tried changing the calling convention to cdecl and
            // stdcall but no change.
            //
            Console.WriteLine((inputArray2 == null ? "Array2 null" : "Array2 not null"));
            tidyBufAlloc(ref tidyBuffer2, 65535);
            Console.WriteLine((inputArray2 == null ? "Array2 null" : "Array2 not null"));
            return "did nothing";
        }
    }
}

All in all I'm a bit stumpped. Any help would be appreciated!

A: 

Try changing your tidyBufAlloc declaration to:

[DllImport("libtidy.dll", CharSet = CharSet.Ansi)]
private static extern int tidyBufAlloc(ref TidyBuffer Buffer, int allocSize);

Note the CharSet.Ansi addition and the "int allocSize" (instead of uint).

Also, see this sample code for an example of using HTML Tidy in C#.

In your example, if inputHTML is large, say 50K, inputArray and inputArray2 will be also be 50K each.

You are then also trying to allocate 65K in the tidyBufAlloc call.

If a pointer is not initialised correctly, it is quite possible a random .NET heap address is being used. Hence overwriting part or all of a seemingly unrelated variable/buffer occurs. It is problaby just luck, or that you have already allocated large buffers, that you are not overwriting a code block which would likely cause a Invalid Memory access error.

Ash
Doesn't seem to have made any difference. I'd also tried the codejedi sample code... but it seemed to have problems of it's own. Note that the current allocation of inputArray/inputArray2 is just for illustrative purposes (ie. I've boiled down a concrete repro... this is not the end code).
Have you tried an earlier version of TidyLib? Could be a recently introduced bug.
Ash
+2  A: 

For what it's worth, we tried Tidy at work and switched to HtmlAgilityPack.

Chris
I'd second using HtmlAgilityPack over Tidy. It's not specifically for pretty printing HTML, but you can easily do that if you need to. Tidy is probably better as a tool you use interactively in your editor, as opposed to embedded in a .NET app.
Ash
I guess I was a bt concerned about whether or not HtmlAgilityPack has a strong background in handling/detecting badly formed HTML. I have hacked together a quick and dirty sanitizer using HtmlAgilityPack but my feeling is that HtmlAgilityPack could well be the weak spot. Thoughts?
The major feature of HtmlAgilityPack is handling and parsing badly formed HTML. For .NET, I still haven't seen of a better option. The full source is freely available, so if you see any weak spot, you're able to fix it. (eg. I've added StringBuilder to my version to improve string concatenation performance). Whenever you plan to run arbitrary HTML through a parser, you need to do lots of testing on real world data, there is no easy way.
Ash
+2  A: 

You are working with an old definition of the TidyBuffer structure. The new structure is larger so when you call the allocate method it is overwriting the stack location for inputArray2. The new definition is:

    [StructLayout(LayoutKind.Sequential)]        
    public struct TidyBuffer        
    {
        public IntPtr allocator;  // Pointer to custom allocator            
        public IntPtr bp;         // Pointer to bytes            
        public uint size;         // # bytes currently in use            
        public uint allocated;    // # bytes allocated            
        public uint next;         // Offset of current input position        
    };
Stephen Martin
Thanks a ton! That solved the problem nicely. Sorry for the delay... I got caught up in something else.