views:

179

answers:

1

I'm trying to clean some HTML with libtidy (C language), the problem is:

I want to construct a TidyDoc (a tree-like structure) with tidyParseBuffer().

I have no problem with tidyParseFile(); about tidyParseBuffer(): I'm sure I read the file properly and that the TidyBuffer structure I give to tidyParseBuffer() is correctly filled.

Any ideas?

here is the code:

    //declaration
 tidyInput = malloc(sizeof(TidyBuffer));
 tidyOutput = malloc(sizeof(TidyBuffer));
 do { 
      len = fread(pbInputData, 1, nInputData, h->file);
      tidyBufAttach(tidyInput, (void*)pbInputData, len);
      tidyParseBuffer(h->doc, tidyInput);  // doc is the TidyDoc 
 } while (len >= nInputData);
 tidyOptSetBool(h->doc, TidyForceOutput, yes);

 tidySaveFile(handler->doc, "C://test.xhtml");

I did simplify the code.

+1  A: 

The problem stems from the fact that you are trying to parse the contents of a file in chunks, reading each chunk into a buffer and calling tidyParseBuffer() for each chunk.

The tidyParseXxx() functions operate by parsing the whole input in a single call, so to do what you want you should take a look at TidyInputSource and tidyParseSource().

Matthew Murdoch
thanks! It looks like it's a good idea.
Pierre Guilbert
It's a bit more complicated to set up but it sounds like the implementation of `tidyParseFile()` uses exactly this mechanism.
Matthew Murdoch
[Edit]I spent the day looking at the tidylib code; I'll make it short:All the `tidyParseXXX()` call the `TY_(DocParseStream)` and then the `DocParseStream()` call stuffs like `TY_(FreeNode)(doc, ` `TidyClearMemory(`So It seems that I must fill completely the TidyInputSource struct before I call the tidyParseSource. It looks like tidyParseBuffer, though tidyParseSource seems more "user oriented defined struct". Thanks for the hint about `tidyParseFile()` it made me go the right way.I guess I have to read all the file before parsing it.
Pierre Guilbert