views:

49

answers:

1

I have a Biztalk project that imports an incoming CSV file and dumps it to a database table. The import works fine, but I only need to keep about 200-300 records from a file with upwards of a million rows. My orchestration discards these rows, but the problem is that the flat file I'm importing is still 250MB, and when converted to XML using a regular flat file pipeline, it takes hours to process and sometimes causes the server to run out memory.

Is there something I can do to have the Custom Pipeline itself discard rows I don't care about? The very first item in each CSV row is one of a few strings, and I only want to keep rows that start with a certain string.

Thanks for any help you're able to provide.

+4  A: 

A custom pipeline component would certainly be the best solution; but it would need to execute in the decode stage before the disassembler component.

Making it 100% streaming-enabled would be complex (but certainly doable), but depending on the size of the resulting trimmed CVS file, you could simply pre-process the entire input file as soon as your custom component runs and either generate the results in memory (in a MemoryStream) if it's small, or write them to a file and then return the resulting FileStream to BizTalk to continue processing from there.

tomasr
I'm fine processing the whole file up front, when it's still a CSV - I think it's the conversion to XML (and the subsequent parsing of the XMLDocument object that's handed to my orchestration) that's causing trouble. If the pipeline could trim it to include only the rows I wanted, the resulting XML document would be under 1MB instead of 250MB-ish
rwmnau
Then certainly doing it in memory sounds possible. Really wouldn't be hard; the meat of the component would just need to create a StreamReader on top of the body stream, read line by line and discard lines that don't match, write the ones that do into a second stream, then pass that stream down the line.
tomasr