Hello All - I have a task to import/transform and extract zipped binary files that contain both text data as well as embedded binary data. Within the data is data that is relational in nature and needs to be processed into a defined database structure. Currently I have a C# single threaded app that essentially grabs all the files from the directory (currently there is 13K files of varying sizes) and extracts the data on a single thread line by line inserts to the database. As you could imagine this is a very slow process and unacceptable. There are several different parsing routines used depending on the header record in the file. There are potentially up to a million rows per file when all the data is extracted to the row level of detail. Follow on task is to parse those rows into their appropriate tables based on is content. i.e. the textual content has to be parsed further into "buckets" of like data in the database. That about sums up the big picture. Now for the problem task list.
How do i iterate through a packet of data using SSIS? In the app the file is decompressed and then is parsed using streams data type and byte arrays and is routed to the required parsing routine based on the header data of each packet. There is bit swapping involved as well. Should i wrap up the app code into a script task(s) and let it do the custom processing? The data is separated by year and the SQL server tables is partitioned by year as well. I need to be able to "catch" bad file data as well and process by hand most likely.
Should i simply load the zipped file to SQL as a blob and parse the file with T-SQL? Would that be multi threaded if done that way? Not sure how to do the parsing in T-SQL that is involved here. Which do you think would be faster?
Potentially the data that is currently processed via files could come to us via a socket. Can SSIS collect that data in real time? How would i go about setting that up?
Processing these new files from the directories will become a daily task. I can manage the data once i get it to SQL Server. Getting it there in a timely fashion seems to be the long pole in the tent for me. I would appreciate any comments or suggestions from the group.
Rick