Step 1: Loading data via "bulk insert" from .txt (delimited) file into Table 1 (no indexes etc.)
bulk insert Table_1
from '\\path\to_some_file.txt'
with ( tablock, FORMATFILE ='format_file_path.xml')
Via format file I map output Column data types to avoid further conversions (from Char to Int for example)
Step 2: OUTPUT the result (perhaps NOT all the columns from Table 1) into another Table 2, but only DISTINCT values from Table 1.
NB! Table_1 is about 20 Million records (per load).
what we have now (example simplified):
select distinct convert(int, col1), convert(int, col2), col3, ...
into Table_2
from Table_1
It takes about 3.5 mins to process. Could you advice some best practices that may help to reduce the processing time and put only UNIQUE records into Table_2?
Thanks in advance!
UPD 1: sorry for misunderstanding - I meant that select distinct Query takes 3.5 mins. "bulk insert" is rather optimized - it loads via 8 threads (8 separate .txt files) "bulk insert" into 1 table with (TABLOCK) and imports 20mln records in about 1min.
UPD 2: I tested different approaches (didn't test on SSIS - in our application this approach won't work): The best result is the approach when data "bulk inserted" into TABLE_2 format already (column types match, data types - also) so we eliminate data type Converts. And just "plain" distinct:
select distinct * into Table_2 from Table_1
Gives 70sec of processing. So I could consider It's a best result I could get for now. I also tried a couple of techniques (additional Order by, CTE win grouping etc) - they were worse then "plain" distinct.
Thanks all for participation!