Saturday, March 23, 2013

SSIS: Remove duplicate rows from input file

We have an requirement in which there are 40,000,000 (forty million) records in the Input file. We have to load it into the databases after implementing a few business logic transformations. Major part of concern is to remove duplicate rows. Following are the possible solutions.

1. Use Sort task and set "remove duplicate" property to TRUE. Would take quite long as it is a full blocking transaction. 

2. Use Script component. Need to compare. I guess it would again be a full blocking transformation. 

3.Dump data to DB and then reload after selecting distinct records. Looks like the best option. 

Will be back with the stats. Feel free to add your suggestions/stats/comments.

Happy SSISing.. :)

1 comment:

Bhavpreet Singh said...

And Here are the stats:
the package with sort transaction tool more than an hour where as the one which dumps data to Db and then selects distinct and reinsert took around 3 mins.
Script component will surely take more time Db task as it would again be using buffer memory.
Hope it helps :)