We have an requirement in which there are 40,000,000 (forty million) records in the Input file. We have to load it into the databases after implementing a few business logic transformations. Major part of concern is to remove duplicate rows. Following are the possible solutions.
1. Use Sort task and set "remove duplicate" property to TRUE. Would take quite long as it is a full blocking transaction.
2. Use Script component. Need to compare. I guess it would again be a full blocking transformation.
3.Dump data to DB and then reload after selecting distinct records. Looks like the best option.
Will be back with the stats. Feel free to add your suggestions/stats/comments.
Happy SSISing.. :)