Thursday, June 30, 2011

SQL Bulk Copy with duplicate keys

Given:
- Large xml file containing entries that need to be in my database table. Some rows have same value for the table key (not necessarily completely duplicate rows!)
- SQL table, C# code

Problem:
Copy the entries from xml to SQL table. This is done fast using SQLBulkCopy function in .NET, but does not tolerate when input data has duplicate values for the primary key column. I really do not care about keeping the rows that are duplicate. Just throw them out so I can use the bulk copy for the 99% of the data.

Solutions.
Online blogs suggest a number of solutions:

- One is to create a temporary table with no duplicate entry constraints, bulk copy everything into that and then use SQL to copy temp table into the final table while maintaining constraints.

- Another way is to find all duplicate rows in a table in code, using hashtables.

- Next, using the Table.DefaultView.ToTable(distinct=true, columns), but this only removes duplicate rows, not  rows that have a certain column with duplicates.

- There is a way to do that with LINQ but I didn't even get to read that.

-------
I used a pretty short (imo) and fast way to do it using IEqualityComparer interface:

class RowEqualityComparer : IEqualityComparer
{
    public bool Equals(DataRow b1, DataRow b2)
    {
        return ((string)b1["Symbol"]) == ((string)b2["Symbol"]);
    }

    public int GetHashCode(DataRow b1)
    {
        // I'm not really sure what to do here or whether I should even worry
        return b1["Symbol"].GetHashCode();
    }
}

and then simply 

BulkCopy.WriteToServer(
           dst.Tables[table].AsEnumerable().Distinct(
               new RowEqualityComparer()).CopyToDataTable());

instead of 

sbc.WriteToServer(dst.Tables[table]);



3 comments:

  1. Nearly 100K records, 43MB xml file, and this code runs in a couple of minutes!

    ReplyDelete
  2. Could you post the links to the sources you found for the other blogs' solutions?

    Also can you document your sample code to explain what "Symbol" is. I assume it the primary key of your table that you are scrubbing the duplicate keys out of.

    This seems like it would work on empty tables, but what if you have existing data in a table already, and you need to bulk copy into that table, while avoiding duplicates. I think you would need the temp table then. This seems good only for scrubbing the duplicate primary keys out the the datatable before the bulk copy. Also, what if you cannot simply discard records arbitrarily?

    ReplyDelete
  3. Hey sorry didn't see the comment till now. This was some time ago, and I remember I went through a dozen pages online on solving this and I just wanted to summarize - it would have been a good idea to post links as well, but I really don't remember those now. As I did, you will find too many related results just by searching google.

    Symbol is by primary key (stock ticker symbol). Actually those lines above have errors, in that comparing two strings is done wrong and will always return false this way. Using .IsEqual() is the correct way.

    Well, luckily I don't have those problems, so this way still works perfect for me. I get all my data complete and quite redundant, so I don't need to worry about throwing out a couple of hundred rows :)

    ReplyDelete