views:

796

answers:

7

Hello everyone,

I have a table, schema is very simple, an ID column as unique primary key (uniqueidentifier type) and some other nvarchar columns. My current goal is, for 5000 inputs, I need to calculate what ones are already contained in the table and what are not. Tht inputs are string and I have a C# function which converts string into uniqueidentifier (GUID). My logic is, if there is an existing ID, then I treat the string as already contained in the table.

My question is, if I need to find out what ones from the 5000 input strings are already contained in DB, and what are not, what is the most efficient way?

BTW: My current implementation is, convert string to GUID using C# code, then invoke/implement a store procedure which query whether an ID exists in database and returns back to C# code.

My working environment: VSTS 2008 + SQL Server 2008 + C# 3.5.

thanks in advance, George

+1  A: 

Try to ensure you end up running only one query - i.e. if your solution consists of running 5000 queries against the database, that'll probably be the biggest consumer of resources for the operation.

If you can insert the 5000 IDs into a temporary table, you could then write a single query to find the ones that don't exist in the database.

Jeffrey Kemp
"Try to ensure you end up running only one query" -- do you mean I loop 5000 times in my C# code to query one by one to record whether they hit in database or not? If it is what you mean, then it is what I am doing.
George2
"If you can insert the 5000 IDs into a temporary table, you could then write a single query to find the ones that don't exist in the database." -- could you show me some pseudo code how to use a single query to find the ones that don't exist in the database?
George2
See Martelli's answer for a good example.
Jeffrey Kemp
Yes and I mean, make sure that you run one (1) query, once only. Not one query 5000 times! :)So to use the temp table option, I'd expect the solution will involve 1 insert, followed by 1 query.I'm not a SQL Server expert though. If it was Oracle I'd run just one query (i.e. no insert, no temp table) and use a bulk in-bind for the 5000 IDs.
Jeffrey Kemp
Sorry for my stupid, (2) I think the term "temp table" is ambigious. Appreciate if you could show some pseudo code or some URL about what "temp table" do you mean? (2) why temp table solution is better than 5000 times query solution? (3) " If it was Oracle I'd run just one query (i.e. no insert, no temp table) and use a bulk in-bind for the 5000 IDs." -- does SQL Server have similar capability.
George2
I was deliberately ambiguous because my advice was general in nature and not specific to a particular db. ... Oracle has a feature called "global temporary tables" which look like ordinary tables when writing queries, but the data in them is local to the current session (i.e. usually just kept in memory) and disappears automagically when the session ends. ... I don't know about SQL Server, sorry.
Jeffrey Kemp
The temp table solution is potentially better because instead of parsing and running 5000 queries, the database only has to parse and run 1 query. Of course, this might be offset by the cost of inserting 5000 rows into a temp table; which is why my best option would be to bulk in-bind the values to the query, if possible.
Jeffrey Kemp
Let me confirm your solution to make sure we are on the same page, step 1, bulk insert into a temporary table for queries; step 2, join temporary table with big table to find which queries are not already contained? Is that correct understanding? If yes, your suggestion is to wrap all two steps into a single store procedure?
George2
For temp table, I am interested in its lifecycle. Do you suggest each time I have batch queries, I create a new temp table and drop after joining results are retrieved? Or I just need to keep a single temporary table?
George2
I don't know if you can have a persistent temp table in SQL Server. In Oracle you only need to create the temp table once, and it's available for use by any session thereafter. But again, I stress that the idea of doing a bulk insert + query is the 2nd best option; the 1st option would be to bulk in-bind the values to the query and not use a temp table at all.
Jeffrey Kemp
You have several options. You can't determine which is best without answering a number of important questions first, and most of all you need to have a testing strategy to compare them. You're going a long way down one path that may or may not be optimal. For 5000 records, I doubt it will be.
le dorfier
1. " the 1st option would be to bulk in-bind the values to the query and not use a temp table at all." -- could you let me know whether SQL Server supports this? 2. Session level temp table means per connection level (a bit confused about what means session here)?
George2
Anyway, could you recommend me some SQL Server temp table related documents/tutorials to read?
George2
+3  A: 

My first instinct would be to pump your 5000 inputs into a single-column temporary table X, possibly index it, and then use:

SELECT X.thecol
FROM X
JOIN ExistingTable USING (thecol)

to get the ones that are present, and (if both sets are needed)

SELECT X.thecol
FROM X
LEFT JOIN ExistingTable USING (thecol)
WHERE ExistingTable.thecol IS NULL

to get the ones that are absent. Worth benchmarking, at least.

Edit: as requested, here are some good docs & tutorials on temp tables in SQL Server. Bill Graziano has a simple intro covering temp tables, table variables, and global temp tables. Randy Dyess and SQL Master discuss performance issue for and against them (but remember that if you're getting performance problems you do want to benchmark alternatives, not just go on theoretical considerations!-).

MSDN has articles on tempdb (where temp tables are kept) and optimizing its performance.

Alex Martelli
"single-column temporary table X" -- appreciate if you could clarify this. I think this point is important. temporary table you mean creating a physical table or?
George2
CREATE TABLE #X(thecol VARCHAR(30))makes the temporary table (the leading # in the name is what makes it temporary) -- it lasts only as long as the procedure or session that creates it.
Alex Martelli
In my scenario, (1) why a temporary table is better? Compared with creating another physical table? (2) for the temporary table, what is your advice about when we create it/when we drop it. i.e. create a new temporary table each time when there is 5000 batch queries and drop after join, or create temp table only once?
George2
(1) if you share one physical table among all users, your code will have to filter on the rows specific to the current session, which takes time; also, a temp table will probably use less resources on the server.
Jeffrey Kemp
So, let me confirm in my scenario your suggestion is to have a session scope temp table if no need to share data with other sessions? If yes, could you recommend me some SQL Server temp table related documents/tutorials to read? I feel I am short of the knowledge of temp table from discussion with you. :-)
George2
It turns out this answer may have been a bit premature. The real scale of the problem is more like 1 million records. You need to ask more questions before offering answers.
le dorfier
"The real scale of the problem is more like 1 million records." -- it is not requirement in my scenario in near future. Anyway, could you recommend me some SQL Server temp table related documents/tutorials to read?
George2
Combining this method of temporary tables with the MERGE sentence marc_s comments in his response is probably a very good way to do what you vant.
Doliveras
"Combining this method of temporary tables with the MERGE sentence" -- I think merge has the same function as left join, so I either use merge or use left join, I am interested how do you combine them, could you show the code please?
George2
@le dorfier, what a mysterious comment you make -- the question specifically said 5000, so it would have been implicitly offensive to ask "do you really mean what you say?"...!-). Point is, a temp table approach is worth benchmarking, be it with join or merge.
Alex Martelli
Sorry, it sounded like a criticism - it was in the sense I would say it to myself. But George has been off doing a bunch of development work when the time would have been better spent clarifying the situation. Batching may not be a good idea when the primary goal is getting the data online quickly, and we don't even know what the current problem behaviour is - or if there is any yet.
le dorfier
Alex Martelli
+2  A: 

What do you need to do with those entries that do or don't exist in your table??

Depending on what you need, maybe the new MERGE statement in SQL Server 2008 could fit your bill - update what's already there, insert new stuff, all wrapped neatly into a single SQL statement. Check it out!

Your statement would look something like this:

MERGE INTO 
    (your target table) AS t
USING 
    (your source table, e.g. a temporary table) AS s
ON t.ID = s.ID
WHEN NOT MATCHED THEN  -- new rows does not exist in base table
  ....(do whatever you need to do)
WHEN MATCHED THEN      -- row exists in base table
  ... (do whatever else you need to do)
;

To make this really fast, I would load the "new" records from e.g. a TXT or CSV file into a temporary table in SQL server using BULK INSERT:

BULK INSERT YourTemporaryTable
FROM 'c:\temp\yourimportfile.csv'
WITH 
(
    FIELDTERMINATOR =',',
    ROWTERMINATOR =' |\n'
)

BULK INSERT combined with MERGE should give you the best performance you can get on this planet :-)

Marc

PS: here's a note from TechNet on MERGE performance and why it's faster than individual statements:

In SQL Server 2008, you can perform multiple data manipulation language (DML) operations in a single statement by using the MERGE statement. For example, you may need to synchronize two tables by inserting, updating, or deleting rows in one table based on differences found in the other table. Typically, this is done by executing a stored procedure or batch that contains individual INSERT, UPDATE, and DELETE statements. However, this means that the data in both the source and target tables are evaluated and processed multiple times; at least once for each statement. By using the MERGE statement, you can replace the individual DML statements with a single statement. This can improve query performance because the operations are performed within a single statement, therefore, minimizing the number of times the data in the source and target tables are processed. However, performance gains depend on having correct indexes, joins, and other considerations in place. This topic provides best practice recommendations to help you achieve optimal performance when using the MERGE statement.

marc_s
The usage senario is, I have a big job/order database which contains already processed job/order, for new 5000 batch order/job requests, I will at first look-up whether the order/jobs are already processed, if not I will process un-processed order/job. Do you think merge is suitable for my scenario?
George2
Yes, absolutely! This is *THE* perfect scenario for MERGE. You have a table with your newly processed jobs, and you then update the base table and e.g. set a flag, or add a row, or whatever it is you need to do.
marc_s
Have you read the reply by Alex about outer join? What are the pros and cons of merge compared with left join?
George2
Yes - that's the way you would have to do it *before* MERGE was around :-) I would assume that using MERGE will be a bit faster, since a lot of work at Microsoft has gone into making sure MERGE is as optimal as ever possible.
marc_s
1. Your temporary table is a physical table, not a real temporary table besings with #? Why not using a real temporary table? 2. " I would assume that using MERGE will be a bit faster" -- do you have any documents to prove?
George2
You can use either a normal table, or a "real" temporary table - really doesn't make a big difference, IMHO.
marc_s
No, not any print documents - but there are several videos on e.g. channel9 on the new SQL Server 2008 features where MS employee mention that a lot of work has put into making sure MERGE was very performant.
marc_s
No matter using real temporary table or not, I need to remove all records before inserting new batch queries, correct? I am just interested since temporary table needs to be reused, how/when do we do clean-up/reset work.
George2
Yes, absolutely - you need to wipe out the table before importing a new batch of records to process.
marc_s
+3  A: 

Step 1. Make sure you have a problem to solve. Five thousand inserts isn't a lot to insert one at a time in a lot of contexts.

Are you certain that the simplest way possible isn't sufficient? What performance issues have you measured so far?

le dorfier
The # of queries are configurable and I want to make my solution works for big numbers, like 1M level.
George2
You'll get different answers for different problems. You need to say 1MM in your question then. You need to be asking about bulk insert strategies. The answer will also be different if you are highly likely or highly unlikely to get matches.
le dorfier
1M is not in the near future, current scenario is for every 15 mins, there is 5000 batch queries. Any advice?
George2
Yes, try one-at-a-time and see how long it takes. If you clump them up in a batch, it can put a big load on the server all at once that it wouldn't experience otherwise. You might also try unbatching entirely and submit them as you get them. As long as it keeps up, it could be the lightest overall load - which might work for a long time. 5K @ 15 min. is only 5 per second. As long as you have a single app queing them in, it could be a reasonable way to level the load overall. The #1 piece of advice is don't solve problems you don't have yet; do the simplest thing first.
le dorfier
Huge traffic is on-demenad query from other application for order status and details, my design for find unprocessed orders are just to avoid impact to the on-demand query to the big order database. Any advice?
George2
" As long as it keeps up, it could be the lightest overall load - which might work for a long time." -- confused about what do you mean "keeps up" and "lightest overall load", appreciate if you could say in some other words? :-)
George2
What concerns me most here is that there seems to be a strong incentive to get these orders loaded as soon as possible because there are consequences to not having them in the database very quickly. If that's the case, then we need to redefine the problem to be "How can I insert new orders as quickly as possible while dealing with the other requirements?"
le dorfier
I think as you describe it now, the best strategy might be to simply append all orders as they come in (preventing insertion problems) but doing your lookup not just for the order, but for the most recent version of the order.
le dorfier
This also give you an audit trail that shows what the customer submitted when - which will give them a lot more confidence that you're getting what they are sending, in the order they send it.
le dorfier
+1  A: 

If you want simplicity, since 5000 records is not very many, then from C# just use a loop to generate an insert statement for each of the strings you want to add to the table. Wrap the insert in a TRY CATCH block. Send em all up to the server in one shot like this:

BEGIN TRY
INSERT INTO table (theCol, field2, field3)
SELECT theGuid, value2, value3
END TRY BEGIN CATCH END CATCH

BEGIN TRY
INSERT INTO table (theCol, field2, field3)
SELECT theGuid, value2, value3
END TRY BEGIN CATCH END CATCH

BEGIN TRY
INSERT INTO table (theCol, field2, field3)
SELECT theGuid, value2, value3
END TRY BEGIN CATCH END CATCH

if you have a unique index or primary key defined on your string GUID, then the duplicate inserts will fail. Checking ahead of time to see if the record does not exist just duplicates work that SQL is going to do anyway.

If performance is really important, then consider downloading the 5000 GUIDS to your local station and doing all the analysis localy. Reading 5000 GUIDS should take much less than 1 second. This is simpler than bulk importing to a temp table (which is the only way you will get performance from a temp table) and doing an update using a join to the temp table.

johnnycrash
I think your solution which checks duplicate by using whether SQL Server returns error is not very reliable, as you can see if we insert fail there could be many reasons, including duplicate value.
George2
"If performance is really important, then consider downloading the 5000 GUIDS to your local station and doing all the analysis localy." -- what do you mean download? My scenario is I have a big table with several M processed order and each time there is batch input 5000 orders to check againts the several M processed order table to find unprocssed ones.
George2
It wasn't clear that the table in the DB Had sveral M records in it. Add that info to your question! You are correct, there could be other errors when executing the insert statement. You can change my solution to absorb only a key violation error and throw other errors. Can you tell me this, 1) how long is it taking to import the 5000 records, and 2) how many of the records are already in the table? I ask, because if the # of duplicate records is small, say 10-100, then you don't save much by not sending the 10-100 inserts.
johnnycrash
+1  A: 

Definitely do not do it one-by-one.

My preferred solution is to create a stored procedure with one parameter that can take and XML in the following format:

<ROOT>
  <MyObject ID="60EAD98F-8A6C-4C22-AF75-000000000000">
  <MyObject ID="60EAD98F-8A6C-4C22-AF75-000000000001">
  ....
</ROOT>

Then in the procedure with the argument of type NCHAR(MAX) you convert it to XML, after what you use it as a table with single column (lets call it @FilterTable). The store procedure looks like:

CREATE PROCEDURE dbo.sp_MultipleParams(@FilterXML NVARCHAR(MAX))
AS BEGIN
    SET NOCOUNT ON

    DECLARE @x XML
    SELECT @x = CONVERT(XML, @FilterXML)

    -- temporary table (must have it, because cannot join on XML statement)
    DECLARE @FilterTable TABLE (
         "ID" UNIQUEIDENTIFIER
    )

    -- insert into temporary table
    -- @important: XML iS CaSe-SenSiTiv
    INSERT      @FilterTable
    SELECT      x.value('@ID', 'UNIQUEIDENTIFIER')
    FROM        @x.nodes('/ROOT/MyObject') AS R(x)

    SELECT      o.ID,
                SIGN(SUM(CASE WHEN t.ID IS NULL THEN 0 ELSE 1 END)) AS FoundInDB
    FROM        @FilterTable o
    LEFT JOIN   dbo.MyTable t
            ON  o.ID = t.ID
    GROUP BY    o.ID

END
GO

You run it as:

EXEC sp_MultipleParams '<ROOT><MyObject ID="60EAD98F-8A6C-4C22-AF75-000000000000"/><MyObject ID="60EAD98F-8A6C-4C22-AF75-000000000002"/></ROOT>'

And your results look like:

ID                                   FoundInDB
------------------------------------ -----------
60EAD98F-8A6C-4C22-AF75-000000000000 1
60EAD98F-8A6C-4C22-AF75-000000000002 0
van
Confused about your SQL statement, 1. I am confused about how do you extract ID sttribute from each row of XML file? I did not find such statement in your reply. 2. Another question is what does this mean exactly -- "SIGN(SUM(CASE WHEN t.ID IS NULL THEN 0 ELSE 1 END))"?
George2
Thanks for your advice, my input is a string array and I do not want to wrap it into XML to add additional overhead. Do you think the solution of bulk insert into a temporary table, and then join the temporary table with the real big processed order table makes senses?
George2
SIGN(...) basically will return 0 when 0 rows found, and 1 when more then 1 rows found. In your case the filter is UNIQUE so this is not really required, so you can drop the SIGN, and leave just SUM(...).
van
Can you explain what means "SIGN(SUM(CASE WHEN t.ID IS NULL THEN 0 ELSE 1 END))" in your code? I think your code is very cool and workable for my scenario except I am confused about what this sub-statement mean.
George2
The solution of Alex using temporary table is a viable solution. But if you think of inserting rows from outside SQL, then I find it not "good" for the following reasons: 1) you break concurrency: 2 users executing your request at the same time will be either locking one another or mixing one others requests, depending on your transaction isolation level. 2) Your SP will become just a select with a join, which "relies" on data in another table - not nice. ---- And I do not think that passing an XML is an overhead: it is easy to create an input string in C#; and SQL handles it rather good.
van
If you do not like XML, you can pass comma-delimited string and create a UDF that converts it to a one-column table (just google for it - many people used this in pre SQLServer-2005 because XML handling was slow and code ugly)
van
I think blocking concurrency issue happens for your solution as well, as when you do join, you also blocks other operation on the processed order table. The same as using temporary table solution?
George2
A comment -- your code of using SUM(CASE WHEN t.ID IS NULL THEN 0 ELSE 1 END) is a little harder to read for newcomer like me, I think using where XXX is not null is better. Any comments? :-)
George2
Readability is important. So go with your solution. My query return 1 or 0 for all the rows. If you need to know only those that ARE there, then you can change the code to "SELECT o.ID FROM @FilterTable o INNER JOIN dbo.MyTable t ON o.ID = t.ID", which will return only those rows that ARE in the database already.
van
I see that Alex suggests in-process temp table. Then there is no issue with the concurrency. But then you still have to populate this table somehow. So my question is: how? and what do you pass as a SP parameter. I mentioned concurrency as an issue only when you use global "a-la temp" table and insert rows from C# directly before you check for existence.
van
to get your XML, just use: "string xml = String.Format("<ROOT><MyTable ID='{0}'/></ROOT>", String.Join("'/><MyTable ID='", listOfGuids));"
van
+1  A: 

Since you are using Sql server 2008, you could use Table-valued parameters. It's a way to provide a table as a parameter to a stored procedure.

Using ADO.NET you could easily pre-populate a DataTable and pass it as a SqlParameter. Steps you need to perform:

Create a custom Sql Type

CREATE TYPE MyType AS TABLE
(
UniqueId INT NOT NULL,
Column NVARCHAR(255) NOT NULL
)

Create a stored procedure which accepts the Type

CREATE PROCEDURE spInsertMyType
@Data MyType READONLY
AS 
xxxx

Call using C#

SqlCommand insertCommand = new SqlCommand(
   "spInsertMyType", connection);
 insertCommand.CommandType = CommandType.StoredProcedure;
 SqlParameter tvpParam = 
    insertCommand.Parameters.AddWithValue(
    "@Data", dataReader);
 tvpParam.SqlDbType = SqlDbType.Structured;

Links: Table-valued Parameters in Sql 2008

Dries Van Hansewijck
Passing parameter is the first step, then next step you suggest I do left join?
George2
You could definately use a left join. In addition, you mentioned using a c# function to convert strings into ID's. Since the table valued parameter can contain multiple fields, it may be unnecessary to do pre-processing of the data and just send data for all the fields you whish to query.
Dries Van Hansewijck