SQL: Mutually Assured Destruction
I have been playing around with SQL lately, trying to get it to sit and roll over. Sure, I’m familiar with all the normal paradigms; all the usual crud. I can select with the best of them, slicing and dicing data with the efficiency of a ginsu knife. But I’ve never actually been much of a SQL developer, not really big on stored procedures and what not. I know there out there, but have never had the need to build any large systems using them. That’s kind of hard to admit, given my background with data access at Microsoft. But that’s just the thing, I had been so focused on access to the server, API’s, sending queries, receiving data and at best invoking procedures, that I seldom found the need to think about the problem from the other side, building apps from the inside out.
Yet, working on programming languages instead of API’s has given me a different perspective. I now see the server in a different light.
I can illustrate this best by walking through an example I was thinking about many weeks ago. I was working on another translation problem, converting a higher-level abstraction into a database query. In this particular case, what I wanted to do was generate a SQL delete statement. It was the closest match to what I wanted to represent, but I also wanted to make sure that what I was doing was general enough. Two years of designing ObjectSpaces put a lot of funny ideas in my heads; lots of oddball scenarios that unfortunately happen all the time with real data.
Logically, I had information that was one piece like a row of a table, but physically this data was stored in two separate tables. Both of these real tables were related to each other in a one-to-one fashion, each sharing the same primary key. I suppose people do this all the time with O/R layers or even just with views in the database.
So what I wanted to do was delete a logical row from the database, but since there were two base tables involved I would have to actually generate two separate delete statements. So far, not so bad, but then I had this horrible thought. Since I also wanted to retain the set-ness of the SQL delete statement over my logical structure, (I wanted the option to be able to delete many of these things given a constraint), I would have to duplicate this constraint on each actual delete that I generated. This sounded like a good idea at first, but then I realized it could go horribly wrong. If the constraint itself reference information out of both base tables, there would be no way to execute both deletes. Once the first delete executed, the second one would always fail.
Imagine you had a logical view called Customers, which was broken down into two actual tables called Customers1 and Customers2; with things like the name and phone number in the first one and their address stored in the second one.
If you wanted to delete all customers with a specific phone prefix you could write:
delete c2 from Customers1 c1, Customers2 c2 where c1.Id = c2.Id and c1.prefix = @p
delete c1 from Customers1 c1 where c1.prefix = @p
Also, if you wanted to delete all customers from a particular city you could write:
delete c1 from Customers1 c1, Customers2 c2 where c1.Id = c2.Id and c2.city = @p
delete c2 from Customers1 c2 where c2.city = @p
Notice how I had to reverse the order of the deletes here, depending on which columns were used in the constraint. This required me to be particularly clever about what was in the query into order to determine this, but it’s not really the order that was at issue. If I could not figure out the order, I’d be stuck.
I know that this is what triggers were made for, etc, etc, etc. You’ll just have to go with me on this one and believe that a trigger was not a viable solution.
What really struck home was when I took the query to the next logical step. What if I wanted to delete all customers in a specific city with a specific prefix. Now the query actually spans both sides.
delete c2 from Customers1 c1, Customers2 c2 where c1.Id = c2.Id and c1.prefix = @prefix and c2.city = @city
delete c1 from Customers1 c1, Customers2 c2 where c1.Id = c2.Id and c1.prefix = @prefix and c2.city = @city
Now, no matter the order, the pair would never succeed. The first one to execute would get rid of data required to fire the next one. So I had a problem that I needed a solution for. I needed a guaranteed way to have mutually assured destruction in both tables.
I figured there were a few ways to solve this. First, I could always generate a temp table with all the ID’s that needed to be deleted. This would indeed work, but I worried about the overhead of generating that table. What if there were a billion rows?
So I decided the best thing for me to do here was use a cursor. I could create a cursor over the logical view using the constraint and while iterating the cursor delete each one independently. Of course, this would not lead to an atomic delete, but since I was splitting them into two anyway that did not concern me.
declare c cursor for select id from customers1 c1, customers2 c2 where c1.id = c2.id and c1.prefix = @prefix and c2.city = @city
declare @id int
fetch next from c into @id
while @@fetch_status = 0
begin
delete from Customers1 where id = @id
delete from Customers2 where id = @id
fetch next from c into @id
end
close c
deallocate c
Now, I’m not too sure about the perf of this guy relative to a pair of delete statements, but semantically it did exactly what I wanted.
After I was done, I looked at what I had written and the thought suddenly hit me; this is code. I mean it; this stuff is code like IL instructions are code. I had taken this logical set-based idiom and converted it into a for-loop with a couple statements inside. I was doing step-wise processing now.
Of course, it’s always been there in the server just like this for writing stored procedures and functions. People have doing this kind of stuff long before I’d ever touched a computer. But now I was thinking less about the server as a storage dump for data and now more as a data runtime; a data processor if you will.
What I started out wanting to do was translate a delete operation so I could send it remote to the server. I was thinking that the only semantics interesting to translate were queries and updates. Now I’m thinking the unification between normal programming languages and data services should go much deeper. It should involve all of the language.
Maybe I should just translate the whole program and send it lock stock and barrel to the server.
Matt
Comments
- Anonymous
March 30, 2004
The comment has been removed - Anonymous
March 30, 2004
Another thing:
if the database could just do this:
DELETE c1, c2
FROM base c1, derived c2
WHERE (c1.key=c2.key) AND (c1.p1='p1' OR c2.p2='p2')
then you'd be golden and with much better performance.
but mssql won't do this.
the pair of deletes is not good performance-wise, comparatively speaking, if there's a solution that doesn't have to search for the data twice.
the cursor is manageable, but if we presume a .NET-runtime-in-yukon, the performance is notably less than a native procedure.
how much less I have no idea. - Anonymous
March 30, 2004
Hi,
I believe that MSSQL uses worktable mechanism (a temporary table created automatically in the tempdb) to implement that cursor feature.
So, rather than using a cursor, you can use a temp table created in the tempdb to capture the IDs of the rows that needs to be deleted.
There will not be any transaction-log for rows that are inserted / deleted / updated in a temp table in the tempdb.
MSSQL 9 (YUKON) provides table variables. This table variable can be made use of to achieve the required functionality. This will allows us perform set oriented operation rather than record oriented operation.
Thanks
SN - Anonymous
March 30, 2004
Why would you use a cursor?
DECLARE @Ids TABLE (Id INT)
INSERT INTO @Ids SELECT Id FROM customers1 c1, customers2 c2 WHERE c1.id = c2.id AND c1.prefix = @prefix AND c2.city = @city
DELETE FROM Customers1 WHERE id IN (SELECT Id FROM @Ids)
DELETE FROM Customers2 WHERE id IN (SELECT Id FROM @Ids)
Ofcourse there are going to be table scans.. it could be optimised, but surely better than a cursor? - Anonymous
March 30, 2004
Woops. Didn't see the billion rows comment ;P - Anonymous
March 30, 2004
Hmm.. on second thoughts, I stand by the temp table solution.
How many times have you seen an application suddenly have the need to purge billions of rows on a regular basis? I've only seen a couple of databases that had a billion rows, and even then, there was serious red-tape processes involved any time you even wanted to modify a single field. Let alone delete a billion rows.
I say avoid cursors like the plague. Use the temp table. - Anonymous
March 30, 2004
The comment has been removed - Anonymous
March 30, 2004
Interesting info: 1) I was not trying to solve an objectspaces problem; similiar but not that precisely. 2) I'm assuming I cannot modify the schema of the database, create procs, or anything else so everything must be akin to a batch request. 3) thinking about what was the most general solution that would be guaranteed to work in even extreme scenarios, though I admit would not necessarily be the most performant. 4) was not targetting yukon yet; table variables are interesting though, much better than temp tables.
Thanks for the feedback - Anonymous
March 30, 2004
I've never been very good at taking people's word for it, so re: the triggers I must ask whether you also considered that views can now have INSTEAD OF triggers? - Anonymous
March 30, 2004
Again I was not considering triggers for this behavior mapping because I can not assume a particular design or priviledges to change the system at runtime. If all were perfect, then everything would be designed into the substrate, but lacking that what tools are at my disposal to solve these things generally. - Anonymous
March 30, 2004
The comment has been removed - Anonymous
March 31, 2004
The comment has been removed - Anonymous
March 31, 2004
The comment has been removed - Anonymous
March 31, 2004
If there really is meant to be a 1-1 mapping between the two tables, and you can't have this enforced by the DB with triggers or ref integrity or whatever,
you could do something like
-- delete from one of the tables using criteria
-- that spans both tables
delete c2 from Customers1 c1, Customers2 c2 where c1.Id = c2.Id and c1.prefix = @prefix and c2.city = @city
-- clean up all orphans in the other table
delete from Customers1 where id not in
(select id from Customers2) - Anonymous
March 31, 2004
Whoa... a billion rows, eh?
Well then guess what? Sometimes cursors do work 1000% better than complex queries.
I think that TABLE variables are forced to be in-memory.
So if you do some sort of stream of results to a temp table, you have to record a billion new rows in your temp table before you even begin deleting. If you use a cursor, then there's no monster spool before the deletion can commence.
If you're really talking that large of a scale, a lot of rules change.