slow query on large data

Question

Hi there,


--3billion rows of accumulated data over a period of time
CREATE TABLE [dbo].[large_accumulation_table](
	[amt] float(24) NULL,
	[lastdate] smalldatetime NULL,
	[type] [varchar](8) NULL,
	[doc1] [int] NULL,
	[line1] [smallint] NULL,
	[type1] [varchar](12) NULL,
	[doc2] [int] NULL,
	[line2] [smallint] NULL,
	[type2] [varchar](8) NULL
) 

--100,000 rows

CREATE TABLE [dbo].[todays_data](
	[amt] float(24) NULL,
	[lastdate] smalldatetime NULL,
	[type] [varchar](8) NULL,
	[doc1] [int] NULL,
	[line1] [smallint] NULL,
	[type1] [varchar](12) NULL,
	[doc2] [int] NULL,
	[line2] [smallint] NULL,
	[type2] [varchar](8) NULL
) 

--insert only new records - ignore records already existing 
insert into [dbo].[large_accumulation_table]

select * from [dbo].[todays_data]
except
select * from [dbo].[large_accumulation_table]

Takes way tooooo long to match. ANy other faster way that you all have come across? all columns are involved in the match
to determine if new record or existing record.

Accepted Answer

Hello RJ -

To speed up the process of inserting only new records into a large table, you can consider using a more efficient approach. Here are a few methods that might help:

Use a LEFT JOIN: Instead of using the EXCEPT clause, you can use a LEFT JOIN to identify new records. This can be more efficient, especially with proper indexing.

INSERT INTO [dbo].[large_accumulation_table] (amt, lastdate, type, doc1, line1, type1, doc2, line2, type2)
SELECT t.amt, t.lastdate, t.type, t.doc1, t.line1, t.type1, t.doc2, t.line2, t.type2
FROM [dbo].[todays_data] t
LEFT JOIN [dbo].[large_accumulation_table] l
ON t.amt = l.amt AND t.lastdate = l.lastdate AND t.type = l.type AND t.doc1 = l.doc1 AND t.line1 = l.line1 AND t.type1 = l.type1 AND t.doc2 = l.doc2 AND t.line2 = l.line2 AND t.type2 = l.type2
WHERE l.amt IS NULL;

Batch Processing: If the dataset is extremely large, consider processing the data in smaller batches. This can help reduce the load on the server and improve performance.

DECLARE @BatchSize INT = 10000;
DECLARE @BatchStart INT = 0;
DECLARE @BatchEnd INT = @BatchSize;

WHILE @BatchStart < (SELECT COUNT(*) FROM [dbo].[todays_data])
BEGIN
    INSERT INTO [dbo].[large_accumulation_table] (amt, lastdate, type, doc1, line1, type1, doc2, line2, type2)
    SELECT t.amt, t.lastdate, t.type, t.doc1, t.line1, t.type1, t.doc2, t.line2, t.type2
    FROM (SELECT *, ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS RowNum FROM [dbo].[todays_data]) t
    LEFT JOIN [dbo].[large_accumulation_table] l
    ON t.amt = l.amt AND t.lastdate = l.lastdate AND t.type = l.type AND t.doc1 = l.doc1 AND t.line1 = l.line1 AND t.type1 = l.type1 AND t.doc2 = l.doc2 AND t.line2 = l.line2 AND t.type2 = l.type2
    WHERE l.amt IS NULL AND t.RowNum BETWEEN @BatchStart AND @BatchEnd;

    SET @BatchStart = @BatchEnd + 1;
    SET @BatchEnd = @BatchEnd + @BatchSize;
END;

Use Indexes: Ensure that you have appropriate indexes on the columns used in the join condition. This can significantly improve the performance of the query.

Consider Using HASHBYTES: If the number of columns is large, you can use HASHBYTES to create a hash of the row and compare the hashes instead of comparing each column individually.

INSERT INTO [dbo].[large_accumulation_table] (amt, lastdate, type, doc1, line1, type1, doc2, line2, type2)
SELECT t.amt, t.lastdate, t.type, t.doc1, t.line1, t.type1, t.doc2, t.line2, t.type2
FROM [dbo].[todays_data] t
LEFT JOIN [dbo].[large_accumulation_table] l
ON HASHBYTES('SHA2_256', CONCAT(t.amt, t.lastdate, t.type, t.doc1, t.line1, t.type1, t.doc2, t.line2, t.type2)) = HASHBYTES('SHA2_256', CONCAT(l.amt, l.lastdate, l.type, l.doc1, l.line1, l.type1, l.doc2, l.line2, l.type2))
WHERE l.amt IS NULL;

These methods should help improve the performance of your insert operation. Let me know if you need further assistance!

Answer

That big table looks scary. All columns are nullable, and there is no sign of a key. Are there any indexes? Else it will be slow. Very slow.

Without much knowledge about the table and the data, one idea is to create a a unique clustered index which comprises all columns. Then you can find the rows with

SELECT * 
FROM newtable new
WHERE NOT EXISTS (SELECT *
                  FROM   bigtable big
                  WHERE  EXISTS (SELECT big.* INTERSECT SELECT new.*))

But I would still expect poor performance.

Would it be that all columns always have values and be made NOT NULL, there are a few more possibilities.

Share via

slow query on large data

1 additional answer

Your answer