slow query on large data

RJ 186 Reputation points
2024-12-19T18:50:52.5133333+00:00

Hi there,


--3billion rows of accumulated data over a period of time
CREATE TABLE [dbo].[large_accumulation_table](
	[amt] float(24) NULL,
	[lastdate] smalldatetime NULL,
	[type] [varchar](8) NULL,
	[doc1] [int] NULL,
	[line1] [smallint] NULL,
	[type1] [varchar](12) NULL,
	[doc2] [int] NULL,
	[line2] [smallint] NULL,
	[type2] [varchar](8) NULL
) 

--100,000 rows

CREATE TABLE [dbo].[todays_data](
	[amt] float(24) NULL,
	[lastdate] smalldatetime NULL,
	[type] [varchar](8) NULL,
	[doc1] [int] NULL,
	[line1] [smallint] NULL,
	[type1] [varchar](12) NULL,
	[doc2] [int] NULL,
	[line2] [smallint] NULL,
	[type2] [varchar](8) NULL
) 

--insert only new records - ignore records already existing 
insert into [dbo].[large_accumulation_table]

select * from [dbo].[todays_data]
except
select * from [dbo].[large_accumulation_table]

Takes way tooooo long to match. ANy other faster way that you all have come across? all columns are involved in the match
to determine if new record or existing record.

Azure SQL Database
SQL Server
SQL Server
A family of Microsoft relational database management and analysis systems for e-commerce, line-of-business, and data warehousing solutions.
14,207 questions
Transact-SQL
Transact-SQL
A Microsoft extension to the ANSI SQL language that includes procedural programming, local variables, and various support functions.
4,668 questions
0 comments No comments
{count} votes

Accepted answer
  1. Ketsha 165 Reputation points Microsoft Employee
    2024-12-19T19:01:28.4366667+00:00

    Hello RJ -

    To speed up the process of inserting only new records into a large table, you can consider using a more efficient approach. Here are a few methods that might help:

    Use a LEFT JOIN: Instead of using the EXCEPT clause, you can use a LEFT JOIN to identify new records. This can be more efficient, especially with proper indexing.

    INSERT INTO [dbo].[large_accumulation_table] (amt, lastdate, type, doc1, line1, type1, doc2, line2, type2)
    SELECT t.amt, t.lastdate, t.type, t.doc1, t.line1, t.type1, t.doc2, t.line2, t.type2
    FROM [dbo].[todays_data] t
    LEFT JOIN [dbo].[large_accumulation_table] l
    ON t.amt = l.amt AND t.lastdate = l.lastdate AND t.type = l.type AND t.doc1 = l.doc1 AND t.line1 = l.line1 AND t.type1 = l.type1 AND t.doc2 = l.doc2 AND t.line2 = l.line2 AND t.type2 = l.type2
    WHERE l.amt IS NULL;
    

    Batch Processing: If the dataset is extremely large, consider processing the data in smaller batches. This can help reduce the load on the server and improve performance.

    DECLARE @BatchSize INT = 10000;
    DECLARE @BatchStart INT = 0;
    DECLARE @BatchEnd INT = @BatchSize;
    
    WHILE @BatchStart < (SELECT COUNT(*) FROM [dbo].[todays_data])
    BEGIN
        INSERT INTO [dbo].[large_accumulation_table] (amt, lastdate, type, doc1, line1, type1, doc2, line2, type2)
        SELECT t.amt, t.lastdate, t.type, t.doc1, t.line1, t.type1, t.doc2, t.line2, t.type2
        FROM (SELECT *, ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS RowNum FROM [dbo].[todays_data]) t
        LEFT JOIN [dbo].[large_accumulation_table] l
        ON t.amt = l.amt AND t.lastdate = l.lastdate AND t.type = l.type AND t.doc1 = l.doc1 AND t.line1 = l.line1 AND t.type1 = l.type1 AND t.doc2 = l.doc2 AND t.line2 = l.line2 AND t.type2 = l.type2
        WHERE l.amt IS NULL AND t.RowNum BETWEEN @BatchStart AND @BatchEnd;
    
        SET @BatchStart = @BatchEnd + 1;
        SET @BatchEnd = @BatchEnd + @BatchSize;
    END;
    

    Use Indexes: Ensure that you have appropriate indexes on the columns used in the join condition. This can significantly improve the performance of the query.

    Consider Using HASHBYTES: If the number of columns is large, you can use HASHBYTES to create a hash of the row and compare the hashes instead of comparing each column individually.

    INSERT INTO [dbo].[large_accumulation_table] (amt, lastdate, type, doc1, line1, type1, doc2, line2, type2)
    SELECT t.amt, t.lastdate, t.type, t.doc1, t.line1, t.type1, t.doc2, t.line2, t.type2
    FROM [dbo].[todays_data] t
    LEFT JOIN [dbo].[large_accumulation_table] l
    ON HASHBYTES('SHA2_256', CONCAT(t.amt, t.lastdate, t.type, t.doc1, t.line1, t.type1, t.doc2, t.line2, t.type2)) = HASHBYTES('SHA2_256', CONCAT(l.amt, l.lastdate, l.type, l.doc1, l.line1, l.type1, l.doc2, l.line2, l.type2))
    WHERE l.amt IS NULL;
    

    These methods should help improve the performance of your insert operation. Let me know if you need further assistance!

    1 person found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. Erland Sommarskog 114.2K Reputation points MVP
    2024-12-19T22:16:20.8366667+00:00

    That big table looks scary. All columns are nullable, and there is no sign of a key. Are there any indexes? Else it will be slow. Very slow.

    Without much knowledge about the table and the data, one idea is to create a a unique clustered index which comprises all columns. Then you can find the rows with

    SELECT * 
    FROM newtable new
    WHERE NOT EXISTS (SELECT *
                      FROM   bigtable big
                      WHERE  EXISTS (SELECT big.* INTERSECT SELECT new.*))
    

    But I would still expect poor performance.

    Would it be that all columns always have values and be made NOT NULL, there are a few more possibilities.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.