Select most recent data per record from multiple tables

JamesHarsh-1587 0 Reputation points

I am trying to get the latest data for a record from multiple tables. The issue I am running into is that the data on each table could have been modified at different times and each table may contain multiple rows per record. I am currently using row number ordered by modified to get the latest record per table then joining them together however this is very slow when each table contains 1 million+ rows.

Example:

WITH cteTable1 AS (

SELECT

Column1,

Column2,

ROW_NUMBER() OVER (PARTITION BY Column1 ORDER BY ModifiedDateTime DESC) AS RowNum

FROM

dbo.Table1

)

,cteTable2 AS (

SELECT

Column1,

Column2,

Column3,

ROW_NUMBER() OVER (PARTITION BY Column1 ORDER BY ModifiedDateTime DESC) AS RowNum

FROM

dbo.Table2

)

,cteTable3 AS (

SELECT

Column1,

Column3,

ROW_NUMBER() OVER (PARTITION BY Column1 ORDER BY ModifiedDateTime DESC) AS RowNum

FROM

dbo.Table3

)

SELECT

*

FROM

cteTable1 t1

LEFT OUTER JOIN cteTable2 t2

 ON t1.Column1 = t2.Column1

 AND t2.RowNum =1

LEFT OUTER JOIN cteTable2 t3

 ON t2.Column2 = t3.Column1

 AND t3.RowNum = 1

WHERE

t1.RowNum = 1

This is very inefficient when the number of joins could be 10+ tables and with large tables sizes.

  1. Viorel 127K Reputation points

    What is the criterion of efficiency?

  2. JamesHarsh-1587 0 Reputation points

    Currently large tables are taking up to 25 minutes to return results. So 10 minutes or less would be a huge win.

  3. Viorel 127K Reputation points

    There are alternatives to ROW_NUMBER, that could be tried.


Sign in to comment

Answer recommended by moderator

Erland Sommarskog 134.7K Reputation points MVP Volunteer Moderator

I think you should still try writing this with temp tables as experimentation, to verify that we are on the right track.

If it has to be views, it could be bundled in a multi-statement function that you call from a view.

What you can try is to make sure that all tables has an index on (Column1, ModifedDateTime DESC). That may give you better performance, but I am not entirely sure.

0 comments No comments

Sign in to comment

2 additional answers

  1. Erland Sommarskog 134.7K Reputation points MVP Volunteer Moderator

    It may be better to materialise the queries into temp tables like this:

    ; WITH CTE AS (
     SELECT Column1, Column2, 
     ROW_NUMBER() OVER (PARTITION BY Column1 ORDER BY ModifiedDateTime DESC) AS RowNum
     FROM Table1
    )
    SELECT * INTO #temp1 FROM CTE WHERE RowNum = 1
    
    ALTER TABLE #temp1 ADD PRIMARY KEY (Column1)
    

    And then you join these temp tables together.

    1. JamesHarsh-1587 0 Reputation points

      Unfortunately, I am unable to use temp tables as these are contained in views.

    2. Lakshmi Narayana Garikapati 1,330 Reputation points Microsoft External Staff Moderator

      Hi James H,

      I'm following up to check if the response provided below was helpful.

      Please let us know if your issue has been resolved. If you still have any questions or need further assistance, don't hesitate to get in touch we're happy to continue supporting you.

      We appreciate your participation and thank you for being an active member of the community.

      Best regards,

      Lakshmi

    3. Lakshmi Narayana Garikapati 1,330 Reputation points Microsoft External Staff Moderator

      Hi James H,

      I'm following up to check if the response provided below was helpful.

      Please let us know if your issue has been resolved. If you still have any questions or need further assistance, don't hesitate to get in touch we're happy to continue supporting you.

      We appreciate your participation and thank you for being an active member of the community.

      Best regards,

      Lakshmi


    Sign in to comment
  2. AI answer

    To optimize your query for selecting the most recent data per record from multiple tables, consider the following strategies:

    1. Indexing: Ensure that the ModifiedDateTime column is indexed in each table. This can significantly speed up the ordering process when using ROW_NUMBER().
    2. Temporary Tables: Instead of using Common Table Expressions (CTEs), you might want to use temporary tables to store the results of the latest records from each table. This can reduce the overhead of multiple CTEs and improve performance. Example:
       SELECT Column1, Column2, ModifiedDateTime
       INTO #TempTable1
       FROM (
       SELECT Column1, Column2, ModifiedDateTime,
       ROW_NUMBER() OVER (PARTITION BY Column1 ORDER BY ModifiedDateTime DESC) AS RowNum
       FROM dbo.Table1
       ) AS T
       WHERE RowNum = 1;
       
       SELECT Column1, Column2, Column3
       INTO #TempTable2
       FROM (
       SELECT Column1, Column2, Column3, ModifiedDateTime,
       ROW_NUMBER() OVER (PARTITION BY Column1 ORDER BY ModifiedDateTime DESC) AS RowNum
       FROM dbo.Table2
       ) AS T
       WHERE RowNum = 1;
       
       SELECT *
       FROM #TempTable1 t1
       LEFT JOIN #TempTable2 t2 ON t1.Column1 = t2.Column1;
      
    3. Batch Processing: If you have many tables to join, consider processing them in batches. This means you can join a few tables at a time and then combine the results in subsequent steps.
    4. Query Optimization: Review the execution plan of your query to identify any bottlenecks. SQL Server Management Studio provides tools to analyze and optimize your queries.
    5. Partitioning: If your tables are very large, consider partitioning them based on the ModifiedDateTime or other relevant columns. This can help SQL Server manage the data more efficiently.

    By implementing these strategies, you should be able to improve the performance of your query significantly.


    References:

    AI-generated content may be incorrect. Read our transparency notes for more information.
Sign in to answer

Your answer