Large DPM databases

Had an on-going issue recently where one of our DPM servers seems to crash over night. Eventually I found an error in the log stating that the DPM server had lost connectivity to the DPMDB database in the SQL instance. I thought this was a little odd because we have three DPM servers which all connect to instances on the same SQL server, and the other two servers had been fine.

A cursory check of the SQL server revealed nothing much, so I started looking at what DPM might have been doing. DPM has a number of housekeeping tasks that it runs to keep its database tables in order, and these run between midnight and 3am depending on the server load and how much each task has to do. It was between these times that we were seeing the crashes, so I started to check the jobs and the tables they look at.

It was quickly apparent that the table tbl_TE_TaskTrail was enormous, occupying nearly 16GB of the database. This seemed strange as one of those overnight processes is meant to clear out any records from this table that have a StoppedDateTime value older than 33 days ago, essentially meaning the table should only hold a little over a month’s worth of data. A quick check of the table running the SQL statement below showed we had a problem with this task:

SELECT TOP 100 * FROM dbo.tbl_TE_TaskTrail ORDER BY CreatedDateTime;

QueryForOldRecords-Cropped

If the clear-up jobs had been running successfully, we shouldn’t have records of jobs going back to 2015… So after discovering this, I ran the following to find out how many rows were older than the prescribed 33 days:

SELECT COUNT(*) FROM dbo.tbl_TE_TaskTrail WHERE StoppedDateTime < (GETDATE() - 33);

It returned a count of nearly 5 million rows… Hurp. No wonder the query was timing out!

So, how to clear this table down and allow DPM to continue on without impacting performance during the day or relying on the scheduled maintenance tasks, which demonstrably hadn’t been running correctly. Well, you can run the following SQL block against your DPMDB which will process the table in blocks of 50,000 at a time. It will delete rows associated with the tasks in the TaskTrail database first, before finally clearing the records from TaskTrail. Some of the tables might not have associated records in them, so don’t worry if some of the delete statements return results to the effect that they haven’t removed any rows, this is normal.

USE DPMDB_MY_DPM_DB -- Change this to the name of your DPM database
GO

DECLARE @GCTill DATETIME = GETUTCDATE() - 33
DECLARE @TempTable TABLE (TID GUID)
INSERT INTO @TempTable (TID) (SELECT TOP 50000 TaskID FROM dbo.tbl_TE_TaskTrail WHERE StoppedDateTime < @GCTill AND dbo.tbl_TE_TaskTrail.TaskID NOT IN (SELECT taskID FROM tbl_AM_AgentTask_Alerts) AND dbo.tbl_TE_TaskTrail.TaskId NOT IN (SELECT taskID FROM tbl_MM_MediaRequiredAlert))

DELETE FROM dbo.tbl_RM_RecoveryTrail_RecoverableObjects WHERE TaskID IN (SELECT TID FROM @TempTable)
DELETE FROM dbo.tbl_AM_AgentDeploymentTrail WHERE TaskID IN (SELECT TID FROM @TempTable)
DELETE FROM dbo.tbl_ARM_TaskTrail WHERE TaskID IN (SELECT TID FROM @TempTable)
DELETE FROM dbo.tbl_CM_InquiryResult WHERE TaskID IN (SELECT TID FROM @TempTable)
DELETE FROM dbo.tbl_MM_MediaRequiredAlert WHERE TaskID IN (SELECT TID FROM @TempTable)
DELETE FROM dbo.tbl_MM_Task WHERE TaskID IN (SELECT TID FROM @TempTable)
DELETE FROM dbo.tbl_MM_TaskTrail WHERE TaskID IN (SELECT TID FROM @TempTable)
DELETE FROM dbo.tbl_PRM_CloudRecoveryPointTrail WHERE TaskID IN (SELECT TID FROM @TempTable)
DELETE FROM dbo.tbl_PRM_ReferencedTaskTrail WHERE TaskID IN (SELECT TID FROM @TempTable)
DELETE FROM dbo.tbl_RM_CandidateDatasetsForSCAssociation WHERE TaskID IN (SELECT TID FROM @TempTable)
DELETE FROM dbo.tbl_RM_RecoveryTrail WHERE TaskID IN (SELECT TID FROM @TempTable)
DELETE FROM dbo.tbl_RM_ReplicaTrail WHERE TaskID IN (SELECT TID FROM @TempTable)
DELETE FROM dbo.tbl_RM_ShadowCopyTrail WHERE TaskID IN (SELECT TID FROM @TempTable)
DELETE FROM dbo.tbl_TE_TaskError WHERE TaskID IN (SELECT TID FROM @TempTable)
DELETE FROM dbo.tbl_TE_TaskTrail WHERE TaskID IN (SELECT TID FROM @TempTable)

So, how long does this take to run? I find on my SQL server (which also hosts two other DPM instances) I was processing 1000 records a minute, so a 50,000 record block would take five minutes to process. If your server is beefier, you can increase the number in the INSERT INTO statement to process more records at a time.

Hope this helps everyone out there.

Share and Enjoy:
  • Digg
  • StumbleUpon
  • Technorati
  • del.icio.us
  • Twitter
  • blogmarks
  • HackerNews
  • Tumblr
  • Posterous
  • email
Bookmark the permalink. Follow any comments here with the RSS feed for this post.
Post a comment or leave a trackback: Trackback URL.

10 Responses to Large DPM databases

  1. Markus

    Hi!
    I try to run your script to remove some 200k entries from the TaskTrail table. Unfortunately I get the following error:

    (500 row(s) affected)

    (0 row(s) affected)

    (0 row(s) affected)

    (0 row(s) affected)

    (0 row(s) affected)

    (0 row(s) affected)

    (0 row(s) affected)
    Msg 547, Level 16, State 0, Line 14
    The DELETE statement conflicted with the REFERENCE constraint “fk__MM_TaskTrailMediaInfo__TE_TaskTrail__TaskId”. The conflict occurred in database “DPMDB_SG_DPM1”, table “dbo.tbl_MM_TaskTrailMediaInfo”, column ‘TaskId’.
    The statement has been terminated.

    (0 row(s) affected)

    (0 row(s) affected)

    (0 row(s) affected)

    (0 row(s) affected)

    (0 row(s) affected)

    (0 row(s) affected)

    (0 row(s) affected)
    Msg 547, Level 16, State 0, Line 22
    The DELETE statement conflicted with the REFERENCE constraint “fk__MM_TaskTrail__TE_TaskTrail__TaskId”. The conflict occurred in database “DPMDB_SG_DPM1”, table “dbo.tbl_MM_TaskTrail”, column ‘TaskId’.
    The statement has been terminated.

    Any idea how I can get around the blocking contraints? DPM is 2016 with UR4

    Thanks for your help!

    Markus

    • Markus

      I added the dbo.tbl_MM_TaskTrailMediaInfo to the list of entries to delete. That fixed the error.

      USE DPMDB — Change this to the name of your DPM database
      GO

      DECLARE @GCTill DATETIME = GETUTCDATE() – 33
      DECLARE @TempTable TABLE (TID GUID)
      INSERT INTO @TempTable (TID) (SELECT TOP 50000 TaskID FROM dbo.tbl_TE_TaskTrail WHERE StoppedDateTime < @GCTill AND dbo.tbl_TE_TaskTrail.TaskID NOT IN (SELECT taskID FROM tbl_AM_AgentTask_Alerts) AND dbo.tbl_TE_TaskTrail.TaskId NOT IN (SELECT taskID FROM tbl_MM_MediaRequiredAlert))

      DELETE FROM dbo.tbl_RM_RecoveryTrail_RecoverableObjects WHERE TaskID IN (SELECT TID FROM @TempTable)
      DELETE FROM dbo.tbl_AM_AgentDeploymentTrail WHERE TaskID IN (SELECT TID FROM @TempTable)
      DELETE FROM dbo.tbl_ARM_TaskTrail WHERE TaskID IN (SELECT TID FROM @TempTable)
      DELETE FROM dbo.tbl_CM_InquiryResult WHERE TaskID IN (SELECT TID FROM @TempTable)
      DELETE FROM dbo.tbl_MM_MediaRequiredAlert WHERE TaskID IN (SELECT TID FROM @TempTable)
      DELETE FROM dbo.tbl_MM_TaskTrailMediaInfo WHERE TaskID IN (SELECT TID FROM @TempTable)
      DELETE FROM dbo.tbl_MM_Task WHERE TaskID IN (SELECT TID FROM @TempTable)
      DELETE FROM dbo.tbl_MM_TaskTrail WHERE TaskID IN (SELECT TID FROM @TempTable)
      DELETE FROM dbo.tbl_PRM_CloudRecoveryPointTrail WHERE TaskID IN (SELECT TID FROM @TempTable)
      DELETE FROM dbo.tbl_PRM_ReferencedTaskTrail WHERE TaskID IN (SELECT TID FROM @TempTable)
      DELETE FROM dbo.tbl_RM_CandidateDatasetsForSCAssociation WHERE TaskID IN (SELECT TID FROM @TempTable)
      DELETE FROM dbo.tbl_RM_RecoveryTrail WHERE TaskID IN (SELECT TID FROM @TempTable)
      DELETE FROM dbo.tbl_RM_ReplicaTrail WHERE TaskID IN (SELECT TID FROM @TempTable)
      DELETE FROM dbo.tbl_RM_ShadowCopyTrail WHERE TaskID IN (SELECT TID FROM @TempTable)
      DELETE FROM dbo.tbl_TE_TaskError WHERE TaskID IN (SELECT TID FROM @TempTable)
      DELETE FROM dbo.tbl_TE_TaskTrail WHERE TaskID IN (SELECT TID FROM @TempTable)

      Best,
      Markus

      • Hi Markus,

        The exact tables that are used will depend on the types of jobs that you’re running on the server, so you may have to add more tables into the checks. Many thanks for the tip on the addition of the extra table, I’ll keep a lookout for that myself in the future.

        In other news, I’ve got a Prem Support call open with Microsoft at the moment regarding 2016 CU4 – we have noticed that it is failing to prune expired disk-based replicas in its overnight jobs and manually running PowerShell commands to remove them causes a service crash. It seems that when removing disk/volume replicas, a SQL stored procedure is being called with three arguments instead of 2, so there should be a patch coming out soon… Removal of Exchange / SQL and other replicas isn’t affected…

  2. Markus

    One more hint: We took notice of the problem when our dpm console stopped working and froze due to the high latency of the sqlserver. Before finding your blog, I ran the database tuning advisor to see if additional idices could be helpful.
    The plan cache based tuning of the dpmdb delivered three recomendations within 5 minutes. I applied the recomendation and that brought the console back to life immediately. I found no negative effects of the additional index.

    • Cool, cheers for the tip! I have a SQL scheduled task that rebuilds any indexes in the database that are over 30% fragmented and have more than 1000 pages to keep the database in good condition. Seems strange to me that a database-heavy application like DPM wouldn’t have it’s own database maintenance tasks like this…

      • Markus

        Yup, rebuilding indexes is also part of my regular SQL-server maintenance.
        The good thing with the tuning assistant is that it creates new indexes that the database designer have not thought of or that were not needed because of the low number of database-entries. I my case that saved my day, as I could resume the backups in spite of the 250.000 outdated log entries in the database.

        • It’s a good job you spotted it sooner than I did… Took me quite a while to remove the 5 million records from the one database. Another had 9 million rows to remove…

          On another note, I had a call from MS today who advised that the issue I posted about separately (http://www.radionowhere.org.uk/wordpress/index.php/2017/11/pruning-dpm-replicas/): they have advised (as I thought) its a bug involved in removing some volume backups which causes a service crash and they’re working on a hotfix for it. You can use the script in that post (or any PowerShell command to remove recovery points) to remove the recovery points, however this may cause a service crash but the recovery point will still be removed. Keep watching this space for news on the hotfix…

Leave a Reply

Your email address will not be published. Required fields are marked *

This blog is kept spam free by WP-SpamFree.