So a while back we were seeing some very strange behavior with our SQL Replication. Every once and a while for no apparent reason the log reader would just slow down the pulling of records from a single publisher database. Our replication was setup with a single publisher, and a single distributor with over a dozen publications all being sent to a couple of different subscribers.
At random times we would see the latency for all the publications for a single database start to climb, eventually being a few hours behind for no apparent reason. Looking in the normal places didn’t lead me to much. I looked at some execution plans, and saw a couple of performance issues there (with the Microsoft code) so I threw a couple of new indexes onto the MSlogreader_agents and MSsubscriptions tables (see below) and I also made a couple of tweaks to the sp_MSset_syncstate procedure to fix some of the pathetic code which I found within the procedure (you’ll also find this below).
This helped a little, but it didn’t solve the problem. What did was when I queried the sys.dm_os_waiting_tasks dynamic management view. This showed a large number of processes with a wait_type of TRACEWRITE, and these were waiting long enough that blocking was actually starting to pop up (very sporadically making it very hard to see). A query look at sys.traces told me that there were three traces running against the server. I knew that I didn’t have one running, so I took the session_id values which were shown in sys.traces and looked in sys.dm_exec_sessions for those session IDs to find out who needed to be kicked in the junk. Turns out that the traces were being run by Quest Software’s Spotlight for SQL Server Enterprise’s Diagnostic Server (the program_name column read “Quest Diagnostic Server (Trace)”).
So I logged into the diagnostic server’s via RDP, and opened Spotlight . Then edited the properties for the server which is our distributor. Then I opened the SQL Analysis window, and disabled the SQL Analysis for this server. Pretty much as soon as I clicked OK through the windows the TRACEWRITE locks went away, and the latency went from 2 hours down to 0.
This just goes to show, just how careful that you have to be when using SQL Profiler (or any sort of tracing) against your database server.
P.S. If you decide to make these changes to your distributor keep in mind that they may cause anything or everything to break, including patches, etc. that you try and install against the SQL Server engine. These changes were made for a distributor running SQL Server 2008 R1 build 10.0.1600, use against another build at your own risk. That said, here’s the code.
USE distribution GO CREATE INDEX IX_sp_MSget_new_errorid ON dbo.MSrepl_errors (id) WITH (FILLFACTOR=100) GO CREATE INDEX IX_sp_MSadd_logreader_history ON dbo.MSlogreader_agents (id) include (name, publication) GO CREATE NONCLUSTERED INDEX IX_sp_MSset_syncstate ON MSsubscriptions (publisher_id, publisher_db, article_id, subscription_seqno) include (publication_id) with (fillfactor=80) GO CREATE NONCLUSTERED INDEX IX_sp_MSset_syncstate2 ON MSsubscriptions (publisher_id, publication_id, sync_type, status, ss_cplt_seqno, publisher_db) include (article_id, agent_id) WITH (FILLFACTOR=90, DROP_EXISTING=ON) GO ALTER procedure sp_MSset_syncstate @publisher_id smallint, @publisher_db sysname, @article_id int, @sync_state int, @xact_seqno varbinary(16) as set nocount on declare @publication_id int select top 1 @publication_id = s.publication_id from MSsubscriptions s where s.publisher_id = @publisher_id and s.publisher_db = @publisher_db and s.article_id = @article_id and s.subscription_seqno < @xact_seqno if @publication_id is not null begin if( @sync_state = 1 ) begin if not exists( select * from MSsync_states where publisher_id = @publisher_id and publisher_db = @publisher_db and publication_id = @publication_id ) begin insert into MSsync_states( publisher_id, publisher_db, publication_id ) values( @publisher_id, @publisher_db, @publication_id ) end end else if @sync_state = 0 begin delete MSsync_states where publisher_id = @publisher_id and publisher_db = @publisher_db and publication_id = @publication_id -- activate the subscription(s) so the distribution agent can start processing declare @automatic int declare @active int declare @initiated int select @automatic = 1 select @active = 2 select @initiated = 3 -- set status to active, ss_cplt_seqno = commit LSN of xact containing -- syncdone token. -- -- VERY IMPORTANT: We can only do this because we know that the publisher -- tables are locked in the same transaction that writes the SYNCDONE token. -- If the tables were NOT locked, we could get into a situation where data -- in the table was changed and committed between the time the SYNCDONE token was -- written and the time the SYNCDONE xact was committed. This would cause the -- logreader to replicate the xact with no compensation records, but the advance -- of the ss_cplt_seqno would cause the dist to skip that command since only commands -- with the snapshot bit set will be processed if they are <= ss_cplt_seqno. -- update MSsubscriptions set status = @active, subscription_time = getdate(), ss_cplt_seqno = @xact_seqno where publisher_id = @publisher_id and publisher_db = @publisher_db and publication_id = @publication_id and sync_type = @automatic and status = @initiated and ss_cplt_seqno <= @xact_seqno OPTION (OPTIMIZE FOR (@automatic=1, @initiated=3, @publisher_id UNKNOWN, @publisher_db UNKNOWN, @xact_seqno UNKNOWN)) end end GO