Database Administration
sql-server availability-groups distributed-availability-groups
Updated Fri, 02 Sep 2022 04:18:20 GMT

What affects log_send_rate in Normal or Distributed AG other than OS resources?


I am observing low log_send_rate in my distributed AG setup. I understand that AG uses log stream and so I assume it should not have anything to do with data, but I was wondering if this has anything to do with data it is transmitting also and not only OS resources (Network, I/O)?

Basic metrics for consideration:

  • SQL Server 2019-CU16
  • Source RAM 1.5 TB, 48 CPU <> Destination RAM 128 GB, 48 CPU - Does difference in memory play any part here?
  • Both servers are in same DC, ping latency is <1ms. Destination server is VM.
  • ROBOCOPY test shows file transfer rate of ~100 MB/s
  • When high transaction log generation activities (like index maintenance or creation) is sent over to other replica - it transfers with rate of maximum 20 MB/s (Which is not expected). This is when log_send_queue piles up.
  • REDO rate on other side is good, no REDO queue piling up there.

On source AG I don't see anything for 'Bytes Sent to Transport/Sec' counter so I cannot determine if that is bottleneck or not.

Please suggest if I missed anything that I should have included.




Solution

Thanks a lot @sqL_handLe and @Josh Darnell for your comments on this.

But actual cause in this particular issue was byte sector size missmatch on source (512) and destination (4096).

While checking which stage of AG was slowing down the process using aglatency-report-tool (because for distributed AG I can't generate latency report from SSMS), I figured it was at destination and not source!

AGLatencyReportBefore

Checked Error log on destination and it was filled with IO misaligned errors. (I know I should have looked at this earlier :P)

"There have been 43983616 misaligned log IOs which required falling back to synchronous IO. The current IO is on file F:\Log\mydb_log.ldf."

According to this article - this could mean

What Does This Message Mean?

SQL Server storage engine logic detects the disk Sector Size and will align the Transaction Log Files metadata and internal boundaries to match the Sector Size (either 512 or 4096 bytes). The error message 9012 is generated when SQL Server has detected that Log Entries have been written assuming a Sector Size different to the Sector Size found on the current SQL Server instance. This can happen in scenarios such as the below:

  1. Log Shipping from a Production server with one disk sector size to a DR server with a larger sector size

  2. AlwaysOn or Database Mirroring from a server with one disk sector size to a DR server with a larger sector size

  3. When adding a 4K Native SSD card to hold the Transaction Log on a server where the Transaction Log was created conventional SAN disk with 512 sector size

To ensure consistency SQL Server may switch from performing asynchronous IO to synchronous IO. This can have an adverse performance impact if technologies like synchronous AlwaysOn or Mirroring are used.

In this scenario it is possible that log writes on an AlwaysOn replica or DBM secondary are switched to Synchronous IO. This could lead to higher than expected replication delays between the Primary and Replica(s)/Secondary

So I took some help from articles (MS tech community & KB3009974) to reach to the conclusion of adding TF 1800 to startup parameter on source, restarted SQL services and log_send_rate just boosted upto 200 MB/s.

AGLatencyReportAfter

Of course, The points you have mentioned have been very helpful to me reaching to the root cause of this issue and I thank you for that!





Comments (2)

  • +0 – Nice troubleshooting with the latency tool! Glad you tracked it down. — Jul 27, 2022 at 16:36  
  • +0 – Nice work! and great graphs :-) — Jul 27, 2022 at 18:26