I am observing low log_send_rate in my distributed AG setup. I understand that AG uses log stream and so I assume it should not have anything to do with data, but I was wondering if this has anything to do with data it is transmitting also and not only OS resources (Network, I/O)?
Basic metrics for consideration:
On source AG I don't see anything for 'Bytes Sent to Transport/Sec' counter so I cannot determine if that is bottleneck or not.
Please suggest if I missed anything that I should have included.
Thanks a lot @sqL_handLe and @Josh Darnell for your comments on this.
But actual cause in this particular issue was byte sector size missmatch on source (512) and destination (4096).
While checking which stage of AG was slowing down the process using aglatency-report-tool (because for distributed AG I can't generate latency report from SSMS), I figured it was at destination and not source!
Checked Error log on destination and it was filled with IO misaligned errors. (I know I should have looked at this earlier :P)
"There have been 43983616 misaligned log IOs which required falling back to synchronous IO. The current IO is on file F:\Log\mydb_log.ldf."
According to this article - this could mean
What Does This Message Mean?
SQL Server storage engine logic detects the disk Sector Size and will align the Transaction Log Files metadata and internal boundaries to match the Sector Size (either 512 or 4096 bytes). The error message 9012 is generated when SQL Server has detected that Log Entries have been written assuming a Sector Size different to the Sector Size found on the current SQL Server instance. This can happen in scenarios such as the below:
Log Shipping from a Production server with one disk sector size to a DR server with a larger sector size
AlwaysOn or Database Mirroring from a server with one disk sector size to a DR server with a larger sector size
When adding a 4K Native SSD card to hold the Transaction Log on a server where the Transaction Log was created conventional SAN disk with 512 sector size
To ensure consistency SQL Server may switch from performing asynchronous IO to synchronous IO. This can have an adverse performance impact if technologies like synchronous AlwaysOn or Mirroring are used.
In this scenario it is possible that log writes on an AlwaysOn replica or DBM secondary are switched to Synchronous IO. This could lead to higher than expected replication delays between the Primary and Replica(s)/Secondary
So I took some help from articles (MS tech community & KB3009974) to reach to the conclusion of adding TF 1800 to startup parameter on source, restarted SQL services and log_send_rate just boosted upto 200 MB/s.
Of course, The points you have mentioned have been very helpful to me reaching to the root cause of this issue and I thank you for that!
External links referenced by this document: