Tuesday, February 18, 2020

Sitecore 9.1 Reference Data database HIGH DTU AZURE PAAS - Sitecore KB 595419 - DTU FIRE




Hello Sitecore Engineers!

I recently worked with a 9.1 deployment that was affected by this issue and so I just wanted to take a moment to discuss Sitecore KB 595419 and what the actual problem looked like in the wild in my scenario.

Here's the official KB from Sitecore on this:

https://kb.sitecore.net/articles/595419

In the KB, Sitecore mentions "Might Degrade Site Performance"

What this actually meant to me when it occurred, was that we saw was a 10-18 second TTFB when loading the site homepage - this was easy to see in chrome developer tools, initially it looked a little like a CDN issue, however when testing by loading the content directly from the origin nodes, bypassing the CDN, the 10-18 second delay was still there.

I was actually led to the right troubleshooting path by looking at a stack trace taken from the CD App service.

In the stack trace there were MANY references to sitecore.analytics.blahblahblah, but there was one important reference to  sitecore.xdb.referencedata.client



After seeing the sitecore.xdb.referencedata.client in the trace I looked at the reference data database and saw that it was pegged at 100% DTU usage - after finding the KB, I applied Sitecore.Support.312397.sql and restarted the Reference Data App service, the 10-18 second delay on the homepage disappeared and the problem was resolved.

Apparently this issue can present itself with different symptoms, in my case, the DB was completelt pegged, however if you look at other references like this one , you will see that it can look very differently DTU usage wise.

So a few things to pass on regarding this issue:

I recommend that Sitecore Production Environments be instrumented with an APM solution. Azure Application insights will work, but I personally prefer AppDynamics and New relic.

If you don't have APM and you are working in a Azure PAAS environment its going to be important to know your Azure app Service Diagnostic Tools. Specifically knowing how to collect a Profiler trace is extremely important. Read up on that HERE
 


To make problems like this easier to spot, I would suggest creating DTU alert rules that will monitor for the DTU going above 90 % for the last 15min for all databases within the Sitecore resource group in question.

Here is a great way to do that in powershell:  LINK  (Thanks Georgi!)



I hope this helps someone out there!

Colin Cooper
Sitecore Engineer

No comments:

Post a Comment