They’ll swap your current licenses for another tool, and you just start paying them maintenance from here on out. I’ve got great news – other monitoring vendors will take trade-ins. I know – you think you don’t have a budget or approval to get a better tool.
They want to teach you how to get the most out of the tool because after all, if you can’t figure out how to use it, you’re probably going to switch tools. Your monitoring vendor wants to help! You can call their support and ask for a consulting engagement where they take control of your monitoring tool, walk you through it, and explain what they’re seeing in the tool. You have two options: learn to use your tool better, or get a better tool. Odds are, you don’t have a Level 4 tool, so when these emergencies happen, you’re going to be doing your own diagnostics using your monitoring tool. Don’t just tell me the disk is filling up – that’s a Wall of Numbers approach – tell me why.
If a log file or TempDB is full and growing, why? When someone or a broken app leaves a transaction open, I’ve got a ticking time bomb in terms of disk space. Ironically, the low-end monitoring tools show huge lock waits, but when you drill down, they only show the blocked queries experiencing the LCK* waits, leading DBAs down the completely wrong troubleshooting path. Who’s the lead blocker in a blocking chain? If you open the dashboard when queries are piled up waiting on locks, it should be immediately obvious who the problem is. (This happened to yet another client last month – their storage was actively corrupting their database, and yet their monitoring tool was happily singing along about CXPACKET being the biggest problem.) I’m not asking for it to assess my RPO/RTO and make sure my backups match – just at least make it obvious that my backups aren’t working.Īre databases corrupt? If there’s rows in _pages, sys.dm_hadr_auto_page_repair, or sys.dm_db_mirroring_auto_page_repair, we have a serious problem. There’s nothing more important to me as a data professional. If failed Agent jobs are gonna cause me to lose my own job, then the tool needs to make this blatantly obvious. Have databases not been backed up recently? When I open the dashboard under its default configurations, this needs to be a full-blown emergency. It’s easy to find out: on a dev box, trigger a few common SQL Server emergencies, and see how your monitoring tool surfaces that problem. When you open the dashboard, you don’t have to ask if you have a problem – the tool tells you what the problem is, and guides you to a solution, leveling up your skills along the way.
Level 4: Mentoring and Root Cause Analysis – tools that use this huge volume of data to actually tell you if your SQL Server is having a problem right now (or not!), and if so, what the root cause is.However, it’s still just a wall of numbers that you need to interpret: for example, if you’re seeing 100% CXPACKET waits, is it even a problem – or are we talking 6 seconds of CXPACKET waits per hour? For performance tuning, tools like this typically start with a wait stats dashboard, and then let you drill down into the specific query causing your highest wait type. Level 3: Advanced DMV Wall of Numbers – if it’s in a DMV, you get it here, including Availability Groups diagnostics, execution plan cache, and index utilization.A good example of this is New Relic’s SQL Server agent, and a good example of user questions for these tools are, “Why is Page Life Expectancy going up and down regularly in a consistent sawtooth pattern?” Level 2: Basic DMV Wall of Numbers – adds on a bunch of DMV queries for things like wait stats, but typically not the queries from the plan cache that are causing a given wait type.There’s nothing wrong with this tier – if you’re an advanced sysadmin, all you want is a metrics repository so you can do your own analysis, and you get exactly that here. It’s not SQL Server specific at all, just a bunch of Perfmon and WMI counters.
It auto-scales every graph so that you always see peaks and valleys, making it even harder to diagnose problems. Level 1: Inexpensive Wall of Numbers – a tool that shows you a big dashboard of numbers, but has absolutely no guidance whatsoever on what’s a normal range for that number.I think of monitoring tools in terms of maturity tiers: In many of these cases, the problem isn’t the people: the problem is that they bought a tool that doesn’t do what they think it’ll do. In our work as consultants, we see a lot of shops that have a monitoring tool, and yet they’re still not able to get to the root cause of a SQL Server performance or reliability issue.