Many moons ago I wrote a manifesto of sorts for my team at work. I’ve sent it out to the team and continue to talk about it at work. Personally I’ve also continued to think and mature ideas. Figured it’s time to publish externally, because, why not. My career focus is fixing Exchange and I’ve written most of the following around Exchange, but methods could be applied to all jobs where trouble shooting is the core competency.
Differential Diagnosis is the medical study and application of defined methods to distinguish between different diseases or conditions based on combination of presented symptoms and results of diagnostic testing. If you’ve ever watched House, this is the method they describe and utilize on the show. In KevLand ™ Differential diagnosis methods directly applied to trouble shooting Exchange / Information Technology based issue. Defining troubleshooting this way will normalize and improve engineer performance.
- Problem || Exchange / IT trouble shooting is an art form lacking a dominate standard processes or written methods.
- Answer ||The Medical field has a dominate and formally documented process to diagnose and treat conditions.
Most of us who are effective trouble shooters have organically developed our troubleshooting methods without a proper outline or understanding of what the method is. We just do what works and fix things. What I’m proposing is a defined method utilizing Differential diagnosis as a basis to teach and perform trouble shooting Exchange / IT issues.
I feel that we can become more effective at a skill when we clearly understand it vs. just going through the motions as we always have. My core job is fixing things. I trouble shoot and fix complicated issues for a job, and I like being efficient at my job, and I’m always working to improve.
How do you troubleshoot with Differential Diagnosis? At its core it’s a top-down approach where you discover and understand the problem, list possible causes, prioritize the cause list from the most urgent to the least urgent, finally you’re eliminating causes with logic and testing. The version of Differential diagnosis I use is comprised of four core areas, with a heavy emphasis on the problem statement created in the Discovery phase. The four areas are as follows:
- Discovery || Obtain all information about the patient and symptoms. Create a list of symptoms and relevant patient history. In Exchange gather logs, gather data, create a problem statement, and ask more questions as needed when data is not discoverable in logs. Trust but verify everything the customer tells you. Make the problem statement as finite and precise as possible.
- Causes || Lists all possible causes for the symptoms presented. Doctors are encouraged to not use a bias and work with Occam’s Razor in mind (The simplest is normally the correct answer) List causes of problem statement created in step 1.
- Prioritize || Prioritize the list of conditions based on urgency of diagnosis I.E. the more likely to cause severe harm to the patient in the shortest time need to be at the top of the list to be eliminated first. Focus on the SLA impacting issues from the list and eliminate as quickly as possible.
- Eliminate and Solve || Eliminate or treat possible causes, beginning with the most impactful and dangerous condition first. Fix the problem, solve all of the things rule the word save the cheerleader win the powerball.
Skipping some steps
Occasionally you’ll be presented with an issue where you don’t know or cannot come up with causes. In those situations, you work with the problem statement and you’ll most likely end up on the internet searching for the error. Chances are strong of finding other people who have run into similar cases. For these cases determining the causes, and validating causes will be more of one at a time vs. creating a big list deal. You’ll progress causes as found; test possible cause, tested cause, found another cause, tested next cause, found the winner, Solve the problem, received the medal from hot royal dignitary. For all other issues where you have the understanding and the issue is complicated enough it helps to create the problem statement, list the possible causes then valid things.
Doctors utilize a number of classifications during the discovery process to narrow down the effected systems; such as identification of the effected body systems (Nervous, Endocrine, Respiratory, ETC), areas of pathological processes (VINDICATE’M mnemonic- Vascular, Metabolic, Autoimmune, ETC ) and the Emergency Medical paramedics ABCDE mnemonic. ABCDE is worked with the rule you must deal with each step before moving to the Next. A – ensure the airway is clear. Until the airway is clear do not move on. Once the airway is clear B – breathing, verify the patient is breathing, and so on down the line.
Doctors use visual, auditory, olfactory, physical senses and verbal talk to patient methods during initial discovery. They then use diagnostic tools and tests to prove or disprove possible diagnosis. In IT we have notes and attachments to tickets, words from customer, screen share, and possible repro for our initial diagnosis. Our diagnostic tools are server log files, fiddler traces, client logs, netmon, pictures, event logs, perfmon, MRTG, router logs, ETL traces, windb, ETC.
Discovery tends to be where problems are missed. Overlooking a small detail the patient is experiencing will limit future troubleshooting. Did you listen to the heart? Did you hear if it fluttered or did you miss that in your haste to keep going? The end result of your discovery phase should be a well written problem statement clearly defining what the problem is.
Problem statements are my pet passion at work, I’ve written power points and given training on it, I speak from my pulpit about it, It’s the first thing I create whenever I touch an issue. I find it challenging to comprehend how engineers can attempt to deal with an issue without clearly understanding what the issue is; makes me shake my head.
A problem statement should be as close as possible to twitter length (140 characters) and contain all relevant details to the issue obtained directly from customer, and validated by any means during discovery. The image around here some place is of slide I used in my how to write a problem statement class. It’s my, create a problem statement with MADLIBs method of getting people started.
You can see the customer provided a limited subset of data when they opened their issues. After some discovery we were able to expand the problem statement to include an actionable concept. Key to the problem statement that it is a living bit of words based on knowledge at a point in time. As you test and validate and discover the problem statement can and will change. Be willing to change and let go of it if you need to.
A problem statement should be comprised of most the following:
- Clearly state the problem – We cannot read minds, we cannot assume information accurately.
- Define where the issue is occurring – For Exchange specifically, on mobile device (what kind) Outlook version, in OWA as well as other clients, does the remote connectivity analyzer replicate the issue, are all users in the same site, forest, hybrid, on prem, cloud, ETC
- Audience of the problem should not need customer specific knowledge – Don’t assume another engineer will know your customers MDM solution or firewall configs. If that data is relevant it should be in the problem statement or supplement supporting data.
- How do we replicate the issue, if we can?
- Put in Huge Glowing red flags when the customer mentions them – Things like we patched last night, or we changed DNS servers this morning.
- Problem Statement should be as short as it can be. Think Twitter post
- Less words are much better then more words – Every word should have meaning and all value
- Two sentences max for the problem statement – If there is additional data place it as supporting data- NDRs, logs, errors, ETC add them as a supplement.
- Modify the problem statement as your understanding matures
The following is an example of a problem statement I wrote recently for a month old issue that did not mirror how the case was progressing:
Problem statement || Customer is using the .net System.Management.Automation Namespace to call local server version of PowerShell on a Windows 2008 R2 SP1 server to execute PowerShell commands against Exchange 2016 and randomly receives the PowerShell error “Encountered an internal error in the SSL library”
After creating an amazing problem statement, the next step is listing all possible effected components and causes for the problem statement considering the Service Level Agreement (SLA) Impact of each possible cause. SLA being the Exchange version of life effecting trauma or mortality on the medical side. Again, if you have ever watched house this is where he brought all of his doctor’s minions into a room and wrote out possible causes on a marker board. then dispatched them to break into someone’s house or run tests so they could circle or cross out each item on the list.
With Exchange we can directly categorize issues based on the effected components and dependencies. The goal is not to list every conceivable cause, only every likely cause, remember Occam’s razor; Among competing hypotheses, the one with the fewest assumptions should be selected. Devotion to most likely causes hopefully prevents Engineers from doing silly troubleshooting, like checking for delivery queues when clients fail to connect with Outlook. Avoid silly, and try to be expansive within each component and dependency. E-mail “disappearing” is likely Outlook / customer caused, but it may also just as likely be a transport rule, or email retention, or an IMAP account, admin assistant, or countless other things..
The following are probable causes for the above problem statement:
Possible causes for the issue above could be | SSL offloading by a customer network device, Layer 7 manipulation of the /PowerShell appool on the customer network, incorrect PowerShell versions on customer, network connectivity, SSL error on Exchange server randomly hitting the customer
In medical terms you prioritize the list of conditions based on urgency of diagnosis I.E. conditions more likely to cause severe harm to the patient in the shortest time need to be at the top of the list to be eliminated first. If a patient arrives presenting sever bleeding from head, complaining of foot pain, and asking for a blanket because they are cold you should deal with the bleeding and head trauma before getting them a blanket. In Exchange and IT focused disciplines eliminate the broadest, highest dollar cost causes in the list as quickly as possible.
In the above example you’d check SSL certs on the Exchange servers first because that could have the widest impact, you’d then start working the problem from the customer network angle.
Eliminate and Solve
Take the prioritized list of causes you’ve created and test with them intent to eliminate or prove them as cause of the problem. Sometimes this will involve research to modify settings able to affect the cause, reboot things, fail over, fix the problem. Other times it might be asking the customer for more information or it could be you’re looking for items in logs that will provide data about the issues.
I like to boil it down to what needs to be rebooted, or who changed what. These two seem to cover most issues I fix.