TLDR || I work support for Office 365 on a special team with unlimited cosmic powers. I have a small toolbox or powerful tools. Read more to learn about those tools and how I use them
What are your top trouble shooting tools at work
At work my core tools are few and sometimes they are misunderstood by both support and customers. That frustration has lead me here to vent = ] There are only a few scenarios where each given tool is useful, anything outside of those, in my mind, using the tools becomes busy work and gives me sad face. Along those lines below is my personal toolbox for troubleshooting Office 365 connectivity issues and my personal opinions on how those tools should be used and should not be used – If you’re asked to use for the tool for a different reason you might want to ask why?
TIMBER – Outlook Advanced Logging
Timber is an internal only Microsoft tool used to read Outlook advanced logging. ETL files. ELT files traces / log calls / ROPS Outlook makes. Being an internal tool me mentioning it does not help the internet’s much. Best I can share is what I use it for to give you an idea of why you would want to provide an ETL file to Microsoft when asked. Outlook logging is a great tool to use to trouble shoot Autodiscover, Free Busy, Connectivity, Performance, Authentication, alternative mailbox access, delegate things, calendaring things, ETC. For the most part if you’re asked for ETL trace you might as well provide once because it’s good data we can use for all kinds of things.
Netmon – Packet Capture
Yes I am old school, Yes I know Netmon is not supported anymore / no dev is working on it anymore- ETC. Windows 10 reminds me every time I install a new build that Netmon is not supported, but dammit I am almost 40, get off my lawn I don’t want to change what I know very well. I guess I should use Message Analyzer, but I am mentally a teenager at heart and I cannot say Analyzer without stressing the first four letters (ANAL – yzer.) and giggling a bit. In my resistance to moving to analyzer, I’ve been known to use Wireshark as a packet capture and reviewing tool.
Whichever tool being used for packets; the purpose of the Packet capture is to view raw traffic from the view point of the device where packets are being capturing. Most commonly we obtain a packet capture from a client machine experiencing connectivity issues to the Office 365 service. Connectivity is about it because our core service access clients, Web browser, Outlook, Skype, and Mobile devices use HTTPS SSL encrypted protocols to talk to the Service so it’s not possible to view the actual things over the wire. To see the packets Netmon, to see in side of the packets Fiddler. Wen asking for a Packet captures we want to look for the following things in the data you provide back:
- Duplicate and Retransmits – We look for duplicate, and fast or slow retransmitted packets at a critical enough mass to indicate a connectivity issue. Critical enough mass you say? Yes One or two is meh whatever, I don’t care much, but blocks of 10-20 packets in a row says something be wrong here and we should dig deeper.
- Out of Order – Packets out of order are indicting something in the middle is messing with the packets. In Exchange land the middle would most likely be a WAN accelerator device like a Riverbed, or an HTTP Proxy like Bluecoat. These devices drop some packets, compress other packets, modify some, borrow from Cache for others, give others higher or lower priority, ETC (much about with packets) The devices are generally aware of the Office 365 protocol stack at the application level and have tools to deal specifically with say Outlook traffic. Sometimes the devices work, sometimes they don’t. We don’t officially support or not support WAN optimization devices. We do, however, reserve the right to ask customers to bypass and or turn them off as part of trouble shooting – https://support.microsoft.com/en-us/kb/2690045
- DNS resolution. DNS calls are plain text and can be read in a capture without magic decoder rings. I look for the A record and Came Query and Response packets to work out what IP the client believes the Office 365 service lives on. For Multitenant customers, everything should point to outlook.com and then Global DNS magic at Microsoft works out the best datacenters to respond back to the client with. For dedicated customers, the customer has a Cname vanity record of their of own FQDN that points to an A record specific to the companies resources in our datacenters. DNS is key to finding resources, and it helps to read a packet capture if it’s not broken down by application by the tool.
- SYN ACK FIN RST (Three-way Hand Shake) – Looking for a proper hand shake when a socket is established and then proper tear down at the end. Check out this ancient, still relevant KB explaining the Hand Shake – https://support.microsoft.com/en-us/kb/172983 Errors here indicate stuff like; something in the middle doing odd things with the packets, a server telling a client to bugger off for some reason, low layer connectivity issues, ETC. more investigation is needed here
- Time – How long does it take the packets to make a round trip – if that number be large then customers normally have a Sad face.
- PAC files – Pac files are things proxy servers send to help configure a client to use a proxy server. I want to use Wireshark here to find the packet and follow the conversion and put it all together for me. Some PAC files take more then bits then are in a single packet and I’m too lazy to build them on my own. With the whole PAC file, I can see the entire list of the bypass addresses the proxy server is configuring the client to honour. My desire here is to see all of my office 365 services listed for bypass. When I see that I am happy. When I don’t see it then I get sad face and figure that’s part of the issue.
- Synced Captures – Here be the difficult to do, here be Dragons, her be massive time and coordination commitment. I see these as last resort captures we should try to avoid unless we HAVE TO. To capture synced we need the client to capture on their machine or device, we need point the device or machine directly at single network device or server on our side, and then we capture at the same time on our side on the thing they are pointed at. The result of this complicated dance is a complete picture of the entire conversation. We can see the packet sent from client, and see how or if it is received by the server, then we see how the server responded back and what the client herd of that response.
Most common finding here is a device in the middle doing something wrong. Next challenge- we need to fix that device; never a simple task.
For me the synced capture is the most frustrating use of the packet capture because the most common root cause is client device and I feel like I’m supplying network support to the customer because A. Outlook is the best network monitoring tools ever, and B. because no one else seems to understand or own the entire end to end picture. You have groups doing Client / Desktop, LAN, WAN, Exchange, Wireless, security, VPN, AD all of the things are siloed into groups and It feels to me like no one gets or owns the whole picture outside of some of us old school Exchange peeps with wide foundations.
Outside of Raw connectivity and DNS I can’t come up with many uses for Netmon because 95% of the packets are encrypted unless you do some odd decryption certificate dark magic you should never do without a smart blue hat person holding your hands. I am not that smart person.
Fiddler – HTTPS capture and Decrypt tool
My go to tool for all things OWA, and EWS is Fiddler. Fiddler shows me the full decrypted conversation and it provides server hints I can take back to the service and use to find server side logs. Fiddler is an External tool that started as an internal tool. The OEM dev worked at Microsoft when he first dropped Fiddler then left and took it with him and has continued to make the tool more awesome ever since. When we ask for Fiddler we’re asking the customer to download fiddler and then configure it to decrypt packets.
Fiddler sets it’s self-up as an HTTPS proxy on the local client via windows / IE proxy settings. For the most part we’re only going to capture packets from products configured to respect those settings. IE being the main client for OWA it respects itself. There are methods to configure Outlook, Activesync and Remote PowerShell to use the Fiddler proxy but those are complicated and don’t often help. They should reserviced for odd edge cases and have some hefty justification behind the requests. When provided with a Fiddler we’re looking for the following things:
- Verbose Errors – OWA does this bugger of a thing; It tells the customer Sad face in the visible UI, and hides the real verbose error deep in the response. We use fiddler to find an error packet and view the JSON or RAW response to see the verbose full error. The verbose error provides a better idea of what actually happened – Plus it’s a generally filled with informative nerd words we can find in cases and or source code to tell us what really happened and or changed to cause the Sad face the customer saw.
- X-Headers – In the header data of a response from Office 365 we respond back with frontend and Backend server names used for the this packet and or customer session. Knowing the server and being able to math the time to UTC we have a huge hint we can take back to service and use to look at service side logs. We have <big number> of servers. Narrowing a search to which involved in a conversation is not always a simple task. Exchange 2013 and 2016 made a huge change here and most logs are on the Mailbox server. Notice I said MOST here.
- JSON – I love me looking at the JSON packets because they are packets containing data the server is replying back with in a nice readable format. So much simpler to read then RAW HEX – unless you name is Wes, then you read raw HEX and Packet for breakfast.
- Authentication – Fiddler exposes authentication packets and such between client and servers which can tell us all kinds of things about cannot connect or login or cross forest or share or ETC, issues
- FreeBusy – Free busy is all about EWS calls, yes, even inside of OWA. You can see OWA freebusy calls in Fiddler. You don’t see Outlook freebusy calls in Fiddler, for that we have Timber and Outlook advanced logging. Plus Outlook needs Autodiscover to work FreeBisy correctly so you start trouble shooting there. When we trouble shoot free busy we want to remove as many variables as possible from the scenario, so we have the customer try in OWA running directly on a server, and we use fiddler to see everything.
- Page Renders – Customer says X does not show up when they look at X page. Where I might have asked for a screen shot in the past as proof to show me the error, now I ask for Fiddler because It shows me so much more it. Fiddler shows me everything before, during, and after the error. I’m like cookie monster of old asking for cookies asking for Fiddler when the customer says OWA. KEV WANT FIDDLER
- Cookies – I’ve solved some tickets looking at cookies before – not the monster kind but the tracking kind.
- A few other – A few other random things not coming to me right now and or I have not come up with yet because always learning.
Microsoft Remote Connectivity Analyzer – https://testconnectivity.microsoft.com
The remote connectivity Analyzer, in my head it’s called TestExchangeConnectivity.com, is a set of synthetic client tests we provide customers to use to test connectivity to on premise and or the Office 365 service. The Key output from the tests tell us what happens when a client tries to use the protocol they are testing – Autodiscover, Activesync, EWS, IMAP, and message header analyzer are my big uses here.
Server Side Log Files
We log meta data about almost every connection a client makes to a server. Same thing you log in premise we log in the cloud. I’m look here for HTTP 200, 401, 500, 403, ETC in the calls with errors if we logged them, and for budget words if the customer to determine if the customer is hitting budget thresholds causing odd errors.
Wrap up
Those boys and girls are my most used tools in trouble shooting Office 365 issues. My job is asking for data and hoping it tells me something. I hope the data and search is something I can take to find more information in; internal knowledge bases, logs on servers, or, most often, trace the error to the lines of code in source code to work out what really happened. For the most part I’ve giving up searching the internet for the errors my team sees, because we’re often the first one’s seeing the issues, and the only place to find the answer is in the source code.
I’m not sure I could even Exchange anymore without access to source code. I’ve been using source code access to look up my answers for so long I’ve forgotten how to look elsewhere.
Pictures
It’s my personal mission in blogging to add pictures to every post but I can’t come up with any relevant picture that does not break NDA. I like my job so I respect my NDA. How about a few random pictures from the last week that have nothing to do with post.
Maddex and I built a Lego Trevi Fountain last week
Larger kids and I built a mega tower out of Mega blocks yesterday morning
If you think my writing is a mess, check out my lab. Maddex and I soldiered up a solar timer thing to move a wheel and drop balls down a spiral thing. It sits on the window near me at work now. Maddex loved the soldering.
We found a pair of Hulk hands at the goodwill outlet last week. It’s not simple doing things with Hulk hands
Even harder to do things with Hulk hands on your feet
Our new Eurovan needed a console the drive could reach without bending down the floor –I Built one.