The “Lean” way of resolving application performance issues on network layer.

    In general there are two types of incidence for network issues. These fall in the category of outage incidents and performance incidents. Troubleshooting outage incidents in datacenter is quite straightforward. When it comes to analyze performance incidents, things become quite tricky. Generally network team’s opinion on most incidents is based on ping and traceroute results. However these two tools are not of much use in figuring out the performance related issues. Imagine a situation when you are called by application team. One major application behavior has just got aberrant in last 24 hours. Transactions are getting failed and they have given a bulky application layer log with errors. This issue calls for a detailed analysis of TCP transactions. Ping and trace can only reflect the issues with latency or any node’s outage. What is the best way to start?