What we have discovered over the last 30 years in troubleshooting networks is that broadcast storms are one of the most difficult problems to troubleshoot and track down. Network administrators spend dozens of hours tracking down chatty printers, rouge switches and network cables that are plug into switches at each end. We see major network outages several times a year that cause ten’s of thousands of dollars to troubleshoot and remediate. Not to mention the amount of downtime that is incurred when something like this happens. And when we review the troubleshooting steps taken, it usually revolves around logging into a switch in a specific IDF/Closet and manually performing pings across the network.
Last year one of our local school districts reported that one of their main district offices could not log into the network or access network resources. The customer was new to us and were in the process of transitioning from a previous vendor. As we began troubleshooting, we notice that this one department was accessing resources at another campus that was connected via fiber several miles away. Once we eliminated that uplink between campuses, the district office was not seeing the issue anymore. This connection was not heavily utilized after migration to the cloud thus we thought we resolved the problem. Several hours later one schools was having more issues than others. This redirected our attention to the school and not the district office. So as we entered the MDF we could tell that we found the correct school we knew we had a broadcast storm on our hands. The switch was an aggregate core that had about 9 IDF’s that were plug into it.
One of common signs of a broadcast storm is the lights on all the switches blink in unison.
So by process of elimination we manually began unplugging each IDF waiting 5 minutes to see if the switch would regain normal activity and the high latency would return to a normal. After 6 or so IDFs disconnected we were able to find the IDF in questions. We were able to pinpoint the specific switch in the stack removing the uplinks one by one. Once again we plugged into the switch and began pinging the default gateway and an internet resource. Then pain staking we removed each cable one by one in the problematic switch. By unplugging them one we were able to determine which port was causing the loopback. And lone behold the end user had plug the cable from the wall jack and into another jack causing the loopback At this point we were several engineers working on the issue and several days worth of labor.
At wirespeeds, our core mission is network and end user performance. Broadcast storms are difficult to track down and usually cause issues across the entire LAN and WAN. This negatively affects all users across all applications, even SaaS based applications that are not on your network. It is usually resolved when you troubleshoot using Ping. That is why one of wirespeeds core functions is to Ping the LAN, WAN and Internet resources so you can determine what location, what IDF and what switch could be affected. Baselining performance metrics and comparing with other switches can reveal a lot about your network. Our goal is to provide actionable metrics about the end user experience across all network endpoints, so you can have peace of mind that your network is running optimally and your end users are able to access all LAN, WAN and SaaS based applications.