A network engineer (to be) who has seen a broadcast storm in a network firsthand can tell you that it’s one of the worst things that can happen to your network. There are different types of broadcast storms, different causes, and depending on the devices in use, different effects can occur.
Most common cause is when an end user connects a hub to the company network, and by some mistake this device is then connected back onto another switchport in the company network. The loop created will catch all frames passing by, keeping them in the loop. But it does not have to be a hub: the same can be done by connecting both Ethernet ports of an IP Phone to a switch, or by connecting a computer to a port while still connected to the company wireless, and the network cards have been set into bridging mode. This is not so far-fetched, laptop network cards are sometimes put into bridging mode by end users to provide wireless to multiple laptops in a hotel room, for example.
Enough blaming the end users. Sometimes beginning network engineers somehow think it’s a good idea to disable spanning-tree. Most luminous of them do it on a production environment. But another problem is when connecting an access port of one VLAN to an access port of another VLAN, or having a native VLAN mismatch on a trunk link. Since this involves two VLANs, if a loop is accidently created, spanning-tree protocol can’t always correctly figure out how to stop it, and the loop persists.
The consequences can differ. If it’s a user VLAN on your access switches, and the problem is noticed fast, it may just be unicast frames stuck in a loop, affecting only the switch CPU, generating MAC address flaps (because the frame source address pops up on different switchports constantly), and eventually congesting bandwidth. But sooner or later, one of the devices is going to send a broadcast frame. And since a typical Windows computer asks for ARP information every two minutes, it’s going to be sooner. That’s when the fun starts. A broadcast frame is going to be flooded out of every switchport. So is a unicast with unknown destination, but a broadcast frame will be processed in software by every end device, whereas a unicast not destinated for the device will be dropped by the network card. That broadcast takes CPU cycles, and a storm of them (read: millions) can take a computer down in seconds.
The situation is worse in a virtualized environment: a network card on a hypervisor platform works in promiscuous mode, accepting every frame and passing it on to the software stack. This means any broadcast storm, even unknown unicasts, will make the hypervisor suffer 100% CPU load. Broadcasts are passed on to the guest operating systems by default, so the problem only gets magnified.
And last: if the switches reach 100% CPU, which does take a while but can happen if the loop goes unnoticed, chain reactions may follow. Frames may be discarded on different VLANs, BPDUs may not be forwarded, causing even more spanning-tree instabilities, and so on. In the core of your network, a broadcast storm has the potential of bringing down the entire network up to the access layers.
So when dealing with Layer 2 networks, take your time to secure them. Configure BPDU Guard, Loop Guard, disable unused ports to prevent accidental loops, check for native VLAN consistency, monitor network device CPU usage, and keep layer 2 domain small if possible. Large layer 2 domains certainly have advantages, but it’s unwise to stretch them across WAN links for example.
But what when one does happen? Well, there are no commands to stop them as far as I know, so the only thing that helps, is finding the loop fast, and unplugging the devices causing it.
Have you every witnessed (or caused) a broadcast storm? Share your stories in the comments.