What 'Network Stability' Implies
Title: What 'Network Stability' Implies Template: wiki Tag: Network, Operations
My thoughts on what to do and not do if you have a goal of operating a stable network.
Several weeks ago, I tweeted...
As a network engineer, perhaps your most important responsibility is network stability. Much is implied by the word "stability" @ecbanks
What was I getting at when I said that, “Much is implied by the word stability”?
No science experiments or nerd knob twisting. I’ve worked on several networks where it seemed like every feature some previous engineer read about in a certification book was enabled. A production network is not your lab. Don’t turn on (or off) a feature unless you have a specific business reason to do so.
Fix bad designs. Most networks have problems. Those problems will, if ignored, lead to unscheduled downtime. One of your jobs as an engineer is to anticipate the weaknesses in a network and, making the most of your allotted budget and available equipment, engineer those weaknesses out.
Standard, simple, replicable design, supported by documentation. Networks that are built the same from pod to pod, closet to closet, and site to site tend to be more stable. Standardized designs have been standardized because they work. They tend to be as simple as possible, but no simpler. Standardized, simple designs lend themselves to being copied. The documenting of standard designs means that other engineers have the opportunity to enforce consistency across the network landscape.
No cowboys. As an engineer working on a production network, you are not a special snowflake. Fall in line. Build the network according to the standard. Stop making it up as you go like a gunslinging anti-hero. Networks don’t stay up if you make it up. Stable networks are the result of thoughtful design pondered over time and carefully considered in the context of the business and the rest of the IT stack.
A minimum of changes. Stop changing things whenever you’re in the mood. Every time you’re about to commit a change, ask yourself if it’s necessary. If yes, is this the appropriate time? Change windows exist for a reason. Do the right thing at the right time to mitigate the risk inherent in change.
Capacity monitoring and planning. A stable network is one that carries all of its traffic with consistent end-to-end latency and very little packet loss. Congested links are a form of network instability, because they result in undelivered traffic. Keeping ahead of bottlenecks is key to long-term network stability.
Wizard-like knowledge of how changes impact traffic. While companies like Veriflow and Forward Networks get up to speed verifying changes against network models, the best network modeler is often the engineer making the change. You should understand the protocols you’re shooting from your keyboard laser pistol so thoroughly that you can predict what’s going to happen when you start making “pew pew” noises. If every new command results in a surprise, you’re going to cause downtime without meaning to.
Obvious security holes filled. Read a hardening guide and apply the bits you can. Change default credentials. Turn off unused services. Avoid silly things like “public” and “private” SNMP community strings, remnants from a bygone era. If you leave doors open, you’re going to get owned. Once the doors are shut, make sure the windows are latched. Stable networks make an attacker work to compromise the infrastructure.
Bug free code. Don’t run the latest release just because it’s out. New code can, and often will, introduce bugs. If the code you run now does what you need and has no known serious security vulnerabilities, stick with it. When possible, prefer maintenance trains and vendor recommended releases over code still dripping with developer sweat.
If I Had To Pick One Thing...
I’d pick “a minimum of changes” as the single greatest factor in network stability. Leave the network alone if it isn’t broken. When the network design does need to change, plan the change with excruciating caution, and schedule the change for a time that the business agrees to.
Never make changes on the sly, hoping no one notices. That can be hard when you’re an introverted type like I am, especially when you just want to get something done.
However, I’ve learned that no matter how capable you are or how unlikely the change is to cause a disruption, accidents happen. A mid-day outage is an introvert’s worst nightmare, especially if you’re the cause.