Bad Network Design – Availability of a system is mainly measured with two parameters. Mean time between failure (MTBF) and Mean time to repair (MTTR)
MTBF is calculated as average time between failures of a system. MTTR is the average time required to repair a failed component (Link, node, device in networking terms)
Operator mistakes is widely seen as the source of the failure of the systems. Thus although it is not individually used to calculate availability value of a system, Mean time between mistakes (MTBM) is commonly used term among the network engineers.
Most failures are caused by human error;estimate range between 70 to 80 percent. How can so many people are so incompetent?
Actually they are not ! It’s a design problem.
If the percentage would be 5 , 10 percent then I could believe that people were at fault. But when the percentage is so high, then clearly other factors must be involved.
When the system fails, until the single, underlying case is found, investigation should continue, this is called root cause analysis. I shared a framework in this article for proper root cause analysis.
Imagine that operator in the service provider configures the VRF, Route target information wrongly for the MPLS Layer 3 VPN and as a result it impacts another customer.
This is always seen as an operator mistake but it is not. In the first place system wouldn’t allow the operator for that action. It could be fix in many ways.
Network operators are expected to stay alert for several hours, a lot of simultaneous tasks (multitasking) at the same time. They are most of the time are interrupted by the other activities and people.
But none of them is an excuse of a poor network design.
In hub and spoke deployment for example, if adding couple spoke sites by the operator causes an entire network meltdown , this is a design problem not the operator mistake. You should increase the hub capacity in scale up or scale out way !
Before applying critical configuration, system can request verification from the operator. You know that when you try to reboot a router , it always ask a verification, right ?
Even though knowledgeable operator may try to redistribute entire BGP table to the internal IGP domain because of several reasons I stated above, then the design shouldn’t allow this. ( Many implementation doesn’t allow this though )
When the problem arise, people tend to blame themselves ” Sorry thats my fault, thats my problem, I did it “. But when someone says, ” It was my fault, I knew better” this is not a valid analysis of the problem.
If the system lets you make the error, it is badly designed.
Look at these examples.
We all love balcony , this is nice house also but can you use the balcony ?
In fact, very nice design , looking awesome but does it serve its purpose ? If it won’t be used to drink something then that’s okay , so know the purpose of your design.