Orhan Ergun 5 Comments

Bad Network Design – Availability of  a system is mainly measured with two parameters. Mean time between failure (MTBF) and Mean time to repair (MTTR)

MTBF is calculated as average time between failures of a system. MTTR is the average time required to repair a failed component (Link, node, device in networking terms)

Operator mistakes is widely seen as the source of the failure of the systems. Thus although it is not individually used to calculate availability value of a system, Mean time between mistakes (MTBM) is commonly used term among the network engineers.

Most failures are caused by human error;estimate range between 70 to 80 percent. How can so many people are so incompetent?

Actually they are not ! It’s a design problem.

If the percentage would be 5 , 10 percent then I could believe that people were at fault. But when the percentage is so high, then clearly other factors must be involved.

When the system fails, until the single, underlying case is found, investigation should continue, this is called root cause analysis. I shared a framework in this article for proper root cause analysis.

Imagine that operator in the service provider configures the VRF, Route target information wrongly for the MPLS Layer 3 VPN and as a result it impacts another customer.

This is always seen as an operator mistake but it is not. In the first place system wouldn’t allow the operator for that action. It could be fix in many ways.

Network operators are expected to stay alert for several hours, a lot of simultaneous tasks (multitasking) at the same time. They are most of the time are interrupted by the other activities and people.

But none of them is an excuse of a poor network design.

In hub and spoke deployment for example, if adding couple spoke sites by the operator causes an entire network meltdown , this is a design problem not the operator mistake. You should increase the hub capacity in scale up or scale out way !

Before applying critical configuration, system can request verification from the operator. You know that when you try to reboot a router , it always ask a verification, right ?

Even though knowledgeable operator may try to redistribute entire BGP table to the internal IGP domain because of several reasons I stated above, then the design shouldn’t allow this. ( Many implementation doesn’t allow this though )

When the problem arise, people tend to blame themselves ” Sorry thats my fault, thats my problem, I did it “. But when someone says, ” It was my fault, I knew better” this is not a valid analysis of the problem.

If the system lets you make the error, it is badly designed.

Look at these examples.

fix the network design

We all love balcony , this is nice house also but can you use the balcony ?

 

design should have purpose

 

In fact, very nice design , looking awesome but does it serve its purpose ? If it won’t be used to drink something then that’s okay , so know the purpose of your design.

 
0.00 avg. rating (0% score) - 0 votes
  • kamlesh

    Many a times its human error (typos) but design flaw could be difficult for an operator to highlight….many a times the operator just gets a script of set of commands to execute with very less or no knowledge about the change….while you said the system (device) should allow or authorise the operator before effecting the change….can you shed some light on the same,any example to elaborate would be appreciated

    • there is a draft for example , in order to configure a VPN for the customer, operator should get an acknowglement for the token, token goes to the customer , if they approve then operator can do the change.

      You can extend this idea or many other idea for different technologies , even for the protocols. There are lots of protocol problems/inefficiencies which probably i should write about

  • Is There Anytools Exist For Monitor How Much Bandwidth Using by Specific Ip Address.

    • Can you ask this at ASK/SHARE page ? Thanks:)

  • While I agree to the point that system should be able to detect and tell about potential problems/misconfiguration but that’s what Engineer’s are meant for too and get paid for ?

    Also industry standards such as Change Management process is usually a good way to solve this problem up to some good extent since a change reviewer and change manager would ask some nice questions and someone most likely will point out the errors.

    Putting up some templates for recurring changes could be another way of solving same problem & doing POC on test equipment before production changes could be another way

    In the end of the day it’s going to be hard for humans or automation systems to put all possible combination of possible errors together as well.