All Smoke and Mirrors by Michael Kowal – Part 1

…Without The Smoke


Working from home too often has turned you, a senior network design engineer, into a recluse with diminishing social skills. Too much time cooped up indoors and working on projects has caused you to feel a little stir crazy, so it is time for a change of scenery. You decide you need to get out of the home office for a change and get some work done at one of the company’s local offices. Not even five minutes into unpacking your laptop and settling in at the local office, a junior engineer (affectionately called “Junior” because he always appears to be lost) approaches you with a request for help. Well Striker, it looks like you picked the wrong day to quit working from home.


Junior explains that he has been shadowing a senior design engineer (a peer of yours) who has been working on a design for ACME Corporation’s two new data centers. The design is to be presented to the CTO of ACME Corp. today, however there is one major problem: the lead, senior design engineer has disappeared under mysterious circumstances. Junior is now stuck with two design solutions and, as per the CTO’s request, must choose one design to present to the customer. Junior asks you to help him better understand the differences between the two designs so that he can choose one to present to the CTO.


Junior shows you the following two designs as well as his design notes:


Design #1:



Design #2:



Design Notes:

  • Design #1 shows a full mesh of connectivity between each leaf node and all spine nodes.
  • Design #2 introduces a transit node that is used solely to interconnect the data centers for East-West traffic flows.
  • Both leaf and spine designs contain servers which connect into a leaf node at 20 GE (10 GE EtherChannel). The links between the leaf and spine nodes are 100 GE.
  • BGP is used for routing. eBGP is used between the leaf and spine nodes. At each data center site, the leaf layer will use its own BGP ASN while the spine uses a single, common ASN. VLANs at the leaf nodes are redistributed into BGP. In Design #2, the Transit nodes are placed in their own BGP ASN.


At first glance of both designs, your gut reaction is that you would probably do neither of these designs, however (and even more disturbingly), Junior has asked you to evaluate design options without first discussing the requirements. All too often, you’ve seen junior engineers behave like ‘Pavlov’s Dog’: launching into product discussions upon hearing a couple key words from customers without fully understanding their requirements. This irks you to no end, so you decide to politely ask Junior to review requirements.


Junior explains the following requirements:

  • Today, ACME’s applications connect directly into ACME’s core network devices as opposed to a dedicated data center infrastructure. This design violates ACME’s new security policy which states that business applications must be placed into a dedicated environment so that physical access to the equipment can be audited.
  • ACME Corporation is building two new, redundant data centers to provide high availability for their multi-tiered applications. The application tiers will be distributed across the two new data centers that are a few miles apart with direct line-of-sight visibility.
  • There is an extremely high amount of synchronization traffic between the application tiers. If the synchronization traffic experiences any type of unreliable delivery (delay, jitter, etc.), the applications can reset the synchronization process which negatively impacts user experience.
  • Cost is a very critical factor that needs to be controlled; especially OPEX costs. Operational complexity must be kept to a minimum. ACME would rather spend more in CAPEX to ensure lower OPEX.
  • In the future, one or more data center sites could be added as the applications continue to grow. The design must be able to scale without major impacts to operational complexity.


You decide to help Junior by quickly breaking down a few high level differences in hopes that you’ll be left alone. Your comments are as follows:

  • Design #1 offers the highest level of redundancy as a link or node failure would not have as great of an impact as Design #2. If both transit nodes were to fail in Design #2, then all east-west traffic would fail.
  • Although Design #1 requires double the amount of 100GbE transceivers, you are not certain that their cost will out-weigh the additional transit nodes in Design #2. You make the argument that if the customer were to add additional leaf nodes, they would spend less on transceivers with Design #2 as opposed to Design #1.
  • One of the major deficiencies of Design #2 is that East-West traffic between data centers must transit the Transit nodes first.
  • Assuming more data centers are to be added in the future, Design #1 offers the worst scale due to the fact that as more leaf nodes are added, the number of point-to-point links grows exponentially.
  • In Design #1, we can argue that, as more data centers are added in the future, the reduced number of autonomous systems to keep track of would help simplify troubleshooting east-west traffic flows.


Junior is now able to whiteboard a chart based on your high-level break-down between the two design choices:

Design #1Design #2
Highest RedundancyX
Lowest OPEXX
Optimal Traffic FlowsX


Most ScalableX


Junior decides that both designs have their pros and cons and decides to present both to ACME Corp. He asks if you could attend the meeting with him at ACME’s headquarters to help position the design, however you choose to politely decline his offer due to the insurmountable pressure of your current workload. You settle to join the meeting via a conference call because it won’t require your full attention.


After joining the conference call, you are free to do work as it sounds like Junior has everything under control. That is, of course, until Junior asks you a question – which you didn’t initially hear and ask him to repeat. Junior briefly recaps that the CTO felt that Design #2’s use of transit nodes offered too much risk for inter-data center connectivity. Design #1 meets most of his requirements except for the extremely high amount of interconnects required between data centers.  Furthermore, while the additional optics required are not an issue, the cost of leasing additional fibers would be cost prohibitive and would reduce the scalability of the design. ACME’s CTO defers to your expert advice.


Which solution do you suggest?

a) Modify Design #2 by upgrading the 100 GE interfaces between transit and spine nodes to 400 GE interfaces to reduce the risk of a throughput bottleneck between data centers.

b) Suggest a 3-stage Clos Fabric as opposed to the current 2-tier design to improve scalability, reduce complexity, and increase availability.

Blog27-Clos Fabric.png

c) Modify Design #1 to use DWDM between data centers to reduce costs in a scalable way.

d) Modify Design #2 by changing from a pure eBGP design to a VxLAN-based overlay using MP-BGP EVPN control plane to reduce complexity when scaling to multiple data centers.

e) Modify Design #1 to use multi-gigabit wireless backhaul using licensed frequency bands to augment fiber constraints to reduce costs and increase availability

f) Suggest that they forego local data centers and split their applications across two different cloud providers to reduce cost and complexity while offloading high availability and scalability to the cloud providers.


Take your pick, make sure you are able to justify it, and on Part 2 I will go over the solution I’d choose, and why.


About the Author

pic Michael Kowal.jpg


For the past 12 years, Michael Kowal has been involved with carrier routing and optical designs and architectures at Cisco. He currently works with national research institutions, regional service providers, large-scale government & higher education customers to help educate, design, and architect Evolved Programmable Networks. Michael's technology focus within the Public Sector includes: BGP, LISP, Segment Routing, IPv6 and DWDM.

Michael currently holds a CCDE and a CCIE in the Routing & Switching, Service Provider, and Voice tracks. Michael also holds a Masters in Electrical Engineering from Stevens Institute of Technology.


Here are a few additional ways for us to engage and keep the conversation going: