VXLAN

The authors of "Building Data Centers with VXLAN BGP EVPN: A Cisco NX-OS Perspective" and "A Modern, Open, and Scalable Fabric: VXLAN EVPN", submit the following guest post.

 

Following the discussion around "the Magic of Super-spines and RFC7938", several times we get asked about the viability of eBGP as the underlay protocol in a VXLAN EVPN fabric. Good, challenge accepted! Now, let's get deeper into what it actually means to use eBGP as the single control-plane for both underlay and overlay in a VXLAN EVPN network and where the usage of separate control-planes like an IGP (OSPF, IS-IS, etc.) for the underlay and iBGP for the overlay has its elegance.


 

Before going into the details of the control-plane, let's first define the topology and focus on the deployment model. Let's assume we have 4 spines and numerous leafs with symmetric connectivity as shown in Figure 1. Every leaf connects to all spines, we call that a folded Clos, Fat-Tree or simply a spine/leaf topology.

 

 

VXLAN_Fig1.png

 

Figure 1 - Spine/Leaf Topology

As we have the topology defined, let's discuss the considerations around Autonomous Systems (AS). With eBGP, we clearly need to define more than one AS.

 

There are two designs on how to define the AS of the nodes in a spine/leaf topology (see Figure 2).

  • The first design is called Multi-AS. In this design, each leaf has a unique individual AS while all the spines exist in a common AS that is different from all the leafs. This approach is pretty simple as we basically re-build the Internet-like topology where the path starts in a source AS and goes over multiple intermediate AS to the destination AS.

  • The second design is the Dual-AS and as implied by the name, this design will implement a two AS model; one AS for all leafs and one AS for all spines.

 

 

VXLAN_Fig2.png

 

Figure 2 – Multi-AS vs. Dual-AS

The Dual-AS model hence breaks how eBGP works as the AS-Path would start and end in the same AS, and this is a violation of the BGP AS-path loop prevention mechanism. To overcome communication issues, additional considerations need to be implemented for BGP to ignore the AS-path violation.  One option is to disable the AS-path loop prevention on the spine with "as-override"; or a second option is to configure the leaf with "allowas-in”. Either option will yield the same result but through different techniques. In “as-override” we replace the remote AS with its own AS in the AS-path. This is different in “allowas-in”, which allows its own AS to be present in the AS-path as many times as configured.

 

Additional considerations come into play when leaf pairs are implemented with multi-chassis ether-channel technology like VPC/MC-LAG. A vPC pair of leafs represents a single logical leaf from a redundancy perspective and requires both leafs in the pair to be part of the same AS. To ensure that routes are similarly advertised from the spine to both leafs that share the same AS, the following command needs to be enabled: "disable-peer-as-check". This allows both leafs in the same AS to receive each other’s updates; it is desirable to have this information synchronized. The configuration of the "disable-peer-as-check" **** is applied on the spine for the IPv4 Unicast peering towards both leafs that are part of the VPC/MC-LAG construct. In cases where the design has vPC/MC-LAG construct with a single AS per-leaf, "disable-peer-as-check" would not be required.


A question to consider, isn’t BGP simple? :-)


From a configuration perspective, there is not much of a difference between the two approaches of Multi-AS or Dual-AS. It is mainly about the need for disabling AS-path loop prevention per-se (Dual-AS) or not (Multi-AS). Depending on the chosen AS design principles, specific considerations arise around the chosen IP addressing schema as well as the peering configuration.


In the world of eBGP, each device configures as neighbors the peering devices by specifying their IP addresses. In order to avoid slow peer failure detection, it is a best practice to peer from physical-interface to physical-interface. This avoids “lights-out events” such as a cable failure and helps to bring down the eBGP neighborship immediately, should the direct link fail. It can be debated if the source interface needs to be specified when configuring the peering; the idea is that doing so ties the source interface to both the IP-Layer and the Interface-Layer for the peering. Finally, it is worth noting how the deployment of Bi-Directional Forwarding Detection (BFD) provides additional peace of mind to help in brownout conditions.

 

Going back to Figure 2, each leaf connects to four spines, this results in the configuration of four physical interfaces and four point-to-point IP subnets (/31 or /30). In addition, four eBGP peerings need to be configured as each leaf is connected to four spines with Equal Cost Multi-Path (ECMP).  It is important to remember that BGP needs to be configured with  "maximum-paths" support as by default, BGP selects a single best path to reach each remote destination.

 

Last but not least, all IP addresses for the loopback interfaces configured on the nodes need to be advertised into the BGP-based underlay. As it will be clarified later, these loopback interfaces are used to enable the overlay control and data planes. The first option is to use network statements to achieve this advertisement. The second and preferred option is to perform redistribution with a route-map; this does not require the global BGP configuration to be touched every time a new IP address needs to be advertised.

 

Example Route-Map with Loopback for Redistribution:

route-map FABRIC-RMAP-REDIST-DIRECT permit 10

  match tag 12345

 

interface loopback0

  ip address 10.101.101.31/32 tag 12345

 

Below is an example of the global BGP configuration required on the spine or leaf as well as the specific peering configuration between them. Remember, the spine will have repetitive neighbor statements for each leaf that connects to the respective spine. Similarly, every leaf will have a neighbor statement for every spine.  To simplify the configuration, peer templates can be implemented, where the neighbor IP and AS parameters must be defined; as previously mentioned. It is also recommended to add the source interface for the peering, this way we ensure the BGP peering uses the correct egressing interface.

 

Example Underlay configuration from spine to leaf

router bgp 65521

  router-id 10.101.101.121

  address-family ipv4 unicast

    redistribute direct route-map RMAP-REDIST-DIRECT

    maximum-paths 4

neighbor 10.121.31.31

    remote-as 65522

    update-source Ethernet1/49

    address-family ipv4 unicast

      disable-peer-as-check

      as-override (only for Dual-AS design)

 

Example Underlay configuration from leaf to spine

router bgp 65522

  router-id 10.101.101.31

  address-family ipv4 unicast

    redistribute direct route-map RMAP-REDIST-DIRECT

    maximum-paths 4

neighbor 10.121.31.121

    remote-as 65521

    update-source Ethernet1/53

 

Alright, we have built a solid underlay network with four ECMP paths from every leaf to every other leaf. We have extensive reachability from every leaf to every spine; direct or indirect. We achieved these attributes by physical-interface to physical-interface peering in the underlay to ensure fast neighbor failure detection with or without BFD. Now, the next task is to build an overlay control plane with VXLAN BGP EVPN on-top of the BGP underlay transport network. The overlay control-plane is an address-family in BGP referred to as "L2VPN EVPN"; the L2VPN address-family will be configured under the same BGP instance that was configured for the BGP underlay.

 

Let’s assume first that the EVPN address-family is added to the same physical-interface to physical-interface peering implemented for the underlay network. In addition, let’s take an example where 50,000 overlay prefixes need to be exchanged between leafs via the “reflection” function performed by every spine. Is it ideal for a leaf to receive 150k overlay prefixes for best-path calculation if a single underlay link goes down? Remember that BGP receives every prefix from every neighbor, evaluates them against the best-path and installs the winner (or winners) into the forwarding table. Is it really desirable to create this churn in the overlay control-plane if one of four available paths go down?

 

The answer is most likely “NO”, as one is interested in stability. Thus the strong recommendation is to build separate BGP peering for IPv4 Unicast (underlay) and L2VPN EVPN (overlay) address-families. Instead of using physical-interfaces for peering as it was done in the underlay, loopback interfaces will be configured to source the overlay peering.  A loopback interface will be configured on the leaf as well as the spine to create a new neighbor configuration for the overlay BGP control plane. In order to have this peering, the loopback interface IP addresses (/32) have to be advertised in the underlay control plane, as it was discussed in the previous section.

 

The new neighbor configuration is built on the premise that the overlay peering should never go down following an underlay failure event (unless of course connectivity to all the spines is lost). Remember that BGP uses some very slow neighbor failure detection, up to 180s (excluding the use of BFD). This "slowness" is actually an advantage for what we want to do with the overlay and specifically with the overlay control-plane. It means that during an underlay network event, the overlay control-plane peering will not go down. While the underlay is re-converging, the loopback to loopback peering for the overlay control-plane will find an alternate path through the available ones resulting in the peering sessions staying up. This is a similar approach that has been implemented for decades in the MPLS world.

 

One additional detail, since the eBGP peering is loopback to loopback, the TTL has to be changed by specifying the "ebgp-multihop" configuration for the overlay peering.

 

For the overlay control-plane, there is much more to discuss than just the peering configurations. Of course, while this is important but it is critical to know that the functionally we are actually using is not traditional eBGP!

 

In VXLAN, the encapsulation is performed from the VTEP closest to the source and will go all the way to the VTEP closest to the destination. In our spine/leaf topology, this means the traffic will go leaf to leaf while the spines are just IP routers that forward traffic based on the outer IP header. As we are using eBGP, the next-hop behavior would implicitly force the spine to be the next-hop. Hence, the leaf will try to establish the VXLAN adjacency to the spine where most likely a VTEP function is not available or configured. Even if there was a VTEP, it would probably not have the capability to terminate and re-encapsulate towards the final destination that resides behind the leaf VTEP where the destination endpoint is attached. In short, we have to disable the traditional eBGP next-hop behavior to keep the next-hop unchanged by leveraging the command "set ip next-hop unchanged" packaged into a route-map for every spine to leaf peering.

 

No, we are not done yet! When configuring iBGP for the overlay control-plane, it is common to deploy a pair of Route-Reflectors (RRs) on the spines for better scalability and to simplify the peering configuration between the leafs. However, the concept of a RR does not exist with eBGP.  In order to achieve a similar behavior, we need to ensure that the routes received by a spine in the EVPN address-family are in turn sent to every neighbor (leaf). This is not the default behavior since a VPN route that is received from an eBGP neighbor is only forwarded to other peers when that VPN is locally configured. The goal is to not create all the VPNs (aka VRFs and Networks) on the spine, but instead, let the spines know how to forward the VPN routes even without this local VRF/Network configuration. This is achieved by leveraging the "retain route-target all" command for the respective address-family (L2VPN EVPN), as highlighted in the configuration sample below.

 

Example Overlay configuration from spine to leaf

router bgp 65521

  router-id 10.101.101.121

  address-family ipv4 unicast

    redistribute direct route-map RMAP-REDIST-DIRECT

    maximum-paths 4

  address-family l2vpn evpn

    retain route-target all

  neighbor 10.101.101.31

    inherit peer OVERLAY-PEERING

    remote-as 65522

    update-source loopback0

    ebgp-multihop 5

    address-family l2vpn evpn

      disable-peer-as-check

      send-community

      send-community extended

      route-map UNCHANGED out

 

route-map UNCHANGED permit 10

  set ip next-hop unchanged

 

Example Overlay configuration from leaf to spine

router bgp 65522

  router-id 10.101.101.31

  address-family ipv4 unicast

    redistribute direct route-map RMAP-REDIST-DIRECT

    maximum-paths 4

  neighbor 10.101.101.121

    remote-as 65521

    update-source loopback0

    ebgp-multihop 5

    address-family l2vpn evpn

      allowas-in 3

      send-community

      send-community extended

 

So what was achieved by tuning all the BGP configuration knobs for the overlay control-plane? Essentially, it resulted in a iBGP-like behavior, the only difference is the multiple AS configuration. But functionality wise, this results in not using eBGP for the overlay control-plane at all!

 

There are further considerations to be taken into account when using a Multi/Dual-AS design. Some considerations are around operational aspects, others around the function of automated macros.

 

But wait, there is more! When using the automated Route-Target macros for EVPN ("route-target both auto evpn"), the format uses the AS number as the leading 2-bytes value followed by a 4-byte field that represents the VNI (Layer-2 identifier for Network and Layer-3 identifier for VRF) for VXLAN in our case. If we use the Multi-AS design, where every leaf has a different AS, clearly the Route-Target (RT) would mismatch between export and import and as a result, no prefixes will be installed in the respective leafs. To overcome this, there is the option to use static Route-Targets to avoid this or to rewrite the AS-portion of the Route-Target when necessary with the "rewrite-evpn-rt-asn" option in the L2VPN EVPN address-family.

 

On the general operational side, clearly the amount of touch points is pretty extensive when we are looking at all the various configuration knobs and peering required for both underlay and overlay setup, that has been presented so far. We are in a single BGP instance with multiple address-families (AF), making a mistake in the wrong AF might have a catastrophic impact as either the underlay and/or the overlay is affected. Not infrequently we see people adding a "no" in front of a command but where they execute from the hierarchy might be in the wrong context.

 

Removing the eBGP terminology, the deployed configuration created a hybrid solution where certain address-families behave more like eBGP (IPv4 Unicast) and others have been converted to behave like iBGP (L2VPN EVPN) - Here we go, the incarnation of a hybrid!

 

As a consequence of all the considerations and caveats presented so far, we think that the use of an IGP for the underlay control-plane and iBGP for the overlay control-plane is a more elegant approach when compared with deploying “eBGP” for both control planes. All the bells and whistles required to make eBGP behave like iBGP for the overlay control-plane creates quite a bit of complexity in configuration and operation, with a questionable value-add. Even the use of eBGP for an underlay requires some AS check considerations. Given all of the above and the simplicity around configuration and operation with the IGP/iBGP case, we would always vote for the simpler approach first. There may be a misconception that eBGP as an underlay converges faster in failure scenarios as compared to the IGP case, but this is not true especially with BFD under consideration. Where appropriate, a true eBGP/eBGP approach can be chosen that provides true underlay and overlay separation, which is what is employed with VXLAN EVPN Multi-Site deployments where there is a need to interconnect multiple sites or fabrics. But just because one has a hammer (eBGP), doesn’t mean everything looks like a nail!

 

Special thanks to Renato Ramalho Fischer for his contributions to this Blog and his continued work on BGP.

 

Note: for more information on this please refer to the following link:

VXLAN Innovations - VXLAN EVPN Multi-Site: Part 2 of 2

https://blogs.cisco.com/datacenter/vxlan-innovations-vxlan-evpn-multi-site-part-2-of-2

 

In addition, enjoy an extended read about Building Data Centers with VXLAN BGP EVPN from a Cisco NX-OS perspective:

 

http://www.ciscopress.com/store/building-data-centers-with-vxlan-bgp-evpn-a-cisco-nx-9781587144677

 

Authored by:

Max Ardica - Principal Engineer | Max on LinkedIn

David Jansen - Distinguished Systems Engineer | David Jansen (@ccie5952) on Twitter | David Jansen on LinkedIn

Shyam Kapadia - Principal Engineer | Shyam on LinkedIn

Lukas Krattiger - Principal Engineer | Lukas Krattiger (@CCIE21921) on Twitter | Lukas Krattiger on LinkedIn

 

Watch Lukas and David present on the Fundamentals of Multi-Tenancy in VXLAN.