8 Replies Latest reply: Jun 20, 2018 8:11 AM by Micheline RSS

    VTEPs

    Jeffrey

      When building your physical network for NSX implementation of VXLAN with unicast, I have a question regarding the use of VTEPs. When your VTEPs are on the same subnet do you need layer 2 connectivity? Is it a requirement for host 1 and host 2 to have layer 2 connection? I can't imagine it working any other way.

      vteps-pic.jpg

         
        • 1. Re: VTEPs
          Micheline

          Dear Jeffery--I'm a little bit confused by your topology pic.  I am going to assume that you mean that Host1 is attached to VTEP 10.11, Host2 to VTEP 10.12, and Host3 to VTEP 20.20. 

           

          For VXLAN, L2 connectivity is achieved using L2VNIs.  That is, VTEP 10.11 and VTEP 10.12 would both be configured with the same VNI.  On the NVE interface you'd have some configuration like this, where VNI 10001 was mapped to the VLAN associated with 10.10.10.0/24 and VNI 10002 was mapped to the VLAN associated with 10.10.20.0/24:

           

          interface nve1

            no shutdown

            host-reachability protocol bgp

            source-interface loopback0

            member vni 10001

              mcast-group 224.4.4.4

            member vni 10002

              mcast-group 224.4.4.4


          Assuming you have your VXLAN fabric up and operational, you only need to have L2 connectivity from the host to its own VTEP.  VTEP-to-VTEP is actually VXLAN over the L3 underlay.  Let's say we want to send packets from Host1 to Host2, both in the same subnet. 


          1. Host1 knows Host2 is in the same subnet, so Host1 looks up the MAC address of Host2 in its MAC table.  Assuming it does not have an entry, it'll ARP for Host2's MAC. 
          2. Host1's ARP reaches VTEP1.  If you have VXLAN BGP EVPN configured properly, VTEP1 will proxy for Host2 and answer Host1's ARP.  That is because VTEP2 will have already advertised that it has a route to Host2 when Host2 connected with VTEP2 and GARPed its credentials. 
          3. Host1 gets VTEP1's proxy ARP back, and now it builds an ethernet frame... per usual.  With Host2's IP and MAC address.  The frame is sent out and reaches VTEP1.
          4. VTEP1 looks at the incoming frame it received and sees that it has Host2's MAC address mapped to VTEP2's IP address.  VTEP1 builds the VXLAN header encasing the original frame with an outer SIP = VTEP1's IP and an outer DIP = VTEP2's IP.  The outer source MAC is VTEP1's MAC and the D-MAC is the MAC of the next hop.
          5. The packet with the VXLAN header is delivered to VTEP2 per normal IP forwarding rules.
          6. Once the packet arrives at VTEP2, it will strip off the outer VXLAN header to expose the original packet that Host1 constructed for Host2.
          7. Now the original ethernet frame destined for Host2 is delivered per usual ethernet rules.

           

          However, if you have no L3VNI (and distributed anycast gateway) configured, Host1 and Host2 cannot reach Host3. 


          Does this help?  MM



          • 2. Re: VTEPs
            Jeffrey

            Micheline, thanks for attempting to clear things up!  What I don't get is step 4. SIP = VTEP1's IP and an outer DIP = VTEP2's IP. I don't quite understand while the VTEPs SIP and DIP are on the same subnet how traffic can be routed over layer 3. With a R&S background this traffic needs layer 2, but I'm thinking maybe Nexus magic makes this work?

            • 3. Re: VTEPs
              Luke

              If the VTEPs are on the same layer3 segment, they will work.

               

              It's just like having two workstations on the same layer-3 segment. They still use layer-3 to communicate, even though there's no router in between.

               

              Here's an oversimplification: A VTEP is just a virtual interface, like a loopback. Connectivity between VTEPs follow the same rules as connectivity between loopback interfaces. They're simply layer3 interfaces.

               

              The magic is where the VTEP encapsulates and decapsulates VXLAN traffic. This is a similar principle to how a VTI configured with GRE will work.

               

              I hope that helps, and doesn't make it more confusing!

              • 4. Re: VTEPs
                Jeffrey

                Thanks for the help, Luke. I think your first sentence is getting right at what I was getting at. By hosts I meant the ESXI boxes and that they would need layer 2 connectivity over Nexus to communicate with the VTEPs when the VTEPs are on same subnet. I think this is exactly what you meant by same layer 3 segment.

                • 5. Re: VTEPs
                  Micheline

                  Hello Jeffrey--The VTEPs don't need to be in the same subnet.  They only have to have a route to each other.  If you wanted to, you could probably statically configure the underlay (it wouldn't be very resilient, but it is possible).  The only L2 space is between the host and its VTEP.  If the hosts are directly attached to the VTEP, as in, for example, a UCS FI directly connected to the VTEP, you don't even need to run STP.

                   

                  When a switch gets a frame that is intra-subnet traffic, it does a MAC lookup.  Usually, the MAC lookup results in a port that the switch just sends the traffic out.  In the case of a VXLAN VTEP, the MAC lookup might result the IP address of a remote VTEP.  When that happens, the VTEP knows to encapsulate the Ethernet frame intact in an additional VXLAN header, with the remote VTEP IP address as the outer DIP.  From the time the VXLAN packet leaves the source VTEP to the time that it arrives at the destination VTEP, normal rules of IP/L3 forwarding apply.  The intervening routers won't even see the payload.  This means that what would normally be Ethernet only traffic can span any distance so long as there is IP routing from VTEP to VTEP. 

                   

                  Why VXLAN?  Why not normal L2 Ethernet switching?  Because normal L2 Ethernet switching has some significant disadvantages. 

                  • Normal L2 only supports 4000 odd VLANs. 
                  • STP normally shuts down half or more of the network's capacity in order to prevent looping.
                  • L2 does not support ECMP, and convergence with STP can be slow.
                  • Enlarging the L2 space means enlarging the area vulnerable to broadcast storms or other broadcast-based catastrophes.
                  • Ethernet frames live forever.  They have no TTL to age them out. 

                   

                  VXLAN addresses all of these disadvantages.  It offers the ability to segregate traffic into millions of VNIs.  It doesn't rely on STP at all.  Since it's based on L3 routing, it can take advantage of ECMP and TTL.  Finally, it reduces L2 failure domains to discrete small spaces between each host and its VTEP. 

                   

                  Does this answer your question?  MM

                  • 6. Re: VTEPs
                    Ivan Biasi

                    Hi all

                     

                    in short:

                    implementing VXLAN at hypervisor layer (as NSX with ESXi hosts) the underlay network doesn't need to be necessarily a flat L2.

                    Due to the important improvements offered by the VXLANs,

                    it's commonly implemented as an L3 where VTEPs are IP addresses in different subnets.

                    With that approach the underlay transport layer could be a "commodity network",

                    implemented without any special features, which need to offer just:

                    - full IP reachability from any VTEP to the others;

                    - MTU >=1550 (better 1600 bytes) due the additional ethernet header;

                    - multicast support (not for all implementations, and not necessarily for NSX).

                     

                    HTH, regards

                    I.

                    • 7. Re: VTEPs
                      Jeffrey

                      Micheline, I was overlooking the basics. You can inject a /32 route then hosts could consistently communicate with each other and the spine could always make it to the right VTEP!

                      • 8. Re: VTEPs
                        Micheline

                        Actually Jeffrey, VXLAN is already one step ahead of you.  VXLAN BGP EVPN automatically generates two routes per host.  One is a MAC only route.  And the other is a MAC and IP route, where the IP route is a /32.  Both of these routes also include the VTEP IP address of the host VTEP and the VNI to which they are associated.   Looks like this:

                         

                        pod4-leaf-1# sh bgp l2 evpn vni 10012

                        BGP routing table information for VRF default, address family L2VPN EVPN

                        BGP table version is 74, local router ID is 4.0.0.101

                        Status: s-suppressed, x-deleted, S-stale, d-dampened, h-history, *-valid, >-best

                        Path type: i-internal, e-external, c-confed, l-local, a-aggregate, r-redist, I-injected

                        Origin codes: i - IGP, e - EGP, ? - incomplete, | - multipath, & - backup

                         

                           Network            Next Hop            Metric     LocPrf     Weight Path

                        Route Distinguisher: 4.4.1.1:33179    (L2VNI 10012)

                        *>l[2]:[0]:[0]:[48]:[0050.5688.cbec]:[0]:[0.0.0.0]/216

                                              4.4.1.100                         100      32768 i

                        *>l[2]:[0]:[0]:[48]:[0050.5688.cbec]:[32]:[172.24.20.102]/272

                                              4.4.1.100                         100      32768 i

                         

                         

                        Highlighted in yellow you can see the host MAC address, and the host's corresponding IP address.  The [32] in the box between the MAC and the IP indicates the mask length.  You can see that these routes are associated with VNI 10012, and that VNI 10012 is a L2VNI.  You can also see that this host's next hop is 4.4.1.100.  The next hop is the remote VTEP to which this host is associated.

                         

                        These routes are considered Type 2 routes.  (That's the first set of square parens).  Type 5 routes are the more familiar subnet only routes.  They only carry an IP, and the mask length is something less than 32.  A Type 5 route looks like this:

                         

                         

                        pod4-leaf-1# sh bgp l2vpn evpn vni-id 10003

                        BGP routing table information for VRF default, address family L2VPN EVPN

                        BGP table version is 113, local router ID is 4.0.0.101

                        Status: s-suppressed, x-deleted, S-stale, d-dampened, h-history, *-valid, >-best

                        Path type: i-internal, e-external, c-confed, l-local, a-aggregate, r-redist, I-inject

                        ed

                        Origin codes: i - IGP, e - EGP, ? - incomplete, | - multipath, & - backup

                         

                           Network            Next Hop            Metric     LocPrf     Weight Path

                        Route Distinguisher: 4.0.0.101:3    (L3VNI 10003)

                        *>i[2]:[0]:[0]:[48]:[0050.5688.7637]:[32]:[172.24.1.103]/272

                                              4.4.1.3                           100          0 i

                        * i[5]:[0]:[0]:[24]:[172.24.1.0]:[0.0.0.0]/224

                                              4.4.1.3                           100          0 i

                        *>l                   4.4.1.100                         100      32768 i

                        * i[5]:[0]:[0]:[24]:[172.24.2.0]:[0.0.0.0]/224

                                              4.4.1.3                           100          0 i

                        *>l                   4.4.1.100                         100      32768 i

                         

                         

                        Here is an L3VNI, and you can see that the first route is a Type 2 route for 172.24.1.103/32, a host.  This route also includes the host's MAC address.  Since VNI 10003 is an L3VNI, this host route is here because this host sent traffic to a host in another subnet/VLAN/VNI.  The remaining two routes are Type 5 routes generated by the L3VNI itself.  See how different they are?  They are an IP only route, but these routes are /24.  (Here the mask precedes the IP address.). Also, you can see that there are two next hop IP addresses--4.4.1.3 and 4.4.1.100.  These IP addresses correspond to the VTEPs which have been configured with this VNI.  If you intend to redistribute out of the VXLAN fabric, it's these routes that you want to inject out.

                         

                        Finally, my very first blog was on reading VXLAN show commands.  I would modestly recommend it as a read.  How's about a little VXLAN ...just for fun?

                         

                        HTH, MM