It's a common fact that we tend to ignore things that are taken for granted as we assume their presence to be constant over time. This applies undeniably to many aspects of our lives, including work. In this post I intend to describe the history and drivers behind Cisco Express Forwarding (CEF), its inner moving parts, how it carries out its tasks and why we should not take this feature for granted.
As we cannot -or should not- jump right into the deepest part of the pool without getting our feet wet first, level setting will help us to be on the same page. Let's begin by warming up the engines before the race begins.
Setting the stage: Switching process
Before putting CEF under the magnifying glass and exploring its clockwork, it is advantageous to begin with a proper examination of the packet forwarding process on a router. The process can be -generally- depicted in the illustration below.
Figure 1 - Forwarding process in a router
As represented above, there are several steps need to be taken in order for a router to forward a packet. The process can be described as follows:
- The router receives a frame, and to make sure it can work with it, it checks the received frame check sequence. If any mismatches are found, the frame is discarded.
- If the FCS check is passed, the router then checks the Ethernet type field for the packet type and extracts the packet.
- Once the packet has been extracted, the following actions depends on the IP version of the packet: If it is IPv4, the header checksum is verified; a packet with a wrong checksum is discarded. If the packet is IPv6, this check is skipped as an IPv6 packet header does not contain a checksum.
- The router checks if the destination address of the packet matches any of the IP addresses configured on its interfaces in “up, line protocol up” state. If there is a match, the router is the final destination of the packet and it processes the packet locally.
- If the destination address did not match with of the “up, line protocol up” interfaces configured in the router, then the routing game begins. But in order for that to happen, the TTL must be higher than 1. The router checks if the TTL is higher than 1. If it is, it can be routed; if it does not contain a value higher than 1 then it is discarded and its sender notified with an ICMP Time Exceeded message.
- Time for routing! - The router checks its routing table and looks for the longest prefix match for the destination address in the packet.
- Once the next hop and the outgoing interface have been found in the routing table, the router needs the data link layer information of the next hop to build a new frame and be able to deliver the packet. It relies on mechanisms like ARP -there are many others- to map Layer3 to Layer2 information.
- As the packet will be routed out, the TTL is reduced by one, and as a consequence, the contents of the IP header have changed (TTL), therefore the IPv4 checksum is recomputed.
- Finally, the routed encapsulates the processed packet inside the newly created data link header and trailer. BAM! We have got a new frame to deliver right out of the oven and we are ready to send it out!
After a brief examination of the forwarding process, we can recognize that there are many steps involved in moving a packet between two points in a network. Its beneficial to understand it so we can comprehend what was done to improve it and the reasons behind it.
Where do we start? Or better said: When?
Everything began in a not-so-distant past right around the corner. Hop on the time machine and let's travel back in time to the late 80s and early 90s, where there were no beefy routers, fancy line cards and hardware accelerated operations. Everything was a matter of wisely using processing power and saving CPU cycles for a rainy day.
In the late 80's, process switching was the first and simplest mechanism used to accomplish packet delivery in networks. The forwarding process was performed by the router’s CPU in a per-packet basis from the moment the packet was received on an inbound interface until its forwarding was made via an outgoing interface. Very simply put, the process switching was nothing else than executing the 9-step process outlined above again and again for every received packet individually, with the procedure being implemented as a standalone running process of the network operating system run on the router.
In the process described above, the router would spend most of the time in a couple of steps, first sifting through the routing table for the destination address in the packet, finding the next hop and the outgoing interface (possibly through recursion), and then looking up the physical address of the next hop (mechanisms like ARP to map Layer 3 to Layer 2 addresses were key in this step). Improving the throughput and lowering the packet routing and switching latency therefore aimed at making these processes as efficient as possible.
The process of routing a packet involves small but important changes to its header information. From a 10000-foot view, looking at a packet fully encapsulated in a frame before and after being forwarded by a router, the most important change lies in the data link header. Building the new frame header is commonly referred to as frame rewrite, as it describes what essentially happens: rewriting the frame header based on known information about the packet’s next hop and how to forward it further. Looking up the necessary information to perform a frame rewrite is one of the most resource-intensive tasks of the whole forwarding process, and having to perform it each time a packet was received from scratch was inefficient.
All these steps executed for every single packet were burdensome and became a process difficult to maintain over time when throughput in networks was increasing. Scale and speed took their toll when demands were ramping up and routers became bottlenecks in the networks not able to accommodate growth.
“Congratulations, it's a…” fast switching? - It was born!
One of common approaches used in computers systems to improve an algorithm’s efficiency is caching: Making use of a previously calculated request, and storing it to serve the data faster for a later request with the same set of inputs. Fast switching allowed packets to follow the “route once, switch many times” paradigm, making use of an “on-demand caching” where the first packet of a request was process-switched and lookup results stored in cache, and then used for future packets, therefore reducing computing time and aiming to optimize resource consumption. Whenever a packet arrived, its destination address would first be looked up in the cache. If the cache contained a match, the forwarding information stored there would be used to route the packet right away. If not, the packet would be process switched, and the results added to the cache for future use.
What would this cache contain? The data stored in the cache would include the destination IP address, next hop information, and data link header information that would be written into the frame before sending it out through the outgoing interface. Any subsequent packets to the same destination would already have the complete forwarding information handy, without the need of process-switching them.
Although Fast Switching was an undisputed improvement, it also had significant downsides: If the router suddenly received a large burst of packets for destinations that were not yet in the cache, the cache would need to be built for all of them. The router would effectively regress to process-based switching for these packets, hindering the advantages fast switching could bring. Furthermore, whenever the routing table or the Layer3-to-Layer2 mapping tables were updated, the affected entries in the fast cache had to be removed since they became outdated. Initial packets for the affected destinations would therefore again need to go through process switching. In addition, load sharing in Fast Switching was rather limited: When a Fast Cache entry for a new destination was created, the router remembered the exact outgoing interface for the packet. All other packets to the same destination would hit the same Fast Cache entry and so would be sent out the same outgoing interface that was determined when the first packet was process switched.
Eventually, it became evident that while fast switching was an improvement for the previous situation, it was still not able to keep up with the rapid pace at which networks were evolving. By 1996, a new mechanism was created to overcome fast switching limitations and bring improvements to packet switching networks.
Cisco Express Forwarding
Caching and reusing computed results is a very effective approach to immediately improve an algorithm’s performance. However, creating the cache contents may still be computationally intensive, and that was one of Fast Switching weak points. Eventually, to squeeze even more performance from software-based routers, and even make it simpler to implement this process in hardware, it was necessary to rethink the entire process of routing packets. To understand how limitations were addressed, we have to grasp where the room for improvement was in the way the forwarding process was executed. Among all tasks, the most complex and resource intensive are: Finding the best path for destination networks -including outbound interface- and building a new data link layer header, better known as frame rewrite.
Routing table lookup
Let us revisit the traits of process switching. To find the best path to the destination network of a packet, the router had to go through its routing table, or Routing Information Base (RIB), and find the best matching route for the packet’s destination address. The lookup rule was the longest prefix match - finding the entry in the RIB that, compared to other RIB entries, matched the packet’s destination IP address in the longest run of bits, starting with the highest bit. One sweep through RIB may not have been enough, however. If the RIB entry only contained the next hop address but not the outgoing interface, the sweep had to be repeated - this time, for the next hop address itself. This process had to be repeated until the router found a RIB entry that ultimately pointed to a directly attached network. This lookup process could take several iterations until its completion, and even when the outgoing interface and the ultimate next hop IP address were finally known, the routing table still did not contain frame rewrite information for the packet. Only after constructing the new frame header using the ARP table, or any other table that maps the next hop’s IP address onto its Layer 2 address, the packet could finally be sent out. It is important to recall that this process had to be repeated for every packet.
How to improve this? Note how the RIB lookups are inefficient: The routing table needs to be sorted according to the prefix lengths so that all networks with the prefix length of /32 are always at the top, then followed by all networks with the prefix length of /31, then by all networks with the prefix length of /30, and so on, till finally, a default route (the only possible route with the prefix length of /0) would be placed at the very bottom of the table. Performing a lookup in this table meant traversing the table from the top, entry by entry, computing the binary AND between the packet’s destination address and the netmask in the RIB entry, and comparing the result with the network address in the entry, stopping at the first match. Since the table is assumed to be organized from the most specific network entries to the least specific ones, the first match when traversing the table from top to bottom is automatically the longest prefix match, too. In the worst case, the router needs to inspect every single entry in the routing table before finding a match - just imagine having 500,000 RIB entries, and receiving a packet whose only matching route is the default route at the bottom of the table. And as if this was not enough, if the best matching entry does not store the information about the outgoing interface, just the next hop (a typical case with static routes or BGP-learned routes), the lookup needs to be repeated recursively for the next hop address to find the outgoing interface. The routing table is a component made to build and store reachability information, not truly optimized for lookups. To optimize it, we need to change its linear structure to something more efficient to perform lookups faster, and to get rid of the recursion. A structure meeting these requirements, truly optimized to perform fast lookups, is separate from the RIB although it is populated by its contents, and is called the Forwarding Information Base (FIB).
The FIB is a database built dynamically using the information contained in the routing table: destination prefixes copied and their next hops recursively resolved. Why does this make sense? The goal is to find the ultimate forwarding information for a packet in a single lookup.
But what makes the FIB different from the RIB? As opposed to the RIB, whose purpose is to build and store reachability information, the FIB is built and ordered in a way that allows it to optimize fast retrieval of information and longest prefix match lookup. Its structure is called trie, a tree-like searchable data structure comprised of nodes and leaves that focuses on fast data retrieval.
Figure 2 - Example of a Trie
The trie is composed of a root node, child nodes and keys. The root is the entry point where we start to traverse the tree, and it has pointers or links to other nodes called child nodes. As opposed to common trees in which a particular node stores the complete key, a trie stores a key (a word, or a number) by splitting it into letters or digits, and creating a node for each letter or digit on progressively deeper levels of the tree. The complete key is therefore looked up letter by letter, or digit by digit - note the prefix-oriented approach!
Every node in the trie can be thought to contain a flag indicating whether the word created by following the path from the root down to and including this node already represents a valid complete word. This flag is called the terminal flag, and if the flag is set, the node is called a terminal node. A terminal node is not necessarily a leaf node located at the end of a trie branch - it can very well be an internal node, too. As an example, in Figure 2, in the branch representing the word SHELLS, there are three valid words and thus three terminal nodes: SHE, SHELL, SHELLS, with the terminal nodes highlighted. Note how the terminal nodes in the SHELLS branch outline the prefixes of this word that are complete words in their own right - SHE, SHELL, SHELLS. Each terminal node in Figure 2 has a non-zero number put next to it; we will discuss the meaning of this number in a few moments.
Performing a lookup in a trie follows a simple algorithm, using the term current node to denote the node we have just reached during our walk in the trie:
- Set the current node to the root node.
- If the current node is a terminal node, remember it, forgetting any previously found terminal nodes.
- Attempt to descend from the current node to its child node that matches the next letter or digit in the key we are looking up.
- If there is no such child node, STOP. The most recently found terminal node represents the longest prefix match for the key. Otherwise, make the child node the new current node, and return to Step 2.
Let’s see how this algorithm would work if trying to find the word SHELTER:
Figure 3 - Looking up a string in a trie
- We start at the root node as the current node (Step 1). The root node is not a terminal node (it does not have a non-zero value shown in Figure 3), so we do not keep a special track of it (Step 2). Taking the first unprocessed letter from the word SHELTER, we attempt to find a child node of the current node for the letter S (Step 3), and since it exists, we descend into it and make it the new current node (Step 4, returning to Step 2).
- The current node (for letter S) is not a terminal node, so we do not keep a special track of it (Step 2). Taking the first unprocessed letter from the word SHELTER, we attempt to find a child node of the current node for the letter H (Step 3), and since it exists, we descend into it and make it the new current node (Step 4, returning to Step 2).
- The current node (for letter H) is not a terminal node, so we do not keep a special track of it (Step 2). Taking the first unprocessed letter from the word SHELTER, we attempt to find a child node of the current node for the letter E (Step 3), and since it exists, we descend into it and make it the new current node (Step 4, returning to Step 2).
- The current node (for letter E) is a terminal node, so we remember it (Step 2). Taking the first unprocessed letter from the word SHELTER, we attempt to find a child node of the current node for the letter L (Step 3), and since it exists, we descend into it and make it the new current node (Step 4, returning to Step 2).
- The current node (for letter L) is not a terminal node, so we do not keep a special track of it (Step 2). Taking the first unprocessed letter from the word SHELTER, we attempt to find a child node of the current node for the letter T (Step 3). However, such child node does not exist. The lookup therefore ends here (Step 4), and the longest prefix match produced by the lookup is the word SHE that ends in the most recently found terminal node.
Figure 4 - Finding a terminal node
So far, we have not commented on the meaning of the numbers indicated at the terminal nodes. Truth is, they can be anything - they are the actual values we are trying to access by looking them up using the keys. For our purposes, they will represent pointers to objects (values) stored somewhere in memory that are indexed by the keys in the trie. In fact, some documents insist that a trie must not contain the values associated with the keys in the nodes themselves, rather, it must only point to them. This is not a correct description of a trie; however, many trie implementations indeed put only pointers to the values into the nodes, rather than having the values put directly into the nodes, since this has many advantages from the software engineering perspective.
How is this applied to networking and prefix lookup? With packet destination IP addresses and a classic routing table, we are always trying to find the best matching route in the routing table using the longest prefix match rule. With a trie, the idea is to store the prefix of every known destination network from the RIB in the FIB in the bit-by-bit fashion, one trie level for one bit of the prefix. Since an IPv4 prefix is at most 32 bits long, the depth of the trie will be at most 32 levels (33 if counting the root node as well). As a result, performing the longest prefix match lookup with a packet’s destination IP address will take at most 32 comparisons in the trie, no matter how many distinct prefixes are stored in it. This is a vast improvement over a linear routing table where the number of comparisons was proportional to the number of prefixes stored. The actual lookup in the trie would be done exactly in the same way: We will follow our path from the root down to the bottom as we did before, but this time, we will use binary numbers instead of letters, matching one bit at a time. How would it look if we are matching the bits in an octet? Making a proper arrangement, we obtain the structure represented in Figure 5 below.
For the sake of brevity, the figure represents bit matching between the first and fourth bit of an octet. The key we are looking for would be the binary expression 11100001. If we transform it into a decimal expression, our number turns out to be 225. In Figure 5, we assume that the trie contains a perfect match for all 8 bits; however, it should be simple to imagine how the lookup would be performed if the trie only contained a shorter prefix, for example, only the leading bits 1110.
Figure 5 - Looking up bits in a trie
But what happens if the network does not exist in the tree at all and there is no terminal node to be found in the path? The clever answer is: the root can be a terminal node, too, in which case it would represent the default route. All those lookups for destination networks which we cannot find in the tree will start in the root first. If there is no other terminal node as we branch deeper into the tree, the most recently found terminal node would be the root - which is the case for a default route! Clever, huh?!
Now that we have found the longest prefix match in our trie, what exactly occurs then? Recall that our goal was to find the best matching route for the destination of the packet to know how to forward it. In turn, knowing how to forward means knowing the outgoing interface and the Layer2 address of the next hop - the frame rewrite we have mentioned earlier. This forwarding information is the actual value we are trying to find in the trie, using the longest matching prefix as the lookup key. This forwarding information can be stored right in the terminal nodes of the trie, but because of many advantages, it is better to store this information in a standalone table, and have the terminal nodes just point to the elements of this table.
Ladies and Gentlemen, let's take a step sideways to introduce all of you to The Adjacency Table
A router can know about thousands or hundreds of thousands of destinations in its RIB/FIB, but will have only a small number of neighbors. Necessarily, many RIB/FIB entries will use the same forwarding information - the same next hop, the same outgoing interface, the same frame rewrite. In previous switching mechanisms, the forwarding information was either not stored, and therefore had to be gathered for every routed packet, or was cached (in a per-destination basis).
A smart router can think ahead, however. With the RIB populated, it already knows the IP addresses of individual next hops. The router can start preparing the complete forwarding information - outgoing interface, frame rewrite - even before the packets start flowing. All the necessary information is already available: The next hops are known from the RIB and their Layer2 addresses can be obtained through mechanisms that map Layer3 to Layer2 addresses (ARP and others similar tables). The complete forwarding information compiled from this data would then be placed into a standalone database called the adjacency table.
The adjacency table is a table holding, unsurprisingly, the list of all adjacencies known by the router. In this sense, an adjacency is the complete forwarding information for a directly connected neighbor: The outgoing interface toward the neighbor, and the complete Layer2 frame header that can be used to send packets to that neighbor. There may be other additional or management information included, but the outbound interface and the frame rewrite are the bare minimum.
To illustrate what the adjacency table contains, consider the following topology: Routers R1, R2 and R3 are adjacent to each other, and also R3 is adjacent to 3 end user desktops. There is a subinterface between R2 and R3 and an untagged interface between R3 and the end hosts.
Lets focus only in R3 for the sake of brevity. As explained before, the adjacency table would contain information from other tables already prepared so packets can be swiftly forwarded.
Identifying the interfaces in R3 we obtain their MAC addresses:
R3# show int Ethernet2/5 | in address
Hardware is AmdP2, address is ca03.20a4.003d (bia ca03.20a4.003d)
Internet address is 192.168.3.254/24
R3# show int Ethernet1/3 | in address
Hardware is AmdP2, address is ca03.20a4.001f (bia ca03.20a4.001f)
Internet address is 10.10.13.3/24
R3# show int Ethernet2/3 | in address
Hardware is AmdP2, address is ca03.20a4.003b (bia ca03.20a4.003b)
Internet address is 10.10.23.3/24
What about the ARP table?
R3# show ip arp | ex -
Protocol Address Age (min) Hardware Addr Type Interface
Internet 10.10.13.1 12 ca01.4a40.001f ARPA Ethernet1/3
Internet 10.10.23.2 12 ca02.2904.003b ARPA Ethernet2/3
Internet 192.168.3.1 183 0050.7966.6800 ARPA Ethernet2/5
Internet 192.168.3.2 192 0050.7966.6801 ARPA Ethernet2/5
Internet 192.168.3.3 216 0050.7966.6802 ARPA Ethernet2/5
Internet 192.168.23.2 19 ca02.2904.003b ARPA Ethernet2/3.100
The entries marked in red are routers, and the ones marked in blue are the end hosts. What would the adjacency table contain if we were to route packets from R3 to one of the end hosts?
R3# show adjacency detail
Protocol Interface Address
<omitted for brevity>
IP Ethernet2/5 192.168.3.3(7)
0 packets, 0 bytes
sourced in sev-epoch 0
Encap length 14
The information provided by the output above corresponds to a complete pre-built data link layer information, just ready for the router to use it.
Let’s decompose that huge set of numbers to understand it better.
005079666802 - Destination MAC address - PC3’s MAC address
CA0320A4003D - Source MAC address - Eth2/5’s MAC address
0800 - Ethertype of the packet
Now, what would happen if the communication is through a subinterface? What would be available to us in the adjacency table?
The good news is: we will see even the VLAN tag! Lets check it out!
R3#show adjacency detail
Protocol Interface Address
<omitted for brevity>
IP Ethernet2/3.100 192.168.23.2(7)
0 packets, 0 bytes
sourced in sev-epoch 0
Encap length 18
Decomposing it again:
CA022904003B - Destination MAC address - R2 Ethernet 2/3‘s MAC address
CA0320A4003B - Source MAC address - R3 Ethernet 2/3‘s MAC address
8100 - Ethertype of the 802.1Q header
0 - Class of Service (0) and Discard Eligibility Indicator (0)
064 - VLAN Tag - 100 (064 in Hex = 100 in decimal)
0800 - Ethertype of the original L2 header
As displayed previously, the adjacency table would contain just what the router needs to forward packets. Previous cases covered untagged and tagged interfaces and their pre-built frames ready to pick up in the adjacency table. What would be there prepared for the router is there is a tunnel interface enabled? Looks like there is some homework to do
The FIB and the adjacency table are the key components of Cisco Express Forwarding, and now we know their purpose: The FIB contains all known IP prefixes from the RIB, organized for rapid lookups. A lookup in the FIB using the packet’s destination IP address as a key produces a pointer into the adjacency table that contains pre-built frame headers for individual next hops, so that they can be instantly applied to the packet, and off it flies!
As a concluding remark: FIB can be implemented both in software and in hardware. The software FIB is implemented using the trie data structure we have discussed. In hardware, the FIB is implemented using Ternary Content Addressable Memory ASICs (TCAM) - but that is perhaps for another blog
Thorough this blog post, we have learned about CEF, a mechanism to enhance packet forwarding, aimed to optimize tasks which are resource intensive within the forwarding process. CEF is composed of two elements: The FIB and the adjacency table. This blog post was meant to explore their essence and inner mechanisms and to unveil their logic and benefits without diving into bells and whistles. However, there is much more to visit and detail about it! Following blogs can cover optimization methods and features to enhance performance and squeeze the most of it. Any thoughts, comments or feedback are welcome!
Paluch Peter, Kocharians Narbik - Chapter 6 , pp. 267-282.
Stringfield Nakia , White Russ, McKee Stacia - Chapters 2 and 3 , pp. 51-101.