OAM on NPUs – Not Only To Off-load the CPU
Last week, I wrote a blog post on how critical Operations, Administration and Maintenance (OAM) is in Carrier Ethernet and IP/MPLS networks. These functions are essential for running a smooth and scalable service provider business. This post, in turn, will highlight one critical implementation aspect of these functions in modern switch/router designs. How do you divide the tasks between the network processor and the control host CPU?
In the good old TDM world, OAM features were built-in and had a dedicated out-of-band channel for communication. In the packet-based world, however, network OAM protocols often run alongside user traffic and therefore compete on link and switching resources. The new – and evolving – standards for packet OAM have to be thought through during the systemization phase of the line card to give the feature set its required backing from the hardware. What bandwidth, packet rates and features have to be supported? How can they be assured processing guarantees without sacrificing user data?
OAM doesn’t come for free. For monitoring a single link, the line card has to generate hundreds of CCM packets (Ethernet term) or Hello packets (MPLS term) every second. On the receiving side, the line card has to read and count the OAM packets coming in. The data plane has to identify if the link is lost (by three consecutive lost packets) to trigger a protection switching mechanism and routing protocol convergence at the control plane.
Link OAM traffic can add up to significant packet rates, particularly in service edge routers that may be responsible for termination and management of hundreds of thousands of virtual connections. And this is just for link monitoring, which is costly from a processing perspective, but still a rather basic service in the broader OAM world. Add to this traffic for monitoring the performance of services for quality analyses on services like voice and video per customer, and the amount of processing for OAM raise to several gigabits per second. This was one of the reasons why system vendors were looking at network processors to support OAM a few years back. They are simply running out of steam on the host CPU to cope with the load. And while performance continues to be a major reason for OAM support in network processors, I believe there is a more fundamental reason why OAM should run on the NPU.
If you run status check services in the control plane rather than in the data plane, the OAM design will eventually end up interlinking and dependent on the control plane itself. While OAM plays a key role to the control and management planes, the planes should ultimately be autonomous from each other. When you are to check the connectivity between two points in the network, you want your test packet to be forwarded exactly as if the packet was part of the user traffic. This is also required in IEEE 802.1ag for link OAM. The packet should travel across the network using tables stored in the data plane, not mirrored copies that hold in the control plane. If there are inconsistencies between them, OAM won’t provide the correct data. It is therefore critical to implement OAM functions in the data plane itself. That is, it should run on the NPU rather than on the CPU.
In my next post, I will delve deeper into the importance of programmability for OAM functions.

Comments
Your comment