Network packets in dataplane memory: the Snabb approach

How do you represent an Ethernet packet in a network dataplane application, such as a Unix kernel or software firewall?

I don’t know how to answer this definitively. The optimal design for a new application will need to match your experience, your choice of programming language, your target operating system kernel, CPU microarchitecture, IO interfaces, and much more besides.

Instead let me just tell you how we represent an Ethernet packet in Snabb, a novel dataplane with these key characteristics:

  • Performance of ~250 cycles/packet, ~10Gbps/core, ~100Gbps/CPU.
  • Built-in device drivers for 10G/25G/40G/100G Ethernet NICs.
  • Completely written in Lua, including the device drivers and fast inner loops.
  • Just-in-time compiled for Linux/x86 including some bespoke machine code assembled in Lua at runtime.

This design was thrashed out to fit these requirements between myself, Alex Gall, Andy Wingo, Diego Pino, Javier Guerra Giraldez, Katerina Barone-Adesi, Max Rottenkolber, Nikolay Nikolaev and others over a couple of years.

Overview

The layout of a packet in memory is very simple. The packet data is stored in a fixed-length array. The only metadata is the packet length.

struct packet {
    uint16_t    length;
    unsigned char[10240] data;
}

That part is simple. But packets also have to follow some tricky rules:

  • Packets have a single owner. Internal sends transfer ownership. (If you need a copy then make one.)
  • Packets are allocated in contiguous physical memory.
  • Packet pointer address bits have special propreties:
    • 44-46 contain the tag value 0x5.
    • 0-43 contain the physical memory address that corresponds to this virtual address.
    • 6-17 contain entropy.
    • 0-7 contain 0x80 in freshly-allocated packets.

Crazy sounding list, right? But these peculiar rules combine to support the way that Snabb does JIT compilation, hardware DMA, multi-process parallelism, and CPU micro-architecture optimization.

JIT compilation

Snabb uses tracing just-in-time compilation for fast-path traffic processing. The most important requirement for effective trace compilation is simple and consistent control flow. Straight-line code massively outperforms complicated and unpredictable control flow with this compiler.

Storing the whole packet in a linear array of bytes hits a sweet-spot for the JIT. Compilation would be much less effective with more complicated data structures like Linux sk_buff, BSD mbuf, or DPDK rte_mbuf that chain variable-sized buffers together and include extensive metadata that must be respected and maintained at each processing step.

(Straight-line code actually hits a sweet spot for both the JIT and also the CPU.)

DMA

Storing each packet in contiguous physical memory makes device drivers simple and efficient. Transmitting or receiving a packet requires exactly one DMA descriptor. The buffers all have the same ample size. Jumbo frames are not a special case.

The address encoding scheme makes translation between virtual and physical addresses simple and efficent. Convert a packet virtual address to a NIC DMA address with (AND 0x0FFFFFFFFFFFFF) to remove the tag. Recover the packet pointer from a DMA receive descriptor by restoring the tag with (OR 0x50000000000000). Just single bitwise operations.

The cost of this simplicity is more complex memory allocation. Contiguous physical memory requires allocation of “huge pages” (2MB or 1GB memory blocks) instead of ordinary 4KB pages that would be too small for jumbo frames. The address encoding scheme requires dancing around Linux kernel limitations on the remapping of huge pages. On balance this was well worth the effort.

Multi-process parallelism

Snabb allows raw pointers to packet buffers to be freely shared between separate Unix processes. Any packet allocated by any process is automatically valid in every other cooperating process. It doesn’t matter which process originally allocated the packet or how the pointer was obtained.

Handy, eh?

This works because each packet always has the same canonical virtual address derived from its immutable physical memory address. The virtual address 0x50000000123400 always refers to the packet buffer with physical address 0x123400. This packet will always have the same address in any Snabb process.

There is one tricky part: creating the address space mappings for each packet in each process. We take a lazy, just-in-time approach:

  • Establish a SIGSEGV signal handler to trap access to unmapped addresses.
  • Identify accesses to unmapped addresses in packet memory e.g. e.g. 0x50000000123400.
  • Calculate the physical address of the 2MB/1GB huge page containing the packet e.g. 0x00000000100000
  • Calculate the filename on ramdisk that represents file-backed shared memory for this huge page e.g. /var/run/snabb/$pid/group/0x00000000100000.dma. This file will only exist if the memory has indeed been allocated by another Snabb process in the same application group.
  • If the file is found then just-in-time map the huge page at the appropriate address and suppress the SIGSEGV. Or, if the file is not found, allow the SIGSEGV to cause a process crash.

(Process crash? Those are recoverable and safeguards are in place to prevent actively referenced DMA memory from being returned to the kernel prematurely. But that is a different story!)

Reserved “headroom” for header adjustment

Often it is desirable to efficiently move the beginning of a packet in memory. For example, to add or remove encapsulation such as an 802.1Q VLAN header by locally resizing the ethernet header and not moving the entire payload around. Our packet structure does not include any explicit offset field for this purpose but it does provide an implicit one.

The reason that bits 0-7 of a freshly allocated packet pointer are 0x80 is that packets are first allocated on a 256-byte aligned address and then shifted by 128 bytes to create “headroom.”

If you need more space at the beginning of your packet, for example to insert encapsulation headers, then you can make room by adjusting the low bits of the packet pointer down towards 0x00. If you need less space, to delete encapsulation, you can increment the pointer towards 0xFF.

If you need more space than this – your adjustment would cross the 0x00 to 0xFF headroom offset range – then you can’t use the headroom trick and must memmove() instead. Otherwise you would move into the memory belonging to an adjacent packet.

Entropy

One pitfall of “just-so” address space allocation schemes is creating “conflict misses” by depriving the CPU cache of entropy. CPUs hash precious few of the bits in each address to choose which hardware cache lines it may occupy. If packet addresses have the same patterns in those bits then it will create contention a small number of cache slots and throttle performance.

This is crucial for Snabb because we operate on batches of around 64 packets at a time and we push each batch through a series of tight processing loops. If the packets lack entropy in the key address bits then they continually evict each other from cache during processing.

For this reason Snabb is careful to preserve enough entropy in addresses of manually-allocated objects that are frequently referenced during processing.

(This issue is most critical for counter objects. Those are allocated as file-backed shared memory objects and therefore each assigned an individual 4KB memory page by the kernel with an address ending in bits 0000.00000000.)