Network packets in dataplane memory: the Snabb approach
How do you represent an Ethernet packet in a network dataplane application, such as a Unix kernel or software firewall?
I don’t know how to answer this definitively. The optimal design for a new application will need to match your experience, your choice of programming language, your target operating system kernel, CPU microarchitecture, IO interfaces, and much more besides.
Instead let me just tell you how we represent an Ethernet packet in Snabb, a novel dataplane with these key characteristics:
- Performance of ~250 cycles/packet, ~10Gbps/core, ~100Gbps/CPU.
- Built-in device drivers for 10G/25G/40G/100G Ethernet NICs.
- Completely written in Lua, including the device drivers and fast inner loops.
- Just-in-time compiled for Linux/x86 including some bespoke machine code assembled in Lua at runtime.
This design was thrashed out to fit these requirements between myself, Alex Gall, Andy Wingo, Diego Pino, Javier Guerra Giraldez, Katerina Barone-Adesi, Max Rottenkolber, Nikolay Nikolaev and others over a couple of years.
Overview
The layout of a packet in memory is very simple. The packet data is stored in a fixed-length array. The only metadata is the packet length.
struct packet {
uint16_t length;
unsigned char[10240] data;
}
That part is simple. But packets also have to follow some tricky rules:
- Packets have a single owner. Internal sends transfer ownership. (If you need a copy then make one.)
- Packets are allocated in contiguous physical memory.
- Packet pointer address bits have special propreties:
- 44-46 contain the tag value
0x5
. - 0-43 contain the physical memory address that corresponds to this virtual address.
- 6-17 contain entropy.
- 0-7 contain
0x80
in freshly-allocated packets.
- 44-46 contain the tag value
Crazy sounding list, right? But these peculiar rules combine to support the way that Snabb does JIT compilation, hardware DMA, multi-process parallelism, and CPU micro-architecture optimization.
JIT compilation
Snabb uses tracing just-in-time compilation for fast-path traffic processing. The most important requirement for effective trace compilation is simple and consistent control flow. Straight-line code massively outperforms complicated and unpredictable control flow with this compiler.
Storing the whole packet in a linear array of bytes hits a sweet-spot
for the JIT. Compilation would be much less effective with more
complicated data structures like Linux sk_buff
, BSD mbuf
, or DPDK
rte_mbuf
that chain variable-sized buffers together and include
extensive metadata that must be respected and maintained at each
processing step.
(Straight-line code actually hits a sweet spot for both the JIT and also the CPU.)
DMA
Storing each packet in contiguous physical memory makes device drivers simple and efficient. Transmitting or receiving a packet requires exactly one DMA descriptor. The buffers all have the same ample size. Jumbo frames are not a special case.
The address encoding scheme makes translation between virtual and
physical addresses simple and efficent. Convert a packet virtual
address to a NIC DMA address with (AND 0x0FFFFFFFFFFFFF)
to remove
the tag. Recover the packet pointer from a DMA receive descriptor by
restoring the tag with (OR 0x50000000000000)
. Just single bitwise
operations.
The cost of this simplicity is more complex memory allocation. Contiguous physical memory requires allocation of “huge pages” (2MB or 1GB memory blocks) instead of ordinary 4KB pages that would be too small for jumbo frames. The address encoding scheme requires dancing around Linux kernel limitations on the remapping of huge pages. On balance this was well worth the effort.
Multi-process parallelism
Snabb allows raw pointers to packet buffers to be freely shared between separate Unix processes. Any packet allocated by any process is automatically valid in every other cooperating process. It doesn’t matter which process originally allocated the packet or how the pointer was obtained.
Handy, eh?
This works because each packet always has the same canonical virtual
address derived from its immutable physical memory address. The
virtual address 0x50000000123400
always refers to the packet buffer
with physical address 0x123400
. This packet will always have the
same address in any Snabb process.
There is one tricky part: creating the address space mappings for each packet in each process. We take a lazy, just-in-time approach:
- Establish a
SIGSEGV
signal handler to trap access to unmapped addresses. - Identify accesses to unmapped addresses in packet memory e.g. e.g.
0x50000000123400
. - Calculate the physical address of the 2MB/1GB huge page containing
the packet e.g.
0x00000000100000
- Calculate the filename on ramdisk that represents file-backed shared
memory for this huge page e.g.
/var/run/snabb/$pid/group/0x00000000100000.dma
. This file will only exist if the memory has indeed been allocated by another Snabb process in the same application group. - If the file is found then just-in-time map the huge page at the
appropriate address and suppress the
SIGSEGV
. Or, if the file is not found, allow theSIGSEGV
to cause a process crash.
(Process crash? Those are recoverable and safeguards are in place to prevent actively referenced DMA memory from being returned to the kernel prematurely. But that is a different story!)
Reserved “headroom” for header adjustment
Often it is desirable to efficiently move the beginning of a packet in
memory. For example, to add or remove encapsulation such as an 802.1Q
VLAN header by locally resizing the ethernet header and not moving the
entire payload around. Our packet structure does not include any explicit
offset
field for this purpose but it does provide an implicit one.
The reason that bits 0-7 of a freshly allocated packet pointer are
0x80
is that packets are first allocated on a 256-byte aligned
address and then shifted by 128 bytes to create “headroom.”
If you need more space at the beginning of your packet, for example to
insert encapsulation headers, then you can make room by adjusting the
low bits of the packet pointer down towards 0x00
. If you need less
space, to delete encapsulation, you can increment the pointer towards
0xFF
.
If you need more space than this – your adjustment would cross the
0x00
to 0xFF
headroom offset range – then you can’t use the
headroom trick and must memmove()
instead. Otherwise you would move
into the memory belonging to an adjacent packet.
Entropy
One pitfall of “just-so” address space allocation schemes is creating “conflict misses” by depriving the CPU cache of entropy. CPUs hash precious few of the bits in each address to choose which hardware cache lines it may occupy. If packet addresses have the same patterns in those bits then it will create contention a small number of cache slots and throttle performance.
This is crucial for Snabb because we operate on batches of around 64 packets at a time and we push each batch through a series of tight processing loops. If the packets lack entropy in the key address bits then they continually evict each other from cache during processing.
For this reason Snabb is careful to preserve enough entropy in addresses of manually-allocated objects that are frequently referenced during processing.
(This issue is most critical for counter objects. Those are allocated
as file-backed shared memory objects and therefore each assigned an
individual 4KB memory page by the kernel with an address ending in
bits 0000.00000000
.)