Home‎ > ‎Articles and Publications‎ > ‎

Linux NAPI-compliant network device driver

by A. Calderone 
[Advanced Level]

The article talks about the NAPI: the architecture of the network Linux device drivers adapted to support network adapters of new generation.

New API (NAPI) is the packet processing mechanism designed to better support the drivers for the "fast" network adapters -such as Gigabit Ethernet adapters- introduced in version 2.6 of the kernel and subsequently ported "back" in 2.4.x.
The architecture of the network device driver legacy is not suitable to support devices capable of generating thousands of interrupts per second, and these on the other hand can be a potential cause of "starvation" for the entire system. 
Some devices have advanced features of interrupt coalescing, or the ability to group multiple packets using interrupt mitigation mechanism.
Without the use of NAPI, so without a substantial support of the kernel, these features should be fully implemented in the driver, combined with mechanisms of preemption (based for example on timer-interrupt), and even on polling methods managed out the scope of the interrupt service routine which handles packet receipt (ie.: kernel threads, tasklet, etc.)
As we shall see, the new model adds to the network driver subsystem features to effectively support interrupt mitigation and packet throttling, and more importantly, allows the kernel to distribute the "load" through a round-robin policy among devices.
NAPI additions to the kernel do not break backward compatibility

Non-NAPI frame reception

We will discuss the mechanisms of packet processing for reception of frames inside the kernel, without claiming to be exhaustive. 
We think it is necessary to know these mechanisms to characterize the elements that make understandable the differences between the two different models: NAPI and legancy.
In the legacy model (shown schematically in Figure 1) the delivery of frames to the protocol stack occurs through the function netif_rx which is normally invoked in the interrupt handler context (during the reception of frames). 
A variant of this function, that can be used outside the context of the interrupt, the routine is netif_rx_ni.

Fig. 1 - Non-NAPI frame reception

The function netif_rx puts in the receive queue of the system (input_pkt_queue) packets received from the device (wrapped in socket buffer) if the queue length is not larger than netdev_max_backlog: this and other related parameters are exported in /proc file system (/proc/sys/net/core pseudo file and can be used for tuning purposes).

Listing 1 - Definition of softnet_data structure
/* * Incoming packets are placed on per-cpu queues so that * no locking is needed. */ struct softnet_data { struct net_device *output_queue; struct sk_buff_head input_pkt_queue; struct list_head poll_list; struct sk_buff *completion_queue; struct net_device backlog_dev; #ifdef CONFIG_NET_DMA struct dma_chan *net_dma; #endif }
The input_pkt_queue is field of softnet_data structure (Listing 1) defined in the file netdevice.h (http://lxr.linux.no/source/include/linux/netdevice.h#L616); 
When the received frame is not discarded due to the congestion of input_pkt_queue, it is processed by softirq NET_RX_SOFTIRQ scheduled through the function netif_rx_schedule, invoked internally by the routinenetif_rx.
The softirq NET_RX_SOFTIRQ is implemented in the routine net_rx_action.
At the moment, it is enough to say that this function passes frames taken from input_pkt_queue and deliver them to protocol stack so that they can be processed.

NAPI frame reception
In the new model (shown schematically in Figure 2), in the case of interrupt for receiving new packets, the driver informs the networking subsystem of the availability of the new frame (rather than process them immediately) so that this can consume it through an appropriate "polling method" , outside of the execution context of the ISR.

Fig 2 - NAPI frame reception

Devices can support NAPI must therefore satisfy the some prerequisites 
The driver cannot use the packets input queue and the device itself must be able to accumulate the incoming frames maintaining a buffer for those received, leaving interrupt disabled.
This logic reduces the occurrence of interrupts and delegates to the task of discarding the frame in case of burst, avoiding saturation of Linux queues.
From the implementation point of view, the parts that differ substantially with the old model are the interrupt routine and the new poll method (of net_device structure), defined as follows:

int (*poll)(struct net_device *dev, int *budget);

In addition to this, two new attributes (int type) are part of the structure net_device, they are known as quota and weight, and are used to implement the mechanism of preemption in the polling cycle (as we will see later).
The interrupt routine in the NAPI model, delegates to poll method  the frame delivery to the protocol stack. 
In other words his work is "reduced" to disable the interrupt reception of the device (which will continue to accumulate incoming frames), perform the specific acknowledgment of the interrupt and then schedule (using the routine netif_rx_schedule) the softirq NET_RX_SOFTIRQ, associated the function net_rx_action.
The drivers waiting to be polled are placed on a list (poll_list) by netif_rx_schedule routine that takes as argument the pointer to the instance of net_device.
The poll_list is scanned during the execution of softirq NET_RX_SOFTIRQ by the routine net_rx_action that invokes, for each driver, the poll() method and this finally wraps the packets in socket buffers and notifies them to the protocol stack.
Going into more detail net_rx_action performs the following operations:
1) Retrieves the reference to poll_list for the current processor.
2) Save jiffies value in the start_time variable.
3) Set the budget (passed by reference) to the initial value of netdev_budget (configurable using /proc/sys/net/core/netdev_budget)
4) For each device in the poll_list or until you run out of budget, or even if you have not spent more than a "jiffies" by start_time then:
4.a) if the quantum is positive invokes the poll method of the "device" (ie, its reference instance of the structure net_device), otherwise add the weight to the quantum put the device in poll_list;
4.a.1) if poll() method returns a non-zero value, restores the quantum based on the attribute weight and put  the device in poll_list;
4.a.2) if poll() method returns zero, it is assumed that the device has been removed from the poll list (it is no more in poll-state).
The reference to budget and the pointer to the net_device structure is passed to poll() method
This function should decrease budget to the number of frames processed. The frames taken from device buffer are wrapped in socket buffers and delivered to the protocol stack through the function netif_receive_skb .
To the policy of preemption based on the variable budget is joined to that one based on quota mechanism: the poll method must take responsibility to check how many frames can deliver the Kernel according to the maximum quota allocated to the device. When this runs out no other packet should be forwarded to the kernel, allowing to poll another device waiting in poll_list. For this reason poll() function has to reduce quota by the number of packets processed by the driver, in a similar way to what was done for the budget.
If the driver has exhausted its quota before it has completed delivery of all frames queued, then the poll method must cease execution returning a non-null value.
In the event that all received packets have been delivered to the protocol stack, the driver must re-enable interrupts on the device and stop the polling invoking the system function netif_rx_complete (which extracts the device from poll_list ), then it must stop the execution and return zero to the calling function (which we know to be net_rx_action) .
Another important attribute of the net_device structure, we have seen, is the weight, which is used to restore the quota to each invocation of poll().
It goes without saying that the field weight should always be initialized with a strictly positive value. 
Typically for Fast Ethernet this value is between 16 and 32. While for Gigabit it assumes higher values ​​(typically 64).
Looking at the implementation of the function net_rx_action, we can observe that if a driver uses an "extra-budget" of its quantum (fact semantically permissible), in subsequent iterations must wait longer before being polled again; and it happens as soon as the driver weight is higher.
In Listing 2 we reported the pseudo-code for the interrupt routine reception and implementation of the poll method of an imaginary sample device driver.

Listing 2 - pseudo-code for the interrupt routine e poll() implementation of sample driver
static irqreturn_t sample_netdev_intr(int irq, void *dev) { struct net_device *netdev = dev; struct nic *nic = netdev_priv(netdev); if (! nic->irq_pending()) return IRQ_NONE; /* Ack interrupt(s) */ nic->ack_irq(); nic->disable_irq(); netif_rx_schedule(netdev); return IRQ_HANDLED; } static int sample_netdev_poll(struct net_device *netdev, int *budget) { struct nic *nic = netdev_priv(netdev); unsigned int work_to_do = min(netdev->quota, *budget); unsigned int work_done = 0; nic->announce(&work_done, work_to_do); /* If no Rx announce was done, exit polling state. */ if(work_done == 0) || !netif_running(netdev)) { netif_rx_complete(netdev); nic->enable_irq(); return 0; } *budget -= work_done; netdev->quota -= work_done; return 1; }
In Figures 3 and 4 show the sequence diagrams of the calls made during the main phase of packet processing in reception stage, related respectively to the model non-NAPI and NAPI compliant .

Fig. 3 -Sequence diagram related to non-Napi model

Fig. 4 - -Sequence diagram related to Napi model

Integration of non- NAPI drivers with the new architecture

Explained the mechanism of packet processing of the NAPI model, we complete our description talking about how this has been integrated with the non-NAPI model.
There is an attribute of the structure already discussed softnet_data called backlog_dev .
This attribute is instance of the net_device structure. At initialization of the networking subsystem (in particular in the routine net_dev_init) of the pointer to poll() of backlog_dev is initialized with the address of the internal function process_backlog . This function is called in softirq NET_RX_SOFTIRQ like any other poll() method of NAPI-compliant devices in the poll_list, as described above.
The parameter passed to this function is just a pointer to backlog_dev which is an attribute of the instance of the structure softnet_data attached to the CPU that executes the softirq.
The process_backlog() function extracts the frames of input_pkt_queue, inserted by the function netif_rx, and delivers them to the protocol stack by calling the netif_receive_skb (such as the poll() method of a generic driver NAPI-compliant does). The maximum number of frames that can be delivered depends in this case also by the value of quantum initialized with the global parameter weight_p (settable in the /proc file system), as well as the duration of the function itself that is limited to not more than one tick jiffies counter.
When this quantum or socket buffers to be processed are exhausted, the function process_backlog gives control back to the calling function which - by now should be clear - is the function net_rx_action.
Having mentioned several times the /proc file system, it seems appropriate to point out that the settings for the device driver (in particular for the parameters of the NAPI drivers) are exported in the sysfs directory /sys/class/net.