Perf Device Driver - xy-torrent’s diary

Summary of the changes and new features merged in the Linux Kernel during the 2.6.32 development cycle.

These are some examples of using the perf Linux profiler, which has also been called Performance Counters for Linux PCL, Linux perf events LPE, or perf_events.

VMware, Inc. 2 Performance Evaluation of VMXNET3 Virtual Network Device The VMXNET3 driver is NAPI compliant on Linux guests. NAPI is an interrupt mitigation.

Summary of the changes and new features merged in the Linux kernel during the 3.8 development cycle.

Collapse the table of content

Expand the table of content

Performance in Network Drivers

Supporting Receive Side Throttle

Minimizing send and receive path length

Although the send and receive paths differ from driver to driver, there are some general rules for performance optimizations:

Optimize for the common paths. The Kernprof.exe tool is provided with the MSDN and IDW builds of Windows that extracts the needed information. The developer should look at the routines that consume the most CPU cycles and attempt to reduce the frequency of these routines being called or the time spent in these routines.

Reduce time spent in DPC so that the network adapter driver does not use excessive system resources, which would cause overall system performance to suffer.

Make sure that debugging code is not compiled into the final released version of the driver; this avoids executing excess code.

Partitioning data and code to minimize sharing across processors

Partitioning is needed to minimize shared data and code across processors. Partitioning helps reduce system bus utilization and improves the effectiveness of processor cache. To minimize sharing, driver writers should consider the following:

Implement the driver as a deserialized miniport as described in Deserialized NDIS Miniport Drivers.

Use per-processor data structures to reduce global and shared data access. This allows you to keep statistic counters without synchronization, which reduces the code path length and increases performance. For vital statistics, have per-processor counters that are added together at query time. If you must have a global counter, use interlocked operations instead of spin locks to manipulate the counter. See Using Locking Mechanisms Properly below for information about how to avoid using spin locks.

To facilitate this, KeGetCurrentProcessorNumberEx can be used to determine the current processor. To determine the number of processors when allocating per-processor data structures, KeQueryGroupAffinity can be used.

The total number of bits set in the affinity mask indicates the number of active processors in the system. Drivers should not assume that all the set bits in the mask will be contiguous because the processors might not be consecutively numbered in the future releases of the operating system. The number of processors in an SMP machine is a zero-based value.

If your driver maintains per-processor data, you can use the KeQueryGroupAffinity function to reduce cache-line contention.

Avoiding false sharing

False sharing occurs when processors request shared variables that are independent from each other. However, because the variables are on the same cache line, they are shared among the processors. In such situations, the cache line will travel back and forth between processors for every access to any of the variables in it, causing an increase in cache flushes and reloads. This increases the system bus utilization and reduces overall system performance.

To avoid false sharing, align important data structures such as spin locks, buffer queue headers, singly linked lists to cache-line boundaries by using NdisGetSharedDataAlignment.

Using locking mechanisms properly

Spin locks can reduce performance if not used properly. Drivers should minimize their use of spin locks by using interlocked operations wherever possible. However, in some cases, a spin lock might be the best choice for some purposes. For example, if a driver acquires a spin lock while handling the reference count for the number of packets that have not been indicated back to the driver, it is not necessary to use an interlocked operation. For more information, see Synchronization and Notification in Network Drivers.

Here are some tips for using locking mechanisms effectively:

Use NDIS singly-linked list functions such as the following for managing resource pools.

1. Prominent features in Linux 3.8

1.1. Ext4 embeds very small files in the inode

Every file in Ext4 has a corresponding inode which stores various information -size, date creation, owner, etc- about the file users can see that information with the stat 1 command. But the inode doesn t store the actual data, it just holds information about where the data it is placed.

The size used by each inode is predetermined at mkfs.ext4 8 time, and defaults to 256 bytes. But the space isn t always used entirely despite small extended attributes making use of it, and there millions of inodes in a typical file system, so some space is wasted. At the same time, at least one data block is always allocated for file data typically, 4KB, even if the file only uses a few bytes. And there is a extra seek involved for reading these few bytes, because the data blocks aren t allocated contiguously to the inodes.

Ext4 has added support for storing very small files in the unused inode space. With this feature the unused inode space gets some use, a data block isn t allocated for the file, and reading these small files is faster, because once the inode has been read, the data is already available without extra disk seeks. Some simple tests shows that with a linux-3.0 vanilla source, the new system can save more than 1 disk space. For a sample /usr directory, it saved more than 3 of space. Performance for small files is also improved. The files that can be inlined can be tweaked indirectly by increasing the inode size -I mkfs.ext4 8 option - the bigger the inode, the bigger the files that can be inlined but if the workload doesn t make extensive use of small files, the space will be wasted.

Code: commit 1, 2, 3, 4, 5, 6, 7, 8

1.2. Btrfs fast device replacement

As a filesystem that expands to multiple devices, Btrfs can remove a disk easily, just in case you want to shrink your storage pool, or just because the device is failing and you want to replace it:

btrfs device add new_disk

btrfs device delete old_disk

But the process is not as fast as it could be. Btrfs has added a explicit device replacement operation which is much faster:

btrfs replace mountpoint old_disk new_disk

The copy usually takes place at 90 of the available platter speed if no additional disk I/O is ongoing during the copy operation. The operation takes place at runtime on a live filesystem, it does not require to unmount it or stop active tasks, and it is safe to crash or lose power during the operation, the process will resume with the next mount. It s also possible to use the command btrfs replace status to check the status of the operation, or btrfs replace cancel to cancel it. The userspace patches for the btrfs program can be found git://btrfs.giantdisaster.de/git/btrfs-progs here.

Code: commit 1, 2, 3, 4, 5, 6

1.3. F2FS, a SSD friendly file system

F2FS is a new experimental file system, contributed by Samsung, optimized for flash memory storage devices. Linux has several file systems targeted for flash devices -logfs, jffs2, ubifs-, but they are designed for native flash devices that expose the flash storage device directly to the computer. Many of the flash storage devices commonly used SSD disks aren t native flash devices. Instead, they have a FTL flash translation layer that emulates a block based device and hides the true nature of flash memory devices. This makes possible to use the existing block storage stacks and file systems in those devices. These file systems have made some optimizations to work better with SSDs like trimming . But the filesystem formats don t make changes to optimize for them.

F2FS is a filesystem for SSDs that tries to keep in mind the existence of the Flash Translation Layer, and tries to make good use of it. For more details about the design choices made by F2FS, reading the following LWN article is recommended:

Code: fs/f2fs

1.4. User namespace support completed

Per-process namespaces allow to have different namespaces for several resources. For example, a process might see a set mountpoints, PID numbers, and network stack state, and a process in other namespace might see others. The per-process namespace support has been developed for many years: The command unshare 1 , available in modern linux distros, allows to start a process with the mount, UTS, IPC or network namespaces unshared from its parent; and systemd uses mount namespaces for the ReadWriteDirectories, ReadOnlyDirectories or InaccessibleDirectories unit configuration options, and for systemd-nspawn. But the use of namespaces was limited only to root.

This release adds is the ability for unprivileged users to use per-process namespaces safely. The resources with namespace support available are filesystem mount points, UTS, IPC, PIDs, and network stack.

For more details about the Linux namespace support, what they are, how they work, details about the API and some example programs, you should read the article series from LWN.