Patrick McHardy: memory mapped netlink and nfnetlink_queue

Patrick McHardy presents his work on a modification of netlink and nfnetlink_queue which is using memory map.

One of the problem of netlink is that netlink uses regular socket I/O and data need to be copied to the socket buffer data areas before being send. This is a problem for performance.

The basic concept or memory mapped netlink is to used a shared memory area which can be used by kernel and userspace. A ring buffer is set and instead of copying the data, we just move a pointer to the correct memory area and the userspace reads
It is necessary to synchronize kernel and user spaces to avoid a read on a non significative area. This is done by using a area ownership.

There is a RX and a TX ring and it is thus possible to send packet (or issue verdict via the TX ring). There is few advantages on the TX side, but the possibility to batch verdict by issuing multiple verdicts in one send message.

Backward compatibility with subsystem that does not support this new system is done via a standard copy and message sending and receiving.

Ordering of message was a difficult problem to solve, reading in the rings depends on the allocation time in the ring and not on the arrival date on the packet. It is thus possible to have unordered packet in the ring. To fix this, userspace can specify it cares about ordering and the kernel will then do reception of packet and copy atomically.

Multicast is currently not supported. The synchronisation of data accross clients is a big issue and most of the solution will have bad performance.

Userspace support has been done in libmnl. As usual with Patrick, the API looks clean and adding support for it in

Testing has been done only done on a loopback interface because Patrick did not have access to a 10Gbit test bed. This is a bad test case because loopback copy is less expensive and thus performance measurement on real NICs should give better result.
Anyway, the performance impact is consequent: between 200% and 300% bandwidth increase dependings on the packet size:

There is currently no known bugs and the submission to netdev should occurs soon.

Jesper Dangaard Brouer: IPTV-analyzer

Jesper presents its IP TV analyser know called IPTV-analyser.

He starts the project when encountering problem in the IP TV system in the company he works for. Proprietary analyser exists but they are expensive and the tested equipment were not able to show the burstiness directly. To fix this, he started using wireshark and add it a burstiness detector. It was not enough because pcap was not scaling enough and they decide to build their own probe. One of the decisive point was the 192000€ necessary to buy the necessary probes.

Being an ISP with custom set top-box has some advantage because they were able to deploy the probe on the box to be able to get data from the client side.

Project is made of three parts:

  • Kernel module: parse MTS flow and extract data
  • Collect daemon: get data from userspace via /proc polling
  • Web interface: display the statistics and browse

Jesper is looking for help on the web interface. He is not a web developer and need help on that point.

IPTV-analyser has been open sourced and the source are available on github.

Holger Eitzenberger: experiences from making Network Stack Multicore

Holger want to describe its experience when switching from monocore system to mutiticore system at Astaro Sophos.
They used:

  • RPS: Receive packet steering
  • RFS:Receive flow steering
  • XPS: Transmit flow steering

They are using a 2.6.32 kernel and they had to backport the code but this was quite easy because the code is self-contained. irqbalance is not RPS and XPS aware and it is know to degrade performance. Holger decide then to start a new project.

This is named (for the moment) irqd. It is able to differentiate hardware multiqueue and RPS and it uses a netlnk interface to communicate.

The information that are used to determine how to dispatch the load is separate in a lot of files and that was one of the difficulty.

What seems strange is that the default in multiqueue and RPS/XPS are not good and clearly failed the “choose sane defaults” principle.

The work on irqd is still in progress, it is working but there is currently no configuration file and it thus can not be easily tuned. It is available on github and Holger encourage everyone to have a look and try to improve the software.

Sanket Shah: An alternate way to use IPSet framework for increasing firewall throughput

When doing matching on iptables, the sequential test of the rules is costly. By using ipset this is possible to limit the number of matches by using the sets.

For their use, they decide to use the connection mark to determine the fate of the packet. It is used to jumb on the correct chain. This logic, combined with a connectionmark set they have developed this lead to a filtering system with a really limited number of rules. In fact, this was switching from something like 10000 rules to one single rule. Ipset is doing all the classification work. The performance increase is huge as on the test system, it goes from a bandwith of 256Mb with iptables to a bandwith 1.8Gb with their system.

József Kadlecsik: ipset status

Ipset is now included in the kernel and that’s the main event of ipset in the previous year. József recommands to use the 6.8 version which is included in kernel 3.1. If your kernel is older, using a separately compiler ipset is recommanded.

If we omit the bugfixes, a lot of of new features have been introduced sinced version 6.0. It is possible to list the sets defined on a system without getting everything which is useful when big set have been defined.

A new hash net,iface has been introduced.

Possible extensions:

  • tc filter support: to use ipset in traffic shaping
  • ipset state replication: it would be interesting but all iptables match with state should be replicated

This last point is a really interesting problem: there is a lot of data that could be exchanged because the state of match changes, and it is difficult to find an identifier to use when doing the replication (where this match take place).

Other possible extensions is to add match support in the SET target. It will help to treat the problem of overlapping sets. For example, we could say:

ipset new foo hash:net
ipset add foo 192.168.1.1 --drop
ipset add foo 192.168.1.0/24 [--accept]
iptables -A ... -j SET --match-set foo dir \
           --match-accept chain1
           --match-drop chain2

József would like to refresh its great test paper but performance testing is a real problem because it requires at least 50 quad-core computers and making all the needed tests could take as much as 18 days.

Eric Leblond: degree of freedom offered by connection tracking helpers

I gave a small presentation about a study I’ve made on connection tracking helpers. The slides are here: nfws_helper_freedom

Discussion following the speech was interesting. The main subject was automatic testing of the connection tracking helpers (as well as testing the other components). Pablo Neira Ayuso came with the idea of injecting the packet inside the kernel via a mechanism similar to NFQUEUE. This would then be easy to replay traffic. An extended discussion about the subject should take place during the week.

Samir Bellabes: userspace security for network syscalls – snet

Snet is an LSM module which treat network access. It is composed of a kernel part, a library and a tool.

In the kernel, event are generated for protocol and syscall, for example tcp and listen. It is then possible through a ticket system to decide if a process has the right to the event. For example, you can tell firefox can open connections to outside. A netlink protocol is used to communicate with userspace. Thus this is possible in userspace to take the decision by issuing ticket and sending it to kernel.

snet-tool is used to manage events and provide a interface to user.

If the computer is also running a firewall, there is two layers of filtering and there is for the moment no way to say: this software is listening to a port and has the right to, then the firewall should be open. A way to fix this would be to create expectation in the connection tracking to dynamically open the flow.

Crash of the userspace part is discussed. Some mechanisms have been set like application of a default policy if userspace does not answer after a configurable delay. And the use of tickets permit to have a working system without userspace running.