At times it is necessary to flush UNREPLIED connection tracking entries for connectionless protocols if there are NAT rules involved. For example this is the case when a ipsec or a ppp connection goes up. Without doing that the connection are not correctly NATed because the topology change has not been taken into account.
Doing this in userspace with the conntrack-tools was taking long like minutes on some setup. They thus decide to put in kernel space and this is now only taking milliseconds instead of minutes.
Holger wants to know if somebody has another solution for this problem (or if someone see generic usage of their features).
Discussion shows that the explanation of the slowness was the fact that conntrack-tools force you to delete connection one by one. Other points were discussed like the fact that connection tracking could in someway react to this topology change. The discussion is planned to continue during the way back to the hotel.
There is a fixed number of connection tracking entries. When reaching the maximum, new connections are simply dropped. Default maximum size is ridicully too low like using 20Mbytes oon a 12GB memory computer.
Kernel syslog message
"nf_conntrack: table full, dropping, packet" is not correct because packet have just no state relatively to conntrack. Usually they get blocked by invalid rules but an adapted ruleset could let them go through.
One other problem is that adjusting the connection tracking size does not change the hash size. This results in longer search because conntrack has often to go through a list.
Mostly being out of entries is due to connection in end of life. But as the timeout are big, the number of entries can be important. Lowering the timeout when connection tracking is almost full can help to release the pressure. An automatic change of the parameters is something that could be thought of but finding a correct logic is not easy.
Destruction of non-important connection tracking entry is something that could really help, but it is necessary to find an adapted logic. Adjusting timeout dynamically requires to do a full scan of the list and this is really costly. This algorithm has also to be resistant to DoS attack. Finding a generic strategy is difficult. Pablo proposes to try a userspace solution. This could be used to experiment different policies and it could also use information taken from other subsystems or/and from configuration file.
Samir suggest to lower the
nf_conntrack_tcp_timeout_syn_sent when being under attack. This could made the bad entries to disappear faster.
Jan starts its presentation by talking about its Distro Availability Matrix of Netfilter tech page. It contains the software and their versions in a lot of distributions.
Next subject is the discussion about maintaining translations of iptables man page. The team is international and could translate in a few language the man pages. But the question is about finding volunteers in the long term. Jan is alright with taking in charge the synchronization of translation. Any volunteers for translation is welcome.
Then, Jan starts a discussion about hs work on Xtables2. The discussed point is switching iptables to netlink. The issue is that iptables command are huge in size and the size of a netlink pakcet is limited. There is thus an issue to solve. One of the possibility is to use continuation message which are supported by netlink. But the problem of cutting the message in the correct place is not easy. During the discussion, clarifications on how to forge huge netlink message appear.
Last subject is about maintening Netfilter. David Miller post a message on netdev complaining about Netfilter mainteners. Patrick and Pablo are currently working on having a git tree that they could share. This should help to speed up reaction of the mainteners. Doing a lot of work on iptables, Jan will soon have a account on Netfilter to be able to push patches to iptables official git tree.
Reverse Path filtering is currently only implemented in IPv4. Eric Leblond sends a patch to add support for IPv6 but it was refused by David Miller who, among other points, wanted to get rid of rp_filter and would like to see it in the Netfilter code.
Reverse patch filter implementation is a single function called fib_validate_source. Looking at the problem, it seem relatively simple to implement because, it is just to reverse source and destination and then get the output interface. if it match with the incoming interface, then this is ok.
But API is not that simple and implementing it in Netfilter is not easy. For example, in PRE_ROUTING we don’t have the output interface and thus we can not guess it easily by using simple Netfilter function. A implementation using standard function from the routing part is thus necessary. But there is still issue with multipath routing in IPV4. Florian then has tried a second implementation which mimic the behaviour of fib_validate_source.
Some implementation questions are discussed. The main part are about how to handle special cases. Patrick proposes to modify the code in PREROUTING to be able to access all interfaces. This will then permit to do a more Netfilter based implementation.
Regarding userspace syntax, this is a match and a specific iptables rules will have to be added to benefit from the functionality.
I just gave a presentation to explain that it is necessary to implement carefully reverse path filtering in IPv4 and IPv6.
More to come later.
Patrick McHardy presents his work on a modification of netlink and nfnetlink_queue which is using memory map.
One of the problem of netlink is that netlink uses regular socket I/O and data need to be copied to the socket buffer data areas before being send. This is a problem for performance.
The basic concept or memory mapped netlink is to used a shared memory area which can be used by kernel and userspace. A ring buffer is set and instead of copying the data, we just move a pointer to the correct memory area and the userspace reads
It is necessary to synchronize kernel and user spaces to avoid a read on a non significative area. This is done by using a area ownership.
There is a RX and a TX ring and it is thus possible to send packet (or issue verdict via the TX ring). There is few advantages on the TX side, but the possibility to batch verdict by issuing multiple verdicts in one send message.
Backward compatibility with subsystem that does not support this new system is done via a standard copy and message sending and receiving.
Ordering of message was a difficult problem to solve, reading in the rings depends on the allocation time in the ring and not on the arrival date on the packet. It is thus possible to have unordered packet in the ring. To fix this, userspace can specify it cares about ordering and the kernel will then do reception of packet and copy atomically.
Multicast is currently not supported. The synchronisation of data accross clients is a big issue and most of the solution will have bad performance.
Userspace support has been done in libmnl. As usual with Patrick, the API looks clean and adding support for it in
Testing has been done only done on a loopback interface because Patrick did not have access to a 10Gbit test bed. This is a bad test case because loopback copy is less expensive and thus performance measurement on real NICs should give better result.
Anyway, the performance impact is consequent: between 200% and 300% bandwidth increase dependings on the packet size:
There is currently no known bugs and the submission to netdev should occurs soon.
Jesper presents its IP TV analyser know called IPTV-analyser.
He starts the project when encountering problem in the IP TV system in the company he works for. Proprietary analyser exists but they are expensive and the tested equipment were not able to show the burstiness directly. To fix this, he started using wireshark and add it a burstiness detector. It was not enough because pcap was not scaling enough and they decide to build their own probe. One of the decisive point was the 192000€ necessary to buy the necessary probes.
Being an ISP with custom set top-box has some advantage because they were able to deploy the probe on the box to be able to get data from the client side.
Project is made of three parts:
- Kernel module: parse MTS flow and extract data
- Collect daemon: get data from userspace via /proc polling
- Web interface: display the statistics and browse
Jesper is looking for help on the web interface. He is not a web developer and need help on that point.
IPTV-analyser has been open sourced and the source are available on github.
Holger want to describe its experience when switching from monocore system to mutiticore system at
- RPS: Receive packet steering
- RFS:Receive flow steering
- XPS: Transmit flow steering
They are using a 2.6.32 kernel and they had to backport the code but this was quite easy because the code is self-contained. irqbalance is not RPS and XPS aware and it is know to degrade performance. Holger decide then to start a new project.
This is named (for the moment) irqd. It is able to differentiate hardware multiqueue and RPS and it uses a netlnk interface to communicate.
The information that are used to determine how to dispatch the load is separate in a lot of files and that was one of the difficulty.
What seems strange is that the default in multiqueue and RPS/XPS are not good and clearly failed the “choose sane defaults” principle.
The work on irqd is still in progress, it is working but there is currently no configuration file and it thus can not be easily tuned. It is available on github and Holger encourage everyone to have a look and try to improve the software.
When doing matching on iptables, the sequential test of the rules is costly. By using ipset this is possible to limit the number of matches by using the sets.
For their use, they decide to use the connection mark to determine the fate of the packet. It is used to jumb on the correct chain. This logic, combined with a connectionmark set they have developed this lead to a filtering system with a really limited number of rules. In fact, this was switching from something like 10000 rules to one single rule. Ipset is doing all the classification work. The performance increase is huge as on the test system, it goes from a bandwith of 256Mb with iptables to a bandwith 1.8Gb with their system.
Ipset is now included in the kernel and that’s the main event of ipset in the previous year. József recommands to use the 6.8 version which is included in kernel 3.1. If your kernel is older, using a separately compiler ipset is recommanded.
If we omit the bugfixes, a lot of of new features have been introduced sinced version 6.0. It is possible to list the sets defined on a system without getting everything which is useful when big set have been defined.
A new hash net,iface has been introduced.
- tc filter support: to use ipset in traffic shaping
- ipset state replication: it would be interesting but all iptables match with state should be replicated
This last point is a really interesting problem: there is a lot of data that could be exchanged because the state of match changes, and it is difficult to find an identifier to use when doing the replication (where this match take place).
Other possible extensions is to add match support in the SET target. It will help to treat the problem of overlapping sets. For example, we could say:
ipset new foo hash:net
ipset add foo 192.168.1.1 --drop
ipset add foo 192.168.1.0/24 [--accept]
iptables -A ... -j SET --match-set foo dir \
József would like to refresh its great test paper but performance testing is a real problem because it requires at least 50 quad-core computers and making all the needed tests could take as much as 18 days.