The Netfilter workshop being a developer conference, I’ve decided to presente an introduction to the coccinelle tool. Coccinelle is a program matching and transformation engine for the C language which is used in many place and among them in the Linux kernel. It is able to perform C clever modification in the code. If you ever had to modify multiple code files following an API change, I invite you to have a look at the slides or my Coccinelle for the newbie page. I’ve also presented my coccigrep tool which is a easy to use semantic grep.
The slides are available: nfws_coccinelle
Jesper’s IPTables::libiptc is a perl module which allow you to modify Netfilter rules from Perl. He’s the maintener and this is available on CPAN. It currently supports up-to iptables 1.4.10 (version 0.51 of IPTables::libiptc).
It dynamically load xtables.so and libiptc.so to access to iptables feature. It is fast as it does not suffer of iptables limitation (which is running modification one by one). Performance are quite good: it takes only 16 sec to generate and implement a 80000 rules ruleset (which is quite good compare to the 42h hours that would be take by direct iptables calls)
Jesper would like to have a complete iptables lib to access to all function and in particular to the do_command() function. One interesting things for him would be to have access to the test command.
Pablo don’t want the team to guarantee the libiptc will not break API or ABI. As it is already exported, it is not possible to make it private again. As the part Jesper is interested in is linked with user command, there should not be API break. Thus exporting the function seems OK.
Next work, Jesper wish to do is to publish a wrapper module IPTables::Interface and moving this to CPAN maybe inside the IPTables::libiptc module.
Patrick presents one work that is aiming at getting rid of the second tuple in the connection tracking. This second tuple is only necessary when NAT is used. idea is not new but at the time the ct-extention where not available and thus it would not be possible to add it when needed. Patrick has done most of the work but there is still a missing point which is the hash function. It has to be symetrical:
hash_func(src,dst) = hash_func(dst, src) and it must be very fast to avoid slowdown of the conntrack.
If this point is fixed, then it will be possible to get rid of the second tuple for all non NATed connection tracking entries.
We have been ignoring the fact that NAT could have some interest in IPv6 during the latest 5 years. IPv6 will not fix everything and it may be time to reconsider NAT. There is some reasons for that:
- Dynamic IPv6 prefixes: some ISP decide to not give fixed address to people
- Server load balancing, DMZ
- Uplink Balancing (multi-homing): this is one of the most important reason. IPv6 client can handle multiple addresses but you may want not having your user to choose their internet output.
IPv6 NAT is available in OpenBSD for some years now. It is also available on FreeBSD when using pf. Cisco IOS has not IPv6 NAT support.
Linux status is quite complicated. There is at least three implementations and there is even a official one that come from Linux virtual server.
NAT66 – RFC 6286 is now available. There is no port translation and the mapping must be checksum-neutral (if you change the prefix, it must not change the checksum).
Ulrich proposes some choices for integration into Netfilter:
- No IPv6 NAT
- NAT66 ip6tables target (with or without conntrack dependency)
- Make nf_nat protocol independant and move to net/netfilter (let admin decide if they want 1:1 or n:1)
- Any other solutions?
Main discussion is about the impact of the change introduced by IPv6 NAT. Nobody seemed against the introduction of the feature and this was finally accepted to add IPv6 NAT inside Netfilter. The remaining point is who will do the job.
Pablo is presenting is work on protocol classification. As you may not have guess, nfgrep is not using regular expression but a descriptive language.
The basic architecture is the following:
- developped layer-7 filter in userspace
- filter is passed to a tool that generates byte-code
- it loads the byte-code to the kernel via nfnetlink
- The kernel does the classification
- nfgrep match can then be used to select or mark the flow
In userspace, nfgrep and libnfgrep can be used to interact with the system. There’s also a nfgrep-test to validate filter before sending them.
Pablo has started to work with BPF but this was hard to develop filter. Getting a simple field could take something like 10 lines. He looks at existing descriptive language like LUA or others but they offer too many feature and are not dedicated to that.
By linking the data to the connection tracking entry this is possible to store stateful information. Multiple informations can be attached, it is thus possible to have multiple match.
The language is simple, it contains a few keywords. One of the interest is to be able to have a multiple step to ensure the matching is accurate. The image below is a description of the HTTP protocol:
Filters can be chained. It could thus be possible to detect HTTP and then to detect HTTP subprotocol.
It is not currently possible to put the information about the detected protocol inside something like nfnetlink_queue but this could be added and provide very interesting classification information to an IPS like suricata.
The TCP segmentation is still an open issue. This could defeat the matching.
The code should be released in the following days.
Cyberoam team presents their work on active active cluster. They’ve done a 2 nodes active active setup, with a primary and an auxiliary sytem. The primary take care of load balancing. The setup is using virtual MAC addresses.
To avoid split-brain problem, the primary take all decisions by always treating the SYN packet. It also transfer the NAT, marks to the auxiliary thanks to a module. This is done via a module called ipt_SYNDATA. It is placed in PREROUTING
Another problem that they need to fix was to arp resolution. They need to have only one answer and one request. For that they developed an arptable extension which is used to have the primary that does all the request and it transfers the answer on the dedicated link between the two nodes.
At times it is necessary to flush UNREPLIED connection tracking entries for connectionless protocols if there are NAT rules involved. For example this is the case when a ipsec or a ppp connection goes up. Without doing that the connection are not correctly NATed because the topology change has not been taken into account.
Doing this in userspace with the conntrack-tools was taking long like minutes on some setup. They thus decide to put in kernel space and this is now only taking milliseconds instead of minutes.
Holger wants to know if somebody has another solution for this problem (or if someone see generic usage of their features).
Discussion shows that the explanation of the slowness was the fact that conntrack-tools force you to delete connection one by one. Other points were discussed like the fact that connection tracking could in someway react to this topology change. The discussion is planned to continue during the way back to the hotel.
There is a fixed number of connection tracking entries. When reaching the maximum, new connections are simply dropped. Default maximum size is ridicully too low like using 20Mbytes oon a 12GB memory computer.
Kernel syslog message
"nf_conntrack: table full, dropping, packet" is not correct because packet have just no state relatively to conntrack. Usually they get blocked by invalid rules but an adapted ruleset could let them go through.
One other problem is that adjusting the connection tracking size does not change the hash size. This results in longer search because conntrack has often to go through a list.
Mostly being out of entries is due to connection in end of life. But as the timeout are big, the number of entries can be important. Lowering the timeout when connection tracking is almost full can help to release the pressure. An automatic change of the parameters is something that could be thought of but finding a correct logic is not easy.
Destruction of non-important connection tracking entry is something that could really help, but it is necessary to find an adapted logic. Adjusting timeout dynamically requires to do a full scan of the list and this is really costly. This algorithm has also to be resistant to DoS attack. Finding a generic strategy is difficult. Pablo proposes to try a userspace solution. This could be used to experiment different policies and it could also use information taken from other subsystems or/and from configuration file.
Samir suggest to lower the
nf_conntrack_tcp_timeout_syn_sent when being under attack. This could made the bad entries to disappear faster.
Jan starts its presentation by talking about its Distro Availability Matrix of Netfilter tech page. It contains the software and their versions in a lot of distributions.
Next subject is the discussion about maintaining translations of iptables man page. The team is international and could translate in a few language the man pages. But the question is about finding volunteers in the long term. Jan is alright with taking in charge the synchronization of translation. Any volunteers for translation is welcome.
Then, Jan starts a discussion about hs work on Xtables2. The discussed point is switching iptables to netlink. The issue is that iptables command are huge in size and the size of a netlink pakcet is limited. There is thus an issue to solve. One of the possibility is to use continuation message which are supported by netlink. But the problem of cutting the message in the correct place is not easy. During the discussion, clarifications on how to forge huge netlink message appear.
Last subject is about maintening Netfilter. David Miller post a message on netdev complaining about Netfilter mainteners. Patrick and Pablo are currently working on having a git tree that they could share. This should help to speed up reaction of the mainteners. Doing a lot of work on iptables, Jan will soon have a account on Netfilter to be able to push patches to iptables official git tree.
Reverse Path filtering is currently only implemented in IPv4. Eric Leblond sends a patch to add support for IPv6 but it was refused by David Miller who, among other points, wanted to get rid of rp_filter and would like to see it in the Netfilter code.
Reverse patch filter implementation is a single function called fib_validate_source. Looking at the problem, it seem relatively simple to implement because, it is just to reverse source and destination and then get the output interface. if it match with the incoming interface, then this is ok.
But API is not that simple and implementing it in Netfilter is not easy. For example, in PRE_ROUTING we don’t have the output interface and thus we can not guess it easily by using simple Netfilter function. A implementation using standard function from the routing part is thus necessary. But there is still issue with multipath routing in IPV4. Florian then has tried a second implementation which mimic the behaviour of fib_validate_source.
Some implementation questions are discussed. The main part are about how to handle special cases. Patrick proposes to modify the code in PREROUTING to be able to access all interfaces. This will then permit to do a more Netfilter based implementation.
Regarding userspace syntax, this is a match and a specific iptables rules will have to be added to benefit from the functionality.