Netfilter and the NAT of ICMP error messages

The problem

I’ve been recently working for a customer which needed consultancy because of some unexplained Netfilter behaviors related to ICMP error messages. He authorizes me to share the result of my study and I thank him for making this blog entry possible.
His problem was that one of his firewalls is using a private interconnexion with their border router and the customer did not manage to NAT all outgoing ICMP error messages.

The simplified network diagram is the following:

The DMZ is in a private network. The router has a route to the public network via the firewall and the public network address do not exists.
The firewall has set of DNAT rules to redirect a public IP to the matching private IP:

iptables -A PREROUTING -t nat -d 1.2.3.X -j DNAT --to 192.168.1.X

The interconnection between the router and firewall is made using a private network. Let’s say 192.168.42.0/24 and 192.168.42.1 for the firewall. The interface eth0 is the one used as interconnection interface.

On the firewall, some filtering rules reject some FORWARD traffic:

iptables -A FORWARD -d 192.168.1.X -j REJECT
iptables -A FORWARD -d 192.168.1.Y -j REJECT

The issue is related with the ICMP unreachable messages. When someone from internet (behind the router) is sending a packet to 192.168.1.X or 192.168.1.Y then:

  • If 192.168.1.X is NATed then the ICMP unreachable message is emitted and seen as coming from 1.2.3.X on eth0.
  • If 192.168.1.Y is not NATed then the ICMP unreachable message is emitted and seen as coming from 192.168.42.1 on eth0.

So, a packet going to 192.168.1.Y results in a ICMP message which is not routed by the router due to the private IP.

To fix the issue, the customer has added a Source NAT rules to translate all packet coming from 192.168.42.1 to 1.2.3.1:

iptables -A POSTROUTING -t nat -p icmp -s 192.168.42.1 -o eth0 -j SNAT --to 1.2.3.1

But this rules has no effect on the ICMP unreachable message.

Explanations

In the case of packets going to X or Y, an ICMP message is sent. Internally the same function (called icmp_send) is used for to send the icmp error message. This is a standard function and
as such, it uses the best local source address possible. In our case the best address is 192.168.42.1 because the packet has to get back through eth0.
At current stage, there is no difference between the two ICMP packets and the result should be the same.

But if nothing is done, the packet to X will result in a packet going to the original source and containing the internal IP information: the packet has been NATed so we have 192.168.1.X and not the public IP in the original packet data contained in the ICMP message. This is a real problem as this will leak private information to the outside.

Hopefully, the packets are handled differently due to the ICMP error connection tracking module. It searches in the payload part of the ICMP error message if it belongs to existing connection. If a connection is found, the IMCP packet is marked as RELATED to the original connection. Once this is done, the ICMP nat helper makes the reverse transformation to send to the network a packet containing only public information. For packet to X, the source addresses of the ICMP messages and payload are modified to the public IP address. This explains the difference between the ICMP error message sent because of packet sent to X or sent to Y.

But this does not explain why the NAT rules inserted by the customer did not work. In fact, the response was already made: the ICMP packet is marked as belonging to a connection related to the original one. Being in a RELATED state, it will not cross the NAT in POSTROUTING as only packet with a connection in state NEW are sent to the nat tables.

The validation of this study can be done by using marking and logging. If we log a packet which belong to a RELATED connection and if we are sure that the original connection is the one we are tracing then our hypothesis is validated. Getting a RELATED connection is easy with the filter: “-m conntrack –ctstate RELATED”. To prove that the packet is RELATED to the original connection, we have to use the fact that RELATED connection inherit of the connection mark of the originating connection. Thus, if we set a connection mark with the CONNMARK target, we will be able to match it in the ICMP error message. The following rules implement this:

iptables -t mangle -A PREROUTING -d 1.2.3.4 -j CONNMARK --set-mark 1
iptables -A OUTPUT -t mangle -m state --state RELATED -m connmark --mark  1 -j LOG

And it logs an ICMP error message when we try to reach 1.2.3.4.

Other debug methods

Using conntrack

The conntrack utils can be used to display connection tracking events by using the -E flag:

# conntrack -E
    [NEW] tcp      6 120 SYN_SENT src=192.168.1.12 dst=91.121.96.152 sport=53398 dport=22 [UNREPLIED] src=91.121.96.152 dst=192.168.1.12 sport=22 dport=53398
 [UPDATE] tcp      6 60 SYN_RECV src=192.168.1.12 dst=91.121.96.152 sport=53398 dport=22 src=91.121.96.152 dst=192.168.1.12 sport=22 dport=53398
 [UPDATE] tcp      6 432000 ESTABLISHED src=192.168.1.12 dst=91.121.96.152 sport=53398 dport=22 src=91.121.96.152 dst=192.168.1.12 sport=22 dport=53398 [ASSURED]

This can be really useful to see what transformation are made by the connection tracking system. But this does not work in your case because the icmp message does not trigger any connection creation and so no event.

Using TRACE target

The TRACE target is a really useful tool. It allows you to see which rules are matched by a packet. It’s usage is really simple. For example, if we want to trace all ICMP traffic coming to the box:

iptables -A PREROUTING -t raw  -p icmp -j TRACE

In our test system, the result was the following:

[ 5281.733217] TRACE: raw:PREROUTING:policy:2 IN=eth0 OUT= MAC=08:00:27:a9:f5:30:0a:00:27:00:00:00:08:00 SRC=192.168.56.1 DST=1.2.3.4 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=ICMP TYPE=8 CODE=0 ID=12114 SEQ=1
[ 5281.737057] TRACE: nat:PREROUTING:rule:1 IN=eth0 OUT= MAC=08:00:27:a9:f5:30:0a:00:27:00:00:00:08:00 SRC=192.168.56.1 DST=1.2.3.4 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=ICMP TYPE=8 CODE=0 ID=12114 SEQ=1
[ 5281.737057] TRACE: nat:PREROUTING:rule:2 IN=eth0 OUT= MAC=08:00:27:a9:f5:30:0a:00:27:00:00:00:08:00 SRC=192.168.56.1 DST=1.2.3.4 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=ICMP TYPE=8 CODE=0 ID=12114 SEQ=1
[ 5281.737057] TRACE: filter:FORWARD:rule:1 IN=eth0 OUT=eth1 MAC=08:00:27:a9:f5:30:0a:00:27:00:00:00:08:00 SRC=192.168.56.1 DST=192.168.42.4 LEN=84 TOS=0x00 PREC=0x00 TTL=63 ID=0 DF PROTO=ICMP TYPE=8 CODE=0 ID=12114 SEQ=1

In the raw TABLE, in PREROUTING the policy is applied (here ACCEPT). In nat PREROUTING the first rule is matching (a mark rule) and the second one is matching too. Finally in FORWARD, the first rule is matching (here the REJECT rule). TRACE is only following the initial packet and thus does not display any information about the ICMP error message.

Conclusion

So Netfilter’s behavior is correct when it translate back the elements initially transformed by NAT. The surprising part comes from the fact that the NAT rules in POSTROUTING are not reach. But this is needed to avoid any complicated issue by doing multiple transformation. Regarding interconnexion with router, you should really use a public network if you want your ICMP error messages to be seen on Internet.

A month in the life of Debian in 2000 and 2012

Visualizing Debian packages upload

Ultimate Debian Database provide a way to get information about all packages upload on Debian repositories accros time. After a discussion with Lucas Nussbaum at Distro Recipes, he made available a webpage to access to a gource compatible file format of packages upload.

Using this I was able to create videos of Debian evolution over time. I’ve generated two videos showing on month of packages upload in 2000 and to compare one month in 2012.

The first video is really peaceful even if the lack of activity cause gource to do some jump in time:

The second video is made with exactly the same time scale and the rhythm is completely crazy:

More info about video generation

The raw data are the following: udd.gource.log.bz2. I’ve transformed them to add section information to package name by using the following script:

[python]
#!/usr/bin/python

import fileinput
import apt

cache = apt.Cache()

for line in fileinput.input():
[date, user, mode, package] = line.split(“|”)
pack = package.rstrip()

if len(user) == 0:
continue

try:
pkg = cache[pack] # Access the Package object for python-apt
package = pkg.section + “/” + pack
except KeyError:
package = “undef” + “/” + pack

print date + “|” + user + “|” + mode + “|” + package
[/python]

The result is the following file: udd.gource-section.log.bz2. Once extracted, it can be visualized in gource:

gource --log-format custom  udd.gource-section.log

Next step was to extract the upload at start (in 2000) and the latest upload (in 2012). I’ve simply used tail and head to do so. The generation of the videos was made using indication given on gource website:

gource -1920x1080 -o - udd.gource-end.log | ffmpeg -y -r 60 -f image2pipe -vcodec ppm -i - -vcodec libvpx -b 13000K debian-2012.webm

WiFi interface and suricata AF_PACKET IPS mode

Not usual setup can lead to surprise

The 5th of December 2012, I’ve setup suricata in AF_PACKET IPS mode between a WiFi interface and an Ethernet interface. The result was surprising as it was leading to a crash after some time:

The issue was linked with the defrag option of AF_PACKEt fanout. I’ve proposed a patch the 7th Dec 2012 and after a discussion with David Miller and Johannes Berg, Johannes has proposed a better patch which was included in official tree. So the problem is fixed for kernel superior or equal to 3.7.

Affected kernel

Here’s the list of affected kernel:

  • All kernel prior to 3.2.36
  • All 3.3.x kernel
  • All 3.4.x kernel prior to 3.4.25
  • All 3.5.x kernel prior to 3.5.7.3
  • All 3.6.x kernel prior to 3.6.11

Workaround in Suricata

If you can’t update to a not affected kernel, you can set defrag to no in af-packet configuration to avoid the issue:

af-packet:
  - interface: wlan0
    # In some fragmentation case, the hash can not be computed. If "defrag" is set
    # to yes, the kernel will do the needed defragmentation before sending the packets.
    defrag: no

Jan Engelhardt, “Merge Me”

Xtables2

xtables 2 suppress the different tables that exits in current Netfilter. If a rule only apply to a specific type of traffic (read owner id match per-example) then it just don’t match.

One of the interest to have one single table is that it is possible to easily update the ruleset by just doing a single atomic swap.

Manual chains can be created by hand as there are very useful to create factorized rules.

To avoid performance issue, rules counter are disabled by default and can be activated on rules.

Discussion

There is a discussion between Patrick McHardy and Jan Engelhardt over the blob usage in xtables2. The point of Jan is that blob should provide a better CPU cache usage than a dynamically allocated structure. Patrick points out that this will make the state swap difficult and it occurs that the state of matches needing one are not stored in the blob but in allocated memory. So for Patrick it breaks the initial idea.

Jan restarts the discussion that was interrupted on the ML. Patrick and Pablo insist on the lack of jump in current xtables2 implementation. And they ask for the availability of some interesting features of nftables. Jan argues that this can be done and that there is already a good documentation of xtables2 but not for nftables.

This continues in a sterile discussion where Jan argues he can develop the features if all parts agreed on xtables2’s starting patches. He’s answered that the real interesting things is the features.

Eric Dumazet makes an interesting point on asking for benchmarks. If he’s got numbers he will have some real information to decide. Holger Eitzenberger is asking for both party to agree on complete testing modality.

Victor Julien asked Jan if backward compatibility will be provided and Jan is answering that this could be done but that it is not important to him. David Miller answered that backward compatibility is the thing that is important for users.

Jan agree that xtables2 does not currently take into account the code redundancy issue that is severe in the different matches and that is fixed by nftables. He talked about using nftables code…

Performance benchmarking is almost agreed and may occurs soon but in its current state xtables2 can not compete to nftables features.

NFWS group photo

Top starting from left:
Jan Engelhardt, Tomasz Bursztyka, Daniel Borkmann, Julien Vehent, Holger Eitzenberger, Victor Julien, Eric Leblond, Eric Dumazet, Nicolas Dichtel, David Miller, S. Park

Bottom starting from left:
Martin Topholm, Jesper Sander Lindgren, Pablo Neira Ayuso, Simon Horman, Jozsef Kadlecsik, Jesper Dangaard Brouer, Patrick McHardy, Thomas Graf

Tomasz Bursztyka, connMan usage of Netfilter

Introduction

connMan is a network manager which has support for a lot of different layers from ethernet and WiFi to NFC and link sharing.

It features automatic link switch and allow you to select your preferred type of support. The communication with UI is event based so it is easy to do as only a few windows type are needed.

Discussion

David Miller pointed out the fact that DHCP client is really often putting the interface in promiscuous mode and this is not a good idea as it is like having a tcpdump started on every laptop. As connMann does ahave its own implementation, they could maybe take this into account and improved the situation. This is in fact already the case as the DHCP client is using an alternate method.

Jozsef Kadlecsik, ipset status

Tc interaction

tc interaction has been contributed by Florian Westphal. It is thus now possible to use a set match to differentiate Qos or routing of packet. This opens a wide area for experimentation.

Packet and byte counters

This is a fairly larger rewriting of set element and extensions which adds packets and bytes counters to the element.

The syntax has been updated:

ipset add   packets n bytes m

It is also possible to do check on counters !! For example, ipset will be able to do a match on a set and to refine the selection by specifying the number of packets we must have seen before matching. Counters can also be updated in the set match.

This will permit some really interesting possibilities to the rules writers.

This work is currently in progress and should be released really soon.

Discussion

nftables is containing a set system which is serving almost the same target. But there is some difference in them. nftables set can be used to parameter a rules: a set can be made with a list of interface: port and a DNAT rules can be written where it will choose the destination port to NAT to depending on the interface.
So it is possible to define ipset set as performance specialized set that could be implemented and used by nftables.

Jozsef Kadlecsik is speaking about the extension of conntrackd to ipset and maybe nftables rules. This would allow to fully synchronized hosts. It is found to be really interesting by the attendees and should be do and provided as an option.

Pablo Neira Ayuso, nftables strikes back

Introduction

This is a new kernel packet filtering framework. The only change is on iptables. Netfilter hooks, connection tracking system, NAT are unchanged.
It provides a backward compatibility. nftables was released in March 2009 by Patrick Mchardy. It has been revived in the precedent months by Pablo Neira Ayuso and other hackers.

Architecture

It uses a pseudo-state machine in kernel-space which is similar to BPF:

  • 4 registers: 4 general purpose (128 bits long each) + 1 verdict
  • provides instruction set (which can be extended)

Here’s a example of existing instructions:

reg = pkt.payload[offset, len]
reg = cmp(reg1, reg2, EQ)
reg = byteorder(reg1, NTOH)
reg = pkt.meta(mark)
reg = lookup(set, reg1)
reg = ct(reg1, state)
reg = lookup(set, data)

Assuming this complex match can be developed in userspace by using this instruction set. For example, you can check if source and destination port are equal by doing two pkt.payload operation and then do a comparison:

reg1 = pkt.payload[offset_src_port, len]
reg2 = pkt.payload[offset_dst_port, len]
reg = cmp(reg1, reg2, EQ)

This was strictly impossible to do without coding a C kernel module with iptables.
C module still needs to be implemented if they can not be expressed with the existing instructions (for example the hashlimit module can not be expressed with current instructions). In this case, the way to do will be to provide a new instruction that will allow more combination in the futur.

nftables shows a great flexibility. The rules counters are now optional. And they come in two flavors, one system-wide counter with a lock (and so with performance impact) and a per-cpu counter. You can choose which type you want when writing the rule.

Nftables operation

All communications are made through Netlink messages and this is the main change from a user perspective. This change is here to be able to do incremental changes. iptables was using a different approach where it was getting the whole ruleset from kernel in binary format making update to this binary blob and sending it back to the kernel. So the size of the ruleset has a big impact on update performance.

Rules have a generation masks attached coding if the rules is currently active, will be active in next generation or destroyed in next generation. Adding rules is starting a transaction, sending netlink messages adding or deleting rules. Then the commit operation will ask the kernel to switch the ruleset.
This allow to have the active ruleset to change at once without any transition state where only a part of the rules are deployed.

Impact of the changes for users

The iptables backward comptability will be provided and so there is no impact on users. xtables is a binary which is using the same syntax as iptables and that is provided by nftables. Linking the xtables binary to iptables and ip6tables will be enough to assure compatibility of existing scripts.

User will have more and more interest in switching due to new features that will be possible in nftables but not possible with iptables.

A high level library will be provided to be able to handle rules. It will allow easy rule writing and will have an official public API.

The main issue in the migration process will be for frontends/tools which are using the IPTC library which is not officially public but is used by some. As this library is an internal library for iptables using binary exchange, it will not work on a kernel using nftables. So, application using IPTC will need to switch to the upcoming nftables high level library.

The discussion between the attendees also came to the conclusion that providing a easy to use search operation in the library will be a good way to permit improvement of different firewall frontends. By updating their logic by doing a search and then adding a rule if needed, they will avoid conflict between them as this is for example currently the case with network manager which is destroying any masquerading policy in place by blindly inserting a masquerading rule on top when he thinks he needs one.

Release date

A lot of works is still needed but there is good chance the project will be proposed upstream before the end of 2013. The API needs to be carefully review to be sure it will be correct on a long time. The atomic rule swapping systems needs also a good review.

More and more work power is coming to the project and with luck it may be ready in around 6 months.

Simon Horman, MPLS Enlightened Open vSwitch

Open vSwitch is a multi-layer switch. It is designed to enable network automation through programmatic extension, while still supporting standard management interfaces and protocols.

Openflow is a management protocol that is supported by Open vSwitch. Openflow is has a basic support for MPLS. It features a minimum operation set to enable to configure MPLS correclty.
Openflow MPLS support is partially implemented in Open vSwitch but there is some difficulties.

SOme of the operations feature update of L3+ parameter like TTL. They must be updated in same manner in the MPLS header and in the packet header. And this is quite complicated as it supposed to decode the packet below MPLS. But MPLS header does not include the encapsulated ethernet type so it is almost impossible to access correctly to the packet structure.

A possible solution is to reinject the packet after modification to modify layer by layer in each step. This is currently a work in progress.

Victor Julien, Suricata and Netfilter

Suricata and Netfilter can be better friend as they are doing some common work like decoding packet and maintaining flow table.

In IPS mode, Suricata is receiving raw packet from libnetfilter_queue. It has to made the parsing of this packet but this kind of thing has also been done by kernel. So it should be possible to avoid to duplicate the work.

In fact Netfilter work is limited as ipheaders srtucture are used. Patrik McHardy proposed that Netfilter offset but this is not the most costly part.

The flow discussion was more interesting because conntrack is really doing a similar work as the one done by Suricata. Using the CT extension of libnetfilter_queue, Suricata will be able to get access to all the CT information. And at a first glance, it seems it contains almost all information needed. So it should be possible to remove the flow engine from suricata. The garbage operation would not be necessary as Suricata will get information via NFCT destroy event.

Jozsef Kadlecsik proposed to use Tproxy to redirect flow and provide a “socket” stream instead of individual packet to Suricata. This would change Suricata a lot but could provide a interesting alternative mode.