Netfilter – Page 2 – To Linux and beyond !

Investigation on an attack tool used in China

Log analysis experiment

I’ve been playing lately with logstash using data from the ulogd JSON output plugin and the Suricata full JSON output as well as standard system logs.

Ulogd is getting Netfilter firewall logs from Linux kernel and is writing them in JSON format. Suricata is doing the same with alert and other traces. Logstash is getting both log as well as sytem log. This allows to create some dashboard with information coming from multiple sources. If you want to know how to configure ulogd for JSON output check this post. For suricata, you can have a look at this one.

Ulogd output is really new and I was experimenting with it in Kibana. When adding some custom graphs, I’ve observed some strange things and decided to investigate.

Displaying TCP window

TCP window size at the start of the connection is not defined in the RFC. So every OSes have choozen their own default value. It was thus looking interesting to display TCP window to be able to find some strange behavior. With the new ulogd JSON plugin, the window size information is available in the tcp.window key. So, after doing a query on tcp.syn:1 to only get TCP syn packet, I was able to graph the TCP window size of SYN packets.

Most of the TCP window sizes are well-known and correspond to standard operating systems:

65535 is or MacOSX or some MS Windows OS.
14600 is used by some Linux.

The first uncommon value is 16384. Graph are clickable on Kibana, so I was at one click of some interesting information.

First information when looking at dashboard after selection TCP syn packet with a window size of 16384 was the fact, it was only ssh scanning:

Second information is the fact that, according to geoip, all IPs are chinese:

A SSH scanning software

When looking at the details of the attempt made on my IP, there was something interesting:

For all hosts, all requests are done with the same source port (6000). This is not possible to do that with a standard ssh client where the source port is by default choosen by the operating system. So or we have a custom standard software that perform a bind operation to port 6000 at socket creation. This is possible and one advantage would be to be easily authorized through a firewall if the country had one. Or we could have a software developped with low level (RAW) sockets for performance reason. This would allow a faster scanning of the internet by skipping OS TCP connection handling. There is a lot of posts regarding the usage of port 6000 as source for some scanning but I did not find any really interesting information in them.

On suricata side, most of the source IPs are referenced in ET compromised rules:

Analysing my SSH logs, I did not see any trace of ssh bruteforce coming from source port 6000. But when selecting an IP, I’ve got trace of brute force from at least one of the IP:

These attackers seems to really love the root account. In fact, I did not manage to find any trace of attempts for user different than root for IP address that are using the port 6000.

Getting back to my ulogd dashboard, I’ve displayed more info about the used scanning sequence:

The host scans the box using a scanner using raw socket, then it attacks with a few minutes later with SSH bruteforce tool. The bruteforce tool has a TCP window size at start of 65535. It indicates that a separated software is used for scanning. So we should have an queueing mechanism between the scanner and the bruteforce tool. This may explains the duration between the scan and the bruteforce. Regarding TCP window size value, 65535 seems to indicate a Windows server (which is coherent with TTL value).

Looking at the scanner traffic

Capturing a sample traffic did not give to much information. This is a scanner sending a SYN and cleanly sending a reset when it got the SYN, ACK:

14:27:54.982273 IP (tos 0x0, ttl 103, id 256, offset 0, flags [none], proto TCP (6), length 40)
    218.2.22.118.6000 > 192.168.1.19.22: Flags [S], cksum 0xa525 (correct), seq 9764864, win 16384, length 0
14:27:54.982314 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 44)
    192.168.1.19.22 > 218.2.22.118.6000: Flags [S.], cksum 0xeee2 (correct), seq 2707606274, ack 9764865, win 29200, options [mss 1460], length 0
14:27:55.340992 IP (tos 0x0, ttl 111, id 14032, offset 0, flags [none], proto TCP (6), length 40)
    218.2.22.118.6000 > 192.168.1.19.22: Flags [R], cksum 0xe48c (correct), seq 9764865, win 0, length 0

But it seems the RST packet after the SYN, ACK is not well crafted:

More info on SSH bruteforce tool

Knowing the the behavior was scanning from 6000 and starting a normal scanning, I’ve focused the Suricata dashboard on one IP to see if I had some more information:

One single IP in the list of the scanning host is triggering multiple alerts. The event table confirmed this:

Studying the geographical repartition of the Libssh alert, it appears there is used in other countries than China:

So, libssh is not a discriminatory element of the attacks.

Conclusion

A custom attack tool has been been deployed on some Chinese IPs. This is a combination of a SSH scanner based on RAW socket and a SSH bruteforce tool. It tries to gain access to the root account of system via the ssh service. On an organisational level, it is possible there is a Chinese initiative trying to get the low-hanging fruit (system with ssh root account protected by password) or maybe it is just a some organization using some compromised Chinese IPs to try to get control other more boxes.

Why you will love nftables

Linux 3.13 is out

Linux 3.13 is out bringing among other thing the first official release of nftables. nftables is the project that aims to replace the existing {ip,ip6,arp,eb}tables framework aka iptables.
nftables version in Linux 3.13 is not yet complete. Some important features are missing and will be introduced in the following Linux versions.
It is already usable in most cases but a complete support (read nftables at a better level than iptables) should be available in Linux 3.15.

nftables comes with a new command line tool named nft. nft is the successor of iptables and derivatives (ip6tables, arptables). And it has a completely different syntax.
Yes, if you are used to iptables, that’s a shock. But there is a compatibility layer that allow you to use iptables even if filtering is done with nftables in kernel.

There is only really few documentation available for now. You can find my nftables quick howto and there is some other initiatives that should be made public soon.

Some command line examples

Multiple targets on one line

Suppose you want to log and drop a packet with iptables, you had to write two rules. One for drop and one for logging:

iptables -A FORWARD -p tcp --dport 22 -j LOG
iptables -A FORWARD -p tcp --dport 22 -j DROP

With nft, you can combined both targets:

nft add rule filter forward tcp dport 22 log drop

Easy set creation

Suppose you want to allow packets for different ports and allow different icmpv6 types. With iptables, you need to use something like:

ip6tables -A INPUT -p tcp -m multiport --dports 23,80,443 -j ACCEPT
ip6tables -A INPUT -p icmpv6 --icmpv6-type neighbor-solicitation -j ACCEPT
ip6tables -A INPUT -p icmpv6 --icmpv6-type echo-request -j ACCEPT
ip6tables -A INPUT -p icmpv6 --icmpv6-type router-advertisement -j ACCEPT
ip6tables -A INPUT -p icmpv6 --icmpv6-type neighbor-advertisement -j ACCEPT

With nft, sets can be use on any element in a rule:

nft add rule ip6 filter input tcp dport {telnet, http, https} accept
nft add rule ip6 filter input icmpv6 type { nd-neighbor-solicit, echo-request, nd-router-advert, nd-neighbor-advert } accept

It is easier to write and it is more efficient on filtering side as there is only one rule added for each protocol.

You can also use named set to be able to make them evolve other time:

# nft -i # use interactive mode
nft> add set global ipv4_ad { type ipv4_address;}
nft> add element global ipv4_ad { 192.168.1.4, 192.168.1.5 }
nft> add rule ip global filter ip saddr @ipv4_ad drop

And later when a new bad boy is detected:

# nft -i
nft> add element global ipv4_ad { 192.168.3.4 }

Mapping

One advanced feature of nftables is mapping. It is possible to use to different type of data and to link them.
For example, we can associate iface and a dedicated rule set (stored in a chain and created before). In the example, the chains are named low_sec and high_sec:

# nft -i
nft> add map filter jump_map { type ifindex : verdict; }
nft> add element filter jump_map { eth0 : jump low_sec; }
nft> add element filter jump_map { eth1 : jump high_sec; }
nft> add rule filter input iif vmap @jump_map

Now, let’s say you have a new dynamic interface ppp1, it is easy to setup filtering for it. Simply add it in the jump_map mapping:

nft> add element filter jump_map { ppp1 : jump low_sec; }

On administration and kernel side

More speed at update

Adding a rule in iptables was getting dramatically slower with the number of rules and that’s explained why script using iptables call are taking a long time to complete. This is not anymore with nftables which is using atomic and fast operation to update rule sets.

Less kernel update

With iptables, each match or target was requiring a kernel module. So, you had to recompile kernel in case you forgot something or want to use something new.
this is not anymore the case with nftables. In nftables, most work is done in userspace and kernel only knows some basic instruction (filtering is implemented in a pseudo-state machine).
For example, icmpv6 support has been achieved via a simple patch of the nft tool.
This type of modification in iptables would have required kernel and iptables upgrade.

What’s new in ulogd 2.0.3

New features in ulogd 2.0.3 release

Database framework update

ulogd 2.0.3 implements two new optional modes for database connections:

backlog system to avoid event loss in case of database downtime
running mode where acquisition is made in one thread and queries to databases are made in separate threads to reduce latency in the treatment of kernel messages

These two modes are described below.

Postgresql update

Postgresql output plugin was only offering a small subset of Postgresql connection-related options.
It is now possible to use the connstring to use all possible parameters of libpq param keywords. If set, this variable has precedence on other variables.

One interest of connstring is to be able to use a SSL-encrypted connection to the database by using the sslmode keyword:

connstring="host=localhost port=4321 dbname=nulog user=nupik password=changeme sslmode=verify-full sslcert=/etc/ssl/pgsql-cert.pem sslkey=/etc/ssl/pgsql-key.pem sslrootcert==/etc/ssl/pgsql-rootcert.pem"

Event loss prevention

ulogd 2.0.3 implements a backlog system for all database output plugins using the abstraction framework for database connection. At the writing of this article, this is MySQL, PostgreSQL and DBI. Memory will be dedicated to store the queries that can not be run because of an unavailability of the database. Once the database is back, the queries are played in order.

To activate this mode, you need to set the backlog_memcap value in the database definition.

[mysql1]
db="nulog"
...
procedure="INSERT_PACKET_FULL"
backlog_memcap=1000000
backlog_oneshot_requests=10

Set backlog_memcap to the size of memory that will be allocated to store events in memory if data is temporary down. The variable backlog_oneshot_requests is storing the number of queries to process at once before reading a kernel message.

Multithreaded database output

If the ring buffer mode is active, a thread will be created for each stack involving the configured database. It will connect to the database and execute the queries.
The idea is to try to avoid buffer overrun by minimizing the time requested to treat kernel message. Doing synchronous SQL request, as it was made before was causing a delay which could cause some messages to be lost in case of burst from kernel side. With this new mode, the time to process kernel message is equal to the time of the formatting of the query.

To activate this mode, you need to set ring_buffer_size to a value superior to 1.
The value stores the number of SQL requests to keep in the ring buffer.

[pgsql1]
db="nulog"
...
procedure="INSERT_PACKET_FULL"
ring_buffer_size=1000

The ring_buffer_size has precedence on the backlog_memcap value. And backlog will be disabled if the ring buffer is active as ring buffer also provide packet loss prevention. ring_buffer_size is the maximum number of queries to keep in memory.

Talk about nftables at Kernel Recipes 2013

I’ve just gave a talk about nftables, the iptables successor, at Kernel Recipes 2013. You can find the slides here:
2013_kernel_recipes_nftables

A description of the talk as well as slides and video are available on Kernel Recipes website

Here’s the video of my talk:

I’ve presented a video of nftables source code evolution:

The video has been generated with gource. Git history of various components have been merged and the file path has been prefixed with project name.

Some ulogd db improvements

Some new features

I’ve just pushed to ulogd tree a series of patches. They bring two major improvements to database handling:

Backlog system: temporary store SQL query in memory if database is down.
Ring buffer system: a special mode with a thread to read data from kernel and a thread to do the SQL query.

The first mode is attended for preventing data loss when database is temporary down. The second one is an attempt to improve performance and the resistance to netlink buffer overrun problem.
The modification has been done in the database abstraction layer and it is thus available in MySQL, PostgreSQL and DBI.

Backlog system

To activate this mode, you need to set the backlog_memcap value in the database definition.

[mysql1]
db="nulog"
...
procedure="INSERT_PACKET_FULL"
backlog_memcap=1000000
backlog_oneshot_requests=10

The backlog system will prevent data loss by storing queries in memory instead of executing them. The waiting queries will be run in order when the connection to the database is restored.

Ring buffer setup

To activate this mode, you need to set ring_buffer_size to a value superior to 1.
The value stores the number of SQL requests to keep in the ring buffer.

[pgsql1]
db="nulog"
...
procedure="INSERT_PACKET_FULL"
ring_buffer_size=1000

The ring_buffer_size has precedence on the backlog_memcap value. And backlog will be disabled if the ring buffer is active.

Conclusion

Feel free to test it and don’t hesitate to provide some feedback!

Netfilter and the NAT of ICMP error messages

The problem

I’ve been recently working for a customer which needed consultancy because of some unexplained Netfilter behaviors related to ICMP error messages. He authorizes me to share the result of my study and I thank him for making this blog entry possible.
His problem was that one of his firewalls is using a private interconnexion with their border router and the customer did not manage to NAT all outgoing ICMP error messages.

The simplified network diagram is the following:

The DMZ is in a private network. The router has a route to the public network via the firewall and the public network address do not exists.
The firewall has set of DNAT rules to redirect a public IP to the matching private IP:

iptables -A PREROUTING -t nat -d 1.2.3.X -j DNAT --to 192.168.1.X

The interconnection between the router and firewall is made using a private network. Let’s say 192.168.42.0/24 and 192.168.42.1 for the firewall. The interface eth0 is the one used as interconnection interface.

On the firewall, some filtering rules reject some FORWARD traffic:

iptables -A FORWARD -d 192.168.1.X -j REJECT
iptables -A FORWARD -d 192.168.1.Y -j REJECT

The issue is related with the ICMP unreachable messages. When someone from internet (behind the router) is sending a packet to 192.168.1.X or 192.168.1.Y then:

If 192.168.1.X is NATed then the ICMP unreachable message is emitted and seen as coming from 1.2.3.X on eth0.
If 192.168.1.Y is not NATed then the ICMP unreachable message is emitted and seen as coming from 192.168.42.1 on eth0.

So, a packet going to 192.168.1.Y results in a ICMP message which is not routed by the router due to the private IP.

To fix the issue, the customer has added a Source NAT rules to translate all packet coming from 192.168.42.1 to 1.2.3.1:

iptables -A POSTROUTING -t nat -p icmp -s 192.168.42.1 -o eth0 -j SNAT --to 1.2.3.1

But this rules has no effect on the ICMP unreachable message.

Explanations

In the case of packets going to X or Y, an ICMP message is sent. Internally the same function (called icmp_send) is used for to send the icmp error message. This is a standard function and
as such, it uses the best local source address possible. In our case the best address is 192.168.42.1 because the packet has to get back through eth0.
At current stage, there is no difference between the two ICMP packets and the result should be the same.

But if nothing is done, the packet to X will result in a packet going to the original source and containing the internal IP information: the packet has been NATed so we have 192.168.1.X and not the public IP in the original packet data contained in the ICMP message. This is a real problem as this will leak private information to the outside.

Hopefully, the packets are handled differently due to the ICMP error connection tracking module. It searches in the payload part of the ICMP error message if it belongs to existing connection. If a connection is found, the IMCP packet is marked as RELATED to the original connection. Once this is done, the ICMP nat helper makes the reverse transformation to send to the network a packet containing only public information. For packet to X, the source addresses of the ICMP messages and payload are modified to the public IP address. This explains the difference between the ICMP error message sent because of packet sent to X or sent to Y.

But this does not explain why the NAT rules inserted by the customer did not work. In fact, the response was already made: the ICMP packet is marked as belonging to a connection related to the original one. Being in a RELATED state, it will not cross the NAT in POSTROUTING as only packet with a connection in state NEW are sent to the nat tables.

The validation of this study can be done by using marking and logging. If we log a packet which belong to a RELATED connection and if we are sure that the original connection is the one we are tracing then our hypothesis is validated. Getting a RELATED connection is easy with the filter: “-m conntrack –ctstate RELATED”. To prove that the packet is RELATED to the original connection, we have to use the fact that RELATED connection inherit of the connection mark of the originating connection. Thus, if we set a connection mark with the CONNMARK target, we will be able to match it in the ICMP error message. The following rules implement this:

iptables -t mangle -A PREROUTING -d 1.2.3.4 -j CONNMARK --set-mark 1
iptables -A OUTPUT -t mangle -m state --state RELATED -m connmark --mark  1 -j LOG

And it logs an ICMP error message when we try to reach 1.2.3.4.

Other debug methods

Using conntrack

The conntrack utils can be used to display connection tracking events by using the -E flag:

# conntrack -E
    [NEW] tcp      6 120 SYN_SENT src=192.168.1.12 dst=91.121.96.152 sport=53398 dport=22 [UNREPLIED] src=91.121.96.152 dst=192.168.1.12 sport=22 dport=53398
 [UPDATE] tcp      6 60 SYN_RECV src=192.168.1.12 dst=91.121.96.152 sport=53398 dport=22 src=91.121.96.152 dst=192.168.1.12 sport=22 dport=53398
 [UPDATE] tcp      6 432000 ESTABLISHED src=192.168.1.12 dst=91.121.96.152 sport=53398 dport=22 src=91.121.96.152 dst=192.168.1.12 sport=22 dport=53398 [ASSURED]

This can be really useful to see what transformation are made by the connection tracking system. But this does not work in your case because the icmp message does not trigger any connection creation and so no event.

Using TRACE target

The TRACE target is a really useful tool. It allows you to see which rules are matched by a packet. It’s usage is really simple. For example, if we want to trace all ICMP traffic coming to the box:

iptables -A PREROUTING -t raw  -p icmp -j TRACE

In our test system, the result was the following:

[ 5281.733217] TRACE: raw:PREROUTING:policy:2 IN=eth0 OUT= MAC=08:00:27:a9:f5:30:0a:00:27:00:00:00:08:00 SRC=192.168.56.1 DST=1.2.3.4 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=ICMP TYPE=8 CODE=0 ID=12114 SEQ=1
[ 5281.737057] TRACE: nat:PREROUTING:rule:1 IN=eth0 OUT= MAC=08:00:27:a9:f5:30:0a:00:27:00:00:00:08:00 SRC=192.168.56.1 DST=1.2.3.4 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=ICMP TYPE=8 CODE=0 ID=12114 SEQ=1
[ 5281.737057] TRACE: nat:PREROUTING:rule:2 IN=eth0 OUT= MAC=08:00:27:a9:f5:30:0a:00:27:00:00:00:08:00 SRC=192.168.56.1 DST=1.2.3.4 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=ICMP TYPE=8 CODE=0 ID=12114 SEQ=1
[ 5281.737057] TRACE: filter:FORWARD:rule:1 IN=eth0 OUT=eth1 MAC=08:00:27:a9:f5:30:0a:00:27:00:00:00:08:00 SRC=192.168.56.1 DST=192.168.42.4 LEN=84 TOS=0x00 PREC=0x00 TTL=63 ID=0 DF PROTO=ICMP TYPE=8 CODE=0 ID=12114 SEQ=1

In the raw TABLE, in PREROUTING the policy is applied (here ACCEPT). In nat PREROUTING the first rule is matching (a mark rule) and the second one is matching too. Finally in FORWARD, the first rule is matching (here the REJECT rule). TRACE is only following the initial packet and thus does not display any information about the ICMP error message.

Conclusion

So Netfilter’s behavior is correct when it translate back the elements initially transformed by NAT. The surprising part comes from the fact that the NAT rules in POSTROUTING are not reach. But this is needed to avoid any complicated issue by doing multiple transformation. Regarding interconnexion with router, you should really use a public network if you want your ICMP error messages to be seen on Internet.

Jan Engelhardt, â€œMerge Meâ€

Xtables2

xtables 2 suppress the different tables that exits in current Netfilter. If a rule only apply to a specific type of traffic (read owner id match per-example) then it just don’t match.

One of the interest to have one single table is that it is possible to easily update the ruleset by just doing a single atomic swap.

Manual chains can be created by hand as there are very useful to create factorized rules.

To avoid performance issue, rules counter are disabled by default and can be activated on rules.

Discussion

There is a discussion between Patrick McHardy and Jan Engelhardt over the blob usage in xtables2. The point of Jan is that blob should provide a better CPU cache usage than a dynamically allocated structure. Patrick points out that this will make the state swap difficult and it occurs that the state of matches needing one are not stored in the blob but in allocated memory. So for Patrick it breaks the initial idea.

Jan restarts the discussion that was interrupted on the ML. Patrick and Pablo insist on the lack of jump in current xtables2 implementation. And they ask for the availability of some interesting features of nftables. Jan argues that this can be done and that there is already a good documentation of xtables2 but not for nftables.

This continues in a sterile discussion where Jan argues he can develop the features if all parts agreed on xtables2’s starting patches. He’s answered that the real interesting things is the features.

Eric Dumazet makes an interesting point on asking for benchmarks. If he’s got numbers he will have some real information to decide. Holger Eitzenberger is asking for both party to agree on complete testing modality.

Victor Julien asked Jan if backward compatibility will be provided and Jan is answering that this could be done but that it is not important to him. David Miller answered that backward compatibility is the thing that is important for users.

Jan agree that xtables2 does not currently take into account the code redundancy issue that is severe in the different matches and that is fixed by nftables. He talked about using nftables code…

Performance benchmarking is almost agreed and may occurs soon but in its current state xtables2 can not compete to nftables features.

NFWS group photo

Top starting from left:
Jan Engelhardt, Tomasz Bursztyka, Daniel Borkmann, Julien Vehent, Holger Eitzenberger, Victor Julien, Eric Leblond, Eric Dumazet, Nicolas Dichtel, David Miller, S. Park

Bottom starting from left:
Martin Topholm, Jesper Sander Lindgren, Pablo Neira Ayuso, Simon Horman, Jozsef Kadlecsik, Jesper Dangaard Brouer, Patrick McHardy, Thomas Graf

Tomasz Bursztyka, connMan usage of Netfilter

Introduction

connMan is a network manager which has support for a lot of different layers from ethernet and WiFi to NFC and link sharing.

It features automatic link switch and allow you to select your preferred type of support. The communication with UI is event based so it is easy to do as only a few windows type are needed.

Discussion

David Miller pointed out the fact that DHCP client is really often putting the interface in promiscuous mode and this is not a good idea as it is like having a tcpdump started on every laptop. As connMann does ahave its own implementation, they could maybe take this into account and improved the situation. This is in fact already the case as the DHCP client is using an alternate method.

Jozsef Kadlecsik, ipset status

Tc interaction

tc interaction has been contributed by Florian Westphal. It is thus now possible to use a set match to differentiate Qos or routing of packet. This opens a wide area for experimentation.

Packet and byte counters

This is a fairly larger rewriting of set element and extensions which adds packets and bytes counters to the element.

The syntax has been updated:

ipset add   packets n bytes m

It is also possible to do check on counters !! For example, ipset will be able to do a match on a set and to refine the selection by specifying the number of packets we must have seen before matching. Counters can also be updated in the set match.

This will permit some really interesting possibilities to the rules writers.

This work is currently in progress and should be released really soon.

Discussion

nftables is containing a set system which is serving almost the same target. But there is some difference in them. nftables set can be used to parameter a rules: a set can be made with a list of interface: port and a DNAT rules can be written where it will choose the destination port to NAT to depending on the interface.
So it is possible to define ipset set as performance specialized set that could be implemented and used by nftables.

Jozsef Kadlecsik is speaking about the extension of conntrackd to ipset and maybe nftables rules. This would allow to fully synchronized hosts. It is found to be really interesting by the attendees and should be do and provided as an option.