Network – To Linux and beyond !

Speeding up scapy packets sending

Sending packets with scapy

I’m currently doing some code based on scapy. This code reads data from a possibly huge file and send a packet for each line in the file using the contained information.
So the code contains a simple loop and uses sendp because the frame must be sent at layer 2.

     def run(self):
         filedesc = open(self.filename, 'r')
         # loop on read line
         for line in filedesc:
             # Build and send packet
             sendp(pkt, iface = self.iface, verbose = verbose)
             # Inter packet treatment

Doing that the performance are a bit deceptive. For 18 packets, we’ve got:

    real    0m2.437s
    user    0m0.056s
    sys     0m0.012s

If we strace the code, the explanation is quite obvious:

socket(PF_PACKET, SOCK_RAW, 768)        = 4
setsockopt(4, SOL_SOCKET, SO_RCVBUF, [0], 4) = 0
select(5, [4], [], [], {0, 0})          = 0 (Timeout)
ioctl(4, SIOCGIFINDEX, {ifr_name="lo", ifr_index=1}) = 0
bind(4, {sa_family=AF_PACKET, proto=0x03, if1, pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0
setsockopt(4, SOL_SOCKET, SO_RCVBUF, [1073741824], 4) = 0
setsockopt(4, SOL_SOCKET, SO_SNDBUF, [1073741824], 4) = 0
getsockname(4, {sa_family=AF_PACKET, proto=0x03, if1, pkttype=PACKET_HOST, addr(6)={772, 000000000000}, [18]) = 0
ioctl(4, SIOCGIFNAME, {ifr_index=1, ifr_name="lo"}) = 0
sendto(4, "\377\377\377\377\377\377\0\0\0\0\0\0\10\0E\0\0S}0@\0*\6\265\373\307;\224\24\300\250"..., 97, 0, NULL, 0) = 97
select(0, NULL, NULL, NULL, {0, 0})     = 0 (Timeout)
close(4)                                = 0
socket(PF_PACKET, SOCK_RAW, 768)        = 4
setsockopt(4, SOL_SOCKET, SO_RCVBUF, [0], 4) = 0
select(5, [4], [], [], {0, 0})          = 0 (Timeout)
ioctl(4, SIOCGIFINDEX, {ifr_name="lo", ifr_index=1}) = 0
bind(4, {sa_family=AF_PACKET, proto=0x03, if1, pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0
setsockopt(4, SOL_SOCKET, SO_RCVBUF, [1073741824], 4) = 0
setsockopt(4, SOL_SOCKET, SO_SNDBUF, [1073741824], 4) = 0
getsockname(4, {sa_family=AF_PACKET, proto=0x03, if1, pkttype=PACKET_HOST, addr(6)={772, 000000000000}, [18]) = 0
ioctl(4, SIOCGIFNAME, {ifr_index=1, ifr_name="lo"}) = 0
sendto(4, "\377\377\377\377\377\377\0\0\0\0\0\0\10\0E\0\0004}1@\0*\6\266\31\307;\224\24\300\250"..., 66, 0, NULL, 0) = 66
select(0, NULL, NULL, NULL, {0, 0})     = 0 (Timeout)
close(4)                                = 0

For each packet, a new socket is opened and this takes age.

Speeding up the sending

To speed up the sending, one solution is to build a list of packets and to send that list via a sendp() call.

     def run(self):
         filedesc = open(self.filename, 'r')
         pkt_list = []
         # loop on read line
         for line in filedesc:
             # Build and send packet
             pkt_list.append(pkt)
         sendp(pkt_list, iface = self.iface, verbose = verbose)

This is not possible in our case due to the inter packet treatment we have to do.
So the best way is to reuse the socket. This can be done easily when you’ve read the documentation^W code:

@@ -27,6 +27,7 @@ class replay:
     def run(self):
         # open filename
         filedesc = open(self.filename, 'r')
+        s = conf.L2socket(iface=self.iface)
         # loop on read line
         for line in filedesc:
             # Build and send packet
-            sendp(pkt, iface = self.iface, verbose = verbose)
+            s.send(pkt)

The idea is to create a socket via the function used in sendp() and to use the send() function of the object to send packets.

With that modification, the performance are far better:

    real    0m0.108s
    user    0m0.064s
    sys     0m0.004s

I’m not a scapy expert so ping me if there is a better way to do this.

A bit of fun with IPv6 setup

When doing some tests on Suricata, I needed to setup a small IPv6 network. The setup is simple with one laptop which is Ethernet connected to a desktop. And the desktop host a Virtualbox system.
This way, the desktop can act as a router with laptop on eth0 and Vbox on vboxnet0.

To setup the desktop/router, I’ve used:

ip a a 4::1/64 dev eth0
ip a a 2::1/64 dev vboxnet0
echo "1">/proc/sys/net/ipv6/conf/all/forwarding

To setup the laptop who already has a IPv6 public address on eth0, I’ve done:

ip a a 4::4/64 dev wlan0
ip -6 r a 2::2/128 via 4::1 src 4::2 metric 128

Almost same thing on the Vbox:

ip a a 2::2/64 dev eth0
ip -6 r a default via 2::1

This setup should be enough but when I tried to do from the laptop:

ping6 2::2

I got a failure.

I then checked the routing on the laptop:

# ip r g 2::2
2::2 via 4::1 dev wlan0  src 2a01:e35:1394:5bd0:f8b3:5a98:2715:6c8d  metric 128

A public IPv6 address is used as source address and this is confirmed by a tcpdump on the desktop:

# tcpdump -i eth0 icmp6 -nv
10:54:48.841761 IP6 (hlim 64, next-header ICMPv6 (58) payload length: 64) 2a01:e35:1394:5bd0:f8b3:5a98:2715:6c8d > 4::1: [icmp6 sum ok] ICMP6, echo request, seq 11

And the desktop does not know how to reach this IP address because it does not have a public IPv6 address.

On the laptop, I’ve dumped wlan0 config to check the address:

# ip a l dev wlan0
3: wlan0:  mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether c4:85:08:33:c4:c8 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.137/24 brd 192.168.1.255 scope global wlan0
       valid_lft forever preferred_lft forever
    inet6 4::4/64 scope global
       valid_lft forever preferred_lft forever
    inet6 2a01:e35:1434:5bd0:f8b3:5a98:2715:6c8d/64 scope global temporary dynamic
       valid_lft 86251sec preferred_lft 84589sec
    inet6 2a01:e35:1434:5bd0:c685:8ff:fe33:c4c8/64 scope global dynamic
       valid_lft 86251sec preferred_lft 86251sec
    inet6 fe80::c685:8ff:fe33:c4c8/64 scope link
       valid_lft forever preferred_lft forever

And, yes, 2a01:e35:1394:5bd0:f8b3:5a98:2715:6c8d is a dynamic IPv6 address which is used by default to get out (and bring a little privacy).

Deleting the address did fix the ping issue:

# ip a d 2a01:e35:1394:5bd0:f8b3:5a98:2715:6c8d/64 dev wlan0
# ping6 2::2
PING 2::2(2::2) 56 data bytes
64 bytes from 2::2: icmp_seq=1 ttl=63 time=5.47 ms

And getting the route did confirm the fix was working:

# ip r g 2::2
2::2 via 4::1 dev wlan0  src 4::4  metric 128

All that to say, that it can be useful to desactivate temporary IPv6 address before setting up a test network:

echo "0" > /proc/sys/net/ipv6/conf/wlan0/use_tempaddr

Talk about nftables at Kernel Recipes 2013

I’ve just gave a talk about nftables, the iptables successor, at Kernel Recipes 2013. You can find the slides here:
2013_kernel_recipes_nftables

A description of the talk as well as slides and video are available on Kernel Recipes website

Here’s the video of my talk:

I’ve presented a video of nftables source code evolution:

The video has been generated with gource. Git history of various components have been merged and the file path has been prefixed with project name.

Using tc with IPv6 and IPv4

The first news is that it works! It is possible to use tc to setup QoS on IPv6 but the filter have to be updated.

When working on adding IPv6 support to lagfactory, I found out by reading tc sources and specifically ll_proto.c that the keyword to use for IPv6 was ipv6. Please read that file if you need to find the keyword for an other protocol.
So to send packet with Netfilter mark 5000 to a specific queue, one can use:

/sbin/tc filter add dev vboxnet0 protocol ipv6 parent 1:0 prio 3 handle 5000 fw flowid 1:3

All would have been simple, if I was not trying to have IPv6 and IPv4 support. My first try was to simply do:

${TC} filter add dev ${IIF} protocol ip parent 1:0 prio 3 handle 5000 fw flowid 1:3
${TC} filter add dev ${IIF} protocol ipv6 parent 1:0 prio 3 handle 5000 fw flowid 1:3

But the result was this beautiful message:

RTNETLINK answers: Invalid argument
We have an error talking to the kernel

Please note the second message displayed that warn you you talk to rudely to kernel and that he just kick you out the room.

The fix is simple. In fact, you can not use twice the same prio. So it is successful to use:

${TC} filter add dev ${IIF} protocol ip parent 1:0 prio 3 handle 5000 fw flowid 1:3
${TC} filter add dev ${IIF} protocol ipv6 parent 1:0 prio 4 handle 5000 fw flowid 1:3

Netfilter and the NAT of ICMP error messages

The problem

I’ve been recently working for a customer which needed consultancy because of some unexplained Netfilter behaviors related to ICMP error messages. He authorizes me to share the result of my study and I thank him for making this blog entry possible.
His problem was that one of his firewalls is using a private interconnexion with their border router and the customer did not manage to NAT all outgoing ICMP error messages.

The simplified network diagram is the following:

The DMZ is in a private network. The router has a route to the public network via the firewall and the public network address do not exists.
The firewall has set of DNAT rules to redirect a public IP to the matching private IP:

iptables -A PREROUTING -t nat -d 1.2.3.X -j DNAT --to 192.168.1.X

The interconnection between the router and firewall is made using a private network. Let’s say 192.168.42.0/24 and 192.168.42.1 for the firewall. The interface eth0 is the one used as interconnection interface.

On the firewall, some filtering rules reject some FORWARD traffic:

iptables -A FORWARD -d 192.168.1.X -j REJECT
iptables -A FORWARD -d 192.168.1.Y -j REJECT

The issue is related with the ICMP unreachable messages. When someone from internet (behind the router) is sending a packet to 192.168.1.X or 192.168.1.Y then:

If 192.168.1.X is NATed then the ICMP unreachable message is emitted and seen as coming from 1.2.3.X on eth0.
If 192.168.1.Y is not NATed then the ICMP unreachable message is emitted and seen as coming from 192.168.42.1 on eth0.

So, a packet going to 192.168.1.Y results in a ICMP message which is not routed by the router due to the private IP.

To fix the issue, the customer has added a Source NAT rules to translate all packet coming from 192.168.42.1 to 1.2.3.1:

iptables -A POSTROUTING -t nat -p icmp -s 192.168.42.1 -o eth0 -j SNAT --to 1.2.3.1

But this rules has no effect on the ICMP unreachable message.

Explanations

In the case of packets going to X or Y, an ICMP message is sent. Internally the same function (called icmp_send) is used for to send the icmp error message. This is a standard function and
as such, it uses the best local source address possible. In our case the best address is 192.168.42.1 because the packet has to get back through eth0.
At current stage, there is no difference between the two ICMP packets and the result should be the same.

But if nothing is done, the packet to X will result in a packet going to the original source and containing the internal IP information: the packet has been NATed so we have 192.168.1.X and not the public IP in the original packet data contained in the ICMP message. This is a real problem as this will leak private information to the outside.

Hopefully, the packets are handled differently due to the ICMP error connection tracking module. It searches in the payload part of the ICMP error message if it belongs to existing connection. If a connection is found, the IMCP packet is marked as RELATED to the original connection. Once this is done, the ICMP nat helper makes the reverse transformation to send to the network a packet containing only public information. For packet to X, the source addresses of the ICMP messages and payload are modified to the public IP address. This explains the difference between the ICMP error message sent because of packet sent to X or sent to Y.

But this does not explain why the NAT rules inserted by the customer did not work. In fact, the response was already made: the ICMP packet is marked as belonging to a connection related to the original one. Being in a RELATED state, it will not cross the NAT in POSTROUTING as only packet with a connection in state NEW are sent to the nat tables.

The validation of this study can be done by using marking and logging. If we log a packet which belong to a RELATED connection and if we are sure that the original connection is the one we are tracing then our hypothesis is validated. Getting a RELATED connection is easy with the filter: “-m conntrack –ctstate RELATED”. To prove that the packet is RELATED to the original connection, we have to use the fact that RELATED connection inherit of the connection mark of the originating connection. Thus, if we set a connection mark with the CONNMARK target, we will be able to match it in the ICMP error message. The following rules implement this:

iptables -t mangle -A PREROUTING -d 1.2.3.4 -j CONNMARK --set-mark 1
iptables -A OUTPUT -t mangle -m state --state RELATED -m connmark --mark  1 -j LOG

And it logs an ICMP error message when we try to reach 1.2.3.4.

Other debug methods

Using conntrack

The conntrack utils can be used to display connection tracking events by using the -E flag:

# conntrack -E
    [NEW] tcp      6 120 SYN_SENT src=192.168.1.12 dst=91.121.96.152 sport=53398 dport=22 [UNREPLIED] src=91.121.96.152 dst=192.168.1.12 sport=22 dport=53398
 [UPDATE] tcp      6 60 SYN_RECV src=192.168.1.12 dst=91.121.96.152 sport=53398 dport=22 src=91.121.96.152 dst=192.168.1.12 sport=22 dport=53398
 [UPDATE] tcp      6 432000 ESTABLISHED src=192.168.1.12 dst=91.121.96.152 sport=53398 dport=22 src=91.121.96.152 dst=192.168.1.12 sport=22 dport=53398 [ASSURED]

This can be really useful to see what transformation are made by the connection tracking system. But this does not work in your case because the icmp message does not trigger any connection creation and so no event.

Using TRACE target

The TRACE target is a really useful tool. It allows you to see which rules are matched by a packet. It’s usage is really simple. For example, if we want to trace all ICMP traffic coming to the box:

iptables -A PREROUTING -t raw  -p icmp -j TRACE

In our test system, the result was the following:

[ 5281.733217] TRACE: raw:PREROUTING:policy:2 IN=eth0 OUT= MAC=08:00:27:a9:f5:30:0a:00:27:00:00:00:08:00 SRC=192.168.56.1 DST=1.2.3.4 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=ICMP TYPE=8 CODE=0 ID=12114 SEQ=1
[ 5281.737057] TRACE: nat:PREROUTING:rule:1 IN=eth0 OUT= MAC=08:00:27:a9:f5:30:0a:00:27:00:00:00:08:00 SRC=192.168.56.1 DST=1.2.3.4 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=ICMP TYPE=8 CODE=0 ID=12114 SEQ=1
[ 5281.737057] TRACE: nat:PREROUTING:rule:2 IN=eth0 OUT= MAC=08:00:27:a9:f5:30:0a:00:27:00:00:00:08:00 SRC=192.168.56.1 DST=1.2.3.4 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=ICMP TYPE=8 CODE=0 ID=12114 SEQ=1
[ 5281.737057] TRACE: filter:FORWARD:rule:1 IN=eth0 OUT=eth1 MAC=08:00:27:a9:f5:30:0a:00:27:00:00:00:08:00 SRC=192.168.56.1 DST=192.168.42.4 LEN=84 TOS=0x00 PREC=0x00 TTL=63 ID=0 DF PROTO=ICMP TYPE=8 CODE=0 ID=12114 SEQ=1

In the raw TABLE, in PREROUTING the policy is applied (here ACCEPT). In nat PREROUTING the first rule is matching (a mark rule) and the second one is matching too. Finally in FORWARD, the first rule is matching (here the REJECT rule). TRACE is only following the initial packet and thus does not display any information about the ICMP error message.

Conclusion

So Netfilter’s behavior is correct when it translate back the elements initially transformed by NAT. The surprising part comes from the fact that the NAT rules in POSTROUTING are not reach. But this is needed to avoid any complicated issue by doing multiple transformation. Regarding interconnexion with router, you should really use a public network if you want your ICMP error messages to be seen on Internet.

Tomasz Bursztyka, connMan usage of Netfilter

Introduction

connMan is a network manager which has support for a lot of different layers from ethernet and WiFi to NFC and link sharing.

It features automatic link switch and allow you to select your preferred type of support. The communication with UI is event based so it is easy to do as only a few windows type are needed.

Discussion

David Miller pointed out the fact that DHCP client is really often putting the interface in promiscuous mode and this is not a good idea as it is like having a tcpdump started on every laptop. As connMann does ahave its own implementation, they could maybe take this into account and improved the situation. This is in fact already the case as the DHCP client is using an alternate method.

Simon Horman, MPLS Enlightened Open vSwitch

Open vSwitch is a multi-layer switch. It is designed to enable network automation through programmatic extension, while still supporting standard management interfaces and protocols.

Openflow is a management protocol that is supported by Open vSwitch. Openflow is has a basic support for MPLS. It features a minimum operation set to enable to configure MPLS correclty.
Openflow MPLS support is partially implemented in Open vSwitch but there is some difficulties.

SOme of the operations feature update of L3+ parameter like TTL. They must be updated in same manner in the MPLS header and in the packet header. And this is quite complicated as it supposed to decode the packet below MPLS. But MPLS header does not include the encapsulated ethernet type so it is almost impossible to access correctly to the packet structure.

A possible solution is to reinject the packet after modification to modify layer by layer in each step. This is currently a work in progress.

David Miller: routing cache is dead, now what ?

The routing cache was maintaining a list of routing decisions. This was an hash table which was highly dynamic and was changing due to traffic. One of the major problem was the garbage collector. An other severe issue was the possibility of DoS using the increase

The routing cache has been suppressed in Linux 3.6 after a 2 years effort by David and the other Linux kernel developers. The global cache has been suppressed and some stored information have been moved to more separate resources like socket.

There was a lot of side effects following this big transformation. On user side, there is no more “neighbour cache overflow” thanks to synchronized sizes of routing and neighbour table.

Metrics were stored in the routing cache entry which has disappeared. So it has been necessary to introduce a separate TCP metrics cache. A netlink interface is available to update/delete/add entry to the cache.

A other side effect of these modifications is that, on TCP socket, xt_owner could be used on input socket but the code needs to be updated.

On security side, the Reverse path filtering has been updated. When activated it is causing up to two extra FIB lookups But when deactivated there is now no overhead at all.

Fabio Massimo Di Nitto: Kronosnet.org

Kronosnet is a “I conceived it when drunk but it works well” VPN implementation. It is using an Ether TAP for the VPN to provide a lyaer 2 vpn. To avoid reinventing the wheel, it is delegating most of the work to the kernel. It supports multilink and redundancy of servers. On multilink side, 8 links can be done per-host to help redundancy.

One of the use of this project is the creation of private network in the cloud as it can be easily setup to provide redundancy and connection for a lot of clients (64k simultaneous clients). And because a layer 2 VPN is really useful for this type of usage.

Configuration is made via a CLI comparable to the one of classical routers.

Fabio has run a demonstration on 4 servers and shows that cutting link has no impact on a ping flood thanks to the multilink system.

Daniel Borkmann: Packets Sockets, BPF and Netsniff-NG

PF_PACKET introduction

This is access to raw packet inside Linux. It is used by libpcap and by other projects like Suricata.
PF_PACKET performance can be improved via dedicated features:

Zero-copy RX/TX
Socket clustering
Linux socket filtering (BPF)

BPF architecture looks like a small virtual machine with register and memory stores. It has different instructions and the kernel has its own kernel extensions to access to cpu number, vlan tag.

Netsniff-NG

Netsniff-ng is a set of minimal tools:

netsniff-ng, a high-performance zero-copy analyzer, pcap capturing and replaying tool
trafgen, a high-performance zero-copy network traffic generator
mausezahn, a packet generator and analyzer for HW/SW appliances with a Cisco-CLI
bpfc, a Berkeley Packet Filter (BPF) compiler with Linux extensions
ifpps, a top-like kernel networking and system statistics tool
flowtop, a top-like netfilter connection tracking tool
curvetun, a lightweight multiuser IP tunnel based on elliptic curve cryptography
astraceroute, an autonomous system (AS) trace route utility

netsniff-ng can be used to capture with advanced option like using a dedicated CPU or using a BPF filter compiled via bpfc instead of a tcpdump like expression.

trafgen can used to generate high speed traffic, using multiple CPUs, and complex configuration/setup even including fuzzing.
A description of the packet can be given where each element is built using different functions.
It can even be combined with tc (for example netem to simulate specific network condition).

The future include to find a way to utilize multicore efficiently with packet_fanout and disk writing.

April 2024
M	T	W	T	F	S	S
« Jul
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30