To Linux and beyond ! – Page 12 – Plaisirs et dÃ©sillusions du monde moderne

Playing a bit with vim macros

During one of my recent coding, I had to modify a signature file for suricata. The file was looking like this:

alert pkthdr any any -> any any (msg:"SURICATA ICMPv4 unknown code"; decode-event:icmpv4.unknown_code; sid:2200024; rev:1;)
alert pkthdr any any -> any any (msg:"SURICATA ICMPv4 truncated packet"; decode-event:icmpv4.ipv4_trunc_pkt; sid:2200025; rev:1;)
alert pkthdr any any -> any any (msg:"SURICATA ICMPv4 unknown version"; decode-event:icmpv4.ipv4_unknown_ver; sid:2200026; rev:1;)
alert pkthdr any any -> any any (msg:"SURICATA ICMPv6 packet too small"; decode-event:icmpv6.pkt_too_small; sid:2200027; rev:1;)
alert pkthdr any any -> any any (msg:"SURICATA ICMPv6 unknown type"; decode-event:icmpv6.unknown_type; sid:2200028; rev:1;)
alert pkthdr any any -> any any (msg:"SURICATA ICMPv6 unknown code"; decode-event:icmpv6.unknown_code; sid:2200029; rev:1;)

The modification was to decrease the number behind by 24 for each signatures.

At first, I did not know how to substract 24 to a number. The only operation I know is CTRL+x (deadly ennemy of CTRL+a) which decreases the next number on the right of the cursor by 1. I then thought to the command iterator. In a perfect world, it should work on multi key command. And it was working! Taping:

24CTRL+x

Decrease the next number on the current line by 24.

All that was needed now was to record a macro and apply it to all lines. Start recording a macro in register a is simply done with the command qa. My macro was the following:

0 # got at start of the line
/sid # search sid
24CTRL+X # decrease sid number
Arrow down # go to next line

Followed by ESC q to leave macro recording.

Macro application is done via the @a command, and by doing 42@a, I could apply it on all the file (42 being the number of lines to change).

In fact this solution is too complicated because you must know how much lines you want to change. It is possible to apply a macro b on a visual selection by using:

:normal @b

Plese note you must not hit ESC before taping :. By removing the Arrow down macro from the a, I was able to apply my modification to the selected lines easily.

This last tips is taken from Vim Tips Wiki.

About Suricata performance boost between 1.0 and 1.1beta2

Discovering the performance boost

When doing some coding on both 1.0 and 1.1 branch of suricata, I’ve remarked that there was a huge performance improvement of the 1.1 branch over the 1.0 branch. The parsing of a given real-life pcap file was taking 200 seconds with 1.0 but only 30 seconds with 1.1. This performance boost was huge and I decide to double check and to study how such a performance boost was possible and how it was obtained:

A git bisection shows me that the performance improvement was done in at least two main steps. I then decide to do a more systematic study on the performance improvement by iterating over the revision and by each time running the same test with the same basic and not tuned configuration:

suricata -c ~eric/builds/suricata/etc/suricata.yaml  -r benches/sandnet.pcap

and storing the log output.

Graphing the improvements

The following graph shows the evolution of treatment time by commits between suricata 1.0.2 and suricata 1.1beta2:

It is impressive to see that improvements are located over a really short period. In term of commit date, almost everything has happened between the December 1th and December 9th.

The following graph shows the same data with a zoom on the critical period:

One can see that there is two big steps and a last less noticeable phase.

Identifiying the commits

The first big step in the improvement is due to commit c61c68fd:

commit c61c68fd365bf2274325bb77c8092dfd20f6ca87
Author: Anoop Saldanha
Date:   Wed Dec 1 13:50:34 2010 +0530

    mpm and fast pattern support for http_header. Also support relative modifiers for http_header

This commit has more than double previous performance.

The second step is commit which also double performance. It is again by Anoop Saldanha:

commit 72b0fcf4197761292342254e07a8284ba04169f0
Author: Anoop Saldanha
Date:   Tue Dec 7 16:22:59 2010 +0530

    modify detection engine to carry out uri mpm run before build match array if alproto is http and if sgh has atleast one sig with uri mpm set

Other improvements were made a few hours later by Anoop who succeeded in a 20% improvements with:

commit b140ed1c9c395d7014564ce077e4dd3b4ae5304e
Author: Anoop Saldanha
Date:   Tue Dec 7 19:22:06 2010 +0530

    modify detection engine to run hhd mpm before building the match array

The motivation of this development was the fact that the developers were knowing that the match on http_headers was not optimal because it was using a single pattern search algorithm. By switching to a multi-pattern match algorithm, they know it will do a great prefilter job and increase the speed. Here’s the quote of Victor Julien’s comment and explanation:

We knew that at the time we inspected the http_headers and a few other buffers for each potential signature match over and over again using a single pattern search algorithm. We already knew this was inefficient, so moving to a multi-pattern match algorithm that would prefilter the signatures made a lot of sense even without benching it.

Finaly two days later, there is a serie of two commits which brings a other 20-30% improvements :

commit 8bd6a38318838632b019171b9710b217771e4133
Author: Anoop Saldanha
Date:   Thu Dec 9 17:37:34 2010 +0530

    support relative pcre for http header. All pcre processing for http header moved to hhd engine

commit 2b781f00d7ec118690b0e94405d80f0ff918c871
Author: Anoop Saldanha
Date:   Thu Dec 9 12:33:40 2010 +0530

    support relative pcre for client body. All pcre processing for client body moved to hcbd engine

Conclusion

It appears that all the improvements are linked with modifications on the HTTP handling. Working hard on improving HTTP feature has lead to an impressive performance boost. Thanks a lot to Anoop for this awesome work. As HTTP is now most of the trafic on internet this is a really good news for suricata users !

Upgrading Galaxy S from Android 2.1 to 2.3.3 under Linux

After some time lost by trying in vain to have Kies (of Death) from Samsung oder Odin working under Virtualbox, I’ve found about the exitence of Heimdall. This software has been developped to flash firmware onto Samsung Galaxy S devices.

It did work quiet easily. Upgrade procedure only requires some files download and in my case some usage of the tar command.

The command line was long but simple:
[bash]heimdall flash -pit s1_odin_20100512.pit –factoryfs factoryfs.rfs \
–cache cache.rfs –dbdata dbdata.rfs –param param.lfs \
–kernel zImage –modem modem.bin \
–primary-boot boot.bin –secondary-boot Sbl.bin \
–verbose[/bash]

A GUI named heimdall-frontend is available for people who do not like command line.

Here’s a list of problems I’ve encountered during this update :

Going to download mode was not possible on the phone: I had to use adb
adb is 32 bit and I had to install 32 bit libs: aptitude install ia32-libs
heimdall uses a device /dev/ttyACM0 which is read/write for dialout (and I was not in the group)
I had to chain command adp reboot download with heimdall command to have the upgrade starting
The first restart was blocked at “Galaxy S” display, I’ve run an adb reboot recover to return to normal behaviour
I had no data connection (3G) after upgrade: Restoring default APN configuration fixed the issue

If we omit this little points, the upgrade procedure was fine. Heimdall was very efficient compare to every crap I’ve tried to use on the Windows. Thanks a lot for their work !

Suricata conference at Solutions Linux 2011

I’ve gived today a presentation about Suricata at the Solutions Linux event. It was part of the security track presided by Herve Schauer.

The slides are in french and are available here: 2011_sollinux_suricata

IPv6 privacy extensions on Linux

IPv6 global address

The global address is used in IPv6 to communicate with the outside world. This is thus the one that is used as source for any communication and thus in a way identify you on Internet.

Below is a dump of an interface configuration:

eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:22:15:64:42:bd brd ff:ff:ff:ff:ff:ff
    inet6 2a01:f123:1234:5bd0:222:15ff:fe64:42bd/64 scope global dynamic 
       valid_lft 86314sec preferred_lft 86314sec
    inet6 fe80::222:15ff:fe64:42bd/64 scope link 
       valid_lft forever preferred_lft forever

The global address is here 2a01:f123:1234:5bd0:222:15ff:fe64:42bd/64. It is build by using the prefix and adding an identifier build with the hardware address. For example, here the hardware address is 00:22:15:64:42:bd and the global IPv6 address is ending with 22:15ff:fe64:42bd.

It is thus easy to go from the IPv6 global address to the hardware address. To fix this issue and increase the privacy of network user, privacy extensions have been developed.

Privacy extensions

The RFC 3041 describes how to build and use temporary addresses that will be used as source address for connection to the outside world.

To activate this feature, you simply have to modify an entry in /proc. For example to activate the feature on eth0, you can do

echo "2">/proc/sys/net/ipv6/conff/eth0/use_tempaddr

The usage of the option is detailled in the must-read ip-sysctl.txt file:

use_tempaddr - INTEGER
        Preference for Privacy Extensions (RFC3041).
          <= 0 : disable Privacy Extensions
          == 1 : enable Privacy Extensions, but prefer public
                 addresses over temporary addresses.
          >  1 : enable Privacy Extensions and prefer temporary
                 addresses over public addresses.
        Default:  0 (for most devices)
                 -1 (for point-to-point devices and loopback devices)

After network restart (a simple ifdown, ifup of the interface is enough), the output of the ip a command looks like that:

eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:22:15:64:42:bd brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.129/24 brd 192.168.1.255 scope global eth0
    inet6 2a01:f123:1234:5bd0:21f1:f624:d2b8:3702/64 scope global temporary dynamic 
       valid_lft 86314sec preferred_lft 2914sec
    inet6 2a01:f123:1234:5bd0:222:15ff:fe64:42bd/64 scope global dynamic 
       valid_lft 86314sec preferred_lft 86314sec
    inet6 fe80::222:15ff:fe64:42bd/64 scope link 
       valid_lft forever preferred_lft forever

A new temporary address has been added. After preferred_lft seconds, it becomes deprecated and a new address is added:

eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:22:15:64:42:bd brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.129/24 brd 192.168.1.255 scope global eth0
    inet6 2a01:f123:1234:5bd0:55c3:7efd:93d1:5057/64 scope global temporary dynamic 
       valid_lft 85009sec preferred_lft 1672sec
    inet6 2a01:f123:1234:5bd0:21f1:f624:d2b8:3702/64 scope global temporary deprecated dynamic 
       valid_lft 82077sec preferred_lft 0sec
    inet6 2a01:f123:1234:5bd0:222:15ff:fe64:42bd/64 scope global dynamic 
       valid_lft 86398sec preferred_lft 86398sec
    inet6 fe80::222:15ff:fe64:42bd/64 scope link 
       valid_lft forever preferred_lft foreverr

The deprecated address is removed when the valid_lft counter reach zero second.

Some more tuning

The default duration for a prefered adress is of one day. This can be changed by modifying the temp_prefered_lft variable.

For example, you can add to sysctl.conf:

net.ipv6.conf.eth0.temp_prefered_lft = 7200

The default validity length of the addresses can be changed via the temp_valid_lft variable.

The max_desync_factor set the max random time to wait before asking a new address. This is used to avoid that all computers in network ask for an address at the same time.
On side effect is that if you set the prefered or valid time to a low value, the max_desync_factor must also be decreased. If not, there will be long time period without temporary address.

If temp_prefered_lft is multiple time lower than temp_valid_lft, then the deprecated addresses will accumulate. To avoid overloading the kernel, a maximum number of addresses is set.
Equal to 16 by default, it can be changed by setting the max_addresses sysctl variable.

Known issues and problems

As the temporary address is used for connection to the outside and has a limited duration, some long duration connections (tink ssh) will be cut when the temporary address is removed.

I’ve also observed a problem when the maximum number of addresses is reached:

ipv6_create_tempaddr(): retry temporary address regeneration.
ipv6_create_tempaddr(): regeneration time exceeded. disabled temporary address support.

The result was that the temporary address support was disabled and the standard global address was used again. When setting temp_prefered_lft to 3600 and keeping temp_valid_ft to default value, the problem is reproduced easily.

Conclusion

The support of IPv6 privacy extensions is correct but the lack of link with existing connection can cause the some services to be disrupted. A easy to use per-software selection of address could be really interesting to avoid these problems.

Joining the OISF coding staff

My collaboration with OISF has been announced today. This is an honor for me to join this excellent team on this wonderful project. I’ve taken a lot of pleasure in the past months contributing to the project and I’m sure the start of an official collaboration will lead to good things. The challenge is high and I will do my best to merit the trust.

A big thanks to all people who congrat me for this nomination.

Building Suricata under OpenBSD

Suricata 1.1beta2 has brought OpenBSD to the list of supported operating system. I’m a total newbie to OpenBSD so excuse me for the lack of respect of OpenBSD standards and usages in this documentation.

Here’s the different step, I’ve used to finalize the port starting from a fresh install of OpenBSD.

If you want to use source taken from git, you will need to install building tools:

pkg_add git libtool

automake and autoconf need to be installed to. For a OpenBSD 4.8, one can run:

pkg_add autoconf-2.61p3 automake-1.10.3

For a OpenBSD 5.[01], one can run:

pkg_add autoconf-2.61p3 automake-1.10.3p3

For OpenBDS 5.2:

pkg_add autoconf-2.61p3 automake-1.10.3p6

Autoconf 2.61 is know to work, some other versions triggers a compilation failure.

Then you can simply clone the repository and run autogen:

git clone git://phalanx.openinfosecfoundation.org/oisf.git
cd oisf
export AUTOCONF_VERSION=2.61
export AUTOMAKE_VERSION=1.10
./autogen.sh

Before running configure, you need to add the dependencies:

pkg_add gcc pcre libyaml libmagic libnet-1.1.2.1p0

Now, we’re almost done and we can run configure:

CPPFLAGS="-I/usr/local/include" CFLAGS="-L/usr/local/lib" ./configure --prefix=/opt/suricata

You can now run make and make install to build and install suricata.

Some new features of IPS mode in Suricata 1.1beta2

The IDS/IPS suricata has a native support for Netfilter queue. This brings IPS functionnalities to users running Suricata on Linux.

Suricata 1.1beta2 introduces a lot of new features related to the NFQ mode.

New stream inline mode

One of the main improvement of Suricata IPS mode is related with the new stream engine dedicated to inline. Victor Julien has a great blog post about it.

Multiqueue support

Suricata can now be started on multiple queue by using a comma separated list of queue identifier on the command line. The following syntax:

suricata -q 0 -q 1 -c /etc/suricata.yaml

will start a suricata listening to Netfilter queue 0 and 1.

The option has been added to improve performance of Suricata in NFQ mode. One of the observed limitation is the number of packets per second that can be sent to a single queue. By being able to specify multiple queues, this is possible to increase performances. The best impact is on multicore systems where it adds scalability.

This feature can be used with the queueâ€balance option the NFQUEUE target which was added in Linux 2.6.31. When using âˆ’âˆ’queueâˆ’balance x:x+n instead of âˆ’âˆ’queueâˆ’num, packets are then balanced across the given queues and packets belonging to the same connection are put into the same nfqueue.

NFQ mode setting

Presentation

One of the difficulty of IPS usage is to build an adapted firewall ruleset. Following my blog post on this issue, I’ve decided to implement most of the existing mode in Suricata.
When running in NFQ inline mode, it is possible now to use a simulated non-terminal NFQUEUE verdict by using the ‘repeat’ mode.
This permit to send all needed packet to suricata via this a rule:

iptables -I FORWARD -m mark ! --mark $MARK/$MASK -j NFQUEUE

And below, you can have your standard filtering ruleset.

Configuration and other details

To activate this mode, you need to set mode to ‘repeat’ in the nfq section.

nfq:
  mode: repeat
  repeat_mark: 1
  repeat_mask: 1

This option uses the NF_REPEAT verdict instead of using a standard NF_ACCEPT verdict. The effect is that the packet is sent back at the start of the table where the queuing decision has been taken. As Suricata has put on the packet the mark $repeat_mark with respect to the mask $repeat_mask, when the packet reaches again the iptables NFQUEUE rules it does not match anymore because of the mark and the ruleset floowing this rule is used.

The ‘mode’ option in nfq section can have two other values:

accept (which is default simple mode)
route (which is little bit tricky)

The idea behind the route option is that a program using NFQ can issues a verdict that has for effect to send the packet to an another queue. This is thus possible to chain softwares using NFQ.
To activate this option, you have to use the following syntax:

nfq:
  mode: route
  route_queue: 2

There is not much usage of this option. One mad mind could think at chaining suricata and snort in IPS mode 😉

The nfq_set_mark keyword

Effect and usage of nfq_set_mark

A new keyword nfq_set_mark has been added to the rules option.
This is an enhancement of the NFQ mode. If a packet matches a rule using nfq_set_mark in NFQ mode, it is marked with the mark/mask specified in the option during the verdict.

The usage is really simple. In any rules, you can add the nfq_set_mark keyword to specify the mark to put on a packet and what is the the

pass tcp any any -> any 80 (msg:"TCP port 80"; sid:91000004; nfq_set_mark:0x80/0x80; rev:1;)

That’s great but what can I do with that ?

If you are not familiar with the advanced network capabilities, you may wondering how it can be used. Nefilter mark can be used to modify the handling of a packet by the network stack, ie routing, quality of service, Netfilter. Thus, the concept is the following, you can use Suricata to detect a suspect packet and decide to change the way it is handled by the network stack.
Thanks to the CONNMARK target, you can modify the way all packets of the connection are handled.

Among the possibilities:

Degrade QoS for a connection when a suspect packet has been seen
Trigger Netfilter logging of all subsequent packets of the connection (logging could be done in pcap for instance)
Change routing to send the trafic to a dedicated equipment

More about Suricata multithread performance

Following my preceding post on suricata multithread performance I’ve decided to continue to work on the subject.

By using perf-tool, I found out that when the number of detect threads was increasing, more and more time was used in a spin lock. One of the possible explanation is that the default running mode for pcap file (RunModeFilePcapAuto) is not optimal. The only decode thread take some time to treat the packets and he is not fast enough to send data to the multiple detect threads. This is triggering a lot of wait and a CPU usage increase. Following a discussion with Victor Julien, I decide to give a try to an alternate run mode for working on pcap file, RunModeFilePcapAutoFp.

The architecture of this run mode is different. A thread is in charge of the reading of the file and the treatment of packets is done in a pool of threads (from decode to output). The augmentation of the power of decoding and the limitation of the ratio decode/detect would possibly bring some scalability.

The following graph is a comparison of the Auto mode and the FP mode on the test system described in the previous post (A 24 threads/core server parses a 6.1Go pcap file). It displays the number of packets per second in function of the number of threads:

The performance difference is really interesting. The FP mode shows a increase of the performance with the number of threads. This is far better than the Auto run mode where performance decrease with the number of threads.

As pointed out in a discussion on the OISF-users mailing list, multithread tuning has a real impact on performance. The result of the tests I’ve done are significant but they only apply to the parsing of a big pcap file. You will have to tune Suricata to see how to take the best of it on your system.

Optimizing Suricata on multicore CPUs

Suricata IDS/IPS architecture is heavily using multithreading. On almost every runmode (PCAP, PCAP file, NFQ, …) it is possible to setup the number of thread that are used for detection. This is the most CPU intensive task as it does the detection of alert by checking the packet on the signatures. The configuration of the number of threads is done by setting
a ratio which decide of the number of threads to be run by available CPUs (detect_thread_ratio variable).

A discussion with Florian Westphal at NFWS 2010 convince me that it was necessary for performance to tune with more granularity the thread and CPU link. I thus decide to improve this part of the code in Suricata and I’ve recently submitted a serie of patches that enable a fine setting.

I’ve been able to run some tests on a decent server:

processor : 23 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU X5650 @ 2.67GHz

This is a dual 6-core CPUs with hyperthreading activated and something like 12 Go of memory. From a Linux point of view, it has 24 identical processors.

I was not able to produce a decent testing environnement (read able to inject traffic) and all tests have been made on the parsing of a 6.1Go pcap file captured on a real network.

I’ve first tested my modification and soon arrive to the conclusion that limiting the number of threads was a good idea. Following Victor Julien suggestion, I’ve played with the detect_thread_ratio variable to see its influence on the performance. Here’s the graph of performance relatively to the ratio for the given server:

It seems that the 0.125 value (which correspond to 3 threads) is on this server the best value.

If we launch the test with ratio 0.125 more than one time, we can see that the performance vary between 64s and 50s with a mean of 59s. This means the variation is about 30% and the running can not be easily predicted.

It is now time to look at the results we can obtain by tuning the affinity. I’ve setup the following affinity configuration which run 3 threads on selected CPU (for our reader 3 = 0.125 * 24):

  cpu_affinity:
    - management_cpu_set:
        cpu: [ 0 ]  # include only these cpus in affinity settings
    - receive_cpu_set:
        cpu: [ 1 ]  # include only these cpus in affinity settings
    - decode_cpu_set:
        cpu: [ "2" ]
        mode: "balanced"
    - stream_cpu_set:
        cpu: [ "0-4" ]
    - detect_cpu_set:
        #cpu: [ 6, 7, 8 ]
        cpu: [ 6, 8, 10 ]
        #cpu: [ 6, 12, 18 ]
        mode: "exclusive" # run detect threads in these cpus
        threads: 3
        prio:
           low: [ "0-4" ]
           medium: [ "5-23" ]
           default: "medium"
    - verdict_cpu_set:
        cpu: [ 0 ]
        prio:
           default: "high"
    - reject_cpu_set:
        cpu: [ 0 ]
        prio:
           default: "low"
    - output_cpu_set:
        cpu: [ "0" ]
        prio:
           default: "medium"

and I’ve played with detect_cpu_set setting on the cpu variable. By using CPU set coherent with the hardware architecture, we manage to have result with a very small variation between run:

All threads including receive one on the same CPU but avoiding hyperthread: 50-52s (detect threads running on 6,8,10)
All threads on same CPU but without avoiding hyperthread: 60-62s (detect threads running on 6,7,8)
All threads on same hard CPU: 55-57s (avoid hyperthread, 4 threads) (4 6 8 10 on detect)
Read and detect on different CPU: 61-65s (detect threads on 6,12,18)

Thus we stabilize the best performance by remaining on the same hardware CPU and avoiding hyperthread CPUs. This also explain the difference between the run of the first tests. The only setting was the number of threads and we can encounter any of the setup (same or different hardware CPU, running two threads on a core with hyperthreading) or even flip during one test. That’s why performance vary a lot between tests run.

Next step for me will be to run perf tool on suricata to estimate where the bottleneck is. More infos to come !