Following my preceding post on suricata multithread performance I’ve decided to continue to work on the subject.
By using perf-tool, I found out that when the number of detect threads was increasing, more and more time was used in a spin lock. One of the possible explanation is that the default running mode for pcap file (RunModeFilePcapAuto) is not optimal. The only decode thread take some time to treat the packets and he is not fast enough to send data to the multiple detect threads. This is triggering a lot of wait and a CPU usage increase. Following a discussion with Victor Julien, I decide to give a try to an alternate run mode for working on pcap file, RunModeFilePcapAutoFp.
The architecture of this run mode is different. A thread is in charge of the reading of the file and the treatment of packets is done in a pool of threads (from decode to output). The augmentation of the power of decoding and the limitation of the ratio decode/detect would possibly bring some scalability.
The following graph is a comparison of the Auto mode and the FP mode on the test system described in the previous post (A 24 threads/core server parses a 6.1Go pcap file). It displays the number of packets per second in function of the number of threads:
The performance difference is really interesting. The FP mode shows a increase of the performance with the number of threads. This is far better than the Auto run mode where performance decrease with the number of threads.
As pointed out in a discussion on the OISF-users mailing list, multithread tuning has a real impact on performance. The result of the tests I’ve done are significant but they only apply to the parsing of a big pcap file. You will have to tune Suricata to see how to take the best of it on your system.
Maybe the best would actually to have a tool running the bench automatically to choose the best option for you upon Suricata startup. This way I wouldn’t bother with perftool setup and get results right away.
I fully agree. Suricata could use the treatment time of the packet in each phase to determine how to organize the threads. Really interesting but this look tricky 🙂
I think technically it wouldn’t be hard to add/remove threads at runtime… I think the challenge will be to determine if it’s necessary…
Anyway, good stuff Eric!