Introduction
Perf is a great tool to analyse performances on Linux boxes. For example, perf top will give you this type of output on a box running Suricata on a high speed network:
Events: 32K cycles 28.41% suricata [.] SCACSearch 19.86% libc-2.15.so [.] tolower 17.83% suricata [.] SigMatchSignaturesBuildMatchArray 6.11% suricata [.] SigMatchSignaturesBuildMatchArrayAddSignature 2.06% suricata [.] tolower@plt 1.70% libpthread-2.15.so [.] pthread_mutex_trylock 1.17% suricata [.] StreamTcpGetFlowState 1.10% libc-2.15.so [.] __memcpy_ssse3_back 0.90% libpthread-2.15.so [.] pthread_mutex_lock
The functions are sorted by CPU consumption. Using arrow key it is possible to jump into the annotated code to see where most CPU cycles are used.
This is really useful but in the case of a function like pthread_mutex_trylock, the interesting part is to be able to find where this function is called.
Getting function call graph in perf
This stack overflow question lead me to the solution.
I’ve started to build suricata with the -fno-omit-frame-pointer option:
./configure --enable-pfring --enable-luajit CFLAGS="-fno-omit-frame-pointer" make make install
Once suricata was restarted (with pid being 9366), I was then able to record the data:
sudo perf record -a --call-graph -p 9366
Extracting the call graph was then possible by running:
sudo perf report --call-graph --stdio
The result is a huge detailed report. For example, here’s the part on pthread_mutex_lock:
0.94% Suricata-Main libpthread-2.15.so [.] pthread_mutex_lock | --- pthread_mutex_lock | |--48.69%-- FlowHandlePacket | | | |--53.04%-- DecodeUDP | | | | | |--95.84%-- DecodeIPV4 | | | | | | | |--99.97%-- DecodeVLAN | | | | DecodeEthernet | | | | DecodePfring | | | | TmThreadsSlotVarRun | | | | TmThreadsSlotProcessPkt | | | | ReceivePfringLoop | | | | TmThreadsSlotPktAcqLoop | | | | start_thread | | | --0.03%-- [...] | | | | | --4.16%-- DecodeIPV6 | | | | | |--97.59%-- DecodeTunnel | | | | | | | |--99.18%-- DecodeTeredo | | | | DecodeUDP | | | | DecodeIPV4 | | | | DecodeVLAN | | | | DecodeEthernet | | | | DecodePfring | | | | TmThreadsSlotVarRun | | | | TmThreadsSlotProcessPkt | | | | ReceivePfringLoop | | | | TmThreadsSlotPktAcqLoop | | | | start_thread | | | | | | | --0.82%-- DecodeIPV4 | | | DecodeVLAN | | | DecodeEthernet | | | DecodePfring | | | TmThreadsSlotVarRun | | | TmThreadsSlotProcessPkt | | | ReceivePfringLoop | | | TmThreadsSlotPktAcqLoop | | | start_thread | | | | | --2.41%-- DecodeIPV6 | | DecodeTunnel | | DecodeTeredo | | DecodeUDP | | DecodeIPV4 | | DecodeVLAN | | DecodeEthernet | | DecodePfring | | TmThreadsSlotVarRun | | TmThreadsSlotProcessPkt | | ReceivePfringLoop | | TmThreadsSlotPktAcqLoop | | start_thread