I’m currently working on a script that parses Suricata EVE log files and try to detect if some fields in the log are present in a list of bad patterns. So the script has two parts which are reading the log file and searching for the string in a list of strings. This list can be big with a target of around 20000 strings.
Note: This post may seem trivial for real Python developers but as I did not manage to find any documentation on this here is this blog post.
The testsFor my test I have used a 653Mo log file containing 896077 lines. Reading this JSON formatted file is taking 5.0s. As my list of strings was around 3000 elements so far below targeted size, a thumb rules was saying that I will be ok if script stayed below 6 seconds with the matching added. First test was a simple Python style inclusion test with the hostname being put in a list:
if event['http']['hostname'] in hostname_list:For that test, the result was 9.5s so not awful but a bit over my expectation. Just to check I have run a test with a C-like implementation:
for elt in hostname_list: if elt == target: # we have a winnerResult was a nice surprise, … for Python, with a execution time of 20.20s.
I was beginning to fear some development to be able to reach the speed I needed and I gave a last try. As I was taking care of match, I can transform my list of strings in a Python set thus only getting unique elements. So I have run the test using:
if event['http']['hostname'] in set(hostname_list):
Result was an amazing execution time of 5.20s. Only 0.20s were used to check data against my set of strings.
Python set required elements to be hashable. It is needed because internal implementation is using dictionary. So looking for an element in a set is equivalent to look for an element in a hash table. And this is really faster than searching in a list where there is no real magic possible.
So if you only care about match and if your elements are hashable then use Python set to test for existence of a object in your set.