Coccinelle for the newbie

coccinelle which is a “program matching and transformation engine which provides the language SmPL (Semantic Patch Language) for specifying desired matches and transformations in C code”. Well, from user point of view it is a mega over-boosted sed for C. coccinelle knows C and is thus has the necessary intelligence to go over C formatting and to manage things better than you will have done.

This article is my own experience with coccinelle. I have try here to put all the things that have been useful to know from my point of view. For a more complete and classical presentation just go to coccinelle website.

My experience with coccinelle has started when I was planning some heavy modifications on suricata.

The global picture

Using coccinelle is :

  • writing semantic patch which looks like normal patch but which are using the SMPL grammar described on coccinelle website
  • running spatch on files to change with a semantic patch as parameter

spatch -sp_file packet.cocci -in_place detect.c

The basics

First thing to do is to understand the syntax of semantic patch. They start with the parameter definitions and continue with the transformation. The parameter block is delimited by @@ and you can put the name of the transformation rule between the first @. Here’s a small example:
[diff]
@rule1@
identifier p;
identifier func;
@@
func(…) {

Packet p;

– &(p)
+ p

}
[/diff]

The transformation rule is named rule1. It will operate with two identifiers p and func. An identifier is a C object that will match the expression given inside the patch. Here, func is any function and p is any variable that inside a function is declared as a Packet.
With that in mind, you can now easily read the semantic patch:

For any Packet variable (identified by p) inside a function, replace all &p by a simple p

Coccinelle knows C

In the previous semantic patch, the &(p) is matching for &p or &(p) because both writing are equivalent for the C langage.
Coccinelle goes far beyond as show the following example:
[diff]
@rule2@
identifier p;
identifier func;
@@
func(…) {
<…
– Packet p;
+ Packet *p = SCMalloc(SIZE_OF_PACKET);
+ if (p == NULL) return 0;

+ SCFree(p);
return …;
…>
}
[/diff]
This semantic patch modifies any Packet variable inside a function and replace it by an pointer to a Packet which is allocated after definition.
The test on nullity is also added.
But the interesting point is the:
[diff]
+ SCFree(p);
return …;
[/diff]
This says add an SCFree of the packet before each return. This is a nice way to assure that the memory will be freed when leaving the function.

This last transformation can lead to the following result:
[diff]
– ret = Unified2Alert(&tv, &p, data, &pq, NULL);
– if(ret == TM_ECODE_FAILED)
+ }
+ ret = Unified2Alert(&tv, p, data, &pq, NULL);
+ if(ret == TM_ECODE_FAILED) {
+ SCFree(p);
return 0;
+ }
ret = Unified2AlertThreadDeinit(&tv, data);
– if(ret == -1)
+ if(ret == -1) {
+ SCFree(p);
return 0;
+ }
[/diff]
It is interesting to notice that for all if which were not using a brace have now a pair of them. Coccinelle is not a super sed, it really knows C.

Some tips

Here is some points about Coccinelle behaviour:

  • Coccinelle will only do the identification of a variable once
  • Coccinelle will only add one line when a line addition is asked

To understand, let’s reuse the rule2 previously defined. If our source file contains:
[C]
Packet p1;
Packet p2;
[/C]
Then Coccinelle will only do the substitution once. To operate on every Packet, we have to add <... ...> to define a block over what can be matched and modified more than once. This give the following patch:
[diff]
@rule2@
identifier p;
identifier func;
@@
func(…) {
<…
– Packet p;
+Packet *p = SCMalloc(SIZE_OF_PACKET);
+ if (p == NULL) return 0;

+ SCFree(p);
return …;
…>
}
[/diff]
But here we will trigger an error:

Fatal error: exception Failure("rule2: already tagged token:

This is due to the fact, that as we have two Packet in the code, we have to add two SCFree().
As mentioned before, Coccinelle assume it will add only once. Thus we need to tell it that we are ready to add more than once the SCFree() line. This is simply done by adding a + at the start of the line:
[diff]
@rule2@
identifier p;
identifier func;
@@
func(…) {
<…
– Packet p;
+ Packet *p = SCMalloc(SIZE_OF_PACKET);
++ if (p == NULL) return 0;

++ SCFree(p);
return …;
…>
}
[/diff]

The result

Here is a slightly improved version of the semantic patch that I used to patch suricata:
[diff]
@rule0@
identifier p;
identifier func;
typedef uint8_t;
typedef Packet;
@@
func(…) {
<…
Packet p;

– memset(&p, 0, …);
+ memset(&p, 0, SIZE_OF_PACKET);
+ p->pkt = (uint8_t *)(p + 1);
…>
}

@rule1@
identifier p;
identifier func;
identifier fdl;
@@
func(…) {
<…
Packet p;

(
– p.fdl
+ p->fdl
|
– &(p)
+ p
)
…>
}

@rule2@
identifier p;
identifier func;
statement S;
@@
func(…) {
<…
Packet
– p
+ *p = SCMalloc(SIZE_OF_PACKET)
;
++ if (p == NULL) return 0;
S

++ SCFree(p);
return …;
…>
}
[/diff]

Code testing

I’ve started implementing some code checking based on coccinelle ability to use python. The idea is to detect invalid use of internal convention/API. My first target was to developp something checking that if a Packet is initialised via a SCMalloc() then the pkt field is correctly set and not zeroed. I wrote the following semantic patch to do so:

@pktset@
typedef Packet;
identifier p;
identifier func;
position p0;
@@
func(...) { 
...
Packet *p@p0 = SCMalloc(...);
...
}

@pktdata@
identifier pktset.p;
identifier pktset.func;
position p1;
@@
func(...) { 
...
Packet *p = SCMalloc(...);
...
(
p@p1->pkt =  ...
)
...
}

@pktzero@
identifier pktset.p;
identifier pktset.func;
position p2;
@@
func(...) { 
...
Packet *p = SCMalloc(...);
...
(
memset(p@p2, 0, ...)
)
...
}

@script:python depends on !pktdata@
p0 << pktset.p0;
@@
print "%s: No pkt setting for Packet* allocated at %s" % (p0[0].file, p0[0].line)

@script:python depends on pktset && pktdata && pktzero@
p0 << pktset.p0;
p1 << pktdata.p1;
p2 << pktzero.p2;
@@
if int(p1[0].line) <= int(p2[0].line):
    print "%s: Packet data set at %s but zeroed at %s" % (p1[0].file, p1[0].line, p2[0].line)

To do so, I've used position. Each rule try to find a specific motif and store the associated position. Once this is done, coccinelle proceeds to the treatment.
First script is treating the case where pkt is not set:

@script:python depends on !pktdata@
p0 << pktset.p0;
@@
print "%s: No pkt setting for Packet* allocated at %s" % (p0[0].file, p0[0].line)

It will be executed if and only if pktdata rule which looks for pkt setting has not matched (depends on !pktdata). In this case, p0 is set (we've got a SCMalloc()) and the patch displays a message warning about th lack of value assignement for pkt.
The second script treat the case of memset() usage:

@script:python depends on pktset && pktdata && pktzero@
p0 << pktset.p0;
p1 << pktdata.p1;
p2 << pktzero.p2;
@@
if int(p1[0].line) <= int(p2[0].line):
    print "%s: Packet data set at %s but zeroed at %s" % (p1[0].file, p1[0].line, p2[0].line) 

Here, pkset and pktdata have match and the code also use memset() on p (usage of depends on). In this case, we compare the line of match p1 and p2. If p1 line is smaller than p2 one, then zeroing of the structure has been made after pkt setting.
A run example on a volontary screwed-up file give the following result:

eric@ice-age:~/git/oisf/src (suric)$ spatch -sp_file ../qa/coccinelle/init-packet.cocci -out_place detect.c
init_defs_builtins: /usr/share/coccinelle/standard.h
HANDLING: detect.c
detect.c: No pkt setting for Packet* allocated at 7722
detect.c: Packet data set at 5026 but zeroed at 5027

Enhanced version

Here's a other version of the semantic patch written by pasting Julia Lawall indication (see comments). Her idea is to detect when there is a SCMalloc() without pkt setting in a first rule. And then add an other rule which detect the affectation and memset inversion. Her comments are largely reproduced here.
[diff]
@nodata forall@
Packet *p;
expression E1,E2;
position p0;
@@
...
p@p0 = SCMalloc(...)
... when != p->pkt = E1
? p=E2

@script:python@
p0 << nodata.p0; @@ print "%s: Packet allocated at l. %s but not initialized" %(p0[0].file, p0[0].line) [/diff] This rule is designated forall so that it checks that there is no setting of the data field along all control-flow paths, until (optionally, as specified by the ?) p is reassigned. The other option is exists, but in this case, that is probably not what is wanted, because there could be some error handling code paths in which it is reasonable not to initialize the data field. p is declared as a pointer to a Packet, coccinelle is parsing header and is able to find the type existence and definition. This is necessary to treat correctly the p->pkt construction.
[diff]
@overwritten exists@
expression p,E;
position p0,p1,p2;
@@

p@p0 = SCMalloc(...)
...
p->pkt@p1 = E
...
memset@p2(p,0,...)
[/diff]
This rule is designated exists, because we want to give an error message if there exists any such execution path. If there is a danger of p being reinitialized in the … part, one could add when != p = E1 and when != p = E2, respectively for expression metavariables E1 and E2.
[diff]
@script:python@
p0 << overwritten.p0; p1 << overwritten.p1; p2 << overwritten.p2; @@ print "%s: Packet data set at l. %s but zeroed at l. %s" % (p1[0].file, p1[0].line, p2[0].line) [/diff] This last part simply print the result when an invalid motif has been found. Now, this is not enough as we do not check function where multiple Packet are defined. To fix, this we can simply change the first rule by using the <+... ...+> construction:
[diff]
@nodata forall@
Packet *p;
expression E1,E2;
position p0;
@@
<+... p@p0 = SCMalloc(...) ... when != p->pkt = E1
? p=E2
...+>
[/diff]

Suppress some functions call

SCMalloc is a Suricata function which does allocation and displays an alert message in case of failure (only in the init phase). It is thus not necessary to use a logging function after a allocation failure. This usage is not known by some of the developpers and annoying message are present. I thus wrote the following semantic patch to get rid of them:
[diff]
@rule1@
expression p;
expression E;
identifier sclog =~ "SCLog.*$";
@@
p = SCMalloc(...);
... when != (p = E);
if (p == NULL) {
- sclog(...);
...
}
[/diff]
This semantic patch uses the regular expression matching: all logging functions (SCLogInfo for example) starts with SCLog. The patch thus detects when a variable is allocated and tested for NULL. If the conditional part of the code start with a logging call, it get supressed. The construction using p = E is used to limit the substitution on case where p is not reset between the SCMalloc and the equality test.
The patch uses expression which is more broad than an identifier p construction. This seems simple but this will do modification on expression like mpm_ctx->ctx where the identifier version will only match when mpm_ctx is set.

Using regular expression to filter on identifier

If you want to apply a substitution on a subset of function for example, you can use the filtering ability of coccinelle. For example, to only apply a transformation on functions whose name match Copy, you can use:
[diff]
@@
identifier func =~ "Copy"
typedef Packet;
Packet *p;
@@
- p->pkt
+ p->ext_pkt
[/diff]
The =~ operator uses a regular expression as argument.
If you want to exclude some identifier from the rules you can use the !=~ operator. For example to avoid a specific function, here PacketCopy one can use:
[diff]
@@
identifier func !=~ "^PacketCopy$"
@@
[/diff]
The regular expression needs to be protected. Thus, for a list of word, you've got to use:
[diff]
identifier func =~ "^\(sprintf\|strcat\|strcpy\)$";
[/diff]

19 thoughts on “Coccinelle for the newbie”

  1. Here is another solution for the code testing example. Comments are interspersed.

    @nodata forall@
    expression p,E1,E2;
    position p0;
    @@

    p@p0 = SCMalloc(…)
    … when != p->data = E1
    ? p=E2

    This rule is designated forall so that it checks that there is no setting of the data field along all control-flow paths, until (optionally, as specified by the ?) p is reassigned. The other option is exists, but in this case, that is probably not what is wanted, because there could be some error handling code paths in which it is reasonable not to initialize the data field.

    p is also declared as an expression rather than an identifier. This is a bit more general, and will capture the case of a variable initialization as well.

    @script:python@
    p0 <data@p1 = E

    memset@p2(p,0,…)

    This rule is designated exists, because we want to give an error message if there exists any such execution path. If there is a danger of p being reinitialized in the … part, one could add when != p = E1 and when != p = E2, respectively for expression metavariables E1 and E2.

    @script:python@
    p0 << overwritten.p0;
    p1 << overwritten.p1;
    p2 << overwritten.p2;
    @@
    print "%s: Packet data set at %s but zeroed at %s" % (p1[0].file, p1[0].line, p2[0].line)

  2. Oops, the middle part of my reply got eaten…

    @nodata forall@
    expression p,E1,E2;
    position p0;
    @@

    p@p0 = SCMalloc(…)
    … when != p->data = E1
    ? p=E2

    @script:python@
    p0 <data@p1 = E

    memset@p2(p,0,…)

    @script:python@
    p0 << overwritten.p0;
    p1 << overwritten.p1;
    p2 << overwritten.p2;
    @@
    print "%s: Packet data set at %s but zeroed at %s" % (p1[0].file, p1[0].line, p2[0].line)

  3. Eaten again, sorry. There should be a rule overwritten just above the last python rule, which is defined as follows. There is also a python rule in the middle to print an error message, but that seems to cause problems for the blog software…

    @overwritten exists@
    expression p,E;
    position p0,p1,p2;
    @@

    p@p0 = SCMalloc(…)

    p->data@p1 = E

    memset@p2(p,0,…)

  4. Hi, I have started using Coccinelle recently and your article really helped me understanding the software. Thank You!
    However, it seems that in the new version of coccinelle, the syntax for using reg-ex is changed a bit. The syntax ~= do not work anymore. It is changed to =~ (interchanged symbols).SO please update the article so that no other newbie sits scratching his head! 😀

  5. Hello, I am confused at and . Could you tell me the difference between them? Thank you so much.

  6. Last message didn’t appear well. I mean…I am confused at “” and “”

  7. I leave a response wɦen I especjally enjoy ɑ post
    ߋn ɑ site or if I have somethіng to valuable to contribute tߋ the
    discussion. It’ѕ a result ߋf tҺe firе communicated in the article I гead.
    And on thіs article Coccinelle fοr the newbie Tօ
    Linux and beүond !. I waѕ excited enough to post
    a commеnt 😉 I ɑctually Ԁо havе a couple of questions fߋr yoս if іt’s оkay.
    ӏs it just me orr ԁο somе of these remarks ϲome acroses lіke left by brain dead visitors?
    😛 And, if yyou are writing оn othеr sites, I would liҝe to keep up with yoս.

    Ԝould yοu mɑke a list thе completе urls of all your social sites lіke
    your twitter feed, Facebook ρage ߋr linkedin profile?

    Visit mү web-site stephen curry shoes

  8. The code formatting in “code testing” section is currently messed up.

Leave a Reply

Your email address will not be published. Required fields are marked *