CIL
Reviewed by: Kevin Coogan and Jon Myers
Abstract
CIL refers to the C Intermediate Language, and a set of tools for manipulating and analysing that language. The tool converts C source code into CIL, analyzes and transforms it according to the users input, and outputs new CIL source. The CIL language is, according to its authors, a "highly structured, 'clean' subset of C," and thus can be compiled without further modification by compilers such as gcc and MSVC. Many standard algorithms are implemented in the included library, and the code includes the ability to implement customized analyses and transformations using the OCaML programming language.
Introduction
Basics
As any of us familiar with compilers know, the choice of an
intermediate representation for our language is an important one. CIL
is a unique language for representing C, because it is a simple subset
of C.
This is difficult to comprehend because we usually envision source
languages as ASCII text and intermediate representations as structured
data. CIL is a source language and an intermediate representation;
the CIL tools provide the capacity to convert conventional C to CIL as
a structured, annotated in-memory syntax tree, and to convert said CIL
syntax trees out to ASCII text. Further, CIL is simple enough that it
is easy to learn the relationship between the tree representation and
the equivalent text representation.
This approach has several benefits. First, all of us know the C
language, so using C as an intermediate representation we remove an
additional barrier to doing compiler-style transformations. Second,
it allows us to take advantage of existing infrastructure, because it
allows us to generate an intermediate representation of the program,
modify the program, and output it as a plain-text file which can then
be read and compiled to executable code by our native compiler of
choice.
CIL will also be useful to us because it will allow us to do our
obfuscations on C code with less hassle (as CIL is simpler than "full"
C) and because the CIL software package includes useful tools for
easily annotating, analyzing, and modifying programs which are
represented with CIL.
Note that this page will deal mainly with CIL as it applies to linux
and gcc, though of course it could be applied to any platform and
compiler which use standard C.
The CIL software package was written by Matt Haren, George Necula,
and Ben Libit. It is maintained as an open source project
on SourceForge.net, and can be
found
at http://sourceforge.net/projects/cil.
Matt Haren and George Necula are listed as project administrators.
CIL was written in the OCaML programming language, and can be extended
by writing OCaML modules.
The project was opened in May 2005, and while it is still accessible on SourceForge, the last update to the downloadable source is dated February 5, 2007. This update is listed as "Production/Stable."
How CIL Works
The primary means for using CIL is the CIL driver called cilly. cilly is a perl script that takes the place of the standard gcc or MSVC compiler. There is also a perl script named patcher that creates modified copies of system include files for use with cilly, so that these system files can be included in the analyses.
To understand the cilly driver and the process that it performs, one can consider the pipeline analogy. Each stage of the pipeline requires a certain type of input, does some work, and outputs some results in a form that can used by the next stage.
Stage 1 of cilly takes C source code files and performs pre-processing on them. The resulting output of this stage is C source in a single file, with #define and macros substituted in place in the code.
Stage 2 performs a series of transformations on the C source code (see Internals below), and subsequently fills in the OCaML data structures. These data structures essentially amount to an Abstract Syntax Tree representation of the original source code, though in fact they contain additional information not required for AST's, but used internally by CIL. The output of this stage is an OCaML object of type file.
Stage 3 is the Analysis and Transformation stage. This stage takes the OCaML file object, performs whatever analyses and transformations were requested on the command line, and outputs another file object. In fact, each transformation or analysis can be thought of as a sub-stage that takes a file object and returns a (possibly different) file object. This allows for complete customization of the analysis and transformation to be performed.
Stage 4 is the output stage. This stage takes the AST data structures in the file object and writes out to text files the text representation of the CIL program contained in them. The result is the new program written in CIL source code.
Stage 5 is the final stage and performs compilation. The cilly script calls the appropriate compiler -- gcc, MSVC, etc... -- on the CIL source code file generated in stage 4, passing any compiler arguments that were included on the command line. Since CIL is a subset of C, this source should compile in all cases.
Installation
Download
The source code can be downloaded by going to the CIL SourceForge Page and clicking the Download link. Or by clicking here to get it directly. This is the latest, stable production version of the code.
The latest working version can be obtained through the subversion site as follows:
svn co svn://hal.cs.berkeley.edu/home/svn/projects/trunk/cil
Build and Installation
Installation of the tool is fairly straightforward. There are two
requirements for installation of CIL. First, a Unix shell environment
such as bash or tcsh. In linux, of course, these come included with
the distribution. To install the tool in Windows, one will need to
install cygwin. The second
requirement is the installation
of OCaML, version 3.08 or higher.
(Download and installation of OCaML is outside the scope of this page,
and is handled elsewhere). Lastly, we will need an up-to-date version
of Perl to run some of the utility scripts. Once these requirements
are met, installation should be straight forward. Simply run the
configure script from the source code top level directory as
follows:
/home/me/cil> ./configure
where "/home/me/cil>" is just a generic linux-like prompt. Once the
configure script has run, simply make the code:
/home/me/cil> make
At this point, the perl scripts cilly and patcher
will be located in the distribution "bin" directory. In the above
example, this would be the "/home/me/cil/bin" directory. The scripts
can be used directly from here, or if the user has root access, they
can be installed with:
/home/me/cil> make install
Usage
Using cilly, the CIL driver
Using CIL, or more specifically, using cilly from the command line is very much like using gcc or MSVC. Here is a trivial example that uses cilly, but does not do any transformations or analysis.
Assume we have a file test.c that looks like this:
#include <stdio.h>
int main() {
     int n = 0;
     int m = 1;
     if(m>0) {
         printf("%d\n",m);
     } else {
         printf("%d\n",n);
     }
     return (0);
}
and is stored in the CIL source top level directory. Ordinarily, to compile this program with gcc and create an executable called test, you would execute something like the following:
/home/me/cil> gcc -o test test.c
Compiling with CIL is just as easy:
/home/me/cil> bin/cilly -o test test.c
will compile the code, create an executable named test, and produce output something similar to the following:
/home/me/cil/obj/x86_LINUX/cilly.asm.exe --out /tmp/cil-xTkATlhH.cil.c /tmp/cil-30UGVLqf.i
gcc -D_GNUCC -E /tmp/cil-xTkATlhH.cil.c -o /tmp/cil-mzZLdASc.cil.i
gcc -D_GNUCC -c -o /tmp/cil-b2GQAYzb.o /tmp/cil-mzZLdASc.cil.i
gcc -D_GNUCC -o test11 /tmp/cil-b2GQAYzb.o
Notice the same command line arguments are passed to CIL as to gcc. Also notice the files created in /tmp. These are temporary files used by CIL to transform then compile the original C source. They are deleted immediately after cilly finishes compiling, but we can look at them by passing the CIL specific argument "--save-temps" to cilly on the command line:
/home/me/cil> bin/cilly --save-temps -o test test.c
will produce file test.cil.c among others, which contains the
transformed CIL source code that is sent to GCC. Here is how it looks:
/* Generated by CIL v. 1.3.6 */
/* print_CIL_Input is true */
#line 337 "/usr/include/stdio.h"
extern int printf(char const * __restrict __format , ...) ;
#line 4 "test.c"
int main(void)
{ int n ;
int m ;
{
#line 5
n = 0;
#line 6
m = 1;
#line 7
if (m > 0) {
#line 8
printf((char const * __restrict )"%d\n", m);
} else {
#line 10
printf((char const * __restrict )"%d\n", n);
}
#line 12
return (0);
}
}
This example
also demonstrates the use of cilly specific command line
arguments in conjunction with gcc command line arguments. To get a
list of available CIL arguments, use the following:
/home/me/cil> bin/cilly --help
The above example is trivial. The authors of the tool claim that the tool has been used to build large projects like the linux kernel, and Gimp. We used the strace open source project from SourceForge to demonstrate cilly's usefulness on a moderate sized project. This tool is a command line executable that takes another executable as argument. It runs the argument executable and outputs information about what system calls were made. Ordinarily, to build strace, one would use the following:
/home/me/strace-4.5.18> ./configure
/home/me/strace-4.5.18> make
The strace configure script, like many configure scripts, looks at the "CC" environment variable to determine which compiler to use to build the tool. We were able to successfully build the strace project with the following commands:
/home/me/strace-4.5.18> setenv CC "/home/me/cil/bin/cilly"
/home/me/strace-4.5.18> ./configure
/home/me/strace-4.5.18> make
The test.c and strace examples demonstrate the basic use of cilly, but it doesn't do anything interesting. Now, we change the file test.c such that the main function has multiple return statements, and we call cilly with the "--dooneRet" argument. This argument is part of the CIL built-in library of functions (see Internals below for more detail). It transforms the function code such that there is only one return statement. This makes some analyses easier to perform. Here is the modified test.c file:
#include <stdio.h>
int main() {
     int n = 0;
     int m = 1;
     if(m>0) {
         printf("%d\n",m);
         return(1);
     } else {
         printf("%d\n",n);
         return(2);
     }
     return (0);
}
Our previous use of cilly:
/home/me/cil> bin/cilly --save-temps --noPrintLn -o test test.c
will produce the following snippet of code from the test.cil.c temp file:
if (m > 0) {
   printf((char const * __restrict )"%d\n", m);
   return (1);
} else {
   printf((char const * __restrict )"%d\n", n);
   return (2);
}
return (0);
(Note: the "--noPrintLn" option simply tells cilly not to print the line number of the original file where each CIL line came from. This makes the temp file less cluttered and easier to read)
However, if we run cilly with the "--dooneRet" command line option, we get the following code:
if (m > 0) {
   printf((char const * __restrict )"%d\n", m);
   __retres3 = 1;
   goto return_label;
} else {
   printf((char const * __restrict )"%d\n", n);
   __retres3 = 2;
   goto return_label;
}
__retres3 = 0;
return_label: /* CIL Label */
return (__retres3);
Note that the return statements in the if blocks have been replaced with temporary variables and goto statements to mimic the behavior of the original code.
Extending cilly with Your Own Features
As a final example, we consider the Make CFG module included in the
CIL library. When executed, this module doesn't actually do any code
transformations of its own, but it does fill in predefined some OCaML
fields that hold lists of predecessors and successors for each
statement. That is, it is building the Control Flow Graph for the
file. This is information that can be used by other modules later.
Here we present an example module which we add to cilly.
Modules can be called from the command line, as in previous examples,
or they can be called within a custom program. Consider the following
module written in OCaML and saved as mycfg.ml:
open Pretty
open Cil
open Cfg
open List
open Dominators
module E = Errormsg
module H = Hashtbl
class printCFGVisitor = object
  inherit nopCilVisitor
  method vstmt (s:stmt) =
   ignore(Pretty.printf "STATEMENT: %a\n" d_stmt s);
   let rec printStmtList = function
    [] -> Pretty.printf "\n"
    | x :: rest -> ignore(Pretty.printf "\n-->%a" d_stmt (x));
     printStmtList rest;
  in
  ignore(Pretty.printf "__________preds_____________\n");
  ignore(printStmtList s.preds);
  ignore(Pretty.printf "__________end preds_____________\n");
  ignore(Pretty.printf "__________succs_____________\n");
  ignore(printStmtList s.succs);
  ignore(Pretty.printf "__________end succs_____________\n");
  DoChildren
end
let feature : featureDescr =
  { fd_name = "mycfg";
  fd_enabled = ref false;
  fd_description = "output cfg result to stdout";
  fd_extraopt = [];
  fd_doit =
  (function (f: file) ->
   Cfg.computeFileCFG f;
   let eVisitor = new printCFGVisitor in
   visitCilFileSameGlobals eVisitor f);
  fd_post_check = true;
  }
As you can see from the example above, most interaction
with cilly is done through a visitor object. Above,
you can see that we created a new class called printCFGVisitor. When
we call visitCilFileSameGlobals (or, if we expect to change
globals within the file, visitCilFile with a visitor and a
file as parameters, our visitor visits every node in the abstract
syntax tree in a depth-first search and "appropriate" functions are
called based on the type of the node (for an instruction
node, vinst is called, for a function declaration
node vfundec is called, etc.)
The visitor functions should return either DoChildren in
order to continue visiting the node's children (the default content of
each function), SkipChildren to avoid visiting the children,
or ChangeTo followed by the contents of a new node which is
to replace this one. In the above example, we use DoChildren
to make sure that all possible stmt nodes are visited.
Before this custom module can be used, it is necessary to rerun the configure script to tell CIL that a new module exists, then to re-make the source code.
/home/me/cil> ./configure EXTRASRCDIRS=/home/me/cil/myModules EXTRAFEATURES=mycfg
/home/me/cil> make
After building the source, the command line option "--domycfg" will be listed when "--help" is called, and available as a command line argument. The topic of building custom modules is covered more extensively in the OCaML Review Page. The reader is encouraged to review this material.
Important to this discussion is the explicit call to CFG.computeFileCFG (fifth line from the bottom of the code). This call would be equivalent to calling cilly with the "--domakeCFG" command line option. Once this call is made, the predecessor and successor information is available for use. The simple "mycfg" only prints this information out, but of course it could be used to do more extensive analysis. The following can be used to run the "mycfg" module on the code:
/home/me/cil> bin/cilly --noPrintLn --domycfg -o test test.c
And a snippet of the output:
STATEMENT: return (2);
__________preds_____________
-->printf((char const * __restrict )"%d\n", n);
__________end preds_____________
__________succs_____________
__________end succs_____________
The "return(2)" statement has only one possible predecessor, which is the print statement immediately before it. As a return statement in main, it does not have any successors.
Here is another example of a module to extend cilly's functionality.
This is a very crude obfuscation tool which creates a new global
variable named "totally_not_guessable" in the file, then inserts the
following before each "real" instruction from the input file:
printf("");
totally_not_guessable++;
Here is the code:
open Pretty
open Cil
module E = Errormsg
module H = Hashtbl
class obfuscateVisitor (globalVar : varinfo) = object
inherit nopCilVisitor
val printfFun =
let fdec = emptyFunction "printf" in
fdec.svar.vtype <- TFun(intType,
Some [ ("prio", intType, []);
("format", charConstPtrType, []) ],
true, []);
fdec
method vinst (i : instr ) : instr list visitAction =
ChangeTo
[ Call((None), (Lval(Var(printfFun.svar),NoOffset)),
[mkString ""], locUnknown);
Set(
(Var(globalVar), NoOffset),
BinOp(PlusA,
Lval(Var(globalVar), NoOffset),
one, Cil.intType),
locUnknown);
i]
end
let feature : featureDescr =
{ fd_name = "obfuscate";
fd_enabled = ref false;
fd_description = "make the code less pleasant";
fd_extraopt = [];
fd_doit =
(function (f: file) ->
let myGlobalVar = makeGlobalVar "totally_not_guessable" Cil.intType in
f.globals <- GVarDecl(myGlobalVar, locUnknown)::f.globals;
let oVisitor = new obfuscateVisitor(myGlobalVar) in
visitCilFile oVisitor f);
fd_post_check = true;
}
Internals
Code Transformations
As previously discussed, CIL is a subset of the C programming language. In order for the CIL tool to generate CIL code, it must transform the original C source into this subset. There are many transformations that are performed. An extensive, but incomplete, list is available in the CIL Users Manual.
Included Internal Libraries
One of the important features of CIL is a fairly robust library of included modules. The modules perform many of the standard static analyses familiar to analysis community at large.
- Control-Flow Graphs -- builds control flow graph for whole files, or selected functions.
- Data Flow Analysis -- framework for standard forward and backward data flow analysis that requires only transfer functions as input.
- Dominators -- calculates immediate dominators.
- Points-to Analysis -- performs points-to analysis using either the Olf (One Level Flow) or Golf (Generalized One Level Flow) algorithms.
- StackGuard -- modifies the program to maintain a separate return address stack, to defend against stack smashing attacks.
- Heapify -- transforms the code to move all dangerous arrays to the heap.
- One Return -- transforms the code to force all functions to have only one return statement.
- Partial Evaluation and Constant Folding -- performs constant folding (simplification of constant expressions), simplification of if statements whose expressions evaluate to a constant, and call elimination for empty functions and functions that return a constant.
- Reaching Definitions -- standard reaching definitions analysis.
- Available Expressions -- standard available expressions analysis.
- Liveness Analysis -- standard liveness analysis.
- Dead Code Elimination -- eliminates code that can never be called or whose results are never used.
- Simple Memory Operations -- transforms the code so that all lvalues have at most one memory reference.
- Simple Three-Address Code -- transforms code into three-address code.
- Converting C to C++ -- outputs valid C++ code. The results of this module may not be 100% correct.
Evaluation
The CIL manual mentions two specific cases where code transformations
may not work precisely as intended. First, when CIL transforms C
source, it explicitly adds function prototypes that are called before
they are defined. However, the manual states that CIL is incapable of
adding prototypes for functions that are used, but are neither
declared nor defined. Second, CIL separates variable declarations
from assignments. Thus, statements like the following:
const int i = 1;
would be transformed into:
int i;
i = 0;
and as a consequence, the const operator must be dropped.
However, this seems unlikely to be problematic in a real world
programming situation. The manual also makes a more general statement
about not guaranteeing behavior for constructs that are undefined by
the C standard.
Consider, for example, the following (poorly-written) program:
#include <stdio.h>
#include <stdlib.h>
int main() {
int i = 10;
printf("%i\n", i++ + i);
}
cilly will remove the side effects as it converts to CIL, resulting in this code:
i = 10;
tmp = i;
i ++;
printf((char const * __restrict )"%i\n", tmp + i);
Ergo, this program now prints 21, whereas if we build with GCC, we get 20.
However, our own experience is that CIL is very robust, as evidenced
by our ability to compile strace and we have not found any
reasonable code examples that behave differently under C, including
our build of the strace project described in an earlier section.
Use of the tool itself is straight forward, and we have found that
extending the tool with custom modules to be fairly easy. For those
users not familiar with the OCaML programming language, there will be
a fairly steep learning curve to get started and become comfortable
with syntax, and especially with the concepts in a strongly typed
functional programming language.
Additionally, the data structures used by CIL to represent the
Abstract Syntax Tree can be quite confusing. There doesn't seem to be
a clear hierarchy of types here. For example, some stmt
(statement) types such as an if statement can have
multiple Block's. But the Block type is defined as
a list of attributes, and a list of stmt's. This circular
definition is, at the very least, counter-intuitive, and the
documentation of the internal representations is insufficient to fully
understand the structure, or the intent behind it.
Screenshots
As use of the tool is strictly command line, no screen shots were necessary for this discussion.
References