CIL <title> <center> <h1> CIL </h1> <br> <h3>Reviewed by: Kevin Coogan and Jon Myers</h3> </center> <h2>Abstract</h2> CIL refers to the <u>C</u> <u>I</u>ntermediate <u>L</u>anguage, and a set of tools for manipulating and analysing that language. The tool converts C source code into CIL, analyzes and transforms it according to the users input, and outputs new CIL source. The CIL language is, according to its authors, a "highly structured, 'clean' subset of C," and thus can be compiled without further modification by compilers such as gcc and MSVC. Many standard algorithms are implemented in the included library, and the code includes the ability to implement customized analyses and transformations using the OCaML programming language. <h2>Introduction</h2> <h3>Basics</h3> <p> As any of us familiar with compilers know, the choice of an intermediate representation for our language is an important one. CIL is a unique language for representing C, because it is a simple subset of C. </p> <p> This is difficult to comprehend because we usually envision source languages as ASCII text and intermediate representations as structured data. CIL is a source language <em>and</em> an intermediate representation; the CIL tools provide the capacity to convert conventional C to CIL as a structured, annotated in-memory syntax tree, and to convert said CIL syntax trees out to ASCII text. Further, CIL is simple enough that it is easy to learn the relationship between the tree representation and the equivalent text representation. </p> <p> This approach has several benefits. First, all of us know the C language, so using C as an intermediate representation we remove an additional barrier to doing compiler-style transformations. Second, it allows us to take advantage of existing infrastructure, because it allows us to generate an intermediate representation of the program, modify the program, and output it as a plain-text file which can then be read and compiled to executable code by our native compiler of choice. </p> <p> CIL will also be useful to us because it will allow us to do our obfuscations on C code with less hassle (as CIL is simpler than "full" C) and because the CIL software package includes useful tools for easily annotating, analyzing, and modifying programs which are represented with CIL. </p> <p> Note that this page will deal mainly with CIL as it applies to linux and gcc, though of course it could be applied to any platform and compiler which use standard C.<br> </p> <br>The CIL software package was written by Matt Haren, George Necula, and Ben Libit. It is maintained as an open source project on <a href="http://sourceforge.net">SourceForge.net</a>, and can be found at <a href="http://sourceforge.net/projects/cil">http://sourceforge.net/projects/cil</a>. Matt Haren and George Necula are listed as project administrators. CIL was written in the OCaML programming language, and can be extended by writing OCaML modules.<br> <br> The project was opened in May 2005, and while it is still accessible on SourceForge, the last update to the downloadable source is dated February 5, 2007. This update is listed as "Production/Stable."<br> <h3>How CIL Works</h3> The primary means for using CIL is the CIL driver called <tt>cilly</tt>. <tt>cilly</tt> is a perl script that takes the place of the standard gcc or MSVC compiler. There is also a perl script named <tt>patcher</tt> that creates modified copies of system include files for use with <tt>cilly</tt>, so that these system files can be included in the analyses.<br> <br> To understand the <tt>cilly</tt> driver and the process that it performs, one can consider the pipeline analogy. Each stage of the pipeline requires a certain type of input, does some work, and outputs some results in a form that can used by the next stage.<br> <br> <img src="cil-illustration.png"> <br> <br> Stage 1 of <tt>cilly</tt> takes C source code files and performs pre-processing on them. The resulting output of this stage is C source in a single file, with <tt>#define</tt> and macros substituted in place in the code.<br> <br> Stage 2 performs a series of transformations on the C source code (see Internals below), and subsequently fills in the OCaML data structures. These data structures essentially amount to an Abstract Syntax Tree representation of the original source code, though in fact they contain additional information not required for AST's, but used internally by CIL. The output of this stage is an OCaML object of type <tt>file</tt>.<br> <br> Stage 3 is the Analysis and Transformation stage. This stage takes the OCaML <tt>file</tt> object, performs whatever analyses and transformations were requested on the command line, and outputs another <tt>file</tt> object. In fact, each transformation or analysis can be thought of as a sub-stage that takes a <tt>file</tt> object and returns a (possibly different) <tt>file</tt> object. This allows for complete customization of the analysis and transformation to be performed.<br> <br> Stage 4 is the output stage. This stage takes the AST data structures in the <tt>file</tt> object and writes out to text files the text representation of the CIL program contained in them. The result is the new program written in CIL source code. Stage 5 is the final stage and performs compilation. The <tt>cilly</tt> script calls the appropriate compiler -- gcc, MSVC, etc... -- on the CIL source code file generated in stage 4, passing any compiler arguments that were included on the command line. Since CIL is a subset of C, this source should compile in all cases. <h2>Installation</h2> <h3>Download</h3> The source code can be downloaded by going to the <a href="http://sourceforge.net/projects/cil">CIL SourceForge Page</a> and clicking the Download link. Or by clicking <a href="http://sourceforge.net/project/platformdownload.php?group_id=138953">here</a> to get it directly. This is the latest, stable production version of the code.<br> <br> The latest working version can be obtained through the subversion site as follows:<br> <br> <tt>svn co svn://hal.cs.berkeley.edu/home/svn/projects/trunk/cil</tt><br> <br> <h3>Build and Installation</h3> Installation of the tool is fairly straightforward. There are two requirements for installation of CIL. First, a Unix shell environment such as bash or tcsh. In linux, of course, these come included with the distribution. To install the tool in Windows, one will need to install <a href="http://www.cygwin.com/">cygwin</a>. The second requirement is the installation of <a href="http://caml.inria.fr/">OCaML</a>, version 3.08 or higher. (Download and installation of OCaML is outside the scope of this page, and is handled elsewhere). Lastly, we will need an up-to-date version of Perl to run some of the utility scripts. Once these requirements are met, installation should be straight forward. Simply run the configure script from the source code top level directory as follows:<br> <br> <tt> /home/me/cil> ./configure</tt><br> <br> where "/home/me/cil>" is just a generic linux-like prompt. Once the configure script has run, simply make the code:<br> <br> <tt> /home/me/cil> make</tt><br> <br> At this point, the perl scripts <tt>cilly</tt> and <tt>patcher</tt> will be located in the distribution "bin" directory. In the above example, this would be the "/home/me/cil/bin" directory. The scripts can be used directly from here, or if the user has root access, they can be installed with:<br> <br> <tt> /home/me/cil> make install</tt><br> <br> <h2>Usage</h2> <h3>Using cilly, the CIL driver</h3> Using CIL, or more specifically, using <tt>cilly</tt> from the command line is very much like using gcc or MSVC. Here is a trivial example that uses <tt>cilly</tt>, but does not do any transformations or analysis.<br> <br> Assume we have a file <tt>test.c</tt> that looks like this:<br> <br> <tt> #include <stdio.h><br> <br> int main() {<br>      int n = 0;<br>      int m = 1;<br>      if(m>0) {<br>          printf("%d\n",m);<br>      } else {<br>          printf("%d\n",n);<br>      }<br>      return (0);<br> }<br> </tt><br> and is stored in the CIL source top level directory. Ordinarily, to compile this program with gcc and create an executable called <tt>test</tt>, you would execute something like the following:<br> <br> <tt> /home/me/cil> gcc -o test test.c</tt><br> <br> Compiling with CIL is just as easy:<br> <br> <tt> /home/me/cil> bin/cilly -o test test.c</tt><br> <br> will compile the code, create an executable named <tt>test</tt>, and produce output something similar to the following:<br> <br> <tt> /home/me/cil/obj/x86_LINUX/cilly.asm.exe --out /tmp/cil-xTkATlhH.cil.c /tmp/cil-30UGVLqf.i<br> gcc -D_GNUCC -E /tmp/cil-xTkATlhH.cil.c -o /tmp/cil-mzZLdASc.cil.i<br> gcc -D_GNUCC -c -o /tmp/cil-b2GQAYzb.o /tmp/cil-mzZLdASc.cil.i<br> gcc -D_GNUCC -o test11 /tmp/cil-b2GQAYzb.o<br> </tt><br> <br> Notice the same command line arguments are passed to CIL as to gcc. Also notice the files created in <tt>/tmp</tt>. These are temporary files used by CIL to transform then compile the original C source. They are deleted immediately after <tt>cilly</tt> finishes compiling, but we can look at them by passing the CIL specific argument "--save-temps" to <tt>cilly</tt> on the command line:<br> <br> <tt> /home/me/cil> bin/cilly --save-temps -o test test.c</tt><br> <br> will produce file <tt>test.cil.c</tt> among others, which contains the transformed CIL source code that is sent to GCC. Here is how it looks: <pre> /* Generated by CIL v. 1.3.6 */ /* print_CIL_Input is true */ #line 337 "/usr/include/stdio.h" extern int printf(char const * __restrict __format , ...) ; #line 4 "test.c" int main(void) { int n ; int m ; { #line 5 n = 0; #line 6 m = 1; #line 7 if (m > 0) { #line 8 printf((char const * __restrict )"%d\n", m); } else { #line 10 printf((char const * __restrict )"%d\n", n); } #line 12 return (0); } } </pre> <br><br> This example also demonstrates the use of <tt>cilly</tt> specific command line arguments in conjunction with gcc command line arguments. To get a list of available CIL arguments, use the following:<br> <br> <tt> /home/me/cil> bin/cilly --help</tt><br> <br> The above example is trivial. The authors of the tool claim that the tool has been used to build large projects like the linux kernel, and Gimp. We used the <a href="http://sourceforge.net/projects/strace/">strace</a> open source project from SourceForge to demonstrate <tt>cilly</tt>'s usefulness on a moderate sized project. This tool is a command line executable that takes another executable as argument. It runs the argument executable and outputs information about what system calls were made. Ordinarily, to build <tt>strace</tt>, one would use the following:<br> <br> <tt> /home/me/strace-4.5.18> ./configure<br> /home/me/strace-4.5.18> make<br> <br> </tt> The <tt>strace</tt> configure script, like many configure scripts, looks at the "CC" environment variable to determine which compiler to use to build the tool. We were able to successfully build the <tt>strace</tt> project with the following commands:<br> <br> <tt> /home/me/strace-4.5.18> setenv CC "/home/me/cil/bin/cilly"<br> /home/me/strace-4.5.18> ./configure<br> /home/me/strace-4.5.18> make<br> </tt> <br> The <tt>test.c</tt> and <tt>strace</tt> examples demonstrate the basic use of <tt>cilly</tt>, but it doesn't do anything interesting. Now, we change the file <tt>test.c</tt> such that the main function has multiple return statements, and we call <tt>cilly</tt> with the "--dooneRet" argument. This argument is part of the CIL built-in library of functions (see Internals below for more detail). It transforms the function code such that there is only one return statement. This makes some analyses easier to perform. Here is the modified <tt>test.c</tt> file:<br> <br> <tt> #include <stdio.h><br> <br> int main() {<br>      int n = 0;<br>      int m = 1;<br>      if(m>0) {<br>          printf("%d\n",m);<br>          return(1);<br>      } else {<br>          printf("%d\n",n);<br>          return(2);<br>      }<br>      return (0);<br> }<br> </tt><br> Our previous use of <tt>cilly</tt>:<br> <br> <tt> /home/me/cil> bin/cilly --save-temps --noPrintLn -o test test.c<br> <br> </tt> will produce the following snippet of code from the <tt>test.cil.c</tt> temp file:<br> <br> <tt> if (m > 0) {<br>    printf((char const * __restrict )"%d\n", m);<br>    return (1);<br> } else {<br>    printf((char const * __restrict )"%d\n", n);<br>    return (2);<br> }<br> return (0);<br> <br> </tt> (Note: the "--noPrintLn" option simply tells <tt>cilly</tt> not to print the line number of the original file where each CIL line came from. This makes the temp file less cluttered and easier to read)<br> <br> However, if we run <tt>cilly</tt> with the "--dooneRet" command line option, we get the following code:<br> <br> <tt> if (m > 0) {<br>    printf((char const * __restrict )"%d\n", m);<br>    __retres3 = 1;<br>    goto return_label;<br> } else {<br>    printf((char const * __restrict )"%d\n", n);<br>    __retres3 = 2;<br>    goto return_label;<br> }<br> __retres3 = 0;<br> return_label: /* CIL Label */ <br> return (__retres3);<br> <br> </tt> Note that the return statements in the <tt>if</tt> blocks have been replaced with temporary variables and goto statements to mimic the behavior of the original code.<br> <br> <h3>Extending cilly with Your Own Features</h3> As a final example, we consider the Make CFG module included in the CIL library. When executed, this module doesn't actually do any code transformations of its own, but it does fill in predefined some OCaML fields that hold lists of predecessors and successors for each statement. That is, it is building the Control Flow Graph for the file. This is information that can be used by other modules later. Here we present an example module which we add to <tt>cilly</tt>.<br> <br> Modules can be called from the command line, as in previous examples, or they can be called within a custom program. Consider the following module written in OCaML and saved as <tt>mycfg.ml</tt>:<br> <br> <tt> open Pretty<br> open Cil<br> open Cfg<br> open List<br> open Dominators<br> module E = Errormsg<br> module H = Hashtbl<br> <br> class printCFGVisitor = object<br>   inherit nopCilVisitor<br>   method vstmt (s:stmt) =<br>    ignore(Pretty.printf "STATEMENT: %a\n" d_stmt s);<br>    let rec printStmtList = function<br>     [] -> Pretty.printf "\n"<br>     | x :: rest -> ignore(Pretty.printf "\n-->%a" d_stmt (x)); <br>      printStmtList rest;<br>   in<br>   ignore(Pretty.printf "__________preds_____________\n");<br>   ignore(printStmtList s.preds);<br>   ignore(Pretty.printf "__________end preds_____________\n");<br>   ignore(Pretty.printf "__________succs_____________\n");<br>   ignore(printStmtList s.succs);<br>   ignore(Pretty.printf "__________end succs_____________\n");<br> <br>   DoChildren<br> <br> end<br> <br> let feature : featureDescr = <br>   { fd_name = "mycfg";<br>   fd_enabled = ref false;<br>   fd_description = "output cfg result to stdout";<br>   fd_extraopt = [];<br>   fd_doit = <br>   (function (f: file) -><br>    Cfg.computeFileCFG f;<br>    let eVisitor = new printCFGVisitor in <br>    visitCilFileSameGlobals eVisitor f);<br>   fd_post_check = true;<br>   }<br> <br> </tt> <p> As you can see from the example above, most interaction with <tt>cilly</tt> is done through a <em>visitor</em> object. Above, you can see that we created a new class called printCFGVisitor. When we call <tt>visitCilFileSameGlobals</tt> (or, if we expect to change globals within the file, <tt>visitCilFile</tt> with a visitor and a file as parameters, our visitor visits every node in the abstract syntax tree in a depth-first search and "appropriate" functions are called based on the type of the node (for an instruction node, <tt>vinst</tt> is called, for a function declaration node <tt>vfundec</tt> is called, etc.) </p> <p> The visitor functions should return either <tt>DoChildren</tt> in order to continue visiting the node's children (the default content of each function), <tt>SkipChildren</tt> to avoid visiting the children, or <tt>ChangeTo</tt> followed by the contents of a new node which is to replace this one. In the above example, we use <tt>DoChildren</tt> to make sure that all possible <tt>stmt</tt> nodes are visited. </p> Before this custom module can be used, it is necessary to rerun the configure script to tell CIL that a new module exists, then to re-make the source code.<br> <br> <tt> /home/me/cil> ./configure EXTRASRCDIRS=/home/me/cil/myModules EXTRAFEATURES=mycfg<br> /home/me/cil> make<br> <br> </tt> After building the source, the command line option "--domycfg" will be listed when "--help" is called, and available as a command line argument. The topic of building custom modules is covered more extensively in the <a href="../ocaml/index.html">OCaML Review Page</a>. The reader is encouraged to review this material.<br> <br> Important to this discussion is the explicit call to <tt>CFG.computeFileCFG</tt> (fifth line from the bottom of the code). This call would be equivalent to calling <tt>cilly</tt> with the "--domakeCFG" command line option. Once this call is made, the predecessor and successor information is available for use. The simple "mycfg" only prints this information out, but of course it could be used to do more extensive analysis. The following can be used to run the "mycfg" module on the code:<br> <br> <tt> /home/me/cil> bin/cilly --noPrintLn --domycfg -o test test.c<br> <br> </tt> And a snippet of the output:<br> <br> <tt> STATEMENT: return (2);<br> __________preds_____________<br> <br> -->printf((char const * __restrict )"%d\n", n);<br> __________end preds_____________<br> __________succs_____________<br> <br> __________end succs_____________<br> <br> </tt> The "return(2)" statement has only one possible predecessor, which is the print statement immediately before it. As a return statement in main, it does not have any successors. <p> Here is another example of a module to extend cilly's functionality. This is a very crude obfuscation tool which creates a new global variable named "totally_not_guessable" in the file, then inserts the following before each "real" instruction from the input file: <br> <pre> printf(""); totally_not_guessable++; </pre> <br> </p> <p> Here is the code: </p> <pre> open Pretty open Cil module E = Errormsg module H = Hashtbl class obfuscateVisitor (globalVar : varinfo) = object inherit nopCilVisitor val printfFun = let fdec = emptyFunction "printf" in fdec.svar.vtype <- TFun(intType, Some [ ("prio", intType, []); ("format", charConstPtrType, []) ], true, []); fdec method vinst (i : instr ) : instr list visitAction = ChangeTo [ Call((None), (Lval(Var(printfFun.svar),NoOffset)), [mkString ""], locUnknown); Set( (Var(globalVar), NoOffset), BinOp(PlusA, Lval(Var(globalVar), NoOffset), one, Cil.intType), locUnknown); i] end let feature : featureDescr = { fd_name = "obfuscate"; fd_enabled = ref false; fd_description = "make the code less pleasant"; fd_extraopt = []; fd_doit = (function (f: file) -> let myGlobalVar = makeGlobalVar "totally_not_guessable" Cil.intType in f.globals <- GVarDecl(myGlobalVar, locUnknown)::f.globals; let oVisitor = new obfuscateVisitor(myGlobalVar) in visitCilFile oVisitor f); fd_post_check = true; } </pre> <h2>Internals</h2> <h3>Code Transformations</h3> As previously discussed, CIL is a subset of the C programming language. In order for the CIL tool to generate CIL code, it must transform the original C source into this subset. There are many transformations that are performed. An extensive, but incomplete, list is available in the <a href="http://manju.cs.berkeley.edu/cil/">CIL Users Manual</a>.<br> <h3>Included Internal Libraries</h3> One of the important features of CIL is a fairly robust library of included modules. The modules perform many of the standard static analyses familiar to analysis community at large.<br> <br> <ul> <li><b>Control-Flow Graphs</b> -- builds control flow graph for whole files, or selected functions.</li><br> <br> <li><b>Data Flow Analysis</b> -- framework for standard forward and backward data flow analysis that requires only transfer functions as input.</li><br> <br> <li><b>Dominators</b> -- calculates immediate dominators.</li><br> <br> <li><b>Points-to Analysis</b> -- performs points-to analysis using either the Olf (One Level Flow) or Golf (Generalized One Level Flow) algorithms.</li><br> <br> <li><b>StackGuard</b> -- modifies the program to maintain a separate return address stack, to defend against stack smashing attacks.</li><br> <br> <li><b>Heapify</b> -- transforms the code to move all dangerous arrays to the heap.</li><br> <br> <li><b>One Return</b> -- transforms the code to force all functions to have only one return statement.</li><br> <br> <li><b>Partial Evaluation and Constant Folding</b> -- performs constant folding (simplification of constant expressions), simplification of if statements whose expressions evaluate to a constant, and call elimination for empty functions and functions that return a constant.</li><br> <br> <li><b>Reaching Definitions</b> -- standard reaching definitions analysis.</li><br> <br> <li><b>Available Expressions</b> -- standard available expressions analysis.</li><br> <br> <li><b>Liveness Analysis</b> -- standard liveness analysis.</li><br> <br> <li><b>Dead Code Elimination</b> -- eliminates code that can never be called or whose results are never used.</li><br> <br> <li><b>Simple Memory Operations</b> -- transforms the code so that all lvalues have at most one memory reference.</li><br> <br> <li><b>Simple Three-Address Code</b> -- transforms code into three-address code.</li><br> <br> <li><b>Converting C to C++</b> -- outputs valid C++ code. The results of this module may not be 100% correct.</li><br> </ul> <h2>Evaluation</h2> <p> The CIL manual mentions two specific cases where code transformations may not work precisely as intended. First, when CIL transforms C source, it explicitly adds function prototypes that are called before they are defined. However, the manual states that CIL is incapable of adding prototypes for functions that are used, but are neither declared nor defined. Second, CIL separates variable declarations from assignments. Thus, statements like the following:<br> <br> <tt> const int i = 1;</tt><br> <br> would be transformed into:<br> <br> <tt> int i;</tt><br> <tt> i = 0;</tt><br> <br> and as a consequence, the <tt>const</tt> operator must be dropped. However, this seems unlikely to be problematic in a real world programming situation. The manual also makes a more general statement about not guaranteeing behavior for constructs that are undefined by the C standard. </p> <p> Consider, for example, the following (poorly-written) program: <pre> #include <stdio.h> #include <stdlib.h> int main() { int i = 10; printf("%i\n", i++ + i); } </pre> <BR> cilly will remove the side effects as it converts to CIL, resulting in this code: <BR> <pre> i = 10; tmp = i; i ++; printf((char const * __restrict )"%i\n", tmp + i); </pre> <br> Ergo, this program now prints 21, whereas if we build with GCC, we get 20. </p> <br> However, our own experience is that CIL is very robust, as evidenced by our ability to compile <tt>strace</tt> and we have not found any reasonable code examples that behave differently under C, including our build of the strace project described in an earlier section.<br> <br> Use of the tool itself is straight forward, and we have found that extending the tool with custom modules to be fairly easy. For those users not familiar with the OCaML programming language, there will be a fairly steep learning curve to get started and become comfortable with syntax, and especially with the concepts in a strongly typed functional programming language.<br> <br> Additionally, the data structures used by CIL to represent the Abstract Syntax Tree can be quite confusing. There doesn't seem to be a clear hierarchy of types here. For example, some <tt>stmt</tt> (statement) types such as an <tt>if</tt> statement can have multiple <tt>Block</tt>'s. But the <tt>Block</tt> type is defined as a list of attributes, and a list of <tt>stmt</tt>'s. This circular definition is, at the very least, counter-intuitive, and the documentation of the internal representations is insufficient to fully understand the structure, or the intent behind it.<br> <h2>Screenshots</h2> As use of the tool is strictly command line, no screen shots were necessary for this discussion.<br> <h2>References</h2> <ul> <li> <a href="http://sourceforge.net/projects/cil">CIL SourceForge Page</a> and <a href="http://sourceforge.net/project/platformdownload.php?group_id=138953">Download link</a><br> <li> <a href="http://manju.cs.berkeley.edu/cil/">CIL Users Manual</a><br> <li> <a href="http://www.springerlink.com/content/4ehv355j1g39n84q/">CIL: Intermediate Language and Tools for Analysis and Transformation of C Programs</a><br> <li> <a href="http://caml.inria.fr/">OCaML Official Site</a><br> <li> <a href="../ocaml/index.html">OCaML Review Page</a><br> </ul> </body> </html>