Using CIL to analyze libvirt

Downloads

Introduction

CIL is a library for doing static analysis on large C projects. It is interesting because unlike other tools around, it has actually been applied to large projects in the real world (such as the Linux kernel), it supports most gcc extensions, C99, and many subtleties of the C language. Another thing which makes it interesting is that it is distributed under a free license (specifically, BSD).

In this document I describe how I used it to analyze libvirt.

Static analysis

Static analysis is the process of loading the source code for a program into memory, then analyzing or transforming it in various ways. The most common programs around which do static analysis are compilers: they both analyze the program for warnings, and transform it (optimization, conversion to machine code).

Compilers such as gcc nowadays do sophisticated analyses on programs -- reachability, dataflow, pointer aliasing, and so on. However they are not customizable. You can't stop the compiler during its analysis phase and add your own constraints. Some of the things you might want to add are:

Static analysis versus testing

Static analysis is concerned with the program at compile time. Testing observes the program's behaviour at runtime. As such static analysis complements testing, because each can find bugs in the code that the other misses.

Writing static analysis rules

Because the rules you want to apply are likely to be quite specific to your C program or library, you will necessarily end up having to write up these rules in some sort of programming language. For the Stanford Checker, rules were written in a special language called "MC". For CIL you have to write them up in OCaml.

C++

CIL does not support C++. There are other solutions which work for C++, including Olmar / Elsa / Elkhound.

However ...

Static analysis of C code is useful, but if you are thinking of starting a new project, choose a sensible language! Don't start new projects in C or C++, or dynamic languages like Perl and Python which leave errors in code for customers to find.

Coverity & the Stanford Checker

A leading commercial (and very proprietary) tool for static analysis of C programs is Coverity, which started life as the Stanford Checker. Early in 2001 postings started to appear on LKML detailing bugs in the Linux kernel which had been found through static analysis (for example: [1] [2] [3] and more background).

About CIL

It is highly recommended that you read the first few chapters of the CIL documentation.

CIL reads C source code after it has gone through the C preprocessor. It performs some transformations on the code to make analysis simpler, and it turns the C source code into memory structures which describe the name, type and contents of every variable and function definition (and everything else) in the program.

CIL provides a drop-in compatible replacement for gcc called cilly which you can use in Makefiles (eg. make CC=cilly). It also provides library functions that you can use to import preprocessed C files and iterate over their structural contents.

CIL provides libraries for doing common transformations and analyses on programs. Some of the more important ones are:

Note two things: Firstly compilers like GCC already do these sorts of transformations and analyses, however not in a way that you can use or extend. Secondly these are only preludes to writing your own program-specific analysis programs. Typically you will use the library to load the source and do at least CFG analysis on it, before you start to do your own analysis.

Loading a test program into CIL

For working code, please refer to the Downloads section at the top of this document.

Before embarking on libvirt, I wrote a few lines of test code in C, split across a couple of files, and loaded those into CIL using the library functions. I preprocessed them using /bin/cpp first, as required by CIL. The OCaml code to load the C code is just:

let files = [ "test1.i"; "main.i" ]
let files = List.map (fun filename -> (Frontc.parse filename) ()) files

The two files are then merged together in memory (kind of like linking), and unused prototypes and definitions are removed. The unused stuff mainly comes from libc header files.

let file = Mergecil.merge files "test"
let () = Rmtmps.removeUnusedTemps file

Finally I call the CIL function to perform control flow graph (CFG) analysis which will later allow me to produce some pretty pictures of the functions:

let () = Cfg.computeFileCFG file

Now as a test I wrote some analysis code. This code simply iterates over the top level expressions in the program, finds function definitions, and prints out the control flow graphs of the functions as GraphViz dot-files. It ignores anything which isn't a function definition:

  List.iter (
    function
    | GType _ -> ()			(* typedef *)
    | GCompTag _ -> ()			(* struct/union *)
    | GCompTagDecl _ -> ()		(* forward prototype of struct/union *)
    | GEnumTag _ -> ()			(* enum *)
    | GEnumTagDecl _ -> ()		(* forward prototype of enum *)
    | GVarDecl _ -> ()			(* variable/function prototype *)
    | GVar _ -> ()			(* variable definition *)
    | GFun (fundec, loc) ->		(* function definition *)

	printf "%s:%d: %s has %d arg(s)\n"
	  loc.file loc.line fundec.svar.vname (List.length fundec.sformals);

	(* Write control flow graph to a GraphViz dot-file: *)
	Cfg.printCfgFilename (fundec.svar.vname ^ ".dot") fundec

    | GAsm _ -> ()			(* global asm statement *)
    | GPragma _ -> ()			(* toplevel #pragma *)
    | GText _ -> ()			(* verbatim text, comments *)
  ) file.globals;

When you run this program (make run-ciltest1), you see:

test1.c:11: test1_add has 2 arg(s)
test1.c:22: test1_inc has 1 arg(s)
test1.c:36: test1_print has 1 arg(s)
test1.c:48: test1_last_error has 0 arg(s)
test1.c:56: test1_print_last_error has 0 arg(s)
main.c:10: main has 0 arg(s)

and here are some of the GraphViz dot-files rendered next to their corresponding functions:

Control flow graph Function
int
test1_inc (int a)
{
  if (a >= 0)
    return a+1;
  else
    return -1;
}
void
test1_print_last_error (void)
{
  const char *err = last_err;
  last_err = NULL;
  if (err)
    fprintf (stderr, "test1: %s\n", err);
}
int
test1_add (int a, int b)
{
  if (a >= 0 && b >= 0)
    return a+b;
  else {
    last_err = "Negative value";
    return -1;
  }
}

In the corresponding graph for this function we see two if nodes. This is because CIL has simplified the if (...&&...) expression into two if statements, to make automatic analysis easier. The code is equivalent to the original, and C compilers commonly perform the same step.

Loading libvirt into CIL

For working code, please refer to the Downloads section at the top of this document.

Before we can perform any analysis, we first have to load the libvirt C code into CIL. I found it simpler to avoid the cilly drop-in replacement for gcc, and instead just compile libvirt like this:

make CC="gcc -save-temps"

This builds libvirt in the ordinary way, but leaves around the preprocessed C files (*.i), and these can be directly loaded into CIL using the library functions.

There are some fragments of shell script in the Makefile which are just there to build libvirt with the special CC variable, find the list of *.i files, and get the list of exported symbols (the entry points into libvirt). Interested readers can consult the Makefile rule run-cilimportvirt for details.

In any case, once these housekeeping tasks are done, the whole of libvirt was loaded into CIL, barring one problem: CIL doesn't yet implement C99 _Bool type, so I had to modify libvirt's build to add -D_Bool=unsigned. Still, a single change isn't bad considering that libvirt is a large body of code written by a whole variety of authors which uses C99 and gcc extensions throughout.

Control flow graphs of real functions in libvirt are considerably more complicated than my test functions. For example on the right is the CFG of function xenHypervisorMakeCapabilitiesXML. The Postscript file produced by GraphViz actually crashes most file viewers.

Simple analysis — extract exported function types

As a simple example of a static analysis, I used CIL along with my list of exported function names to extract the argument type(s), return type and point of definition of each exported function. The output looks like this:

virterror.c:70: virGetLastError exported function with 0 arg(s)
  virErrorPtr (void)
virterror.c:88: virCopyLastError exported function with 1 arg(s)
  int (virErrorPtr to )
virterror.c:105: virResetError exported function with 1 arg(s)
  void (virErrorPtr err )
virterror.c:126: virResetLastError exported function with 0 arg(s)
  void (void)

[...]

libvirt.c:56: virInitialize exported function with 0 arg(s)
  int (void)
libvirt.c:336: virGetVersion exported function with 3 arg(s)
  int (unsigned long *libVer , char const   *type , unsigned long *typeVer )
libvirt.c:443: virConnectOpen exported function with 1 arg(s)
  virConnectPtr (char const   *name )
libvirt.c:461: virConnectOpenReadOnly exported function with 1 arg(s)
  virConnectPtr (char const   *name )
libvirt.c:478: virConnectClose exported function with 1 arg(s)
  int (virConnectPtr conn )
libvirt.c:504: virConnectGetType exported function with 1 arg(s)
  char const   *(virConnectPtr conn )
libvirt.c:534: virConnectGetVersion exported function with 2 arg(s)
  int (virConnectPtr conn , unsigned long *hvVer )
libvirt.c:566: virConnectGetHostname exported function with 1 arg(s)
  char *(virConnectPtr conn )

[etc]

In total this analysis required 61 lines of code to implement, although much of that was comments and the boilerplate required to load the *.i files.


rjones AT redhat DOT com

$Id: index.html,v 1.5 2008/03/31 12:00:29 rjones Exp $