What Does Obfuscated Software Look Like?

Obfuscated software looks very different from the "normal" programs we usually use. This is deliberate: while ease of understanding is an important quality in conventional programming, the whole point of obfuscating software is to make the code difficult to understand and analyze. To do this, creators of obfuscated software systematically violate all of the guidelines programmers are taught to follow. The Lynx project aims to develop automated tools that use a combination of static and dynamic program analysis techniques to analyze such code and tease out the underlying logic of the computation. This page shows some examples of such obfuscated code.

Web-delivered Malware: "Drive-By Downloads"

By far the commonest way to get malware on one's computer is through a process called drive-by downloading, where unsuspecting web users get infected simply from opening an infected web page. In many cases, the infected page belongs to a legitimate website that has been hacked; in other cases, the bad guys set up their own website and try to trick people into visiting it.

As an example of this, some time ago I got the email message shown below: "Please open the below URL to access PAYROLL REPORTS."

Of course, this email is suspicious for all sorts of reasons (why am I being sent payroll records? with bad grammar? from an email address that makes no sense? linking to a website in Germany?) But we wanted to know what was at the other end of that link. When we downloaded that page, we found a short snippet of obfuscated JavaScript code:

It's very difficult to make any sense of this code, which is exactly the point of code obfuscation. When we deobfuscated this code using our tools, it turned out to point to a web page that contained even more obfuscated code:

We used our analysis tools to penetrate this layer of obfuscation as well. This led us to the actual attack code, which exploited a vulnerability in Adobe Reader using a malicious PDF file loaded from a website in China:

Emulation-Based Obfuscation

This is an obfuscation technique where the the program to be executed is encoded in terms of a virtual machine (VM) instruction set that is interpreted using a VM emulator. VM emulators can be stacked several levels deep, where the emulator for one VM is itself encoded in terms of the instruction set for a second VM, and so on. Emulation-based obfuscation presents a number of challenges for reverse engineering because examining the instructions in the program exposes only the logic of the emulator.

This picture shows (part of) the control flow graph of a simple iterative factorial computation that has been obfuscated, using a software obfuscation tool called Themida, to hide the program's logic. Blue nodes represent computations influenced by input values; red nodes are computations that affect the output generated by the program. What started out as a very simple iterative computation involving a few arithmetic computations, and a straightforward flow of values from the input to the output, has become smeared over a large amount of code and the logical connections between different parts of this computation are difficult to discern.

(Larger image)

The Themida-obfuscated code shown above uses a single level of emulation. The next picture shows the control flow graph of a matrix multiplication program that has been obfuscated to have two levels of such emulation. Again, the structure of the computation has been changed dramatically and it's difficult to discern the logic of the underlying computation:

(Larger image)