What Does Obfuscated Software Look Like?
Obfuscated software looks very different from the "normal"
programs we usually use. This is deliberate: while ease of
understanding is an important quality in conventional programming,
the whole point of obfuscating software is to make the code difficult
to understand and analyze. To do this, creators of obfuscated
software systematically violate all of the guidelines
programmers are taught to follow. The Lynx project aims to develop automated
tools that use a combination of static and dynamic program analysis
techniques to analyze such code and tease out the underlying logic of the
computation. This page shows some examples of such obfuscated code.
Web-delivered Malware: "Drive-By Downloads"
By far the commonest way to get malware on one's computer is through
a process called
drive-by downloading, where unsuspecting
web users get infected simply from opening an infected web page.
In many cases, the infected page belongs to a legitimate website
that has been hacked; in other cases, the bad guys set up
their own website and try to trick people into visiting it.
As an example of this, some time ago I got the email message
shown below: "Please open the below URL to access PAYROLL
REPORTS."
Of course, this email is suspicious for all sorts of reasons (why am I being
sent payroll records? with bad grammar? from an email address that makes no
sense? linking to a website in Germany?) But we wanted to know
what was at the other end of that link. When we downloaded that page, we
found a short snippet of obfuscated JavaScript code:
It's very difficult to make any sense of this code, which is exactly the
point of code obfuscation. When we deobfuscated this code using our
tools, it turned out to point to a web page that contained even more obfuscated code:
We used our analysis tools to penetrate this layer of obfuscation as well.
This led us to the actual attack code, which exploited a vulnerability in Adobe
Reader using a malicious PDF file loaded from a website in China:
Emulation-Based Obfuscation
This is an obfuscation technique where the the program to be executed
is encoded in terms of a virtual machine (VM) instruction set that is
interpreted using a VM emulator. VM emulators can be stacked several
levels deep, where the emulator for one VM is itself encoded in terms
of the instruction set for a second VM, and so on. Emulation-based
obfuscation presents a number of challenges for reverse engineering
because examining the instructions in the program exposes only the
logic of the emulator.
This picture shows (part of) the control flow graph of a simple iterative
factorial computation that has been obfuscated, using a software
obfuscation tool called Themida, to hide the program's logic.
Blue nodes represent computations
influenced by input values;
red nodes are computations that affect the
output generated by the program.
What started out as a very simple iterative computation involving a few
arithmetic computations, and a straightforward flow of values from
the input to the output, has become smeared over a large amount of code
and the logical connections between different parts of this computation are
difficult to discern.
(Larger image)
The Themida-obfuscated code shown above uses a single level of emulation.
The next picture shows the control flow graph of a matrix multiplication
program that has been obfuscated to have two levels of such emulation.
Again, the structure of the computation has been changed dramatically
and it's difficult to discern the logic of the underlying computation:
(Larger image)