I've been doing some heavy reading into reverse engineering lately, mostly because it will help me advance career-wise in this company (also got an offer doing more IDS researchy-stuff somewhere else, but I'll consider that later, I think), and well, because it's interesting. There's a surprising lack of good material on reverse engineering, as I lamented in email to a far more knowledgeable one than I that's not in the form of a poorly written case study. I found
this paper and I emailed the obviously much-smarter author with this embarassing silly fan letter -- O wow! This is cool! Thank you! -- because it actually discussed general technique associated with reversing, and it was well written (yay, no broken english!).
As an exercise, I spent some time today (yesterday? I'm on night shift, nothing makes sense) reversing this binary a friend provided; basically, it was a program he had written that strips some essential information from ELF headers that makes it impossible for binutils stuff to digest (e.g., mangles the ELF version, zeroes the section header size so it looks like the section headers are missing which gdb/objdump/etc HATES) but it is still possible for the binary to correctly execute. There is a secret flag to the program, author informs me, and provides me the binary so I can reverse it.
Unfortunately, the program was applied to itself; so I was given a section-stripped ELF-header mangled binary that was horribly ptrace-unfriendly. I dunno about you, but reversing something without the use of a debugger is painful, for me at least. It wouldn't even load properly in IDA pro [as an ELF binary], so my options were to load it as a raw binary and map to correct virtual addresses later on.
I looked at other methods. Of course, the obvious way of cheating would be to loop through all values a-z && A-Z for the correct flag, heh. But I did not do that! No, instead:
some trickery I explored...
a) ltrace to view getopt() arguments, snag the flag there
or
write my own getopt() which does nothing but print the arguments to getopt, and LD_PRELOAD it
Oh, wouldn't this have been a nice, clean solution, had this been dynamically linked, but no, it was compiled for static linking. Actually, it took me a while to figure out definitively how it was linked, heh, since neither 'ldd' or 'file' understood it -- if you look at the readelf -a output, however, you get "There is no dynamic segment in this file." Um, also you could probably check /proc output to see if any libraries were mapped into its address space, also.
b) fix the header using elfsh
I was informed of this cool library/tool,
elfsh, for manipulating ELF objects. Basically it provides a scripted interface (and a C API, for those inclined) to modify header fields, insert program headers/section headers, fix up the section header table, all kinds of neat stuff. I compiled and ran elfsh against the binary, was able to modify the appropriate header fields to non-bogus values and fix up the SHT right and proper, however, it had some... erm.. issues saving the actual output to disk. Although the output displayed properly in the interactive shell, it saved some completely bogus header instead. Impatient as I am, I didn't bother debugging it (although, upon later inspection, it appears the authors recommend using the non-latest version for things like fixing the SHT, because the latest one is known to be buggy. heh). My idea was to fix up the ELF headers so it would be understandable by gdb/objdump/IDA pro, but, heh, no dice.
c) Manual analysis using IDA
First, I struggle with loading it into the proper virtual address space. IDA didn't understand the ELF header (and crashed rather abruptly when faced with disassembling it as an ELF), so I loaded it as a flat binary, chose my entry point (as plucked from the ELF header), and began disassembly there. Except, um, wait, it loaded the whole file into a flat segment, and I had to map the binary to the appropriate virtual address spaces to get both .text and .data segments starting at 0x08048000 to get it to look something like this:
08048000-080aa000 r-xp 00000000 08:05 3899739 /home/jen/test .text!
080aa000-080ab000 rw-p 00062000 08:05 3899739 /home/jen/test .data!
080ab000-080ad000 rwxp 00000000 00:00 0 .bss
bfffd000-c0000000 rwxp ffffe000 00:00 0 stack
Not going to get into the details of that here, except given my fuzzy knowledge of segmentation and addressing schemes and novice knowledge of IDA, it wasn't easy. Heh.
After much (about 45 minutes' worth) of pouring through the binary manually, sucessfully profiling such functions as __libc_start_main() and getopt() and marking some other strings, I basically through sheer luck run into what looks like the 3rd argument string to getopt (not accidentally placed adjacent to other constant strings used by the getopt loop, I imagine). I try it; it works. The argument basically 'fixes' the header file of the mangled ELF binary so it's now grokable by binutils, yay. OK, so, it felt kind of wrong. But then again, much of this stuff I think is luck and insights for where to look... sheer investigator's kismet, I think...
Next project: analyze some windows malware. Perhaps my recent foray into win32 programming will help me here...
readreadreadreadreadread
----
What I'd like to see in the way of literature is a paper that covered general techniques for finding main() (or WinMain, etc as appropriate) for the different loaders/compilers. As far as I can figure out so far, there's a few ways to do it. Look at how the compiler generates entry point code, find the odd (non-library) calls. Look for calls to library functions likely to be in a main() loop (e.g., getopt()). Look for the pushing of environment variables, then argv, and argc (for unix). Of course, you have to make a reasonable guess as to what compiler they are using, etc. But there must be some other tricks...?
---
I got a gmail account in the past few weeks and it's super cool, for a webmail acct. Yay, google mail!