Binary anatomy [REV]
Table of Contents
The program contains machine code (instructions) and the data (variables and constants). It was decided to organize this in binary files, so all the code and data belonging to particular program.
Compilation process
Preprocessing
gcc -E -P compilation_example.c
-E - Stop after the preprocessing stage; do not run the compiler proper. The output is in the form of preprocessed source code, which is sent to the standard output.
-P - Inhibit generation of linemarkers in the output from the preprocessor. This might be useful when running the preprocessor on something that is not C code, and will be sent to a program which might be confused by the linemarkers.
Compilation
The high level code compiles into the assembly first, not into the direct machine code. It was done like that because it’s easier to compile every other language into one assembly, dedicated to the particular processor and only then translate it into the machine code.
gcc -S -masm=intel compilation_example.c
cat compilation_example.s
Assembly
The output on this stage is object file or the module.
gcc -c compilation_example.c
file compilation_example.o
ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), not stripped
Relocatable - file can be moved around without breaking any assumptions in the code. It’s a label that we’re dealing with an object file.
The assembler doesn’t know the memory addresses of the other object files. Object files can be moved in any order to assemble a complete binary exec. Think of it as a brick of the whole binary house.
Object file can have links to functions on other object files or libraries. The reference is a symbolic link.
Linking
Assemble a binary file from the object files. Resolve all symbolic links. Assign particular memory address.
Static libraries (.a) are copied into the binary.
Dynamic libraries (.so) are mot and symbol links are not resolved until the binary would not be loaded into the memory to be executed.
Symbol links for the dynamic libraries would be resolved by interpreter like
file a.out
... interpreter /lib64/ld-linux-x86-64.so.2 ..
Disassembling Process
Disassemble and object file
Show read-only data
objdump -sj .rodata compilation_example.o
Disassemble code in the Intel syntax
objdump -M intel -d compilation_example.o
As an example we see that we have
f: bf 00 00 00 00 mov edi, 0x0
14: e8 00 00 00 00 call 19 <main+0x19>
We have here mov 0 into edi - this should be symbolic link to the rodata, and main+19 - should be puts. Linker will link them appropriately during the assigning memory addresses.
You can check relocating symbols with the next command:
readelf --relocs compilation_example.o
Relocation section '.rela.text' at offset 0x210 contains 2 entries:
Offset Info Type Sym. Value Sym. Name + Addend
➊ 000000000010 00050000000a R_X86_64_32 0000000000000000 .rodata + 0
➋ 000000000015 000a00000002 R_X86_64_PC32 0000000000000000 puts - 4
...
Offset is equal to instruction offset + 1 -> to overwrite the operand, not the opcode.
Disassemble binary file
objdump -M intel -d a.out
Loading and Executing a binary
Here I’ve just listed a high level steps for my shallow understanding ;)
- OS: setting a new process and virtual address space
- OS: maps interpreter (IP) into the process VM
- Kernel: transfer control to the IP
- IP: loads the binary into the virtual address space
- IP: parses the binary to find which dynamic libs to use
- IP: maps DL into VM and does all relocations (well not all, only necessary)
- IP: lookups the entry point and transfers control to it
Questions
Q. How symbolic link in the object file looks like?
A. Firstly you can check how references to the dynamic libraries are look like
strip --strip-all a.out
readelf --syms a.out
With stripping all the symbols, only references to dynamic libs left.
Q. Stripped and non stripped binaries?
A. Not stripped binaries - binaries that has human readable symbols. You can check them with
readelf --syms a.out
To strip the binary
strip --strip-all a.out
What was new to me?
- Relocatable thing - that the object files contains unresolved symbolic links to the read-only data and other functions. How does it looks like
- How to strip the binary
- readelf command and it’s parameters
- Binary should be parsed with interpret first
- Interpret is the one who load the dynamic library and the binary
Reference: Practical binary analysis