Introduction to X86 Assembly for Reverse Engineering

Binary analysis centers on the examination of binaries. But what exactly is a binary? A binary program is a self-contained file containing executable binary code (machine instructions) and data (variables, constants). Modern computers perform computations using the binary numerical system, where all numbers are strings of ones and zeros, and the machine code they execute is called binary code. Understanding the low-level language into which software is translated is vital for binary analysis.
For malware analysis, it is often necessary to go beyond basic static analysis of file sections, strings, and imports, involving reverse engineering a program’s assembly code. Disassembly and reverse engineering are at the core of deep static analysis of malware samples, and basic proficiency in reading disassembled x86 code is easier than it might seem. The ability to understand assembly code is directly proportional to comfort in malware analysis tasks. All software on a platform eventually runs as microprocessor instructions, giving immense power to deconstruct any software to a good approximation.
The Compilation Process: From High-Level to X86 Assembly
Binaries are produced through compilation, which translates human-readable source code (like C or C++) into machine code that a processor can execute. The output of the compilation phase is assembly code, in a reasonably human-readable form with symbolic information intact. This design decision allows a single assembler to handle the final translation of assembly to machine code for various programming languages, rather than requiring compilers to directly emit machine code for each.
Typically, each source file corresponds to one assembly file, and each assembly file corresponds to one object file. Object files contain machine instructions that are, in principle, executable by the processor. Compilers like gcc often compile code examples into x86-64 code by default.
Understanding X86 Assembly Language
Assembly language is the lowest-level human-readable programming language for a given architecture, mapping closely to the binary instructions. This low-level nature means it can often be retrieved easily from a malware binary using the right tools. There are two major dialects of x86 assembly: Intel and AT&T. Intel syntax is generally used in binary analysis and official Intel documentation.
An x86 assembly instruction generally has the form mnemonic destination, source. The mnemonic is a human-readable representation of a machine instruction, and source and destination are its operands. X86 uses variable-length instructions, ranging from 1 byte up to 15 bytes. An instruction consists of optional prefixes, an opcode, and zero or more operands. The opcode is the main designator for the instruction type. Immediates are constant integer operands hardcoded directly into the instruction.
Key X86 Assembly Concepts:

  • CPU Registers: These are small, fast data storage units located on the CPU itself, making register access orders of magnitude faster than memory access. They are used for core computational operations, arithmetic, condition testing, and storing program status.
    ◦ General-Purpose Registers (GPRs): On a 32-bit system, these typically include EAX (Accumulator), EBX (Base), ECX (Counter), EDX (Data), ESP (Stack Pointer), EBP (Base Pointer), ESI (Source Index), and EDI (Destination Index). In 64-bit programming, there are additional registers like R8-R15.
    ◦ Instruction Pointer (EIP/RIP): This crucial register contains the memory address of the currently executing instruction. It’s automatically set by the CPU and cannot be written manually. Its corruption is a key goal in buffer-overflow exploits.
    ◦ Flags Register (EFLAGS/RFLAGS): This register tracks things like whether the last operation yielded zero or resulted in an overflow.
  • The Stack: A memory region reserved for storing data related to function calls. It’s used to pass arguments, allocate local variables, and remember the return address after a function finishes executing.
    ◦ The x86 stack grows downward in memory.
    ◦ The push instruction pushes a value onto the stack and decrements the value of the stack pointer (ESP/RSP).
    ◦ The pop instruction retrieves a value from the stack and stores it in a designated register, then increments the stack pointer.
  • Common Instructions:
    ◦ Data Transfer:
    ▪ mov dst, src: Copies the value from src to dst.
    ▪ push src: Pushes src onto the stack.
    ▪ pop dst: Pops a value from the stack into dst.
    ◦ Arithmetic:
    ▪ add dst, src: Adds src to dst and stores the result in dst.
    ▪ sub dst, src: Subtracts src from dst and stores the result in dst.
    ▪ inc reg: Increments the value in reg by 1.
    ▪ dec reg: Decrements the value in reg by 1.
    ◦ Control Flow:
    ▪ jmp target: Unconditionally jumps to target.
    ▪ call func: Calls a function, pushing the return address onto the stack.
    ▪ ret: Returns from a function, popping the return address from the stack.
    ▪ cmp op1, op2: Compares two operands, setting flags.
    ▪ jne target: Jumps if not equal, based on flags set by a previous comparison.
  • Control Flow Graphs (CFGs): Programs have a network structure due to conditional and unconditional branches. A basic block is a sequence of instructions where the first instruction is the only entry point and the last instruction is the only exit point. CFGs visualize how basic blocks relate to and flow into one another.
    Disassembly for Reverse Engineering
    Disassembly is the process of translating malware’s binary code into valid x86 assembly language. This is a crucial step for deep static analysis.
    There are two major approaches to static disassembly:
  1. Linear Disassembly: This conceptually simple approach iterates through all code segments in a binary, decoding bytes consecutively. Tools like objdump often use this. Its limitation is that not all bytes may be instructions; inline data can be misinterpreted as code, leading to invalid opcodes or desynchronization, especially on dense, variable-length ISAs like x86.
  2. Recursive Disassembly: This method is sensitive to control flow and starts from known entry points (like the main entry point or function symbols), then follows control flow instructions. It’s less susceptible to being fooled by inline data. Tools like IDA Pro and custom disassemblers built with Capstone use this approach. However, it may miss instructions if they are only reachable via indirect control flows that cannot be resolved statically.
    Dynamic Disassembly, also known as execution tracing, logs each executed instruction as the binary runs. The main drawback of this approach is the “code coverage problem”—it only sees the instructions that are actually executed during a specific run, not all possible instructions.
    Tools for Disassembly:
  • objdump: A simple, easy-to-use disassembler included with most Linux distributions, often used for linear disassembly.
  • IDA Pro: The de facto industry-standard recursive disassembler, offering an interactive graphical interface and extensive features for manual reverse engineering.
  • Capstone: A free, open-source disassembly framework providing a lightweight, multi-architecture API for building custom disassembly tools. It offers detailed inspection of disassembled instructions.
  • pefile and capstone (Python Libraries): Used together to disassemble 32-bit x86 binary code from PE files, common in malware analysis.
    Challenges in Reverse Engineering X86 Binaries:
    Malware authors frequently employ anti-disassembly techniques to thwart reverse engineers. These include obfuscation, such as encryption, polymorphism, and metamorphism. Instruction overlapping is a particularly effective obfuscation technique on x86, where instructions can vary in length and the processor doesn’t enforce strict alignment, allowing one instruction to partially or completely overlap with another.
    Furthermore, production-ready binaries often have symbolic information stripped to reduce file sizes and prevent reverse engineering, especially in malware or proprietary software. This means analysts frequently deal with stripped binaries, making it challenging to identify functions and data, which might have only automatically generated names or addresses instead of symbolic ones.
    Takeaways and Steps for Reverse Engineering
    Understanding x86 assembly is paramount for anyone involved in binary analysis, especially for malware analysis and vulnerability research.
  1. Master X86 Assembly Fundamentals: Dedicate time to understanding the core concepts:
    ◦ Registers: Know the general-purpose registers and their common uses. Crucially, understand the role of the instruction pointer (EIP/RIP).
    ◦ Instructions: Familiarize yourself with common instruction mnemonics, especially data transfer, arithmetic, and control flow instructions. Recognize their purpose and effect on registers and memory.
    ◦ The Stack: Understand how the stack functions for function calls, argument passing, and local variable allocation, and how push and pop instructions manipulate it.
    ◦ Control Flow: Be able to identify basic blocks and trace control flow using concepts like jmp, call, ret, and conditional jumps.
  2. Understand the Compilation Process: Recognize that human-readable code is translated into assembly, then into machine code. This helps in seeing the correspondence between high-level code and assembly.
  3. Learn Binary Formats: Gain familiarity with common binary executable file formats like ELF (Linux) and PE (Windows), as they dictate how code and data are structured and loaded.
  4. Practice Disassembly Techniques: Experiment with both linear (e.g., objdump) and recursive (e.g., IDA Pro, Capstone-based tools) disassembly to understand their strengths and weaknesses. Practice disassembling different types of binaries, from simple, unstripped examples to more complex ones.
  5. Be Aware of Anti-Analysis Techniques: Recognize that malware and proprietary software often employ techniques like stripping symbolic information and code obfuscation (e.g., instruction overlapping) to hinder reverse engineering. This awareness prepares you for the challenges of analyzing real-world binaries.

Key Takeaways and Steps for Reverse Engineering
Understanding x86 assembly is paramount for anyone involved in binary analysis, especially for malware analysis and vulnerability research.

  1. Master X86 Assembly Fundamentals: Dedicate time to understanding the core concepts of x86 assembly.
    ◦ Registers: Know the general-purpose registers (EAX, EBX, ECX, EDX, ESP, EBP, ESI, EDI on 32-bit systems, plus R8-R15 on 64-bit) and their common uses. Crucially, understand the role of the Instruction Pointer (EIP/RIP), which holds the address of the next instruction to execute. Also, be aware of the Flags Register (EFLAGS/RFLAGS), which tracks operation results.
    ◦ Instructions: Familiarize yourself with common instruction mnemonics, especially those for data transfer (mov, push, pop), arithmetic (add, sub, inc, dec), and control flow (jmp, call, ret, cmp, jne). Recognize their purpose and effect on registers and memory. Instructions on x86 are variable-length, from 1 to 15 bytes.
    ◦ The Stack: Understand how the stack functions as a memory region reserved for function calls, used to pass arguments, allocate local variables, and store return addresses. Remember that the x86 stack grows downward in memory, with push decrementing the stack pointer (ESP/RSP) and pop incrementing it.
    ◦ Control Flow: Be able to identify basic blocks (sequences of instructions with a single entry and exit point) and trace program execution flow using concepts like jmp, call, ret, and conditional jumps. Visualizing these relationships often involves Control Flow Graphs (CFGs).
  2. Understand the Compilation Process: Recognize that human-readable source code (like C or C++) is translated into assembly language, which is then further translated into machine code that a processor can execute. This helps in seeing the correspondence between high-level code and the resulting assembly.
  3. Learn Binary Formats: Gain familiarity with common binary executable file formats.
    ◦ Executable and Linkable Format (ELF) is the default binary format on Linux-based systems, used for executables, object files, shared libraries, and core dumps. Key components include an executable header, program headers, and sections (like .text for code, .data, .bss, and .rodata for data).
    ◦ Portable Executable (PE) is the file format used by most Windows programs (e.g., .exe, .dll files). Important sections in PE include .text (executable code), .idata (Import Address Table or IAT, listing dynamically linked libraries and functions), .rsrc (resources like strings and images), .data, .rdata, and .reloc.
  4. Practice Disassembly Techniques: Experiment with different approaches to translating binary code into assembly language.
    ◦ Linear Disassembly: This method decodes bytes consecutively but can be misled by inline data. Tools like objdump often use this approach.
    ◦ Recursive Disassembly: This method is sensitive to control flow, starting from known entry points and following branches. It is less prone to misinterpreting data as code and is the basis for advanced disassemblers like IDA Pro.
    ◦ Dynamic Disassembly (Execution Tracing): Logs executed instructions as a binary runs, but only covers the paths actually taken during execution.
  5. Be Aware of Anti-Analysis Techniques: Recognize that malware and proprietary software often employ techniques to hinder reverse engineering.
    ◦ Stripping Symbolic Information: Production-ready binaries frequently have symbolic information removed to reduce file sizes and prevent reverse engineering, making it challenging to identify functions and data.
    ◦ Code Obfuscation: Malware authors use techniques like encryption, polymorphism, metamorphism, and instruction overlapping (where one instruction partially or completely overlaps with another) to thwart analysis.
  6. Hands-on Application: The best way to learn is by doing. Practice disassembling simple programs, modifying them, and then moving to more complex scenarios like analyzing malware samples.
    Essential Resources for X86 Assembly & Reverse Engineering
    The provided materials highlight various tools and knowledge bases crucial for mastering x86 assembly and reverse engineering:
  • Disassemblers and Debuggers:
    ◦ IDA Pro: The de facto industry-standard recursive disassembler for Windows, Linux, and macOS. It offers an interactive graphical interface and extensive features for manual reverse engineering, with many advanced analysis techniques relying on its core disassembly functionality.
    ◦ objdump: A simple, easy-to-use command-line disassembler included with most Linux distributions, often used for linear disassembly.
    ◦ OllyDbg: A powerful, full-featured, assembler-level analyzing debugger that runs on Windows, useful for binary analysis.
  • Disassembly and Binary Analysis Frameworks/Libraries:
    ◦ Capstone: A free, open-source, multi-architecture disassembly framework designed for a simple, lightweight API. It has bindings for C/C++ and Python, and can recover virtually all relevant details of disassembled instructions, making it excellent for building custom disassembly tools.
    ◦ distorm3: An open-source disassembly API for x86 code, aiming at fast disassembly.
    ◦ pefile: A popular Python library used to dissect Portable Executable (PE) files, which are common in Windows malware analysis.
    ◦ libbfd (Binary File Descriptor library): Provides a common interface for reading and parsing all popular binary formats, including ELF and PE files for x86/x86-64 machines. It’s useful for building binary loaders.
    ◦ libelf: A popular open-source library for manipulating the contents of ELF binaries, useful when implementing your own binary analysis tools.
  • Static Analysis Utilities:
    ◦ strings: A utility that extracts printable character strings from binaries, which can provide vital clues about a file’s functionality or origin, especially from .rsrc sections in PE files.
    ◦ readelf: A command-line tool for displaying information about ELF files, including symbols and sections.
    ◦ xxd: A hex viewer that allows you to view the raw bytes of binaries in hexadecimal format.
  • Dynamic Analysis and Instrumentation Tools:
    ◦ Intel Pin: A powerful binary instrumentation framework that allows you to analyze and modify binaries at runtime. Many advanced binary analysis platforms, like dynamic taint analysis tools, are built on Pin.
    ◦ libdft: An open-source Dynamic Taint Analysis (DTA) library, built on Intel Pin, that enables tracking the flow of tainted data from sources to sinks within a program’s memory. It supports byte-granularity taint-tracking and multiple taint colors.
    ◦ Triton: A dynamic binary analysis framework that supports symbolic execution and taint analysis, allowing you to reason about program states and find vulnerabilities like buffer overflows.
  • Reference Materials and Datasets:
    ◦ Intel Architecture Software Developer’s Manuals: Essential for detailed information about low-level programming of Intel processors.
    ◦ MSDN API libraries: A valuable resource for understanding the Windows API (Win32 API) functions and their internal details.
    ◦ Sysinternals Tools: A suite of utilities for Windows (e.g., Process Explorer, HandleEx, TCPView, RegMon, FileMon) that provide real-time monitoring and in-depth analysis of processes, files, registry, and network activity.
    ◦ Online Documentation: Websites like https://intelxed.github.io/ref-manual/ for a list of x86 instruction classes.
    ◦ Malware Datasets: Practical experience can be gained by analyzing real-world malware samples, such as the APT1 dataset and various benign and malicious HTML files for neural network training. A VirtualBox Ubuntu instance with preloaded data and code is provided for convenience.
    ◦ Other Books/Publications: Resources like “The IDA Pro Book” for detailed reverse engineering and papers on dynamic taint analysis and symbolic execution can deepen your understanding.

By:

Posted in:


Leave a comment