Learning Note: From Source File to Executable File

·

4 min read

This article is a summary of Chapter 8 of:

Yazawa, H. (2015). 程序是怎么跑起来的[How Program Works] (L. Fengjun, Trans.). People's Posts and Telecommunications Press

Source Code and Native Code

The code written in programming languages are called source code, and the files storing them are source files. Source code must be translated to machine language, known as native code, to be executed by CPU. Fig 1 is an example of native code in hexadecimal in an EXE file:

Fig 1. Native code represented in hexadecimal

Compiler is a program that translates source code to native code. The type of compiler not only depends on the type of programming language, but also depends on the type of CPU, since different types of CPUs have differdnt native code as well. Cross compiler is used to generate native code in different CPU type than that the runtime environment.

Fig 2. There are different compilers for different CPU instruction sets. Image reproduced from (Yazama, 2015)

From Object File to Executable File

The compiler can translate the source file (.c in C) to object file (.obj in C). However, object files are still not the executable files before linkage. Linkage is an operation that connects the object file with external libraries it uses, and is done by linkage editor. For example, if an object file invokes functions from other files, the files containing the functions need to be linked with the object file to be a complete EXE file.

Even if the object file does not use any external programs, it still needs to be linked with the loader that is used to set up the execution environment of the object file. Also, we do not need to specify every external files used by the object file during the linkage. Instead, library files (.lib in Windows) are files that store multiple files into a single file, and we can direcly link to the library files used, during which the linkage editor will navigate the specific object files in library. Functions stored in library files instead of the source file are called standard function.

To illustrate the process above with an example, assume we have a source file called sample.c, which contains two external functions called sprintf() (in cw32.lib) and MessageBox() (in import32.lib). We use Windows system and Borland C++ compiler.

  1. Compile source file to object file sample.obj: bcc32 -W -c sample.c

  2. Link the object file with loader and library files: link32 -Type -c -x -aa c0w32.obj sample.obj, sample.exe, import32.lib, cw32.lib. (HEre c0w32.obj is the loader, and import32.lib and cw32.lib are external libraries used. The result will be executable file sample.exe)

DLL and Import Library

Windows operating system provides many APIs for programs. These APIs are normally soted in DLL (Dynamic Link Library) files, which are dynamically loaded to programs during runtime. In the example above, MessageBox() is actually a system called stored in user32.dll. import32.lib library is in charge of connecting the DLL file with the object file. Such library files are called import libraries.

Static Link Library are libraries that store the actual object files of library functions, and are combined with object file during the generation of executable file. In the case above, cw32.lib is a static link library.

Fig 3. The whole process of involking SLL and DLL, Image reproduced from (Yazama, 2015)

The Memory Allocated for a Program during Execution

EXE files contain virtual memory address for variables and functions. During runtime, these virtual address will be converted to the physical address. Information needed for this convertion are called reconfiguration information, and are added by the linkage editor. Reconfiguration information marks the relative address of variables and functions in reference to their starting address. During runtime, variables and functions are both loaded in consecutive parts of the memory, so the address of each variable and function can be specified by the starting address and relative address.

Fig 4, The structure of EXE file. Image reproduced from (Yazama, 2015)

During runtime, two storage areas for program data, called stack and heap, will also be allocated in memory. Stack is in charge of storing local variables inside functions, and heap is used to store any data and object in memory. The allocation of stack and heap happens during the runtime.

For the stack, the clearing of stack is automatically implemented by the compiler. However, for the heap, the allocation and clearing for memory are not automatic. Languages like C and C++ require programmers to allocate and clear memory manually. If the data stored in heap are not properly released, memory leak would happen. Languages like Java and C# use garbage collection algorithms to automatically release heap memory for unused data and objects.

Fig 5, the memory used by a program during its runtime. Image reproduced from (Yazama, 2015)

Did you find this article valuable?

Support Raine's blog by becoming a sponsor. Any amount is appreciated!