CS107E Guide - Binary Utilities (Binutils)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Guide: Binary Utilities (binutils)

Written by Pat Hanrahan

This document describes binutils (binary utilities). These are utility programs used for compiling, inspecting, and
transforming binaries (machine code files). Some of them you've probably heard of or used before. Other, more
specialized ones, might be new to you. Learning these tools and what they can do will greatly improve your productivity
as a programmer. They'll let you quickly and easily figure out what is wrong and why with a large number of otherwise
very difficult problems.

This guide refers to each tool with its short name (e.g., ld , as ). Typically, on a UNIX/Linux/Darwin machine, these
names refer to the native tools, tools for that computer. For example, gcc on a myth machine is a version of the Gnu C
Compiler for Linux on an x86 processor. Since we're using a laptop to cross-compile for the Pi, you want to use the binary
utilities that are designed to run on your machine but compile for the Pi. These tools are prefixed with riscv64-
unknown-elf . So you want to use riscv64-unknown-elf-ld not ld .

Core utilities
This section discusses the core utilities that you use very often: as , ld , nm , size , strings , strip , objcopy ,
objdump , and ar .

as
This is the assembler. It takes assembly code (human readable text of machine code) and turns it into a binary that a
processor can execute. When you invoke gcc and tell it to generate a binary (e.g., an object file), gcc first compiles the C
code to assembly then invokes the assembler to generate machine code. The output of as is a file format called ELF, or
Executable and Linkable Format. Almost all executables you run in Linux, for example, are ELF. ELF includes not only
machine code (the E part) but also symbols, names that provide information on where different variables and functions
reside (see nm and objdump below). When you compile a library, for example, the resulting ELF file has not only the
library's code, but also a set of symbols that say where the code for each library function starts. That way, when you
compile a program that invokes the library function, binutils can find where it is and link the two files together correctly
(see ld below).

The most common use of as is as follows:

$ as code.s -o code.o

This command instructs the assembler to read assembly code from the input file code.s and to write the machine code
to the output file code.o . If you do not include -o code.o then it will default to a output file named to a.out .

Just as the input to as might not be a executable program, the output of as might not be executable. For example, a
library is a set of functions that other programs can call, but does not have any entry point ( main() function). It can also
be that the compiled assembly code references functions that are in another binary file. For example, your program that
calls a library, when compiled into machine code, does not have the actual instructions of those library calls. The tool that
takes multiple binary files and links them together is ld , described below.

ld
The linker. ld takes object files, libraries, and other binary files as input and links them together into another binary file.
For example, when you have a library written in multiple source files, you typically compile each source file
independently into an ELF object file, then link all of those files together into a larger ELF object file for the library. When
you compile a program that uses the library, you compile your program code into one or more object files, then link those
against the library, producing a final executable.

The linker operates by resolving symbols. The name of each of your functions and global variables is a symbol. Symbols
come in two forms: a definition that associates a name with its data (e.g. initializing a global variable or the body of a
function) and a declaration/use of that name. There must be exactly one definition for a symbol, but there can be
multiple declarations/use of the symbol. For example, when you define a variable int a = 17; in global scope, this
creates a symbol for a , which establishes at what memory address a resides and sets its initial value to 17. Any use of
the global a must be resolved to refer to this one copy of the variable. Similarly with a function named f() . There will
be one definition of f() (perhaps in a library). Every other place where you call the function f() it generates an
unresolved reference. It is the job of the linker to resolve all such references to use the one shared copy. When the linker
links your program, it sees that f is an unresolved symbol, sees where f is defined in the library, then fixes the binary
code in your program so that when it calls f it jumps to where f exists.

The basic use for ld is as follows:

$ ld code1.o code1.o -o code.bin

This takes two object files, links them together, and outputs the result as code.bin .

When you invoke GCC without the -c option, it typically runs ld as its last step. For example,

$ gcc -o test test.c

is mostly equivalent (I'm leaving out a lot of options that are added by default) to:

$ gcc -S test.S test.c # Generate assembly


$ as -o test.o test.S # Generate object file
$ ld -o test test.o

The most commonly used options for ld are -l and -L . The -l option tells ld to link a library, which is basically
just an object file. But libraries are typically stable code that you don't update often and are used by many programs, so
they live in separate directories for that purpose (e.g, /usr/lib ). The -L option tells ld what directories to look for
libraries in. So, for example,

$ ld -o test test.o -lgcc -L/usr/lib

tells ld to link a file name libgcc.a (or libgcc.so , but we will not be using shared objects in this class, so don't
worry about them) which it should search for not only in the local directory but also /usr/lib . You can pass multiple -
l and -L options. For example, gcc will typically automatically pass -lgcc and -L/usr/lib options when it
invokes ld , which are in addition to any other library loads or library search path entries you add.

nm
Displays the symbol table of a binary file. It takes one or more file names as parameters and outputs their symbol tables
in a format like this:

$ nm file.o
00010074 B __bss_end
00010070 B __bss_start
00010070 T _text_end
U _start
00010070 B a
00008040 T f
00008000 T main

The first column states the address of the symbol. For a function, this is the first instruction to jump to when you call the
function. For a variable, this is its location in memory. The second column states what kind of symbol it is. The types are:

1. T : Text. This is a symbol for executable code (e.g., a function).


2. D : Data. This is a symbol for a variable that has an initializer (e.g., int a = 5; ).
3. B : Data. This is a symbol for a variable that doesn't have an initializer (e.g., int a; ).
4. U : Undefined. This a symbol the objects needs to be linked. It's defined elsewhere, and hopefully the linker will
resolve it later.
5. N : Symbol used for debugging.
6. A : Absolute. Later linking will not change this symbol.

So in the above example, the object file defines a function f , a function main , and a variable a . The other symbols
(generally, symbols that start with _ ) are generated by the compiler for bookkeeping and linking.

size
Lists the size of sections (and total size) of object files. Can be invoked on multiple files at once simply by listing the
desired files, which produces output like this:

$ size test.o test2.o


text data bss dec hex filename
80 24 32 136 88 test.o
72 0 0 72 48 test2.o

Under text you see the size of the actual machine code that makes up your program. Similarly to the symbol types
listed under nm above, the data section is the size of storing your initialized global variables, and bss is the size of
storing your uninitialized global variables. The dec and hex numbers indicate the total size (sum of text, data, & bss) in
decimal and hexadecimal, respectively.

The options for the size command are mostly to change the format of the output. For example, to see the section sizes in
hexadecimal, use the -x option ( size -x test.o ). You can also specify the file type if it is not automatically
recognized. For example if you want to know the size of a raw binary file, you could use

$ size --target=binary test.bin

strings
Prints text strings embedded in the input file. strings is useful for searching binary files, which are not readable using
a text editor. For example, if you wanted to search an object file for a particular string, you could call strings and pipe the
result to grep, like this:

$ strings test.o | grep "my string"

By default, strings looks for strings of at least 4 printable characters (followed by a NUL character indicating the end
of a string). To set a minimum string length other than 4, use the -n option. For example,

$ strings -n 6 test.o

looks for strings of at least 6 characters.

strip
Removes the symbol table from an object file. The symbol table has information about each symbol by name, including
size, type, and address, see ld above. By stripping out symbol tables and debug information, strip decreases the size
of object files.

strip modifies the input file rather than creating a new, stripped output file. Compare the file before and after you
strip it using nm (above). You will find that the symbol information has been removed. All of the code/data for the
symbols remains in the binary, but there is no longer a "legend" that identifies which symbol is where.

objcopy
Transforms binary objects between different formats. For example, you can use objcopy to transform an ELF executable
(which has symbols and all kinds of other information) into a simple binary. The principal options to objcopy are -O
and -I , which specify the output and input formats. Example formats are elf32-bigarm, ihex, and binary. Sometimes
objcopy can tell what the format is and so doesn't need to be told explicitly (e.g., ELF). In its most basic use, objcopy
just makes a copy of the file. For example,

$ objcopy main main2

creates a simple of main in main2. In contrast,

$ objcopy main -O binary main.bin

takes main as input (an ELF file), transforms it into a raw binary file and outputs that raw binary as main.bin.
objdump
Displays information about object files. To use objdump , you must specify at least one of the many options, which
indicate what type of information you would like to view. There are many different options, check out man objdump to
see what it can show.

A very useful option is the -d option ( -d is for disassemble), which allows you to view the assembly instructions
associated with the executable part of the binary file:

$ objdump -d test.o

ar
Allows you to create, modify, and extract archives. Archives are single files holding collections of other (usually binary)
files, similar to a zip or tar file. Archives are named with a .a extension and are usually used to hold libraries. The linker
(see ld above) is often used to link to functions in these archive library files.

ar has options for you to create an archive, add or remove files from an existing archive, and extract files from an
archive. To create an archive from object files, use the cr or crs options, followed by your object files. The c option
means it will not warn you that it needs to create the library (since that is what you are trying to do). The r option says
to insert the new files (or replace existing ones). If you specify the s option, the archive maintains an index to all
symbols defined in files in the archive to allow for quicker linking to the library functions.

For example,

$ ar cr libtest.a test.o test2.o

makes the library archive file libtest.a containing test.o and test2.o. Then you can link to this library by specifying -
ltest in the ld command (see ld above).

For more options on how to modify archives, see man ar or ar --help .

CS107e Winter 2024 · Site generated 2024-03-01 17:09


Core utilities
as
ld
nm
size
strings
strip
objcopy
objdump
ar

You might also like