Bulk Extractor
Bulk Extractor
Bulk Extractor
bulk_extractor
The bulk_extractor is a program that extracts email addresses, credit card numbers, URLs, and other types of
information from any kind of digital evidence. The program operates on disk images in raw, split-raw, EnCase
E01, and AFF formats, but the program has also been used productively on sessionized TCP/IP traffic, memory
dumps, and archives of files downloaded from the Internet. The program can also directly analyze media
directly connected to the analysts computerd for example, with a write blocker. The data to be analyzed are
divided into pages that are separately processed by one or more scanners. Identified features are stored in
feature files, a simple line-based format containing extracted features, their location, and local context.
The bulk_extractor detects and optimistically decompress data in ZIP, gzip, and Microsofts XPress Block
Memory Compression algorithm . This has proven useful in recovering email addresses from within fragments
of cor-rupted Windows hibernation files.
The bulk_extractor gets its speed through the use of GNU flex ( The Flex Project, 2008), which allows
multiple regular expressions to be compiled into a single finite state machine, and multi-threading (x 4.4),
which allows multiple pages to be analyzed at the same time on different cores.
After the features have been extracted, bulk_extractor builds a histogram of email addresses, Google search
terms, and other extracted features. Stop lists can remove features not relevant to a case.
The remainder of this section introduces bulk_extractor with a typical case and then presents the programs
overall design; the following section (x 4) discusses the current implementation; Section 5 presents
approaches for validation.
Use case
The bulk_extractor is designed to be used in the early part of an investigation involving digital media. A typical
case might involve the analysis of 20 laptops and desktops seized from suspected members of a child
exploitation group. Each piece of subject media is connected to an analyst workstation with a write-blocker
and directly processed with bulk_extractor. (Time to initiate: approximately 5 min per machine.) The
bulk_extractor runs in a batch, unattended operation. (Time to process: between 1 and 8 h per piece of media,
depending on size and complexity of the subject data.)
Each running instance of bulk_extractor creates a directory where the programs output is stored. The
output consists of one or more feature files, Each feature file is a text file that contains the location of each
feature found, the feature itself, and the feature surrounded by its local context (typically 16 characters from
either side of the feature). Typical feature files are email.txt for email addresses, url.txt for URLs, aes.txt for
AES keys, and so on. Some of the information that is present in the feature files originated in compressed data
on the subject disk. For example, many URLs and email addresses were present in browser cache entries
compressed with the gzip compression algorithm. Because it was compressed, the data would not be evident
simply by running the Unix strings program or by a manual examination of the disk sectors.
When the extraction phase is finished, each instance of bulk_extractor reads the feature files and creates a
feature histogram for each file. Post-processing also extracts a histo-gram of popular search terms.
2
When the program finishes, the examiner may manually review the histogram files:
The list of email addresses provides the examiner with a quick report of individuals that may have some
connection to the drive. Email addresses can appear on a drive for many reasonsdthey can be in email
messages, or in the web cache from webmail, or in a web cache because they appeared at the bottom of a
news article thats being read, or even because they were in a Microsoft Office document.
However, experience has shown that the most common email addresses on a drive are typically present
because they were in multiple email or webmail messages, and that is usually because they are associated
with either the drives primary user or one of that persons correspondents.
Search terms can be used as an indicator of the computer users intent.
The presence, frequency and number of credit card numbers can be used to infer if the drive had a large
number of credit card numbers (an indication of either credit card processing or credit card fraud), or a high
frequency of very few credit card numbers (an indicator of frequent e-commerce).
Tools such as grep can be used to scan the list of URLs to extract identifiers for Facebook, Microsoft Live, and
other online services. Lists of identifiers can also be uploaded to other systems for correlation with law
enforcement databases.
Although the feature files can be directly entered into evidence or used as the basis of a formal report, more
commonly the examiner will use the data to inform a more detailed analysis of the media with a traditional
forensic tool such as EnCase or FTK. In this manner, bulk_extractor is used for triagedthat is, to prioritize
analysis based on the content of the media itself, rather than the circumstances surrounding the medias
capture.Because it is easy to use and readily integrated with existing forensic processes, bulk_extractor has
been adopted by a number of law enforcement organizations, and its use is growing.
Requirements study
Between 2003 and 2005 a prototype bulk media analysis tool was developed by the author to assist in an
unrelated inves-tigation. Experience showed the tool to be significantly faster than file-based systems and
allowed easy answer to questions of interest early in an investigation, such as Does this drive contain sensitive
information? What search terms were used? and what is the most common email address on the drive?
Because the prototype did not map to an existing forensic tool category, a series of unstructured interviews
were held with local, state and federal law enforcement (LE) forensic examiners to determine if there was a
need for this new kind of tool. In total, approximately 20 interviews took place between 2005 and 2008.
Although it may seem that the tool development described here was the result of documented needs stated
by LE, this was not the case. LE practitioners interviewed at the time were generally pleased with their then-
current tools, which seemed quite powerful and had required considerable effort to master. Examiners merely
wanted their existing tools to run faster they were not looking for tools that implemented fundamentally new
approaches. Indeed, at the beginning of the interviews, several LE practitioners and trainers spoke derisively of
the desire to create a so-called get evidence button. Such a button could not be created, these practitioners
asserted, because a computer would never be able to make sense of all the information left behind on a digital
3
storage device and arrange it in a manner that was consistent with the objectives of a case.
It was only after seeing some of the initial results of the early prototype that some analysts became
enthusiastic about the work and requested that the tool be further developed to extract specific types of
information, including:
Email addresses
Credit card numbers, including track 2 information
Search terms (extracted from URLs)
Phone numbers
GPS coordinates
EXIF (Exchangeable Image File Format) information from JPEG images
A list of all words present on the disk, for use in password cracking
Interviewees also provided a number of operational requirements:
Run on Windows, Linux and Macintosh-based systems
Operate on raw disk images, split-raw volumes, EnCase E01 evidence containers, and AFF evidence
containers
Perform batch analysis with no user input
Allow users to provide additional regular expressions for searches
Automatically extract features from compressed data
Run as fast as the physical drive or storage system could deliver data
Identify the specific files in the file system that are the source of the extracted features
Produce output as an easy to use text file
Never crash
The interviews revealed that the primary need for such a tool was triagedto prioritize which pieces of digital
evidence should be analyzed first, and to identify specific email addresses for follow-up investigation. Final
analysis, however, would typically be performed with an approved tool.
Forensic scanners, feature extractors and optimistic decompression
The bulk_extractor employs multiple scanners that run sequentially on raw digital evidence. These scanners
are provided with a buffer to analyze (initially corresponding to a 16 MiB page of data read from the disk
image) the location or path of the buffers first byte, and a mechanism for recording extracted features.
Special logic is used to handle features that span across buffer borders (x 4.3). All buffers are processed by all
scanners until there are no more buffers to analyze. At this point the program performs post-processing and
finally exits.
There are two types of scanners. Basic scanners are limited to analyzing the buffer and recording what they
find. An example is the email scanner scan_email, which can find email addresses, RFC822 headers, and other
recognizable strings likely to be in email messages.
Recursive scanners, as their name implies, can decode data and pass it back to the buffer processor for re-
analysis. An example is scan_zip, which detects the components of ZIP files,
records the ZIP header as an XML feature, decompresses the data, and passes the decompressed data back to
4
the buffer processor. Most of bulk_extractors recursive scanners are opti-mistic. That is, they scan the entire
buffer for data that can be decompressed or otherwise transformed and, if they find it, they transform it as
appropriate. In addition to decompressing, bul-k_extractor uses optimistic transformations for BASE64
decoding, PDF text extraction, and other encodings. Optimistic decoding produces significantly higher recall
rates than approaches that only decode data from specifically recognized file formats.
The speed of the forensic tool is obviously impacted by the use of additional scanners: the degree of the
impact depends on the data being analyzed. A disk image that contains no compressed data will be processed
more slowly merely because the tool scans for compressed data; in testing this degradation is not significant.
Significant amounts of compressed data, in contrast, will significantly slow process-ing, especially if
compressed data is contained within other compressed regions.
Forensic programs that recursively process compressed data must guard against decompression
bombsdfiles that, when fully decompressed, extend to many terabytes, petabytes, or more ( Aera Network
Security, 2009). The bulk_extractor implements three defenses against compression bombs. First, only a
configurable portion of each compressed stream is decompressed. Second, the page processor will not call the
recursive scanners when the depth reaches a configurable limit (by default, five recursions). Finally, the tool
computes the cryptographic hash for each compressed region prior to decompression; regions that have the
same hash are only decompressed once.
Feature files
Analysts requested that the tool provide output as a simple text file that could be viewed with an editor or
processed with other scripting tools. Realizing this request is bulk_extractors feature file format, a tab-
delimited text file containing the offset where each feature was found, the feature itself, and a con-figurable
number of bytes that precede and follow the feature in the evidence file (Feature files are not sorted but are
loosely ordered. The order is loose because it is determined, in part, by the execution order of the multiple
threads. As a result, running bulk_extractor twice on the same subject media will likely result in feature files
that contain the same lines, but for which the lines appear in a different order. (Sorting the lines during
processing would require either additional memory or stalling one or more threads, both unacceptable
solutions.)
When it is necessary to report multiple values associated with each extracted feature, the second and/or third
fields of the feature file can be replaced with an XML fragment. For example, the JPEG scanner uses a block of
XML to report all of the fields associated with EXIF structures found within embedded JPEGs.
can be challenging when using tools that have the ability to extract information from within compressed
objects, because it is also necessary to document how the data must be decompressed or otherwise decoded.
There are at least five potential sources of compressed data on a hard drive:
1. Many web browsers download data from web servers with gzip compression and persist the compressed
stream directly to the web cache. (The percentage of web servers employing compression increased from
less than 5%e30% between 2003 and 2010 ( Port80 Software, 2010) because compression significantly
increases web performance ( Pierzchala, 2006; Srinivasan, 2003).)
5
2. NTFS file compression may result in disk sectors that contain compressed data. The most commonly
compressed files are Windows restore points, as the operating system compresses these automatically.
However, users may choose to have any file compressed.
3. Windows hibernation files frequently contain forensically important information. Complicating access to this
file is Microsofts use of a proprietary compression algorithm called Xpress ( Suiche, 2008) and the fact that
Windows overwrites the beginning of the hibernation file when the
operating system resumes from hibernation. Also, the hibernation files location on the hard drive moves as
a result of NTFS defragmentation operations ( Beverly et al., 2011); thus, any software that hopes to recover
features from hibernation files must be able to decompress incom-plete hibernation file fragments.
4. Files are increasingly bundled together and distributed as ZIP, RAR, or .tar.gz archives for convenience and to
decrease bandwidth requirements. These files are frequently written to a hard drive. If deleted, one of the
components may be overwritten while the others remain.
5. The .docx and .pptx file formats used by Microsoft Office store content as compressed XML files in ZIP
archives ( Garfinkel and Migletz, 2009).
Consider a message containing a set of credit card numbers viewed using a webmail service. If the web
client and server both support HTTP compression, the web page will most likely be downloaded as a gzip-
compressed stream; both Firefox and Internet Explorer will store the compressed stream in the browser
cache. If the computer is suspended, the web browser memory may be compressed with Xpress and stored in
the hibernation file. It is not enough simply to report where the credit card numbers are found on the subjects
disk, because looking at the disk sectors with a hex editor will not show human-readable strings: it is also
necessary to explain how the data must be transformed to make them intelligible.
6
To recap: todays practice for describing the location of a feature extracted from a disk image is to report
the sector number or offset where the evidence is found. The evidence can then be examined with a tool such
as a hex editor to verify the existence of the feature. This approach simply does not work when the feature
resides within compressed data: examining the sector with a hex editor merely shows binary data. The
forensic path, introduced here, provides a clear, concise and unambiguous way to describe both the location
of the extracted features and the specific decoding operations that need to be executed in order to recover
the data.
Histogram processing
Frequency distribution histograms can be of significant use in forensic investigations ( Garfinkel, 2006). For
example, a frequency histogram of email addresses found on a hard drive readily identifies the drives primary
user and that persons primary contacts.
Histogram generation is integrated with the feature recording system so that histograms can be created for
any feature or feature substring at the conclusion of media pro-cessing. For example, the regular expression
below extracts search terms provided to Google, Yahoo, and other popular
Diagram showing overview of the bulk_extractor architecture. Thread 0 reads data from a physical disk, disk
image, or individual files and puts data into buffer structures called sbufs. These buffers are processed
sequentially by the scanners operating in the thread pool. Some of the scanners (e.g. zip, pdf and hiberfile)
are recursive; they create new sbufs, which are in turn processed by all of the scanners. Features that are
extracted are stored in feature files which are, in turn, processed by the histogram processor into histogram
files. A graphical user interface (GUI), not described in this article, allows the resultant feature files to be
browsed at the conclusion of the processing.
7
Two excerpts from a feature file generated by processing the disk image ubnist1.gen3.E01 ( arfinkel et al.,
2009). The first column is the forensic path within the evidence file; the second column is the extracted email
address; the third column is the email address in context (unprintable characters are represented as
underbars). These email addresses are extracted from executables found within the Linux operating system
and as a result do not constitute private information or human subject data.
search engines; the keyword substring is extracted with the regular expression parenthesis operator:
search.*[?&/;fF][pq]([^&/])
(1)
Experience has shown that a histogram of the extracted search terms dramatically improves their usefulness
to the investigator, since items of import tend to be present multiple times on the subject media.
Histograms of search terms are particularly useful when conducting an investigation, as they can reveal the
intent of the computers user (x 6.1). Individuals frequently engage in repeated searches for items of interest.
However, the tool explicitly does not suppress low-density information, since it may be quite valuable in some
cases. (x 3.7 discusses approaches for weighting features.)
For example, at the 2008 murder trial of Neil Entwistle, prosecutors introduced evidence that Entwistle had
per-formed Internet searches for murder techniques just three days before his wife and child were found dead
( Associated Press). The bulk_extractors ability to identify, extract and make histograms of search terms has
been used in court with some success (x 6.1).
Once the search terms are extracted, bulk_extractor creates a histogram of the extracted terms.
Critical to using bulk_extractors reports in court is the fact that the feature file clearly identifies the physical
location on the media from which the search terms were recovered; this location allows the evidence to be
rapidly located and re-analyzed using other tools.
Context-sensitive stop lists
Many of the email addresses, phone numbers and other identifiers on a hard drive are distributed as part of
the oper-ating system or application programs. For example, previous work identified the email address
mazrob@panix.com as being part of the Windows 95 Utopia Sound Scheme ( Garfinkel, 2006). One way to
suppress these common features is to weigh each feature by its inverse corpus frequency, a novel application
of the well-known TF-IDF approach used in information retrieval ( Jones, 1972).
For reasons not anticipated in Garfinkel (2006), it is not possible for many organizations to create a single
list of all
Bulk Extractor Scanners: Where They Output and What They Do
8
Scanner
name
Output feature file Recovery description / archival usefulness
bulk bulk.txt Performs a bulk data scan.
wordlist
wordlist.txt,
wordlist_*.txt
A list of all words extracted from the disk, useful for password cracking or
to discover if an author ever used a specific term (including in
deleted/hidden files). Note that the words this scanner can access depend
on which other scanners are on; to include words in .zip files, for example,
you'd need to have the "zip" scanner enabled.
xor
XOR is a technique for obfuscating data, often used to conceal sensitive
data and code within malicious files and programs
[4]
; this scanner searches
for data hidden by XOR.
accounts
telephone.txt,
ccn.txt,
ccn_track2.txt,
pii.txt
Credit card numbers, credit card track 2 information (the magnetic stripe
data track read by ATMs and credit card checkers
[5]
), phone numbers, and
other formatted numbers. Useful for tracking how a device's user(s)
conducted business.
aes aes_keys.txt
AES key schedules in memory (AES key schedules expand a short key into a
number of separate round keys
[6]
).
base16 hex.txt
BASE16 coding, aka hexadecimal or hex code (includes MD5 codes
embedded in the data). The primary use of hexadecimal notation is a
human-friendly representation of binary-coded values in computing and
digital electronics. Hexadecimal is also commonly used to represent
computer memory addresses.
[7]
base64
BASE64 coding. Base64 is a group of similar binary-to-text encoding
schemes that represent binary data in an ASCII string format by translating
it into a radix-64 representation
[8]
.
elf elf.txt
Linux Executable and Linkable Format (formerly called Extensible Linking
Format). A common standard file format for executables, object code,
shared libraries, and core dumps
[9]
.
9
email
email.txt, rfc822.txt,
domain.txt,
ether.txt, url.txt
Discovers RFC822 email headers, HTTP cookies, hostnames, IP addresses,
email addresses, and URLs. Useful for recreating email correspondence on
a device.
exif
exif.txt, gps.txt,
jpeg_carved.txt
Exifs, or exchangeable image file format, is a standard that specifies the
formats for images, sound, and ancillary tags used by digital cameras
(including smartphones), scanners and other systems handling image and
sound files recorded by digital cameras; it includes .JPG, .TIFF, and .WAV
[10]
.
This scan finds EXIFs from JPEGs and video segments (and carving of JPEG
files); this feature file contains all of the EXIF fields, expanded as XML
records.
find find.txt
Returns the results of specific regular expression search requests. A regular
expression is a way of searching for patterns in strings of
characters; RegexOne.com offers a good basic tutorial on writing regular
expressions to create extremely specific searches.
GPS gps.txt
Garmin-formatted XML containing GPS (global positioning system, i.e.
location mapping) coordinates.
gzip
Files compressed with the gzip algorithm (such as browser cache entries,
HTTP streams) and ZLIB-compressed gzip streams.
hiber
Windows Hibernation file fragments. Windows "hibernate mode" saves a
copy of everything in your PCs memory (RAM) onto your hard disk before
it shuts down
[11]
.
json json.txt
Javascript Object Notation (JSON), a text-based open standard designed for
human-readable data interchange
[12]
, objects downloaded from web
servers, and well as JSON-like objects found in source code.
kml kml.txt
KML files (carved). KML is Keyhole Markup Language (KML), an XML
notation for expressing geographic annotation and visualization within
Internet-based, two-dimensional maps and three-dimensional Earth
browsers
[13]
.
net ip.txt, ether.txt
IP and TCP packets (types of network packets, formatted units of data
carried by a packet-switched network
[14]
) in virtual memory, and creates
10
libpcap files (the libpcap file format is the main capture file format used in
TcpDump/WinDump, Wireshark/TShark, snort, and many other networking
tools
[15]
).
pdf
Text from PDF files.
rar
rar.txt,
unrar_carved.txt
RAR components in unencrypted archives are decrypted and processed.
Encrypted RAR file are carved. RAR is a proprietary archive file format that
supports data compression, error recovery and file spanning
[16]
.
vCard vcard.txt
vCard recovery. vCard is a file format standard for electronic business
cards
[17]
.
windrs windrs.txt Windows FAT32 and NTFS directory entries.
winpe winpe.txt
Windows Preinstallation Environment (PE) executables (.exe and .dll files
notated with an MD5 hash of the first 4k). PE is a minimal Win32 operating
system with limited services, built on the Windows Vista kernel. It is used
to prepare a computer for Windows installation, to copy disk images from
a network file server, and to initiate Windows Setup
[18]
.
winprefetch winprefetch.txt
Windows prefetch files and file fragments. Each time you turn on your
computer, Windows keeps track of the way your computer starts and
which programs you commonly open; Windows saves this information as a
number of small files in the prefetch folder
[19]
.
zip
zip.txt,
unzip_carved.txt
A file containing information regarding every ZIP file component found on
the media. This is exceptionally useful as ZIP files contain internal structure
and ZIP is increasingly the compound file format of choice for a variety of
products such as Microsoft Office. Will find zlib-compressed regions.