Malware Analysis
Malware Analysis
Malware Analysis
Additional information:
• Make sure you have a computer with sufficient horsepower to run the Project's VM. Minimum RAM
required for this project is 4 GB RAM for the VM, 8GB RAM on your host.
Setup (0 points)
Please note that the file is over 13GB, so it will take some time to download. Do not wait until the
last minute to download it. Do it right away!
2. Students need a x86 (Intel) machine to properly run the project virtual machine (please see the Ed
Discussion post regarding VM Troubleshooting new link to Ed if you have any issues regarding your
virtual machine).
3. Import the OVA into VirtualBox. Note that this may be easiest to do by double-clicking the OVA file and
letting the file association open VirtualBox.
1
[Type here]
o Password: debian
o unzip Desktop/malware_analysis_reports.zip -d
Desktop/malware_reports
6. The five malware files will be located in the “malware_reports” directory on the Desktop.
Hint: Look at the API/system call sequence under each process generated by the malware sample and
determine what the malware is doing. Note that each JoeSandbox report may contain multiple processes with
many different system call sequences. If any of the behaviors are seen (or attempted, but not necessarily
successful) in any process in the report, then that malware has attempted that behavior. This is, of course, not
completely practical, as legitimate applications may perform the same actions in a benign fashion. We are not
concerned with differentiating the two in this assignment, but it is some food for thought.
Clarification for attempted: We mean by “attempted” that a specific action was attempted but failed. By
“specific” we mean that it is clear which action is attempted. If you have a registry key, for instance, that is
unambiguous (like, say, it is used only to set a startup option), but it fails to change the key, that is an attempt
for our purposes. But if you have a more generic registry key that governs multiple settings, we don’t know for
sure which key or keys it is attacking and so the action would not count as an “attempt”.
You will encounter that the same API functions can end with either a W or an A. This is a standard practice in
the Windows API, and this document explains the difference (either one could in theory be present in the wild):
https://docs.microsoft.com/en-us/windows/desktop/intl/unicode-in-the-windows-api
For each of the following questions, mark which of the malware exhibit the identified behavior:
2
[Type here]
I. Keylogger attempt
R. Uses loops or otherwise needless repetitions of commands, such as Pings, used to delay malware
execution and potentially exceed time thresholds of automated analysis environments.
S. Attempts to override the domain name system (DNS) for a domain on a specific machine.
3
[Type here]
DELIVERABLE: Your deliverable for this part of the assignment will be your final JSON file with your answers
to the 20 questions.
{
"behavior01": "",
"behavior02": "",
"behavior03": "",
"behavior04": "",
"behavior05": "",
"behavior06": "",
"behavior07": "",
"behavior08": "",
"behavior09": "",
"behavior10": "",
"behavior11": "",
"behavior12": "",
"behavior13": "",
"behavior14": "",
"behavior15": "",
"behavior16": "",
"behavior17": "",
"behavior18": "",
"behavior19": "",
"behavior20": ""
}
{
"behavior01": "2,4",
"behavior02": "3",
"behavior03": "1,2,3,4,5",
"behavior04": "4",
.
.
.
}
The naming of the submission file is not important, as long as it is JSON (“submission.json” is an example). You
will have 20 attempts to submit your answers. If you attempt to make more submissions than the limit, your grade
will be a ZERO for this Phase. You will be able to choose your best submission of the 20 manually in Gradescope.
Background
In this phase you will learn how to apply Machine Learning concepts to malware classification. You’ll be given
a dataset of malware samples. Using Malheur, the software used for clustering malware in this project, you’ll
run an unsupervised learning clustering algorithm in order to classify them by behavior.
2. Pay special attention to malheur configuration parameters. You are given a malheur sample configuration
file to serve as a starting point. You will be asked to work with these parameters (not necessarily all of
them) during this assignment. Nevertheless, some of these parameters should remain unchanged, since
changing them will be detrimental to the assignment’s objectives. This way, unless you know exactly what
you are doing, you should not change values for:
Parameter Value
generic.input_format “text”
generic.event_delim “;”
generic.state_dir “./malheur_state”
classify.max_dist 1.00
All other values can be changed in the configuration file. Refer the malheur manual for specifics
on each configuration parameter.
“This parameter specifies the length of n-grams. If the events in the reports are not sequential, this
parameter should be set to 1. In all other cases, it determines the length of event sequences to be mapped
to the vector space, so called n-grams.”
Malheur manual
While the malware behavior is encoded by listing all API calls in sequential order, (see
Understanding the dataset below) if you receive better performance by selecting a different parameter
value than the default provided to you, you may select another value.
5
[Type here]
2. There are many kinds of clustering used in data analysis. Malheur, the software used for clustering
malware in this project, uses a hierarchical clustering technique. Hierarchical clustering begins by setting
each entry in the dataset to be its own cluster. Then, it combines the closest two clusters into one new
cluster. It repeats this until either a target number of clusters is reached, or the distance between each
cluster exceeds a target amount.
3. To further help you understand hierarchical clustering, we provide an example below. This example uses 2-
d data to help you visualize what clustering is actually doing. The process of clustering malware samples is
similar, but the data has many more features.
4. Let’s say we’re given the following data-set. It consists of x-y coordinate data that, when plotted, looks
like:
6
[Type here]
We do this until each of the closest two clusters are fairly far away from one another. The exact minimum
distance between the final clusters, can usually be set as a parameter to the clustering algorithm. This gives
us the clusters we were expecting:
5. This is a general example of hierarchical clustering. In order to cluster malware, malheur extends this
method, as you will see in the next section. In order to complete this assignment, you will have to develop
a deeper understanding of what Malheur is and how it performs this clustering. We encourage you to
further research Machine Learning and clustering methods, beyond this section, in order to better
understand the next section.
2. Our dataset is based on information extracted from Cuckoo malware behavior reports.
3. Cuckoo malware behavior reports have a section called ‘behavior’. In this section of the report, you can
find all API Calls that were part of the malware behavior during Cuckoo’s analysis.
4. The way we chose to encode malware behavior was by simply listing all API Calls, in the order they
happened, as indicated by the report.
5. More specifically, we build a list of API call names by extracting the contents of Cuckoo JSON report
fields such as ‘behavior->process->[X]->calls->[Y]->api’, where X is the process index number and Y is
the process API call index number.
6. In the end, every feature file will be a semicolon separated list of strings, where each string is an API call
name, and those API call names are displayed in the order they have actually been called by the malware
during Cuckoo analysis.
7. Each sample in our dataset is labeled as to indicate which malware family it belongs to. To gather labels
and assign them to our samples, we have used a malware labeling tool called AVClass
(https://github.com/malicialab/avclass) along with data from Virus Total website
(https://www.virustotal.com/). Because of the way malheur recognizes labels, we have placed them in each
dataset sample’s file name as the file extension, while the main part of the file name is the SHA256 hash
7
[Type here]
code for the original malware binary. For example, take the feature file named as:
0bc19b9304d5c409b9f480a9121c8c8abcef2f3a595ed6b2758daeb2d679b74a.dinwod
(Since the file extension for this malware sample is ‘dinwod’, it belongs to malware family Dinwod.)
8. You can use the hash code that represents the file name to research extra information about each individual
dataset sample. Several Malware research related websites provide complete malware reports indexed by
Hash codes. Since we have used data from Virus Total for our labels, it’s probably a good place to start.
You can use the URL https://www.virustotal.com/gui/file/<HASH_CODE> to directly open a malware
detection report in Virus Total for the hash code <HASH_CODE>.
o For example, to gather extra information about a dataset sample with the following file
name:
0bc19b9304d5c409b9f480a9121c8c8abcef2f3a595ed6b2758daeb2d679b74a.dinwod
o All you have to do is to remove the label (dinwood) from the file name and use the hash to
craft and visit the following URL:
https://www.virustotal.com/gui/file/0bc19b9304d5c409b9f480a9121c8c8abcef2f3a595ed6
b2758daeb2d679b74a
9. All dataset files are located in the “dataset” folder and they are distributed in two groups: “training” and
“testing”.
o The “training” folder contains files with features extracted from various malware samples
based on the strategy we discussed previously. You will use those files to “train” your
machine learning model using clustering techniques.
o The “testing” folder is much smaller and contains the files you will use to test the model
you will eventually create.
o We will discuss the training and testing phases of your assignment in more details ahead.
8
[Type here]
Setup (0 points)
We have already provided the malheur binaries, datasets, and example configuration files. Once
you are inside Project 2 VM, just enter the avml directory on the VM desktop using a terminal:
$ cd /home/debian/Desktop/avml/
1. This command tells malheur to analyze the dataset contained in dataset/training/ (11,000+ samples) in
order to cluster all its samples into groups of similar malwares. Since malheur is using the dataset to learn
how to group malware families, we are using a special subset that is called training dataset.
2. In this phase, malheur will output quality assessment information. More specifically, it will show you three
main measurements: precision, recall, and f-score. Take this opportunity to research a little bit about them.
Those measurements are all related to how good our clustering has been. The way malheur does it is by
looking at each sample’s label. If you take a look at the contents of directory ‘dataset/training/’, you will
see lots of files, each of which with a different file extension. Those files extensions are the labels for each
sample, and malheur extracts them to see if each cluster actually groups similar samples (similar labels)
and how many samples that should have been part of the group were left out. This analysis is eventually
represented in numbers by precision, recall, and f-score.
3. This command tells malheur to use the clusters it has built in the training phase (using samples in
dataset/training/) to classify all samples inside dataset/testing. Our objective is to decide whether the model
generated in the training phase is good or bad. That is one of the reasons all samples in the testing dataset
used to be part of the training dataset but were randomly selected and put apart. They are selected this way,
so testing data represents reality while minimizing bias towards the training data. Think about It as it were
a student preparing for an exam: the end goal is succeeding in life, but the exam will give us a rough idea if
that will actually happen. The same way, malheur “studies” (analysis) and “learns” (builds a model) from
the training dataset (training phase) so it can succeed in real life (classify malware samples in the wild).
Nevertheless, malheur will have to pass the exam (testing phase) to make sure it’s ready. Would you put
the exact same questions from study material in the exam? I hope not! Exam questions should be unique
and, at the same time, representative of the real-life skills students are supposed to learn. Welcome to the
testing phase!
9
[Type here]
4. In this phase, malheur will also output precision, recall, and f-score. You should think of those
measurements as malheur’s final grade. During this phase, malheur isn’t being tested against the same
exact data used to train it. It now uses a different dataset where all testing instances were statically chosen
to adequately represent all training data, even though it is not the same data used in the training phase.
2. After you have reached the 70% f-score goal, use the same model you have trained malheur with to classify
each of your Project 2 original malware samples. For that, we have provided you feature files for all of
Project 2 malware samples containing features extracted from their Cuckoo’s JSON. Those feature files are
located inside the directory “subjects” in the main assignment path (/home/debian/Desktop/avml/subjects).
3. In order to use your newly created malheur model to classify the given malware samples, make sure you
are inside the assignment main directory, and run the following command:
o Your goal is to have all Project 2 samples properly classified (none should be labeled
‘rejected’) using your model with classify.max_dist set to 1. (as mentioned above, this is
required)
o If you can’t achieve those classification results for Project 2 malware samples, try
revisiting your training and testing phases with different parameters until you can comply
with everything.
o Also, you can take a look at Virus Total reports for the prototypes that malheur has
assigned to your subjects. Seeing the full malware behavior might give you a better idea
about the direction you should take.
4. By achieving all Task’s goals, you will have built a tool that is able to classify unknown executable binary
samples (no prior knowledge needed) into potential malware families. It is your very own AV software!
5. TASK GOALS: You have two main goals for this task:
o GOAL 1: the first goal is to achieve 70% f-score during the testing phase only (you should
NOT consider f-score results for other phases for the first goal, just the testing phase).
o GOAL 2: the second goal is to classify all Project 2 malware samples with maximum
distance of 1 (none should be labeled ‘rejected’) as indicated above.
Goal 1 is a prerequisite for Goal 2. Therefore, any results for Goal 2 are only considered valid if your
10
[Type here]
6. DELIVERABLE: Your deliverable for this part of the assignment will be your final malheur
configuration file (config.mlw), the one you have used to achieve all goals (70% f-score during testing
phase and Project 2 malware samples classification with maximum distance of 1). The file needs to be
labeled exactly “config.mlw” for credit. There should be only one configuration file for both objectives.
11