Algorithmic Approaches to Match Degraded
Land Impressions
Eric Hare, Heike Hofmann, Alicia Carriquiry
10/05/2017
Abstract
Bullet matching is a process used to determine whether two bullets may have been fired
from the same gun barrel. Historically, this has been a manual process performed by trained
forensic examiners. Recent work however has shown that it is possible to add statistical
validity and objectivity to the procedure. In this paper, we build upon the algorithms
explored in Automatic Matching of Bullet Lands (Hare, Hofmann, and Carriquiry 2016) by
formalizing and defining a set of features, computed on pairs of bullet lands, which can be
used in machine learning models to assess the probability of a match. We then use these
features to perform an analysis of the two Hamby (Hamby, Brundage, and Thorpe 2009)
bullet sets (Set 252 and Set 44), to assess the presence of microscope operator effects in
scanning. We also take some first steps to address the issue of degraded bullet lands, and
provide a range of degradation at which the matching algorithm still performs well. Finally,
we discuss generalizing land to land comparisons to full bullet comparisons as would be used
for this procedure in a criminal justice situation.
1
Background
Intense scrutiny has been focused on the process of bullet matching in recent years (e.g.,
Giannelli 2011). Bullet matching, the process of determining whether two bullets could have
been fired from the same gun barrel, has traditionally been performed without meaningful
determination of error rates or statistical assessments of uncertainty (National Research
Council 2009). There have been some attempts towards developing mathematical and
statistical approaches to bullet matching. One such attempt was the definition of CMS, the
Consecutively Matching Striae (Biasotti 1959), with a cutoff of six to separate matches from
non-matches. Still, rigorous assessments of the applicability of such cutoffs have not to this
point been described (President’s Council of Advisors on Science and Technology 2016).
Recently, several authors have addressed these well-known shortcomings. Focusing on firing
pin impressions and breech faces, Riva and Champod (2014) have described an automated
algorithm using 3D images that enables comparison between pairs of exemplars. Other
examples of work in this and related areas include Petraco and Chan (2012), W. Chu et al.
(2011), T. Vorburger et al. (2011), and others. In our approach to this problem, Automatic
Matching of Bullet Lands, we used the Hamby 252 set (Hamby, Brundage, and Thorpe 2009)
to train and develop a random forest in order to provide a matching probability for two
1
bullet lands (Hare, Hofmann, and Carriquiry 2016). While the algorithm had a very strong
performance on this set, some limitations were immediately clear. For instance, performance
was assessed only on this single set of 35 bullets fired from a consecutively manufactured set
of only ten known and 15 unknown gun barrels. Each of these bullets was part of controlled
study, and the full lands were available for matching. While there were some data quality
issues, this was still a near ideal test case for the algorithm.
Real world applications of bullet matching often involve the recovery of fragments of bullets
from the crime scene (National Research Council 2004). Traditional features used in forensic
examination work well for a full land, but there has been less investigation into their
performance in the case of a fragmented land. For example, the CMS is naturally limited by
the portion of the land that can be recovered and varies across manufacturers (W. Chu et al.
2011).
In this paper, we take steps to address these and other concerns. Specifically, we begin by
reviewing features from the literature, computed on pairs of bullet lands, and presenting
some of our own features. We propose an approach to standardize the featues, to account
for the fact that only a portion of the land impression may be recovered from the crime
scene. With the standardized features, we tackle two issues that were not addressed in Hare,
Hofmann, and Carriquiry (2016). The first is the effect of the microscope operator on the
resulting images and consequent algorithm performance. The second issue has to do with
the robustness of the land matching algorithm in Hare, Hofmann, and Carriquiry (2016)
relative to the degree of degradation of the questioned land impression. Finally, we describe
some of the initial steps toward generalizing a matching algorithm based on land-to-land
comparisons, to one based on bullet-to-bullet comparisons, as would be of interest in a real
world application of these ideas.
2
Feature Standardization
To start, we introduce a standardized version of each of the features used in the matching
routine proposed by (Hare, Hofmann, and Carriquiry 2016). These features are computed on
aligned pairs of bullet land impressions rather than on individual lands. This enables us, for
instance, to compute the number of matching striae between two lands. We generalize the
definitions of these features to account for the possibility that we may be handling degraded
bullet lands, where only fragments can be recovered. The definition of each feature is given
below, where f (t) represents the height values of the first profile at position t along the
signature, and g(t) the height values of the second. An indication of whether the feature is
new since Hare, Hofmann, and Carriquiry (2016) is also given:
at the optimal
• ccf (%) is the maximum value of the Cross-Correlation function evaluated
R∞
alignment. The Cross-Correlation function is defined as C(τ ) = −∞
f (t)g(t + τ )dt
where τ represents the the lag of the second signature (T. Vorburger et al. 2011).
• rough_cor (new) (%) feature that quantifies the correlation between the two signatures
after performing a second LOESS smoothing stage and then subtracting the result
2
•
•
•
•
•
•
•
•
•
•
from the original signatures. This attempts to model the roughness of the surface after
removing structure such as waviness.
lag (mm) Is the optimal lag for the ccf value.
D (mm) is the Euclidean vertical distance between each height value of the aligned
2
1 P
signatures. This is defined as D2 = #t
t [f (t) − g(t)] . This is a measure of the total
variation between two functions (Clarkson and Adams 1933).
sd_D (mm) provides the standard deviation of the values of D from above.
signature_length (mm) is the overall length of the smallest of the two aligned
signatures.
overlap (new) (%) provides the percentage of the two signatures that overlap after the
alignment stage.
matches (per mm) is the number of matching peaks/valleys (striae) per millimeter of
the overlapping portion of the aligned signatures.
mismatches (per mm) is the number of mismatching peaks/valleys (striae) per millimeter of the overlapping portion of the aligned signatures.
cms (per mm) is the number of consecutively matching peaks/valleys (striae) per
millimeter of the overlapping portion of the aligned signatures (Biasotti 1959, Wei Chu
et al. (2013)).
non_cms (per mm) is the number of consecutive mismatching peaks/valleys (striae)
per millimeter of the overlapping portion of the aligned signatures.
sum_peaks (per mm) is the the sum of the average heights of matched striae.
The features that are expressed on a per millimeter level are intended to support the degraded
land case, as discussed earlier. Note that the computation differs slightly depending on the
feature. For example, to standardize the number of matches, the first count the raw number
of matching striae, and then divide this number by the length of the overlapping region of
the two lands (overlap from above). In most cases, the overlapping region will be very close
to the length of the smaller signature. But depending on the alignment, this may not always
be true. This ensures that we do not punish a particular cross-comparison for having a
smaller region in which matches could occur. On the other hand, the number of mismatches
is divided by the total length of the two aligned signatures, since mismatched striae can occur
even in the non-overlapping region of the two signatures.
The rough_cor or Roughness Correlation is derived by performing a second smoothing step,
and subtracting the result from the original signatures. This creates a new signature which
eliminates some of the overall structure, allowing global deformations to have less of an
influence on the model output. Where the roughness correlation is most useful is in a scenario
like Figure 1. This figure shows the alignment of profile 40977 with 47600. The top panel
shows the smoothed signatures. The middle panel overlays a LOESS fit to the average of the
two signatures. Finally, to derive the roughness correlation, this LOESS is subtracted from
the original signature to create a new set of roughness residuals, which are then given in the
bottom panel. Note that these two profiles do not match, yet the ccf is 0.7724. The roughness
correlation (-0.0324) correctly indicates the lack of matching. The roughness correlation acts
as a check against false positives which can arise when there are significant deformations in
the overall structure, as in the case with both these profiles.
3
Adjusted relative heightRelative Height (in µm)Relative Height (in µ
5.0
2.5
profile_id
0.0
40977
−2.5
47600
−5.0
0
500
1000
1500
2000
Relative Location (in µm)
5.0
2.5
profile_id
0.0
40977
−2.5
47600
−5.0
0
500
1000
1500
2000
Relative Location (in µm)
4
2
profile_id
0
40977
−2
47600
0
500
1000
1500
2000
Relative Location (in µm)
Figure 1: Alignment of profile 40977 with 47600. The top panel shows the smoothed signatures.
The middle panel overlays a LOESS fit to the average of the two signatures. Finally, to derive
the roughness correlation, this LOESS is subtracted from the original signature to create a
new set of roughness residuals, which are then given in the bottom panel. Note that these
two profiles do not match, yet the ccf is 0.7724. The roughness correlation (-0.0324) correctly
indicates the lack of matching.
4
Adjusted relative heightRelative Height (in µm)Relative Height (in µ
In a typical comparison between two profiles, such as in Figure 2, the roughness correlation
does not meaningfully impact the matching probability given the presence of the ccf in the
model. In this figure, we see the alignment of profile 8752 with profile 136676. In this case,
the waviness or the deformation pattern in the signatures is less pronounced, and hence the
resulting roughness signature resembles the original signature more closely. These profiles
match, and both ccf (0.6891) and rough_cor (0.7980) provide values indicative of matching.
profile_id
2.5
8752
0.0
136676
−2.5
0
500
1000
1500
2000
Relative Location (in µm)
profile_id
2.5
8752
0.0
136676
−2.5
0
500
1000
1500
2000
Relative Location (in µm)
3
2
1
0
−1
−2
profile_id
8752
136676
0
500
1000
1500
2000
Relative Location (in µm)
Figure 2: Alignment of profile 8752 with profile 136676. In this case, the waviness or the
deformation pattern in the signatures is more minor, and hence the resulting roughness
signature resembles the original signature more closely. These profiles match, and both ccf
(0.6891) and rough_cor (0.7980) provide values indicative of matching.
We can observe the distributions of both CCF and the Roughness Correlation side by side,
differentiating between known matches and known non-matches. Figure 3 displays this as
an empirical CDF plot. It can be seen that the separation of known matches and known
non-matches along both the CCF and the Roughness Correlation is quite strong and follows
similar distributions (the known non-matches are relatively symmetric, while the known
matches are very skewed left). However, some known non-matches with CCF values that
would typically be indicative of a match have relatively lower values for the Roughness
Correlation, which indicates that this feature could provide some added value when it comes
to discriminating between matches and non-matches.
5
ccf
1.00
Bullet Land Pairs
0.75
0.50
Known non−matches (KNM)
Known matches (KM)
0.25
0.00
rough_cor
1.00
0.75
0.50
0.25
0.00
−0.5
0.0
0.5
1.0
Figure 3: Empirical CDFs of the Roughness Correlation compared to the CCF for known
matches and known non-matches. It can be seen that the distribution of each feature for the
known non-matches is quite symmetric, while the distribution for each feature for the known
matches is skewed left.
3
Model Training
Using these features, we can train a randomForest (Liaw and Wiener 2002) model which
attempts to predict whether two lands match given the value of the features. There are
currently three studies for bullets included in the NIST Ballistics Toolmark Research Database.
Those are Hamby (Set 252), Hamby (Set 44), and Cary. For purposes of the analysis we
describe in this paper, we exclude the Cary bullets from consideration, because the study
was designed to assess the persistence of striation markings over a series of fires from the
same barrel. Thus, every Cary bullet is a known match to every other Cary bullet. Hence,
we will consider Hamby (Set 252) and Hamby (Set 44) only. This leaves us a total of 83,028
land-to-land comparisons, of which 1208 are among known matching land impressions and
81,820 are among known non-matching land impressions.
We can now train the forest using the features we defined earlier. Using the caret package
(Jed Wing et al. 2016), we perform the following partitioning scheme. Out of the 50 barrels
total, ten knowns and fifteen unknowns from each of the two Hamby sets, we hold out ten
barrels randomly as a testing set, and use the remaining 40 to train the model. We repeat this
procedure ten times and average the confusion matrix in order to assess the model accuracy
with different holdout samples. Table 1 displays the results in the form of a confusion matrix
on the test set, averaged over these ten independent random forests trained on ten random
barrel subsets. It can be seen that false positives are exceedingly rare, but false negatives
occur more frequently (approximately 21‘ false negative land to land comparisons on the test
set, compared with an average of less than two false positives).
6
Result
Count
False Negative
20.6
False Positive
2.0
True Negative 3716.7
True Positive
56.6
Table 1: The average confusion matrix for the 10 random forests. It can be seen that false
positives are exceedingly rare, but false negatives occur more frequently.
These results suggest that our algorithm is too conservative in predicting a match when
in fact the bullets were fired from the same gun barrel. We can break down the confusion
matrix by the study from which each of the two land impressions originated. Table 2 shows
the average confusion matrix for the 10 random forests, broken down by study. It can be
seen that Hamby252 to Hamby252 comparisons exhibit the fewest errors, while Hamby44
to Hamby44 comparisons exhibit the most errors on average. This intuitively makes some
sense given the potential presence of scanner operator effects, which we address further in
this section.
Study
Hamby252_Hamby252
Hamby252_Hamby44
Hamby44_Hamby44
False Negative
0.38%
0.45%
0.88%
False Positive True Negative
0.02%
97.56%
0.05%
98.3%
0.09%
97.68%
True Positive
2.04%
1.19%
1.35%
Table 2: The average confusion matrix for the 10 random forests, broken down by study.
It can be seen that Hamby252 to Hamby252 comparisons exhibit the fewest errors, while
Hamby44 to Hamby44 comparisons exhibit the most on average.
4
Feature Robustness
Our goal is to assess the robustness of the previously defined features as it pertains to our
bullet matching routines. This goal is both a backward looking assessment of our previous
results for full land-to-land comparisons, and a forward looking one to help support the
case that these can be used in the degraded land case. As a first stage to assessing this
robustness, we produce parallel coordinate plots of the various features based on true positive,
true negative, false positive, and false negative land-to-land matches. Figure 4 displays these
plots. The means of the true positive and the true negative groups are shown as thick blue
lines, respectively, in the two panels. The dashed lines represent individual land to land
comparisons, with errors highlighted larger in red. It can be seen that the few false positives
tendt o have anomalously high ccf or matches, while the false negatives tend have a lot of
variability, though tending to also have a high ccf value.
7
Normalized Value
15
10
Type
5
False Positive
True Positive
0
ks
r
ea
m
_p
h_
su
ro
no
ug
n_
cm
co
s
es
m
m
is
m
at
at
ch
ch
es
D
s
cm
cc
f
−5
Normalized Value
4
2
Type
False Negative
0
True Negative
ks
r
_p
m
su
ug
h_
ea
co
s
cm
n_
no
ro
m
is
m
m
at
at
ch
ch
es
es
D
s
cm
cc
f
−2
Figure 4: Parallel coordinate plot of the features based on the random forest confusion matrix
for true and false positives (above), and true and false negatives (below). False positives
tend to have some feature anomalously high, while false negatives exhibit quite a spread,
sometimes having very large values of CCF or CMS, for example.
8
4.1
Operator Effects
We attempt to quantify the effect of the study on the matching probability by fitting a new
random forest which is designed to predict the study from which the scans came from based
on the derived features. It should be noted that the two sets of scans were performed by two
trained professionals at NIST, and therefore this analysis is intended to shed light on how
slightly different operating procedures for the scans may lead to varying results in algorithms
derived from these scans. Ideally, if the assumption of independence between lands holds
across different operators, this forest should have poor performance - The set of derived
features should be relatively consistent among known matches and known non-matches
regardless of the study since the Hamby data in both sets originated from the same gun
barrels.
Table 3 shows the confusion matrix, with column proportions, for the random forest with
study as the response. It can be seen that indeed the random forest performs poorly, as
hoped, indicating that a simple model to predict the study using the features available is not
enough to detect the operator effects.
Prediction \ Actual
Hamby252_Hamby252
Hamby252_Hamby44
Hamby44_Hamby44
Hamby252_Hamby252 Hamby252_Hamby44 Hamby44_Hamby44
09.93%
08.47%
011.3%
81.24%
80.82%
78.27%
08.83%
010.7%
10.43%
Table 3: Confusion Matrix (Column Proportions) for the random forest with study as the
response. It can be seen that the random forest performs poorly, as hoped, indicating that a
simple model to predict the study using the features available is not enough to detect the
operator effects.
Figure 5 shows the distributions of the land-to-land features, faceted by whether the lands are
known to be fired from the same gun barrel, across different study to study comparisons. The
distributions among the known non-matches seem relatively consistent across study based
on visual inspection. On the other hand, among known matches, Hamby252 to Hamby252
comparisons exhibit more pronounced features, including a higher average ccf, higher number
of matches, and higher value of sum_peaks.
Though visual inspection clearly shows differences, we can more formally assess the differences
between distributions with a Kolmogorov-Smirnov test. Table 4 gives the results of pairwise
tests, for each feature, between different set comparisons, and between known matches
compared with known non-matches. Although most of the tests are significant, looking at the
raw values of the D statistic suggest that the largest effect sizes do in fact occur in comparisons
with two Hamby252 lands, as the visual inspection of the boxplots also suggested.
9
Distributions of Features by Study and Match
Known Non−Match
Known Match
1.00
0.75
ccf
0.50
0.25
0.00
15
cms
10
5
0
0.006
D
0.004
0.002
15
matches
10
Value
5
0
15
mismatches
10
5
0
1.0
overlap
0.9
0.8
0.7
1.0
rough_cor
0.5
0.0
−0.5
20
sum_peaks
15
10
5
Hamby44_Hamby44
Hamby252_Hamby44
Hamby252_Hamby252
Hamby44_Hamby44
Hamby252_Hamby44
Hamby252_Hamby252
0
Study
Figure 5: Distribution of the features, facetted by match, for different study to study
comparisons of lands.
10
set1
H252_H252
H252_H252
H252_H252
H252_H252
H252_H252
H252_H252
H252_H252
H252_H252
H252_H252
H252_H252
H252_H252
H252_H252
H252_H252
H252_H252
H252_H252
H252_H252
H252_H44
H252_H44
H252_H44
H252_H44
H252_H44
H252_H44
H252_H44
H252_H44
set2
H252_H44
H252_H44
H252_H44
H252_H44
H252_H44
H252_H44
H252_H44
H252_H44
H44_H44
H44_H44
H44_H44
H44_H44
H44_H44
H44_H44
H44_H44
H44_H44
H44_H44
H44_H44
H44_H44
H44_H44
H44_H44
H44_H44
H44_H44
H44_H44
feature
ccf
cms
D
matches
mismatches
overlap
rough_cor
sum_peaks
ccf
cms
D
matches
mismatches
overlap
rough_cor
sum_peaks
ccf
cms
D
matches
mismatches
overlap
rough_cor
sum_peaks
matchtest
< 0.0001
< 0.0001
< 0.0001
< 0.0001
< 0.0001
0.0492
< 0.0001
8e-04
< 0.0001
< 0.0001
< 0.0001
< 0.0001
< 0.0001
0.2432
< 0.0001
1e-04
0.1149
0.111
0.2923
0.0603
0.3301
0.8231
0.2671
0.047
matchd
0.2723
0.1751
0.2567
0.1933
0.2015
0.0984
0.2647
0.1426
0.2160
0.2515
0.2342
0.2770
0.2505
0.0906
0.2242
0.1926
0.0883
0.0888
0.0724
0.0977
0.0700
0.0465
0.0741
0.1011
nonmatchtest
1e-04
< 0.0001
< 0.0001
< 0.0001
0.3537
< 0.0001
< 0.0001
0.0015
< 0.0001
< 0.0001
< 0.0001
< 0.0001
0.0414
< 0.0001
< 0.0001
< 0.0001
< 0.0001
< 0.0001
< 0.0001
< 0.0001
0.1633
1e-04
< 0.0001
0.006
nonmatchd
0.0189
0.0245
0.1049
0.0327
0.0079
0.0276
0.0970
0.0162
0.0257
0.0467
0.1946
0.0713
0.0138
0.0408
0.1718
0.0289
0.0259
0.0262
0.0906
0.0423
0.0096
0.0190
0.0769
0.0147
Table 4: Results for the Kolmogorov-Smirnov distributional test.
These results strongly suggest the need for controlling for more effects when performing the
analysis. Specifically, microscope operator effects resulting in variations in scan quality and
scan parameters seem to play a role in the utlimate performance of the matching algorithm.
Land to land comparisons from Hamby252 consistently result in more pronounced expression
of features among known matches, and therefore result in higher accuracy in the random
forest. Rigorous procedures to ensure scan quality and consistency across operators need to
be in place to minimize the effect of the study and ensure that the assumption of land to
land independence is satisfied.
Another way to demostrate the study/operator effect is by observing the distribution of our
algorithm’s ideal cross section by study. Figure 6 gives the distributions of the ideal cross
sections by study. It can be seen that the Hamby44 ideal cross sections are more likely to be
close to the base of the bullet when compared to the position of the ideal cross sections in
Hamby252.
Indeed, another Kolmogorov-Smirnov test confirms a significant difference in the distributions
of these values (D = 0.6239, p < 0.0001). This result strongly suggests that the operator
11
Distribution of Ideal Cross Section by Study
400
ideal_crosscut
300
200
100
0
Hamby252
Hamby44
study
Figure 6: Distributions of the ideal cross sections by study. It can be seen that the Hamby44
ideal cross sections are much more likely to be close to the base of the bullet compared to
Hamby252.
12
effect in the bullet scanning procedure must be taken into account in order to assume pairwise
independence of bullet land scans between Hamby sets 252 and 44.
4.2
Degraded Lands
We now turn our attention to matching degraded bullet lands, in which only fragments of
the land can be recovered. Because the NIST database currently contains only full bullet
lands, we artificially degrade bullets under some simplifying assumptions. Essentially, we
delete portions of lands to simulate the situation where we only recover a fragment from the
crime scene. We simulate various levels of degradation from the left, right, and middle of the
land impression. We vary the proportion of the land impression that is recovered, between
100% (no degradation) and 25% (significant degradation). For example, a left-fixed 75%
scenario implies that the left hand portion of the land was recovered, and the 25% rightmost
portion was lost. We will do this by subsetting the signatures. Note that this is a a simplified
scenario because the signatures themselves are somewhat dependent on the data that are
missing because of the properties of the LOESS smoother.
Figure 8 gives the sensitivity (true positive rate) and specificity (true negative rate) of the
random forest predictions for given levels of degradation. It can be seen that the sensitivity
drops a bit until 50% of the land is available and then rises again. This occurs because the
algorithm begins producing more positive predictions in general, likely as the result of the ccf
being arbitrarily higher for known non-matches due to the small signature. On the other hand,
the specificity drops dramatically for left, middle, and right fixed degraded lands when less
than 50% of the land impression is available for examination. For a more in-depth exploration
of the matching probabilities, Figure 7 provides histograms of the matching probability by
degradation level and by known match versus known non-match categories. The matching
probabilities suffer when compared with the probablities obtained from comparisons between
full land impressions in all cases. The jump seems to be most noticeable beginning at about
25% degradation (75% land recovered), and the algorithm struggles beyond 50%.
Figure 9 gives feature expression for known matches, as a function of the proportion of land
impression recovered. It is immediately obvious that the variability in feature expression
is large when only a small fraction of the land is recovered, such as 25%. For instance,
sum_peaks and cms both drop, while D rises. Interestingly, some of the features are better
expressed for the middle-fixed case. Overall, feature expression remains relatively consistent
as long as we recover 50% or more of the land impression. Feature 10 shows the feature
expression for known non-matches by comparison. The non-matches don’t exhibit the same
pattern of better feature expression for the middle-fixed case, except perhaps for very low
degradation levels. However, feature expression rises as a function of the land proportion,
which indicates why the random forest begins predicting more positives, raising the sensitivity,
but drastically lowering the specificity.
To come full circle, we now attempt to match a particular land which exhibits bad tank
rash. Figure 11 provides an image of the surface of this land impression. Due to the tank
rash, this particular land impression was originally excluded from consideration (see (Hare,
13
Left Fixed
0
0.125
0.25
0.375
0.5
0.625
0.75
Known Non−Match
80000
60000
40000
count
20000
0
Known Match
600
400
200
0
0.000.250.500.751.000.000.250.500.751.000.000.250.500.751.000.000.250.500.751.000.000.250.500.751.000.000.250.500.751.000.000.250.500.751.00
forest
Right Fixed
0
0.125
0.25
0.375
0.5
0.625
0.75
Known Non−Match
80000
60000
40000
0
600
Known Match
count
20000
400
200
0
0.000.250.500.751.000.000.250.500.751.000.000.250.500.751.000.000.250.500.751.000.000.250.500.751.000.000.250.500.751.000.000.250.500.751.00
forest
Middle Fixed
0.0625
0.125
0.1875
0.25
0.3125
0.375
0
0.0625
0.125
0.1875
0.25
0.3125
0.375
80000
60000
40000
count
20000
0
600
400
200
Known Non−Match Known Match
0
0
0.000.250.500.751.000.000.250.500.751.000.000.250.500.751.000.000.250.500.751.000.000.250.500.751.000.000.250.500.751.000.000.250.500.751.00
forest
Figure 7: Histograms of matching probability, facetted by the degradation level and known
match versus known non-match.
14
sensitivity
1.0
0.8
0.6
fixed
0.4
value
full
left
specificity
middle
1.0
right
0.8
0.6
0.4
100%
87.5%
75%
62.5%
50%
37.5%
25%
Land Proportion
Figure 8: Sensitivity and specificity of the random forest for given levels of degradation. It
can be seen that both metrics decline as a function of the land proportion, except for the
sensitivity, which rises for very low levels of the land proportion due to an increase in the
amount of positive predictions.
15
ccf
cms
D
0.0019
0.85
5.5
0.0018
0.80
5.0
0.0017
0.75
4.5
0.0016
0.70
4.0
0.0015
value
matches
mismatches
7.0
6.5
6.0
5.5
5.0
12.5
11
10.0
9
7.5
7
5.0
5
2.5
100%
rough_cor
sum_peaks
0.55
75%
50%
25%
8.0
7.5
7.0
6.5
0.50
0.45
0.40
100%
non_cms
13
75%
50%
25%
100%
75%
50%
25%
Land Proportion
fixed
full
left
middle
right
Figure 9: Feature expression for known matches, as a function of land proportion. It can be
seen that when we fix the middle portion of the bullet land, the features tend to be better
expressed.
16
ccf
cms
0.0027
0.0026
0.0025
0.0024
0.0023
0.0022
2.5
2.0
1.5
value
matches
mismatches
3.4
3.2
3.0
2.8
2.6
2.4
non_cms
12.5
13
12
11
10
9
10.0
7.5
rough_cor
5.0
100%
sum_peaks
75%
50%
25%
4.5
0.4
0.3
0.2
0.1
0.0
100%
D
3.0
0.8
0.7
0.6
0.5
0.4
4.0
3.5
75%
50%
25%
3.0
100%
75%
50%
25%
Land Proportion
fixed
full
left
middle
right
Figure 10: Feature expression for known non-matches, as a function of land proportion. As
expected by the fact that the false positive rate increases as the land proportion decreases,
so too does the feature expression. However, unlike for the known matches, fixing the middle
portion of the land does not seem to lead to more expressed features, except perhaps for very
low degradation levels.
17
Hofmann, and Carriquiry 2016)) based on our subjective assessment of the quality of the land.
However, it appears that approximately half of the bullet land remains relatively unaffected.
We extract a signature from the unaffected half and attempt to match this signature to its
full known match.
Figure 11: Land 4 of Bullet 2, from Barrel 9 of Hamby Set 252. It can be seen that this
particular land exhibits some major tank rash on the left half.
Table 5 shows the values of the features, after extracting only the last 50% of the Hamby
Barrel 9 Bullet 2, 4th land (and hence, simulating a right-fixed 50% degraded scenario),
compared with a feature comparison between both full lands (and hence, including the tank
rash striae). The features are derived in a comparison with its known match, the complete
Bullet 1 third land fired from Barrel 9. The features, including the ccf and the matches, are
expressed enough to (barely) indicate a match in the case of the degraded bullet. Using the
pre-trained random forest, the predicted matching probability is 52%. This is encouraging in
that attempting to match the full bullet land, by comparison, yields a matching probability
of 0.0067%. This is due to the relatively higher values of the ccf, cms, and matches for the
degraded comparison, and suggests that the feature standardization is working as intended.
5
From Lands to Bullets
Another area that deserves more study is the question of generalizing these algorithms to
matching entire bullets rather than indvidual lands, as would be done in a criminal justice
application. One such approach is to recognize that (at least for the Hamby bullets) there
should be six matching pairs of lands for any two bullets that were fired from the same
18
Feature
ccf
rough_cor
D
overlap
matches
mismatches
cms
non_cms
sum_peaks
matchprob
Degraded Land Full Land
0.6004
0.4442
0.3671
0.1633
0.0018
0.0023
0.9968
0.9968
10.2236
5.6275
7.5949
5.0713
9.2013
4.6043
6.5823
2.5357
12.0020
6.3148
0.5200
0.0067
Table 5: Features extracted for a comparison of the full Hamby Barrel 9 Bullet 1 Land 3,
with a left-fixed 50 percent degraded portion of Hamby Barrel 9 Bullet 2 Land 4. These two
lands are known matches, and indeed the random forest does predict a match.
gun barrel. Therefore, for each pair of bullets, we can extract the six highest matching
probabilities and average them. If we do so, we obtain a clear separation between the scores
that are obtained when matching bullets known to be matches and the scores obtained from
known non-matches. This is shown in Figure 12. No known-matches have a score below 50%,
while all known non-matches have a score below 10%.
We can improve on this approach by exploiting the rotation of the bullet to compute a score.
Under the assumption of land to land independence, we can define the probability that two
bullets match (M) as one minus the probability that the two bullets do not match (NM).
Exploiting the idea that when two bullets do not match, none of the individual lands match
either, we can write the matching probability as the probability that at least one land pair in
the matrix matches. Specifically,
P (M ) = 1 − P (N M )
= 1 − (P (N M 1) × P (N M 2) × ... × P (N M 6))
= 1 − ((1 − P (M 1)) × (1 − P (M 2)) × ... × (1 − P (M 6)))
where M is the event that two bullets match, N M is the event that two bullets do not match,
M 1, M 2, . . . , M 6 are the probabilities of land one, land two, . . . , land six matching, and
N M 1, N M 2, . . . , N M 6 are the probabilities that land one, land two, . . . , land six do not
match. However, to compute this probability, we need to know the alignment of the two
sets of lands. Fortunately, the consistent rotation of the bullet permits this. For instance, if
we knew that land 1 of bullet 1 matches land 4 of bullet 2, then we immediately know that
land 2 of bullet 1 matches to land 5 of bullet 2, land 3 of bullet 1 matches to land 6 of bullet
2, etc. Hence, we can take look across six diagonals of the 6 ⊗ 6 matrix containing match
probabilities. Table 6 gives an example of the matrix of matching probabilities between two
sets of six lands from bullets that are known matches. The matching diagonal is clear based
on the high probabilities (cell (1, 3), cell (2, 4), cell (3, 5), etc.) although it can be seen that
19
Known matches (KM)
Known non−matches (KNM)
0.00
0.25
0.50
0.75
1.00
score
Figure 12: Score distributions for the naive approach to bullet matching, for known matches
and known non-matches.
one of the six comparisons has a relatively lower matching probability. This procedure is
based on the Sequence Average Maximum (SAM) by Sensofar (2017) in their bullet matching
software application SensoMatch. A similar approach using the cross correlation maximum
was first proposed by W. Chu et al. (2010). Compared to that approach, ours uses random
forest based probabilities compared to correlation values allowing for elements of probability
theory to help determine the resulting bullet match probability.
profile1_id
42594
43063
43581
44211
44568
45070
45604
0.0000
0.0000
0.0000
0.0000
1.0000
0.0000
46104
0.0000
0.0000
0.0000
0.0000
0.0000
1.0000
46601
1.0000
0.0000
0.0000
0.0133
0.0033
0.0000
47069
0.0000
1.0000
0.0000
0.0000
0.0000
0.0000
47600
0.0000
0.0067
0.8433
0.0000
0.0000
0.0000
48069
0.0000
0.0000
0.0000
0.6700
0.0000
0.0000
Table 6: Matrix of matching probabilities between two sets of six lands from bullets that are
known matches.
We now describe four methods for deriving a score from this matrix. The results derived from
these methods are shown in Figure 13. In Method 1, we derive a score by computing the
bullet matching probability on each set of six matrix diagonals using the previously defined
formula, under the assumption of land to land indepndence. Finally, we take the maximum
20
score obtained out of the six results as the final matching score for a bullet pair. After doing
so, we can plot the scores for known matches and known non-matches separately. It can be
seen that the known matches all have scores of around 100%, while no non-match achieves a
score of above 30%, and hence this procedure provides perfect discrimination between all
pairs of bullets between and within the two Hamby datasets.
On the other hand, Method 2 is obtained by flipping this procedure around by assuming that
a match occurs if and only if all six lands match. As it turns out, this does not discriminate
quite as well. Every known bullet non-match achieves a score of about zero, but so do about 15
known bullet matches. This method performs poorly because our matching algorithm exhibits
a larger false negative rate than the rate of false positives. Multiplying the probabilities
together compounds the issue of false negatives and leads to some misidentification of matching
bullets.
Method 3 is a hybrid of these two approaches, where we average the probabilities along
the diagonal rather than multiplying those probabilities. Now, we once again differentiate
the two groups well with no known non-match achieving a score above 10%, and no known
match with a score below 40%.
One more approach to generate bullet matching scores, which we call Method 4, would
exploit the SAM procedure on individual features. For each diagonal in the 6 ⊗ 6 matrix, we
can compute an average value for each feature in our model. This yields six sets of feature
values for all six diagonals. We can then feed all six sets of features into the random forest in
order to obtain a matching probability for each, taking the highest resulting probability to
locate the diagonal and thus identify land to land alignment. It can be seen that while this
procedure does discriminate well, it yields some false negatives (matching bullets that our
forest identifies as a non-match).
6
Conclusion
In this paper, we have introduced a set of robust features that can be used to train bullet
matching models. We have used these features to train a random forest and assess its
out-of-sample accuracy. In doing so, we noted strong evidence of operator effects that resulted
in differences in the quality of the microscope scans. These effects were noted despite the
experience of the individuals conducting the scans, which implies that such effects could quite
likely be more pronounced when scans are done by those more inexperienced, or with fewer
standard operating procedures in place.
While these effects were clearly identified, the best approach to account for them in practice
is less clear. In the ideal case, bullets fired from a particular gun barrel should yield surface
scans that are of identical quality and properties, regardless of the operator performing the
scan. To achieve this, rigorous standards may need to be put in place with regards to the
alignment of the bullet under the objective, and the procedure used to scan the bullet surface.
To appropriately design a set of best practices requires more research. For instance, because
of the significant difference between the placement of the ideal cross section across the two
21
Method 1
Known matches (KM)
Known non−matches (KNM)
Method 2
Known matches (KM)
Known non−matches (KNM)
Method 3
Known matches (KM)
Known non−matches (KNM)
Method 4
Known matches (KM)
Known non−matches (KNM)
0.00
0.25
0.50
0.75
1.00
score
Figure 13: Distribution of matching scores using four methods. Method 1 assumes a match
if at least one pair of lands match. Method 2 assumes a match if all pairs of lands match.
Method 3 averages the probabilities instead of multiplying them. Finally, Method 4 uses a
SAM procedure on the feature values for known matches compared to known non-matches.
While these methods have various levels of discriminatory power, and rely on slightly different
assumptions, they do show clear and significant separation between matching bullets and
non-matching bullets in general.
22
studies, a best practice may specify the margin from the edge of the objective at which the
bullet can be placed.
We began exploring the robustness of the matching algorithm proposed in Hare, Hofmann,
and Carriquiry (2016) to land degradation. As suspected, the algorithm performance declines
as a function of the rate of degradation. However, there is a relatively clear threshold around
about 50%; if 50% of the land or more is recovered, the algorithm still performs reasonably
well. When the proportion of the land that is recovered is below 50%, the accuracy with
which we can compare land impressions is low.
Finally, one pleasing conclusion from these results is the fact that generalizing them to full
bullet comparisons rather than the land to land level appears to work quite well. Depending
on the assumptions made, the out of sample accuracy of bullet to bullet comparisons can
range from nearly perfect to perfect. This result is encouraging in that real world use of these
algorithms would be done on the bullet level, assuming enough of the bullet was recovered to
make these procedures possible.
As we have stated before, the lack of 3D images of bullets available in the public domain
limits the extent to which these algorithms can be tested and validated. The degraded
land simulation itself may be too simplistic and not faithfully represent realistic scenarios.
However, as more data are collected, we can continue to update, train, and test the matching
algorithm in order to improve its performance in real datasets.
7
Acknowledgment
We wish to acknowledge the anonymous reviewers, whose quick yet thorough and immensely
helpful reviews helped us improve this manuscript. We greatly appreciate the extra effort
put forth for us.
This research was partially funded by the Center for Statistics and Applications in Forensic
Evidence (CSAFE) through Cooperative Agreement #70NANB15H176 between NIST and
Iowa State University, which includes activities carried out at Carnegie Mellon University,
University of California Irvine, and University of Virginia.
23
References
Biasotti, Alfred A. 1959. “A Statistical Study of the Individual Characteristics of Fired
Bullets.” Journal of Forensic Sciences 4 (1): 34–50.
Chu, W., J. Song, T. Vorburger, R. Thompson, and R. Silver. 2011. “Selecting Valid
Correlation Areas for Automated Bullet Identification System Based on Striation Detection.”
Journal of Research of the National Institute of Standards and Technology 116 (3): 649.
Chu, W., J. Song, T. Vorburger, J. Yen, S. Ballou, and B. Bachrach. 2010. “Pilot study of
automated bullet signature identification based on topography measurements and correlations.”
J. Forensic Sci. 55 (2): 341–47.
Chu, Wei, Robert M Thompson, John Song, and Theodore V Vorburger. 2013. “Automatic
identification of bullet signatures based on consecutive matching striae (CMS) criteria.”
Forensic Science International 231 (1–3): 137–41.
Clarkson, James A, and C Raymond Adams. 1933. “On Definitions of Bounded Variation
for Functions of Two Variables.” Transactions of the American Mathematical Society 35 (4).
JSTOR: 824–54.
Giannelli, Paul C. 2011. “Ballistics Evidence Under Fire.” Criminal Justice 25 (4): 50–51.
Hamby, James E., David J. Brundage, and James W. Thorpe. 2009. “The Identification of
Bullets Fired from 10 Consecutively Rifled 9mm Ruger Pistol Barrels: A Research Project
Involving 507 Participants from 20 Countries.” AFTE Journal 41 (2): 99–110.
Hare, E., H. Hofmann, and A. Carriquiry. 2016. “Automatic Matching of Bullet Lands.”
ArXiv E-Prints, January.
Jed Wing, Max Kuhn. Contributions from, Steve Weston, Andre Williams, Chris Keefer,
Allan Engelhardt, Tony Cooper, Zachary Mayer, et al. 2016. Caret: Classification and
Regression Training. https://CRAN.R-project.org/package=caret.
Liaw, Andy, and Matthew Wiener. 2002. “Classification and Regression by RandomForest.”
R News 2 (3): 18–22. http://CRAN.R-project.org/doc/Rnews/.
National Research Council. 2004. Forensic Analysis: Weighing Bullet Lead Evidence. National
Academies Press.
———. 2009. Strengthening Forensic Science in the United States: A Path Forward. Washington, DC: The National Academies Press. doi:10.17226/12589.
Petraco, Nicholas, and Helen Chan. 2012. Application of Machine Learning to Toolmarks:
Statistically Based Methods for Impression Pattern Comparisons. Mannheim, Germany:
Bibliographisches Institut AG.
President’s Council of Advisors on Science and Technology. 2016. “Report on Forensic
Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods.”
24
https://www.whitehouse.gov/sites/default/files/microsites/ostp/PCAST/pcast_forensic_
science_report_final.pdf.
Riva, Fabiano, and Christophe Champod. 2014. “Automatic Comparison and Evaluation of
Impressions Left by a Firearm on Fired Cartridge Cases.” Journal of Forensic Sciences 59
(3): 637–47. doi:10.1111/1556-4029.12382.
Sensofar. 2017. SensoMATCH Bullet Comparison Software.
Vorburger, T.V., J.-F. Song, W. Chu, L. Ma, S.H. Bui, A. Zheng, and T.B. Renegar. 2011. “Applications of Cross-Correlation Functions.” Wear 271 (3–4): 529–33.
doi:http://dx.doi.org/10.1016/j.wear.2010.03.030.
25