Coffee Quality Classification
Coffee Quality Classification
Coffee Quality Classification
C L A S S I F I C AT I O N
S P E C I A LT Y C O F F E E Q U A L I T Y C L A S S I F I C AT I O N
W I T H F E AT U R E I M P O R TA N C E
committee
dr. Grzegorz Chrupała
dr. Marijn van Wingerden
location
Tilburg University
School of Humanities and Digital Sciences
Department of Cognitive Science &
Artificial Intelligence
Tilburg, The Netherlands
date
June 25, 2021
acknowledgments
I want to thank to my family for their support in my studies throughout
the years.
I want to thank dr. Robert McKeon for sharing the passion for coffee and
the Sweet Maria’s dataset. Finally, I want to thank to my supervisor dr.
Grzegorz Chrupała for the wise guidance provided in the writing of this
thesis.
C O F F E E Q UA L I T Y
C L A S S I F I C AT I O N
S P E C I A LT Y C O F F E E Q U A L I T Y C L A S S I F I C AT I O N W I T H
F E AT U R E I M P O R TA N C E
contents
1 Introduction 4
1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Project Relevance . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Project Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Related Work 8
2.1 Similar Research . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Feature importance . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Coffee Quality Prediction and Classification . . . . . . . . . 10
3 Experimental Setup 12
3.1 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.1 Coffee Quality Institute . . . . . . . . . . . . . . . . . 12
3.2.2 Sweet Maria’s Coffee . . . . . . . . . . . . . . . . . . . 12
3.3 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . 13
3.3.1 Coffee Quality Institute . . . . . . . . . . . . . . . . . 13
3.3.2 Sweet Maria’s Coffee . . . . . . . . . . . . . . . . . . . 15
3.3.3 Compatibility and discrepancy . . . . . . . . . . . . . 16
3.4 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4.1 Coffee Quality Institute . . . . . . . . . . . . . . . . . 17
3.4.2 Sweet Maria’s Coffee . . . . . . . . . . . . . . . . . . . 18
3.5 Description of feature importance algorithms . . . . . . . . . 19
3.5.1 Filter Methods . . . . . . . . . . . . . . . . . . . . . . 19
3.5.2 Wrapper Methods . . . . . . . . . . . . . . . . . . . . 20
3.5.3 Embedded Methods . . . . . . . . . . . . . . . . . . . 21
3.5.4 Hierarchical Clustering . . . . . . . . . . . . . . . . . 23
3.6 Evaluation Algorithm: K Nearest Neighbors . . . . . . . . . 24
3.6.1 Subset Evaluation . . . . . . . . . . . . . . . . . . . . . 25
1
CONTENTS 2
Abstract
Coffee is one of the most traded commodities in the world and the
biggest producers of the black bean are developing countries; farmers
in such countries can greatly benefit of the high demand of coffee and
more now since specialty coffee enjoys an en vogue status. The price
of the coffee is established in online auctions and bidders rely on the
reputation of a crop, this prestige is build upon coffee cupping scores;
this thesis main focus is to study the features in the coffee evaluation
to understand the features importance and explore the possibility of
including environmental features in the assessment of coffee quality.
The most important features of coffee are the taste variables; flavor
is the most important for classification of specialty coffee, while the
texture based features come at last, despite of having features that
are more important than others, all of the features contribute greatly
in the classification tasks.
1 introduction 4
1 introduction
1.1 Context
Coffee has been rated as the second most traded commodity in the world
market and even though that affirmation remains inaccurate; coffee is still
one of the most traded seeds in the world (Mussatto, Machado, Martins,
& Teixeira, 2011); the coffee economy is worth more than $12.2 billion
dollars and about 142 million people rely on the agriculture of coffee as
stated by Reis Guimarães, Carlos dos Santos, Montagnana Vicente Leme,
and da Silva Azevedo (2020). The impact of the coffee economy in the
countries where this is harvested has impacted socially and physically
the communities in a positive way, not only allowing big traders to profit
throughout a free-trade market, but also allowing its inhabitants to increase
the life standard in the countryside according to Lashermes, Andrade, and
Etienne (2008), but farmers still need to adapt to current retailers requests
since the industry of coffee is on constant mutation; not only boosted by
environmental factors such as global warming, but also propelled by the
growing demand of high quality coffee (Wollni & Zeller, 2007), according
to Lannigan (2020) this demand occurs due to a trend of the market where
connoisseur consumption is now the rule, the consumers now research
about the quality of their products not only on the ecological production
process of goods itself, but also on the final quality; therefore is pertinent for
coffee evaluation to study the importance of variables and its interactions
to deliver a conscious rating of taste.
1 introduction 5
Component Analysis (PCA), the subset was used for training a Support
Vector Machine that classified with a 99% accuracy.
A similar approach was done by (Rodríguez et al., 2010); they used an
electronic nose (with gas chromatography detection) with 8 different sen-
sors for detecting quality defects on coffee; a Multilayer Perceptron Neural
Network with PCA preprocessing was trained, this approach was able to
classify on a 100% accuracy. A latter work from (Chang et al., 2021) used
the exact same approach, but in this case to classify the flavor of quality
coffee among 9 different classes, they used a Support Vector Machine with
a Radial Basis Function Kernel (RBF) and a Residual Neural Network; the
first had a 78.91% accuracy, while the second 78.79%.
Nonetheless, none of the previous studies contemplated the influence mag-
nitude of each feature in the classification task; therefore understanding
the importance of them is necessary to contribute to the evaluation meth-
ods when no electronic device is available; the use electronic noses for
classification is deemed to be expensive and until the moment not easily
available to the daily coffee grader nor farmer. In order to understand the
feature importance in the coffee cupping scores different machine learning
algorithms can be used for feature importance; filter methods, wrapper
methods and embedded methods, each method has different advantages
and difficulties that need to be assessed on the type on the data to analyze.
Since the intention of this project is to understand the features that play
a major role in the classification of coffee quality the following research
questions will be answered:
• RQ2 Can feature selection yield a better result than using the whole data
set for specialty coffee classification?
• RQ3 Does including features other than taste itself in specialty coffee
classification increase performance of the model?
2 related work 8
2 related work
ture selection method could create a subset that prompt to higher accuracy,
however there are flaws in the research: the first problem is that wrapper
methods are bound to overfitting, no measure to quantify nor remedy
overfitting are presented in the paper; the second nuisance is the lack of
the subset details, therefore no individual contribution of each feature to
the final scores of wine quality can be evaluated, neither is available the
information for study replication.
Other different approach can be evidenced in the research of Laughter and
Omari (2020), in their investigation they extracted features by employing
PCA. The used classifiers were a Multilayer Neural Network with SGD and
ADAM as optimizers (the optimizers are highlighted as those algorithms
could reduce the risk of overfitting), a Decision Tree and a Support Vector
Machine (SVM), the model that ranked with the highest accuracy score for
classification was a Random Forest with a 72.4% of accuracy. None of the
previous works contemplated in the discussion the influence of imbalanced
data when selecting or extracting features; it is an issue when conducting
research with the UCI Wine dataset as low quality wines are underrepre-
sented in the data; in contrast to the previous studies, the investigation of
Hu, Xi, Mohammed, and Miao (2016) employed Synthetic Minority Over-
sampling Technique (SMOTE) for balancing the data. AdaBoost, Decision
Tree and Random Forest were used as classifiers; the later performed with
the best accuracy by reaching a 94.6% of accuracy in the test set. The work
of Hu et al. (2016) suggested the SMOTE technique, which is also used
in this thesis, as according to Fernández, García, Herrera, and Chawla
(2018) SMOTE is the robust standard in the industry for oversampling the
minority class.
On the other hand, the research of feature importance in coffee taste qual-
ity has been elusive in the academia; this can be mainly attributed to the
current trends in the research field; the current investigations are focused
mainly on the use of neural networks without a previous treatment of
features; still there are studies in the direction of feature engineering, but
such research projects are focused on dimensionality reduction by feature
extraction. In light of the existing literature it is necessary to analyze
feature importance research out of the beverage quality field.
For instance, the paper of Valko and Hauskrecht (2010) exposes which
features of clinical data influence physician’s decision about ordering dif-
ferent laboratory tests; in the research a multivariate analysis prompted for
an understanding of all the main features that influenced the classification
tasks; all of the variables where used for training a model on a SVM,
2 related work 10
then the AUC was evaluated for each variable; with those variables three
different subsets where evaluated: (1) Top 1 feature; (2) top 3 features;
and (3) top 30 features; in general the model with just 1 top feature could
predict up to an 87.46% of precision, adding all of the variables to the
model improved in certain cases the precision in the prediction, however it
would not add more than 10% in any case. The greedy selection method
was useful for that research, however it did not allow to study the feature
dependencies and it limited the research to an explanatory point of which
features contribute mostly to the SVM precision.
On the other hand, the research of Ginsburg, Lee, Ali, and Madabhushi
(2016) investigated the feature importance in Nonlinear Embeddings (FINE)
and it applied this method for quantitative histomorphometry (QH); a field
that uses pathology images for disease prediction or outcome. The ap-
proach used for such investigation is based on feature extraction; kPCA
was used for feature extraction, then the eigenvectors were used for train-
ing a model; a SVM and a RF were used as classifiers; the metric AUC
allowed to review which eigenvectors contributed the most on the classi-
fication task. After determining the importance of each eigenvector, the
same process was executed with each feature in each of the most important
eigenvectors. Their trained FINE model achieved a higher accuracy than
the baseline models with Fisher Scores (AUC:0.53 -0.67) and Gini Impu-
rity (AUC:0.58-0.75), depending on the pathology the model with FINE
the reported AUC between 0.74–0.93. This study used a dimensionality
reduction in order to build an accurate predictive model that avoided the
curse dimensionality; this is a great risk in certain types of datasets, in
this case they had 4 medical images datasets, one of them contained 140
vectors and 2343 different variables; although the predictive capacity of the
model is high, the relevance of each of the variables is not stated explicitely
in the paper. This lack of specificity may be comprehensible as the most
important features could be more than 100; however its biggest limitation
is the use of eigenvectors, as those may create an incapacity to account
for interactive effects among variables. As well, the author of that paper
acknowledges that filter methods, wrapper or ensemble methods, could
prompt to feature subsets that produce similar classification results.
In contrast, just a few research papers use existing databases with a clear
methodology to classify the specialty coffee. The research of Suarez-Peña,
Lobaton-García, Rodríguez-Molano, and Rodriguez-Vazquez (2020) has
used different variables other than the tasting of coffee itself; with a
database of 56 vectors and a 10-fold cross validation stratification of data a
2 related work 11
4-layer Neural Network was trained for classification; and it was able to
classify quality coffee on a 81% accuracy and another model with SVM
using a RBF kernel performed a 88% accuracy; however, a lack of rigor is
present as the size of the database does not allow a full test data set totally
independent of the training data and the variables used (34) for training
are more than half of the accounted cases for the model. which could
prompt to an overfitting model. On the other hand, Yuriko and I Dewa
(2020) used the Coffee Quality Institute database and a General Regression
Neural Network was built for prediction; in it : "The model’s performance
is measured with MSE and MAE with the best MSE value of 0.097 and
MAE value 0.245" (Yuriko & I Dewa, 2020, p. 189).
The previous studies use complex classification methods and different types
of feature selection methods for prediction and classification, due to the
complex nature of the available datasets and the evidenced methodology
used the previous mentioned scientific works; this project thesis will use
3 different models of feature importance: Filter, wrapper and embedded
methods. The approach on different methodology should prompt to similar
data or data that allows for the coffee tasting phenomena comprehension.
3 experimental setup 12
3 experimental setup
3.1 Software
To conduct the analysis of the data and fitting the Machine learning models
this thesis project used Jupyter Notebook on Python (Version 3), as well
the following libraries were executed:
For the study of the feature importance for coffee quality classification two
data sets have been analyzed and both were created from a web scraping
tool.
Features Variance
method_Natural / Dry 0.13
method_Other 0.01
method_Pulped natural / honey 0.01
method_Semi-washed / Semi-pulped 0.03
method_Washed / Wet 0.22
color_Blue-Green 0.04
color_Bluish-Green 0.06
color_Green 0.17
Species 0.03
Aroma 0.17
Flavor 0.21
Aftertaste 0.22
Acidity 0.18
Body 0.17
Balance 0.22
Uniformity 0.54
clean_cup 1.10
Sweetness 0.55
cupper_points 0.28
Moisture 0.00
cat1_def 12.30
quakers 0.63
cat2_def 38.50
altitude_low_meters 222020.67
altitude_high_meters 241116.72
Sweet Maria’s coffee dataset also has determined scores that range
between 1-10 therefore is it useful to have an overview of the variance:
3 experimental setup 16
Features Variance
Fragrance 0.13
Aroma 0.11
Brightness 0.18
Flavor 0.11
Body 0.10
Finish 0.13
Sweetness 0.05
Clean_cup 0.19
Complexity 0.29
Uniformity 0.08
Cuppers_correction 1.14
Quality 0.12
• Fragrance / Aroma
• Flavor
• Aftertaste / Finish
• Body
• Sweetness
• Clean Cup
• Uniformity
• Brightness / Acidity
• Wet Aroma: The SCAA protocol evaluates the Wet Aroma inside of
the Fragrance and Aroma aspect as one main variable; Sweet Maria’s
evaluate the wet aroma separately from the Fragrance. It is partially
related with Aroma of the SCAA, however it is scored apart from it.
3.4 Preprocessing
random forest that cannot handle such imbalance, therefore the treatment
of such values is relevant.
Due to the small size of the data set the algorithm chosen for oversampling
the low quality coffee is the Synthetic Minority Oversampling Technique
(SMOTE), it has disadvantages as it does not consider the neighboring
data of other classes and it may create noise, nevertheless the advantages
out-weight the disadvantages.
Finally, the categorical variables: Color (3 different settings) and processing
method (5 different methods). At the end of preprocessing 8 additional
variables were created for a total of 2286 vectors and 26 variables.
In this project the filter method that will be used is the Fisher Score: This is
calculated for each variable in relation to the target, and the main formula
is the following:
j
∑ c n k ( µ k − µ j )2
F ( X j ) = k =1 (1)
( σ j )2
It is mainly used for classification of binary classes, the Fisher Score accord-
ing to the formula is defined by the mean (µ) and the standard deviation σ
of all the vectors for each jth feature (Gu, Li, & Han, 2012); in conclusion it
measures the distance between the means for each class per feature divided
by their variances.
Where i are the iterations over j; which represents each of the features in
the data set and s is the score reached on accuracy by the trained model,
for each feature shuffling of the (k) corrupted data on the feature j; in each
iteration a new score of accuracy is calculated and subtracted to the initial
score; the final model is calculated on each loss on accuracy of each of the
features shuffled and there is where importance is calculated by ranking,
where the feature shuffling prompted in the largest decrease on accuracy.
3 experimental setup 21
The feature permutation algorithm has only one main parameter: The clas-
sifier used for determining the accuracy loss per permutation of variable,
in this research the random forest classifier was chosen as the classifier, the
main reason for such election is that it uses subsets to train each tree and
such feature reduces the risk of a biased classification. The random forest
classified with an accuracy of 98% on the test set, therefore the same will
be used for the feature permutation with the following parameters: ’boot-
strap’: True, ’max_depth’: 100, ’max_features’: ’auto’, ’min_samples_leaf’:
2, ’min_samples_split’: 2, ’n_estimators’: 100.
1
2
whereΩ( f ) = γT + λ
W
(4)
2
Omega contains the tree leaves (T) and gamma (γ) is penalty value for
pruning the nodes, after calculating it there is an addition of the fraction
of lambda (λ) and the product with W that represents the leaf weights. In
the case of XGB the pruning of the trees occurs replacing nodes that do
not contribute to improving classification on leaves (C. Chen et al., 2020).
XGBoost splits a tree up to max_depth specified in the parameters and
prunes backwards until the loss is below a threshold, it will only hold the
leafs where the outcome was positive. Among the reasons for traning the
data with XGBoost the following grounds have been enumerated:
• max_depth [4] was left on that value as the accuracy was the same
with higher values and a deeper level would over fit.
• gamma [1] Grid search suggested such value, which specifies the
minimum loss reduction required to make a split, the higher the
number the more conservative the model is.
3 experimental setup 23
• Linkage Function:
In principle, this thesis project has used other classifiers to analyze the
feature importance of each of the variables in the datasets, XGBoost uses
decision trees to classify; then for the feature permutation and the hierar-
chical clustering another random forest was used for testing the importance
of each chosen and or permuted feature, nonetheless, it is necessary to
test each of the selected features sets compiled across the different feature
importance and selection models. The reason for choosing this algorithm as
the test algorithm of the selected features is mainly that kNN requires only
one hyperparameter to be set to classify. Since the other algorithms should
have chosen the most important features for classification, the algorithm
should work properly for classifying with high accuracy.
One of the disadvantages of the kNN algorithm is that it cannot handle
high dimensional datasets, but the datasets for this thesis were not mas-
sively populated of features and it was shrunk in subsets of the important
features, ergo the disadvantage is not an inconvenient for this research
project.
kNN as mentioned is the lazy classifier used for the model evaluation task,
lazy means that it will not create a model in order to evaluate new cases,
but every case is a new model, it is based on a distance metric and a voting
system (K); the most common approach for distance in the kNN algorithm
is the euclidean distance or also called pythagorean distance, represented
by: s
n
d ( p, q) = ∑ ( q i − p i )2 (7)
i =1
The algorithm checks for the distance between two points; qi and pi , those
are the points that contain the distance of each vector and the origin point
in a n-dimensional plane; traditionally it accounted the distances in a
two dimensional plane, however in practice it can calculate the distance
3 experimental setup 25
between two points in any amount of dimensions. For the purpose of the
estimation of distance for this project each dimension is a variable in the
datasets and each sample of coffee will be placed by the algorithm in a
n-dimensional space; after that the test vector will be classified by checking
which are the closest neighbors from the training subset, a neighbor should
be enough for determining a class for a case by determining the smallest
distance between the test vector and the train vector, however a higher
amount of neighbors (k) can be used with the risk of overfitting and it can
also increase the accuracy; the algorithm will use a voting majority system
to decide the label for the case; therefore it is a common logic practice to
use odd numbers for neighbors, in case of a voting there must be a breaking
vote and there will not be ties. There are different parameters that can be
tuned in the kNN algorithm, Bhatia and Vandana (2010) suggested the
use of many different standards for classification; therefore three different
amounts of neighbors will be used k= [3,5,7], while the suggested algorithm
for calculating the distances is called ’Brute’ and it calculates each time
the distance between points, it is not efficient in terms of processing, but it
remains imperceptible due to the size of the data sets used for this thesis.
• 10-Fold Cross validation: The embedded method will use this data
split; as the XGBoost uses additive trees with optimizers, the use
of all the information for training should avoid bias and find the
features with largest magnitude in the classification.
The filter method does not need any data split, as it does not use any
classifier for finding the correlations among the features and the target, for
such reason all the data should be used.
TP + TN
Accuracy = (8)
TP + TN + FP + FN
TP
Precision = (9)
TP + FP
TP
Recall = (10)
TP + FN
2 ∗ Precision ∗ Recall 2 ∗ TP
F1 = = (11)
Precision + Recall 2 ∗ TP + FP + FN
Sensitivity, Specificity and AUC are also vogue metrics that are used
to evaluate a model. For this thesis Accuracy was selected for evaluating
all of the models; it is broadly used and easy to comprehend; its only
disadvantage is the lack of evaluation of a model’s performance when
there are imbalanced classes in a dataframe (Hossin & Sulaiman, 2015),
however this has been sufficed in the preprocessing of the data sets.
3.9 Repository
4 results
The rankings used for evaluation are based on the 10 most important fea-
tures, then feature ablation was performed on each of the subsets. There is
a rationale behind the use of 10 features; specialty coffee quality is already
determined by 10 features; therefore checking for more variables for clas-
sification would go against the intention of scrutinizing the performance
after dimensionality reduction.
The different feature importance methods (filter, wrapper and embedded)
were executed on the datasets for determining if there was a considerable
difference between the rankings created with the different methods.
Table 6: Top 10 features according to filter, wrapper and embedded methods with
F-Score, XGBoost and Feature Permutation accordingly.
hand, using the subset of the 10 variables prompted a better accuracy with
an 85.5% of accuracy using three neighbors and dropping each variable in
the ranking when classifying did not exposed consistency in the accuracy;
when dropping altitude high meters the accuracy increased and dropping
flavor decreased a 42.3% of the accuracy, further details can be found in
Appendix C (page 55).
On the other hand, the XGBoost algorithm found that the most important
10 variables are variables that belong only to the coffee cupping scores;
it does not match any of the placed positions by the fisher scores feature
importance.
According to the scores of accuracy in the evaluation of the subset built
with the XGBoost algorithm the use of the whole data set of 26 variables
creates a lot of noise that cannot be handled by the kNN algorithm, when
using the top 10 features and five neighbors the accuracy was a 99.56%
on the test set and dropping eight of the most important variables proved
an steady decrease of the metrics to an 83.4%, however when running the
algorithm backwards, dropping the least important to the most important
exposed the importance of "clean cup", "aftertaste" and "flavor".
The figure 3 show that after dropping 7 variables the accuracy does not
change by ablating Clean cup and Cupper points and aftertaste. Aftertaste
and flavor hold an 88% of the accuracy reached by the kNN. While using
4 results 29
Body and Sweetness hold for an 86% of the accuracy. It exposes that the
performance of kNN for classification using the 9th and the 10th important
features yield a result of just 2% less than the use of flavor and aftertaste.
Above can be seen the accuracy adhered to each of the features ac-
cording to the classifier used for feature permutation importance and the
permutation importance itself. The random forest trained for the feature
permutation importance yield a similar result of feature importance to the
XGBoost subset; however the feature importance based on impurity models
differs greatly in comparison to the other algorithm used for feature im-
portance; both algorithms XGBoost and Feature Permutation Importance
placed flavor most important feature for specialty coffee quality classifi-
cation. Despite the commonality on the first feature, the other features
varied without an established pattern in the different models; hitherto
is the XGBooster that prompted to a better ranking of the features: The
feature permutation importance algorithm selected a feature that does not
belong to the coffee cupping scores as important; the category 2 defects, it
could help to answer one of the research questions, however when testing
the subset using 10 variables it performed on average with a smaller accu-
racy (97.96%) than the subset chosen by the XGBoost feature importance
(98.09%). Besides, the performance of the classifier used for evaluation had
a steady decline in the accuracy using the XGBoost subset, while the subset
created by the feature permutation importance had an increase of the
performance when dropping different variables from the least important
to the most important. Further information with accuracy scores for spe-
cialty coffee of the XGBooster feature importance and Feature Permutation
Importance can be found in the Appendix D (page 56) and Appendix E
(page 58) respectively.
Table 7: Ranking of top 10 feature importance for specialty coffee excluding the
coffee cupping scores
4 results 31
The subsets are similar in the rankings of the 10 most important features.
The specifics of the evaluation for both subsets can be found in tables
18 (XGBoost) and 23 (Feature Permutation). The model with all of the
variables of the subset had a lower accuracy in comparison to the model
with coffee cupping scores (XGBoost subset: 76.42%, Feature Permutation
subset:75.69%),the ’category type 2 of defects’ and ’moisture’ are the most
important variables for both models; only sharing the 7th (’Processing
Method:Washed/Wet’) and 10th (’Processing Method: Other’) position in
the ranking. The remaining positions of the rankings hold no consistency
in the order among the two subsets.
Table 8: Ranking of top 10 feature importance for specialty coffee excluding the
coffee cupping scores
The two proxy variables that contain other variables are ’Aftertaste’ and
’Balance’; those were excluded in order to make an evaluation of the
subsets. Both of the subsets contain the same 10 variables 8 variables
include coffee cupping scores, while other 2 are extrinsic factors of the
coffee bean: ’Moisture’ and ’Category type 2 defects’. The evaluation of
the subsets prompted in both cases for a lower accuracy, tables 19 and 24
expose a loss in 1% of the accuracy in both cases (XGBoost subset: 98.40%,
Feature Permutation subset:98.18%).
The analysis of Sweet Maria’s Coffee data set also used all of the methods
described in the previous section, however no extrinsic variables were taken
into account. The feature permutation and XGBoost feature importance
4 results 32
While all of the rankings show that flavor is the most important feature
for specialty coffee, there was not a clear distinction of all of the most
important variables as there are no established patterns in common. In the
evaluation of the rankings with the kNN algorithm; the feature ablation
with the respective order of the rankings from the least important to the
most important and rearwards proved that the feature order chosen by the
XGBoost had a better performance than the feature permutation algorithm;
while the feature ablation of 9 variables from the most important to the
least important showed a decline on average in the accuracy of a 43.12%,
the decrease of accuracy when dropping the most important to the least
important only showed a decrease in the accuracy of 2.14%; showing that
even the last two variables hold an 87.15% of the accuracy, the complete
data is available in Appendix (page 61).
4 results 33
Figure 6: Unstable decrease with feature ablation of the ranking created with
feature permutation importance.
As both of the previous sections only considered the coffee cupping fea-
tures, the hierarchical clustering aimed to determine coffee quality by
adding other extrinsic variables.
4 results 34
Subset 1 Subset 2
method_Natural / Dry method_Other
method_Washed / Wet color_Green
Aroma Flavor
Uniformity clean_cup
Moisture cat1_def
altitude_low_meters altitude_high_meters
The random forest classifier scored with a 95% of accuracy on the test
set using all the variables of the subsets, however when the subsets were
evaluated with kNN the performance with the full subset was lower for
each of them, with an 81.73% of accuracy of subset two and a 78.03% in
the subset one. A feature ablation to each of the subsets deemed a larger
accuracy when dropping features arbitrarily.
The intention of the feature ablation was not to evaluate the most impor-
tant feature of the subset, but instead testing in a different environment the
selected features. An increase in the model performance can be evidenced
in the graphic above when the "altitude high meters" and the "category 1
defects" were dropped for the kNN classification.
5 discussion 36
5 discussion
The results showed that specialty coffee classification can be highly biased
in different aspects, the coffee taste prints should be reflected in the coffee
cupping as many scientist have exposed previously; altitude, moisture or
processing method should influence the coffee taste as exposed by (Bravo-
Moncayo et al., 2020; J. Li et al., 2019; Spence & Carvalho, 2019), however
none of it was reflected in the coffee feature importance. In the first place, it
can be explained by the way the label quality is created; a linear model uses
the sum of all cupping scores, this sum makes up to a final score between
1-100 or 1-110 and it determines the type of specialty coffee evaluated. In
second place, there is a strong impact of the symbolic influence in the
coffee rating; the trademark and geographic origin of the coffee could bias
the coffee evaluation; the research of (Traore et al., 2018) proved that not
only tasting, but also pricing relies on the symbolic assumptions given
to the coffee, eg. a robusta coffee from Ethiopia or an arabic coffee from
Colombia will be better rated and priced due to the symbolic assumptions,
than a coffee harvested in Morocco.
The validity of the coffee cupping scores in the international trade market
should be reconsidered, as those scores may represent the preference of
the q-cup grader and not reflect the actual objective features of coffee as
mentioned above, a deeper explanation of the assumptions can be found
in the following paragraphs:
5.1 RQ1
Features Position
Flavor 1
Aftertaste 2
cupper points 3
clean cup 4
Balance 5
Uniformity 6
Aroma 7
Acidity 8
Sweetness 9
Body 10
5.2 RQ2
As for the second research question, the results confirm that classification
with less variables of the data set prompted to a smaller accuracy; that
applies when using only the coffee cupping scores as the feature ablation
even by discarding the least important features showed a decrease in
the performance; however the feasibility of creating a coffee evaluation
method that evaluates coffee not only with the scores, but also other
variables was checked directly with two approaches; the use of kNN
algorithm with of all the variables (including extrinsic information) showed
an 84% of accuracy. A basic random forest (used as classifier in the feature
permutation) outperformed the KNN and it reached a 98% on the test set
using all of the variables; this outcome reflects the phenomena exposed
in the scientific literature: kNN is very sensible to noise and it does not
perform well with high dimensional data (Kouiroukidis & Evangelidis,
2011). However, none of the described previous data reached a 99% of
accuracy with only the cupping scores.
As exposed in the tables 18 and 23 in the Appendix section, the subset
models without coffee proxy (’aftertaste’ and ’balance’) variables had
a lower performance in comparison to the model with cupping scores,
however its difference was minimal and it promotes the use of feature
selection; due to the nature of correlated features in the SCA coffee cupping
method, the feature selection could delete redundancy; in this case the
redundancy is not directly reflected in the accuracy scores, however the
evaluation method should reevaluate if variables such as ’aftertaste’ and
’balance’ should be still included in the coffee cupping scores, as more
variables that rely on human perception could bias to a deeper level one
coffee evaluation.
5.3 RQ3
Finally, the third research question was addressed with the hierarchical
clustering and Ward’s linkage; the features subsets performed with a maxi-
mal of 81% of accuracy with the kNN, however the ablation of the extrinsic
variables the accuracy was increased, perhaps the random forest used for
classification initially prompted to overfit, as the use of impurity-based
algorithms such as the random forest can be biased when having high
cardinality features as exposed by Zhou and Hooker (2021). The use of
the XGBoost and the feature permutation importance proved that other
variables such as categorical type one and two defects, moisture and al-
titude of the crop could contribute in the model; in fact, the subset that
excluded the coffee cupping scores performed up to a 76.42% of accuracy,
5 discussion 40
conclusion
references
Aich, S., Al-Absi, A. A., Lee Hui, K., & Sain, M. (2019). Prediction of
quality for different type of wine based on different feature sets using
supervised machine learning techniques. In 2019 21st international
conference on advanced communication technology (icact) (p. 1122-1127).
doi: 10.23919/ICACT.2019.8702017
Bertrand, B., Vaast, P., Alpizar, E., Etienne, H., Davrieux, F., & Charmetant,
P. (2006, sep). Comparison of bean biochemical composition and
beverage quality of Arabica hybrids involving Sudanese-Ethiopian
origins with traditional varieties at various elevations in Central
America. Tree Physiology, 26(9), 1239–1248. doi: 10.1093/treephys/
26.9.1239
Bhatia, N., & Vandana. (2010, July). Survey of Nearest Neighbor Techniques.
arXiv e-prints, arXiv:1007.0085.
Bravo-Moncayo, L., Reinoso-Carvalho, F., & Velasco, C. (2020). The effects
of noise control in coffee tasting experiences. Food Quality and Pref-
erence, 86, 104020. Retrieved from https://www.sciencedirect.com/
science/article/pii/S0950329320302895 doi: https://doi.org/
10.1016/j.foodqual.2020.104020
Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O.,
. . . Varoquaux, G. (2013). API design for machine learning software:
experiences from the scikit-learn project. In Ecml pkdd workshop:
Languages for data mining and machine learning (pp. 108–122).
Carvalho, F. M., Moksunova, V., & Spence, C. (2020). Cup tex-
ture influences taste and tactile judgments in the evaluation of
specialty coffee. Food Quality and Preference, 81, 103841. Re-
trieved from https://www.sciencedirect.com/science/article/
pii/S0950329319306652 doi: https://doi.org/10.1016/j.foodqual
.2019.103841
Carvalho, F. M., & Spence, C. (2018). The shape of the cup influences aroma,
taste, and hedonic judgements of specialty coffee. Food Quality and
Preference, 68, 315-321. Retrieved from https://www.sciencedirect
.com/science/article/pii/S0950329318300855 doi: https://doi
.org/10.1016/j.foodqual.2018.04.003
Chambers, E., & Koppel, K. (2013). Associations of volatile compounds with
sensory aroma and flavor: The complex nature of flavor. Molecules,
18(5), 4887–4905. Retrieved from https://www.mdpi.com/1420-3049/
18/5/4887 doi: 10.3390/molecules18054887
Chandrashekar, G., & Sahin, F. (2014). A survey on feature se-
lection methods. Computers & Electrical Engineering, 40(1), 16–
28. Retrieved from https://www.sciencedirect.com/science/
REFERENCES 42
00031305.2015.1031827
Ginsburg, S. B., Lee, G., Ali, S., & Madabhushi, A. (2016). Feature im-
portance in nonlinear embeddings (fine): Applications in digital
pathology. IEEE Transactions on Medical Imaging, 35(1), 76-88. doi:
10.1109/TMI.2015.2456188
Gu, Q., Li, Z., & Han, J. (2012). Generalized fisher score for feature selection.
arXiv preprint arXiv:1202.3725.
Gupta, Y. (2018). Selection of important features and predicting wine qual-
ity using machine learning techniques. Procedia Computer Science, 125,
305-312. Retrieved from https://www.sciencedirect.com/science/
article/pii/S1877050917328053 (The 6th International Conference
on Smart Computing and Communications) doi: https://doi.org/
10.1016/j.procs.2017.12.041
Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P.,
Cournapeau, D., . . . Oliphant, T. E. (2020). Array programming with
NumPy. Nature, 585(7825), 357–362. Retrieved from https://doi
.org/10.1038/s41586-020-2649-2 doi: 10.1038/s41586-020-2649-2
Hossin, M., & Sulaiman, M. (2015). A review on evaluation metrics for
data classification evaluations. International Journal of Data Mining &
Knowledge Management Process, 5(2), 1.
Hu, G., Xi, T., Mohammed, F., & Miao, H. (2016). Classification of wine
quality with imbalanced data. In 2016 ieee international conference
on industrial technology (icit) (p. 1712-1217). doi: 10.1109/ICIT.2016
.7475021
Hua, J., Tembe, W. D., & Dougherty, E. R. (2009). Perfor-
mance of feature-selection methods in the classification of high-
dimension data. Pattern Recognition, 42(3), 409–424. Re-
trieved from https://www.sciencedirect.com/science/article/
pii/S0031320308003142 doi: https://doi.org/10.1016/j.patcog.2008
.08.001
Hunter, J. D. (2007). Matplotlib: A 2d graphics environment. Computing in
Science & Engineering, 9(3), 90–95. doi: 10.1109/MCSE.2007.55
International Coffee Organization. (2021, March). World coffee con-
sumption. https://www.ico.org/prices/new-consumption-table
.pdf. Retrieved April 7, 2021, from https://www.ico.org/prices/
new-consumption-table.pdf (Last checked on April 09, 2021)
Kluyver, T., Ragan-Kelley, B., Pérez, F., Granger, B., Bussonnier, M., Frederic,
J., . . . Willing, C. (2016). Jupyter notebooks – a publishing format for
reproducible computational workflows. In F. Loizides & B. Schmidt
(Eds.), Positioning and power in academic publishing: Players, agents and
agendas (p. 87 - 90).
REFERENCES 44
Waskom, M., Botvinnik, O., O’Kane, D., Hobson, P., Lukauskas, S., Gemper-
line, D. C., . . . Qalieh, A. (2017, September). seaborn: v0.8.1 (septem-
ber 2017). Zenodo. Retrieved from https://doi.org/10.5281/
zenodo.883859 doi: 10.5281/zenodo.883859
Wikström, N., Kainulainen, K., Razafimandimbison, S. G., Smedmark,
J. E. E., & Bremer, B. (2015, may). A Revised Time Tree of the Asterids:
Establishing a Temporal Framework For Evolutionary Studies of the
Coffee Family (Rubiaceae). PLOS ONE, 10(5), e0126690. Retrieved
from https://doi.org/10.1371/journal.pone.0126690
Wollni, M., & Zeller, M. (2007). Do farmers benefit from participating
in specialty markets and cooperatives? the case of coffee marketing
in costa rica. Agricultural Economics, 37(2-3), 243-248. Retrieved
from https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1574
-0862.2007.00270.x doi: https://doi.org/10.1111/j.1574-0862.2007
.00270.x
Worku, M., Duchateau, L., & Boeckx, P. (2016, oct). Reproducibility of coffee
quality cupping scores delivered by cupping centers in Ethiopia.
Journal of Sensory Studies, 31(5), 423–429. Retrieved from https://doi
.org/10.1111/joss.12226 doi: https://doi.org/10.1111/joss.12226
Yuriko, C., & I Dewa, M. B. A. D. (2020). Specialty coffee cupping score
prediction with general regression neural network (grnn). JELIKU (Ju-
rnal Elektronik Ilmu Komputer Udayana), 9(2), 185–190. Retrieved from
https://ojs.unud.ac.id/index.php/JLK/article/view/64485
Zhou, Z., & Hooker, G. (2021, 01). Unbiased measurement of feature
importance in tree-based methods. ACM Transactions on Knowledge
Discovery from Data, 15, 1-21. doi: 10.1145/3429445
6 appendix 48
6 appendix
appendix a
• Body | The quality of Body is based upon the tactile feeling of the
liquid in the mouth, especially as perceived between the tongue
and roof of the mouth. Most samples with heavy Body may also
receive a high score in terms of quality due to the presence of brew
colloids and sucrose. Some samples with lighter Body may also have
a pleasant feeling in the mouth, however. Coffees expected to be high
in Body, such as a Sumatra coffee, or coffees expected to be low in
Body, such as a Mexican coffee, can receive equally high preference
scores although their intensity rankings will be quite different.
• Wet Aroma – Refers to the smell of wet coffee grinds that form a
crust at the top of the glass after adding hot water.
• Finish – The lingering tastes or emerging tastes that come after the
mouth is cleared. Often called “aftertaste”.
• Clean Cup – This does not literally mean dirt on the coffee. It’s just
about flavor. And raw, funky coffees that are “unclean” in the flavor
can also be quite desirable, such a wet-hulled Indonesia coffee from
Sumatra, or dry-processed Ethiopia and Yemeni types.
appendix b
Figure 10: Pearson correlation heatmap of CQI dataset after categorical variables
encoding
appendix c
Full set Uniformity clean cup Aroma Altitude Low Sweetness cupper points washed-pulped Flavor
7 Neighbors 81.00 81.66 81.66 80.57 91.48 88.65 90.83 90.61 52.40
5 Neighbors 82.75 82.75 81.66 80.79 92.14 89.30 91.70 91.27 52.84
3 Neighbors 85.15 84.28 82.53 82.10 93.01 87.99 92.36 91.70 52.84
mean 82.97 82.90 81.95 81.15 92.21 88.65 91.63 91.19 52.69
SE 0.85 0.54 0.21 0.34 0.31 0.27 0.31 0.22 0.10
Table 13: Accuracy of classification of specialty coffee quality with kNN and
Fisher scores using top 10 features
Full set Uniformity clean_cup Aroma Altitude Low Sweetness cupper points washed-pulped Flavor
7 Neighbors 81.66 80.35 81.22 81.00 79.26 78.60 78.38 77.95 76.86
5 Neighbors 81.88 81.44 81.22 80.79 79.91 79.26 78.60 78.60 77.95
3 Neighbors 84.06 84.06 82.97 83.19 80.35 79.69 78.38 78.38 78.17
mean 82.53 81.95 81.80 81.66 79.84 79.18 78.46 78.31 77.66
SE 0.54 0.78 0.41 0.54 0.22 0.22 0.05 0.14 0.29
Table 14: Accuracy of classification of specialty coffee quality with kNN and
Fisher scores using all features
6 appendix 56
appendix d
Features Position
Flavor 1
Aftertaste 2
cupper_points 3
clean_cup 4
Balance 5
Uniformity 6
Aroma 7
Acidity 8
Sweetness 9
Body 10
cat2_def 11
Moisture 12
Species 13
method_Natural / Dry 14
altitude_low_meters 15
color_Green 16
method_Washed / Wet 17
altitude_high_meters 18
cat1_def 19
method_Other 20
color_Blue-Green 21
color_Bluish-Green 22
quakers 23
method_Semi-washed / Semi-pulped 24
method_Pulped natural / honey 25
Full set Flavor Aftertaste cupper points clean cup Balance Uniformity Aroma Acidity
7 Neighbors 98.69 98.25 97.82 96.72 95.63 92.58 87.55 86.90 88.43
5 Neighbors 99.56 98.91 97.82 96.94 95.63 92.79 88.21 89.30 88.43
3 Neighbors 98.91 99.13 98.03 97.38 96.72 92.79 89.52 89.74 83.41
mean 99.05 98.76 97.89 97.02 96.00 92.72 88.43 88.65 86.75
SE 0.19 0.19 0.05 0.14 0.26 0.05 0.41 0.62 1.18
Table 16: Accuracy of classification of specialty coffee quality with kNN and
XGB importance top 10 features feature ablation from most important to least
important
6 appendix 57
Full set Body Sweetness Acidity Aroma Uniformity Balance clean cup cupper points
7 Neighbors 98.69 98.47 98.03 97.60 97.38 94.54 94.98 87.55 87.55
5 Neighbors 99.56 98.47 98.25 97.38 96.94 94.98 94.76 86.68 88.65
3 Neighbors 98.91 98.69 98.47 97.38 97.82 95.41 95.20 88.65 88.65
mean 99.05 98.54 98.25 97.45 97.38 94.98 94.98 87.63 88.28
SE 0.19 0.05 0.09 0.05 0.18 0.18 0.09 0.40 0.26
Table 17: Accuracy of classification of specialty coffee quality with kNN and XGB
importance top 10 features and feature ablation from least important to most
important
Full set cat2 def Moisture Species Natural / Dry altitude low Green Washed / Wet altitude high
7 Neighbors 76.42 75.33 72.93 72.93 71.83 70.09 70.52 69.43 63.10
5 Neighbors 75.76 76.20 73.36 73.58 73.58 71.40 70.09 69.00 65.94
3 Neighbors 77.07 77.29 73.80 73.80 74.02 72.05 70.96 67.47 65.72
mean 76.42 76.27 73.36 73.44 73.14 71.18 70.52 68.63 64.92
SE 0.27 0.40 0.18 0.19 0.47 0.41 0.18 0.42 0.64
Table 18: Accuracy of classification of specialty coffee quality with kNN and XGB
importance without coffee cupping scores top 10 features ablation
Full set Flavor cupper points clean cup Uniformity Aroma Acidity Sweetness Body
7 Neighbors 97.60 96.94 93.67 93.01 90.83 88.86 86.68 84.93 83.62
5 Neighbors 98.69 97.38 94.10 93.01 90.83 90.17 87.12 86.03 84.72
3 Neighbors 98.91 98.03 94.76 93.89 89.74 89.30 89.08 86.03 80.13
mean 98.40 97.45 94.18 93.30 90.47 89.45 87.63 85.66 82.82
SE 0.29 0.22 0.22 0.21 0.26 0.27 0.52 0.26 0.98
Table 19: Accuracy of classification of specialty coffee quality with kNN and XGB
importance with coffee cupping scores top 10 features ablation and excluding
proxy features: ’Aftertaste’ and ’Balance’
6 appendix 58
appendix e
Full set Flavor Uniformity clean cup Aftertaste cupper points Balance Acidity Aroma
7 Neighbors 97.82 97.38 96.29 91.92 91.05 87.77 85.15 85.37 86.03
5 Neighbors 97.82 97.82 96.29 92.58 92.14 89.08 85.81 86.46 86.03
3 Neighbors 98.25 98.25 96.29 93.01 91.92 90.39 86.90 87.34 87.55
mean 97.96 97.82 96.29 92.50 91.70 89.08 85.95 86.39 86.54
SE 0.10 0.18 0.00 0.22 0.24 0.53 0.36 0.40 0.36
Table 21: Accuracy of classification of specialty coffee quality with feature permu-
tation importance and top 10 features from most important to least important
Full set cat2 def Body Aroma Acidity Balance cupper points Aftertaste clean cup
7 Neighbors 97.82 98.03 98.03 98.03 97.38 96.29 96.29 95.63 94.98
5 Neighbors 97.82 98.69 98.25 98.25 96.94 96.51 96.72 95.20 95.20
3 Neighbors 98.25 98.25 98.47 98.25 97.82 97.60 96.51 94.76 94.76
mean 97.96 98.33 98.25 98.18 97.38 96.80 96.51 95.20 94.98
SE 0.10 0.14 0.09 0.05 0.18 0.29 0.09 0.18 0.09
Table 22: Accuracy of classification of specialty coffee quality with feature permu-
tation importance and top 10 features from least important to most important
6 appendix 59
Full set cat2 def Moisture altitude high altitude low Natural / Dry cat1 def Washed / Wet washed/pulped
7 Neighbors 75.11 74.67 70.74 70.52 65.94 65.72 56.55 56.55 48.69
5 Neighbors 75.33 75.11 73.80 70.52 66.16 65.28 56.55 56.99 49.13
3 Neighbors 76.64 77.07 72.49 70.74 66.59 66.16 53.28 57.86 49.78
mean 75.69 75.62 72.34 70.60 66.23 65.72 55.46 57.13 49.20
SE 0.34 0.52 0.63 0.05 0.14 0.18 0.77 0.27 0.22
Table 23: Accuracy of classification of specialty coffee quality with kNN and
feature permutation importance with coffee cupping scores top 10 features ablation
and excluding Coffee cupping scores.
Full set Flavor Uniformity clean cup cupper points Acidity Aroma Body cat2 def
7 Neighbors 97.82 96.51 95.20 92.79 89.30 89.08 88.65 84.93 82.75
5 Neighbors 98.25 96.94 95.63 93.23 89.96 89.08 89.08 83.41 79.69
3 Neighbors 98.47 98.03 95.20 93.67 91.48 90.17 89.30 85.37 83.19
mean 98.18 97.16 95.34 93.23 90.25 89.45 89.01 84.57 81.88
SE 0.14 0.32 0.10 0.18 0.46 0.26 0.14 0.42 0.78
Table 24: Accuracy of classification of specialty coffee quality with kNN and
feature permutation importance with coffee cupping scores top 10 features ablation
and excluding proxy features: ’Aftertaste’ and ’Balance’.
6 appendix 60
appendix f
Full set Flavor Fragrance Aroma Finish Complexity Sweetness Uniformity Brightness Clean cup
7 Neighbors 96.33 96.33 96.33 97.25 96.33 93.58 92.66 88.99 81.65 52.29
5 Neighbors 96.33 96.33 96.33 96.33 96.33 94.50 92.66 89.91 82.57 55.96
3 Neighbors 97.25 97.25 96.33 96.33 96.33 95.41 93.58 90.83 81.65 52.29
mean 96.64 96.64 96.33 96.64 96.33 94.50 92.97 89.91 81.96 53.52
SE 0.22 0.22 0.00 0.22 0.00 0.37 0.22 0.37 0.22 0.86
Full set Body Cup correction Clean cup Brightness Uniformity Sweetness Complexity Finish Aroma
7 Neighbors 96.33 96.33 95.41 95.41 96.33 96.33 97.25 96.33 94.50 94.50
5 Neighbors 96.33 96.33 97.25 97.25 96.33 96.33 97.25 96.33 94.50 95.41
3 Neighbors 96.33 97.25 97.25 98.17 97.25 96.33 97.25 96.33 95.41 93.58
mean 96.33 96.64 96.64 96.94 96.64 96.33 97.25 96.33 94.80 94.50
SE 0.00 0.22 0.43 0.57 0.22 0.00 0.00 0.00 0.22 0.37
appendix g
Full set Flavor Finish Complexity Fragrance Clean cup Uniformity Brightness Body Cuppers correction
7 Neighbors 97.25 96.33 97.25 96.33 95.41 95.41 94.50 90.83 89.91 88.99
5 Neighbors 97.25 96.33 97.25 97.25 96.33 95.41 95.41 88.07 89.91 86.24
3 Neighbors 97.25 97.25 98.17 98.17 96.33 94.50 95.41 87.16 90.83 86.24
mean 97.25 96.64 97.55 97.25 96.02 95.11 95.11 88.69 90.21 87.16
SE 0.00 0.22 0.22 0.37 0.22 0.22 0.22 0.78 0.22 0.65
Full set Aroma Sweetness Cuppers correction Body Brightness Uniformity Clean cup Fragrance Complexity
7 Neighbors 96.33 96.33 96.33 96.33 96.33 96.33 96.33 97.25 96.33 92.66
5 Neighbors 96.33 96.33 96.33 96.33 96.33 96.33 96.33 97.25 97.25 91.74
3 Neighbors 96.33 97.25 96.33 96.33 96.33 96.33 97.25 97.25 97.25 92.66
mean 96.33 96.64 96.33 96.33 96.33 96.33 96.64 97.25 96.94 92.35
SE 0.00 0.22 0.00 0.00 0.00 0.00 0.22 0.00 0.22 0.22
appendix h