Journal Pone 0295339

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

PLOS ONE

RESEARCH ARTICLE

Deep learning-based idiomatic expression


recognition for the Amharic language
Demeke Endalie ID1*, Getamesay Haile1, Wondmagegn Taye2
1 Faculty of Computing and Informatics, Jimma Institute of Technology, Jimma, Ethiopia, 2 Faculty of Civil
and Environmental Engineering, Jimma Institute of Technology, Jimma, Ethiopia

* demeke.endalie@ju.edu.et

a1111111111 Abstract
a1111111111
a1111111111 Idiomatic expressions are built into all languages and are common in ordinary conversation.
a1111111111 Idioms are difficult to understand because they cannot be deduced directly from the source
a1111111111 word. Previous studies reported that idiomatic expression affects many Natural language
processing tasks in the Amharic language. However, most natural language processing
models used with the Amharic language, such as machine translation, semantic analysis,
sentiment analysis, information retrieval, question answering, and next-word prediction, do
OPEN ACCESS not consider idiomatic expressions. As a result, in this paper, we proposed a convolutional
Citation: Endalie D, Haile G, Taye W (2023) Deep neural network (CNN) with a FastText embedding model for detecting idioms in an Amharic
learning-based idiomatic expression recognition for text. We collected 1700 idiomatic and 1600 non-idiomatic expressions from Amharic books
the Amharic language. PLoS ONE 18(12):
to test the proposed model’s performance. The proposed model is then evaluated using this
e0295339. https://doi.org/10.1371/journal.
pone.0295339 dataset. We employed an 80 by 10,10 splitting ratio to train, validate, and test the proposed
idiomatic recognition model. The proposed model’s learning accuracy across the training
Editor: Michael Flor, Educational Testing Service
(ETS), UNITED STATES dataset is 98%, and the model achieves 80% accuracy on the testing dataset. We compared
the proposed model to machine learning models like K-Nearest Neighbor (KNN), Support
Received: May 12, 2023
Vector Machine (SVM), and Random Forest classifiers. According to the experimental
Accepted: November 20, 2023
results, the proposed model produces promising results.
Published: December 14, 2023

Peer Review History: PLOS recognizes the


benefits of transparency in the peer review
process; therefore, we enable the publication of
all of the content of peer review and author
responses alongside final, published articles. The 1. Introduction
editorial history of this article is available here: In recent years, the development of deep learning in neural networks has improved perfor-
https://doi.org/10.1371/journal.pone.0295339
mance in many natural language processing (NLP) tasks. In natural language processing, neu-
Copyright: © 2023 Endalie et al. This is an open ral networks are used to develop machine translation, speech recognition, text generation, text
access article distributed under the terms of the mining, and named entity recognition.
Creative Commons Attribution License, which
An idiomatic expression is a phrase or expression whose meaning may be different from
permits unrestricted use, distribution, and
reproduction in any medium, provided the original
the combination of literal meanings of its composing words. The meaning of the idioms can-
author and source are credited. not be interpreted from the meaning of words that construct them directly [1]. Idiomatic
expressions are one of the important parts of all-natural languages [2]. Amharic is one of the
Data Availability Statement: This research work’s
data set and source code are publicly available on
languages grouped under the Semitic language families [3]. The Amharic language has more
GitHub. The link to access the data is: https:// than 4000 idiomatic expressions. The detection of this type of expression from Amharic text
github.com/demekeendalie/Idiomatic-expression-. helps those individuals who are not familiar with the Language. For example, the expression

PLOS ONE | https://doi.org/10.1371/journal.pone.0295339 December 14, 2023 1 / 14


PLOS ONE Idiomatic expression recognition

Funding: The author(s) received no specific “ፊቱን ጣለዉ” (he drops his face) can be directly taken as the guy drops his face somewhere,
funding for this work. but the actual meaning is “he becomes sad.”
Competing interests: The authors have declared
that no competing interests exist.
2. Related works
Idiomatic expression recognition from a given text plays an important role in implementing
tasks such as machine translation, speech recognition, sentiment analysis, and dialog systems
within the respective language. Idiom token classification involves determining if a phrase is
literal or idiomatic [4]. Salton et al. [4] used Skip-Thought Vectors to create distributed repre-
sentations with predictive features. Skip-thought vectors are generated using an encoder-
decoder model. After receiving the training sentence, the encoder creates a vector. Encoders
include uni-skip, bi-skip, and combine-skip, with uni-skip reading the text from beginning to
end, bi-skip concatenating forward and backward results, and combined-skip concatenating
vectors. Classifiers perform competitively, using only the target phrase as input, and are less
dependent on discourse context. This approach can be used to train a competitive general
idiom token classifier.
The study [5] uses a computational search approach to examine idiomatic language identifi-
cation in non-native English writings. Idioms are often employed in English as a Foreign Lan-
guage (EFL) essays, and a search method that considers their syntactic and lexical flexibility
enhances recall by 30% and increases false positives.
Afsaneh et al. [6] used the connection between idiomaticity and (in)flexibility to create sta-
tistical measures to automatically distinguish idiomatic from literal verb plus noun combina-
tions. VNICs differ in flexibility but contrast with compositional phrases, which are more
lexically productive and have a wider range of syntactic forms. Lexical and syntactic flexibility
can be used as partial indicators of semantic analyzability and idiomaticity.
An algorithm is proposed for the automatic classification of idiomatic and literal expres-
sions [7]. It hypothesizes that high-ranking words in a text segment are less likely to be part of
an idiomatic expression. The algorithm uses Latent Dirichlet Allocation (LDA) to extract top-
ics from paragraphs containing idioms and literals. Idiomatic expressions are treated as
semantic outliers, and outlier detection is used to distinguish idioms from literals using local
semantic contexts.
In the study of [8], the authors proposed an idiomatic expression detection method based
on the assumption that idioms and their literal counterparts do not occur in the same contexts.
The inner product of context word vectors with the vector representing a target expression is
computed first by their model. Because literal vectors predict local contexts well, their inner
product with contexts should be greater than idiomatic ones. This distinguishes literals from
idioms and, in word vector space, computes literal and idiomatic scatter (covariance) matrices
from local contexts. Because the scatter matrices represent context distributions, they used the
Frobenius norm to calculate their difference.
The work of [9] presents a generalized model for determining whether an idiom is used fig-
uratively or literally based on the concept of semantic compatibility. They examine continuous
bag-of-words (CBOW’s) limitations regarding semantic compatibility measurement and pro-
pose a novel semantic compatibility model based on CBOW training for idiom usage recogni-
tion. Experiments on two benchmark idiom usage corpora reveal that the proposed
generalized model outperforms state-of-the-art per-idiom models at the time.
In [10], a model for detecting idiomatic phrases in written text was proposed. This paper
presents a binary classification approach for identifying idioms at the sentence level, offering
insights into contexts and unique properties. The authors aim to improve detection rates using
textual cohesion and compositionality measures. Textual cohesion refers to the grammatical

PLOS ONE | https://doi.org/10.1371/journal.pone.0295339 December 14, 2023 2 / 14


PLOS ONE Idiomatic expression recognition

and semantic relationships between phrases or pieces of a text that contribute to unity and
coherence. Textual compositionality measurements, on the other hand, assess the composi-
tional aspects of a text. They used principal component analysis for idiom detection, linear dis-
criminant analysis for discriminant subspace generation, and three nearest neighbor classifiers
to obtain accuracy. They also analyzed the advantages and disadvantages of each technique,
which are broader than previous idiom identification algorithms.
Idiomatic expression in language has a detrimental impact on improving language learning
proficiency and NLP task performance [11, 12]. However, according to our best knowledge,
no Amharic natural language processing model considers idiomatic expressions. This inspired
us to create an Amharic idiomatic phrase identification system based on deep learning. This
study focuses on constructing a CNN using the FastText model to detect the presence of idio-
matic terms in an Amharic text. The overall contributions of the study are summarized as
follows:
1. Prepare a general-purpose Amharic idiomatic expression dataset that can be used by other
studies in the future.
2. Proposed a deep learning model incorporating CNN with FastText to recognize idioms
from Amharic texts.
3. Evaluate the performance of the proposed recognition model with various evaluation
metrics.
The remainder of the paper is structured as follows. Section 2 presents the state of art learn-
ing models. Section 3 presents the planned work’s comprehensive methodology in detail. Sec-
tion 4 defines the experimental results. In this section, we present the outcome and a
discussion of it. Finally, section 5 is the conclusion.

3. Learning model
3.1. Convolutional neural network
We need a learning model to determine whether a particular phrase is idiomatic or not. A con-
volutional neural network is an advanced neural network model that discovers patterns and
relationships between data items based on their relative positions [13]. CNN can automatically
learn effective feature representation from massive text using a 1D structure (word order) in
the convolutional layer. It captures local relationships among the neighbor words in terms of
context windows, and by using pooling layers, it extracts global features. CNN is a neural net-
work made up of several convolutional and pooling layers.

3.2. K-Nearest Neighbor classifier


K-Nearest-Neighbors is a basic yet effective non-parametric supervised classification tech-
nique. The KNN classifier is the most common pattern recognition classifier because of its
effective performance, efficient outputs, and simplicity. It is frequently utilized in pattern rec-
ognition, machine learning, text classification, data mining, object identification, and various
other domains [14]. The KNN method classifies by analogy, which means that it compares the
unknown data point to the training data points to which it is comparable. The Euclidean dis-
tance is used to calculate similarity. The attribute values are adjusted to prevent bigger range
characteristics from outweighing smaller range ones. In KNN classification, the unknown pat-
tern is assigned the most predominant class amongst the classes of its nearest neighbors. In the
event of a tie between two classes for the pattern, the class with the minimum average distance

PLOS ONE | https://doi.org/10.1371/journal.pone.0295339 December 14, 2023 3 / 14


PLOS ONE Idiomatic expression recognition

Fig 1. Support Vector.


https://doi.org/10.1371/journal.pone.0295339.g001

to the unknown pattern is assigned. A global distance function based on individual attributes
can be calculated by combining several local distance functions based on distance [15].

3.3. Support Vector Machine


Support Vector Machines and Kernel methods have found a natural and effective coexistence
since their introduction in the early 90s. SVMs use kernels for learning linear predictors in
high-dimensional feature spaces [16]. The objective of the SVM algorithm is to find a hyper-
plane in N-dimensional space (N is the number of features) that distinctly classifies the data
points. Hyperplanes are decision boundaries that help classify the data points. Data points on
either side of the hyperplane can be attributed to different classes. Also, the dimension of the
hyper-plane depends upon the number of features. If the number of input features is two, then
the hyper-plane is just a line. If the number of input features is three, the hyper-plane becomes
a two-dimensional plane. Fig 1 shows a sample decision boundary separation.

3.4. Random Forest classifiers


A random forest is a technique used in modeling predictions and behavior analysis and is built
on decision trees. It contains many decision trees, each representing a distinct instance of the
classification of data input into the random forest. The random forest technique considers the
instances individually, taking the one with the majority of votes as the selected prediction [17].
A random forest generates a set of decision trees. To achieve diversity among basic decision
trees, random forest chose the randomization approach, which works well with bagging or
random subspace methods [18]. To generate each tree in the random forest, the following
points should be considered: If the number of records in the training set is N, N records are
randomly sampled but replaced by the original data. This is a bootstrap sample. This sample
will be a training set for growing the tree. If there are M input variables, a number m << M is
selected, and at each node, m variables are randomly selected from M, and the best split over
these m attributes is used to split the node. The value of m remains constant during forest
growth. Each tree is cultivated to the best of its ability.

4. Materials and methods


FastText is an open-source text representation and categorization framework developed by
Facebook’s AI Research (FAIR) team. Using the FastText word vector representation, we gen-
erate a vector for each word in the corpus that can be directly fed into any learning algorithm.
The goal of this study is to develop a deep learning model that uses FastText to detect the pres-
ence of idiomatic words in an Amharic text. Fig 2 below depicts the proposed idiomatic
expression recognition architecture for the Amharic language. Pre-processing, word

PLOS ONE | https://doi.org/10.1371/journal.pone.0295339 December 14, 2023 4 / 14


PLOS ONE Idiomatic expression recognition

Fig 2. The architecture of the proposed idiomatic expression recognition system.


https://doi.org/10.1371/journal.pone.0295339.g002

embedding, and learning modules are all components of the proposed automatic idiomatic
expression identification system. The tasks in the proposed model range from data gathering
to evaluation. This means the proposed model contains tasks from data collection up to
evaluation.

4.1. Dataset
This study concentrated on detecting Amharic idiom types. This means detecting whether a
phrase is usually an idiom or usually literal. The dataset for the study is gathered and processed
with this major goal in consideration. The dataset utilized in this study was gathered from two
Amharic books “የአማርኛ ፈሊጦች” (idiomatic expressions in Amharic), and “ፍቅር እስከ
መቃብር” (love up to the grave) [3, 19]. The book writers already annotate the data as idiomatic
or literal. The idiomatic terms used in this study were hand-picked and cross-checked from
published books.
Most idiomatic expressions in our source books have 2 to 4 tokens, so the dataset contains
only 2 to 4-length idiomatic expressions. There are more than four thousand idioms in the
Amharic language. In this study, we used 3300 isolated Amharic idiomatic and non-idiomatic
expressions to train and evaluate the proposed model. Out of 3300 Amharic phrases, 1700
were idiomatic expressions, and 1600 were phrases containing terms found in idiomatic
expressions but not utilized as an idiom. The idiomatic phrases were compiled from the afore-
mentioned publications and are easily readable in the books themselves. The 1600 non-idio-
matic phrases were collected from those publications.
Let us use an example to clarify the distinction between idiomatic and non-idiomatic
phrases in the dataset.” እግረ ደረቅ ነው” (igire derek’i newi), is an idiomatic phrase which has
the meaning "the feet are dry" when we interpret the meaning of each word, but the actual
meaning of the phrase is "Unlucky."“ እግረ አባጣ ነው” (igirē ābat’a newi) is a non-idiomatic
phrase which means “he has elephantiasis.” This is the only meaning that can be derived from
the literal meanings of each word in it. Despite the fact that both phrases utilize the same
word, "እግረ," which means leg, the first phrase does not refer to its real meaning, leg, whereas

PLOS ONE | https://doi.org/10.1371/journal.pone.0295339 December 14, 2023 5 / 14


PLOS ONE Idiomatic expression recognition

Table 1. The dataset’s idiomatic expression length distribution.


Token count in the idiomatic expression The total number of idiomatic expressions of the given length
Idiomatic with two tokens (words) 1620
Idiomatic with three tokens (words) 70
Idiomatic with four tokens (words) 10

The dataset’s non-idiomatic clauses have a length of two up to four tokens. After collecting the data from books, we
apply the following preprocessing modules to clean up the data and make the learning phase as easy as possible.

https://doi.org/10.1371/journal.pone.0295339.t001

the second phrase does. We use text processing activities such as stop word removal to remove
terms such as "ነው” (newi) that have no bearing on the meaning of the statement.
Idiomatic phrases were labeled zero, and non-idiomatic phrases were labeled one in the
dataset. Table 1 shows the distribution of the length of idiomatic statements in the collection.
i. Normalization. The Amharic writing system has different letters (“”) that can be read
with the same pronunciation, but there are no rules to distinguish their meanings. As a result,
in Amharic, these letters may represent the same concept or name of an object. For instance,
the word "power" can be written as ሃይል፣ ሀይል፣ ሐይል፤ ሓይል፤ ኀይል፤ኃይል. These six terms do
not have meaning differences but use different letters with the same phonetics. The normaliza-
tion of Amharic characters is similar to that of other Semitic language families, such as Arabic
and Hebrew [20]. This increases the number of features extracted for processing or analysis.
To avoid this inconsistency, normalize those characters or letters with the same phonetics to
one common canonical form. To overcome this redundancy, we normalize those characters
with the same pronunciation to one canonical letter used in this study, as shown in Table 2
below [21]. Normalization aims to reduce the number of distinct features in the gathered
dataset.
ii. Stemming. Stemming is the process of reducing inflected words to their stem, base, or
root form. Amharic is one of the morphological-rich Semitic languages [22]. Different terms
can exist with the same stem, and this helps reduce the size of feature space for processing. In
this study, we used the HornMorpho stemmer developed by Michel Gasser [23]. HornMorpho
is a Python library developed to analyze three Ethiopian languages: Amharic, Afan Oromo,
and Tigrigna.
iii. Remove stop words. In Amharic, the common words, e.g., “ሁሉ፣ እስከ፣ ነዉ፣ሆነ “, and
others that scoreless weightage in the text processing tasks is called stop words. In this study,
stop words are terms that do not contribute to the semantics of the given idiomatic phrase but
are used to fill the grammatical structure of the idiomatic statement. Stop words are eliminated
to save computational time wasted in processing them. Amharic does not have a well-prepared
list of stop words. However, we remove stop words prepared by [21]. In addition, to stop word
removal, we also replace numbers with their name in alphabetic characters (“ፊደል”). To obtain
the feature vectors of the numbers’ names, we use the FastText embedding and substitute

Table 2. Normalization of characters having the same pronunciations.


Canonical character Characters with the same pronunciation as the canonical character
ሀ(hā) ሃ፣ኃ፣ኀ፤ሐ፣ሓ(hā)
ሰ (še) ሠ (še)
አ(ā) ኣ፣0፣ዓ(ā)
ጸ(ts’e) ፀ(ts’e)
ው(wu) ዉ(wu)
https://doi.org/10.1371/journal.pone.0295339.t002

PLOS ONE | https://doi.org/10.1371/journal.pone.0295339 December 14, 2023 6 / 14


PLOS ONE Idiomatic expression recognition

numbers with their alphabetic characters. In the pre-trained FastText-based word embedding,
only Amharic alphabetic characters are employed. As a result, for this research, numbers must
be translated to their alphabetic names. For example, in “2 አይን” (two eyes), 2 can be changed
to two (“ሁለት”) and produce “ሁለት አይን”. This replacement is done by keeping a map of the
key-value relation between digits and an alphabetic description of each digit.

4.2. Text representation


Encoding is extensively required to pass texts as input to different machine learning and deep
learning models [24]. One of the text encoding algorithms that changes a given text into a vec-
tor is the word2vec algorithm. It is a set of neural network models used to represent a word in
a vector space. Those words which have similarities in their context are clustered together, and
those that do not have any contextual meaning similarity appear sparsely on the vector space.
However, word2vec fails to generate the vector of words that are not in the training
vocabulary.
FastText is one of the state-of-the-art word embedding models developed by Facebook. For
157 languages, Facebook develops pre-trained FastText embedding models. One of the lan-
guages with a trained FastText word embedding model is Amharic. FastText embedding’s
strength is that it can create a vector for a given term even if it is not in the training vocabulary.
This is resolved by considering the character-level n-gram of a given term. Each idiomatic
expression’s embedding is constructed using pre-trained FastText embedding. The embedding
is then saved as a matrix in which the rows reflect the number of idiomatic expressions in the
dataset, and the columns represent the embedding dimension used by the embedding tech-
nique. The embedding is built before splitting the dataset into training and testing sets.

5. Results and discussions


All experiments are carried out in a Windows 10 environment on a machine equipped with a
Core i7 processor and 16 GB of RAM. The accuracy, precision, recall, and f1-score are used to
assess the performance of the models used in this study. The formulas used to calculate them
are shown in Table 3 below.
Where Tp denotes true positive, Tn denotes true negative, Fp denotes false positive, and Fn
denotes false negative. The experimental configurations required to construct the proposed
Amharic idiomatic expression recognition system are determined during experimentation
using grid search-based adjustment.

5.1. Training and validating the model


We have divided the data to train and validate its performance with a training test split ratio of
80%, 10%, and 10% for training, validating, and testing the proposed model, respectively. We
did not employ k-fold cross-validation to assess the performance of the proposed model

Table 3. Performance evaluation metrics.


Evaluation metric Formula
TpþTn
Accuracy accuracy ¼ TpþTNþFpþFn
Tp
Precision precision ¼ TpþFp
Tp
Recall recall ¼ TpþFn
F1-score f1 score ¼ 2ðrecall∗PrecisionÞ
recallþprecision

https://doi.org/10.1371/journal.pone.0295339.t003

PLOS ONE | https://doi.org/10.1371/journal.pone.0295339 December 14, 2023 7 / 14


PLOS ONE Idiomatic expression recognition

Table 4. Hyperparameter values of the CNN model.


Hyperparameters Values
Embedded dimension 300
Number of filters 265
Batch size 16
Dropout 0.5
Activation Sigmoid
Optimization Adam
Epoch 100
Loss Binary cross entropy
https://doi.org/10.1371/journal.pone.0295339.t004

because the dataset is small, and a portion of it is used for model evaluation, reducing the
amount of data available for training the model indirectly [25]. We used unique idiomatic and
non-idiomatic terms (those not included in the training set) to validate and evaluate the pro-
posed idiomatic recognition model. We tune the hyperparameters using a grid search strategy
to train the proposed CNN model. The training, validating, and testing sets are separated after
producing the dataset’s word2vec of each idiomatic expression. There are 1360, 168, and 172
instances of idiomatic expressions used in training, validating, and testing, respectively. The
non-idiomatic expression training, validation, and testing examples are 1280, 160, and 160,
respectively. The value of the hyperparameters used in this study is shown in Table 4 below.
To feed the training, validating, and testing data to the CNN model, we first create three
matrices for the training, validation, and testing sets, each with dimensions of (number of
training samples,300), (number of validating samples,300), and (number of testing sam-
ples,300). The number 300 is chosen for the matrix because the dimension of the matrices we
build must be equal to the embedding matrix value of the CNN, as mentioned in Table 4
above. Next, using looping, fill the above matrix with the vector values of the FastText embed-
ding model’s idiomatic and non-idiomatic words. Following that, we train the CNN model by
supplying the training data matrix and training data label and validate its performance by pass-
ing the validation set’s matrices with the hyperparameter values shown in Table 4.
The model’s training accuracy and loss are then displayed in Figs 3 and 4 below after the
model has been trained using the above-mentioned parameters and training dataset. Since the
training accuracy grows as the number of epochs increases, the model learns from the data
well. In addition, as the number of epochs rises, the training loss declines. This shows that the
model picks up on idiomatic expression features from the training set.

5.2. Testing the model


With the testing dataset and the evaluation metrics listed in Table 3 above, we assess the effec-
tiveness of the proposed Amharic idiomatic expression recognition model. Fig 5 below shows
the experimental results of how well the proposed scheme performed in terms of accuracy,
precision, recall, and f1-measure.
As shown in Fig 5 above, the proposed Amharic idiomatic expression recognition system,
which makes use of CNN with FastText word embedding, achieved results with accuracy, pre-
cision, recall, and f1-score of 80%, 70%, 77.78%, and 73.68%, respectively. The 80% means that
the proposed idiomatic detection approach correctly detected 266 idiomatic and non-idio-
matic phrases out of 332 cases in the testing dataset. The proposed idiomatic recognition
model fails to recognize the remaining 66 testing instances. The results show that deep learning
with FastText embedding can detect Amharic idiomatic expressions in a text with better

PLOS ONE | https://doi.org/10.1371/journal.pone.0295339 December 14, 2023 8 / 14


PLOS ONE Idiomatic expression recognition

Fig 3. Training accuracy of the proposed model.


https://doi.org/10.1371/journal.pone.0295339.g003

results across multiple quality criteria. This means that researchers or application developers
will utilize deep learning-based idiomatic expression identification models to improve the per-
formance of their models or applications.

5.3. Comparison of the performance of the model with other models


We must consider two factors to justify a model working effectively [26]. These factors are 1)
by examining the model’s numerical output and 2) by contrasting its performance with that of

Fig 4. Training loss of the proposed model.


https://doi.org/10.1371/journal.pone.0295339.g004

PLOS ONE | https://doi.org/10.1371/journal.pone.0295339 December 14, 2023 9 / 14


PLOS ONE Idiomatic expression recognition

Fig 5. Evaluation of the performance of the proposed model with accuracy, precision, recall, and f1-score.
https://doi.org/10.1371/journal.pone.0295339.g005

other models applied to the same dataset by other studies. As a result, we contrasted the new
model’s performance with some of the machine learning models employed in earlier studies
[27]. We compare the proposed model against KNN, SVM, and Random Forest classifiers.
The vector generated by the pre-trained FastText retained the meanings of each idiomatic
and non-idiomatic phrase. This vector for the training and testing set, along with their label, is
passed to the machine learning algorithms employed in this study, as explained in section 4.1.
The hyperparameter values of these learning models were adjusted using a grid search-based
tuning approach. Table 5 shows the optimal hyperparameter values found using this grid-
searching approach.
Table 6 below shows the detection accuracy these learning models using the hyperpara-
meter value from Table 5. In addition, Table 6 is used to compare the detection accuracy of dif-
ferent learning models against the proposed deep learning with FastText-based word
embedding for Amharic idiom detection. The number 72%,68%, and 76% indicates the per-
centage of the testing set correctly detected by Random Forest, KNN, and SVM models,
respectively. To determine whether or not the aforementioned machine learning algorithms
appropriately label a given phrase as idiomatic or non-idiomatic. To begin, we must use Fas-
tText to create a vector of the phrase with the same vector dimension as the one used for
training.
All the above results shown in Table 6 above are produced with the same dataset and with
the same word embedding model, which is FastText. The results show that the proposed algo-
rithm enhanced detection accuracy by 8%, 12%, and 4%, respectively, compared to Random
Forest, KNN, and SVM. This is since (1) employing FastText to generate word vectors can
yield important features, and (2) processed text features using CNN can better represent high-
level characteristics in the given idiomatic and non-idiomatic phrases.

Table 5. The optimum values of hyperparameters for KNN, SVM, and Random Forest learning models.
Learning algorithms The optimal values of hyperparameter values
KNN n_neighbors = 2 weights = ‘uniform’ The default for other parameters
SVM Kernel = ‘rbf’ C = 1.0 Gamma = 1 The default for other parameters
Random Forest classifier n-estimators = 300 Class_weight = ‘none’ The default for other parameters
https://doi.org/10.1371/journal.pone.0295339.t005

PLOS ONE | https://doi.org/10.1371/journal.pone.0295339 December 14, 2023 10 / 14


PLOS ONE Idiomatic expression recognition

Table 6. Comparison of the proposed model with SVM, KNN, Random Forest.
Models Accuracy
Random Forest 72%
KNN 68%
SVM 76%
Proposed model 80%
https://doi.org/10.1371/journal.pone.0295339.t006

In addition to this, we compared the performance of the proposed idiomatic recognition


model (CNN with FastText embedding) with other word embedding models like Term Fre-
quency-Inverse Document Frequency (TF-IDF) and one-hot encoding vectors. IDF = Log [(#
Number of documents) / (Number of documents containing the word)] and TF = (Number of
repetitions of a word in a document) / (# of words in a document). The TF-IDF is calculated
for each of the 3300 Amharic phrases used in this study. To produce our dataset’s TF-IDF
matrix, we first create a list of unique word lists and then compute the TF-IDF of each word in
each phrase. The generated matrix has a dimension equal to the entire number of phrases in
the dataset multiplied by the total number of unique words. We append the label of each
phrase in the matrix to the end of each row. Then, we divide this matrix into training and test-
ing and train each of the models used in this study. The result is depicted as shown in Table 7
below.
According to the results in Table 7 above, CNN with FastText is more effective at identify-
ing idioms in Amharic language. The FastText embedding preserves the contextual meaning
of every phrase in the dataset. Therefore, it outperforms both TF-IDF and one-hot encoding
word embedding approaches. This is because the features of idiomatic expressions in the
Amharic language can be gained better with the help of FastText’s embedding [28]. By com-
bining the benefits of CNN with the pre-trained FastText embedding model, the proposed
approach detected Amharic idioms with higher accuracy than other machine learning models
and words’ vector representation. Even though the proposed model performed better in
detecting idioms, it has to be supplemented with more idiomatic expressions annotated by
Amharic linguistic specialists to improve its performance further. The enlarged model can
then be used in Amharic machine translation, question answering, and sentiment analysis
models or applications.

6. Conclusion
Different NLP models are now being developed for the Amharic language without considering
idiomatic expressions. Models that do not take idiomatic recognition into account may pro-
duce incorrect results since the actual meaning of the expression differs from the meaning of
each word that makes up the expression. Idioms are one of the most fascinating and difficult
aspects of Amharic vocabulary. Machine learning algorithms do not process text as input, so
they require encoding of texts into another format. We produced a vector of each word used
in this study using pre-trained FastText word embedding as part of this encoding. The

Table 7. Comparison of different words’ vector representation.


Model Word Embedding Recognition accuracy
CNN FastText 80%
TF-IDF 74%
One-hot encoding 71.3%
https://doi.org/10.1371/journal.pone.0295339.t007

PLOS ONE | https://doi.org/10.1371/journal.pone.0295339 December 14, 2023 11 / 14


PLOS ONE Idiomatic expression recognition

experimental findings show that compared to models utilized in this study, the proposed CNN
with the FastText embedding model is more effective at detecting Amharic idioms. The pro-
posed approach can, therefore, be applied to natural language processing tasks requiring the
detection of idiomatic expressions, such as machine translation, sentiment analysis, and ques-
tion-answering systems. Potentially, the model’s performance could be improved by training
on more data. In the future, other datasets from Amharic holy books, such as the Amharic
Bible, will be included in the aforementioned model to improve its performance. Furthermore,
we propose to use this model as a component in Amharic machine translation.

Supporting information
S1 Appendix. “Amharic stop words eliminated in this study”. “The English translation of
the stop words eliminated in this study“.
(PDF)

Acknowledgments
The authors would like to thank the Jimma Institute of Technology for supporting them
through different resources. The authors would like to thank Jimma University for its support
during the research work.

Author Contributions
Conceptualization: Demeke Endalie.
Data curation: Demeke Endalie, Wondmagegn Taye.
Formal analysis: Demeke Endalie.
Funding acquisition: Demeke Endalie.
Investigation: Demeke Endalie.
Methodology: Demeke Endalie.
Resources: Demeke Endalie.
Software: Demeke Endalie.
Supervision: Getamesay Haile, Wondmagegn Taye.
Validation: Demeke Endalie.
Visualization: Demeke Endalie.
Writing – original draft: Demeke Endalie, Wondmagegn Taye.
Writing – review & editing: Demeke Endalie, Getamesay Haile, Wondmagegn Taye.

References
1. Debra A Titone Kyle Lovseth, Kasparian Kristina,Tiv Mehrgol, "Are figurative interpretations of idioms
directly retrieved, compositionally built, or both? Evidence from eye movement measures of reading,"
Canadian Journal of Experimental Psychology, vol. 73, no. 4, p. 216–230, 2019. https://doi.org/10.
1037/cep0000175 PMID: 31192627
2. Yağiz Oktay, "Language, Culture, Idioms, and Their Relationship with the Foreign Language," Journal
of Language Teaching and Research, vol. 4, no. 4, pp. 953–957, 2013.
3. Dagnachew Amsalu, Worku Akililu, የአማርኛ ፈሊጦች Idiomatic expressions in Amharic, Addis Ababa,
Ethiopia: Kuraz Publishing Agency, 1993.

PLOS ONE | https://doi.org/10.1371/journal.pone.0295339 December 14, 2023 12 / 14


PLOS ONE Idiomatic expression recognition

4. Salton Giancarlo, Ross Robert, Kelleher John, "Idiom Token Classification using Sentential Distributed
Semantics," in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), Berlin, Germany, 2016.
5. Flor Michael, Klebanov Beata Beigman, "Catching Idiomatic Expressions in EFL Essays," in Proceed-
ings of the Workshop on Figurative Language Processing, New Orleans, Louisiana, 2018.
6. Fazly Afsaneh, Stevenson Suzanne, "Automatically Constructing a Lexicon of Verb Phrase Idiomatic
Combinations," in 11th Conference of the European Chapter of the Association for Computational Lin-
guistics, Trento, Italy, 2006.
7. Peng Jing, Feldman Anna, Vylomova Ekaterina, "Classifying Idiomatic and Literal Expressions Using
Topic Models and Intensity of Emotions," in Proceedings of the 2014 Conference on Empirical Methods
in Natural Language Processing (EMNLP), Doha, Qatar, 2014.
8. Peng Jing, Feldman Anna, "Automatic Idiom Recognition with Word Embeddings," Information Man-
agement and Big Data, vol. 656, p. 17–29, 2016.
9. Liu Changsheng, Hwa Rebecca, "A Generalized Idiom Usage Recognition Model Based on Semantic
Compatibility," in The Thirty-Third AAAI Conference on Artificial Intelligence, Hawaii, USA., 2019.
10. Feldman Anna, Peng Jing, "Automatic Detection of Idiomatic Clauses," Computational Linguistics and
Intelligent Text Processing, vol. 7816, p. 435–446, 2013.
11. Thyab Rana Abid, "The Necessity of idiomatic expressions to English Language learners," International
Journal of English and Litrture, vol. 7, no. 7, pp. 106–111, 2016.
12. Zeng Ziheng, Bhat Suma, "Idiomatic Expression Identification using Semantic Compatibility," Transac-
tions of the Association for Computational Linguistics, vol. 9, p. 1546–1562, 2021.
13. Yamashita Rikiya, Nishio Mizuho, Gian Do Richard Kinh& Togashi Kaori, "Convolutional neural net-
works: an overview and application in radiology," Insights into Imaging, vol. 9, p. 611–629, 2018.
https://doi.org/10.1007/s13244-018-0639-9 PMID: 29934920
14. Pedro J. Garcı́a-Laencina, Sancho-Gómez José-Luis, Figueiras-Vidal Anı́bal R. Verleysen Michel, "K
nearest neighbours with mutual information for simultaneous classification and missing data imputa-
tion," Neurocomputing, vol. 72, no. 7–9, pp. 1483–1493, 2009.
15. Hota Soudamini, Pathak Sudhir, "KNN classifier based approach for multi-class sentiment analysis of
twitter data," International Journal of Engineering & Technology, vol. 7, no. 3, pp. 1372–1375, 2018.
16. Kok Zhi Hong, Shariff Abdul Rashid Mohamed, Alfatni Meftah Salem M., Khairunniza-Bejo Siti, "Support
Vector Machine in Precision Agriculture: A review," Computers and Electronics in Agriculture, Vols.
191, 2021.
17. Vrushali Y Kulkarni, Pradeep K Sinha, "Random Forest Classifiers: A Survey and Future Research
Directions," International Journal of Advanced Computing, vol. 36, no. 1, pp. 1144–1153, 2013.
18. Panhalkar Archana R., Doye Dharmpal D., "A novel approach to build accurate and diverse decision
tree forest," Evolutionary Intelligence, vol. 15, p. 439–453, 2022. https://doi.org/10.1007/s12065-020-
00519-0 PMID: 33425041
19. Alemayehu Haddis, ፍቅር እስክ መቃብር(Love up to the grave), Addis Ababa, Ethiopia: Mega Publishing
Agency, 2004.
20. Mohamed Osman Hegazi Yasser Al-Dossari, Abdullah Al-Yahy Abdulaziz Al-Sumari, Hilal Anwer, "Pre-
processing Arabic text on social media," Heliyon, vol. 7, no. 2, p. e06191, 2021. https://doi.org/10.1016/
j.heliyon.2021.e06191 PMID: 33644469
21. Endalie Demeke, Haile Getamesay, Abebe Wondmagegn Taye, "Feature selection by integrating docu-
ment frequency with genetic algorithm for Amharic news document classification," PeerJ Computer Sci-
ence, vol. 8, p. e961, 2022. https://doi.org/10.7717/peerj-cs.961 PMID: 35634124
22. Martha Yifiru Tachbelie Wolfgang Menzel, "Amharic Part-of-Speech Tagger for Factored Language
Modeling," in International Conference RANLP, Borovets, Bulgaria, 2009.
23. Gasser Michael, "ornMorpho: a system for morphological processing of Amharic, Oromo, and Tigrinya,"
in Conference on Human Language Technology for Development, Alexandria, Egypt, 2011.
24. Wang Haitao,He Jie,Zhang Xiaohong,Liu Shufen, "A Short Text Classification Method Based on N
-Gram and CNN," Chinese Journal of Electronics, vol. 29, no. 2, pp. 248–254, 2020.
25. Berrar Daniel, "Cross-Validation," Encyclopedia of Bioinformatics and Computational Biology, vol.
1, pp. 542–545, 2019.
26. Valk Ton Van der, Van Driel Jan H., Vos Wobbe De, "Common Characteristics of Models in Present-
day Scientific Practice," Research in Science Education, vol. 37, no. 4, pp. 469–488, 2007.
27. Saigal Pooja, Khanna Vaibhav, "Multi-category news classification using Support Vector Machine
based classifiers," SN Applied Sciences, vol. 2, no. 3, pp. 458–468, 2020.

PLOS ONE | https://doi.org/10.1371/journal.pone.0295339 December 14, 2023 13 / 14


PLOS ONE Idiomatic expression recognition

28. Athiwaratkun Ben, Wilson Andrew, Anandkumar Anima, "Probabilistic FastText for Multi-Sense Word
Embeddings," in Proceedings of the 56th Annual Meeting of the Association for Computational Linguis-
tics, Melbourne, Australia, 2018.

PLOS ONE | https://doi.org/10.1371/journal.pone.0295339 December 14, 2023 14 / 14

You might also like