Journal Pone 0295339
Journal Pone 0295339
Journal Pone 0295339
RESEARCH ARTICLE
* demeke.endalie@ju.edu.et
a1111111111 Abstract
a1111111111
a1111111111 Idiomatic expressions are built into all languages and are common in ordinary conversation.
a1111111111 Idioms are difficult to understand because they cannot be deduced directly from the source
a1111111111 word. Previous studies reported that idiomatic expression affects many Natural language
processing tasks in the Amharic language. However, most natural language processing
models used with the Amharic language, such as machine translation, semantic analysis,
sentiment analysis, information retrieval, question answering, and next-word prediction, do
OPEN ACCESS not consider idiomatic expressions. As a result, in this paper, we proposed a convolutional
Citation: Endalie D, Haile G, Taye W (2023) Deep neural network (CNN) with a FastText embedding model for detecting idioms in an Amharic
learning-based idiomatic expression recognition for text. We collected 1700 idiomatic and 1600 non-idiomatic expressions from Amharic books
the Amharic language. PLoS ONE 18(12):
to test the proposed model’s performance. The proposed model is then evaluated using this
e0295339. https://doi.org/10.1371/journal.
pone.0295339 dataset. We employed an 80 by 10,10 splitting ratio to train, validate, and test the proposed
idiomatic recognition model. The proposed model’s learning accuracy across the training
Editor: Michael Flor, Educational Testing Service
(ETS), UNITED STATES dataset is 98%, and the model achieves 80% accuracy on the testing dataset. We compared
the proposed model to machine learning models like K-Nearest Neighbor (KNN), Support
Received: May 12, 2023
Vector Machine (SVM), and Random Forest classifiers. According to the experimental
Accepted: November 20, 2023
results, the proposed model produces promising results.
Published: December 14, 2023
Funding: The author(s) received no specific “ፊቱን ጣለዉ” (he drops his face) can be directly taken as the guy drops his face somewhere,
funding for this work. but the actual meaning is “he becomes sad.”
Competing interests: The authors have declared
that no competing interests exist.
2. Related works
Idiomatic expression recognition from a given text plays an important role in implementing
tasks such as machine translation, speech recognition, sentiment analysis, and dialog systems
within the respective language. Idiom token classification involves determining if a phrase is
literal or idiomatic [4]. Salton et al. [4] used Skip-Thought Vectors to create distributed repre-
sentations with predictive features. Skip-thought vectors are generated using an encoder-
decoder model. After receiving the training sentence, the encoder creates a vector. Encoders
include uni-skip, bi-skip, and combine-skip, with uni-skip reading the text from beginning to
end, bi-skip concatenating forward and backward results, and combined-skip concatenating
vectors. Classifiers perform competitively, using only the target phrase as input, and are less
dependent on discourse context. This approach can be used to train a competitive general
idiom token classifier.
The study [5] uses a computational search approach to examine idiomatic language identifi-
cation in non-native English writings. Idioms are often employed in English as a Foreign Lan-
guage (EFL) essays, and a search method that considers their syntactic and lexical flexibility
enhances recall by 30% and increases false positives.
Afsaneh et al. [6] used the connection between idiomaticity and (in)flexibility to create sta-
tistical measures to automatically distinguish idiomatic from literal verb plus noun combina-
tions. VNICs differ in flexibility but contrast with compositional phrases, which are more
lexically productive and have a wider range of syntactic forms. Lexical and syntactic flexibility
can be used as partial indicators of semantic analyzability and idiomaticity.
An algorithm is proposed for the automatic classification of idiomatic and literal expres-
sions [7]. It hypothesizes that high-ranking words in a text segment are less likely to be part of
an idiomatic expression. The algorithm uses Latent Dirichlet Allocation (LDA) to extract top-
ics from paragraphs containing idioms and literals. Idiomatic expressions are treated as
semantic outliers, and outlier detection is used to distinguish idioms from literals using local
semantic contexts.
In the study of [8], the authors proposed an idiomatic expression detection method based
on the assumption that idioms and their literal counterparts do not occur in the same contexts.
The inner product of context word vectors with the vector representing a target expression is
computed first by their model. Because literal vectors predict local contexts well, their inner
product with contexts should be greater than idiomatic ones. This distinguishes literals from
idioms and, in word vector space, computes literal and idiomatic scatter (covariance) matrices
from local contexts. Because the scatter matrices represent context distributions, they used the
Frobenius norm to calculate their difference.
The work of [9] presents a generalized model for determining whether an idiom is used fig-
uratively or literally based on the concept of semantic compatibility. They examine continuous
bag-of-words (CBOW’s) limitations regarding semantic compatibility measurement and pro-
pose a novel semantic compatibility model based on CBOW training for idiom usage recogni-
tion. Experiments on two benchmark idiom usage corpora reveal that the proposed
generalized model outperforms state-of-the-art per-idiom models at the time.
In [10], a model for detecting idiomatic phrases in written text was proposed. This paper
presents a binary classification approach for identifying idioms at the sentence level, offering
insights into contexts and unique properties. The authors aim to improve detection rates using
textual cohesion and compositionality measures. Textual cohesion refers to the grammatical
and semantic relationships between phrases or pieces of a text that contribute to unity and
coherence. Textual compositionality measurements, on the other hand, assess the composi-
tional aspects of a text. They used principal component analysis for idiom detection, linear dis-
criminant analysis for discriminant subspace generation, and three nearest neighbor classifiers
to obtain accuracy. They also analyzed the advantages and disadvantages of each technique,
which are broader than previous idiom identification algorithms.
Idiomatic expression in language has a detrimental impact on improving language learning
proficiency and NLP task performance [11, 12]. However, according to our best knowledge,
no Amharic natural language processing model considers idiomatic expressions. This inspired
us to create an Amharic idiomatic phrase identification system based on deep learning. This
study focuses on constructing a CNN using the FastText model to detect the presence of idio-
matic terms in an Amharic text. The overall contributions of the study are summarized as
follows:
1. Prepare a general-purpose Amharic idiomatic expression dataset that can be used by other
studies in the future.
2. Proposed a deep learning model incorporating CNN with FastText to recognize idioms
from Amharic texts.
3. Evaluate the performance of the proposed recognition model with various evaluation
metrics.
The remainder of the paper is structured as follows. Section 2 presents the state of art learn-
ing models. Section 3 presents the planned work’s comprehensive methodology in detail. Sec-
tion 4 defines the experimental results. In this section, we present the outcome and a
discussion of it. Finally, section 5 is the conclusion.
3. Learning model
3.1. Convolutional neural network
We need a learning model to determine whether a particular phrase is idiomatic or not. A con-
volutional neural network is an advanced neural network model that discovers patterns and
relationships between data items based on their relative positions [13]. CNN can automatically
learn effective feature representation from massive text using a 1D structure (word order) in
the convolutional layer. It captures local relationships among the neighbor words in terms of
context windows, and by using pooling layers, it extracts global features. CNN is a neural net-
work made up of several convolutional and pooling layers.
to the unknown pattern is assigned. A global distance function based on individual attributes
can be calculated by combining several local distance functions based on distance [15].
embedding, and learning modules are all components of the proposed automatic idiomatic
expression identification system. The tasks in the proposed model range from data gathering
to evaluation. This means the proposed model contains tasks from data collection up to
evaluation.
4.1. Dataset
This study concentrated on detecting Amharic idiom types. This means detecting whether a
phrase is usually an idiom or usually literal. The dataset for the study is gathered and processed
with this major goal in consideration. The dataset utilized in this study was gathered from two
Amharic books “የአማርኛ ፈሊጦች” (idiomatic expressions in Amharic), and “ፍቅር እስከ
መቃብር” (love up to the grave) [3, 19]. The book writers already annotate the data as idiomatic
or literal. The idiomatic terms used in this study were hand-picked and cross-checked from
published books.
Most idiomatic expressions in our source books have 2 to 4 tokens, so the dataset contains
only 2 to 4-length idiomatic expressions. There are more than four thousand idioms in the
Amharic language. In this study, we used 3300 isolated Amharic idiomatic and non-idiomatic
expressions to train and evaluate the proposed model. Out of 3300 Amharic phrases, 1700
were idiomatic expressions, and 1600 were phrases containing terms found in idiomatic
expressions but not utilized as an idiom. The idiomatic phrases were compiled from the afore-
mentioned publications and are easily readable in the books themselves. The 1600 non-idio-
matic phrases were collected from those publications.
Let us use an example to clarify the distinction between idiomatic and non-idiomatic
phrases in the dataset.” እግረ ደረቅ ነው” (igire derek’i newi), is an idiomatic phrase which has
the meaning "the feet are dry" when we interpret the meaning of each word, but the actual
meaning of the phrase is "Unlucky."“ እግረ አባጣ ነው” (igirē ābat’a newi) is a non-idiomatic
phrase which means “he has elephantiasis.” This is the only meaning that can be derived from
the literal meanings of each word in it. Despite the fact that both phrases utilize the same
word, "እግረ," which means leg, the first phrase does not refer to its real meaning, leg, whereas
The dataset’s non-idiomatic clauses have a length of two up to four tokens. After collecting the data from books, we
apply the following preprocessing modules to clean up the data and make the learning phase as easy as possible.
https://doi.org/10.1371/journal.pone.0295339.t001
the second phrase does. We use text processing activities such as stop word removal to remove
terms such as "ነው” (newi) that have no bearing on the meaning of the statement.
Idiomatic phrases were labeled zero, and non-idiomatic phrases were labeled one in the
dataset. Table 1 shows the distribution of the length of idiomatic statements in the collection.
i. Normalization. The Amharic writing system has different letters (“”) that can be read
with the same pronunciation, but there are no rules to distinguish their meanings. As a result,
in Amharic, these letters may represent the same concept or name of an object. For instance,
the word "power" can be written as ሃይል፣ ሀይል፣ ሐይል፤ ሓይል፤ ኀይል፤ኃይል. These six terms do
not have meaning differences but use different letters with the same phonetics. The normaliza-
tion of Amharic characters is similar to that of other Semitic language families, such as Arabic
and Hebrew [20]. This increases the number of features extracted for processing or analysis.
To avoid this inconsistency, normalize those characters or letters with the same phonetics to
one common canonical form. To overcome this redundancy, we normalize those characters
with the same pronunciation to one canonical letter used in this study, as shown in Table 2
below [21]. Normalization aims to reduce the number of distinct features in the gathered
dataset.
ii. Stemming. Stemming is the process of reducing inflected words to their stem, base, or
root form. Amharic is one of the morphological-rich Semitic languages [22]. Different terms
can exist with the same stem, and this helps reduce the size of feature space for processing. In
this study, we used the HornMorpho stemmer developed by Michel Gasser [23]. HornMorpho
is a Python library developed to analyze three Ethiopian languages: Amharic, Afan Oromo,
and Tigrigna.
iii. Remove stop words. In Amharic, the common words, e.g., “ሁሉ፣ እስከ፣ ነዉ፣ሆነ “, and
others that scoreless weightage in the text processing tasks is called stop words. In this study,
stop words are terms that do not contribute to the semantics of the given idiomatic phrase but
are used to fill the grammatical structure of the idiomatic statement. Stop words are eliminated
to save computational time wasted in processing them. Amharic does not have a well-prepared
list of stop words. However, we remove stop words prepared by [21]. In addition, to stop word
removal, we also replace numbers with their name in alphabetic characters (“ፊደል”). To obtain
the feature vectors of the numbers’ names, we use the FastText embedding and substitute
numbers with their alphabetic characters. In the pre-trained FastText-based word embedding,
only Amharic alphabetic characters are employed. As a result, for this research, numbers must
be translated to their alphabetic names. For example, in “2 አይን” (two eyes), 2 can be changed
to two (“ሁለት”) and produce “ሁለት አይን”. This replacement is done by keeping a map of the
key-value relation between digits and an alphabetic description of each digit.
https://doi.org/10.1371/journal.pone.0295339.t003
because the dataset is small, and a portion of it is used for model evaluation, reducing the
amount of data available for training the model indirectly [25]. We used unique idiomatic and
non-idiomatic terms (those not included in the training set) to validate and evaluate the pro-
posed idiomatic recognition model. We tune the hyperparameters using a grid search strategy
to train the proposed CNN model. The training, validating, and testing sets are separated after
producing the dataset’s word2vec of each idiomatic expression. There are 1360, 168, and 172
instances of idiomatic expressions used in training, validating, and testing, respectively. The
non-idiomatic expression training, validation, and testing examples are 1280, 160, and 160,
respectively. The value of the hyperparameters used in this study is shown in Table 4 below.
To feed the training, validating, and testing data to the CNN model, we first create three
matrices for the training, validation, and testing sets, each with dimensions of (number of
training samples,300), (number of validating samples,300), and (number of testing sam-
ples,300). The number 300 is chosen for the matrix because the dimension of the matrices we
build must be equal to the embedding matrix value of the CNN, as mentioned in Table 4
above. Next, using looping, fill the above matrix with the vector values of the FastText embed-
ding model’s idiomatic and non-idiomatic words. Following that, we train the CNN model by
supplying the training data matrix and training data label and validate its performance by pass-
ing the validation set’s matrices with the hyperparameter values shown in Table 4.
The model’s training accuracy and loss are then displayed in Figs 3 and 4 below after the
model has been trained using the above-mentioned parameters and training dataset. Since the
training accuracy grows as the number of epochs increases, the model learns from the data
well. In addition, as the number of epochs rises, the training loss declines. This shows that the
model picks up on idiomatic expression features from the training set.
results across multiple quality criteria. This means that researchers or application developers
will utilize deep learning-based idiomatic expression identification models to improve the per-
formance of their models or applications.
Fig 5. Evaluation of the performance of the proposed model with accuracy, precision, recall, and f1-score.
https://doi.org/10.1371/journal.pone.0295339.g005
other models applied to the same dataset by other studies. As a result, we contrasted the new
model’s performance with some of the machine learning models employed in earlier studies
[27]. We compare the proposed model against KNN, SVM, and Random Forest classifiers.
The vector generated by the pre-trained FastText retained the meanings of each idiomatic
and non-idiomatic phrase. This vector for the training and testing set, along with their label, is
passed to the machine learning algorithms employed in this study, as explained in section 4.1.
The hyperparameter values of these learning models were adjusted using a grid search-based
tuning approach. Table 5 shows the optimal hyperparameter values found using this grid-
searching approach.
Table 6 below shows the detection accuracy these learning models using the hyperpara-
meter value from Table 5. In addition, Table 6 is used to compare the detection accuracy of dif-
ferent learning models against the proposed deep learning with FastText-based word
embedding for Amharic idiom detection. The number 72%,68%, and 76% indicates the per-
centage of the testing set correctly detected by Random Forest, KNN, and SVM models,
respectively. To determine whether or not the aforementioned machine learning algorithms
appropriately label a given phrase as idiomatic or non-idiomatic. To begin, we must use Fas-
tText to create a vector of the phrase with the same vector dimension as the one used for
training.
All the above results shown in Table 6 above are produced with the same dataset and with
the same word embedding model, which is FastText. The results show that the proposed algo-
rithm enhanced detection accuracy by 8%, 12%, and 4%, respectively, compared to Random
Forest, KNN, and SVM. This is since (1) employing FastText to generate word vectors can
yield important features, and (2) processed text features using CNN can better represent high-
level characteristics in the given idiomatic and non-idiomatic phrases.
Table 5. The optimum values of hyperparameters for KNN, SVM, and Random Forest learning models.
Learning algorithms The optimal values of hyperparameter values
KNN n_neighbors = 2 weights = ‘uniform’ The default for other parameters
SVM Kernel = ‘rbf’ C = 1.0 Gamma = 1 The default for other parameters
Random Forest classifier n-estimators = 300 Class_weight = ‘none’ The default for other parameters
https://doi.org/10.1371/journal.pone.0295339.t005
Table 6. Comparison of the proposed model with SVM, KNN, Random Forest.
Models Accuracy
Random Forest 72%
KNN 68%
SVM 76%
Proposed model 80%
https://doi.org/10.1371/journal.pone.0295339.t006
6. Conclusion
Different NLP models are now being developed for the Amharic language without considering
idiomatic expressions. Models that do not take idiomatic recognition into account may pro-
duce incorrect results since the actual meaning of the expression differs from the meaning of
each word that makes up the expression. Idioms are one of the most fascinating and difficult
aspects of Amharic vocabulary. Machine learning algorithms do not process text as input, so
they require encoding of texts into another format. We produced a vector of each word used
in this study using pre-trained FastText word embedding as part of this encoding. The
experimental findings show that compared to models utilized in this study, the proposed CNN
with the FastText embedding model is more effective at detecting Amharic idioms. The pro-
posed approach can, therefore, be applied to natural language processing tasks requiring the
detection of idiomatic expressions, such as machine translation, sentiment analysis, and ques-
tion-answering systems. Potentially, the model’s performance could be improved by training
on more data. In the future, other datasets from Amharic holy books, such as the Amharic
Bible, will be included in the aforementioned model to improve its performance. Furthermore,
we propose to use this model as a component in Amharic machine translation.
Supporting information
S1 Appendix. “Amharic stop words eliminated in this study”. “The English translation of
the stop words eliminated in this study“.
(PDF)
Acknowledgments
The authors would like to thank the Jimma Institute of Technology for supporting them
through different resources. The authors would like to thank Jimma University for its support
during the research work.
Author Contributions
Conceptualization: Demeke Endalie.
Data curation: Demeke Endalie, Wondmagegn Taye.
Formal analysis: Demeke Endalie.
Funding acquisition: Demeke Endalie.
Investigation: Demeke Endalie.
Methodology: Demeke Endalie.
Resources: Demeke Endalie.
Software: Demeke Endalie.
Supervision: Getamesay Haile, Wondmagegn Taye.
Validation: Demeke Endalie.
Visualization: Demeke Endalie.
Writing – original draft: Demeke Endalie, Wondmagegn Taye.
Writing – review & editing: Demeke Endalie, Getamesay Haile, Wondmagegn Taye.
References
1. Debra A Titone Kyle Lovseth, Kasparian Kristina,Tiv Mehrgol, "Are figurative interpretations of idioms
directly retrieved, compositionally built, or both? Evidence from eye movement measures of reading,"
Canadian Journal of Experimental Psychology, vol. 73, no. 4, p. 216–230, 2019. https://doi.org/10.
1037/cep0000175 PMID: 31192627
2. Yağiz Oktay, "Language, Culture, Idioms, and Their Relationship with the Foreign Language," Journal
of Language Teaching and Research, vol. 4, no. 4, pp. 953–957, 2013.
3. Dagnachew Amsalu, Worku Akililu, የአማርኛ ፈሊጦች Idiomatic expressions in Amharic, Addis Ababa,
Ethiopia: Kuraz Publishing Agency, 1993.
4. Salton Giancarlo, Ross Robert, Kelleher John, "Idiom Token Classification using Sentential Distributed
Semantics," in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), Berlin, Germany, 2016.
5. Flor Michael, Klebanov Beata Beigman, "Catching Idiomatic Expressions in EFL Essays," in Proceed-
ings of the Workshop on Figurative Language Processing, New Orleans, Louisiana, 2018.
6. Fazly Afsaneh, Stevenson Suzanne, "Automatically Constructing a Lexicon of Verb Phrase Idiomatic
Combinations," in 11th Conference of the European Chapter of the Association for Computational Lin-
guistics, Trento, Italy, 2006.
7. Peng Jing, Feldman Anna, Vylomova Ekaterina, "Classifying Idiomatic and Literal Expressions Using
Topic Models and Intensity of Emotions," in Proceedings of the 2014 Conference on Empirical Methods
in Natural Language Processing (EMNLP), Doha, Qatar, 2014.
8. Peng Jing, Feldman Anna, "Automatic Idiom Recognition with Word Embeddings," Information Man-
agement and Big Data, vol. 656, p. 17–29, 2016.
9. Liu Changsheng, Hwa Rebecca, "A Generalized Idiom Usage Recognition Model Based on Semantic
Compatibility," in The Thirty-Third AAAI Conference on Artificial Intelligence, Hawaii, USA., 2019.
10. Feldman Anna, Peng Jing, "Automatic Detection of Idiomatic Clauses," Computational Linguistics and
Intelligent Text Processing, vol. 7816, p. 435–446, 2013.
11. Thyab Rana Abid, "The Necessity of idiomatic expressions to English Language learners," International
Journal of English and Litrture, vol. 7, no. 7, pp. 106–111, 2016.
12. Zeng Ziheng, Bhat Suma, "Idiomatic Expression Identification using Semantic Compatibility," Transac-
tions of the Association for Computational Linguistics, vol. 9, p. 1546–1562, 2021.
13. Yamashita Rikiya, Nishio Mizuho, Gian Do Richard Kinh& Togashi Kaori, "Convolutional neural net-
works: an overview and application in radiology," Insights into Imaging, vol. 9, p. 611–629, 2018.
https://doi.org/10.1007/s13244-018-0639-9 PMID: 29934920
14. Pedro J. Garcı́a-Laencina, Sancho-Gómez José-Luis, Figueiras-Vidal Anı́bal R. Verleysen Michel, "K
nearest neighbours with mutual information for simultaneous classification and missing data imputa-
tion," Neurocomputing, vol. 72, no. 7–9, pp. 1483–1493, 2009.
15. Hota Soudamini, Pathak Sudhir, "KNN classifier based approach for multi-class sentiment analysis of
twitter data," International Journal of Engineering & Technology, vol. 7, no. 3, pp. 1372–1375, 2018.
16. Kok Zhi Hong, Shariff Abdul Rashid Mohamed, Alfatni Meftah Salem M., Khairunniza-Bejo Siti, "Support
Vector Machine in Precision Agriculture: A review," Computers and Electronics in Agriculture, Vols.
191, 2021.
17. Vrushali Y Kulkarni, Pradeep K Sinha, "Random Forest Classifiers: A Survey and Future Research
Directions," International Journal of Advanced Computing, vol. 36, no. 1, pp. 1144–1153, 2013.
18. Panhalkar Archana R., Doye Dharmpal D., "A novel approach to build accurate and diverse decision
tree forest," Evolutionary Intelligence, vol. 15, p. 439–453, 2022. https://doi.org/10.1007/s12065-020-
00519-0 PMID: 33425041
19. Alemayehu Haddis, ፍቅር እስክ መቃብር(Love up to the grave), Addis Ababa, Ethiopia: Mega Publishing
Agency, 2004.
20. Mohamed Osman Hegazi Yasser Al-Dossari, Abdullah Al-Yahy Abdulaziz Al-Sumari, Hilal Anwer, "Pre-
processing Arabic text on social media," Heliyon, vol. 7, no. 2, p. e06191, 2021. https://doi.org/10.1016/
j.heliyon.2021.e06191 PMID: 33644469
21. Endalie Demeke, Haile Getamesay, Abebe Wondmagegn Taye, "Feature selection by integrating docu-
ment frequency with genetic algorithm for Amharic news document classification," PeerJ Computer Sci-
ence, vol. 8, p. e961, 2022. https://doi.org/10.7717/peerj-cs.961 PMID: 35634124
22. Martha Yifiru Tachbelie Wolfgang Menzel, "Amharic Part-of-Speech Tagger for Factored Language
Modeling," in International Conference RANLP, Borovets, Bulgaria, 2009.
23. Gasser Michael, "ornMorpho: a system for morphological processing of Amharic, Oromo, and Tigrinya,"
in Conference on Human Language Technology for Development, Alexandria, Egypt, 2011.
24. Wang Haitao,He Jie,Zhang Xiaohong,Liu Shufen, "A Short Text Classification Method Based on N
-Gram and CNN," Chinese Journal of Electronics, vol. 29, no. 2, pp. 248–254, 2020.
25. Berrar Daniel, "Cross-Validation," Encyclopedia of Bioinformatics and Computational Biology, vol.
1, pp. 542–545, 2019.
26. Valk Ton Van der, Van Driel Jan H., Vos Wobbe De, "Common Characteristics of Models in Present-
day Scientific Practice," Research in Science Education, vol. 37, no. 4, pp. 469–488, 2007.
27. Saigal Pooja, Khanna Vaibhav, "Multi-category news classification using Support Vector Machine
based classifiers," SN Applied Sciences, vol. 2, no. 3, pp. 458–468, 2020.
28. Athiwaratkun Ben, Wilson Andrew, Anandkumar Anima, "Probabilistic FastText for Multi-Sense Word
Embeddings," in Proceedings of the 56th Annual Meeting of the Association for Computational Linguis-
tics, Melbourne, Australia, 2018.