END CRP

News article classification
Sunday, February 16, 2025
Written by Abdullah Obaid

Under Supervised Dr Bassam AlKasasbeh
21110383@HTU.EDU.JO / abdallah.obaid@outlook.com
1
Bassam.AlKasasbeh@htu.edu.jo
1
Al Hussein Technical University, Amman 11831, Jordan
Contents
1.Introduction..............................................................................................................................................3
My research questions is :-..................................................................................................................4
My objectives is:-.................................................................................................................................4
Sections:-.............................................................................................................................................4
Gaps:-..................................................................................................................................................5
2.Related Work............................................................................................................................................5
3.Data collection and Description..............................................................................................................10
3.1 Primary Data...............................................................................................................................11
3.2 Secondary Data.........................................................................................................................12
4.Research Approach and Methodologies.................................................................................................13
4.1 Onion Model....................................................................................................................................14
4.1.1 Philosophy................................................................................................................................14
4.1.2 Theory Development Approach................................................................................................15
4.1.3 Methodological Choice.............................................................................................................16
4.1.4 Research Strategy.....................................................................................................................16
4.1.5 Time Horizons...........................................................................................................................17
4.1.6 Techniques and Procedures......................................................................................................18
4.2 Research Methodology....................................................................................................................19
5.Results and Discussion............................................................................................................................23
6.Conclusion and Recommendations........................................................................................................24
7.Reflections..............................................................................................................................................24
7.1 Selected Research Methodology.................................................................................................24
7.2 Alternative Research Methodologies...........................................................................................24
7.3 Recommended Actions and Future Considerations.....................................................................24
7.4 Recommended Methodology......................................................................................................24
8.References..............................................................................................................................................24
2
‫َّل‬
‫ِبْس ِم ال ـِه الَّر ْح َمٰـِن الَّر ِح يِم‬
Abstract
Keywords: First Keyword, Second Keyword, Third Keyword.
1.Introduction
W e are now in 2025 and modern technologies are developing very rapidly. With the development
of these technologies, our data has developed, so data has become widely spread, especially in
news sites and social media. Therefore, with this tremendous growth in digital content, classifying text
data has become very important in this era, especially in the media field and the spread of news,
because every minute we have new news in this world and it often spreads at a tremendous speed to
reach all the inhabitants of this planet. Article classification aims to assign specific categories to all
articles based on their content. This content may be, for example, in politics, sports, technology, etc., and
classification can be used for a variety of purposes, including: -
Recommendations: - Customizing users' favorite digital content based on some articles they have
previously read
News filtering: - Determining a person's priorities and displaying news that matches their interests and
ideas
Improving people's experience on news sites: - Improving access to news and reducing the spread of
news that does not interest the person
Directing content to specific people: - Ensuring that people get information that interests them
Reducing time Necessary for searching for news: - Through the process of filtering and classifying news,
the time required to search for news that interests the individual is reduced
and so on, as there are many benefits and uses that can be used in classifying content
Human language is characterized by being ambiguous in origin, and there are some words that are in the
same sentence, the meaning of which is implicit and ambiguous, and there are also some sentences that
depend on the context or grammar in their sentence, so researchers and scholars in the field of language
and the field of artificial intelligence have emerged in this world and have brought us its essence, which
is natural language processing (NLP) and other methodologies that can help us make our own
classification. Natural language processing has emerged as the cornerstone of understanding and
processing human language, and over the years, we have reached this advanced development of
machine learning models and models specialized in deep learning, which has led to bridging the gap
between the human understanding system (human language) and automated and electronic systems,
and after years of research and development, little by little, only where in the beginning the classification
relied on primitive or relatively simple techniques such as Rule-based systems and statistical learning
These methods were effective in the past for small-scale problems that do not require multiple
operations until there were other problems facing them, such as the difficulty of expanding and
3
generalizing them to all applications or using them in a large and highly diverse data set. Machine
learning appeared to us as a transitional stage for the process of machine learning and text classification,
as it provided us with large models capable of learning from data without specific rules that oblige them
to do so. We started first with machine learning, which was a real revolution and a rocket launch to what
came after it, which was deep learning and learning from electronic neural networks such as ANN's,
CNN's, RNN's, and also learning from transformers and using them to produce some tasks related to
producing and processing texts such as BERT and GPT. We reached what we are today in terms of great
development in electronic classification processes. In this research, I will try to address some methods
and techniques in the field of NLP and provide good content in this paper in the field of text
classification, as I will use data (BBC Articles Dataset with Extra Features) which consists of 2127 rows
and taken from the BBC News Network We have five types of classifications which are either in the field
of sports or in the field of business or technology or entertainment and finally in the field of politics
The production of this research has scientific and economic implications in the media and digital content
industry and beyond, as it can enhance the accurate and effective classification of sentences, articles and
news headlines from providing content tailored to people and improving the ability to quickly access
information and improve the management of text data whether this is through individuals or through
institutions. In addition, this research helps in the ongoing dialogue about what are the best practices
and emerging trends in the world of data science, artificial intelligence and natural language processing.
My research questions is :-
1. How do businesses and institutions use the trained models for their applications?
The paper (ERNIE and Multi-Feature Fusion for News Topic Classification) discussed the advantages
of ERNIE But it did not explicitly say how it will be used in real life and did not focus on its deployment
in business in general
2. Can traditional methods such as TF-IDF with machine learning models outperform pre-trained
models such as BERT for the BBC dataset?
The paper (ERNIE and Multi-Feature Fusion for News Topic Classification) showed that pre-trained
models such as BERT and ERNIE can outperform other models But without testing this in real life and
directly on the same dataset.
My objectives is:-
1- Obj1: Based on ("ERNIE and Multi-Feature Fusion for News Topic Classification"), we will compare the
performance of a pre-trained model such as BERT with our performance to find out whether traditional
methods are better than pre-trained models for this specific case.
2- Obj2: To understand what fields and organizations that will benefit from a successful model to match their
needs.
Sections:-
This research consists of eight sections, which are:
1- Introduction
2- related work
3- data collection and description
4- research approach at methodology
5- results and discussions
6- conclusion and recommendations
7- reflections
8- references
4
Gaps:-
Simplified applications of using artificial intelligence in the real world and in commercial uses in
particular, where a problem came in the papers in that they lacked broad applications in practical life for
the use of artificial intelligence and the use of classification in business, as these papers focused on
improving the accuracy of classification and giving us new discoveries and studies in this principle, but
they lacked practical implementation such as the common uses in companies and organizations for
artificial intelligence in the process of classifying news, comparing different models, as these papers
lacked finding a clear comparison on the same data in different papers and under the same
circumstances, which left us with unanswered questions about some of the processes that may occur on
the same data in different research and different papers. It was also noted that there are generally no
reasons that tell us why this news was classified into a specific category, which left a clear gap that
hinders understanding the classification of this model for texts.
2.Related Work
In this first reference that I have reviewed, This study focused on the process of classifying news in
Indonesian using artificial intelligence tools. The dataset included about 5000 records, which were taken
from the website cnnindonesia.com and were divided into several categories such as technology, sports,
5
and health. There were pre-processing steps such as removing unwanted words, lemmatization, and TF-
IDF was used to select the data that would enter the model. The results showed that Multinomial Naive
Bayes achieved a percentage of 98.4%. It was also fast, taking only 0.702 seconds. Compared to other
models, it was the best, although SVM was good, but it took a very long time to process the data. This
research focused on the importance of pre-processing and also how to choose features when you have
data that is text. The image below (which was taken from the same paper) shows us the methodology
that their project followed, as they started with the data, which was of the text type, then moved to Pre-
processing process, then selecting features, then selecting the appropriate model for them, and finally
we have the result, The paper does not talk about whether the data is balanced or not or if the article
was classified based on whether the article was long or short. Also, the study lacked the use of the
originally learned ampings. They only used TFIDF..[1]
In this second reference that I studied, it was talking about classifying news articles using machine
learning, where it used data that included approximately 210,000 records that were taken from the
Kagen website, and this data was collected between 2000 2012 to 2022. Several algorithms were used
and their results were studied. Text mining techniques were also used, as well as some techniques that
worked to convert unstructured data into structured data, such as TF-IDF. The percentages that were
achieved in their experiments were as follows: Random Forest achieved the highest accuracy at 91.94%,
followed by K-Neighbor Classifier (91.27%), Gaussian Naive Bayes (88.25%), and Decision Tree (79.19%) ,
In the image below (taken from the same paper), the methodology that they followed appears, where
they first selected the data, then prepared it, and also worked on converting the texts into vectors, such
as the vector in the TF-IDF, then they By choosing the features that interest them and finally putting
them on their models, One of the things that we lack in this paper is that it consists of news headlines
only, which means that they are very short, meaning that the model has learned on short sentences, and
this leads to a decrease in the model’s performance when applied to a large data set, and the text is rich
in words and also longer. This study relied mainly on using TFIDF only without using any other word
6
embedding.[2]
In this three references that I studied, it focuses on the challenges of classifying news texts with a study
of the length of the text and also its feature extraction. Also, a new model was revealed, called DCLSTM-
MLP, which consists of a process of merging CNN, LSTM, and MLP. This new model came to work on
learning the relationships specific to the text more and more accurately, as the accuracy rate came as
follows: DCLSTM-MLP: 94.82%, MLP: 88.76%, Text-CNN: 92.46%, Text-LSTM: 92.35% and CNN-MLP:
93.68%, and it obtained the highest percentage among them all, What I noticed in this paper is that
firstly the DCLSTM-MLP model can be complex and computationally expensive and requires very high
resources to run. Also, this paper has limited evaluation metrics as the focus was on Accuracy and recall
and other critical matters such as F1-score were not mentioned.[3]
In this fourth reference that I studied, this paper focuses on the process of improving the classification
process for Bengali news by processing unbalanced data and also taking advantage of some modern
techniques in the field of artificial intelligence such as BERT and SMOT, where the data was made
balanced by using one of the techniques called Random Under-Sampling (RUS) and also using SMOTE,
where after preparing the data, many models were used, whether they were regular models or deep
learning models, and they were as follows (Logistic Regression, Decision Tree, Stochastic Gradient
Descent, ANN, CNN, and BERT), where BERT achieved the highest result obtained, which was 99.04% on
the balanced data set, and it obtained a percentage of 72.23% on the unbalanced data set,Some notes
on this paper are the size of the data, as there is difficulty in finding many Bengali texts and putting them
in the dataset even after using some techniques such as SMOT, RUS, this may affect our model. Also,
using the BERT model requires very high computational resources. Finally, the ability to interpret, as
techniques such as LIME AND SHAP were used to interpret the model’s predictions, but this increased
the complexity of searching in the sentence itself.. [4]
7
In this fifth reference that I studied, it provides us with the classification of news on the Internet using
artificial intelligence, but with a focus on computational efficiency and reducing the complexity of each
model. I also suggest in this study using the SVM model with improving it by changing some
hyperparameters, because SVM is not used much in the process of classifying texts. Therefore, in this
paper, a comparison was made between it and some other models such as SGD, RF, LR, KNN, NB, and the
only thing that the superiority here came to SVM with a percentage that reached 85.16%, but it was the
slowest of the existing methods that were used. The research also used some techniques such as
removing noise and removing unwanted words, as well as stemming, and tokenization, The image below
(which was taken from the same paper) shows us the methodology that their project followed , Where
they started with me taking the data and making the Data pre-processing, then they worked on dividing
the data into two parts, one for testing and one for training, then they made feature selection for the
data that is specific to The train then worked on training the model and then they did their own
evaluation, One of the notes on this paper is the need for high computational efficiency due to their use
of SVM and also the lack of use of modern techniques in the process of converting words into vectors,
using only TFIDF..[5]
In this sixth reference that I have reviewed ,they used machine learning to supervise in order to classify
the type of articles into some specific types such as politics, sports, and entertainment. A data set of
75,000 articles was used and collected from the HuffPost website. Pre-processing was applied, which
included some steps such as removing unnecessary words and also why the encoder. Then the data was
separated into two parts with a ratio of 30/70. Cross-validation was also applied to ensure the guarantee
of strong evaluation. Some classifiers were tested and the percentages were as follows: NB got a
percentage estimated at 93%. In second place came LR with a percentage of 81%. In third place came
SVM with a percentage of 76%. In last place came KNN with a percentage estimated at 72%, The image
below (which was taken from the same paper) shows us the News article classification process that their
project followed.
They first started by collecting data through a website, then they converted it to News Documents, then
they downloaded the data after their own process, then they divided it and converted it to vectors for
each article they had, then they trained their model and predicted the news class, and finally they did an
8
evaluation of their model, One of the problems I saw in this paper is the incorrect distribution of the
data. Although there are 75,000 articles, the distribution is not equal for some categories, such as
politics, which is very common in the data. Also, one of the strange things is the lack of using specific
word embeddings, or perhaps they did not mention them explicitly in the article. They only said that
when they pre-processed the data, they used the NLTK library. Finally, they did not use all the metrics,
such as F1-score.[6]
In this seventh source, the research paper talks about something called ERNIE, which is known as
(Enhanced Representation through Knowledge Integration) with a multi-feature fusion, where most of
the time the methods face some problems in classifying texts. ERNIE has the power to overcome some of
the problems that occur during the process of classifying texts, such as the inability to capture some of
the meanings in the text itself. ERNIE is a pre-trained language model that overcomes these problems by
working to hide some words (like the mask on the BERT) and dealing with them in specific ways. It works
to extract the main features in texts, contexts and important sentences and also provides attention to it,
which includes a greater and broader understanding of the data. This model was tested on a wide range
of data such as BBCNews, where it achieved an accuracy rate of up to 97.47%, and all the metric results
were significantly high. As for another set of data, the accuracy rate was 98.31%. It is also worth noting
that the multi-feature fusion technique greatly helped this model classify topics. News by merging some
diverse features that are related to each other in a unified framework, this work emphasizes the
importance of taking advantage of modern methods such as ERNIE along with feature merging
technology to ensure high accuracy in classification , ERNIE makes huge improvements in text
classification processes, but it has several drawbacks, or we can consider them as challenges. First, it
requires very large data for training, high quality and free of errors. It also requires high computational
resources, as it requires much higher computational challenges than machine learning models, which
makes it less useful in people’s hands.[7]
9
we will classify news as to whether it is fake or not. We have not strayed too far from the topic, as the
topic is still the process of classifying news articles.
In this eighth reference that I have reviewed ,NLP was applied to classify fake news and articles spread
on the Internet. A dataset called LIAR-PLUS was used, which contains 14,787 classified phrases. Some
techniques were used to help classify whether the information is fake or not, such as sentiment analysis,
capturing news from fake sources, and also analyzing word frequencies and context, etc. The steps
dedicated to pre-processing were also applied, such as removing some symbols and removing unwanted
words. Several types of models were used, such as SVM, LR and NB. However, the SVM model obtained
the highest percentage, which was 92%, and the other metrics were all superior in SVM over the rest of
the models.[8]
In this ninth reference that I have reviewed in this paper, a study was conducted on fake news related to
the Corona Covid-19 disease using models for machine and deep learning, where the data, which
consisted of real and fake news, was processed using text processing and also placed on specialized
models such as Naive Bayes, Random Forest, Logistic Regression, CNN, LSTM, and BiLSTM, where CNN
and BiLSTM achieved the highest accuracy of 97%. At the end of the research, the risks of spreading fake
news in the health field are emphasized, especially during the Corona pandemic, and they explained the
effectiveness of their models and the effectiveness of artificial intelligence in the process of detecting
incorrect or false texts.
In the image behind (taken from the same paper), the methodology
that they followed
They first started by taking the data related to fake news, which was
primary data, meaning they collected them, then started with the
steps of pre-processing the data, then cleaning the text, then
converting it to a vector to enter the model, then testing several
models on it and coming out with the best result, then working on a
model for prediction and producing the best result, then the data for
the test comes and a prediction is made on it, and in the end we have
the sentiment of fake news of test set[9]
3.Data collection and Description

Define the primary and secondary data and describe the differences between them.
Primary data:- It is a type of data that is collected by the person himself for the first time
and is the information and data that needs direct answers. This method is considered
expensive as analyses are used using large human and technical resources. Data is collected
through tests, observations, questionnaires, etc.[10]
10
Secondary data:- It is the data that has been made, implemented and collected by other
researchers who have searched for it to reach an answer to their questions and it was not
collected by the questioner himself. It can be in several forms such as statistics that are
made by organizations and government agencies such as population censuses, books,
newspaper articles whether paper or on websites, reports, etc. This method is simple and
inexpensive in terms of research and material, but its main drawback is that it may not be
very accurate or may not meet the main purpose of the research that is used later.[10]
3.1 Primary Data

The way I collected the primary data was through interviews, where I conducted a personal interview
with two people who are experts in the field of natural language processing and text classification. Their
questions were open-ended questions and one to two closed questions. The questions were simple and
easy to understand for the people I asked, and all their answers will be analyzed.
The search and finding process was good and not complicated, given that the people are in the place
where I study. The process of contacting them and booking an appointment with them was smooth and
simple. Choosing the questions and making them simple and easy was difficult, so that I could obtain
accurate information in the field of news classification.
I chose my primary data process to be interviews because it gives me greater knowledge and allows me
to conduct a dialogue and discussion with the people I had been talking to. These interviews were
effective and gave us the main idea from them and conveyed the required information in a better and
greater way, especially since most of the answers were not finished, which helped us understand the
biggest topics. Which were raised when conducting interviews with them and this experience was
unique because it was the first time I had done this type of information gathering in this way, also these
interviews were very effective in the research and answered questions that were raised in this research
Merits:
1- Good delivery of information: - The interviews allowed me to conduct a detailed and comprehensive
collection of data, which conveyed ideas in a greater and better way
2- Clarifying the questions: - There was a question that was not clearly understood, so the interviews made it
possible for me to clarify the question in a good way, which gave a better answer and a more accurate
answer to the question
3- Continuous modification: - When going through an interview, there may be a change in the questions that
I can ask, and this is what actually happened, as I changed one of the questions during the interview that I
conducted with them
Limits:
1- Few interviewers: - Because of conducting interviews, the target segment (the people we are interviewing)
may be small, which makes it difficult for us to generalize the results
2- Relying on quality: - In The interview method depends on the person asking the questions. Since it is one
of the few times I have done interviews, I have faced a problem with interview skills, how to ask questions,
and also how to listen well to them.
11
3- Difficulty of analysis:- This method is more difficult to analyze than other methods, for example (surveys).
It does not give us charts immediately, but people must analyze the answers in detail and more accurately.
3.2 Secondary Data

I used the BBC classification data (this is the link for the data that I had picked from Kaggle
https://www.kaggle.com/datasets/jacopoferretti/bbc-articles-dataset (it is similar data that
paper number 7 used (ERNIE and Multi-Feature Fusion for News Topic Classificatio) [7] )),
It contain from 7 columns and 2127 rows , the data is related for a BBC News , each text
classified by the content for there class like (sport , politic ……etc)
The columns
Text:- It is the full text of the news article
Label:- It is the category that the news article classify based on
no_sentences:- It is the number of sentences on the article
Flesch Reading Ease Score:- It is a score shows how the text is easy to read
Dale-Chall Readability Score:- It is another measure shows how the words ore difficult to
read on the article
text_rank_summary & lsa_summary:- They are summaries created by some algorithms.
Justify my chose:-
The purpose of using this dataset is because of the great similarity in the content of this
data and the content of one of the papers that I read, which was titled (ERNIE and Multi-
Feature Fusion for News Topic Classification), where they used one of the models called
ERNIE and data like this was tested on it and in order to make a comparison between the
results that I will come out with and the results they obtained using this data, I found that
this is the closest data that can be obtained in addition to the great convergence in the
number of rows, in addition to the fact that this data is very suitable for conducting
research in the field of natural language processing due to the large size of the texts in it
and the presence of many features that can be used in the search process and improving
the results and coming out with the best result in the automatic classification process and
the process of summarizing texts. This data also contained several features that can help
researchers in the field of natural language processing by using it and giving them results
that may be good and using some modern methods in text summarization processes such
as using TextRank and also using LSA to make a summary For the text, which can help in the
process of research, investigation and comparison between different summarization
techniques if the researcher needs it
Using this data set is very useful because it provides us with readings that have been given
to us to study and analyze texts, such as the number of sentences in the text and the ease
and difficulty of reading sentences and words in each article, as this saves the researcher
the analysis process related to text analysis, through which we can see the data better and
more accurately as well, which makes it useful for research applications in various fields in
the world of natural language processing
The Merits and Limits
12
Merits:-
1- Different features:- This data provides some features that can help researchers in
some operations such as providing the degree of difficulty and ease of words in each
text and also providing some summaries that were created automatically, which
helps to focus on evaluating the model without the need to conduct extensive data
analysis.
2- Balanced data:- Most of the data categories are equal, which means that there is no
category that dominates all categories, which helps to increase the accuracy of the
data.
3- The data is ready:- Because it is of the secondary data type, I do not need to collect
it myself, as it is pre-collected.
4- Diversity of texts:- There are several categories into which the texts have been
divided, which helps to cover many of our news articles. This means that the data is
very useful for natural language processing applications to a large extent.
Limits:-
1- Lack of metrics in the data:- The data depends on three types of metrics such as
no_sentences, Dale-Chall Readability and Flesch Reading Ease Score, and these
may not constitute a comprehensive and complete analysis. For texts
2- Potential bias:- Because the data was collected from only one source, which is the
BBC, this (possibly) leads to bias towards news articles issued by the BBC, which
can limit the possibility of machine learning on the data
3- Automated summaries:- If the researcher needs summaries, they may not be as

accurate as what humans summarize
4.Research Approach and Methodologies

In this section, I will talk in detail about the research methodology that I used in this study, which
included collecting primary data and then how to analyze secondary data. The process of collecting
primary data was done by conducting interviews with experts in the field of natural language processing.
These interviews were aimed at collecting insights from specialists in the field of artificial intelligence
and natural language processing and benefiting from their experiences in this field, in addition to asking
them some questions that aimed to reach an answer to them in this research, such as future
improvements, benefits and challenges that media institutions may face in using artificial intelligence
and the classification processes of the texts used by them.
Then, the secondary data was analyzed, as a data set collected from the BBC news website was used to
evaluate the performance of our classification models. Some pre-processing techniques were applied,
transforming the text form and training the models, which were models divided into two parts: Machine
Learning Models and also pre-trained models such as BERT. In the end, we used the metrics that we
measured the performance of the models with and checked whether there was overfitting in the data or
not. Combining the research objectives collected from the interviews (which were the primary data) with
the data that was analysed (the primary data that was used from the BBC) and then drawing a diagram
that shows how these two approaches work.
13
4.1 Onion Model
It is a model that was discovered and presented by Saunders, Lewis and Thornhill, these
researchers, in one of their books. The aim of this model was to help students in their
academic stages to write a thesis in an organized and clear methodological manner. It has
several layers, which are: Philosophy, Theory Development Approach, Methodological
Choice ,Research Strategy, Time Horizons and Techniques and Procedures [12]
Figure 1: The structure of the Onion Research Model [13].
4.1.1 Philosophy
This layer tests the reality of the assumptions that the research begins to study, as this layer forms
the cornerstone in shaping the research method in the research conducted by the researcher, as
this layer directly affects the form and content of the research as a whole. This layer studies the
assumption methodologies that the researcher performs, whether through analysis or through
social experiments or points of view or interactions and human experiences. Some of the
methodologies that are in this layer are:
1- Positivism:-It is the process of identifying some laws and patterns on which confirmatory
tests and analysis of topics are conducted. This method is based on looking at the different
assumptions in all points of view, while also focusing on observing some phenomena that
can be seen with the eye and some measurable evidence. OR It focuses on topics that are
realistic (seeing facts and facts) and extracting them so that it is not possible for two people
to disagree on this fact. For example, water consists of two atoms of oxygen and one atom of
hydrogen. This is a fact that two people cannot disagree on. This (positivism) states that
knowledge can only be acquired when conducting research. Experimentalism, which uses the
process of measurement and observation, and emphasizes the establishment of some laws
and the identification of general patterns through tests and analyses, with a focus on some
phenomena that can be observed and measured completely.
14
2- Interpretivism:-It is the subjective nature of research, which refers to the methodology of
assumption, meaning that through human experiences and the work of some cultural contexts
and human social interactions, we have some experiences and some interpretations.
3- Realism:-It is the research theory that focuses on the fact that the research focus is
independent of human capabilities, whether the human principle agrees on it or not, whether
the human mind can understand it or not, it is a reality greater than human capabilities.
4- Pragmatism:-It is more between positivism and interpretivism, and this assumption looks at
different points of view in the processes of conducting experiments and analysis processes
For my research (News text classification), I used Positivism because it was the most
suitable for my research, because I am dealing with a set of data (BBC news dataset)
and this data includes several ready-made metrics such as the number of sentences,
word difficulty, sentence ease, etc. The main goal of my research is to compare the
performance of the ERNIE model with models that will use traditional TF-IDF.
In my research, as I mentioned, I used Positivism because this philosophy emphasizes
observation and measurement, which is what their study based on evaluating models
needs. If I want to make comparisons between different methods, I must use
measurable measures to make these comparisons and to make sure which models are
better in relation to the measures used. This philosophy is suitable for me. It is possible
to use several measures that can help us confirm which models are better, such as
Confusion matrix, which can help us extract some measures that can be used to infer
whether the model is efficient for this task or not.
I also relied on Pragmatism to discover how institutions and companies can use text
classification using artificial intelligence and natural language processing.
4.1.2 Theory Development Approach

This layer contains the strategies that will work on developing, testing and understanding
theories in research, whether it is starting with collecting data, looking at theories, or collecting
and studying observations. This layer contains three research strategies, which are:
1- Deduction:- is the process of deducing and understanding theories or hypotheses, and then the
process of testing all hypotheses is done by collecting observations and data on them to reach a
specific goal about one of the specific topics.
2- Induction: - It is the process of observation whereby a person notices something and wants to
apply it in a broader form and builds theories in order to generalize these observations or patterns
that he observed. In other words, it begins with observation and ends with generating and
establishing theories about the specific topic.
3- Abduction: - The focus here is on developing theories and hypotheses about some specific
observations that were observed and ends with prediction processes.
In my research, I used Deduction because it is the best approach that can be followed, and the
reason is that we already have a pre-defined hypothesis, which was that ERNIE performance
15
and would it be better to use TF-IDF with machine learning Models are better or not, and we
can also test the results that we will get through measurable measures and results, such as the
measures determined by the confusion matrix.
4.1.3 Methodological Choice

This layer is concerned with the methods, how and forms of data that will be collected and the
methods by which it will be analyzed later.
1- Mono method quantitative:- It is a method of using only one technique to collect quantitative
results through experiments
2- Mono method qualitative:- It is a method of using only one technique to collect qualitative
3- Multi method quantitative:- It is a method of using several multiple techniques to collect
quantitative results through experiments
4- Multi method qualitative:- It is a method of using several techniques to collect qualitative
5- Mixed method simple:- It is a method of using qualitative and quantitative techniques to better
solve the research problem and to collect results through experiments
6- Mixed method complex:- It is the use of qualitative and quantitative techniques to better solve
the research problem with the use of complex methods to collect data, analyze and extract
information better
In this research, I used a mixed method simple approach, which combines a quantitative and
qualitative method in order to conduct a comprehensive analysis of news text classification
using natural language processing
I used The quantitative to evaluate the performance of the machine learning model that I used
and also used some numerical statistical measures such as accuracy, which helps provide clear
and measurable measures of the performance of this model with the data that I studied, which
also helps in comparing between the machine learning that I used and ERNIE.
I also used the qualitative by conducting some personal interviews with people specialized in
natural language processing who are very familiar with how to classify texts. These interviews
came with the aim of understanding how companies supported by artificial intelligence can use
the process of classifying news texts and helping them with it and discovering the benefits and
impact of artificial intelligence in it and also identifying some of the challenges that may hinder
this process.
4.1.4 Research Strategy

This layer is concerned with the path that will be taken to collect and analyze data
1- Experiment: It is the process of manipulating and changing variables, I hope tests and seeing the
results
2- Survey: It is the process of collecting a set of data from people through one of the techniques
such as questionnaires, which helps in the process of analyzing people's opinions using them
3- Archival Research: It is the process of analyzing data and verifying the data found in old records
4- Case Study: It is the process of studying a case or problem within the aspects related to this
problem
5- Ethnography: It is the process of observing and recording data for people's lives or people's
culture, and this is done through close monitoring of them
16
6- Action Research: It works on using strategies, planning and cooperation between people in order
to solve one of the practical problems that you face
7- Grounded Theory: It is the process of creating and developing some theories derived from data
in general (whether historical or modern)
8- Narrative inquiry: It is the process of discovering problems, identifying them and solving them if
possible by discovering them through stories and experiences from people Different
In my research, I used a combination of archival research and case study to find answers to the
questions I asked
Archival Research
I first studied some data that helped me understand how to classify texts and news articles
based on the speech in them, which helped the research to develop more in my knowledge
in this aspect. I also used data issued by the BBC, which is data that has been made available
to the public. This data will help us evaluate our model and train it on this data
Case Study
I conducted some interviews with specialists in the field of natural language processing to
discover how news classification based on texts will be used in companies and how
companies and institutions will adopt these models that help in classification. These
interviews will provide us with a strategic understanding based on real-world benefit from
the model provided to them and how it will help companies classify their newsAfter
combining these techniques, this matter enhanced my research and enhanced my
knowledge, as the combination of these two techniques led to enhancing the research and
answering the questions better
4.1.5 Time Horizons

This stage focuses on the time period that the research takes, and here and about
1- Cross-sectional: It is the process of collecting, analyzing and studying data within a specific
time frame such as interviews and surveys
2- Longitudinal: It is the process of collecting, analyzing and studying data within a long and
extended time frame such as an experiment
In my research, I used the Cross-sectional approach because this study focuses on analyzing
data and studying ideas within a specific time frame and not extended over a long period of
time, as I will analyze the BBC data set on Python code, which is non-continuous data, and I also
conducted some interviews with specialists in the field of natural language processing, so I
followed this time frame
4.1.6 Techniques and Procedures

In this section, I will explain the techniques, steps, and stages that I took in order to
collect, analyze, and interpret data and information in my research.
17
1- Data collection through interviews: - I conducted semi-structured interviews (some
questions were open-ended and others were closed-ended) with experts in the field
of artificial intelligence, especially in the field of natural language processing.
These interviews came to achieve the goals that I wanted to answer some questions
related to his research, such as how to use artificial intelligence to classify news,
what are the benefits that may occur when using it, the expected challenges when
using it, where will artificial intelligence reach in the future in companies, and use
cases. The questions were semi-structured questions, meaning that most of the
questions were open and others were closed. The number of questions was 11
questions. These questions were carefully formulated so that I could extract
detailed insights from them. As I mentioned, the questions came to the heart of the
subject, which is the use of artificial intelligence in news organizations. The
interviews allowed me the flexibility to ask questions and clarify questions that
were not widely understood by people before answering. I also recorded all the
interviews after taking (Approval from the people I interviewed) so that I could
listen to it again and also so that I would not lose or forget any question or answer
that I had asked during the interview, which also helped me to analyze the interview
correctly and not lose any information, but one of the problems that I faced was the
limited number of people I interviewed.
2- Secondary data analysis: The code (of Python type) that I used and implemented
was of great help in analyzing the data previously collected from real data issued by
the BBC news agency, where the goal was to analyze using libraries in Python and
then train and evaluate the performance of all the models that we have, whether they
were pre-trained or regular machine learning models. The process of analysis,
processing and training was as follows:
A. 1- Pre-processing the text: - Here the process of preparing the data and
removing impurities and defects from it takes place so that this data is
correct and useful to us and does not negatively affect the model and it
came as follows
I. Removing punctuation marks
II. Converting words to lowercase letters
III. Returning words to their origin
IV. Removing stop words
V. Separating the sentence and cutting it into words
Thus, we have completed the process of preparing the text before we

perform the feature engineering process and converting its form, and in this
way all the data came in a good and regular form and does not affect the
performance of the model
B. Feature engineering: - Where the use of TF-IDF came in order to convert

and change the form of the data into a digital form so that it can enter our
models (this comes because the model does not receive words, but only
receives numbers, so this feature came in order to convert the text into a
digital matrix that can enter the model)
C. Data division: - The data set was also divided in a way that ensures a large
training process, with the selection of data for training, and the percentage
came to 80% For training and 20% for testing
D. Cross validation:- This technique was tested to know whether this data set
has overfitting or not (meaning that when training the data, we are given a
very high result in the training data and bad in the test data) so because of
the high accuracy that the models get, I used cross validation and this came
18
to make sure that the data does not have overfitting and thus it turned out
that the data is sound and has no problems
E. Exploratory data analysis (EDA):- Where I analyzed this data and

extracted some visuals that could help us understand the nature of this data
and I also made visuals before and after I did the pre-text processing to
know what are the most frequently repeated words in each class before and
after the processing process
F. Training and testing:- I trained this data on several models and also tested
them. These models were the following
I. Regular machine learning using TF-IDF (logistic regression,
random forest, SVM )
II. Pre-trained the model and it was BERT
G. Metrics: - In order to compare the performance of the models, we had to

use some metrics on which we determined whether these models were
efficient or not, and I used precision recall f1-score accuracy
After doing all these steps of dividing and transforming and making sure
whether the data was correct or not, then training and finally measuring, all
of this came to make tests to know the effectiveness of the pre-trained
models in front of the small self like this compared to the normal machine
learning models that were used, and after training, the efficiency of each
model was measured and its results were verified more
The combination of primary data and secondary data was very useful to make a comprehensive
answer to our questions, as the interviews (primary data) provided realistic questions and
answers on the ground, while the secondary data (on which the model was trained) allowed for
conducting experiments and tests and proving the validity of the hypotheses and questions that
I raised, and these methods worked together to address The questions we have
4.2 Research Methodology

In this research, I used two strategies to collect data and then integrate them into the research
to verify, confirm and find the answer to our questions. They were the use of interviews and
then my use of the Python code that I did to analyze the secondary data.
A) I will start first with the methods of implementing the interviews, which I did through these
steps.
19
In each of these steps, which were for the purpose of creating and organizing an interview
to answer some questions, each step had a significant impact on the result and on the
research as a whole
1- Designing the Interview Questions:- The questions were carefully designed with the
research objectives and focused on the following (these are not the questions, but what the
questions focused on)
A. How companies adopt artificial intelligence
B. Benefits of using artificial intelligence in companies
C. Some challenges and limitations
D. Future improvements
These questions were clearly chosen and in clear language, so that when the
question is asked, it is understood by the people I interviewed correctly and
smoothly, because any error in the formulation of the question may give us an
answer that is not related to the research we want to present. All the words came
in a language that technical people and specialists in the field of natural language
processing can understand in a simple and smooth way. The questions were
designed to be 11 questions, most of which are open-ended, with one to two
closed-ended questions.
2- Selecting the people:- I selected 2 experts in the field of artificial intelligence, especially
natural language processing, specialists and workers in it, as it was easy to find them because
they are present at my university. I went to one of them and asked him to conduct an
interview and he gave me an appointment. As for the other, I sent him an email and he set an
appointment for the interview with me.
20
3- Making the Interviews:- I went to them at the appointed time for the interview and it was
time to start the interview process, but Before that, I got their approval to record the
interviews so that I could listen to them later and make sure that I didn't forget any
information. I used the questions that I wrote and the interviews were smooth and the
questions were understandable except for one question that I had to clarify in one of the
interviews, but the interviews were good. The questions were understandable. All the
answers were in the required format. 11 questions were used in these interviews.
4- Analyzing Responses:- After recording the interviews, I listened to them several times to
analyze the answers of the two people I interviewed and I determined the answers for each
question.
5- Integrating with Research Progress:- Finally, after I analyzed and collected data from the
experts, I will integrate these ideas with the analysis results from the secondary data. This is
to ensure the validity of the research performance and to find a clear and final answer to what
is required from the research
Interviews are an effective strategy in the process of gathering insightful and sensitive
information from experts. I have collected important information that I would not have reached
if I had not met with NLP experts. Through discussions with them, I gained an important point
of view about the possibility of companies using artificial intelligence that can classify news and
how artificial intelligence will be integrated into the business of media agencies. The experts
reached the light on several aspects such as the introduction of artificial intelligence media
centers in their work during the next five years, in addition to their shedding a large part of the
restrictions related to the use of artificial intelligence in these organizations. It is very important
for organizations to adopt artificial intelligence, especially the presence of some advanced
technologies that can improve their work and benefit from the infrastructure of this generation
of students that can help them create structures of code with which they can achieve high
profits and save some expenses.
B) Now moving on to the code that was used to collect data, enhance the search and answer
the questions. It is a Python code that was created to predict the type of news that is present,
whether it is political, business, sports, technology or entertainment. A data set from BBC News
was used.
Below is the structure of the secondary data process:
21
1- Data Collection ( Secondary ): This data was collected from the BBC News website, which
contains 2127 rows. This data was available on the famous data collection website Kaggel. It also
contained other features such as the ease of reading the text, sentences in the text, and the
difficulty of the words in this text. It also contains ready-made summaries to implement the data
on them, as they used a technique called Text rank and LSA to implement these features (but we
will not use it. We will use the original text). This data set was chosen for several factors. First, it
was classified into several articles for more than one category. It contains five types of articles.
Second, the quality of the linguistic data set is good and from a good source, which is the BBC.
Third, this data set contains features that can help the researcher in the process of analyzing the
data and extracting some features from it. Fourth, some similar research was done on the same
data or similar data, which confirms the quality of this data.
2- Text Preprocessing: The text processing step is one of the important steps that work To make
the text correct and error-free before converting it and entering it into the model, at this stage
several pre-processing steps were used, including:
A. Removing punctuation marks: - Have all punctuation marks been removed from the text
B. Making the text lowercase: - This helps the model deal with all similar words in the same
way without any difference between whether they are in capital letters or if they are in
lowercase letters
C. Lemmatization: - Returning words to their original forms
D. Removing stop words: Removing stop words is an important and common thing in the field
of natural language processing. This comes so that the model can train on the correct words.
E. Tokenization: Dividing the text into words
3- EDA: In order to gain a comprehensive and broad understanding of the data, we conduct
exploratory data analysis, where I used a set of charts to understand the data, such as the
distribution of categories, where we found that the distribution is almost equal for each category
of articles. I also analyzed the relationships between the readable scores and the text categories.
This enabled me to see the hidden aspects of the data and analyze them more and better from
within the data (I also made two plots after removing the stop words to see what words
significantly affect each category)
22
4- Feature Engineering: My feature engineering was in some steps, the most prominent of which
are:
a. I did not enter a set of columns into the model because we did not need them in
the study, and we were satisfied with only the columns that contain the text
itself
b. Data conversion: - This was in the form of TF-IDF, where after pre-processing
I converted the data in the form of TF-IDF Matrix, which is a technique that
gives us the important words within each slice of data. The goal of conversion
to this form came because the model cannot take data in the form of words, but
only takes it in the form of numbers. This conversion came to fill this gap so
that the data can enter the model and the training and testing process occurs.
5- Data Splitting: - I divided the data into two groups, one for training and one for testing. The
distribution came at a rate of 80% to the training self and 20% to the testing data. This gives the
model a large training process on a large part of the data, while also leaving a small part of the
invisible data for the model to be tested on.
6- Modeling: - The modeling process was using machine learning models such as logistic
regression, random forests, and also support vector machines, and another pre-trained model,
which is BERT. This comes so that we can train the data on Several types of automated models,
whether pre-trained or other models
7- Evaluation:- The process of evaluating models comes to know what is the best model that can be
dealt with in the case of this data of ours, and I measured these models through the following
methods
I. Accuracy
II. Precision
III. Recall
IV. F1-score:
These are some degrees that can measure the effectiveness of this model, but I did not end here
yet. After I saw that the percentage was very high in most values, I made sure whether the model
was overfitting or not, and I used a technique called Cross validation. The results came out good
in all stages of Cross validation. In this way, I made sure that the data was sound and good to deal
with all our models.
8- Integrating with Research Progress:- In the end, the information obtained from the interviews
was integrated with the data that was analyzed using secondary data. This is in order to solve all
the questions raised in the research and find the best answers to these questions.
The BBC data set was used and a comprehensive code was created for it on Python. This
method was effective and good for evaluating the performance of all the models that were
used and studying the data structure and what it contains of secrets. It included pre-
processing and also EDA, and then it was separated. Data, transforming it and merging it
with models to give us the final result that we evaluated through the measurement tools
used.
23
5.Results and Discussion
Discuss the results. Include all tables and figures.
Explain how the results meet the research question and objectives.
Describe merits and limits of the analysis.
6.Conclusion and Recommendations

State the aim of the study and summarize the findings.
Based on the findings, discuss the recommendations that you have.
Propose future work (optional).
7.Reflections
Avoid generalization and focus on personal development and the research journey in a
critical and objective way.
7.1 Selected Research Methodology
Reflection of the research process.
Reflection on the merits, limitations, and potential pitfalls of the selected methods.
7.2 Alternative Research Methodologies
Alternative research methodologies in view of outcomes.
Lessons learned in view of outcomes.
7.3 Recommended Actions and Future Considerations
Use reflection to inform future considerations.
7.4 Recommended Methodology
Updated version of paper Methodology (flowchart, block diagram) with discussion.
8.References
Wongso, R., Luwinda, F.A., Trisnajaya, B.C., Rusli, O. and Rudy (2017). News Article Text
Classification in Indonesian Language. Procedia Computer Science, 116, pp.137–143.
doi:https://doi.org/10.1016/j.procs.2017.10.039.[1]
Agarwal, J., Christa, S., Aditya Pai H, M. Anand Kumar and Prasad, G. (2023). Machine Learning
Application for News Text Classification.
doi:https://doi.org/10.1109/confluence56041.2023.10048856.[2]
24
‌
Zhang, M. (2021). Applications of Deep Learning in News Text Classification. Scientific
Programming, 2021, pp.1–9. doi:https://doi.org/10.1155/2021/6095354.[3]
‌
Khan Md. Hasib, Nurul Akter Towhid, Kazi Omar Faruk, Jubayer Al Mahmud and Mridha, M.F.
(2023). Strategies for enhancing the performance of news article classification in Bangla:
Handling imbalance and interpretation. 125, pp.106688–106688.
doi:https://doi.org/10.1016/j.engappai.2023.106688.[4]
‌
Daud, S., Ullah, M., Rehman, A., Saba, T., Damaševičius, R. and Sattar, A. (2023). Topic
Classification of Online News Articles Using Optimized Machine Learning Models. Computers,
12(1), p.16. doi:https://doi.org/10.3390/computers12010016.[5]
‌
Ahmed, J. and Ahmed, M. (2021). ONLINE NEWS CLASSIFICATION USING MACHINE LEARNING
TECHNIQUES. IIUM Engineering Journal, 22(2), pp.210–225.
doi:https://doi.org/10.31436/iiumej.v22i2.1662.[6]
‌
Chen, W., Liu, B. and Guan, W. (2023). ERNIE and Multi-Feature Fusion for News Topic
Classification. Artificial Intelligence and Applications.
doi:https://doi.org/10.47852/bonviewaia32021743.[7]
‌
Mehta, D., Patel, M., Dangi, A., Patwa, N., Patel, Z., Jain, R., Shah, P. and Suthar, B.
(2024). Exploring the Efficacy of Natural Language Processing and Supervised Learning in the
Classification of Fake News Articles. [online] Ssrn.com. Available at:
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4721587.[8]
‌
Bangyal, W.H., Qasim, R., Rehman, N. ur, Ahmad, Z., Dar, H., Rukhsar, L., Aman, Z. and Ahmad, J.
(2021). Detection of Fake News Text Classification on COVID-19 Using Deep Learning
Approaches. Computational and Mathematical Methods in Medicine, 2021, pp.1–14.
doi:https://doi.org/10.1155/2021/5514220.[9]
25
Byjus (2019). Top 6 Difference Between Primary Data And Secondary Data. [online] BYJUS.
Available at: https://byjus.com/commerce/difference-between-primary-data-and-secondary-
data/. [10]
Ferretti, J. (2023). BBC Articles Dataset with Extra Features. [online] Kaggle.com. Available at:
https://www.kaggle.com/datasets/jacopoferretti/bbc-articles-dataset [Accessed 16 Jan. 2025].
[11]
15writers (2019). Understanding the Research Onion. [online] 15 Writers. Available at:
https://15writers.com/research-onion/. [12]
A. Melnikovas, “Adapting Research Onion Model for Futures Studies,” https:

//www.semanticscholar.org/paper/Towards-an-Explicit-Research-Methodology-% 3A-Adapting-
Melnikovas/dccf54a5a4312ceb2261e1989cda01f73989d735, 2018, [Online; accessed 10-Jan-
2024].[13]
26

END CRP

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

END CRP

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

END CRP

Uploaded by

Copyright:

Available Formats

News article classification

Sunday, February 16, 2025

Written by Abdullah Obaid

Keywords: First Keyword, Second Keyword, Third Keyword.

3.Data collection and Description

3.1 Primary Data

3.2 Secondary Data

3- Automated summaries:- If the researcher needs summaries, they may not be as

4.Research Approach and Methodologies

Figure 1: The structure of the Onion Research Model [13].

4.1.2 Theory Development Approach

4.1.3 Methodological Choice

4.1.4 Research Strategy

4.1.5 Time Horizons

4.1.6 Techniques and Procedures

Thus, we have completed the process of preparing the text before we

B. Feature engineering: - Where the use of TF-IDF came in order to convert

E. Exploratory data analysis (EDA):- Where I analyzed this data and

G. Metrics: - In order to compare the performance of the models, we had to

4.2 Research Methodology

6.Conclusion and Recommendations

A. Melnikovas, “Adapting Research Onion Model for Futures Studies,” https:

You might also like