Rapport Firas Naimi

Télécharger au format pdf ou txt
Télécharger au format pdf ou txt
Vous êtes sur la page 1sur 64

Graduation Project Report

Telecommunications Engineering Cycle

Topic:

Document Search And Classification Desktop


Application

Realized by :
Naimi Firas
Academic Supervisor :
Mme. Messaoudi Houyem
Profession Supervisor :
Mr. Kais Chaabouni

Work proposed and fulfilled in collaboration with :

Audis Services

1
Supervisors’ signatures
ACKNOWLEDGEMENTS

The success of this project is the result of an effective guidance by Audis team who had the
time to share with me their expertise and knowledge. My sincere appreciations go to Audis
supervisor Mr. Kais Chaabouni for his guidance, advice, assistance and good humor. I
would like to thank my pedagogic advisor, Mrs.Messaoudi Houyem for her collaboration,
and I acknowledge with much appreciation the honorable jury members for having time to
examine my modest work. Additional thanks should go to all those who encouraged us all
along the past period.

ii
Abstract

The rise of big data with advancement in technology leads to an ever-increasing demand
for a personalized search engine to search the huge amount of data residing in personal
computers. A desktop search engine is used to search files or data in a user’s personal
systems It includes an efficient way of searching and retrieving desired data and informa-
tion. In this project, we present a search and classification desktop application that is
capable of fast indexing and searching for documents located in personal computer (PC)
and classifying them into categories according to their content using classification machine
learning model Support Vector Machine(SVM).
Keywords: Desktop search engine, index, search, RI, SRI, TF-IDF, ML,
SVM, automatic text classification, vector representation.

iii
iv

Résumé — L’essor du big data et les progrès de la technologie entraînent une demande
croissante pour un moteur de recherche personnalisé permettant de rechercher les énormes
quantités de données résidant dans les ordinateurs personnels. Un moteur de recherche de
bureau est utilisé pour rechercher des fichiers ou des données dans les systèmes personnels
d’un utilisateur. Il comprend un moyen efficace de rechercher et de récupérer les données
et les informations souhaitées. Dans ce projet, nous présentons une application de bureau
de recherche et de classification qui est capable d’indexer et de rechercher rapidement des
documents situés dans un ordinateur et de les classer selon des catégories en fonction de
leur contenu à l’aide d’un modèle d’apprentissage automatique de classification Support
Vector Machine(SVM).
Mots clés : Moteur de recherche de bureau, index, recherche, RI, SRI, TF-IDF, ML,
SVM, texte automatique classification, représentation vectorielle.
Contents

Introduction 1

1 Internship Context 3
1.1 Host Institution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Project Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Study of existing search text and document search . . . . . . . . . . 5
1.2.2.1 Recoll . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2.2 Copernic Desktop . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Project Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Comparison between existing tools . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Methodology Adopted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Theory and Background 10


2.1 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Information retrieval systems . . . . . . . . . . . . . . . . . . . . . 11
2.1.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2.2 Document and query concepts . . . . . . . . . . . . . . . . 11
2.1.2.3 Main phase of information retrieval system . . . . . . . . . 12
2.1.3 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.3.1 Indexing Advantage . . . . . . . . . . . . . . . . . . . . . 12
2.1.3.2 Methods for document indexing . . . . . . . . . . . . . . . 12
2.1.3.3 Inverted Index . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.4 Language pre-processing . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.4.1 Tokenisation . . . . . . . . . . . . . . . . . . . . . . . . . 14

v
Contents vi

2.1.4.2 Filtration of the stopwords . . . . . . . . . . . . . . . . . . 14


2.1.4.3 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Natural Language Processing (NLP) . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Key Application Areas of NLP . . . . . . . . . . . . . . . . . . . . 15
2.2.3 Typical text pre-processing tasks in NLP . . . . . . . . . . . . . . . 16
2.3 Multi-Class Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Support Vector Machine SVM . . . . . . . . . . . . . . . . . . . . . 19
2.3.3 K Nearest Neighbors KNN . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.4 Multinomial Naïve Bayes . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.5 Multinomial Logistic Regression . . . . . . . . . . . . . . . . . . . . 20
2.3.6 Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.7 Performance Measurement . . . . . . . . . . . . . . . . . . . . . . . 21

3 Methodology 24
3.1 Indexing and Searching –Whoosh . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.1 Whoosh overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.2 Indexing process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.2.1 Parsing documents . . . . . . . . . . . . . . . . . . . . . . 26
3.1.2.2 Creating Indexed Data . . . . . . . . . . . . . . . . . . . . 26
3.1.2.3 Searching query . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Categorization and classification . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.2 Input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.3.1 Creation of the initial dataset . . . . . . . . . . . . . . . . 29
3.2.3.2 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . 30
3.2.3.3 Feature Engineering . . . . . . . . . . . . . . . . . . . . . 31
3.2.3.4 Performance Measurement . . . . . . . . . . . . . . . . . . 34
3.2.3.5 Best Model Selection . . . . . . . . . . . . . . . . . . . . . 35
3.2.3.6 Model Interpretation . . . . . . . . . . . . . . . . . . . . . 36
3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Contents vii

4 Achievement 38
4.1 Work Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.1 Hardware Environment . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.2 Software Environment . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Achieved Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.1 Programming language and framework . . . . . . . . . . . . . . . . 41
4.2.1.1 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.1.2 PYQT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.1.3 The model/view architecture : MV Model . . . . . . . . . 41
4.2.2 Results: Presentation of the application desktop Interface . . . . . . 42

Conclusion 47
List of Figures

1.1 Audis Services company logo . . . . . . . . . . . . . . . . . . . . . . . . . 3


1.2 Clients d’Audis Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Recoll logo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Copernic Desktop logo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Watefall model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1 Architecture of the information retrieval system . . . . . . . . . . . . . . . 11


2.2 Word level Inverted Index . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Stemming and lemmatization . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 NLP pre-processing example . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Random Forest Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Multiclass Classification Using SVM . . . . . . . . . . . . . . . . . . . . . 19
2.7 KNN Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.8 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.9 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1 Whoosh Text Based Search Engine . . . . . . . . . . . . . . . . . . . . . . 25


3.2 Whoosh index format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Whoosh Indexing process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Whoosh Index searching process . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 Dataset Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6 Label Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.7 Classification Models Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 35
3.8 SVM model Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.9 SVM model Classification report . . . . . . . . . . . . . . . . . . . . . . . 36

4.1 Anaconda Logo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

viii
List of Figures ix

4.2 Spyder Logo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40


4.3 Jupyter Notebook Logo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 PYQT MV architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6 Search Scope Pane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.7 Size and Date Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.8 Document types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.9 Document Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.10 Search field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.11 Result Pane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.12 Document Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
List of Tables

1.1 Existing tools comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4.1 Hardware environment Tools . . . . . . . . . . . . . . . . . . . . . . . . . . 38

x
Acronyms / Abbreviations

IR Information retrieval

SIR System Information Retrievak

ML Machine Learning

SVM Support Vector Machine

UI User Interface

PC Personal Computer

TFIDF Term Frequency–Inverse Document Frequency

xi
General Introduction

The rise of big data with advancement in technology leads to an ever-increasing demand
for a personalized search engine to search the huge amount of data residing in personal
computers. Data has grown by multiple order of magnitude in different dimensions, it
has become a major key in any business decision made by companies. The information is
mainly stored on computers, but the role of computers has changed in the last decades.
We no longer use computers for just their raw computing abilities: they also serve as
communication devices, multimedia players, and media storage devices. Those uses require
the ability to quickly find a specific piece of data. For that, companies aim to have desktop
tools that search it quickly using efficient algorithms. In other words, companies must find
data in their personal computers’ storage rapidly by using performant applications. An
efficient software is a tool that searches data in less time complexity with optimal resources
usage. However, there are some problems with current developed apps:
- Existing efficient apps are expensive. In other words, tools that give us all the features
we need costs a lot of money.
- Sometimes, desktop apps are not open source. As a result, if we need to develop a
custom feature and integrate it to the existing solution, it can’t be possible.
- The problem arises when an increasing portion of the information is non-textual in
format. Tools must give the possibility to search by image in personal computers.
- Tools currently used cannot classify files by field. We want sometimes to make a
search
in business files only. So, we must make files classification technique. It is crucial for
Audis business.
As a result, Audis has chosen to develop its own solution based on well-known efficient
algorithms. It does do this for many reasons such as the need for an open-source solution,
which is internal, cheap and can offer the developer the possibility to add custom features
based on needs. In addition, search by image and files classification techniques must be

1
Introduction 2

integrated in the application we want to develop. For that, we need a desktop application
that:
- Uses efficient algorithms in the search-indexation and classification processes
- That can be extensible by adding custom features
- Where files classification and image search techniques are integrated.

This manuscript illustrates the work done in X chapters, each of which is ordered as
follows:
• The first chapter presents the general framework of the project, describes the
problematic studied and presents a critical study of existing solutions highlighting their
limits. It also introduces the proposed solution to achieve the targeted objectives as
well as the work methodology adopted for its realization.
• The second chapter presents the concepts related to the project
• The third chapter is devoted explains the details of the methods used to perform and
implement our application.
• The last chapter presents the realization part of the project in order to concretize the
work by giving detailed screenshots.
• In closing, a general conclusion which presents the results of this project and the
possible prospects for possible improvements.
Chapter 1

Internship Context

Introduction
This chapter gives a broad overview of the internship and the circumstances under which
the project has been accomplished. First of all, I will present the hosting company and its
fields of business. After that, I’ll concentrate on the problem statement and the project
presentation. Finally, I will discuss the methods used to meet the project’s objectives.

1.1 Host Institution


Audis Services [1] is a consulting, services and IT engineering company. It is an expert in
the field of new technologies and computer science. Its mission is to design and implement
scalable IT solutions and services adapted to the needs of its customers. Since its creation
in 2013, Audis Services has continued to develop its business expertise. Having highly
qualified staff, Audis Services guides their clients to adopt a new approach to induce the
evolution of application solutions to using a reliable and proven methodology.

Figure 1.1: Audis Services company logo


[1]

3
1.2. Project Overview 4

Audis Services has many well-known clients around the world. Figure 1.2 shows the
distribution of Audis Services’ clients.

Figure 1.2: Clients d’Audis Services


[2]

Audis presents itself in five areas of expertise:


- Consulting
- Outsourcing
- Technical assistance
- NearShore
- Data Science

1.2 Project Overview


In this section, we’ll offer the problem statement, followed by an explanation of the motiva-
tion that led to the completion of this project. Following that, we’ll deliver a presentation
on the project as well as the technique used to complete it.

1.2.1 Problem statement


As a consulting, services and IT engineering company, Audis employees deal with a large
amount of data on their personal computers. The need for a tool to get information from
these files is really obvious. This tool was a response to a various number of problems:
- Existing searching tools are expensive. Despite costing money, they don’t include all
the needed features.
- A large number of existing tools are not open source which makes them not useful
1.2. Project Overview 5

when we want to add custom features. Even if they are open source, most of the project
are stopped.
Hence, there is no bug fixes, no maintenance and no updates.
- Existing tools don’t include searching words in images and the possibility to get files
by class where they belong.
Those problems lead to the development of an internal tool to search words in personal
computers’ storage and to integrate the classification and search by image techniques.

1.2.2 Study of existing search text and document search


1.2.2.1 Recoll

Recoll is a personal text search tool for Unix and Linux. It is based on the powerful Xapian
[ref] indexing engine, for which it offers an easy-to-use, rich, and easy-to-implement QT
graphical interface. Recoll is free open-source software, the source code of which is available
under the GPL license.

Figure 1.3: Recoll logo


[3]

Recoll characteristics:
- Runs on most Unix based systems.
- Handles most common document types, messages and their attachments.
- Powerful search functions, with Boolean expressions, phrases and proximity, wildcards,
filtering on file types or location.

1.2.2.2 Copernic Desktop

Copernic Desktop Search is constantly evolving to find more and more content on your
PC, while maintaining its primary goal: to find your files and emails quickly. Its intuitive
interface and well-organized results will eliminate wasted time searching for your PC’s
content.
1.2. Project Overview 6

Figure 1.4: Copernic Desktop logo


[4]

Copernic Characteristics:
• Deep searches: Copernic can index and search over 150 file types and emails—on
desktops, network, virtual desktop and even the cloud!
• Fast results: Once a search index is created, take advantage of fast search results that
appear on the screen
• Robust Security: Whether you use your Desktop or Server Search solution, your data
will always remain extremely secure.

1.2.3 Project Presentation


Our project is an implementation of an intelligent search and classification desktop appli-
cation. The main features of this application are:
- Indexing a storage and searching a word into this storage.
- Classification of files in the storage by their class category (sport, business, technology,
sport, entertainment).
Characteristics:
Our application has these characteristics:
- Open-source
- Searches a word in files’ folder quickly using efficient algorithms
- Contains new features such as the technique to search a word in an image.
- Quick results
- An easy user interface
- Works on windows and unix operating systems
1.3. Comparison between existing tools 7

1.3 Comparison between existing tools


This table 1.1 illustrates the difference between the existing tools and our solution. This
comparison is based on different criteria whether it is open-source or not, free or not and
that contains all the needed features.

Our Application Recoll Copernic Desktop


Open-source Yes Yes No
Search in the cloud No No No
Search a word in an image Yes No No
Classification technique Yes No No
Operating system Unix, Windows Unix Windows
Fast Yes Yes Yes
Add custom features Yes Yes No

Table 1.1: Existing tools comparison

In our project requirements, we need a system which enables us to classify files in a specific
storage, to get the appearance of a word in a given image. It is important to work on both
unix and Windows. These requirements cannot be guaranteed with the existing tools. For
that, we have to develop our own tool.

1.4 Methodology Adopted


I have chosen the "Waterfall model". This model is straightforward and simple to use and
comprehend. It consists of 6 steps as the following figure 1.5 shows:
1.4. Methodology Adopted 8

Figure 1.5: Watefall model


[5]

These steps are presented as follows:


1.Requirements: this stage lays out the specifications of the solution that will be
designed and implemented.
2.Design is a crucial stage in the development of any product. This step specifies the
physical and logical architecture to be used to achieve the project’s goal.
3.Implementation: The system is first created in small programs known as units,
which are then integrated into the next step, using inputs from the system design.
4.Integration and Testing: After each unit has been tested, all of the units generated
during the implementation phase are combined into a system. Testing is done to ensure
that the client does not have any issues installing the software.
5.Deployment: Once all of the needed processes have been completed, the software
can be used.
6.Maintenance: Once the program is deployed, it should be updated on a regular
basis and any errors should be corrected. As a result, this stage is in charge of this
feature.
1.4. Methodology Adopted 9

This phase is not included in the project.

Conclusion
In this chapter, I have presented the host company where this project was completed.
Then I delivered the general context of the project, as well as the problem description and
project presentation. Finally, I discussed the methodology used to complete the job. I’ll
start the preliminary research in the next chapter.
Chapter 2

Theory and Background

Introduction
The Chapter Background and Theory will clarify the relevant concepts related to our
project First of all, we will define Information Retrieval and its aspects. Then we will
understand index and search process. Finally, we will we present the state of the art of
the AI based techniques such as Natural Language Processing and Classification Machine
Learning Models that gives good results for categorization task.

2.1 Information Retrieval


Information retrieval is a field historically linked to information science, which has always
been concerned with establishing representations of documents in order to retrieve informa-
tion through the construction of indexes. Computer science has allowed the development
of tools to process information and establish the representation of documents at the time
of their indexing, as well as to search for information.

2.1.1 Definitions
Several definitions of information retrieval have emerged in recent years, we quote in this
context the following three definitions:
- Definition 1: Information retrieval is considered as the set of techniques allowing to
select from a collection of documents, those that are likely to meet the needs of the the
user’s needs.

10
2.1. Information Retrieval 11

- Definition 2: Information retrieval is a branch of computer science that is concerned


with the acquisition, organization, storage, retrieval and selection of information.
- Definition 3: Information retrieval is a research discipline that integrates models
and techniques whose goal is to facilitate access to relevant information for a user.

2.1.2 Information retrieval systems


2.1.2.1 Definitions

An Information Retrieval System (IRS) is a computer system that allows to return from a
set of documents, those whose contents best correspond to a user’s need for information,
with a query.

Figure 2.1: Architecture of the information retrieval system


[6]

2.1.2.2 Document and query concepts

Document
The document represents the elementary container of information, exploitable and ac-
cessible by the SRI. A document can be a text, a WEB page, an image, a video, etc.In
our context, we concentrate on text-based documents (PDF, Word, PPT, TXT, Image, etc
2.1. Information Retrieval 12

. . . ).
Query
A query is the expression of the user’s need for information.

2.1.2.3 Main phase of information retrieval system

The fundamental objective of an IR process is to select the documents ”closest” to the


user’s information need described by a query. This leads to main phase in the process:
indexing and matching query/document matching.

2.1.3 Indexing
Text indexing is the process of extracting statistics considered important from a text in
order to reflect the information provided and/or to allow quick searches on its content. Text
indexing processes can be conducted on nearly any type of textual information, including
source code for computer programs, DNA or protein databases, and textual data storage,
in addition to plain language texts.

2.1.3.1 Indexing Advantage

The goal of storing an index is to improve speed and performance when searching for
relevant content. Without an index, the search engine would have to go through each
document in the corpus, which would take a long time and a lot of computer power.
For example, a sequential scan of every word in 10,000 huge documents could take hours,
whereas an index of 10,000 documents can be searched in moments.

2.1.3.2 Methods for document indexing

There are numerous indexing methods available, here are the most two popular:
Full-Text Indexes
Full-text indexes are simple to create. The system examines every word of the document
and produces an index of each term and its location as part of this procedure. Full-text
indexes need a lot of storage space, despite the fact that they are easier to process.
Field-Based Indexes
Field-based indexes are a quick and easy approach to find information in a database.
This type of indexing allows the user to look for specific information about each document.
The field could be a date, time, or any other designated area, for example.
2.1. Information Retrieval 13

2.1.3.3 Inverted Index

The inverted index is a database index that stores a mapping from content, such as words
or integers, and their places in a database, a document, or a series of documents. An
inverted index is used to enable for quick full-text searches. There are two categories of
inverted indexes:
a) A record-level inverted index contains a list of document references for each
term.
b) A word-level inverted index additionally includes the positions of each word
inside a document The latter provides more capability, but it necessitates more processing
power and storage space.

In our work we will focus on word-level inverted index, and to better understand how
does it work, let’s review a simple example with following figure figure 2.2:

Figure 2.2: Word level Inverted Index


[7]
2.1. Information Retrieval 14

2.1.4 Language pre-processing


In this section we will present the set of linguistic preprocessing leading to the constitution
of the inverted index and the documentary representation from a given document collection.

2.1.4.1 Tokenisation

Tokenization is the process of breaking down a large chunk of text into smaller tokens
that can be words, characters, or subwords. Hence, tokenization can thus be classified
into three categories: word, character, and subword (n-gram characters) tokenization. The
most often used tokenization algorithm is word tokenization. It divides a chunk of text
into distinct words using a delimiter. Different word-level tokens are created depending on
the delimiters.
Considering the following sentence: "Never give up".
The sentence’s tokenization yields three tokens: Never/give/up.

2.1.4.2 Filtration of the stopwords

Stopwords are words that tend to appear often in all papers in a collection and don’t
provide information about the document’s content. In other words, they don’t have any
semantic significance. For example, in English, it can be the terms "of," "the," "for,"
etcetera.

2.1.4.3 Normalization

Normalization consists in bringing back an inflected word in its canonical form. There are
two types of normalization: Stemming and lemmatization.
a) Lemmatization is the process of transforming a word into its dictionary form, such
as "reading" => "read," "finds" => "find," and "thought" => "think."
b) Stemming is the process of converting a word into its root form by removing the
word’s ending. This is similar to Lemmatization, but it can’t handle irregular verbs. It
can, however, handle words that aren’t in the dictionary.
2.2. Natural Language Processing (NLP) 15

Figure 2.3: Stemming and lemmatization


[8]

2.2 Natural Language Processing (NLP)

2.2.1 Definition
Natural language processing (NLP) is a branch of linguistics, computer science, and arti-
ficial intelligence that studies how computers interact with human language, particularly
how to develop an algorithm that can process and evaluate huge amounts of natural lan-
guage data. This enables computers to comprehend both the content of documents and
the language’s internal contextual nuances. NLP technology is capable of accurately ex-
tracting information and meanings from documents, as well as categorizing and organizing
the documents themselves. [2]

2.2.2 Key Application Areas of NLP


— Search: This entails identifying specific terms within a text. It allows you to search
for keywords in a document, do a contextual search for synonyms, and discover misspelled
2.2. Natural Language Processing (NLP) 16

words or related entities, among other things.


— Machine translation: This refers to the process of translating one natural language into
another while maintaining the meaning and producing fluent writing as a result.
— Summarization: NLP systems can be used to construct a short version of articles and
long texts that contains only the most important information, including primary points
and essential ideas.
— Named-Entity Recognition (NER): NER is a technique for identifying, extracting, and
categorizing entities. It entails extracting the names of various elements such as persons
and places and categorizing them into specified classifications.
— Text classification: The NLP algorithm is programmed to categorize texts based on
certain characteristics such as subject, document type, and time.
— Sentiment analysis: This is a type of text classification in which the NLP model deter-
mines whether the text is positive, negative, or neutral.
— Answering queries: An automated question answering system analyzes unstructured
data from articles, social media, newsfeeds or medical records by extracting the needed in-
formation elements, analyzing it, and using the relevant part to answer the question using
a set of natural language processing (NLP) methods.

2.2.3 Typical text pre-processing tasks in NLP


In order to efficiently turn the text into a format that can be used by models and other
activities, most NLP systems contain a text pre-processing pipeline and tasks that should
be processed.

• Sentence Segmentation: It entails separating the text into individual phrases. It may
be as simple as splitting sentences whenever a punctuation mark appears to code
a sentence segmentation model. Newer NLP pipelines typically use more complex
algorithms that work even when a document isn’t correctly organized.

• Tokenisation

• Stemming and lemmatization

• Stopword removal

Figure 2.4 illustrates NLP reprocessing techniques on sentences.


2.3. Multi-Class Text Classification 17

Figure 2.4: NLP pre-processing example


[9]

2.3 Multi-Class Text Classification


ML approaches categories are divided in two:

• Supervised learning: the model being trained on annotated data is required to infer
knowledge from the input features and map them to an output class. In it the main
goal is to be able by the end of the learning process to correctly predict new instances
outputs.

• Unsupervised Learning: In contrast with supervised learning, unsupervised learning


has no specific output targets or evaluations to handle. Based on the input features,
the main goal is to get an insight over all the statistical structure of the input data
features.

Therefore, text classification is supervised problem in machine learning where the dataset
is labelled. There are different algorithms to solve classification problem. To obtain the
2.3. Multi-Class Text Classification 18

best result and the more accurate prediction, we have to test different ML models to see
which one would best match the data and capture the relationships between the points
and their labels. We’ll give a quick explanation of the logic behind each model.

2.3.1 Random Forest


Random forest is a supervised learning algorithm. It can be used for both classification
for classification and regression. It is also the most flexible and easy to use algorithm. It
works by creating a large number of decision trees during training and then aggregates the
votes from different decision trees to decide. It is said that the more trees there are, the
more forest is robust.

Figure 2.5: Random Forest Architecture


[10]
2.3. Multi-Class Text Classification 19

2.3.2 Support Vector Machine SVM


Support Vector Machine is a supervised machine learning technique that is mostly used to
solve classification issues. The classification is carried out by determining the best hyper-
planes distinguishing the classes by maximizing the separation boundaries between data
points depending on the labels or classes pre-defined.

Figure 2.6: Multiclass Classification Using SVM


[11]

2.3.3 K Nearest Neighbors KNN


The K nearest neighbors algorithm is a non-parametric method used for classification and
regression. The input is the K closest training instances in the feature space in both
situations.
2.3. Multi-Class Text Classification 20

Figure 2.7: KNN Classification


[12]

2.3.4 Multinomial Naïve Bayes


The Bayes theorem is used in Naive Bayes to predict the category of a given sample (with
the strong assumption that each feature is independent of the others). It’s a probabilistic
classifier, which means it’ll use Bayes theory to determine the probability of each category,
then output the category with the highest probability.

2.3.5 Multinomial Logistic Regression


Multinomial Logistic Regression (MLR) is a classification approach that extends logistic
regression to multiclass poblems. Logistic Regression is a classification algorithm, that is
used when the response variable is categorical. It’s idea is to find a relationship between
features and probability of particular outcome.
For example, when predicting whether a student passes or fails an exam based on the
amount of hours spent studying, the response variable has two values: pass and fail.
2.3. Multi-Class Text Classification 21

Figure 2.8: Bayes Theorem

2.3.6 Gradient Boosting


Gradient boosting is a machine learning approach for regression, classification, and other
problems that generates a prediction model by combining various weak predictors, typically
Decision Trees.

2.3.7 Performance Measurement


There are numerous metrics that may be utilized to gain insight into the model perfor-
mance while dealing with classification challenges. Here are a few examples:
Confusion matrix: The confusion matrix is a N x N matrix that is used to evaluate
the performance of a classification model, with N being the number of target classes. The
matrix compares the actual calculated values with those predicted.

Figure 2.9: Confusion Matrix


2.3. Multi-Class Text Classification 22

Accuracy: The accuracy metric calculates the proportion of correct predictions to the
total number of occurrences assessed.

Precision: Precision is a metric for determining how many positive patterns are properly
predicted from the total predicted patterns.

Recall: recall is a metric for determining the ratio of positive patterns which are properly
classified.

F1-Score: If we need to strike a compromise between precision and recall, F1 Score could
be a preferable metric to employ.
In fact, it is the harmonic mean of precision and recall. Therefore, it provides a good
estimate of the overall quality of a model.

TP, TN, FP and FN are respectively :

• TP/True Positive: You predicted positive and it’s true.

• TN/True Negative: You predicted negative and it’s true.

• FP/False Positive : You predicted positive and it’s false.

• FN/False Negative: You predicted negative and it’s false.


2.3. Multi-Class Text Classification 23

Conclusion
In this chapter we have first information retrieval as a research discipline that integrates
models and techniques like indexing and searching whose goals is to facilitate access to
relevant information for a user. Secondly, we explain AI based techniques for classification.
We started with introducing NLP and its application areas and common text preprocessing
tasks. Then, we present different ML classification models and their performances. In the
following chapter, we are going to explain the details of the methods used to perform and
implement our application.
Chapter 3

Methodology

Introduction
This chapter is dedicated to define the method to use to reach the project goals. The
work will be divided in two big steps: Indexing and Searching for files and Preprocessing
classification models followed by selection the best model after performance evaluation

3.1 Indexing and Searching –Whoosh

Introduction
To search for relevant content from based-text document quickly and obtain good results,
we have to use a full text search engine tool. It should have efficient and precise search
algorithms to collect, parse and store data into an index to facilitate fast and accurate
information retrieval. In our case, we adopt python as programming language, for that the
suitable tool that wa can use is Whoosh library

3.1.1 Whoosh overview


Whoosh is a fast full-text indexing and searching library implemented in pure Python.
It is a library of classes and functions for indexing text and then searching the index.
Programmers can use it to easily add search functionality to their applications and websites.
Whoosh is a highly sophisticating system yet simple to use open source for building up a
text search engine project. It has a python based text search engine library that contains

24
3.1. Indexing and Searching –Whoosh 25

powerful indexing and querying functions. Because of the powerful functions, it is widely
used by users who want to develop their own search engines.
Whoosh consists of two components that make up a search engine, indexing and searching.
Firstly, all the text and metadata extracted from documents, originating from different text
sources like images, Docx or PDF files, are indexed in order to produce a common format.
Preparing for a common format makes the search process convenient as the documents are
processed by an Analyzer and turned into tokens to be actually indexed.
Secondly, when a user inputs some attributes in the query, Whoosh parses the query with
Query Parser and creates search criteria which are used to run for the Query object against
the index.
Finally the items of data that meet the search criteria will be returned as Document objects
to the user. This process is described in Figure 3.5 below.

Figure 3.1: Whoosh Text Based Search Engine

3.1.2 Indexing process


The first step for indexing is to scan all available documents and parsing their content and
metadata. In our case, the difficult type of documents that we deal with and needs special
3.1. Indexing and Searching –Whoosh 26

focus while parsing are PDF and images.

3.1.2.1 Parsing documents

1. PDF parsing
PDF’s may be one of the most commonly used documents in most offices. it stands for
Portable Document Format and was developed by Adobe. The main goal was to be
able to exchange information platform-independently while preserving and protecting
the content and layout of a document. This results in PDFs being hard to edit and
difficult with extracting information from them, which does not mean it is impossible.
There are different tools with different methodologies and functionalities available in
python for PDF text extraction like, PyPDF2, PyMuPdf, PDFMiner Both PDFMiner
and PyPDF2 are pure Python libraries. In contrast, PyMuPDF is based on MuPDF,
a lightweight but extensive PDF viewer. This has huge advantages when it comes
to handle difficult PDFs and it claims to be significantly faster than PDFMiner and
PyPDF2 in various tasks. For these reasons, we have chosen it.

2. Image parsing - Optical character recognition processing (OCR)


Extracting text from image is a really hard task for a computer. In fact, extrac-
tion process involves detection, localisation, tracking, extraction, enhancement, and
recognition of the text from a given image. For enabling our python program to
have Character recognition capabilities, we would be making use of Python library
:Tesseract
Python-tesseract is an optical character recognition (OCR) tool for python. That is,
it will recognize and “read” the text embedded in images.

3.1.2.2 Creating Indexed Data

After extracting all the text and metadata from documents, originating from different text
sources like images, Docx or PDF files, they should be indexed in order to produce a
common format. Preparing for a common format makes the search process convenient,
easy, and fast. Therefore, the “schema” of the index has to be defined.
Schema defines list of fields to be indexed or stored for each text file. It’s similar to how
we define it for database. A field is a piece of information for each document in the index,
such as its title or text content. Indexing of a field means it can be searched and it is also
returned with results if defined as argument (stored=True) in schema. In our case, the
3.1. Indexing and Searching –Whoosh 27

shema includes fields such as title, content, path, date, size, type.

Figure 3.2: Whoosh index format

The indexing process of the Whoosh search engine is described as follows:

Figure 3.3: Whoosh Indexing process

3.1.2.3 Searching query

When a document is added to an index and a user sends a search query, whoosh will
run the query against the index to determine among the multiple search options. The
following figure describes the process of the searching. Firstly, the query will be parsed and
converted to a list of plain text, which is then processed by the standard analyzer of search
3.1. Indexing and Searching –Whoosh 28

engine, whose operations include discarding punctuation, removing accents, lowercasing,


removing stopwords, stemming and lemmatization. Afterwards, the search engine searches
the information from the different segments of the indexes and then returns the list of
results classified according to their score. In fact, each document is ranked according to a
scoring function.
There are quite a few types of scoring function supported by whoosh.
• Frequency : It simply returns the count of the terms occurred in the document. It
does ²not perform any normalization or weighting.

• Tf-Idf scores: It returns tf * idf scores of each document

• BM25F scoring: It is the by default ranking function used by whoosh. BM stands


for best matching. It is based on tf-idf along with bunch of factors like length of
document in words, average length of documents in the collection.

• Cosine scoring: It is useful for finding document similar to your search query.

Figure 3.4: Whoosh Index searching process

In this part, we have clarified the steps of indexing and searching by whoosh library and
explaining its functionalities. In the next step, we will introduce the classification model
adopted in our work.
3.2. Categorization and classification 29

3.2 Categorization and classification


3.2.1 Introduction
Once the document content is extracted, it is passed to a supervised machine learning
classification model that is able to predict the category of a given text.
This might be considered a text classification issue. Text classification is a common appli-
cation of natural language processing (NLP) in a variety of challenges.

3.2.2 Input data


The dataset used in this project is the BBC News Raw Dataset. It can be downloaded
from: http://mlg.ucd.ie/datasets/bbc.html
It contains 2.225 documents from the BBC news website from 2004 to 2005, matching to
items in five topical areas. These are the areas in question:

• Business

• Entertainment

• Politics

• Sport

• Tech

3.2.3 Methodology
3.2.3.1 Creation of the initial dataset

The goal of this stage is to create a dataset that looks like this:

That is, each row will represent a single document, with its name, content, and category
stored in the columns.
3.2. Categorization and classification 30

3.2.3.2 Exploratory Data Analysis

When creating a classification model, one of our key concerns is whether the different
classes are balanced. This suggests that each class is represented in the dataset in roughly
equal proportions.
For example, if there are two classes and 95% of observations belong to one of them, a bad
classifier that always outputs the majority class would have 95% accuracy, despite failing
all minority class predictions.
There are numerous approaches to dealing with datasets that are unbalanced. To obtain
a more balanced dataset, one first strategy is to undersample the majority class and over-
sample the minority class. Another method is to use error metrics other than accuracy,
such as precision, recall, or F1-score.
By looking at our data, we can get the pourcentage of observations that belongs to each
class:

Figure 3.5: Dataset Distribution

We remark that the classes are roughly balanced, therefore there will be no undersam-
3.2. Categorization and classification 31

pling or oversampling. We will, however, employ precision and recall to evaluate model
performance.

3.2.3.3 Feature Engineering

Feature engineering is an important part of any intelligent system’s development. As An-


drew Ng says:
“Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied
machine learning’ is basically feature engineering.”
Feature engineering is the process of changing data into features that may be used as inputs
to machine learning models, with the goal of enhancing model performance.
There are numerous methods for extracting features that reflect text data when working
with it. We’ll go over some of the most prevalent approaches before deciding which one is
best for us.

1. Text representation There are sevral methods that we can use to represenet a text
in our corpus :

(a) Word Count Vectors Every column represents a term from the corpus, and
each cell represents the frequency count of each term in each document.

(b) TF–IDF Vectors

Being:

• t: term (i.e. a word in a document) o


• d: document
• TF(t): term frequency (i.e. how many times the term t appears in the
document d)
• N: number of documents in the corpus
• DF t: number of documents in the corpus containing the term t
3.2. Categorization and classification 32

The TFIDF value rises in proportion to the number of times a word appears
in a document and is offset by the number of documents in the corpus that
contain the term These two methods (Word Count Vectors and TFIDF Vectors
are named Bag of Words methods, since we ignore the order of the words in a
sentence.)
The methods that follow are more advanced since they preserve the order of the
words and their lexical considerations in some way.

(c) Word Embeddings The position of a word within the vector space is deter-
mined by the words that surround it when it is used. Word embeddings can be
used with transfer learning models that have already been trained.

(d) Text based or NLP based features We can manually add any feature that
we think would help us distinguish between categories (for example, word den-
sity, amount of letters or words, etc...).
We can also employ NLP-based features such as Part of Speech models to deter-
mine whether a word is a noun or a verb, and then apply the PoS tag frequency
distribution.

(e) Topic Models In what is known as topic modeling, methods such as Latent
Dirichlet Allocation attempt to represent every topic by a probabilistic distri-
bution over words.
To represent the documents in our corpus, we utilized TF-IDF vectors, for the
following reasons:

• The TF-IDF is a straightforward model that produces excellent results in


this domain
• Creating TF-IDF features is a quick operation
• We can fine-tune the feature generation method (see following paragraph)
to minimize overfitting problems.

In addition to previous reasons of TF-IDF method choice, there is other advan-


tages of creating the features with this method, for example we can choose some
parameters:

• N-gram range: we can consider unigrams, bigrams, trigrams,etc


3.2. Categorization and classification 33

• Maximum/Minimum Document Frequency: We can ignore terms with a


document frequency strictly higher/lower than the given threshold when
generating the vocabulary.
• Maximum features: we can choose the top N features across the corpus,
ranked by term frequency.

We have chosen the following parameters:

• N-gram range: (1,2)


• Maximum DF: 100%
• Minimum DF: 10
• Maximum features: 300

We anticipate that bigrams will aid in the improvement of our model’s perfor-
mance by taking into account words that frequently appear together in doc-
uments. We picked a Minimum DF of 10 to eliminate extremely rare words
that appear in less than 10 documents, and a Maximum DF of 100 percent to
ensure that no other terms are missed. We chose 300 as the maximum number
of features because we want to avoid overfitting, which is frequently caused by
a large number of features compared to the amount of training data.

2. Text cleaning
To guarantee that no distortions are introduced to the model, we must must perform
a cleaning process before producing any feature from the raw text. We took the
following steps:Before creating any feature from the raw text, we must perform a
cleaning process to ensure no distortions are introduced to the model. We have
followed these steps:

• Special character cleaning: special characters such as “” must be removed


from the text.

• Upcase/downcase: “Book” and “book” should have same predicting power.


For that reason each word is downcased .

• Punctuation signs: characters like “?”, “!”, “;” have been removed.

• Possessive pronouns: “Trump” and “Trump’s” should have the same predict-
ing power
3.2. Categorization and classification 34

• Stemming or Lemmatization: stemming is the practise of reducing deriva-


tive words to their base. Lemmatization is the process of reducing a word to its
lemma. The main difference between the two approaches is that lemmatization
returns existing words, whereas stemming returns the root, which may or may
not be a word.
• Stop words: have been removed

3. Label coding
To make a prediction, machine learning models need numeric information and labels.
As a result, we’ll need to build a dictionary to map each label to a number ID.

Figure 3.6: Label Coding

4. Train – test split


To validate the quality of our models while predecting unknown data, we need to
put aside a test set. We picked a random split, with 85 percent of the observations
making up the training test and 15% of the observations making up the test set.

3.2.3.4 Performance Measurement

There are numerous metrics that may be utilized to gain insight into the model perfor-
mance while dealing with classification challenges like accuracy, recall, precision and F1-
score. These metrics have a wide range of applications and are commonly used in binary
classification.
When dealing with multiclass classification, however, they become more difficult to com-
pute and interpret.
Furthermore, we simply want documents to be predicted accurately. As a result, whether
our classifier is more specefic or sensitive is irrelevant to us as long as it accurately classifies
as many documents as possible.
3.2. Categorization and classification 35

Therefore, we have studied the accuracy when comparing models and when choosing the
best hyperparameters. In the first case, we have calculated the accuracy on both training
and test sets so as to detect overfit models.
After that, we obtained the confusion matrix and classification report for each model (which
computes precision, recall, and F1-score for all classes) so that we could better understand
their behavior).

3.2.3.5 Best Model Selection

Figure 3.7: Classification Models Accuracy

In general, we obtain good accuracy values for each model. We can see that the Gradient
Boosting, Logistic Regression, and Random Forest models are overfit because they have a
high training set accuracy but a low test set accuracy, therefore we’ll remove them. The
SVM classifier will be chosen over the other models because it has the highest test set
accuracy, which is very close to the training set accuracy. The following figures 4.4 3.9 are
the confusion matrix and classification report of the SVM model
3.2. Categorization and classification 36

Figure 3.8: SVM model Confusion Matrix

Figure 3.9: SVM model Classification report

3.2.3.6 Model Interpretation

At this point we have selected the SVM as our preferred model to do the predictions since
it gave us best results.
However, we find that the model fails to classify articles that do not clearly belong to a
unique class and can’t classify text that don’t fit into any of classes.
As a result, we can set a threshold using this logic: if the highest conditional probability
is less than the threshold, no projected label for the item will be provided. If it is higher,
the matching label will be assigned.
As a result, we have fixed a threshold at 65%.
3.3. Conclusion 37

3.3 Conclusion
This chapter resumed the the indexing,sarching and classifications tasks. For the index and
search process, it is done with woosh library that contains numerous features and afford
quick anf fast search and retrieve of information. Furthermore we clarified the document
classification task by getting preparing and parsing data, creating features from them,
training several classifications models and evaluate their performance to select the best
one that gives better accuracy and efficiency.
In the next chapter, we will discuss the achievement of our project and the process of
implementing and building of our solution
Chapter 4

Achievement

Introduction
After presenting the design of the project, I focus, in this last chapter, on the presentation of
the work carried out. I thus begin by introducing the hardware and software development
environment used for the implementation of the solution. Next, I present screenshots
illustrating the work done.

4.1 Work Environment


In this section, I present the environment on which the solution was developed. I first
introduce the hardware environment and then the software tools used to carry out the
project.

4.1.1 Hardware Environment

Type Laptop Hp
CPU Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz 2.20 GHz
RAM/Hard Disk 12.0 Go/ 512 Go SSD
GPU Nvidia Geforce 620m
Exploitation System Microsoft Windows 10 Pro

Table 4.1: Hardware environment Tools

38
4.1. Work Environment 39

4.1.2 Software Environment


In order to implement the solution, I used different softwares and technologies:

• Anaconda
Anaconda is a free and open source distribution of Python programming language,
it aims to provide data science utilities with over 100 Python packages and an own
package manager [9].
The distribution includes packages that are compatible with with any operating sys-
tem (Windows, Linux, macos).
It is used for data science, machine learning, large-scale data processing, predictive
analytics and predictive analysis, etc.

Figure 4.1: Anaconda Logo


[13]

• Spyder
The Scientific Python Development Environment, is a powerful scientific environment
written in Python, for Python, and designed by and for scientists, engineers and data
analysts.
It offers a unique combination of advanced analysis, debugging and profiling features
of a full-featured development tool with data mining, interactive execution, deep
inspection and inspection and superb visualization capabilities of a scientific software
package.
4.1. Work Environment 40

Figure 4.2: Spyder Logo


[14]

• Jupyter Notebook Jupyter Notebook is a web-based environment that creates


environment which include Notebook documents for interactive computing.
It is indeed a web-based application that allows editing anything related to data
science,machine learning, data cleaning and transformation, data visualization, and
much more.

Figure 4.3: Jupyter Notebook Logo


[15]
4.2. Achieved Work 41

4.2 Achieved Work


4.2.1 Programming language and framework
4.2.1.1 Python

Python is an open source, high-level interpreted language and offers an excellent approach
to object-oriented programming. It is one of the best languages used by data scientists for
various projects/applications. Python provides great functionality to handle mathematics,
statistics and scientific function.
That’s why we chose it as programming language, especially since it is the most suitable for
deploying the intelligent part, also developed with python using different libraries (spacy,
gensim, pandas, NumPy, SciPy, math...).

4.2.1.2 PYQT

For the development part, we used the PyQt PyQt is a library that lets you use the Qt
GUI framework from Python. Qt itself is written in C++. By using it from Python, you
can build applications much more quickly while not sacrificing much of the speed of C++.
PYQT has the advantage of flexibility, encourage rapid development, clean and pragmatic
design and its facilitates the integration of AI part in our project.

4.2.1.3 The model/view architecture : MV Model

Model-View-Controller (MVC) is a design pattern originating from Smalltalk that is often


used when building user interfaces.
MVC consists of three kinds of objects. The Model is the application object, the View
is its screen presentation, and the Controller defines the way the user interface reacts to
user input. In Qt there is a concept of a Model much like the model in the MVC pattern
which exposes the data to the view (the data could be coming from an API or storage)
and contains all the business logic that the application needs to do whatever it was design
to do. There is also a concept of a View again much the same as the MVC pattern which
is where we describe how to present the data to the end user and is where we capture user
input. In fact the Model View (MV) of Qt is exactly like the Model and View in the MVC
pattern minus the need for a Controller.
4.2. Achieved Work 42

Figure 4.4: PYQT MV architecture

4.2.2 Results: Presentation of the application desktop Interface


In this section, we will present the application through a collection of screenshots. The
user interface will be provided in a sequence that demonstrates the entry point of the
application and includes generic functionalities.
When the application is launched, the following window is displayed:

Figure 4.5: User Interface

Figure 4.5 shows the user interface. It illustrates the different functions that helps us search
files we need.
4.2. Achieved Work 43

Figure 4.6: Search Scope Pane

As indicates the figure 4.6, the first step consists in indexing the folder in whom we
want to search documents we need. In fact, each index corresponds to some searchable
location on the computer.

Figure 4.7: Size and Date Filter

The figure 4.7 shows the file size and modified date filter. The user can set size or date
range according to his needs
4.2. Achieved Work 44

Figure 4.8: Document types

The user is allowed also to choose the types of documents that he want to search like
the figure 4.8 shows. The supported Document Formats are:

• HTML (html, xhtml, . . . )

• Image (JPG, PNG)

• Microsoft Office pre-2007 (doc, docx, ppt, pptx, . . . )

• Portable Document Format (PDF)

• Plain Text (customizable extensions)

Figure 4.9: Document Categories

According to figure 4.9, the user can also perform his search by choosing the main topic
of document. There is five classes available: Business, Technology, Sport, Politics and
Politics.
4.2. Achieved Work 45

Figure 4.10: Search field

The figure below 4.10 shows the the Search field where you enter the word to search.
Whoosh gives us multiple choices when searching. Its major features are:
Boolean Operators:

• AND operator: AND is the default relation between terms so Writing “word AND
project” is the same as writing “work project”

• OR operator

• NOT operator

INEXACT TERMS:

• Wildcard expressions like ? to represenet a signle character and * to represenet


any number of characters

• Fuzzy queries to ignore misspelling and find word that are similar to a given word.

For example, when searching for house will turn up documents containing words like
houses and horse,etc. . .

Figure 4.11: Result Pane


4.2. Achieved Work 46

After giving the query, like shown in figure 4.11, the result pane displays the search results.
These are the files that contains word that the user entered in the search field.

Figure 4.12: Document Content

When we click to one files that appears in the result pane, a display windows appears
and displays the text of selected file. As the figure 4.12 shows, Words searched for are
highlighted.

Conclusion
In this chapter, we outlined the work environment: both the hardware and software com-
ponents were presented. Furthermore, a detailed explanation of the implementation was
introduced.
Conclusion

This report outlines the work accomplished as part of a graduate internship project. It’s
Document Search and Classification Desktop Application and was developed at Audis Ser-
vices. It is a desktop application that performs full-text search on computer can instantly
find files with text, with different types, and classify them into classes by machine learning
algorithm.

We start by presenting our project description, which includes the hosting organism, the
problem statement and the proposed solution. Then we focus on the theorical background
necessary to understand the relevant concepts related to our project. We have discovered
the information retrieval (IR) domain and its concept concept. Then we discovered the in-
dexing process which is mainly important for search engine applications. It facilitates fast
and accurate information retrieval. In our project, we used the Whoosh Python Library
for Indexing and Searching.

After that, we present the Supervised Machine Learning algorithms used for multi-Classifications.
Then we trained these models and evaluated the performance of each one of them. We
have chosen the SVM model since he gives us the best accuracy.

Finally, we ended our work with the achievement phase in which we presented the hardware
and software technologies that are used in the process of achieving the solution in addition
to the presentation of screenshots showing the application I have developed with its main
functionalities In this internship,we learned to implement a solution from scratch using
good practices of coding and testing, and it is fair to state that we were able to overcome
all of the difficulties that occurred.

Regardless of the technical constraints and challenges faced, we achieved the objectives

47
Conclusion 48

and met the requirements of the application. In the later stages, we intend to make the
application be able to:

• Search for files located on connected computers on the same network.

• Search for data stored in cloud storage.


Bibliography

[1]. Ian H. Witten, Alistair Moffat, Timothy C. Bell “Managing Gigabytes: Compressing
and Indexing Documents and Images”, Mai, 1999.

[2]. Zhiwang Cen, Jungang Xu, Jian Sun, “SoDesktop: a Desktop Search Engine”, In-
ternational Conference on Communication Systems and Network Technologies, 2012.

[3]. Rujia Gao, ‘Application of Full Text Search Engine Based on Lucene, Advances in
Internet of Things January 2012.

[4]. “Whoosh Documentations”, https://whoosh.readthedocs.io/en/latest

49
Netography

[1] https://www.inov-tech.fr/ [Accès le 6-Août-2020]


[2] https://www.inov-tech.fr/ [Accès le 6-Août-2020]
[3] https://www.lesbonscomptes.com/recoll/
[4] https://copernic.com/en/desktop/
[5] https://merehead.com/blog/how-to-use-jira-for-project-management/
[6] https://devopedia.org/information-retrieval
[7] http://mocilas.github.io/2015/11/18/Python-Inverted-Index-for-dummies/
[8] https://medium.com/geekculture/introduction-to-stemming-and-lemmatization-nlp-3b7617d84e65
[9] https://medium.com/swlh/nlp-preprocessing-task-c5e6c0837a15
[10] https://www.researchgate.net/publication/337407116P re−evacuationTimeEstimation
[11] https://towardsdatascience.com/
[12] https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-
algorithm-6a6e71d01761
[13] https://mrmint.fr/installer-environnement-python-machine-learning-anaconda
[14] https://www.researchgate.net/figure/Python-in-Spyder-environment-Figure-2-R-Studio-
for-KNN-classificationf ig1342263126
[15] https://technology.amis.nl/data-analytics/quickest-way-to-try-out-jupyter-notebook-zero-
install-3-cli-commands-and-5-minutes-to-action/

50
Annexe A: Use Case Diagram

51
Annexe B: Sequence System Diagram

52

Vous aimerez peut-être aussi