Rapport Firas Naimi
Rapport Firas Naimi
Rapport Firas Naimi
Topic:
Realized by :
Naimi Firas
Academic Supervisor :
Mme. Messaoudi Houyem
Profession Supervisor :
Mr. Kais Chaabouni
Audis Services
1
Supervisors’ signatures
ACKNOWLEDGEMENTS
The success of this project is the result of an effective guidance by Audis team who had the
time to share with me their expertise and knowledge. My sincere appreciations go to Audis
supervisor Mr. Kais Chaabouni for his guidance, advice, assistance and good humor. I
would like to thank my pedagogic advisor, Mrs.Messaoudi Houyem for her collaboration,
and I acknowledge with much appreciation the honorable jury members for having time to
examine my modest work. Additional thanks should go to all those who encouraged us all
along the past period.
ii
Abstract
The rise of big data with advancement in technology leads to an ever-increasing demand
for a personalized search engine to search the huge amount of data residing in personal
computers. A desktop search engine is used to search files or data in a user’s personal
systems It includes an efficient way of searching and retrieving desired data and informa-
tion. In this project, we present a search and classification desktop application that is
capable of fast indexing and searching for documents located in personal computer (PC)
and classifying them into categories according to their content using classification machine
learning model Support Vector Machine(SVM).
Keywords: Desktop search engine, index, search, RI, SRI, TF-IDF, ML,
SVM, automatic text classification, vector representation.
iii
iv
Résumé — L’essor du big data et les progrès de la technologie entraînent une demande
croissante pour un moteur de recherche personnalisé permettant de rechercher les énormes
quantités de données résidant dans les ordinateurs personnels. Un moteur de recherche de
bureau est utilisé pour rechercher des fichiers ou des données dans les systèmes personnels
d’un utilisateur. Il comprend un moyen efficace de rechercher et de récupérer les données
et les informations souhaitées. Dans ce projet, nous présentons une application de bureau
de recherche et de classification qui est capable d’indexer et de rechercher rapidement des
documents situés dans un ordinateur et de les classer selon des catégories en fonction de
leur contenu à l’aide d’un modèle d’apprentissage automatique de classification Support
Vector Machine(SVM).
Mots clés : Moteur de recherche de bureau, index, recherche, RI, SRI, TF-IDF, ML,
SVM, texte automatique classification, représentation vectorielle.
Contents
Introduction 1
1 Internship Context 3
1.1 Host Institution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Project Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Study of existing search text and document search . . . . . . . . . . 5
1.2.2.1 Recoll . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2.2 Copernic Desktop . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Project Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Comparison between existing tools . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Methodology Adopted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
v
Contents vi
3 Methodology 24
3.1 Indexing and Searching –Whoosh . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.1 Whoosh overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.2 Indexing process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.2.1 Parsing documents . . . . . . . . . . . . . . . . . . . . . . 26
3.1.2.2 Creating Indexed Data . . . . . . . . . . . . . . . . . . . . 26
3.1.2.3 Searching query . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Categorization and classification . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.2 Input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.3.1 Creation of the initial dataset . . . . . . . . . . . . . . . . 29
3.2.3.2 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . 30
3.2.3.3 Feature Engineering . . . . . . . . . . . . . . . . . . . . . 31
3.2.3.4 Performance Measurement . . . . . . . . . . . . . . . . . . 34
3.2.3.5 Best Model Selection . . . . . . . . . . . . . . . . . . . . . 35
3.2.3.6 Model Interpretation . . . . . . . . . . . . . . . . . . . . . 36
3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Contents vii
4 Achievement 38
4.1 Work Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.1 Hardware Environment . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.2 Software Environment . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Achieved Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.1 Programming language and framework . . . . . . . . . . . . . . . . 41
4.2.1.1 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.1.2 PYQT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.1.3 The model/view architecture : MV Model . . . . . . . . . 41
4.2.2 Results: Presentation of the application desktop Interface . . . . . . 42
Conclusion 47
List of Figures
viii
List of Figures ix
x
Acronyms / Abbreviations
IR Information retrieval
ML Machine Learning
UI User Interface
PC Personal Computer
xi
General Introduction
The rise of big data with advancement in technology leads to an ever-increasing demand
for a personalized search engine to search the huge amount of data residing in personal
computers. Data has grown by multiple order of magnitude in different dimensions, it
has become a major key in any business decision made by companies. The information is
mainly stored on computers, but the role of computers has changed in the last decades.
We no longer use computers for just their raw computing abilities: they also serve as
communication devices, multimedia players, and media storage devices. Those uses require
the ability to quickly find a specific piece of data. For that, companies aim to have desktop
tools that search it quickly using efficient algorithms. In other words, companies must find
data in their personal computers’ storage rapidly by using performant applications. An
efficient software is a tool that searches data in less time complexity with optimal resources
usage. However, there are some problems with current developed apps:
- Existing efficient apps are expensive. In other words, tools that give us all the features
we need costs a lot of money.
- Sometimes, desktop apps are not open source. As a result, if we need to develop a
custom feature and integrate it to the existing solution, it can’t be possible.
- The problem arises when an increasing portion of the information is non-textual in
format. Tools must give the possibility to search by image in personal computers.
- Tools currently used cannot classify files by field. We want sometimes to make a
search
in business files only. So, we must make files classification technique. It is crucial for
Audis business.
As a result, Audis has chosen to develop its own solution based on well-known efficient
algorithms. It does do this for many reasons such as the need for an open-source solution,
which is internal, cheap and can offer the developer the possibility to add custom features
based on needs. In addition, search by image and files classification techniques must be
1
Introduction 2
integrated in the application we want to develop. For that, we need a desktop application
that:
- Uses efficient algorithms in the search-indexation and classification processes
- That can be extensible by adding custom features
- Where files classification and image search techniques are integrated.
This manuscript illustrates the work done in X chapters, each of which is ordered as
follows:
• The first chapter presents the general framework of the project, describes the
problematic studied and presents a critical study of existing solutions highlighting their
limits. It also introduces the proposed solution to achieve the targeted objectives as
well as the work methodology adopted for its realization.
• The second chapter presents the concepts related to the project
• The third chapter is devoted explains the details of the methods used to perform and
implement our application.
• The last chapter presents the realization part of the project in order to concretize the
work by giving detailed screenshots.
• In closing, a general conclusion which presents the results of this project and the
possible prospects for possible improvements.
Chapter 1
Internship Context
Introduction
This chapter gives a broad overview of the internship and the circumstances under which
the project has been accomplished. First of all, I will present the hosting company and its
fields of business. After that, I’ll concentrate on the problem statement and the project
presentation. Finally, I will discuss the methods used to meet the project’s objectives.
3
1.2. Project Overview 4
Audis Services has many well-known clients around the world. Figure 1.2 shows the
distribution of Audis Services’ clients.
when we want to add custom features. Even if they are open source, most of the project
are stopped.
Hence, there is no bug fixes, no maintenance and no updates.
- Existing tools don’t include searching words in images and the possibility to get files
by class where they belong.
Those problems lead to the development of an internal tool to search words in personal
computers’ storage and to integrate the classification and search by image techniques.
Recoll is a personal text search tool for Unix and Linux. It is based on the powerful Xapian
[ref] indexing engine, for which it offers an easy-to-use, rich, and easy-to-implement QT
graphical interface. Recoll is free open-source software, the source code of which is available
under the GPL license.
Recoll characteristics:
- Runs on most Unix based systems.
- Handles most common document types, messages and their attachments.
- Powerful search functions, with Boolean expressions, phrases and proximity, wildcards,
filtering on file types or location.
Copernic Desktop Search is constantly evolving to find more and more content on your
PC, while maintaining its primary goal: to find your files and emails quickly. Its intuitive
interface and well-organized results will eliminate wasted time searching for your PC’s
content.
1.2. Project Overview 6
Copernic Characteristics:
• Deep searches: Copernic can index and search over 150 file types and emails—on
desktops, network, virtual desktop and even the cloud!
• Fast results: Once a search index is created, take advantage of fast search results that
appear on the screen
• Robust Security: Whether you use your Desktop or Server Search solution, your data
will always remain extremely secure.
In our project requirements, we need a system which enables us to classify files in a specific
storage, to get the appearance of a word in a given image. It is important to work on both
unix and Windows. These requirements cannot be guaranteed with the existing tools. For
that, we have to develop our own tool.
Conclusion
In this chapter, I have presented the host company where this project was completed.
Then I delivered the general context of the project, as well as the problem description and
project presentation. Finally, I discussed the methodology used to complete the job. I’ll
start the preliminary research in the next chapter.
Chapter 2
Introduction
The Chapter Background and Theory will clarify the relevant concepts related to our
project First of all, we will define Information Retrieval and its aspects. Then we will
understand index and search process. Finally, we will we present the state of the art of
the AI based techniques such as Natural Language Processing and Classification Machine
Learning Models that gives good results for categorization task.
2.1.1 Definitions
Several definitions of information retrieval have emerged in recent years, we quote in this
context the following three definitions:
- Definition 1: Information retrieval is considered as the set of techniques allowing to
select from a collection of documents, those that are likely to meet the needs of the the
user’s needs.
10
2.1. Information Retrieval 11
An Information Retrieval System (IRS) is a computer system that allows to return from a
set of documents, those whose contents best correspond to a user’s need for information,
with a query.
Document
The document represents the elementary container of information, exploitable and ac-
cessible by the SRI. A document can be a text, a WEB page, an image, a video, etc.In
our context, we concentrate on text-based documents (PDF, Word, PPT, TXT, Image, etc
2.1. Information Retrieval 12
. . . ).
Query
A query is the expression of the user’s need for information.
2.1.3 Indexing
Text indexing is the process of extracting statistics considered important from a text in
order to reflect the information provided and/or to allow quick searches on its content. Text
indexing processes can be conducted on nearly any type of textual information, including
source code for computer programs, DNA or protein databases, and textual data storage,
in addition to plain language texts.
The goal of storing an index is to improve speed and performance when searching for
relevant content. Without an index, the search engine would have to go through each
document in the corpus, which would take a long time and a lot of computer power.
For example, a sequential scan of every word in 10,000 huge documents could take hours,
whereas an index of 10,000 documents can be searched in moments.
There are numerous indexing methods available, here are the most two popular:
Full-Text Indexes
Full-text indexes are simple to create. The system examines every word of the document
and produces an index of each term and its location as part of this procedure. Full-text
indexes need a lot of storage space, despite the fact that they are easier to process.
Field-Based Indexes
Field-based indexes are a quick and easy approach to find information in a database.
This type of indexing allows the user to look for specific information about each document.
The field could be a date, time, or any other designated area, for example.
2.1. Information Retrieval 13
The inverted index is a database index that stores a mapping from content, such as words
or integers, and their places in a database, a document, or a series of documents. An
inverted index is used to enable for quick full-text searches. There are two categories of
inverted indexes:
a) A record-level inverted index contains a list of document references for each
term.
b) A word-level inverted index additionally includes the positions of each word
inside a document The latter provides more capability, but it necessitates more processing
power and storage space.
In our work we will focus on word-level inverted index, and to better understand how
does it work, let’s review a simple example with following figure figure 2.2:
2.1.4.1 Tokenisation
Tokenization is the process of breaking down a large chunk of text into smaller tokens
that can be words, characters, or subwords. Hence, tokenization can thus be classified
into three categories: word, character, and subword (n-gram characters) tokenization. The
most often used tokenization algorithm is word tokenization. It divides a chunk of text
into distinct words using a delimiter. Different word-level tokens are created depending on
the delimiters.
Considering the following sentence: "Never give up".
The sentence’s tokenization yields three tokens: Never/give/up.
Stopwords are words that tend to appear often in all papers in a collection and don’t
provide information about the document’s content. In other words, they don’t have any
semantic significance. For example, in English, it can be the terms "of," "the," "for,"
etcetera.
2.1.4.3 Normalization
Normalization consists in bringing back an inflected word in its canonical form. There are
two types of normalization: Stemming and lemmatization.
a) Lemmatization is the process of transforming a word into its dictionary form, such
as "reading" => "read," "finds" => "find," and "thought" => "think."
b) Stemming is the process of converting a word into its root form by removing the
word’s ending. This is similar to Lemmatization, but it can’t handle irregular verbs. It
can, however, handle words that aren’t in the dictionary.
2.2. Natural Language Processing (NLP) 15
2.2.1 Definition
Natural language processing (NLP) is a branch of linguistics, computer science, and arti-
ficial intelligence that studies how computers interact with human language, particularly
how to develop an algorithm that can process and evaluate huge amounts of natural lan-
guage data. This enables computers to comprehend both the content of documents and
the language’s internal contextual nuances. NLP technology is capable of accurately ex-
tracting information and meanings from documents, as well as categorizing and organizing
the documents themselves. [2]
• Sentence Segmentation: It entails separating the text into individual phrases. It may
be as simple as splitting sentences whenever a punctuation mark appears to code
a sentence segmentation model. Newer NLP pipelines typically use more complex
algorithms that work even when a document isn’t correctly organized.
• Tokenisation
• Stopword removal
• Supervised learning: the model being trained on annotated data is required to infer
knowledge from the input features and map them to an output class. In it the main
goal is to be able by the end of the learning process to correctly predict new instances
outputs.
Therefore, text classification is supervised problem in machine learning where the dataset
is labelled. There are different algorithms to solve classification problem. To obtain the
2.3. Multi-Class Text Classification 18
best result and the more accurate prediction, we have to test different ML models to see
which one would best match the data and capture the relationships between the points
and their labels. We’ll give a quick explanation of the logic behind each model.
Accuracy: The accuracy metric calculates the proportion of correct predictions to the
total number of occurrences assessed.
Precision: Precision is a metric for determining how many positive patterns are properly
predicted from the total predicted patterns.
Recall: recall is a metric for determining the ratio of positive patterns which are properly
classified.
F1-Score: If we need to strike a compromise between precision and recall, F1 Score could
be a preferable metric to employ.
In fact, it is the harmonic mean of precision and recall. Therefore, it provides a good
estimate of the overall quality of a model.
Conclusion
In this chapter we have first information retrieval as a research discipline that integrates
models and techniques like indexing and searching whose goals is to facilitate access to
relevant information for a user. Secondly, we explain AI based techniques for classification.
We started with introducing NLP and its application areas and common text preprocessing
tasks. Then, we present different ML classification models and their performances. In the
following chapter, we are going to explain the details of the methods used to perform and
implement our application.
Chapter 3
Methodology
Introduction
This chapter is dedicated to define the method to use to reach the project goals. The
work will be divided in two big steps: Indexing and Searching for files and Preprocessing
classification models followed by selection the best model after performance evaluation
Introduction
To search for relevant content from based-text document quickly and obtain good results,
we have to use a full text search engine tool. It should have efficient and precise search
algorithms to collect, parse and store data into an index to facilitate fast and accurate
information retrieval. In our case, we adopt python as programming language, for that the
suitable tool that wa can use is Whoosh library
24
3.1. Indexing and Searching –Whoosh 25
powerful indexing and querying functions. Because of the powerful functions, it is widely
used by users who want to develop their own search engines.
Whoosh consists of two components that make up a search engine, indexing and searching.
Firstly, all the text and metadata extracted from documents, originating from different text
sources like images, Docx or PDF files, are indexed in order to produce a common format.
Preparing for a common format makes the search process convenient as the documents are
processed by an Analyzer and turned into tokens to be actually indexed.
Secondly, when a user inputs some attributes in the query, Whoosh parses the query with
Query Parser and creates search criteria which are used to run for the Query object against
the index.
Finally the items of data that meet the search criteria will be returned as Document objects
to the user. This process is described in Figure 3.5 below.
1. PDF parsing
PDF’s may be one of the most commonly used documents in most offices. it stands for
Portable Document Format and was developed by Adobe. The main goal was to be
able to exchange information platform-independently while preserving and protecting
the content and layout of a document. This results in PDFs being hard to edit and
difficult with extracting information from them, which does not mean it is impossible.
There are different tools with different methodologies and functionalities available in
python for PDF text extraction like, PyPDF2, PyMuPdf, PDFMiner Both PDFMiner
and PyPDF2 are pure Python libraries. In contrast, PyMuPDF is based on MuPDF,
a lightweight but extensive PDF viewer. This has huge advantages when it comes
to handle difficult PDFs and it claims to be significantly faster than PDFMiner and
PyPDF2 in various tasks. For these reasons, we have chosen it.
After extracting all the text and metadata from documents, originating from different text
sources like images, Docx or PDF files, they should be indexed in order to produce a
common format. Preparing for a common format makes the search process convenient,
easy, and fast. Therefore, the “schema” of the index has to be defined.
Schema defines list of fields to be indexed or stored for each text file. It’s similar to how
we define it for database. A field is a piece of information for each document in the index,
such as its title or text content. Indexing of a field means it can be searched and it is also
returned with results if defined as argument (stored=True) in schema. In our case, the
3.1. Indexing and Searching –Whoosh 27
shema includes fields such as title, content, path, date, size, type.
When a document is added to an index and a user sends a search query, whoosh will
run the query against the index to determine among the multiple search options. The
following figure describes the process of the searching. Firstly, the query will be parsed and
converted to a list of plain text, which is then processed by the standard analyzer of search
3.1. Indexing and Searching –Whoosh 28
• Cosine scoring: It is useful for finding document similar to your search query.
In this part, we have clarified the steps of indexing and searching by whoosh library and
explaining its functionalities. In the next step, we will introduce the classification model
adopted in our work.
3.2. Categorization and classification 29
• Business
• Entertainment
• Politics
• Sport
• Tech
3.2.3 Methodology
3.2.3.1 Creation of the initial dataset
The goal of this stage is to create a dataset that looks like this:
That is, each row will represent a single document, with its name, content, and category
stored in the columns.
3.2. Categorization and classification 30
When creating a classification model, one of our key concerns is whether the different
classes are balanced. This suggests that each class is represented in the dataset in roughly
equal proportions.
For example, if there are two classes and 95% of observations belong to one of them, a bad
classifier that always outputs the majority class would have 95% accuracy, despite failing
all minority class predictions.
There are numerous approaches to dealing with datasets that are unbalanced. To obtain
a more balanced dataset, one first strategy is to undersample the majority class and over-
sample the minority class. Another method is to use error metrics other than accuracy,
such as precision, recall, or F1-score.
By looking at our data, we can get the pourcentage of observations that belongs to each
class:
We remark that the classes are roughly balanced, therefore there will be no undersam-
3.2. Categorization and classification 31
pling or oversampling. We will, however, employ precision and recall to evaluate model
performance.
1. Text representation There are sevral methods that we can use to represenet a text
in our corpus :
(a) Word Count Vectors Every column represents a term from the corpus, and
each cell represents the frequency count of each term in each document.
Being:
The TFIDF value rises in proportion to the number of times a word appears
in a document and is offset by the number of documents in the corpus that
contain the term These two methods (Word Count Vectors and TFIDF Vectors
are named Bag of Words methods, since we ignore the order of the words in a
sentence.)
The methods that follow are more advanced since they preserve the order of the
words and their lexical considerations in some way.
(c) Word Embeddings The position of a word within the vector space is deter-
mined by the words that surround it when it is used. Word embeddings can be
used with transfer learning models that have already been trained.
(d) Text based or NLP based features We can manually add any feature that
we think would help us distinguish between categories (for example, word den-
sity, amount of letters or words, etc...).
We can also employ NLP-based features such as Part of Speech models to deter-
mine whether a word is a noun or a verb, and then apply the PoS tag frequency
distribution.
(e) Topic Models In what is known as topic modeling, methods such as Latent
Dirichlet Allocation attempt to represent every topic by a probabilistic distri-
bution over words.
To represent the documents in our corpus, we utilized TF-IDF vectors, for the
following reasons:
We anticipate that bigrams will aid in the improvement of our model’s perfor-
mance by taking into account words that frequently appear together in doc-
uments. We picked a Minimum DF of 10 to eliminate extremely rare words
that appear in less than 10 documents, and a Maximum DF of 100 percent to
ensure that no other terms are missed. We chose 300 as the maximum number
of features because we want to avoid overfitting, which is frequently caused by
a large number of features compared to the amount of training data.
2. Text cleaning
To guarantee that no distortions are introduced to the model, we must must perform
a cleaning process before producing any feature from the raw text. We took the
following steps:Before creating any feature from the raw text, we must perform a
cleaning process to ensure no distortions are introduced to the model. We have
followed these steps:
• Punctuation signs: characters like “?”, “!”, “;” have been removed.
• Possessive pronouns: “Trump” and “Trump’s” should have the same predict-
ing power
3.2. Categorization and classification 34
3. Label coding
To make a prediction, machine learning models need numeric information and labels.
As a result, we’ll need to build a dictionary to map each label to a number ID.
There are numerous metrics that may be utilized to gain insight into the model perfor-
mance while dealing with classification challenges like accuracy, recall, precision and F1-
score. These metrics have a wide range of applications and are commonly used in binary
classification.
When dealing with multiclass classification, however, they become more difficult to com-
pute and interpret.
Furthermore, we simply want documents to be predicted accurately. As a result, whether
our classifier is more specefic or sensitive is irrelevant to us as long as it accurately classifies
as many documents as possible.
3.2. Categorization and classification 35
Therefore, we have studied the accuracy when comparing models and when choosing the
best hyperparameters. In the first case, we have calculated the accuracy on both training
and test sets so as to detect overfit models.
After that, we obtained the confusion matrix and classification report for each model (which
computes precision, recall, and F1-score for all classes) so that we could better understand
their behavior).
In general, we obtain good accuracy values for each model. We can see that the Gradient
Boosting, Logistic Regression, and Random Forest models are overfit because they have a
high training set accuracy but a low test set accuracy, therefore we’ll remove them. The
SVM classifier will be chosen over the other models because it has the highest test set
accuracy, which is very close to the training set accuracy. The following figures 4.4 3.9 are
the confusion matrix and classification report of the SVM model
3.2. Categorization and classification 36
At this point we have selected the SVM as our preferred model to do the predictions since
it gave us best results.
However, we find that the model fails to classify articles that do not clearly belong to a
unique class and can’t classify text that don’t fit into any of classes.
As a result, we can set a threshold using this logic: if the highest conditional probability
is less than the threshold, no projected label for the item will be provided. If it is higher,
the matching label will be assigned.
As a result, we have fixed a threshold at 65%.
3.3. Conclusion 37
3.3 Conclusion
This chapter resumed the the indexing,sarching and classifications tasks. For the index and
search process, it is done with woosh library that contains numerous features and afford
quick anf fast search and retrieve of information. Furthermore we clarified the document
classification task by getting preparing and parsing data, creating features from them,
training several classifications models and evaluate their performance to select the best
one that gives better accuracy and efficiency.
In the next chapter, we will discuss the achievement of our project and the process of
implementing and building of our solution
Chapter 4
Achievement
Introduction
After presenting the design of the project, I focus, in this last chapter, on the presentation of
the work carried out. I thus begin by introducing the hardware and software development
environment used for the implementation of the solution. Next, I present screenshots
illustrating the work done.
Type Laptop Hp
CPU Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz 2.20 GHz
RAM/Hard Disk 12.0 Go/ 512 Go SSD
GPU Nvidia Geforce 620m
Exploitation System Microsoft Windows 10 Pro
38
4.1. Work Environment 39
• Anaconda
Anaconda is a free and open source distribution of Python programming language,
it aims to provide data science utilities with over 100 Python packages and an own
package manager [9].
The distribution includes packages that are compatible with with any operating sys-
tem (Windows, Linux, macos).
It is used for data science, machine learning, large-scale data processing, predictive
analytics and predictive analysis, etc.
• Spyder
The Scientific Python Development Environment, is a powerful scientific environment
written in Python, for Python, and designed by and for scientists, engineers and data
analysts.
It offers a unique combination of advanced analysis, debugging and profiling features
of a full-featured development tool with data mining, interactive execution, deep
inspection and inspection and superb visualization capabilities of a scientific software
package.
4.1. Work Environment 40
Python is an open source, high-level interpreted language and offers an excellent approach
to object-oriented programming. It is one of the best languages used by data scientists for
various projects/applications. Python provides great functionality to handle mathematics,
statistics and scientific function.
That’s why we chose it as programming language, especially since it is the most suitable for
deploying the intelligent part, also developed with python using different libraries (spacy,
gensim, pandas, NumPy, SciPy, math...).
4.2.1.2 PYQT
For the development part, we used the PyQt PyQt is a library that lets you use the Qt
GUI framework from Python. Qt itself is written in C++. By using it from Python, you
can build applications much more quickly while not sacrificing much of the speed of C++.
PYQT has the advantage of flexibility, encourage rapid development, clean and pragmatic
design and its facilitates the integration of AI part in our project.
Figure 4.5 shows the user interface. It illustrates the different functions that helps us search
files we need.
4.2. Achieved Work 43
As indicates the figure 4.6, the first step consists in indexing the folder in whom we
want to search documents we need. In fact, each index corresponds to some searchable
location on the computer.
The figure 4.7 shows the file size and modified date filter. The user can set size or date
range according to his needs
4.2. Achieved Work 44
The user is allowed also to choose the types of documents that he want to search like
the figure 4.8 shows. The supported Document Formats are:
According to figure 4.9, the user can also perform his search by choosing the main topic
of document. There is five classes available: Business, Technology, Sport, Politics and
Politics.
4.2. Achieved Work 45
The figure below 4.10 shows the the Search field where you enter the word to search.
Whoosh gives us multiple choices when searching. Its major features are:
Boolean Operators:
• AND operator: AND is the default relation between terms so Writing “word AND
project” is the same as writing “work project”
• OR operator
• NOT operator
INEXACT TERMS:
• Fuzzy queries to ignore misspelling and find word that are similar to a given word.
For example, when searching for house will turn up documents containing words like
houses and horse,etc. . .
After giving the query, like shown in figure 4.11, the result pane displays the search results.
These are the files that contains word that the user entered in the search field.
When we click to one files that appears in the result pane, a display windows appears
and displays the text of selected file. As the figure 4.12 shows, Words searched for are
highlighted.
Conclusion
In this chapter, we outlined the work environment: both the hardware and software com-
ponents were presented. Furthermore, a detailed explanation of the implementation was
introduced.
Conclusion
This report outlines the work accomplished as part of a graduate internship project. It’s
Document Search and Classification Desktop Application and was developed at Audis Ser-
vices. It is a desktop application that performs full-text search on computer can instantly
find files with text, with different types, and classify them into classes by machine learning
algorithm.
We start by presenting our project description, which includes the hosting organism, the
problem statement and the proposed solution. Then we focus on the theorical background
necessary to understand the relevant concepts related to our project. We have discovered
the information retrieval (IR) domain and its concept concept. Then we discovered the in-
dexing process which is mainly important for search engine applications. It facilitates fast
and accurate information retrieval. In our project, we used the Whoosh Python Library
for Indexing and Searching.
After that, we present the Supervised Machine Learning algorithms used for multi-Classifications.
Then we trained these models and evaluated the performance of each one of them. We
have chosen the SVM model since he gives us the best accuracy.
Finally, we ended our work with the achievement phase in which we presented the hardware
and software technologies that are used in the process of achieving the solution in addition
to the presentation of screenshots showing the application I have developed with its main
functionalities In this internship,we learned to implement a solution from scratch using
good practices of coding and testing, and it is fair to state that we were able to overcome
all of the difficulties that occurred.
Regardless of the technical constraints and challenges faced, we achieved the objectives
47
Conclusion 48
and met the requirements of the application. In the later stages, we intend to make the
application be able to:
[1]. Ian H. Witten, Alistair Moffat, Timothy C. Bell “Managing Gigabytes: Compressing
and Indexing Documents and Images”, Mai, 1999.
[2]. Zhiwang Cen, Jungang Xu, Jian Sun, “SoDesktop: a Desktop Search Engine”, In-
ternational Conference on Communication Systems and Network Technologies, 2012.
[3]. Rujia Gao, ‘Application of Full Text Search Engine Based on Lucene, Advances in
Internet of Things January 2012.
49
Netography
50
Annexe A: Use Case Diagram
51
Annexe B: Sequence System Diagram
52