Disease Inspection Identification For Food Using Machine Learning Algoithms
Disease Inspection Identification For Food Using Machine Learning Algoithms
BACHELOR OF TECHNOLOGY
IN
i|Page
KALLAM HARANADHAREDDY INISTITUTE OF TECHNOLOGY
AN ISO 9001:2015 CERTIFIED INSTITUTION ACCREDITED BY NBA &
NAAC WITH ‘A’ GRADE
(APPROVED BY AICTE, AFFILIATED TO JNTUK, KAKINADA)
CERTIFICATE
This is to certify that the project work entitle" DISEASE INSPECTION IDENTIFICATION FOR
FOOD USING MACHINE LEARNING ALGOITHMS " being submitted by Venkata Reddy
Pulagm (178X1A0582), Bharath Kumar Sunkara (178X1A0599), Thota Mahesh(178X1A05A7), and
Vishnu Teja (178X1A05B6) in the partial fulfilment for the award of degree of Bachelor of Technology in
Computer Science Engineering in Kallam Haranadhareddy Institute of Technology and this bonafide work
carried out by them.
External Examiner
ii | P a g e
DECLARATION
This is a record of bonafide work carried out by us and the results embodied in
this project have not been reproduced or copied from any source. The results
embodied in this project have not been submitted to any other university for the
award of any other degree.
iii | P a g e
ACKNOWLEDGEMENT
We profoundly grateful to express our deep sense of gratitude and respect towards our
honorable chairman, our Grand Father Sri KALLAM HARANADHA REDDY, Chairman of
Kallam group for his precious support in the college.
We are thankful to Dr. M. UMA SANKAR REDDY, Director, KHIT, GUNTUR for his
encouragement and support for the completion of the project.
We are much thankful to Dr. B. SIVA BASIVI REDDY, Principal KHIT, GUNTUR for his
support during and till the completion of the project.
We are greatly indebted to Dr. K. VENKATA SUBBA REDDY, Professor & Head
Department of Computer Science and Engineering, KHIT, GUNTUR for providing the laboratory
facilities to the fullest extent as and when required and also for giving us the opportunity to carry
the project work in the college.
We are also thankful to our Project Coordinators Mr. N. Md. Jubair Basha and
We extend our deep sense of gratitude to our Internal Guide Dr. Md. Sirajuddin Sir, other
Faculty Members & Support staff for their valuable suggestions, guidance and constructive ideas in
each and every step, which was indeed of great help towards the successful completion of our
project.
iv | P a g e
v|Page
ABSTRACT
Suitable nutritional diets have been widely recognized as important measures to prevent and control
non-communicable diseases (NCDs). However, there is little research on nutritional ingredients in
food now, which are beneficial to the rehabilitation of NCDs. In this paper, we profoundly analyzed
the relationship between nutritional ingredients and diseases by using data mining methods. First,
more than 10 diseases were obtained and we collected the recommended food ingredients for each
disease. Then, referring to the India Food Nutrition, we proposed an improved system using Random
Forest, Decision Tree, Gaussian Naïve Bayes and KNN algorithms to find out which nutritional
ingredients can exert positive effects on diseases based on rough sets to select the To the best of our
knowledge, this is the major study to discuss the relationship between nutritional ingredients in food
and diseases through machine learning based on dataset in India. The experiments on real-life data
show that our method based on machine learning improves the performance compared with the
traditional CNN approach, with the highest accuracy of 0.97. Additionally, for some common
diseases such as acne, angina, cardiovascular, ovarian, stroke, tooth decay, Asthma, liver disease, oral
cancers, hyper tension and kidney stone, our work is able to predict the disease based on the first
three nutritional ingredients in food that can benefit the rehabilitation of those diseases. These
experimental results demonstrate the effectiveness of applying machine learning in selecting of
nutritional ingredients in food for disease analysis.
vi | P a g e
TABLE OF CONTENTS
1.1 Introduction 1
vii | P a g e
CHAPTER 4: SOFTWARE REQUIREMENT SPECIFICATION 13-15
5.1 Python 17
5.6 Variables 21
5.9 Datasets 26
6.1 Introduction 27
6.2 Normalization 27
viii | P a g e
6.5 UML Diagrams 32-40
CHAPTER 8: CONCLUSION 71
x|Page
CHAPTER - 1
1. INTRODUCTION
NCDS are chronic diseases, which are mainly caused by occupational and environmental factors, lifestyles
and behaviors, including Obesity, Diabetes, Hypertension, Tumors and other diseases. According to the
Global Status Report on Non-communicable Diseases issued by the WHO, the annual death toll from NCDs
keeps adding up, which has caused serious economic burden to the world. About 40 million people died
from NCDs each year, which is equivalent to 70% of the global death toll. Statistics of Chinese Resident’s
Chronic Disease and Nutrition shows that, the number of the patients suffering from NCDs in China is
higher than the number in any other countries in the world, and the current prevalence rate has blown out. In
addition, the population aged 60 or over in China has reached 230 million and about two-thirds of them are
suffering from NCDs according to the official statistics. Therefore, relevant departments in each country,
especially in India, such as medical colleges, hospitals and disease research centers all are concerned about
NCDs. Suitable nutritional diets play an important role in maintaining health and preventing the occurrence
of NCDs. With the gradual recognition of this concept, india has also repositioned the impact of food on
health. However, research on nutritional ingredients in food via Machine Learning, which are conducive to
the rehabilitation of diseases is still rare in India. At present, India has just begun the IT (Information
Technology) construction of smart health-care. Most studies on the relationship between nutritional
ingredients in food and diseases are still through expensive precision instruments or long-term clinical trials.
In addition, there are also many prevention reports, but they studied only one or several diseases. In India,
studying the relationship between nutritional ingredients and diseases using data mining is immature. Most
doctors only recommend the specific food to patients suffering from NCDs, without giving any relevant
nutrition information, especially about nutritional ingredients in food. The solutions for NCDs require
interdisciplinary knowledge. In the era of big data, data mining has become an essential way of discovering
new knowledge in various fields, especially in disease prediction and accurate health-care (AHC). It has
become a core support for preventive medicine, basic medicine and clinical medicine research. With respect
to the disease analysis through the mining of nutritional ingredients in food, we mainly make the following
contributions: (i) We extracted data related to Chinese diseases, corresponding recommended food and taboo
food for each disease as many as possible from medical and official websites to create a valuable knowledge
base that are available online; (ii) Applying machine to find out which nutritional ingredients in food can
exert positive effects to diseases; (iii) In this paper, the data is continuous and has no decision attributes. To
address this problem, we proposed machine learning models like random forest, decision tree, knn and
gaussian naïve bayes , which can better select corresponding core ingredients from the positive nutritional
ingredients in food. The structure of this paper is organized as follows: Section II reviews the related work in
11 | P a g e
the field of disease analysis and Machine Learning. Describes the specific data mining algorithms used in
this paper, reasons why we select the algorithms, as well as two evaluation indexes. Elaborates the data,
experimental results and analysis in detail. Presents discussions between methods. Some conclusions and
potential future research directions are also discussed.
Problem Statement:
The Existing System used in performing the disease analysis using the cnn approach has low accuracy and
high complexity. To avoid these problems our proposed system which uses different machine learning
models like random forest, decision tree, knn and gaussian naïve bayes models used in analysis ,which gives
the results with better accuracy and efficiency.
12 | P a g e
CHAPTER – 2
2.REQUIREMENTS:
• RAM: 4GB
• Processor: Intel i3
• Software : Anaconda
• Jupyter IDE
13 | P a g e
CHAPTER - 3
3 SYSTEM ANALYSIS
1. Numpy
2. Pandas
3. Matplotlib
4. Scikit –learn
1 . Numpy:
2. Pandas
14 | P a g e
analyze. Python with Pandas is used in a wide range of fields including academic and
commercial domains including finance, economics, Statistics, analytics, etc.
3.Matplotlib
For simple plotting the pyplot module provides a MATLAB-like interface, particularly
when combined with IPython. For the power user, you have full control of line styles, font
properties, axes properties, etc, via an object oriented interface or via a set of functions
familiar to MATLAB users.
4. Scikit – learn
15 | P a g e
3.2 EXISTING SYSTEM:
In the Existing System ,It only provides a novel system which can estimate nutritional
ingredients of food items by analyzing the input image of food item. this system works on
different deep learning techniques and models for the accuracy of result of nutritional
components. Conversely, These models using images as input results in unstability at certain
times and requires advanced techniques to predict the output. The complexity is higher in this
model and time consuming.
In the proposed system, we can identify the disease that we may get effected due to the lack
of certain ingredients in our body to avoid this problem we recommend food according to the
body’s intake food based on the type of food consumption, minerals, and amount of food that
human body consumes (in grams). However, the above studies are basically carried out
through long-term clinical trials, which just recommend food for certain specific diseases and
they seldom study the relationship between nutritional ingredients and disease inspection by
Machine Learning techniques.
Inputs:
Importing the all required packages like numpy, pandas, matplotlib, scikit-
learn and required machine learning algorithms packages .
Outputs:
16 | P a g e
Preprocessing the importing data frame for imputing nulls with the related
information.
17 | P a g e
3.5 PROCESS MODELS USED WITH JUSTIFICATION
SDLC stands for Software Development Life Cycle. A Software Development Life Cycle is
essentially a series of steps, or phases, that provide a model for the development and lifecycle
management of an application or piece of software. SDLC is the process consisting of a series of
planned activities to develop or alter the software products.
The intent of a SDLC process it to help produce a product that is cost-efficient, effective,
and of high quality. Once an application is created, the SDLC maps the proper
deployment and decommissioning of the software once it becomes a legacy.
The SDLC methodology usually contains the following stages: Analysis (requirements
and design), construction, testing, release, and maintenance (response). Veracode makes
it possible to integrate automated security testing into the SDLC process through use of
its cloud-based platform.
1. Requirements Gathering:
2. Analysis:
In this phase based upon the client requirements we prepare one
documentation is called “High Level Design Document”. It contains
Abstract, Functional
3. Design:
It is difficult to understand the high-level Design Document for all
the members understand easily we use “Low Level Design
Document”. To design this document we use UML (Unified
Modelling Language). In this we have Use case, Sequence,
Collaboration……..
18 | P a g e
4. Coding:
5. Testing:
After developing we have to check weather client requirements are
satisfied or not. If not we are again going to develop.
6. Implementation:
In testing phase if client requirements are satisfied, we go for implementation.
i.e. we need to deploy the application in some server.
7. Maintenance:
After deployment, if at all any problems come from the client side we are providing
application
19 | P a g e
DESIGN PRINCIPLES & METHODOLOGY:
The fundamental difference between OOA and OOD is that the former models
the problem domain, leading to an understanding and specification of the problem,
while the latter models the solution to the problem. That is, analysis deals with the
problem domain, while design deals with the solution domain. However, OOAD
subsumed the solution domain representation. That is, the solution domain
representation, created by OOD, generally contains much of the representation
created by OOA. The separating line is a matter of perception, and different people
have different views on it. The lack of clear separation between analysis and design
can also be considered one of the strong points of the object oriented approach; the
transition from analysis to design is “seamless”. This is also the main reason OOAD
methods-where analysis and designs are both performed.
The main difference between OOA and OOD, due to the different domains of
modeling, is in the type of objects that come out of the analysis and design process.
20 | P a g e
Features of OOAD:
• All objects can be represented graphically including the relation between them.
• All Key Participants in the system will be represented as actors and the actions done by
them will be represented as use cases.
• A typical use case is nothing bug a systematic flow of series of events which can be well
described using sequence diagrams and each event can be described diagrammatically by
Activity as well as state chart diagrams.
• So the entire system can be well described using the OOAD model, hence this model is
chosen as the SDLC model.
Preliminary investigation examines project feasibility, the likelihood the system will be
useful to the organization. The main objective of the feasibility study is to test the Technical,
Operational and Economical feasibility for adding new modules and debugging old running
systems. All systems are feasible if they are unlimited resources and infinite time. There are
aspects in the feasibility study portion of the preliminary investigation:
● Technical Feasibility
● Operational Feasibility
● Economical Feasibility
A system can be developed technically and that will be used if installed must still be a
good investment for the organization. In the economical feasibility, the development cost in
creating the system is evaluated against the ultimate benefit derived from the new systems.
Financial benefits must equal or exceed the costs.
21 | P a g e
The system is economically feasible. It does not require any additional hardware or
software. Since the interface for this system is developed using the existing resources and
technologies available at NIC, There is nominal expenditure and economical feasibility for
certain.
Proposed projects are beneficial only if they can be turned into an information system.
That will meet the organization’s operating requirements. Operational feasibility aspects of
the project are to be taken as an important part of the project implementation. Some of the
important issues raised are to test the operational feasibility of a project includes the
following:
The well-planned design would ensure the optimal utilization of the computer resources
and would help in the improvement of performance status.
The well-planned design would ensure the optimal utilization of the computer resources
and would help in the improvement of performance status.
22 | P a g e
3.9TECHNICAL FEASIBILITY
The technical issue usually raised during the feasibility stage of the investigation includes
the following:
23 | P a g e
CHAPTER - 4
24 | P a g e
development styles.
PURPOSE
In software engineering, the same meanings of requirements apply, except that the focus
of interest is the software itself.
• Data analysis
• Data preprocessing
• Model building
• Prediction
25 | P a g e
4.2NON FUNCTIONAL REQUIREMENTS
Introduction to Flask The Web development framework that saves you time and makes
Web development a joy. Using Flask, you can build and maintain high quality Web
applications with minimal fuss. At its best, Web development is an exciting, creative act; at
its worst, it can be a repetitive, frustrating nuisance. Flask lets you focus on the fun stuff —
the crux of your Web application — while easing the pain of the repetitive bits. In doing so, it
provides high-level abstractions of common Web development patterns, shortcuts for frequent
programming tasks, and clear conventions for how to solve problems. At the same time, flask
tries to stay out of your way, letting you work outside the scope of the framework as needed.
The goal of this book is to make you a Flask expert. The focus is twofold. First, we explain, in
depth, what flask does and how to build Web applications with it. Second, we discuss higher-
level concepts where appropriate, answering the question “How can I apply these tools
effectively in my own projects?” By reading this book, you’ll learn the skills needed to
develop powerful Web sites quickly, with code that is clean and easy to maintain.
Flask is a prominent member of a new generation of Web frameworks. So what exactly does
that term mean? To answer that question, let’s consider the design of a Web application
written using the Common Gateway Interface (CGI) standard, a popular way to write Web
applications circa 1998. In those days, when you wrote a CGI application, you did everything
yourself — the equivalent of baking a cake from scratch. For example, here’s a simple CGI
script, written in Python, that displays the ten most recently published books from a database:
26 | P a g e
With a one-off dynamic page such as this one, the write-it-from-scratch approach
isn’t necessarily bad. For one thing, this code is simple to comprehend — even a
novice developer can read these 16 lines of Python and understand all it does, from
start to finish. There’s nothing else to learn; no other code to read. It’s also simple to
deploy: just save this code in a file called latestbooks. cgi, upload that file to a Web
server, and visit that page with a browser. But as a Web application grows beyond
the trivial, this approach breaks down, and you face a number of problems:
Should a developer really have to worry about printing the “Content-Type” line and
remembering to close the database connection? This sort of boilerplate reduces
programmer productivity and introduces opportunities for mistakes. These setup-
and teardown-related tasks would best be handled by some common infrastructure.
27 | P a g e
CHAPTER : 5
5.LANGUAGES OF IMPLEMENTATION
5.1 Python
What Is A Script?
Up to this point, I have concentrated on the interactive programming capability of Python. This is
a very useful capability that allows you to type in a program and to have it executed immediately
in an interactive mode
Basically, a script is a text file containing the statements that comprise a Python program. Once
you have created the script, you can execute it over and over without having to retype it each
time.
Perhaps, more importantly, you can make different versions of the script by modifying the
statements from one file to the next using a text editor. Then you can execute each of the
individual versions. In this way, it is easy to create different programs with a minimum amount of
typing.
Just about any text editor will suffice for creating Python script files.You can use Microsoft Notepad,
Microsoft WordPad, Microsoft Word, or just about any word processor if you want to.
Script:
Scripts are distinct from the core code of the application, which is usually written in a different
language, and are often created or at least modified by the end-user. Scripts are often interpreted
from source code or bytecode, whereas the applications they control are traditionally compiled to
native machine code.
Program:
28 | P a g e
The program has an executable form that the computer can use directly to execute the
instructions.
The same program in its human-readable source code form, from which
executable programs are derived (e.g., compiled)
P ython
What is Python?
Chances you are asking yourself this. You may have found this book because
you want to learn to program but don’t know anything about programming
languages. Or you may have heard of programming languages like C, C++, C#,
or Java and want to know what Python is and how it compares to “big name”
languages. Hopefully I can explain it for you.
Python concepts
If your not interested in the the hows and whys of Python, feel free to skip to the
next chapter. In this chapter I will try to explain to the reader why I think Python is
one of the best languages available and why it’s a great one to start programming
with.
29 | P a g e
• Python is Interpreted − Python is processed at runtime by the interpreter. You do
not need to compile your program before executing it. This is similar to PERL and
PHP. • Python is Interactive − You can actually sit at a Python prompt and interact
with the interpreter directly to write your programs.
Python was developed by Guido van Rossum in the late eighties and early nineties
at the National Research Institute for Mathematics and Computer Science in the
Netherlands. Python is derived from many other languages, including ABC,
Modula-3, C, C++, Algol-68, SmallTalk, and Unix shell and other scripting
languages.
Python is copyrighted. Like Perl, Python source code is now available under the
GNU General Public License (GPL).
• Easy-to-learn − Python has few keywords, simple structure, and a clearly defined
syntax. This allows the student to pick up the language quickly.
• Easy-to-read − Python code is more clearly defined and visible to the eyes. •
Easy-to-maintain − Python's source code is fairly easy-to-maintain.
• A broad standard library − Python's bulk of the library is very portable and
30 | P a g e
cross-platform compatible on UNIX, Windows, and Macintosh.
• Interactive Mode − Python has support for an interactive mode which allows interactive
testing and debugging of snippets of code.
• Portable − Python can run on a wide variety of hardware platforms and has the
same interface on all platforms.
• Extendable − You can add low-level modules to the Python interpreter. These
modules enable programmers to add to or customize their tools to be more
efficient.
• Scalable − Python provides a better structure and support for large programs
than shell scripting.
Apart from the above-mentioned features, Python has a big list of good features, few
are listed below −
• It provides very high-level dynamic data types and supports dynamic type checking. •
IT supports automatic garbage collection.
• It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.
Types Python is a dynamic-typed language. Many other languages are static typed,
such as C/C++ and Java. A static typed language requires the programmer to
explicitly tell the computer what type of “thing” each data value is.
31 | P a g e
For example, in C if you had a variable that was to contain the price of something,
you would have to declare the variable as a “float” type.
This tells the compiler that the only data that can be used for that variable must be
a floating point number, i.e. a number with a decimal point.
Python, however, doesn’t require this. You simply give your variables names and
assign values to them. The interpreter takes care of keeping track of what kinds of
objects your program is using. This also means that you can change the size of the
values as you develop the program. Say you have another decimal number (a.k.a. a
floating point number) you need in your program.
With a static typed language, you have to decide the memory size the variable can
take when you first initialize that variable. A double is a floating point value that
can handle a much larger number than a normal float (the actual memory sizes
depend on the operating environment). If you declare a variable to be a float but
later on assign a value that is too big to it, your program will fail; you will have to
go back and change that variable to be a double. With Python, it doesn’t matter. You
simply give it whatever number you want and Python will take care of manipulating
it as needed. It even works for derived values.
For example, say you are dividing two numbers. One is a floating point number and
one is an integer. Python realizes that it’s more accurate to keep track of decimals so
it automatically calculates the result as a floating point number
5.6 Variables
Variables are nothing but reserved memory locations to store values. This means
that when you create a variable you reserve some space in memory.
Based on the data type of a variable, the interpreter allocates memory and decides
what can be stored in the reserved memory. Therefore, by assigning different data
types to variables, you can store integers, decimals or characters in these variables.
32 | P a g e
The data stored in memory can be of many types. For example, a person's age is stored as a
numeric value and his or her address is stored as alphanumeric characters. Python
has various standard data types that are used to define the operations possible on
them and the storage method for each of them.
• Numbers
• String
• List
• Tuple
• Dictionary
P ython Numbers
Number data types store numeric values. Number objects are created when you
assign a value to them
P ython Strings
P ython Lists
Lists are the most versatile of Python's compound data types. A list contains items
separated by commas and enclosed within square brackets ([]). To some extent, lists
are similar to arrays in C. One difference between them is that all the items
belonging to a list can be of different data type.
The values stored in a list can be accessed using the slice operator ([ ] and [:]) with
indexes starting at 0 in the beginning of the list and working their way to end -1. The
33 | P a g e
plus (+) sign is the list concatenation operator, and the asterisk (*) is the repetition
operator.
P ython Tuples
A tuple is another sequence data type that is similar to the list. A tuple consists of a
number of values separated by commas. Unlike lists, however, tuples are enclosed
within parentheses.
The main differences between lists and tuples are: Lists are enclosed in brackets ( [
] ) and their elements and size can be changed, while tuples are enclosed in
parentheses ( ( ) ) and cannot be updated. Tuples can be thought of as read-only
lists.
P ython Dictionary
Python's dictionaries are kind of hash table type. They work like associative arrays
or hashes found in Perl and consist of key-value pairs. A dictionary key can be
almost any Python type, but are usually numbers or strings. Values, on the other
hand, can be any arbitrary Python object.
Dictionaries are enclosed by curly braces ({ }) and values can be assigned and
accessed using square braces ([]).
The normal mode is the mode where the scripted and finished .py files are run in
the Python interpreter.
Interactive mode is a command line shell which gives immediate feedback for each
statement, while running previously fed statements in active memory. As new lines
are fed into the interpreter, the fed program is evaluated both in part and in whole
20 Python libraries
1. Requests. The most famous http library written by kenneth reitz.
It’s a must have for every python developer.
6. BeautifulSoup. I know it’s slow but this xml and html parsing
library is very useful for beginners.
9. SciPy. When we talk about NumPy then we have to talk about scipy.
It is a library of algorithms and mathematical tools for python and
has caused many scientists to switch from ruby to python.
11. Pygame. Which developer does not like to play games and develop
them ? This library will help you achieve your goal of 2d game
development.
12. Pyglet. A 3d animation and game creation engine. This is the
engine in which the famous python port of minecraft was made.
15. Scapy. A packet sniffer and analyzer for python made in python.
17. nltk. Natural Language Toolkit – I realize most people won’t be using
this one, but it’s generic enough. It is a very useful library if you want
to manipulate strings. But it’s capacity is beyond that. Do check it
out.
20. IPython. I just can’t stress enough how useful this tool is. It is a
python prompt on steroids. It has completion, history, shell
capabilities, and a lot more. Make sure that you take a look at it.
Numpy:
NumPy’s main object is the homogeneous multidimensional array. It is a table of elements
(usually numbers), all of the same type, indexed by a tuple of positive integers. In NumPy
dimensions are called axes. The number of axes is rank.
Offers Matlab-ish capabilities within Python
m atplotlib
36 | P a g e
• High quality plotting library.
DataSets
The DataSet object is similar to the ADO Recordset object, but more
powerful, and with one other important distinction: the DataSet is always
disconnected. The DataSet object represents a cache of data, with database-
like structures such as tables, columns, relationships, and constraints.
However, though a DataSet can and does behave much like a database, it is
important to remember that DataSet objects do not interact directly with
databases, or other source data. This allows the developer to work with a
programming model that is always consistent, regardless of where the source
data resides. Data coming from a database, an XML file, from code, or user
input can all be placed into DataSet objects. Then, as changes are made to
the DataSet they can be tracked and verified before updating the source data.
The GetChanges method of the DataSet object actually creates a second
DatSet that contains only the changes to the data. This DataSet is then used
by a DataAdapter (or other objects) to update the original data source.
The DataSet has many XML characteristics, including the ability to produce and
consume XML data and XML schemas. XML schemas can be used to describe
schemas interchanged via WebServices. In fact, a DataSet with a schema can
actually be compiled for type safety and statement completion.
37 | P a g e
CHAPTER : 6
6. SYSTEM DESIGN
6.1 INTRODUCTION
Software design sits at the technical kernel of the software engineering process and is applied
regardless of the development paradigm and area of application. Design is the first step in the
development phase for any engineered product or system. The designer’s goal is to produce a
model or representation of an entity that will later be built. Beginning, once system
requirement
have been specified and analyzed, system design is the first of the three technical activities -
design, code and test that is required to build and verify software.
The importance can be stated with a single word “Quality”. Design is the place where
quality is fostered in software development. Design provides us with representations of
software that can assess for quality. Design is the only way that we can accurately translate a
customer’s view into a finished software product or system. Software design serves as a
foundation for all the software engineering steps that follow. Without a strong design we risk
building an unstable system – one that will be difficult to test, one whose quality cannot be
assessed until the last stage.
6.2 NORMALIZATION
38 | P a g e
the problems that can arise due to data redundancy i.e. repetition of data in the database,
maintain data integrity as well as handling problems that can arise due to insertion, updation,
deletion anomalies.
Insertion anomaly: Inability to add data to the database due to absence of other data.
Deletion anomaly: Unintended loss of data due to deletion of other data. Update
anomaly: Data inconsistency resulting from data redundancy and partial update
Normal Forms: These are the rules for structuring relations that eliminate anomalies.
A relation is said to be in first normal form if the values in the relation are atomic for every
attribute in the relation. By this we mean simply that no attribute value can be a set of values
or, as it is sometimes expressed, a repeating group.
A relation is said to be in second Normal form is it is in first normal form and it should satisfy
any one of the following rules.
3) Every non key attribute is fully functionally dependent on full set of primary key.
39 | P a g e
Transitive Dependency: If two non key attributes depend on each other as well as on the
primary key then they are said to be transitively dependent.
The above normalization principles were applied to decompose the data in multiple tables
thereby making the data to be maintained in a consistent state.
6.3 E – R DIAGRAMS
• The relation upon the system is structure through a conceptual ER-Diagram, which not only
specifics the existential entities but also the standard relations through which the system exists
and the cardinalities that are necessary for the system state to continue.
• The entity Relationship Diagram (ERD) depicts the relationship between the data objects. The
ERD is the notation that is used to conduct the date modeling activity the attributes of each data
object noted is the ERD can be described resign a data object descriptions. • The set of primary
components that are identified by the ERD are
The primary purpose of the ERD is to represent data objects and their relationships.
6.4 DATA FLOW DIAGRAMS
A data flow diagram is graphical tool used to describe and analyze movement of data
through a system. These are the central tool and the basis from which the other components
are developed. The transformation of data from input to output, through processed, may be
described logically and independently of physical components associated with the system.
These are known as the logical data flow diagrams. The physical data flow diagrams show the
actual implements and movement of data between people, departments and workstations. A
full description of a system actually consists of a set of data flow diagrams. Using two familiar
notations Yourdon, Gane and Sarson notation develops the data flow diagrams. Each
component in a DFD is labeled with a descriptive name. Process is further identified with a
40 | P a g e
number that will be used for identification purpose. The development of DFD’S is done in
several levels. Each process in lower level diagrams can be broken down into a more detailed
DFD in the next level. The lop-level diagram is often called context diagram. It consists a
single process bit, which plays vital role in studying the current system. The process in the
context level diagram is exploded into other process at the first level DFD.
The idea behind the explosion of a process into more process is that understanding at
one level of detail is exploded into greater detail at the next level. This is done until further
explosion is necessary and an adequate amount of detail is described for analyst to understand
the process.
Larry Constantine first developed the DFD as a way of expressing system requirements in a
graphical from, this lead to the modular design.
Larry Constantine first developed the DFD as a way of expressing system requirements in a
graphical from, this lead to the modular design.
A DFD is also known as a “bubble Chart” has the purpose of clarifying system requirements
and identifying major transformations that will become programs in system design. So it is the
starting point of the design to the lowest level of detail. A DFD consists of a series of bubbles
joined by data flows in the system.
1. The DFD shows flow of data, not of control loops and decision are controlled considerations
do not appear on a DFD.
2. The DFD does not indicate the time factor involved in any process whether the dataflow take
place daily, weekly, monthly or yearly.
41 | P a g e
TYPES OF DATA FLOW DIAGRAMS
1. Current Physical
2. Current Logical
3. New Logical
4. New Physical
CURRENT PHYSICAL:
In Current Physical DFD process label include the name of people or their positions or the
names of computer systems that might provide some of the overall system-processing label
includes an identification of the technology used to process the data. Similarly data flows and
data stores are often labels with the names of the actual physical media on which data are
stored such as file folders, computer files, business forms or computer tapes.
CURRENT LOGICAL:
The physical aspects at the system are removed as mush as possible so that the current system
is reduced to its essence to the data and the processors that transforms them regardless of
actual physical form.
NEW LOGICAL:
This is exactly like a current logical model if the user were completely happy with he user
were completely happy with the functionality of the current system but had problems with
how it
was implemented typically through the new logical model will differ from current logical
model while having additional functions, absolute function removal and inefficient flows
recognized.
NEW PHYSICAL:
The new physical represents only the physical implementation of the new system.
42 | P a g e
6.5 UML Diagrams
NewUseCase
Data
Understanding Predictive Learning
NewUseCase2 Model Building
NewUseCase3
Dataset
Data
Analytics(EDA) Trained Dataset
NewUseCase4
Model Evaluation
Particular Data
NewUseCase5
Test/Test Split
NewUseCase6
EXPLANATION:
The primary motivation behind a utilization case chart is to show what framework capacities are performed for which
entertainer. Parts of the entertainers in the framework can be portrayed. The above chart comprises of client as
entertainer. Each will assume a specific part to accomplish the idea.
43 | P a g e
6.5.2 Class Diagram
Model Evaluation
dataset
traineddataset()
particulardata()
EXPLANATION
In this class chart addresses how the classes with qualities and strategies are connected together to play out the
confirmation with security. From the above chart shown the different classes engaged with our venture.
Model Evaluation
EXPLAN
ATION:
In the above digram tells about the progression of articles between the classes. It's anything but a chart that shows a
total or fractional perspective on the construction of a displayed framework. In this item chart addresses how the
classes with traits and strategies are connected together to play out the confirmation with security.
44 | P a g e
6.5.4 Component Diagram
Trained Particular
Dataset Data
EXPLANATION:
A segment gives the arrangement of required interfaces that a part acknowledges or carries out. These are the static
outlines of the bound together demonstrating language. Segment outlines are utilized to address the working and
conduct of different segments of a framework.
45 | P a g e
6.5.5 Deployment Diagram
Data
Data
Understanding
Analysis(EDA)
Train Data
Model Building
Model
Evaluation
EXPLANATION:
An UML sending chart is an outline that shows the design of run time preparing hubs and the parts that live on
them. Arrangement graphs are a sort of design chart utilized in demonstrating the actual parts of an article situated
framework. They are regularly be utilized to demonstrate the static sending perspective on a framework.
46 | P a g e
6.5.6 STATE DIAGRAM
Dataset
Split Data
Model-Building Phase
Particular Data
Machine Learning
Dataset
EXPLANATION:
State outline are an inexactly characterized graph to show work processes of stepwise exercises and activities, with
help for decision, cycle and simultaneousness. State charts necessitate that the framework portrayed is made out of a
limited number of states; at times, this is to be sure the situation, while at different occasions this is a sensible
deliberation. Numerous types of state charts exist, which vary marginally and have distinctive semantics.
47 | P a g e
6.5.7 Sequence Diagram
Datasets Transfers
Datasets
Training Data
Predictive Learning
Machine Learning
Trained Dataset
Split Data
Particular Data
EXPLANATION:
UML Sequence Diagrams are connection charts that detail how tasks are done. They catch the association between
objects with regards to joint effort. Grouping Diagrams are time center and they show the request for the association
outwardly by utilizing the upward pivot of the outline to address time what messages is sent and when.
48 | P a g e
6.5.8 Collaboration Diagram
5: Machine Learning
Train
Data Data
Understanding
1: Datasets Transfers
8: Split Data 3: Training Data
4: Predictive Learning
Model
Building
Dataset
6: Trained Dataset
9: Particular Data
2: Datasets
Model
Evaluation
Data
Analysis(EDA)
EXPLANATION:
Coordinated effort outlines are utilized to show how items interface to play out the conduct of a specific use case, or
a piece of a utilization case. Alongside succession charts, joint effort are utilized by architects to characterize and
explain the jobs of the articles that play out a specific progression of occasions of a utilization case. They are the
essential wellspring of data used to deciding class duties and interfaces.
49 | P a g e
6.5.9 Activity Diagram
Dataset
Dataset
EXPLANATION:
Action chart are an inexactly characterized graph to show work processes of stepwise exercises and activities, with
help for decision, emphasis and simultaneousness. UML, action charts can be utilized to portray the business and
operational bit by bit work processes of parts in a framework. UML movement outlines might actually display the
inner rationale of a mind boggling activity. From numerous points of view UML action outlines are the article
arranged likeness stream graphs and information stream charts (DFDs) from underlying turn of events.
50 | P a g e
6.5.10 System Architecture
DATASET
DATASET
EXPLORATORY
EXPLORATORY DATA
DATA
ANALYTICS
ANALYTICS
TRAIN/TEST
TRAIN/TEST SPLIT
SPLIT
MODEL
MODEL BUILDING/HYPER
BUILDING/HYPER
PARAMETER
PARAMETER TUNING
TUNING
MODEL
MODEL
ELEVATION
ELEVATION
RESULT
RESULT
51 | P a g e
CHAPTER : 7
7.Implementation
7.1 Data Collection
We collected phishing websites dataset from Kaggle website. It consists
of mix of phishing and legitimate URL features. Dataset has 11055 rows
and 31 columns.
52 | P a g e
7.2 Exploratory Data Analysis
We loaded the dataset into python IDE with the help of pandas package
and checked if there are any missing values in data. We found that
thereare no missing values in data and we removed an unwanted column
for our process. After removing unwanted column, below are the
columns left out in our dataset.
Figure: 7.2.1
We tried to analyze the obtained data and go to find out the following observations:
Figure: 7.2.2
From the above count plot which is plotted with the help of seaborn
package, we can observe the count of values of target variable.
And below is the plot which is drawn with the help of matplotlib
package to find out the correlation among the features.
53 | P a g e
From the analysis we clearly found out that the type of our data is
classification problem where the target variable is Result.
X = data.drop('Result', axis=1)
Figure: 7.3.1
Figure: 7.3.2
7.4.1Logistic Regression
for the function used at the core of the method, the logistic function. The
55 | P a g e
logistic function, also called the sigmoid function was developed by
statisticians to describe properties of population growth in ecology, rising
quickly and maxing out at the carrying capacity of the environment. It’s an
S-shaped curve that can take any real-valued number and map it into a
value between 0 and 1, but never exactly at those limits.
1 / (1 + e^-value)
Where e is the base of the natural logarithms (Euler’s number or the EXP()
function in your spreadsheet) and value is the actual numerical value that
you want to transform. Below is a plot of the numbers between -5 and 5
transformed into the range 0 and 1 using the logistic function.
P(sex=male|height)
Written another way, we are modeling the probability that an input (X)
56 | P a g e
belongs to the default class (Y=1), we can write this formally as:
P(X) = P(Y=1|X)
We’re predicting probabilities? I thought logistic regression was a classification
algorithm?
Note that the probability prediction must be transformed into binary values
(0 or 1) in order to actually make a probability prediction.. Logistic
regression is a linear method, but the predictions are transformed using the
logistic function. The impact of this is that we can no longer understand the
predictions as a linear combination of the inputs as we can with linear
regression, for example, continuing on from above, the model can be stated
as:
/ 1 – p(X)) = b0 + b1 * X
We can move the exponent back to the right and write it as:
odds = e^(b0 + b1 * X)
We trained logistic regression model with the help of training split and tested with
test split.
Accuracy score - 0.92
Confusion Matrix:
Phishing Non
Phishing
Table:7.4.1.1
Classification Report:
Precision Recall F1-Score Support
57 | P a g e
Phishing 0.94 0.92 0.93 9
0
1
Table:7.4.1.2
Random forest is a supervised learning algorithm. It can be used both for classification and
regression. It is also the most flexible and easy to use algorithm. A forest is comprised of
trees. It is said that the more trees it has, the more robust a forest is. Random forests
creates decision trees on randomly selected data samples, gets prediction from each tree
and selects the best solution by means of voting. It also provides a pretty good indicator
of the feature importance.
Advantages:
58 | P a g e
• Random forest is considered as a highly accurate and robust method because of the
number of decision trees participating in the process.
• It does not suffer from the over fitting problem. The main reason is that it takes the
average of all the predictions, which cancels out the biases.
• Random forests can also handle missing values. There are two ways to handle these: using median
values to replace continuous variables and computing the proximity- weighted average of missing
values.
It works in four steps:
1. Select random samples from a given dataset.
2. Construct a decision tree for each sample and get a prediction result from each decision
tree.
3. Perform a vote for each predicted result.
4. Select the prediction result with the most votes as the final prediction.
We trained logistic regression model with the help of training split and tested with test split.
Accuracy score - 0.96
Confusion Matrix:
Phishing Non-Phishing
Table:7.4.2.1
Classification Report:
Precision Recall F1-Score Support
59 | P a g e
Non-Phishing 0.95 0.96 0.95 854
Table:7.4.2.2
A decision tree is a flowchart-like tree structure where an internal node represents feature,
the branch represents a decision rule, and each leaf node represents the outcome. The
topmost node in a decision tree is known as the root node. It learns to partition on the basis
of the attribute value. It partitions the tree in recursively manner call recursive partitioning.
This flowchart-like structure helps you in decision making. It's visualization like a
flowchart diagram which easily mimics the human level thinking. That is why decision
trees are easy to understand and interpret.
1. Select the best attribute using Attribute Selection Measures (ASM) to split the
records.
2. Make that attribute a decision node and breaks the dataset into smaller subsets.
60 | P a g e
3. Starts tree building by repeating this process recursively for each child until one of
the condition will match:
o All the tuples belong to the same attribute value.
Pros
It requires fewer data preprocessing from the user, for example, there is no need to
normalize columns.
It can be used for feature engineering such as predicting missing values, suitable for
variable selection.
The decision tree has no assumptions about distribution because of the non-parametric
nature of the algorithm. (Source)
` Cons
The small variation(or variance) in data can result in the different decision tree. This
can be reduced by bagging and boosting algorithms.
Decision trees are biased with imbalance dataset, so it is recommended that balance
out the dataset before creating the decision tree.
61 | P a g e
Confusion Matrix:
Phishing Non-Phishing
Classification Report:
Table:7.4.3.1
Table:7.4.3.2
The naive Bayes classifier is a generative model for classification. Before the advent of
deep learning and its easy-to-use libraries, the Naive Bayes classifier was one of the
widely deployed classifiers for machine learning applications. Despite its simplicity, the
naive Bayes classifier performs quite well in many applications.
A Naive Bayes classifier is a probabilistic machine learning model that’s used for
62 | P a g e
classification task. The crux of the classifier is based on the Bayes theorem.
Bayes Theorem:
Using Bayes theorem, we can find the probability of A happening, given that B has
occurred. Here, B is the evidence and A is the hypothesis. The assumption made here is that
the predictors/features are independent. That is presence of one particular feature does not
affect the other. Hence it is called naive.
Confusion Matrix:
Phishing Non-Phishing
Table:7.4.4.1
Classification Report:
Precision Recall F1-Score Support
63 | P a g e
Macro avg 0.80 0.64 0.60 1755
Table:7.4.4.2
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.SVM
chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases
are called as support vectors, and hence algorithm is termed as Support Vector Machine.
Types of SVM
Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such data
is termed as linearly separable data, and classifier is used called as Linear SVM
classifier.
64 | P a g e
dimensional space, but we need to find out the best decision boundary that helps to classify
the data points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features then hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.
Advantages
SVM Classifiers offer good accuracy and perform faster prediction compared to Naïve Bayes
algorithm. They also use less memory because they use a subset of training points in the
decision phase. SVM works well with a clear margin of separation and with high dimensional
space.
Confusion Matrix:
Phishing Non-Phishing
Table:7.4.5.1
65 | P a g e
Classification Report:
Precision Recall F1-Score Support
Table:7.4.5.2
The KNN algorithm assumes that similar things exist in close proximity. In other words,
similar things are near to each other. KNN captures the idea of similarity (sometimes
called distance, proximity, or closeness) with some mathematics we might have learned in
our childhood— calculating the distance between points on a graph.
Advantages
66 | P a g e
3. The algorithm is versatile. It can be used for classification, regression, and search (as we will see
in the nextsection).
Accuracy score - 0.928
Confusion Matrix:
Phishing Non-Phishing
Table:7.4.6.1
Classification Report:
Precision Recall F1-Score Support
Table:7.4.6.2
XGBoost is a powerful machine learning algorithm especially where speed and accuracy
are concerned. XGBoost (eXtreme Gradient Boosting) is an advanced implementation
of gradient boosting algorithm.
ADVANTAGES
1. Regularization:
67 | P a g e
• Standard GBM has no regularization like XGBoost, therefore it also helps
to reduce overfitting.
High Flexibility
Tree Pruning:
• A GBM would stop splitting a node when it encounters a negative loss in the
split. Thus it is more of a greedy algorithm.
• XGBoost on the other hand make splits upto the max_depth specified and then
start pruning the tree backwards and remove splits beyond which there is no
positive gain.
68 | P a g e
• Another advantage is that sometimes a split of negative loss say -2 may be
followed by a split of positive loss +10. GBM would stop as it encounters -2.
But XGBoost will go deeper and it will see a combined effect of +8 of the split
and keep both.
Built-in Cross-Validation
Confusion Matrix:
Phishing Non-Phishing
Table:7.4.7.1
Classification Report:
Precision Recall F1-Score Support
69 | P a g e
Macro avg 0.94 0.94 0.94 1755
Table:7.4.7.2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
df=pd.read_csv('C:/Users/k.anusha/Documents/phishing/dataset.csv'
) df.head()
df.describe()
df.isnull().sum()
df.dtypes
sns.countplot(x='Result',data=df)
x=df.drop(['Result','index'],axis=1)
y=df['Result']
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)
# Logistic Regression
# Accuracy Score
70 | P a g e
from sklearn.metrics import accuracy_score
a1=accuracy_score(y_test,y_pred)
a1
# Classification Report
71 | P a g e
# Decision Tree Classifier
Output:
72 | P a g e
# Random Forest
73 | P a g e
from sklearn.metrics import
classification_report
print(classification_report(y_test,y_pred)
MSE =
np.square(np.subtract(y_test,y_pred)).me
an() RMSE3 = math.sqrt(MSE)
print("Root Mean Square Error:\n",RMSE3)
rf.predict([[-1,0,-1,1,-1,-1,1,1,-1,1,1,-1,1,0,0,-1,-1,-1,0,1,1,1,1,1,1,1,-1,1,-1,-1]])
Output:
74 | P a g e
#SupportVector Machine
from sklearn.svm
import SVC
sv=SVC()
sv.fit(x_train,y_train)
y_pred=sv.predict(x_test)
from sklearn.metrics import
accuracy_score
a4=accuracy_score(y_test,y_pred
)
a4
from sklearn.metrics import
classification_report
print(classification_report(y_test,y_pre
d))
MSE =
np.square(np.subtract(y_test,y_pred)).mea
n() RMSE4 = math.sqrt(MSE)
print("Root Mean Square Error:\n",RMSE4)
sv.predict([[-1,-1,-1,1,-1,1,-1,1,1,-1,-1,1,1,1,1,-1,-1,-1,-1,1,-1,1,-1,-1,-1,1,-1,1,-1,-1]])
Output:
75 | P a g e
#Navie Bayes
76 | P a g e
#Gradient Boosting Algorithm
from sklearn.ensemble import
GradientBoostingClassifier
gb=GradientBoostingClassifier()
gb.fit(x_train,y_train)
y_pred=sv.predict(x_test)
from sklearn.metrics import
accuracy_score
a6=accuracy_score(y_test,y_pred)
a6
77 | P a g e
from sklearn.metrics import classification_report print(classification_report(y_test,y_pred))
MSE = np.square(np.subtract(y_test,y_pred)).mean()
RMSE6 = math.sqrt(MSE)
print("Root Mean Square Error:\n",RMSE6)
gb.predict([[1,0,-1,1,1,-1,1,1,-1,1,1,1,1,0,0,-1,1,1,0,-1,1,-1,1,-1,-1,0,-1,1,1,1]])
78 | P a g e
#K-Nearest Neighbors
79 | P a g e
80 | P a g e
# Acurracy levels for various Algorithms
sns.barplot(x='Algorithm',y='Accuracy',data=d
f1) plt.xticks(rotation=90)
plt.title('Comparision of Accuracy Levels for various algorithms')
81 | P a g e
CONCLUSION
The present project is aimed at classification of phishing websites based on the features.
For that we have taken the phishing dataset which collected from uci machine learning
repository and we built our model with seven different classifiers like SVC, Naïve Bayes,
XGB Classifier, Random Forest, K-Nearest Neighbours, Decision Tree and we got good
accuracy scores. There is a scope to enhance it further .if we can have more data our
project will be much more effective and we can get very good results. For this we need
API integrations go get the data of different websites.
82 | P a g e