0% found this document useful (0 votes)
136 views82 pages

Disease Inspection Identification For Food Using Machine Learning Algoithms

Uploaded by

Chintu Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views82 pages

Disease Inspection Identification For Food Using Machine Learning Algoithms

Uploaded by

Chintu Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 82

A Project Report on

DISEASE INSPECTION IDENTIFICATION FOR


FOOD USING
MACHINE LEARNING ALGOITHMS
Submitted in partial fulfillment of the requirements for the award of the degree in

BACHELOR OF TECHNOLOGY
IN

COMPUTER SCIENCE AND ENGINEERING


BY
P Venkata Reddy <178X1A0582>
S Bharath Kumar <178X1A0599>
T Mahesh <178X1A05A7>
G Vishnu Teja <178X1A05B6>
Under the esteemed guidance of

G Mahesh Reddy Asst Prof

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


KALLAM HARANADHAREDDY INSTITUTE OF TECHNOLOGY
AN ISO 9001:2015 CERTIFIED INSTITUTION
ACCREDITED BY NBA & NAAC WITH ‘A’ GRADE
(APPROVED BY AICTE, AFFILIATED TO JNTUK, KAKINADA)

NH-5, CHOWDAVARAM, GUNTUR – 522019


2017 – 2021

i|Page
KALLAM HARANADHAREDDY INISTITUTE OF TECHNOLOGY
AN ISO 9001:2015 CERTIFIED INSTITUTION ACCREDITED BY NBA &
NAAC WITH ‘A’ GRADE
(APPROVED BY AICTE, AFFILIATED TO JNTUK, KAKINADA)

NH-5, CHOWDAVARAM, GUNTUR-522019

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CERTIFICATE
This is to certify that the project work entitle" DISEASE INSPECTION IDENTIFICATION FOR
FOOD USING MACHINE LEARNING ALGOITHMS " being submitted by Venkata Reddy
Pulagm (178X1A0582), Bharath Kumar Sunkara (178X1A0599), Thota Mahesh(178X1A05A7), and
Vishnu Teja (178X1A05B6) in the partial fulfilment for the award of degree of Bachelor of Technology in
Computer Science Engineering in Kallam Haranadhareddy Institute of Technology and this bonafide work
carried out by them.

Internal Guide Head of the Department


G. Mahesh Reddy Dr.K.V. Subba Reddy Sir
Asst Professor Professor & HOD

External Examiner

ii | P a g e
DECLARATION

We Venkata Reddy Pulagam(178X1A0582), Bharath Kumar Sunkara


(178X1A0599), Mahesh Thota (178X1A05A7), and V i s h n u Teja
G a n i m i s e t t y (178X1A05B6) hereby declare that the project report titled "
DISEASE INSPECTION IDENTIFICATION FOR FOOD USING

MACHINE LEARNING ALGOITHMS " under the guidance of G.


Mahesh Reddy Sir is submitted in partial fulfillment of the requirements
for the award of the degree of Bachelor of Technology in Computer Science and
Engineering.

This is a record of bonafide work carried out by us and the results embodied in
this project have not been reproduced or copied from any source. The results
embodied in this project have not been submitted to any other university for the
award of any other degree.

Venkata Reddy P <178X1A0582>


Bharath Kumar S <178X1A0599>
Mahesh T <178X1A05A7>
Vishnu Teja G <178X1A05B6>

iii | P a g e
ACKNOWLEDGEMENT
We profoundly grateful to express our deep sense of gratitude and respect towards our
honorable chairman, our Grand Father Sri KALLAM HARANADHA REDDY, Chairman of
Kallam group for his precious support in the college.

We are thankful to Dr. M. UMA SANKAR REDDY, Director, KHIT, GUNTUR for his
encouragement and support for the completion of the project.

We are much thankful to Dr. B. SIVA BASIVI REDDY, Principal KHIT, GUNTUR for his
support during and till the completion of the project.

We are greatly indebted to Dr. K. VENKATA SUBBA REDDY, Professor & Head
Department of Computer Science and Engineering, KHIT, GUNTUR for providing the laboratory
facilities to the fullest extent as and when required and also for giving us the opportunity to carry
the project work in the college.

We are also thankful to our Project Coordinators Mr. N. Md. Jubair Basha and

Mr. P. LAKSHMIKANTH who helped us in each step of our Project.

We extend our deep sense of gratitude to our Internal Guide Dr. Md. Sirajuddin Sir, other
Faculty Members & Support staff for their valuable suggestions, guidance and constructive ideas in
each and every step, which was indeed of great help towards the successful completion of our
project.

Venkata Reddy P <178X1A0582>


Bharath Kumar S <178X1A0599>
Mahesh T <178X1A05A7>
Vishnu Teja G <178X1A05B6>

iv | P a g e
v|Page
ABSTRACT
Suitable nutritional diets have been widely recognized as important measures to prevent and control
non-communicable diseases (NCDs). However, there is little research on nutritional ingredients in
food now, which are beneficial to the rehabilitation of NCDs. In this paper, we profoundly analyzed
the relationship between nutritional ingredients and diseases by using data mining methods. First,
more than 10 diseases were obtained and we collected the recommended food ingredients for each
disease. Then, referring to the India Food Nutrition, we proposed an improved system using Random
Forest, Decision Tree, Gaussian Naïve Bayes and KNN algorithms to find out which nutritional
ingredients can exert positive effects on diseases based on rough sets to select the To the best of our
knowledge, this is the major study to discuss the relationship between nutritional ingredients in food
and diseases through machine learning based on dataset in India. The experiments on real-life data
show that our method based on machine learning improves the performance compared with the
traditional CNN approach, with the highest accuracy of 0.97. Additionally, for some common
diseases such as acne, angina, cardiovascular, ovarian, stroke, tooth decay, Asthma, liver disease, oral
cancers, hyper tension and kidney stone, our work is able to predict the disease based on the first
three nutritional ingredients in food that can benefit the rehabilitation of those diseases. These
experimental results demonstrate the effectiveness of applying machine learning in selecting of
nutritional ingredients in food for disease analysis.

vi | P a g e
TABLE OF CONTENTS

TITLE Page No.

CHAPTER 1: INTRODUCTION 1-2

1.1 Introduction 1

1.2 Purpose of the System 1

1.3 Problem Statement 1

1.4 Solution of Problem Statement 2

CHAPTER 2: REQUIREMENTS 3-3

2.1 Hardware Requirements 3

2.2 Software Requirements 3

CHAPTER 3: SYSTEM ANALYSIS 4-12

3.1 Study of System 4

3.2 Existing System 6

3.3 Proposed System 6

3.4 Input and Output 7

3.5 Process Models Used with Justification 8

3.6 Feasibility Study 10

3.7 Economic Feasibility 10

3.8 operational Feasibility 11

3.9 Technical Feasibility 12

vii | P a g e
CHAPTER 4: SOFTWARE REQUIREMENT SPECIFICATION 13-15

4.1 Functional Requirements 14

4.2 Non-Functional Requirements 15

CHAPTER 5: LANGUAGES OF IMPLEMENTATION 17-26

5.1 Python 17

5.2 Difference Between a Script and a Program 17

5.3 History of Python 19

5.4 Python features 19

5.5 Dynamic vs Static 21

5.6 Variables 21

5.7 Standard Data Types 22

5.8 Different Modules in python 23

5.9 Datasets 26

CHAPTER 6: SYSTEM DESIGN 27-40

6.1 Introduction 27

6.2 Normalization 27

6.3 ER- Diagram 29

6.4 Data Flow Diagram 29

viii | P a g e
6.5 UML Diagrams 32-40

6.5.1 Use case Diagram 32

6.5.2 Class Diagram 33

6.5.3 Object Diagram 33

6.5.4 Component Diagram 34

6.5.5 Deployment Diagram 35

6.5.6 State Diagram 36

6.5.7 Sequence Diagram 37

6.5.8 Collaboration Diagram 38

6.5.9 Activity Diagram 39

6.5.10 System Architecture 40

CHAPTER 7: IMPLEMENTATION 41-59

7.1 Data Collection 41

7.2 Data Analysis 42

7.3 Data Processing 43

7.4 Modeling 44-56

7.4.1 Logistic Regression 44

7.4.2 Random Forest Classifier 47

7.4.3 Decision Tree Classifier 49

7.4.4 Naive Bayes Classifier 51

7.4.5 Support Vector Machine 53


ix | P a g e
7.4.6 K-Nearest Neighbors 55

7.4.7 XGB Classifier 56

7.5 Coding and Execution 59

CHAPTER 8: CONCLUSION 71

x|Page
CHAPTER - 1

1. INTRODUCTION

NCDS are chronic diseases, which are mainly caused by occupational and environmental factors, lifestyles
and behaviors, including Obesity, Diabetes, Hypertension, Tumors and other diseases. According to the
Global Status Report on Non-communicable Diseases issued by the WHO, the annual death toll from NCDs
keeps adding up, which has caused serious economic burden to the world. About 40 million people died
from NCDs each year, which is equivalent to 70% of the global death toll. Statistics of Chinese Resident’s
Chronic Disease and Nutrition shows that, the number of the patients suffering from NCDs in China is
higher than the number in any other countries in the world, and the current prevalence rate has blown out. In
addition, the population aged 60 or over in China has reached 230 million and about two-thirds of them are
suffering from NCDs according to the official statistics. Therefore, relevant departments in each country,
especially in India, such as medical colleges, hospitals and disease research centers all are concerned about
NCDs. Suitable nutritional diets play an important role in maintaining health and preventing the occurrence
of NCDs. With the gradual recognition of this concept, india has also repositioned the impact of food on
health. However, research on nutritional ingredients in food via Machine Learning, which are conducive to
the rehabilitation of diseases is still rare in India. At present, India has just begun the IT (Information
Technology) construction of smart health-care. Most studies on the relationship between nutritional
ingredients in food and diseases are still through expensive precision instruments or long-term clinical trials.
In addition, there are also many prevention reports, but they studied only one or several diseases. In India,
studying the relationship between nutritional ingredients and diseases using data mining is immature. Most
doctors only recommend the specific food to patients suffering from NCDs, without giving any relevant
nutrition information, especially about nutritional ingredients in food. The solutions for NCDs require
interdisciplinary knowledge. In the era of big data, data mining has become an essential way of discovering
new knowledge in various fields, especially in disease prediction and accurate health-care (AHC). It has
become a core support for preventive medicine, basic medicine and clinical medicine research. With respect
to the disease analysis through the mining of nutritional ingredients in food, we mainly make the following
contributions: (i) We extracted data related to Chinese diseases, corresponding recommended food and taboo
food for each disease as many as possible from medical and official websites to create a valuable knowledge
base that are available online; (ii) Applying machine to find out which nutritional ingredients in food can
exert positive effects to diseases; (iii) In this paper, the data is continuous and has no decision attributes. To
address this problem, we proposed machine learning models like random forest, decision tree, knn and
gaussian naïve bayes , which can better select corresponding core ingredients from the positive nutritional
ingredients in food. The structure of this paper is organized as follows: Section II reviews the related work in

11 | P a g e
the field of disease analysis and Machine Learning. Describes the specific data mining algorithms used in
this paper, reasons why we select the algorithms, as well as two evaluation indexes. Elaborates the data,
experimental results and analysis in detail. Presents discussions between methods. Some conclusions and
potential future research directions are also discussed.

Problem Statement:
The Existing System used in performing the disease analysis using the cnn approach has low accuracy and
high complexity. To avoid these problems our proposed system which uses different machine learning
models like random forest, decision tree, knn and gaussian naïve bayes models used in analysis ,which gives
the results with better accuracy and efficiency.

12 | P a g e
CHAPTER – 2

2.REQUIREMENTS:

2.1 Hardware Requirement: -

• RAM: 4GB

• Processor: Intel i3

• Hard Disk: 120GB

2.2 Software Requirement-

• OS: Windows or Linux

• Software : Anaconda

• Jupyter IDE

• Language : Python Scripting

13 | P a g e
CHAPTER - 3

3 SYSTEM ANALYSIS

3.1 STUDY OF THE SYSTEM

1. Numpy
2. Pandas
3. Matplotlib
4. Scikit –learn

1 . Numpy:

Numpy is a general-purpose array-processing package. It provides a high-performance


multidimensional array object, and tools for working with these arrays.It is the fundamental
package for scientific computing with Python. It contains various features including these
important ones:

• A powerful N-Dimensional array object.

• Sophisticated (broadcasting) functions.

• Tools for integrating C/C++ and Fortran code.

• Useful linear algebra, Fourier transform, and random number capabilities

2. Pandas

Pandas is an open-source Python Library providing high-performance data manipulation


and analysis tool using its powerful data structures. Python was majorly used for data
munging and preparation. It had very little contribution towards data analysis. Pandas solved
this problem. Using Pandas, we can accomplish five typical steps in the processing and
analysis of data, regardless of the origin of data load, prepare, manipulate, model, and

14 | P a g e
analyze. Python with Pandas is used in a wide range of fields including academic and
commercial domains including finance, economics, Statistics, analytics, etc.
3.Matplotlib

Matplotlib is a Python 2D plotting library which produces publication quality figures in a


variety of hardcopy formats and interactive environments across platforms. Matplotlib can be
used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application
servers, and four graphical user interface toolkits. Matplotlib tries to make easy things easy
and hard things possible. You can generate plots, histograms, power spectra, bar charts, error
charts, scatter plots, etc., with just a few lines of code. For examples, see the sample plots
and thumbnail gallery.

For simple plotting the pyplot module provides a MATLAB-like interface, particularly
when combined with IPython. For the power user, you have full control of line styles, font
properties, axes properties, etc, via an object oriented interface or via a set of functions
familiar to MATLAB users.

4. Scikit – learn

Scikit-learn provides a range of supervised and unsupervised learning algorithms via a


consistent interface in Python. It is licensed under a permissive simplified BSD license and is
distributed under many Linux distributions, encouraging academic and commercial use. The
library is built upon the SciPy (Scientific Python) that must be installed before you can use
scikit-learn. This stack that includes:

• NumPy: Base n-dimensional array package


• SciPy: Fundamental library for scientific computing
• Matplotlib: Comprehensive 2D/3D plotting
• IPython: Enhanced interactive console
• Sympy: Symbolic mathematics
• Pandas: Data structures and analysis
•Extensions or modules for SciPy care conventionally named SciKits. As such, the
module provides learning algorithms and is named scikit-learn.

15 | P a g e
3.2 EXISTING SYSTEM:

In the Existing System ,It only provides a novel system which can estimate nutritional
ingredients of food items by analyzing the input image of food item. this system works on
different deep learning techniques and models for the accuracy of result of nutritional
components. Conversely, These models using images as input results in unstability at certain
times and requires advanced techniques to predict the output. The complexity is higher in this
model and time consuming.

3.3. PROPOSED SYSTEM:

In the proposed system, we can identify the disease that we may get effected due to the lack
of certain ingredients in our body to avoid this problem we recommend food according to the
body’s intake food based on the type of food consumption, minerals, and amount of food that
human body consumes (in grams). However, the above studies are basically carried out
through long-term clinical trials, which just recommend food for certain specific diseases and
they seldom study the relationship between nutritional ingredients and disease inspection by
Machine Learning techniques.

3.4 INPUT AND OUTPUT

The following are the project's inputs and outputs.

Inputs:

 Importing the all required packages like numpy, pandas, matplotlib, scikit-
learn and required machine learning algorithms packages .

 Setting the dimensions of visualization graph.

 Downloading and importing the dataset and convert to data frame.

Outputs:

16 | P a g e
 Preprocessing the importing data frame for imputing nulls with the related
information.

 All are displaying cleaned outputs.


 After applying machine learning algorithms it will give good results and
visualization plots.

17 | P a g e
3.5 PROCESS MODELS USED WITH JUSTIFICATION

SDLC stands for Software Development Life Cycle. A Software Development Life Cycle is
essentially a series of steps, or phases, that provide a model for the development and lifecycle
management of an application or piece of software. SDLC is the process consisting of a series of
planned activities to develop or alter the software products.

Benefits of the SDLC Process

 The intent of a SDLC process it to help produce a product that is cost-efficient, effective,
and of high quality. Once an application is created, the SDLC maps the proper
deployment and decommissioning of the software once it becomes a legacy.

 The SDLC methodology usually contains the following stages: Analysis (requirements
and design), construction, testing, release, and maintenance (response). Veracode makes
it possible to integrate automated security testing into the SDLC process through use of
its cloud-based platform.

1. Requirements Gathering:

In this phase we gather all the requirements from the client,


i.e. what are the client expectedinput, output……

2. Analysis:
In this phase based upon the client requirements we prepare one
documentation is called “High Level Design Document”. It contains
Abstract, Functional

3. Design:
It is difficult to understand the high-level Design Document for all
the members understand easily we use “Low Level Design
Document”. To design this document we use UML (Unified
Modelling Language). In this we have Use case, Sequence,
Collaboration……..

18 | P a g e
4. Coding:

In this phase we develop the coding module by module. After


developing all the modules, we integrate them.

5. Testing:
After developing we have to check weather client requirements are
satisfied or not. If not we are again going to develop.

6. Implementation:
In testing phase if client requirements are satisfied, we go for implementation.
i.e. we need to deploy the application in some server.

7. Maintenance:

After deployment, if at all any problems come from the client side we are providing
application

19 | P a g e
DESIGN PRINCIPLES & METHODOLOGY:

Object Oriented Analysis And Design

When Object orientation is used in analysis as well as design, the boundary


between OOA and OOD is blurred. This is particularly true in methods that
combine analysis and design. One reason for this blurring is the similarity of basic
constructs (i.e.,objects and classes) that are used in OOA and OOD. Though there is
no agreement about what parts of the object-oriented development process belong
to analysis and what parts to design, there is some general agreement about the
domains of the two activities.

The fundamental difference between OOA and OOD is that the former models
the problem domain, leading to an understanding and specification of the problem,
while the latter models the solution to the problem. That is, analysis deals with the
problem domain, while design deals with the solution domain. However, OOAD
subsumed the solution domain representation. That is, the solution domain
representation, created by OOD, generally contains much of the representation
created by OOA. The separating line is a matter of perception, and different people
have different views on it. The lack of clear separation between analysis and design
can also be considered one of the strong points of the object oriented approach; the
transition from analysis to design is “seamless”. This is also the main reason OOAD
methods-where analysis and designs are both performed.

The main difference between OOA and OOD, due to the different domains of
modeling, is in the type of objects that come out of the analysis and design process.

20 | P a g e
Features of OOAD:

• It users Objects as building blocks of the application rather functions

• All objects can be represented graphically including the relation between them.

• All Key Participants in the system will be represented as actors and the actions done by
them will be represented as use cases.

• A typical use case is nothing bug a systematic flow of series of events which can be well
described using sequence diagrams and each event can be described diagrammatically by
Activity as well as state chart diagrams.

• So the entire system can be well described using the OOAD model, hence this model is
chosen as the SDLC model.

3.6 FEASIBILITY Study

Preliminary investigation examines project feasibility, the likelihood the system will be
useful to the organization. The main objective of the feasibility study is to test the Technical,
Operational and Economical feasibility for adding new modules and debugging old running
systems. All systems are feasible if they are unlimited resources and infinite time. There are
aspects in the feasibility study portion of the preliminary investigation:

● Technical Feasibility
● Operational Feasibility
● Economical Feasibility

3.7 ECONOMIC FEASIBILITY

A system can be developed technically and that will be used if installed must still be a
good investment for the organization. In the economical feasibility, the development cost in
creating the system is evaluated against the ultimate benefit derived from the new systems.
Financial benefits must equal or exceed the costs.

21 | P a g e
The system is economically feasible. It does not require any additional hardware or
software. Since the interface for this system is developed using the existing resources and
technologies available at NIC, There is nominal expenditure and economical feasibility for
certain.

3.8 OPERATIONAL FEASIBILITY

Proposed projects are beneficial only if they can be turned into an information system.
That will meet the organization’s operating requirements. Operational feasibility aspects of
the project are to be taken as an important part of the project implementation. Some of the
important issues raised are to test the operational feasibility of a project includes the
following:

● Is there sufficient support for the management from the users?


● Will the system be used and work properly if it is being developed and
implemented?
● Will there be any resistance from the user that will undermine the possible
application benefits?

The well-planned design would ensure the optimal utilization of the computer resources
and would help in the improvement of performance status.

This system is targeted to be in accordance with the above-mentioned issues. Beforehand,


the management issues and user requirements have been taken into consideration. So there is
no question of resistance from the users that can undermine the possible application benefits.

The well-planned design would ensure the optimal utilization of the computer resources
and would help in the improvement of performance status.

22 | P a g e
3.9TECHNICAL FEASIBILITY

The technical issue usually raised during the feasibility stage of the investigation includes
the following:

● Does the necessary technology exist to do what is suggested?


● Does the proposed equipment have the technical capacity to hold the data required
to use the new system?
● Will the proposed system provide adequate response to inquiries, regardless of the
number or location of users?
● Can the system be upgraded if developed?
● Are there technical guarantees of accuracy, reliability, ease of access and data
security?

Earlier no system existed to cater to the needs of ‘Secure Infrastructure Implementation


System’. The current system developed is technically feasible. It is a web based user interface
for audit workflow at NIC-CSD. Thus it provides easy access to the users. The database’s
purpose is to create, establish and maintain a workflow among various entities in order to
facilitate all concerned users in their various capacities or roles. Permission to the users would
be granted based on the roles specified. Therefore, it provides the technical guarantee of
accuracy, reliability and security. The software and hard requirements for the development of
this project are not many and are already available in-house at NIC or are available as free as
open source. The work for the project is done with the current equipment and existing
software technology. Necessary bandwidth exists for providing fast feedback to the users
irrespective of the number of users using the system.

23 | P a g e
CHAPTER - 4

4. SOFTWARE REQUIREMENT SPECIFICATION

A Software Requirements Specification (SRS) – a requirements specification for a


software system – is a complete description of the behavior of a system to be developed. It
includes a set of use cases that describe all the interactions the users will have with the
software. In addition to use cases, the SRS also contains non-functional requirements. Non-
functional requirements are requirements which impose constraints on the design or
implementation (such as performance engineering requirements, quality standards, or design
constraints).

System requirements specification: A structured collection of information that embodies


the requirements of a system. A business analyst, sometimes titled system analyst, is
responsible for analyzing the business needs of their clients and stakeholders to help identify
business problems and propose solutions. Within the systems development life cycle domain,
typically performs a liaison function between the business side of an enterprise and the
information technology department or external service providers. Projects are subject to three
sorts of requirements:

● Business requirements describe in business terms what must be delivered or


accomplished to provide value.
● Product requirements describe properties of a system or product (which could be
one of Several ways to accomplish a set of business requirements).
● Process requirements describe activities performed by the developing
organization. For instance, process requirements could specify specific
methodologies that must be followed, and constraints that the organization must
obey.
Product and process requirements are closely linked. Process requirements often specify
the activities that will be performed to satisfy a product requirement. For example, a
maximum development cost requirement (a process requirement) may be imposed to help
achieve a maximum sales price requirement (a product requirement) a requirement that the
product be maintainable often is addressed by imposing requirements to follow particular

24 | P a g e
development styles.

PURPOSE

In systems engineering, a requirement can be a description of what a system must do,


referred to as a Functional Requirement. This type of requirement specifies something that
the delivered system must be able to do. Another type of requirement specifies something
about the system itself, and how well it performs its functions. Such requirements are often
called Non-functional requirements, or 'performance requirements' or 'quality of service
requirements.' Examples of such requirements include usability, availability, reliability,
supportability, testability and maintainability.

A collection of requirements define the characteristics or features of the desired system. A


'good' list of requirements as far as possible avoids saying how the system should implement
the requirements, leaving such decisions to the system designer. Specifying how the system
should be implemented is called "implementation bias" or "solution engineering". However,
implementation constraints on the solution may validly be expressed by the future owner, for
example for required interfaces to external systems; for interoperability with other systems;
and for commonality (e.g. of user interfaces) with other owned products.

In software engineering, the same meanings of requirements apply, except that the focus
of interest is the software itself.

4.1 FUNCTIONAL REQUIREMENTS


• Load data

• Data analysis

• Data preprocessing

• Model building

• Prediction

25 | P a g e
4.2NON FUNCTIONAL REQUIREMENTS

1. Secure access of confidential data (user’s details). SSL can be used.


2. 24 X 7 availability.
3. Better component design to get better performance at peak time
4. Flexible service based architecture will be highly desirable for future extension

Introduction to Flask The Web development framework that saves you time and makes
Web development a joy. Using Flask, you can build and maintain high quality Web
applications with minimal fuss. At its best, Web development is an exciting, creative act; at
its worst, it can be a repetitive, frustrating nuisance. Flask lets you focus on the fun stuff —
the crux of your Web application — while easing the pain of the repetitive bits. In doing so, it
provides high-level abstractions of common Web development patterns, shortcuts for frequent
programming tasks, and clear conventions for how to solve problems. At the same time, flask
tries to stay out of your way, letting you work outside the scope of the framework as needed.
The goal of this book is to make you a Flask expert. The focus is twofold. First, we explain, in
depth, what flask does and how to build Web applications with it. Second, we discuss higher-
level concepts where appropriate, answering the question “How can I apply these tools
effectively in my own projects?” By reading this book, you’ll learn the skills needed to
develop powerful Web sites quickly, with code that is clean and easy to maintain.

What Is a Web Framework?

Flask is a prominent member of a new generation of Web frameworks. So what exactly does
that term mean? To answer that question, let’s consider the design of a Web application
written using the Common Gateway Interface (CGI) standard, a popular way to write Web
applications circa 1998. In those days, when you wrote a CGI application, you did everything
yourself — the equivalent of baking a cake from scratch. For example, here’s a simple CGI
script, written in Python, that displays the ten most recently published books from a database:

26 | P a g e
With a one-off dynamic page such as this one, the write-it-from-scratch approach
isn’t necessarily bad. For one thing, this code is simple to comprehend — even a
novice developer can read these 16 lines of Python and understand all it does, from
start to finish. There’s nothing else to learn; no other code to read. It’s also simple to
deploy: just save this code in a file called latestbooks. cgi, upload that file to a Web
server, and visit that page with a browser. But as a Web application grows beyond
the trivial, this approach breaks down, and you face a number of problems:

Should a developer really have to worry about printing the “Content-Type” line and
remembering to close the database connection? This sort of boilerplate reduces
programmer productivity and introduces opportunities for mistakes. These setup-
and teardown-related tasks would best be handled by some common infrastructure.

27 | P a g e
CHAPTER : 5
5.LANGUAGES OF IMPLEMENTATION
5.1 Python

What Is A Script?

Up to this point, I have concentrated on the interactive programming capability of Python. This is
a very useful capability that allows you to type in a program and to have it executed immediately
in an interactive mode

Scripts are reusable

Basically, a script is a text file containing the statements that comprise a Python program. Once
you have created the script, you can execute it over and over without having to retype it each
time.

Scripts are editable

Perhaps, more importantly, you can make different versions of the script by modifying the
statements from one file to the next using a text editor. Then you can execute each of the
individual versions. In this way, it is easy to create different programs with a minimum amount of
typing.

You will need a text editor

Just about any text editor will suffice for creating Python script files.You can use Microsoft Notepad,
Microsoft WordPad, Microsoft Word, or just about any word processor if you want to.

5.2 Difference between a script and a program

Script:

Scripts are distinct from the core code of the application, which is usually written in a different
language, and are often created or at least modified by the end-user. Scripts are often interpreted
from source code or bytecode, whereas the applications they control are traditionally compiled to
native machine code.

Program:

28 | P a g e
The program has an executable form that the computer can use directly to execute the
instructions.

The same program in its human-readable source code form, from which
executable programs are derived (e.g., compiled)
P ython

What is Python?

Chances you are asking yourself this. You may have found this book because
you want to learn to program but don’t know anything about programming
languages. Or you may have heard of programming languages like C, C++, C#,
or Java and want to know what Python is and how it compares to “big name”
languages. Hopefully I can explain it for you.

Python concepts

If your not interested in the the hows and whys of Python, feel free to skip to the
next chapter. In this chapter I will try to explain to the reader why I think Python is
one of the best languages available and why it’s a great one to start programming
with.

Open source general-purpose language.

• Object Oriented, Procedural, Functional

• Easy to interface with C/ObjC/Java/Fortran

• Easy-ish to interface with C++ (via SWIG)

• Great interactive environment

Python is a high-level, interpreted, interactive and object-oriented scripting


language. Python is designed to be highly readable. It uses English keywords
frequently where as other languages use punctuation, and it has fewer syntactical
constructions than other languages.

29 | P a g e
• Python is Interpreted − Python is processed at runtime by the interpreter. You do
not need to compile your program before executing it. This is similar to PERL and
PHP. • Python is Interactive − You can actually sit at a Python prompt and interact
with the interpreter directly to write your programs.

• Python is Object-Oriented − Python supports Object-Oriented style or


technique of programming that encapsulates code within objects.

• Python is a Beginner's Language − Python is a great language for the beginner-level


programmers and supports the development of a wide range of applications from
simple text processing to WWW browsers to games.

5.3 History of Python

Python was developed by Guido van Rossum in the late eighties and early nineties
at the National Research Institute for Mathematics and Computer Science in the
Netherlands. Python is derived from many other languages, including ABC,
Modula-3, C, C++, Algol-68, SmallTalk, and Unix shell and other scripting
languages.

Python is copyrighted. Like Perl, Python source code is now available under the
GNU General Public License (GPL).

Python is now maintained by a core development team at the institute, although


Guido van Rossum still holds a vital role in directing its progress.

5.4 Python Features

Python's features include −

• Easy-to-learn − Python has few keywords, simple structure, and a clearly defined
syntax. This allows the student to pick up the language quickly.

• Easy-to-read − Python code is more clearly defined and visible to the eyes. •
Easy-to-maintain − Python's source code is fairly easy-to-maintain.

• A broad standard library − Python's bulk of the library is very portable and

30 | P a g e
cross-platform compatible on UNIX, Windows, and Macintosh.

• Interactive Mode − Python has support for an interactive mode which allows interactive
testing and debugging of snippets of code.

• Portable − Python can run on a wide variety of hardware platforms and has the
same interface on all platforms.

• Extendable − You can add low-level modules to the Python interpreter. These
modules enable programmers to add to or customize their tools to be more
efficient.

• Databases − Python provides interfaces to all major commercial databases. • GUI


Programming − Python supports GUI applications that can be created and ported
to many system calls, libraries and windows systems, such as Windows MFC,
Macintosh, and the X Window system of Unix.

• Scalable − Python provides a better structure and support for large programs
than shell scripting.

Apart from the above-mentioned features, Python has a big list of good features, few
are listed below −

• It supports functional and structured programming methods as well as OOP. • It can


be used as a scripting language or can be compiled to byte-code for building large
applications.

• It provides very high-level dynamic data types and supports dynamic type checking. •
IT supports automatic garbage collection.

• It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.

5.5 Dynamic vs Static

Types Python is a dynamic-typed language. Many other languages are static typed,
such as C/C++ and Java. A static typed language requires the programmer to
explicitly tell the computer what type of “thing” each data value is.

31 | P a g e
For example, in C if you had a variable that was to contain the price of something,
you would have to declare the variable as a “float” type.
This tells the compiler that the only data that can be used for that variable must be
a floating point number, i.e. a number with a decimal point.

Python, however, doesn’t require this. You simply give your variables names and
assign values to them. The interpreter takes care of keeping track of what kinds of
objects your program is using. This also means that you can change the size of the
values as you develop the program. Say you have another decimal number (a.k.a. a
floating point number) you need in your program.

With a static typed language, you have to decide the memory size the variable can
take when you first initialize that variable. A double is a floating point value that
can handle a much larger number than a normal float (the actual memory sizes
depend on the operating environment). If you declare a variable to be a float but
later on assign a value that is too big to it, your program will fail; you will have to
go back and change that variable to be a double. With Python, it doesn’t matter. You
simply give it whatever number you want and Python will take care of manipulating
it as needed. It even works for derived values.

For example, say you are dividing two numbers. One is a floating point number and
one is an integer. Python realizes that it’s more accurate to keep track of decimals so
it automatically calculates the result as a floating point number

5.6 Variables

Variables are nothing but reserved memory locations to store values. This means
that when you create a variable you reserve some space in memory.

Based on the data type of a variable, the interpreter allocates memory and decides
what can be stored in the reserved memory. Therefore, by assigning different data
types to variables, you can store integers, decimals or characters in these variables.

5.7 Standard Data Types

32 | P a g e
The data stored in memory can be of many types. For example, a person's age is stored as a
numeric value and his or her address is stored as alphanumeric characters. Python
has various standard data types that are used to define the operations possible on
them and the storage method for each of them.

Python has five standard data types −

• Numbers

• String

• List

• Tuple

• Dictionary

P ython Numbers

Number data types store numeric values. Number objects are created when you
assign a value to them

P ython Strings

Strings in Python are identified as a contiguous set of characters represented in the


quotation marks. Python allows for either pairs of single or double quotes. Subsets
of strings can be taken using the slice operator ([ ] and [:] ) with indexes starting at 0
in the beginning of the string and working their way from -1 at the end.

P ython Lists

Lists are the most versatile of Python's compound data types. A list contains items
separated by commas and enclosed within square brackets ([]). To some extent, lists
are similar to arrays in C. One difference between them is that all the items
belonging to a list can be of different data type.

The values stored in a list can be accessed using the slice operator ([ ] and [:]) with
indexes starting at 0 in the beginning of the list and working their way to end -1. The

33 | P a g e
plus (+) sign is the list concatenation operator, and the asterisk (*) is the repetition
operator.

P ython Tuples

A tuple is another sequence data type that is similar to the list. A tuple consists of a
number of values separated by commas. Unlike lists, however, tuples are enclosed
within parentheses.
The main differences between lists and tuples are: Lists are enclosed in brackets ( [
] ) and their elements and size can be changed, while tuples are enclosed in
parentheses ( ( ) ) and cannot be updated. Tuples can be thought of as read-only
lists.

P ython Dictionary

Python's dictionaries are kind of hash table type. They work like associative arrays
or hashes found in Perl and consist of key-value pairs. A dictionary key can be
almost any Python type, but are usually numbers or strings. Values, on the other
hand, can be any arbitrary Python object.

Dictionaries are enclosed by curly braces ({ }) and values can be assigned and
accessed using square braces ([]).

5.8 Different modes in python

Python has two basic modes: normal and interactive.

The normal mode is the mode where the scripted and finished .py files are run in
the Python interpreter.
Interactive mode is a command line shell which gives immediate feedback for each
statement, while running previously fed statements in active memory. As new lines
are fed into the interpreter, the fed program is evaluated both in part and in whole

20 Python libraries
1. Requests. The most famous http library written by kenneth reitz.
It’s a must have for every python developer.

2. Scrapy. If you are involved in webscraping then this is a must have


34 | P a g e
library for you. After using this library you won’t use any other.

3. wxPython. A gui toolkit for python. I have primarily used it in place


of tkinter. You will really love it.

4. Pillow. A friendly fork of PIL (Python Imaging Library). It is more


user friendly than PIL and is a must have for anyone who works with
images.

5. SQLAlchemy. A database library. Many love it and many hate it.


The choice is yours.

6. BeautifulSoup. I know it’s slow but this xml and html parsing
library is very useful for beginners.

7. Twisted. The most important tool for any network application


developer. It has a very beautiful api and is used by a lot of
famous python developers.

8. NumPy. How can we leave this very important library ? It


provides some advance math functionalities to python.

9. SciPy. When we talk about NumPy then we have to talk about scipy.
It is a library of algorithms and mathematical tools for python and
has caused many scientists to switch from ruby to python.

10. matplotlib. A numerical plotting library. It is very useful for any


data scientist or any data analyzer.

11. Pygame. Which developer does not like to play games and develop
them ? This library will help you achieve your goal of 2d game
development.
12. Pyglet. A 3d animation and game creation engine. This is the
engine in which the famous python port of minecraft was made.

13. pyQT. A GUI toolkit for python. It is my second choice after


wxpython for developing GUI’s for my python scripts.
35 | P a g e
14. pyGtk. Another python GUI library. It is the same library in which
the famous Bittorrent client is created.

15. Scapy. A packet sniffer and analyzer for python made in python.

16. pywin32. A python library which provides some useful methods


and classes for interacting with windows.

17. nltk. Natural Language Toolkit – I realize most people won’t be using
this one, but it’s generic enough. It is a very useful library if you want
to manipulate strings. But it’s capacity is beyond that. Do check it
out.

18. nose. A testing framework for python. It is used by millions


of python developers. It is a must have if you do test driven
development.

19. SymPy. SymPy can do algebraic evaluation, differentiation,


expansion, complex numbers, etc. It is contained in a pure
Python distribution.

20. IPython. I just can’t stress enough how useful this tool is. It is a
python prompt on steroids. It has completion, history, shell
capabilities, and a lot more. Make sure that you take a look at it.

Numpy:
NumPy’s main object is the homogeneous multidimensional array. It is a table of elements
(usually numbers), all of the same type, indexed by a tuple of positive integers. In NumPy
dimensions are called axes. The number of axes is rank.
Offers Matlab-ish capabilities within Python

• Fast array operations

• 2D arrays, multi-D arrays, linear algebra etc.

m atplotlib

36 | P a g e
• High quality plotting library.

5.9 DATA SETS

DataSets

The DataSet object is similar to the ADO Recordset object, but more
powerful, and with one other important distinction: the DataSet is always
disconnected. The DataSet object represents a cache of data, with database-
like structures such as tables, columns, relationships, and constraints.
However, though a DataSet can and does behave much like a database, it is
important to remember that DataSet objects do not interact directly with
databases, or other source data. This allows the developer to work with a
programming model that is always consistent, regardless of where the source
data resides. Data coming from a database, an XML file, from code, or user
input can all be placed into DataSet objects. Then, as changes are made to
the DataSet they can be tracked and verified before updating the source data.
The GetChanges method of the DataSet object actually creates a second
DatSet that contains only the changes to the data. This DataSet is then used
by a DataAdapter (or other objects) to update the original data source.
The DataSet has many XML characteristics, including the ability to produce and
consume XML data and XML schemas. XML schemas can be used to describe
schemas interchanged via WebServices. In fact, a DataSet with a schema can
actually be compiled for type safety and statement completion.

37 | P a g e
CHAPTER : 6

6. SYSTEM DESIGN

6.1 INTRODUCTION

Software design sits at the technical kernel of the software engineering process and is applied
regardless of the development paradigm and area of application. Design is the first step in the
development phase for any engineered product or system. The designer’s goal is to produce a
model or representation of an entity that will later be built. Beginning, once system
requirement

have been specified and analyzed, system design is the first of the three technical activities -
design, code and test that is required to build and verify software.

The importance can be stated with a single word “Quality”. Design is the place where
quality is fostered in software development. Design provides us with representations of
software that can assess for quality. Design is the only way that we can accurately translate a
customer’s view into a finished software product or system. Software design serves as a
foundation for all the software engineering steps that follow. Without a strong design we risk
building an unstable system – one that will be difficult to test, one whose quality cannot be
assessed until the last stage.

During design, progressive refinement of data structure, program structure, and


procedural details are developed reviewed and documented. System design can be viewed
from either technical or project management perspective. From the technical point of view,
design is comprised of four activities – architectural design, data structure design, interface
design and procedural design.

6.2 NORMALIZATION

It is a process of converting a relation to a standard form. The process is used to handle

38 | P a g e
the problems that can arise due to data redundancy i.e. repetition of data in the database,
maintain data integrity as well as handling problems that can arise due to insertion, updation,
deletion anomalies.

Decomposing is the process of splitting relations into multiple relations to eliminate


anomalies and maintain anomalies and maintain data integrity. To do this we use normal
forms or rules for structuring relation.

Insertion anomaly: Inability to add data to the database due to absence of other data.
Deletion anomaly: Unintended loss of data due to deletion of other data. Update
anomaly: Data inconsistency resulting from data redundancy and partial update

Normal Forms: These are the rules for structuring relations that eliminate anomalies.

FIRST NORMAL FORM:

A relation is said to be in first normal form if the values in the relation are atomic for every
attribute in the relation. By this we mean simply that no attribute value can be a set of values
or, as it is sometimes expressed, a repeating group.

SECOND NORMAL FORM:

A relation is said to be in second Normal form is it is in first normal form and it should satisfy
any one of the following rules.

1) Primary key is a not a composite primary key

2) No non key attributes are present

3) Every non key attribute is fully functionally dependent on full set of primary key.

THIRD NORMAL FORM:

A relation is said to be in third normal form if their exits no transitive dependencies.

39 | P a g e
Transitive Dependency: If two non key attributes depend on each other as well as on the
primary key then they are said to be transitively dependent.

The above normalization principles were applied to decompose the data in multiple tables
thereby making the data to be maintained in a consistent state.

6.3 E – R DIAGRAMS

• The relation upon the system is structure through a conceptual ER-Diagram, which not only
specifics the existential entities but also the standard relations through which the system exists
and the cardinalities that are necessary for the system state to continue.

• The entity Relationship Diagram (ERD) depicts the relationship between the data objects. The
ERD is the notation that is used to conduct the date modeling activity the attributes of each data
object noted is the ERD can be described resign a data object descriptions. • The set of primary
components that are identified by the ERD are

◆ Data object ◆ Relationships

◆ Attributes ◆ Various types of indicators.

The primary purpose of the ERD is to represent data objects and their relationships.
6.4 DATA FLOW DIAGRAMS

A data flow diagram is graphical tool used to describe and analyze movement of data
through a system. These are the central tool and the basis from which the other components
are developed. The transformation of data from input to output, through processed, may be
described logically and independently of physical components associated with the system.
These are known as the logical data flow diagrams. The physical data flow diagrams show the
actual implements and movement of data between people, departments and workstations. A
full description of a system actually consists of a set of data flow diagrams. Using two familiar
notations Yourdon, Gane and Sarson notation develops the data flow diagrams. Each
component in a DFD is labeled with a descriptive name. Process is further identified with a

40 | P a g e
number that will be used for identification purpose. The development of DFD’S is done in
several levels. Each process in lower level diagrams can be broken down into a more detailed
DFD in the next level. The lop-level diagram is often called context diagram. It consists a

single process bit, which plays vital role in studying the current system. The process in the
context level diagram is exploded into other process at the first level DFD.

The idea behind the explosion of a process into more process is that understanding at
one level of detail is exploded into greater detail at the next level. This is done until further
explosion is necessary and an adequate amount of detail is described for analyst to understand
the process.

Larry Constantine first developed the DFD as a way of expressing system requirements in a
graphical from, this lead to the modular design.

Larry Constantine first developed the DFD as a way of expressing system requirements in a
graphical from, this lead to the modular design.

A DFD is also known as a “bubble Chart” has the purpose of clarifying system requirements
and identifying major transformations that will become programs in system design. So it is the
starting point of the design to the lowest level of detail. A DFD consists of a series of bubbles
joined by data flows in the system.

SAILENT FEATURES OF DFD’S

1. The DFD shows flow of data, not of control loops and decision are controlled considerations
do not appear on a DFD.

2. The DFD does not indicate the time factor involved in any process whether the dataflow take
place daily, weekly, monthly or yearly.

3. The sequence of events is not brought out on the DFD.

41 | P a g e
TYPES OF DATA FLOW DIAGRAMS
1. Current Physical

2. Current Logical

3. New Logical

4. New Physical

CURRENT PHYSICAL:

In Current Physical DFD process label include the name of people or their positions or the
names of computer systems that might provide some of the overall system-processing label
includes an identification of the technology used to process the data. Similarly data flows and
data stores are often labels with the names of the actual physical media on which data are
stored such as file folders, computer files, business forms or computer tapes.

CURRENT LOGICAL:

The physical aspects at the system are removed as mush as possible so that the current system
is reduced to its essence to the data and the processors that transforms them regardless of
actual physical form.

NEW LOGICAL:

This is exactly like a current logical model if the user were completely happy with he user
were completely happy with the functionality of the current system but had problems with
how it

was implemented typically through the new logical model will differ from current logical
model while having additional functions, absolute function removal and inefficient flows
recognized.

NEW PHYSICAL:

The new physical represents only the physical implementation of the new system.
42 | P a g e
6.5 UML Diagrams

6.5.1 Use case diagram

NewUseCase

Data
Understanding Predictive Learning
NewUseCase2 Model Building

NewUseCase3
Dataset

Data
Analytics(EDA) Trained Dataset
NewUseCase4

Model Evaluation

Particular Data

NewUseCase5

Test/Test Split

NewUseCase6

EXPLANATION:

The primary motivation behind a utilization case chart is to show what framework capacities are performed for which
entertainer. Parts of the entertainers in the framework can be portrayed. The above chart comprises of client as
entertainer. Each will assume a specific part to accomplish the idea.

43 | P a g e
6.5.2 Class Diagram

Data Understanding Data Analysis (EDA) Train Data


analysis datamodel Model Building
dataprocess
tuningprocess
datasetstransfers() datasets() trainingdata()
modelbuildingphase() splitdata() predictivelearning()
machinelearning()

Model Evaluation
dataset

traineddataset()
particulardata()

EXPLANATION
In this class chart addresses how the classes with qualities and strategies are connected together to play out the
confirmation with security. From the above chart shown the different classes engaged with our venture.

6.5.3 Object Diagram


Data Understanding Data Analysis (EDA) Train Data Model Building

Model Evaluation

EXPLAN
ATION:
In the above digram tells about the progression of articles between the classes. It's anything but a chart that shows a
total or fractional perspective on the construction of a displayed framework. In this item chart addresses how the
classes with traits and strategies are connected together to play out the confirmation with security.
44 | P a g e
6.5.4 Component Diagram

Data Datasets DEEP


Understanding Transfers Learning

Model-Building Datasets Data Analysis


Phase (EDA)

Train Data Training Split Data


Data

Model Predictive Model


Evaluation Learning Building

Trained Particular
Dataset Data

EXPLANATION:
A segment gives the arrangement of required interfaces that a part acknowledges or carries out. These are the static
outlines of the bound together demonstrating language. Segment outlines are utilized to address the working and
conduct of different segments of a framework.

45 | P a g e
6.5.5 Deployment Diagram

Data
Data
Understanding
Analysis(EDA)

Train Data

Model Building

Model
Evaluation

EXPLANATION:
An UML sending chart is an outline that shows the design of run time preparing hubs and the parts that live on
them. Arrangement graphs are a sort of design chart utilized in demonstrating the actual parts of an article situated
framework. They are regularly be utilized to demonstrate the static sending perspective on a framework.

46 | P a g e
6.5.6 STATE DIAGRAM

Dataset

Data Understanding Train Data Model Building


Data Analysis(EDA) Model Evaluation

Datasets Training Data


Predictive Learning
Datasets Transfers Trained Dataset

Split Data
Model-Building Phase
Particular Data

Machine Learning

Dataset

EXPLANATION:
State outline are an inexactly characterized graph to show work processes of stepwise exercises and activities, with
help for decision, cycle and simultaneousness. State charts necessitate that the framework portrayed is made out of a
limited number of states; at times, this is to be sure the situation, while at different occasions this is a sensible
deliberation. Numerous types of state charts exist, which vary marginally and have distinctive semantics.

47 | P a g e
6.5.7 Sequence Diagram

Data Data Train Data Model Building Model Dataset


Understanding Analysis(EDA) Evaluation

Datasets Transfers

Datasets

Training Data

Predictive Learning

Machine Learning
Trained Dataset

Model Building Phase

Split Data

Particular Data

EXPLANATION:
UML Sequence Diagrams are connection charts that detail how tasks are done. They catch the association between
objects with regards to joint effort. Grouping Diagrams are time center and they show the request for the association
outwardly by utilizing the upward pivot of the outline to address time what messages is sent and when.

48 | P a g e
6.5.8 Collaboration Diagram

5: Machine Learning
Train
Data Data
Understanding

1: Datasets Transfers
8: Split Data 3: Training Data

4: Predictive Learning
Model
Building

Dataset
6: Trained Dataset
9: Particular Data

2: Datasets
Model
Evaluation

7: Model Building Phase

Data
Analysis(EDA)

EXPLANATION:
Coordinated effort outlines are utilized to show how items interface to play out the conduct of a specific use case, or
a piece of a utilization case. Alongside succession charts, joint effort are utilized by architects to characterize and
explain the jobs of the articles that play out a specific progression of occasions of a utilization case. They are the
essential wellspring of data used to deciding class duties and interfaces.

49 | P a g e
6.5.9 Activity Diagram

Dataset

Data Understanding Data Analysis(EDA) Model Building Model Evaluation


Train Data

Datasets Transfers Datasets Training Data Trained Dataset


Predictive Learning

Model-Building Phase Split Data


Machine Learning
Particular Data

Dataset

EXPLANATION:
Action chart are an inexactly characterized graph to show work processes of stepwise exercises and activities, with
help for decision, emphasis and simultaneousness. UML, action charts can be utilized to portray the business and
operational bit by bit work processes of parts in a framework. UML movement outlines might actually display the
inner rationale of a mind boggling activity. From numerous points of view UML action outlines are the article
arranged likeness stream graphs and information stream charts (DFDs) from underlying turn of events.

50 | P a g e
6.5.10 System Architecture

DATASET
DATASET

EXPLORATORY
EXPLORATORY DATA
DATA
ANALYTICS
ANALYTICS

TRAIN/TEST
TRAIN/TEST SPLIT
SPLIT

MODEL
MODEL BUILDING/HYPER
BUILDING/HYPER
PARAMETER
PARAMETER TUNING
TUNING

MODEL
MODEL
ELEVATION
ELEVATION

RESULT
RESULT

51 | P a g e
CHAPTER : 7
7.Implementation
7.1 Data Collection
We collected phishing websites dataset from Kaggle website. It consists
of mix of phishing and legitimate URL features. Dataset has 11055 rows
and 31 columns.

52 | P a g e
7.2 Exploratory Data Analysis
We loaded the dataset into python IDE with the help of pandas package
and checked if there are any missing values in data. We found that
thereare no missing values in data and we removed an unwanted column
for our process. After removing unwanted column, below are the
columns left out in our dataset.

Figure: 7.2.1

We tried to analyze the obtained data and go to find out the following observations:

Figure: 7.2.2
From the above count plot which is plotted with the help of seaborn
package, we can observe the count of values of target variable.

And below is the plot which is drawn with the help of matplotlib
package to find out the correlation among the features.

53 | P a g e
From the analysis we clearly found out that the type of our data is
classification problem where the target variable is Result.

7.3 Data Processing


We have to prepare data for algorithms for training and testing purposes.
With the help of sci- kit learn package, we have split the data 70% for
training and 30% for testing.

7.3.1 Data Splitting


54 | P a g e
Y = data['Result']

X = data.drop('Result', axis=1)

Figure: 7.3.1

Figure: 7.3.2

# to split train and test set


from sklearn.model_selection
import train_test_split # Split X
and y into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=0)
7.4 Modeling

7.4.1Logistic Regression

Logistic regression is another technique borrowed by machine learning


from the field of statistics. It is the go-to method for binary classification
problems (problems with two class values). Logistic regression is named

for the function used at the core of the method, the logistic function. The

55 | P a g e
logistic function, also called the sigmoid function was developed by
statisticians to describe properties of population growth in ecology, rising
quickly and maxing out at the carrying capacity of the environment. It’s an
S-shaped curve that can take any real-valued number and map it into a
value between 0 and 1, but never exactly at those limits.

1 / (1 + e^-value)
Where e is the base of the natural logarithms (Euler’s number or the EXP()
function in your spreadsheet) and value is the actual numerical value that
you want to transform. Below is a plot of the numbers between -5 and 5
transformed into the range 0 and 1 using the logistic function.

Logistic regression uses an equation as the representation, very much like


linear regression. Input values (x) are combined linearly using weights or
coefficient values to predict an output value (y). A key difference from
linear regression is that the output value being modeled is a binary value (0
or 1) rather than a numeric value

Below is an example logistic regression equation:


y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))
Where y is the predicted output, b0 is the bias or intercept term and b1 is
the coefficient for the single input value (x). Each column in your input
data has an associated b coefficient (a constant real value) that must be
learned from your training data. The actual representation of the model that
you would store in memory or in a file are the coefficients in the equation
(the beta value or b’s).Logistic regression models the probability of the
default class (e.g. the first class).
For example, if we are modeling people’s sex as male or female from their
height, then the first class could be male and the logistic regression model
could be written as the probability of male given a person’s height, or more
formally:

P(sex=male|height)
Written another way, we are modeling the probability that an input (X)

56 | P a g e
belongs to the default class (Y=1), we can write this formally as:

P(X) = P(Y=1|X)
We’re predicting probabilities? I thought logistic regression was a classification
algorithm?
Note that the probability prediction must be transformed into binary values
(0 or 1) in order to actually make a probability prediction.. Logistic
regression is a linear method, but the predictions are transformed using the
logistic function. The impact of this is that we can no longer understand the
predictions as a linear combination of the inputs as we can with linear
regression, for example, continuing on from above, the model can be stated
as:

p(X) = e^(b0 + b1*X) / (1

+ e^(b0 + b1*X)) ln(p(X)

/ 1 – p(X)) = b0 + b1 * X

We can move the exponent back to the right and write it as:
odds = e^(b0 + b1 * X)
We trained logistic regression model with the help of training split and tested with
test split.
Accuracy score - 0.92
Confusion Matrix:
Phishing Non
Phishing

Phishing 827(TP) 74(FN)

Non-Phishing 57(FP) 797(TN)

Table:7.4.1.1
Classification Report:
Precision Recall F1-Score Support

57 | P a g e
Phishing 0.94 0.92 0.93 9
0
1

Non-Phishing 0.92 0.93 0.92 8


5
4

Accuracy 0.93 1755

Macro avg 0.93 0.93 0.93 1755

Weighted avg 0.93 0.93 0.93 1755

Table:7.4.1.2

7.4.2 Random Forest Classifier

Random forest is a supervised learning algorithm. It can be used both for classification and
regression. It is also the most flexible and easy to use algorithm. A forest is comprised of
trees. It is said that the more trees it has, the more robust a forest is. Random forests
creates decision trees on randomly selected data samples, gets prediction from each tree
and selects the best solution by means of voting. It also provides a pretty good indicator
of the feature importance.

It technically is an ensemble method (based on the divide-and-conquer approach) of


decision trees generated on a randomly split dataset. This collection of decision tree
classifiers is also known as the forest. The individual decision trees are generated using an
attribute selection indicator such as information gain, gain ratio, and Gini index for each
attribute. Each tree depends on an independent random sample. In a classification
problem, each tree votes and the most popular class is chosen as the final result. In the
case of regression, the average of all the tree outputs is considered as the final result. It is
simpler and more powerful compared to the other non-linear classification algorithms.

Advantages:

58 | P a g e
• Random forest is considered as a highly accurate and robust method because of the
number of decision trees participating in the process.

• It does not suffer from the over fitting problem. The main reason is that it takes the
average of all the predictions, which cancels out the biases.

• The algorithm can be used in both classification and regression problems.

• Random forests can also handle missing values. There are two ways to handle these: using median
values to replace continuous variables and computing the proximity- weighted average of missing
values.
It works in four steps:
1. Select random samples from a given dataset.
2. Construct a decision tree for each sample and get a prediction result from each decision
tree.
3. Perform a vote for each predicted result.
4. Select the prediction result with the most votes as the final prediction.

We trained logistic regression model with the help of training split and tested with test split.
Accuracy score - 0.96

Confusion Matrix:
Phishing Non-Phishing

Phishing 858(TP) 43(FN)

Non-Phishing 35(FP) 819(TN)

Table:7.4.2.1

Classification Report:
Precision Recall F1-Score Support

Phishing 0.96 0.95 0.96 901

59 | P a g e
Non-Phishing 0.95 0.96 0.95 854

Accuracy 0.96 1755

Macro avg 0.96 0.96 0.96 1755

Weighted avg 0.96 0.96 0.96 1755

Table:7.4.2.2

7.4.3 Decision Tree Classifier:

A decision tree is a flowchart-like tree structure where an internal node represents feature,
the branch represents a decision rule, and each leaf node represents the outcome. The
topmost node in a decision tree is known as the root node. It learns to partition on the basis
of the attribute value. It partitions the tree in recursively manner call recursive partitioning.
This flowchart-like structure helps you in decision making. It's visualization like a
flowchart diagram which easily mimics the human level thinking. That is why decision
trees are easy to understand and interpret.

Decision Tree is a white box type of ML algorithm. It shares internal decision-making


logic, which is not available in the black box type of algorithms such as Neural Network.
Its training time is faster compared to the neural network algorithm. The time complexity
of decision trees is a function of the number of records and number of attributes in the
given data. The decision tree I s a distribution-free or non-parametric method, which does
not depend upon probability distribution assumptions. Decision trees can handle high
dimensional data with good accuracy.

The basic idea behind any decision tree algorithm is as follows:

1. Select the best attribute using Attribute Selection Measures (ASM) to split the
records.

2. Make that attribute a decision node and breaks the dataset into smaller subsets.

60 | P a g e
3. Starts tree building by repeating this process recursively for each child until one of
the condition will match:
o All the tuples belong to the same attribute value.

o There are no more remaining attributes.

o There are no more instances.

Pros

Decision trees are easy to interpret and visualize.

It can easily capture Non-linear patterns.

It requires fewer data preprocessing from the user, for example, there is no need to
normalize columns.

It can be used for feature engineering such as predicting missing values, suitable for
variable selection.

The decision tree has no assumptions about distribution because of the non-parametric
nature of the algorithm. (Source)

` Cons

Sensitive to noisy data. It can overfit noisy data.

The small variation(or variance) in data can result in the different decision tree. This
can be reduced by bagging and boosting algorithms.

Decision trees are biased with imbalance dataset, so it is recommended that balance
out the dataset before creating the decision tree.

Accuracy score - 0.935

61 | P a g e
Confusion Matrix:
Phishing Non-Phishing

Phishing 849(TP) 52(FN)

Non-Phishing 62(FP) 792(TN)

Classification Report:
Table:7.4.3.1

Precision Recall F1-Score Support

Phishing 0.93 0.94 0.94 901

Non-Phishing 0.94 0.93 0.93 854

Accuracy 0.94 1755

Macro avg 0.94 0.93 0.93 1755

Weighted avg 0.94 0.94 0.94 1755

Table:7.4.3.2

7.4.4 Naïve Bayes Classifier

The naive Bayes classifier is a generative model for classification. Before the advent of
deep learning and its easy-to-use libraries, the Naive Bayes classifier was one of the
widely deployed classifiers for machine learning applications. Despite its simplicity, the
naive Bayes classifier performs quite well in many applications.

A Naive Bayes classifier is a probabilistic machine learning model that’s used for

62 | P a g e
classification task. The crux of the classifier is based on the Bayes theorem.

Bayes Theorem:

Using Bayes theorem, we can find the probability of A happening, given that B has
occurred. Here, B is the evidence and A is the hypothesis. The assumption made here is that
the predictors/features are independent. That is presence of one particular feature does not
affect the other. Hence it is called naive.

Accuracy score - 0.65

Confusion Matrix:
Phishing Non-Phishing

Phishing 901(TP) 0(FN)

Non-Phishing 607(FP) 247(TN)

Table:7.4.4.1

Classification Report:
Precision Recall F1-Score Support

Phishing 0.60 1.00 0.75 901

Non-Phishing 1.00 0.29 0.45 854

Accuracy 0.65 1755

63 | P a g e
Macro avg 0.80 0.64 0.60 1755

Weighted avg 0.79 0.65 0.60 0755

Table:7.4.4.2

7.4.5 Support Vector Machine (SVM)

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.SVM
chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases
are called as support vectors, and hence algorithm is termed as Support Vector Machine.

Types of SVM

SVM can be of two types:

Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such data
is termed as linearly separable data, and classifier is used called as Linear SVM
classifier.

Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,


which means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n

64 | P a g e
dimensional space, but we need to find out the best decision boundary that helps to classify
the data points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features then hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.

Advantages
SVM Classifiers offer good accuracy and perform faster prediction compared to Naïve Bayes
algorithm. They also use less memory because they use a subset of training points in the
decision phase. SVM works well with a clear margin of separation and with high dimensional
space.

Accuracy score - 0.943

Confusion Matrix:
Phishing Non-Phishing

Phishing 838(TP) 63(FN)

Non-Phishing 37(FP) 817(TN)

Table:7.4.5.1

65 | P a g e
Classification Report:
Precision Recall F1-Score Support

Phishing 0.96 0.93 0.94 901

Non-Phishing 0.93 0.94 0.94 854

Accuracy 0.94 1755

Macro avg 0.94 0.94 0.94 1755

Weighted avg 0.94 0.94 0.94 1755

Table:7.4.5.2

7.4.6 K-Nearest Neighbors (KNN)

The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised


machine learning algorithm that can be used to solve both classification and regression
problems. A supervised machine learning algorithm (as opposed to an unsupervised
machine learning algorithm) is one that relies on labeled input data to learn a function that
produces an appropriate output when given new unlabeled data.

The KNN algorithm assumes that similar things exist in close proximity. In other words,
similar things are near to each other. KNN captures the idea of similarity (sometimes
called distance, proximity, or closeness) with some mathematics we might have learned in
our childhood— calculating the distance between points on a graph.

Advantages

1. The algorithm is simple and easy to implement.

2. There’s no need to build a model, tune several parameters, or make additional


assumptions.

66 | P a g e
3. The algorithm is versatile. It can be used for classification, regression, and search (as we will see
in the nextsection).
Accuracy score - 0.928
Confusion Matrix:
Phishing Non-Phishing

Phishing 855(TP) 46(FN)

Non-Phishing 79(FP) 775(TN)

Table:7.4.6.1

Classification Report:
Precision Recall F1-Score Support

Phishing 0.92 0.95 0.93 901

Non-Phishing 0.94 0.91 0.93 854

Accuracy 0.93 1755

Macro avg 0.93 0.93 0.93 1755

Weighted avg 0.93 0.93 0.93 0755

Table:7.4.6.2

7.4.7 XGB Classifier:

XGBoost is a powerful machine learning algorithm especially where speed and accuracy
are concerned. XGBoost (eXtreme Gradient Boosting) is an advanced implementation
of gradient boosting algorithm.

ADVANTAGES

1. Regularization:

67 | P a g e
• Standard GBM has no regularization like XGBoost, therefore it also helps
to reduce overfitting.

In fact, XGBoost is also known as a ‘regularized boosting‘technique.


Parallel Processing:
• XGBoost implements parallel processing and is blazingly faster as compared
to GBM.
• XGBoost also supports implementation on Hadoop.

High Flexibility

• XGBoost allows users to define custom optimization objectives and


evaluation criteria.
• This adds a whole new dimension to the model and there is no limit to what we
can do.

Handling Missing Values

• XGBoost has an in-built routine to handle missing values.


• The user is required to supply a different value than other observations and pass
that as a parameter. XGBoost tries different things as it encounters a missing
value on each node and learns which path to take for missing values in
future.

Tree Pruning:

• A GBM would stop splitting a node when it encounters a negative loss in the
split. Thus it is more of a greedy algorithm.
• XGBoost on the other hand make splits upto the max_depth specified and then
start pruning the tree backwards and remove splits beyond which there is no
positive gain.

68 | P a g e
• Another advantage is that sometimes a split of negative loss say -2 may be
followed by a split of positive loss +10. GBM would stop as it encounters -2.
But XGBoost will go deeper and it will see a combined effect of +8 of the split
and keep both.

Built-in Cross-Validation

• XGBoost allows user to run a cross-validation at each iteration of the boosting


process and thus it is easy to get the exact optimum number of boosting
iterations in a single run.
• This is unlike GBM where we have to run a grid-search and only limited
values can be tested.

Accuracy score - 0.94

Confusion Matrix:
Phishing Non-Phishing

Phishing 836(TP) 65(FN)

Non-Phishing 39(FP) 815(TN)

Table:7.4.7.1

Classification Report:
Precision Recall F1-Score Support

Phishing 0.96 0.93 0.94 901

Non-Phishing 0.93 0.95 0.94 854

Accuracy 0.94 1755

69 | P a g e
Macro avg 0.94 0.94 0.94 1755

Weighted avg 0.94 0.94 0.94 1755

Table:7.4.7.2

7.5 Coding and Execution

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
df=pd.read_csv('C:/Users/k.anusha/Documents/phishing/dataset.csv'
) df.head()
df.describe()
df.isnull().sum()
df.dtypes
sns.countplot(x='Result',data=df)
x=df.drop(['Result','index'],axis=1)
y=df['Result']
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)

# Logistic Regression

from sklearn.linear_model import LogisticRegression


lr=LogisticRegression()
lr.fit(x_train,y_train)
x_test
y_pred=lr.predict(x_test)
y_pred

# Accuracy Score

70 | P a g e
from sklearn.metrics import accuracy_score
a1=accuracy_score(y_test,y_pred)
a1

# Classification Report

from sklearn.metrics import classification_report


print(classification_report(y_test,y_pred))
import math
MSE = np.square(np.subtract(y_test,y_pred)).mean()
RMSE1 = math.sqrt(MSE)
print("Root Mean Square Error:\n",RMSE1)
x.keys()
lr.predict([[1,0,1,1,1,-1,-1,-1,-1,1,1,-1,1,0,-1,-1,-1,-1,0,1,1,1,1,1,-1,1,-1,1,0,-1]])

71 | P a g e
# Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier


tree=DecisionTreeClassifier()
tree.fit(x_train,y_train)
y_pred=tree.predict(x_test)
from sklearn.metrics import accuracy_score
a2=accuracy_score(y_test,y_pred)
a2
MSE = np.square(np.subtract(y_test,y_pred)).mean()
RMSE2 = math.sqrt(MSE)
print("Root Mean Square Error:\n",RMSE2)
tree.predict([[1,0,-1,1,1,-1,1,1,-1,1,1,1,1,0,0,-1,1,1,0,-1,1,-1,1,-1,-1,0,-1,1,1,1]])

Output:

72 | P a g e
# Random Forest

from sklearn.ensemble import


RandomForestClassifier
rf=RandomForestClassifier()
rf.fit(x_train,y_train)
y_pred=rf.predict(x_test)
from sklearn.metrics import
accuracy_score
a3=accuracy_score(y_test,y_pred)
a3

73 | P a g e
from sklearn.metrics import
classification_report
print(classification_report(y_test,y_pred)
MSE =
np.square(np.subtract(y_test,y_pred)).me
an() RMSE3 = math.sqrt(MSE)
print("Root Mean Square Error:\n",RMSE3)
rf.predict([[-1,0,-1,1,-1,-1,1,1,-1,1,1,-1,1,0,0,-1,-1,-1,0,1,1,1,1,1,1,1,-1,1,-1,-1]])

Output:

74 | P a g e
#SupportVector Machine

from sklearn.svm
import SVC
sv=SVC()
sv.fit(x_train,y_train)
y_pred=sv.predict(x_test)
from sklearn.metrics import
accuracy_score
a4=accuracy_score(y_test,y_pred
)
a4
from sklearn.metrics import
classification_report
print(classification_report(y_test,y_pre
d))
MSE =
np.square(np.subtract(y_test,y_pred)).mea
n() RMSE4 = math.sqrt(MSE)
print("Root Mean Square Error:\n",RMSE4)
sv.predict([[-1,-1,-1,1,-1,1,-1,1,1,-1,-1,1,1,1,1,-1,-1,-1,-1,1,-1,1,-1,-1,-1,1,-1,1,-1,-1]])

Output:

75 | P a g e
#Navie Bayes

from sklearn.naive_bayes import


GaussianNB nb=GaussianNB()
nb.fit(x_train,y_train)
y_pred=sv.predict(x_t
est)
from sklearn.metrics import
accuracy_score
a5=accuracy_score(y_test,y_pred
)
a5
from sklearn.metrics import
classification_report
print(classification_report(y_test,y_pre
d))
MSE =
np.square(np.subtract(y_test,y_pred)).mea
n() RMSE5 = math.sqrt(MSE)
print("Root Mean Square Error:\n",RMSE5)
nb.predict([[-1,1,1,1,-1,-1,-1,-1,-1,1,1,-1,1,-1,1,-1,-1,-1,0,1,1,1,1,-1,-1,-1,-1,1,1,-1]])

76 | P a g e
#Gradient Boosting Algorithm
from sklearn.ensemble import
GradientBoostingClassifier
gb=GradientBoostingClassifier()
gb.fit(x_train,y_train)
y_pred=sv.predict(x_test)
from sklearn.metrics import
accuracy_score
a6=accuracy_score(y_test,y_pred)
a6
77 | P a g e
from sklearn.metrics import classification_report print(classification_report(y_test,y_pred))
MSE = np.square(np.subtract(y_test,y_pred)).mean()
RMSE6 = math.sqrt(MSE)
print("Root Mean Square Error:\n",RMSE6)
gb.predict([[1,0,-1,1,1,-1,1,1,-1,1,1,1,1,0,0,-1,1,1,0,-1,1,-1,1,-1,-1,0,-1,1,1,1]])

78 | P a g e
#K-Nearest Neighbors

from sklearn.neighbors import KNeighborsClassifier


knn=KNeighborsClassifier(n_neighbors=2)
knn.fit(x_train,y_train)
y_pred=sv.predict(x_test)
from sklearn.metrics import accuracy_score
a7=accuracy_score(y_test,y_pred)
a7
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))
MSE = np.square(np.subtract(y_test,y_pred)).mean()
RMSE7 = math.sqrt(MSE)
print("Root Mean Square Error:\n",RMSE7)
knn.predict([[-1,1,1,1,-1,-1,-1,-1,-1,1,1,-1,1,-1,1,-1,-1,-1,0,1,1,1,1,-1,-1,-1,-1,1,1,-1]])

79 | P a g e
80 | P a g e
# Acurracy levels for various Algorithms

sns.barplot(x='Algorithm',y='Accuracy',data=d
f1) plt.xticks(rotation=90)
plt.title('Comparision of Accuracy Levels for various algorithms')

81 | P a g e
CONCLUSION

The present project is aimed at classification of phishing websites based on the features.
For that we have taken the phishing dataset which collected from uci machine learning
repository and we built our model with seven different classifiers like SVC, Naïve Bayes,
XGB Classifier, Random Forest, K-Nearest Neighbours, Decision Tree and we got good
accuracy scores. There is a scope to enhance it further .if we can have more data our
project will be much more effective and we can get very good results. For this we need
API integrations go get the data of different websites.

82 | P a g e

You might also like