0% found this document useful (0 votes)

53 views374 pages

CBLM For BigData Data Analytics and Data Science

The document is a Competency-Based Learning Material for Big Data, Data Analytics, and Data Science, designed for students in the IT sector in Bangladesh. It includes six modules covering essential skills and knowledge, such as interpreting data, statistical concepts, programming skills, data preparation, model building, and understanding big data. The material aims to equip learners with the competencies needed for relevant job roles through structured activities and assessments.

Uploaded by

Rod John Castro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views374 pages

CBLM For BigData Data Analytics and Data Science

Uploaded by

Rod John Castro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 374

COMPETENCY-BASED LEARNING MATERIAL

FOR
BIG DATA, DATA ANALYTICS AND DATA SCIENCE

(STUDENT GUIDE)

(IT SECTOR)

Finance Division, Ministry of Finance

Government of the People’s Republic of Bangladesh
Table of Contents

how to use this competency-based learning material 4

list of icons 5

Module 1: Interpret Data and Business Domain 6

Module 2: Demonstrate Statistical Concepts 50

Module 3: Demonstrate Programming Skills for Data Science 106

Module 4: Prepare and Visualise Data 203

Module 5: Build, Validate and Deploy Model 253

Module 6: Demonstrate Understanding on Big Data 343

2
Copyright

The Competency-based Learning Material (Student Guide) for Big Data, Data Analytics and Data Science
is a document, aligned to its applicable competency standard, for providing training consistent with the
requirements of industry in order for individuals who graduated through the established standard via
competency-based assessment to be suitably qualified for a relevant job.

This document is owned by the Finance Division of the Ministry of Finance of the People’s Republic of
Bangladesh, developed under the Skills for Employment Investment Program (SEIP).

Public and private institutions may use the information contained in this competency-based learning
material for activities benefitting Bangladesh.

Other interested parties must obtain permission from the owner of this document for reproduction of
information in any manner, in whole or in part, of this Competency-based Learning Material, in English or
other language.

This document is available from:

Skills for Employment Investment Program (SEIP) Project

Finance Division
Ministry of Finance
Probashi Kallyan Bhaban (Level – 16)
71-72 Old Elephant Road
Eskaton Garden, Dhaka 1000
Telephone: +8802 551 38598-9 (PABX), +8802 551 38753-5
Facsimile: +8802 551 38752
Website: www.seip-fd.gov.bd

3
How to Use this Competency-based Learning Material

Welcome to the competency-based learning material for Big Data, Data Analytics and Data Science for
use in IT sector. These modules contain training materials and learning activities for you to complete in
order to become competent and qualified as a merchandiser.
There are Six (6) modules that make up this course which comprises the skills, knowledge and attitudes
required to become a skilled worker including:

Module 1: Interpret Data and Business Domain

Module 2: Demonstrate Statistical Concepts
Module 3: Demonstrate Programming Skills for Data Science
Module 4: Prepare and Visualise Data
Module 5: Build, Validate and Deploy Model
Module 6: Demonstrate Understanding on Big Data

As a learner, you will be required to complete a series of activities in order to achieve each learning
outcome of the module. These activities may be completed as part of structured classroom activities or
simulated workplace demonstrations.
These activities will also require you to complete associated learning and practice activities in order to gain
the skills and knowledge needed to achieve the learning outcomes. You should refer to Learning Activity
pages of each module to know the sequence of learning tasks and the appropriate resources to use for
each task.
This page will serve as the road map towards the achievement of competence. If you read the Information
Sheets, these will give you an understanding of the work, and why things are done the way they are. Once
you have finished reading the Information Sheets, you will then be required to complete the Self-Check
Quizzes.
The self-check quizzes follow the Information Sheets in this learning guide. Completing the self-check
quizzes will help you know how you are progressing. To check your knowledge after completion of the
Self-Check Quizzes, you can review the Answer Key at the end of each module.
You are required to complete all activities as directed in the Learning Activity and Information Sheet.
This is where you will apply your newly acquired knowledge while developing new skills. When working,
high emphasis should be laid on safety requirements. You will be encouraged to raise relevant queries or
ask the facilitator for assistance as required.
When you have completed all the tasks required in this learning guide, formal assessment will be scheduled
to officially evaluate if you have achieved competency of the specified learning outcomes and are ready for
the next task.

4
List of Icons

Icon Name Icon

Module content

Learning outcomes

Performance criteria

Contents

Assessment criteria

Resources required

Information sheet

Self-check Quiz

Answer key

Activity

Video reference

Learner job sheet

Assessment plan

Review of competency

5
Module 1: Interpret Data and Business Domain

MODULE CONTENT module covers

Module Descriptor: This unit covers the knowledge, skills and attitudes required to Interpret
data and the business domain. It specifically includes demonstrating
understanding on the domain of data science, interpreting concepts of
data analytics, characterizing a business problem, formulating a
business problem as a hypothesis question and using methodologies in
executing data science project cycles.

Nominal Duration: 20 hours

LEARNING OUTCOMES:

Upon completion of the module, the trainee should be able to:

1.1 Demonstrate understanding on the domain of data science.

1.2 Interpret concepts of data analytics.
1.3 Characterise a business problem.
1.4 Formulate a business problem as a hypothesis question.
1.5 Use methodologies in executing data science project cycle.

PERFORMANCE CRITERIA:

1. Data science is defined.

6
15. Key business indicators and target improvement metrics are identified.
16. Research questions with associated hypotheses are constructed from business problems.
17. Types of data needed to test the hypotheses are determined.
18. Hypotheses to be tested are aligned with business value.
19. Application of scientific method to data science business problems are demonstrated.
20. Cross-industry standard process for data mining (CRISP-DM methodology) is described.
21. Data pipelining is explained.
22. Application of experimental approach for finding insights and solutions are explained.
23. Application of the scientific method and the CRISP-DM methodology are followed during setting
up new data science projects.

7
Learning Outcome 1.1 – Demonstrate understanding on the
domain of data science

Contents:

▪ Data science
▪ Scopes of data science
▪ Roles of different occupations of data science
▪ Tools and technologies related to data science

Assessment criteria:

1. Data science is defined.

Resources required:

Students/trainees must be provided with the following resources:

▪ Workplace (Computer with internet connection).

LEARNING ACTIVITY 1.1

Learning Activity Resources/Special Instructions/References

Demonstrate understanding on the ▪ Information Sheet: 1.1

domain of data science. ▪ Self-Check: 1.1
▪ Answer Key: 1.1

8
INFORMATION SHEET 1.1

Learning Objective: to Demonstrate understanding on the domain of data science.

● Data Science:
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and
systems to extract knowledge and insights from noisy, structured, and unstructured data. Data
science is related to data mining, machine learning, and big data. The goal of Data Science is to
turn data tombs into “golden nuggets” of knowledge.

● Scopes of data science

In a recent survey of India, it was revealed that around 97,000 data analytics are vacant in India
due to a lack of skilled professionals. The use of data analytics in almost every industry has
contributed to a sharp increase of 45% in the total jobs related to data science last year. The
growing demand for data scientists will give an idea about the scope of Data Science. Some of the
scopes of data science are illustrated in this section.

Business application:
Data about customers can reveal details about their habits, demographic characteristics,
preferences, aspirations, and more. With so many potential sources of customer data, a
foundational understanding of data science can help make sense of it. For instance, data about a
customer each time they visit a website, add an item to their cart, complete a purchase, open an
email, or engage with a social media post. After ensuring the data from each source is accurate,
one needs to combine it in a process called data wrangling. This might involve matching a
customer’s email address to their credit card information, social media handles, and purchase
identifications. By aggregating the data, they can draw conclusions and identify trends in their
behaviours.

Industrial application
Finance
Data analysis is helping financial institutions to engage with customers more meaningfully
by understanding their transactional patterns. The data of
transactions available to banks are used in risk and fraud management. The advent of data
science has led to better management of every client’s personal information. Banks are
beginning to understand the importance of collating and utilizing not only the debit and
credit transactions but also purchase history and patterns, mode of communication,
Internet banking data, social media, and mobile phone usage.
9
Energy
Data scientists help in cutting costs, reducing risks, optimizing investments, and improving
equipment maintenance. They use predicting models to monitor compressors, which, in
turn, can reduce the number of downtime days. Regarding the (data science) tools used in
extracting and evaluating data, it can range from Oracle, Hadoop, NoSQL, Python, and
various other software and solutions that can manipulate and analyze large datasets.

Manufacturing
Often referred to as industry 4.0 (with the introduction of robotization and automation as
the 4th industrial revolution), the manufacturing industry keeps growing in need of data
scientists where they can apply their knowledge of broad data management solutions
through quality assurance, tracking defects, and increasing the quality of supplier relations.
Similar to the energy industry, utilizing preventive maintenance to troubleshoot potential
future equipment issues is another focus where data scientists can find good usage of their
skills.

Pharmaceuticals
Connected to human health, the pharma industry has also emerged as an industry where
data science is increasing its application. For example, a pharmaceutical company can
utilize data science to ensure a more stable approach for planning clinical trials.

Research and Development:

A machine learning scientist researches new data manipulating approaches and designs new
algorithms to be used. They are often a part of the R&D department, and their work usually leads
to research papers. Their work is closer to academia yet in an industry setting. Job role titles that
can be used to describe machine learning scientists are Research Scientist or Research Engineer.

Risk management
Data science to increase the security of a business and protect sensitive information. For example,
banks use complex machine-learning algorithms to detect fraud based on
deviations from a user’s typical financial activities. These algorithms can catch fraud faster and with
greater accuracy than humans, simply because of the sheer volume of data generated every day.
Even if anyone doesn't work at a bank, algorithms can be used to protect sensitive information
through the process of encryption. Learning about data privacy can ensure that a company doesn’t
misuse or share customer’s sensitive information, including credit card details, medical information,
Social Security numbers, and contact information.“As organizations become more and more data-
centric, the need for the ethical treatment of individual data becomes equally urgent,” Tingley says
in Data Science Principles. It’s the combination of algorithms and human judgement that can move
businesses closer to a higher level of security and ethical use of data.

E-commerce
E-commerce and retail are some of the most relevant industries that require data analysis at the
largest level. The effective implementation of data analysis will help the e-commerce organizations
to predict the purchases, profits, losses, and even manipulate customers into buying things by
tracking their behavior. Retail brands analyze customer profiles and based on the results they
market the relevant products to push the customer towards purchasing.

Telecom
Today, advancements in technology have brought the world closer. People are able to connect
with their loved ones and other people sitting far away from them in just a few seconds. With the
increasing connectivity, the data is also increasing. With our daily calls, messages, etc, we are
generating a huge amount of data. So it is no surprise to know that Data Science in the Telecom
Industry is helping to handle such a large amount of data.

Product Optimization
Providing the best-suited products according to the needs of the customers is a very
important concern for any industry. The Telecom Industry is using Data Science to perform
the real-time analysis of customer data for improving their products. Various factors like
the customers’ usage, feedback, etc are taken into consideration for coming up with new
products that will benefit the customers as well as the industry.
Increased Network Security
10
One of the biggest concerns of the Telecom Industry is to ensure the security of the
networks. Data Science helps them to identify the problems. It also helps them to analyze
the previous data and make predictions about any problem or
complications that might appear in the near future. This analysis helps them to take suitable
actions for any problem before it’s severe consequences.
Predictive Analytics
The Telecom industry has to manage and maintain a large number of devices that are
continuously running all the time. The Telecommunication sector performs predictive
analytics on the data collected by their devices for gaining valuable insights. These insights
help them in making some smarter data-driven decisions for becoming faster and better.
Medical
Electronic medical records, billing, clinical systems, and different sources produce huge volumes
of data every day. This presents a valuable opportunity for healthcare providers to ensure better
patient care powered by actionable insights from previous patient data.

Public sector
Big data is a rapidly evolving field that offers significant opportunities when explored and applied in
order to uncover deep insights into the data. Private organizations and the public sector are
adopting big data analytics applications at a rapid rate, and data is now an integral part of critical
decision-making. Public sector transactions, employment, education, manufacturing, and
agriculture, to name a few, generate a large amount of publicly available information.

Media and Entertainment

Consumers now expect rich media in different formats as and when they want it on a variety of
devices. Collecting, analyzing, and utilizing these consumer insights is now a challenge that data
science is stepping in to tackle. Data science is being used to leverage social media and mobile
content and understand real-time, media content usage patterns. With data science techniques,
companies can better create content for different target audiences, measure content performance,
and recommend on-demand content. For example, Spotify, the on-demand music streaming
service, uses Hadoop big data analytics to collect and analyze data from its millions of users to
provide better music recommendations to individual users.

Education
One challenge in the education industry where data science and analytics can help is to incorporate
data from different vendors and sources and use them on platforms not designed for varying data.
For example, the University of Tasmania with over 26,000 students has developed a learning and
management system that can track when a student logs into the system, the overall progress of
the student, and how much time is spent on different pages, among other things.
Big data can also be used to measure teachers’ effectiveness by fine-tuning teachers’ performance
by measuring subject matter, student numbers, student aspirations, student demographics, and
many other variables.

Digital marketing
Today, the entire sphere of digital marketing is powered by data science and artificial intelligence.
While Google can be considered a significant player, advertisers are not far behind. The
advertisements that most of us see in our feeds today are based on our browsing patterns and are
a reflection of what we like in a real sense. Data science makes it possible for advertisers to
understand the trends based on the data they collect in terms of browsing history or any other
personal data. This has helped advertisers reap more benefits out of digital advertisements as
compared to traditional advertisements.

● Benefits of using data science

Data Science is not a seven-headed bug. In reality, one of the main functions of a data scientist is
to study and structure business data so that he can extract more accurate insights from the data.
There are many benefits of Data Science to a business. Some of the benefits are illustrated in this
section.

Increases business predictability

When a company invests in structuring its data, it can work with what we call predictive analysis.
With the help of the data scientist, it is possible to use technologies such as Machine Learning and

11
Artificial Intelligence to work with the data that the company has and, in this way, carry out more
precise analyses of what is to come.
Ensures real-time intelligence
The data scientist can work with Robotic Process Automation(RPA) professionals to identify the
different data sources of their business and create automated dashboards, which search all this
data in real-time in an integrated manner.

Favours the marketing and sales area

Data-driven Marketing is a universal term nowadays. The reason is simple: only with data, we can
offer solutions, communications, and products that are genuinely in line with customer
expectations.

Improves data security

One of the benefits of Data Science is the work done in the area of data security. In that sense,
there is a world of possibilities. The data scientists work on fraud prevention systems, for example,
to keep the company’s customers safer. On the other hand, studying recurring patterns of
behaviour in a company’s systems to identify possible architectural flaws.

Helps interpret complex data

Data Science is a great solution when we want to cross different data to understand the business
and the market better. Depending on the tools we use to collect data, we can mix data from
“physical” and virtual sources for better visualization.

Facilitates the decision-making process

Of course, from what we have exposed so far, you should already imagine that one of the benefits
of Data Science is improving the decision-making process. This is because we can create tools to
view data in real-time, allowing more agility for business managers. This is done both by
dashboards and by the projections that are possible with the data scientist’s treatment of data.

● Roles of different occupations of data science

Data Scientist
Let’s start with the most general role, data scientist. A data scientist will deal with all aspects of the
project. Starting from the business side to data collecting and analyzing, and finally visualizing and
presenting. A data scientist knows a bit of everything; every step of the project, because of that,
they can offer better insights on the best solutions for a specific project and uncover patterns and
trends. Moreover, they will be in charge of researching and developing new algorithms and
approaches. Often, in big companies, team leaders in charge of people with specialized skills are
data scientists; their skill set allows them to overlook a project and guide them from start to finish.
Data Analyst
Data analysts are responsible for different tasks such as visualizing, transforming and manipulating
the data. Sometimes they are also responsible for web analytics tracking and A/B testing analysis.
Since data analysts are in charge of visualization, they are often in charge of preparing the data for
communication with the project's business side by preparing reports that effectively show the trends
and insights gathered from their analysis.
Data Engineer
Data engineers are responsible for designing, building, and maintaining data pipelines. They need
to test ecosystems for the businesses and prepare them for data scientists to run their algorithms.
Data engineers also work on batch processing of collected data and match its format to the stored
data. In short, they make sure that the data is ready to be processed and analyzed.

Machine Learning Engineer

Machine learning engineers are very on-demand today. They need to be very familiar with the
various machine learning algorithms like clustering, categorization, and classification and are up-
to-date with the latest research advances in the field. To perform their job properly, machine
learning engineers need to have strong statistics and programming skills in addition to some
knowledge of the fundamentals of software engineering.

Data Architect
Data architects have some common responsibilities with data engineers. They both need to ensure
that the data is well-formatted and accessible for data scientists and analysts and improve the data
pipelines' performance. In addition to that, data architects need to design and create new database
systems that match the requirements of a specific business model and job requirements.
12
Business Intelligence Developer
Business Intelligence developers also called BI developers — are in charge of designing and
developing strategies that allow business users to find the information they need to make decisions
quickly and efficientlyAside from that, they also need to be very
comfortable using new BI tools or designing custom ones that provide analytics and business
insights to understand their systems better.

Database Administrator
A database administrator (DBA) is the information technician responsible for directing or performing
all activities related to maintaining a successful database environment. A DBA makes sure an
organization's database and its related applications operate functionally and efficiently.

● Tools and technologies related to data science are described

Technologies for Data Science

Artificial Intelligence
Artificial Intelligence or AI has been around for quite a long time. It has been used to make
interaction with technology and collecting customer data easier over the decades. Due to
its high processing speed and data access, it is now deeply rooted in our routine lifestyle.
From voice and language recognition, such as Alexa and Siri, to predictive analytics and
driverless cars, artificial intelligence is growing at a fast rate by bringing innovation,
providing a competitive edge to businesses, and changing the way how companies operate
today.

Cloud Services
As humongous data is generated daily, it becomes a challenge to find solutions for low-
cost storage and cheap power. This is where cloud computing and services come as a
saviour. Cloud services aim at storing large amounts of data for a low cost to efficiently
tackle the issues encountered regarding storage in data science.

Augmented Reality and Virtual Reality Systems

AR stands for Augmented Reality, whereas VR stands for Virtual Reality. This technology
has already caught the attention of individuals and businesses all around the world.
Augmented reality and virtual reality aim at enhancing the interactions between humans
and machines. They automate data insights with the help of machine learning and Natural
Language Processing (NLP), which facilitates data scientists and analysts in finding
patterns and generating shareable smart data.

Internet Of Things (IOT)

IoT refers to a network of various objects such as people or devices that have unique IP
addresses and an internet connection. These objects are designed in such a way to
communicate with each other with the help of internet access. Sensors and smart meters,
among others, are a few boons of the IoT, and data scientists intend to develop this
technology further to be able to use it in predictive analytics.

Big Data
Big Data refers to humongous amounts of data that may be either structured or
unstructured. These sets of data are too large to be quickly processed with the help of
traditional techniques, and hence advanced techniques need to be employed for the same.
Big Data boasts of technologies such as dark data migration and strong cybersecurity,
which would not have been possible without it. Smart bots are also a result of processing
big data to analyze the necessary information.

Automated Machine Learning

Automated Machine Learning is also called AutoML and has now become a buzzword. It
is now being recognized as an aid to developing better models for machine learning.
According to Gartner, more than 40 percent of tasks in data science will be automated by
the year 2020.

13
Tools of data science
PYTHON
The Data Science tools and technologies are not limited to databases and frameworks.
Choosing the right programming language for Data Science is of utmost importance.
Python is used by a lot of data scientists for web scraping. Python offers various libraries
designed explicitly for Data Science operations. Various mathematical, statistical, and
scientific calculations can be performed with Python. Some of the widely used Python
libraries for Data Science are NumPy, SciPy, Matplotlib, Pandas, Keras, etc.

R
R provides a scalable software environment for statistical analysis and is one of the many
popular programming languages used in the Data Science sector. Data clustering and
classification can be performed in less time using R. Various statistical models can be
created using R, supporting both linear and nonlinear modelling.

EXCEL
Part of Microsoft’s Office tools, Excel is one of the best tools for Data Science freshers. It
also helps in understanding the basics of Data Science before moving into high-end
analytics. It is one of the essential tools used by data scientists for data visualization. Excel
represents the data in a simple way using rows and columns to be understood even by
non-technical users.

TABLEAU
Tableau is a data visualization tool that assists in decision-making and data analysis. You
can represent data visually in less time by Tableau so that everyone can understand it.
Advanced data analytics problems can be solved in less time using Tableau. You don’t
have to worry about setting up the data while using Tableau and can stay focused on rich
insights.

POWERBI
PowerBI is also one of the essential tools of Data Science integrated with business
intelligence. You can combine it with other Microsoft Data Science tools for performing
data visualization. You can generate rich and insightful reports from a given dataset using
PowerBI. Users can also create their data analytics dashboard using PowerBI.
The incoherent sets of data can be turned into coherent sets using PowerBI. You can
develop a logically consistent dataset that will generate rich insights using PowerBI. One
can generate eye-catching visual reports using PowerBI that can be understood by non-
technical professionals too.

TENSORFLOW
TensorFlow is widely used with various new-age technologies like Data Science, Machine
Learning, Artificial Intelligence, etc. TensorFlow is a Python library that can use for building
and training Data Science models.

PyTorch
PyTorch is an open-source deep learning framework that’s known for its flexibility and ease
of use. This is enabled in part by its compatibility with the popular Python high-level
programming language favoured by machine learning developers and data scientists. It is
the work of developers at Facebook AI Research and several other labs. The framework
combines the efficient and flexible GPU-accelerated backend libraries from Torch with an
intuitive Python frontend that focuses on rapid prototyping, readable code, and support for
the widest possible variety of deep learning models.

MONGODB
MongoDB is a high-performance database and is one of the top Data Science tools in the
market. One can store large volumes of data in a collection (MongoDB documents) offered
by MongoDB. It provides all the capabilities of SQL and supports dynamic queries.

APACHE SPARK
Apache Spark is designed for performing Data Science calculations with low latency.
Based on the Hadoop MapReduce, Apache Spark can handle interactive queries and
stream processing. It has become one of the best Data Science tools in the market due to

14
its in-memory cluster computing. Its in-memory computing can increase the processing
speed significantly.

APACHE HADOOP
Apache Hadoop is an open-source software widely used for the parallel processing of data.
Any large file is distributed/split into chunks and then handed over to various nodes. The
clusters of nodes are then used for parallel processing by Hadoop. Hadoop consists of a
distributed file system responsible for dividing the data into chunks and distributing it to
various nodes. Besides the Hadoop File Distribution System, many other Hadoop
components are used to parallelly process data, such as Hadoop YARN, Hadoop
MapReduce, and Hadoop Common.

YARN
YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop 2.0 to
remove the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was
described as a “Redesigned Resource Manager” at the time of its launching, but it has now
evolved to be known as a large-scale distributed operating system used for Big Data
processing.

APACHE KAFKA
Apache Kafka is a distributed messaging system used to transfer large volumes of data
from one application to another. Real-time data pipelines can be constructed in less time
using Apache Kafka. Known for its fault tolerance and scalability, Kafka will provide you
with zero data loss while transferring data over applications.

APACHE FLUME
Apache Flume is a reliable and distributed system for collecting, aggregating, and moving
massive quantities of log data. It has a simple yet flexible architecture based on streaming
data flows. Apache Flume is used to collect log data present in log files from web servers
and aggregate it into HDFS for analysis.

IBM Watson
IBM Watson Studio empowers data scientists, developers, and analysts to build, run and
manage AI models, and optimize decisions anywhere on IBM Cloud Pak for Data. Unite
teams, automate AI lifecycles, and speed time to value on an open multi-cloud architecture.

AWS
AWS provides a mature big data architecture with services covering the entire data
processing pipeline — from ingestion through treatment and pre-processing, ETL, querying
and analysis, to visualization and dashboarding. AWS helps to manage big data
seamlessly and effortlessly, without having to set up complex infrastructure or deploy
software solutions like Spark or Hadoop.

AZURE
Azure is Microsoft’s well-known cloud platform, competing with the Google Cloud and
Amazon Web Services. Microsoft Azure gives the freedom to build, manage, and deploy
applications on a massive global network. Azure provides over 100 services that enable
us to do everything from running our existing applications on virtual machines to exploring
new software paradigms such as intelligent bots and mixed reality.

Oracle Cloud Infrastructure

Oracle Cloud Infrastructure is a set of complementary cloud services that enable you to
build and run a wide range of applications and services in a highly available hosted
environment. Oracle Cloud Infrastructure (OCI) offers high-performance compute
capabilities (as physical hardware instances) and storage capacity in a flexible overlay
virtual network that is securely accessible from your on-premises network.

GCP
Google Cloud Platform (GCP) operates like other public cloud providers. It provides virtual
machines and hardware, housing them in a regional data center. The regions are then
divided into separate zones where data is stored. This allows resources to be housed near
15
your physical location. It also prevents failures and latency. In addition, there are global,
regional, and zonal resources.

Individual Activity:
▪ Show the benefits of Data Science.

SELF-CHECK QUIZ 1.1

Check your understanding by answering the following questions:

Write the correct answer for the following questions.

1. Define data Science.

2. Why is Data Science so important?

3. How does data Science help the telecom industry?

4. Describes technologies of data science.

16
LEARNING OUTCOME 1.2 - Interpret Concepts of Data Analytics

Contents:

▪ Data analytics processes

▪ Services of data analytics platforms.
▪ End to-end perspective of data products.
▪ Agile principles
.

Assessment criteria:

1. Data analytics processes are described.

2. Services of data analytics platform are interpreted and their applications are explained.
3. End to end perspective of data products is interpreted.
4. Ways of working in a cross-functional team are explained.
5. Agile principles are interpreted.

Resources required:

Students/trainees must be provided with the following resources:

▪ Workplace (Computer with internet connection).

LEARNING ACTIVITY 1.2

Learning Activity Resources/Special Instructions/References

Interpret concepts of data analytics ▪ Information Sheets: 1.2

▪ Self-Check: 1.2
▪ Answer Key: 1.2

17
INFORMATION SHEET 1.2

Learning objective: to Interpret concepts of data analytics

● Data analytics processes

Data analytics is the science of analyzing raw datasets in order to derive a conclusion regarding
the information they hold. It enables us to discover patterns in the raw data and draw valuable
information from them.Like any scientific discipline, data analysis follows a rigorous step-by-step
process. Each stage requires different skills and know-how. To get meaningful insights, though, it’s
important to understand the process as a whole. An underlying framework is invaluable for
producing results that stand up to scrutiny.

Defining The Problem Statements:

The first step in any data analysis process is to define objectives. Sometimes it is also
called a ‘problem statement’.Defining objective means coming up with a hypothesis and
figuring out how to test it.

Gathering Requirements
Before designing or developing a business intelligence tool, one must complete the
requirements gathering process. This involves collecting knowledge, information, goals,
and challenges to understand what is necessary for effective and impactful models or
projects.

Data Collection:
After establishing the objective and gathering requirements, one needs to create a strategy
for collecting and aggregating the appropriate data. A key part of this is determining which
is needed. This might be quantitative (numeric) data, e.g. sales figures, or qualitative
(descriptive) data, such as customer reviews.

Data Cleansing
Once the data collection process is completed, the next step is to get it ready for analysis.
This means cleaning, or ‘scrubbing’ it, and is crucial in making sure that data working with
is high-quality data. Key data cleaning tasks include:
Removing major errors, duplicates, and outliers—all of which are inevitable problems when
aggregating data from numerous sources.
Removing unwanted data points—extracting irrelevant observations that have no bearing
on intended analysis.
Filling in major gaps—It may be noticed that important data are missing. Once identified
gaps, one can go about filling them.
A good data analyst will spend around 70-90% of their time cleaning their data. This might
sound excessive. But focusing on the wrong data points (or analyzing erroneous data) will
severely impact results.

Carrying out an exploratory analysis

Another thing many data analysts do (alongside cleaning data) is to carry out an
exploratory analysis. This helps identify initial trends and characteristics, and can even
refine your hypothesis. It refers to the critical process of performing initial investigations on
data so as to discover patterns, spot anomalies, test hypotheses, and check assumptions
with the help of summary statistics and graphical representations.

Data Interpretation:
Data interpretation is the process of reviewing data through some predefined processes
which will help assign some meaning to the data and arrive at a relevant conclusion. It
involves taking the result of data analysis, making inferences on the relations studied, and
using them to conclude.

Data Visualisation:

18
Data visualization helps to tell stories by curating data into a form easier to understand,
highlighting the trends and outliers. A good visualization tells a story, removing the noise
from data and highlighting the useful information.The most important thing that data
visualization does is discovering the trends in data. After all, it is much easier to observe
data trends when all the data is laid out in front of you in a visual form as compared to data
in a table.

Pattern Recognition:
Pattern recognition is a data analysis method that uses machine learning algorithms to
automatically recognize patterns and regularities in data. This data can be anything from
text and images to sounds or other definable qualities. Pattern recognition systems can
recognize familiar patterns quickly and accurately.

● Services of data analytics platform

Microsoft Power BI
Power BI is an end-to-end analytics solution that has the huge advantage of being
immediately familiar to most information workers, due to the way it slots into the Office 365
ecosystem. However, it only comes as standard with enterprise licenses. Power BI has
been around for nearly ten years now and has become the analytics workhorse of choice
for thousands of organizations, but its most recent editions have placed it firmly at the head
of the pack thanks to comprehensive and constantly evolving automation and
augmentation capabilities.

Oracle Analytics Cloud

Oracle – the original king of the hill when it comes to databases – has in recent years
revamped and relaunched its product and service offerings to fit with the cloud-and-AI era.
Its natural language capabilities are among the most sophisticated in the field, accepting
queries in more than 28 languages – more than any other platform.
Oracle is also pushing hard on the concept of an autonomous database. This means using
machine learning algorithms to carry out many of the functions that would previously have
required organizations to employ an expensive human database administrator to carry out.
This includes data management, security updates, and performance tuning.

IBM Cognos Analytics

Cognos puts AI at the forefront of IBM’s own end-to-end analytics solution, enabling users
to ask and receive answers to their queries in natural language. This means that rather
than simply giving you graphs and charts to look at, it can explain what each one means,
and point you towards the insights it thinks you should be getting. It also uses a high level
of automation for its data cleansing and preparation functions, meaning the AI will
automatically spot and clean up bad data, remove duplicate information, or highlight areas
where something is missing. As with Microsoft's solution, it can be run entirely in the cloud
or installed locally on-premises, depending on your needs and the requirements of the data
you’re working with.

ThoughtSpot
ThoughtSpot is another comprehensive analytics suite that lets you query datasets in
natural language and emphasizes a friendly, pick-up-and-play approach to analytics. It
incorporates UI features that will be familiar to anyone who is used to social media – such
as autonomously curated feeds providing real-time insights into what is going on with your
data. This was a great move as social networks have evolved to become extremely efficient
at generating user engagement, and adapting their innovation to fit an analytics platform
has proven to be a winning formula. Its AI-powered assistant, SpotIQ, uses machine
learning to understand what a user is thinking and make suggestions, pointing out insights
that may have been overlooked or suggesting alternative methods.

Qlik
Qlik is another key player that has made confident moves into machine learning-driven
automation, most apparent in its Associative Engine that lets users see connections
between important data points before they make a single query. Another advantage is that
Qlik’s Data Literacy Project initiative is baked into the platform, which aims to ease the pain

19
of introducing analytics tools across traditionally non-technical or non-data-savvy
workforces.

Apache Spark
Spark is a mature open-source platform that has been around for six years and has
become incredibly popular during that time. That means there is a rich ecosystem of
extensions and plugins, making it up to just about any enterprise analytics task, such as
its MLib machine learning library. It also has a huge community of users and vendors
offering support and assistance, so its applications are adaptable to workforces of differing
levels of IT skill, and as you would expect it integrates easily with other Apache projects
such as Hadoop.

Sisense
Sisense is another solution that has grown in popularity and developed a reputation as a
market leader. It excels at allowing users to create collaborative working environments
where they can slice, dice, and analyze data as a team, using its Crowd Accelerated BI
features. Data can be lifted in from just about any source due to its "API-first" ethos, and
its powerful but user-friendly web browser interface simplifies the process of getting
started.

Talend
Talend is another very popular platform that has increased its automation capabilities in
line with current trends towards machine learning and smart computing. It carries out
automated data quality and compliance operations in the background to provide its users
with faster access to better quality insights. It is also open source, meaning there’s a strong
community of users to learn from and work with, and it’s simple to find example tools and
templates for just about any job you might need to do.

Salesforce Einstein Analytics

Gartner’s analysts this year rank Salesforce’s Einstein engine as having the strongest
capabilities when it comes to automated analytics. There are questions about how the
company will integrate Tableau's technology, which it acquired last year, with its existing
and hugely popular cloud analytics tools. Salesforce practically invented data-driven
marketing, and its customer relationship management (CRM) tools have been industry-
standard for years. Today, it is continuing to live up to its reputation as a leader of
innovation, enabling a level of automation that competitors are struggling to match.

SAS Viya
SAS continues to offer one of the world's most popular BI platforms. Trusted by thousands
of companies worldwide, SAS has built out its visualization capabilities with the release of
the Visual Analytics component and worked to enhance its automation capabilities. It's
designed to let users keep their entire analytics workflow on one unified platform.

● End to end perspective of data products:

In data science, not every project is successful. There are thousands of reasons behind this failure.
So here is a process in a data science project to do so (may help to find better results):

Scope the solution

The first step is to scope the solution. Sometimes it is very hard to find the balance between
expectations and pricing. My recommendation is to drive any decisions by looking at the real need.
This would allow putting aside “nice to have” s and prioritizing “mandatory features”.

The Proof Of Concept (POC) phase

The objective of this phase is to demonstrate that the project is viable and can bring value. To
optimize the process in this phase, you have to consider what needs to be constraints.

20
The constraints triangle

Basically in a POC phase, I cannot constraint the time, the resources, and the accuracy at the
same time. Time means the amount of time I can spend on this project. Resources are the number
of persons allocated to this project. Accuracy means the accuracy of the model. Theoretically, you
can only constrain two of them. In reality, resources are more or less always constrained. Maybe
some companies can allocate an “infinite” number of resources on a project. I have never been in
such a situation. If you relax accuracy, it corresponds to a “hackathon mode” or “spike”. It is useful
when you want to investigate an idea, most probably, for a short time. The most realistic choice is
to relax time. Some clients have a hard time working in this mode because they have deadlines to
respect and they might have the feeling that the POC will never end. But, the trick is to communicate
progress to the client in an efficient way. For that reason, Agile methodology is good. This paradigm
has been invented for software development but adaptation in the context of a data science project
also performs better .

Productization

The machine learning code is a small piece of the entire solution. It is important to have a clear
picture of the additional steps needed for productization. productization requires the support of data
engineers and DevOps.
Hopefully, the POC is done and we met the success criterion. Good!! Now we need to productize
the pipeline. There are two main ways to proceed:
Hands-off: The results of the POC are translated into specifications. A team of software
developers take over, rewrite the entire code, and productize the pipeline.
Push to production: My code is reused “as is' and with the help of data engineers and
Developers, we build an infrastructure around it.
Both approaches have pros and cons. The hands-off approach will make sure that the solution is
more stable. Because a data scientist is not always an experimented developer (no offense here),
productization of the code is often in the hand of other people. It is often the case in a large
company. A detail, in this situation, the ownership of the code does not belong to the data scientist
anymore. So fixing bugs is done by the software development team.

Maintaining and improving the pipeline

Now the pipeline is in production. However, it is important to create a process to capture the
feedback of the operator to continuously improve the solution. For instance, The operator is using
the product daily but he proposed some improvements for the algorithm. For instance, instead of
predicting if a failure would occur, he prefers to have a score or a probability of failure. To derive a
score from the prediction, it would require a minor change in the pipeline. Just returning the
probability instead of the predicted class. So it is a need to set up a feedback process as soon as
the first version of the pipeline is accessible. This feedback loop would also help to adjust when the
product is released, this continuous improvement remains. To have a good interaction with the
user, it is important to improv and release as fast as possible. For that reason avoid the hands-off
strategy to do the improvement and get feedback faster. The feedback loop is also important to
capture the decisions made by the operator. When decisions don’t match the prediction of the
pipelines, it means we can improve the pipeline by incorporating the “reasons” the operator took
his decision. Over time, the role of the operator will change and get simpler. One consequence is
the ability to scale his capabilities. Ultimately, a predictive pipeline can become a prescriptive
pipeline. Then, the prediction is not interpreted by the operator anymore and directly connected to
take action.

21
Company reorganization

It is important to understand that introducing an AI solution in a company, will disrupt the usual way
the employee would work. Sometimes, the employees’ role changes and additional skills might be
needed. For example, if one introduces a chatbot to help the customer support team, this team
might be reduced or reallocated to other tasks. The solution might also cause some business
reorientation in the client company. I have seen some companies changing their strategy from
selling products to selling services because of the new solution. This is often not anticipated by the
client. Sometimes, it can cause the project to stop for a while. It happened to me that the company
would have been so much affected by the algorithm that the reorganization and the integration of
the solution in their system would require much more investment.

Cross-functional team

Organizational isolation isn’t conducive to success. For example, if a company is launching a new
product, R&D, marketing, sales, and distribution must all work together to bring it to market. Or if
your employer wants to improve the customer experience, sales, distribution, and customer support
need to be aligned around this common goal.
And that’s precisely where cross-functional teams come in.A cross-functional team is a working
group that’s comprised of professionals from a variety of different areas within an organization.
Oftentimes, these types of teams are temporary units that are set up to accomplish a specific goal.

● Ways for working in cross functional teams

Because cross-functional teams are usually quickly assembled and have a limited time in which
to achieve an objective, it’s important to know what’s expected of you and how to interact with the
other team members. Keep the following pointers in mind:

● Look to the team leader for direction. Oftentimes, cross-functional teams appoint a leader who’s
responsible for coordinating tasks so you keep moving forward towards your goal. Even if you’re
more used to working in a matrixed environment, it’s advisable to listen to the team leader, since
he or she will likely have a better overview of the team members’ individual specializations and
responsibilities.

● Try to understand everyone’s priorities.Everyone on the team will be looking for their own
outcomes. It’s important to know where this specific project ranks in their lists of priorities to have
a better idea of the kind of time and effort everyone will invest.

● Communicate clearly and respectfully. Whether you know anybody on the team or not, it’s critical
to communicate clearly and respectfully. Otherwise, a lack of communication or a
miscommunication could stall the project you’re working on.

● Ask for resources when necessary. If you don’t have all the resources you need, inform your team
leader as soon as possible. You can’t be expected to perform well if you don’t have all the tools
you require.

● Accept accountability. In a cross-functional team, you’ll likely be the only expert in your specific
field. That means you must be accountable for the aspect of the project that pertains to your field.
Make sure you understand what’s being asked of you and give it your best effort.
● Working in a cross-functional team can be interesting and invigorating. What’s more: If you keep
these pointers in mind, it’s an experience that can help you advance your career.

● Agile Principle

Agile teams are known to be highly efficient at getting work done. Because Agile teams share a
collaborative culture, efficiencies tend to have a ripple effect. Looking for an Agile solution provides
insights into delivery trends to remove bottlenecks and adapt workflow processes for improved
productivity.

22
There are predefined 12 agile principles that can support businesses to streamline their product-
development cycles and achieve better results through a flexible, reactive system.These principles
are briefly explained in this section.

Satisfy Customers Through Early & Continuous Delivery

The original formulation of the first of the Agile principles says, "our highest priority is to satisfy the
customer through early and continuous delivery of valuable software". However, it is perfectly
applicable in areas outside of software development.

As you can see, customer satisfaction sits on top of the 12 principles. Early and continuous delivery
increases the likelihood of meeting customers' demands and contributes to the generation of faster
ROI.

By applying this concept, you will increase your process's agility and respond to changes in a timely
fashion. On the other hand, your customers will be happier because they will get the value they are
paying for more frequently. Also, they will be able to provide you with feedback early on, so you will
be able to decrease the likelihood of making significant changes later in the process.

Welcome Changing Requirements Even Late in the Project

Still, if need be, change requests should be most welcome even at the late stages of project
execution. The original text of the second of the Agile principles states that your team needs to
"welcome changing requirements, even late in development. Agile processes harness change for
the customer's competitive advantage".

In traditional project management, any late-stage changes are taken with a grain of salt as this
usually means scope creep and thus higher costs. In Agile, however, teams aim to embrace
uncertainty and acknowledge that even a late change can still bear a lot of value to the end
customer. Due to the nature of Agile's iterative process, teams shouldn't have a problem
responding to those changes in a timely fashion

Deliver Value Frequently

The third Agile project management principle originally states, "deliver working software frequently,
from a couple of weeks to a couple of months, with a preference to the shorter timescale". Its prime
goal is to reduce the batch sizes that you use to process work.

This principle became necessary due to the extensive amounts of documentation that were part of
the planning process in software development at the end of the 20th century. Logically, by taking it
to heart, you will reduce the time frame for which you are planning and spend more time working
on your projects. In other words, your team will be able to plan in a more agile way.

23
Break the Silos of Your Project
Agile relies on cross-functional teams to make communication easier between the different
stakeholders in the project. As the original text states, "business people and developers must work
together daily throughout the project".

In a knowledge work context that is not explicitly related to software development, you can easily
change the word "developers" to "engineers" or "designers" or whatever best suits your situation.
The goal is to create a synchronization between the people who create value and those who plan
or sell it. This way, you can make internal collaboration seamless and improve your process
performance.

Build Projects Around Motivated Individuals

The logic behind the fifth of the Agile principles is that by reducing micromanagement and
empowering motivated team members, projects will be completed faster and with better quality.

Like the original text following the Agile manifesto states, you need to "build projects around
motivated individuals. Give them the environment and support they need, and trust them to get the
job done".

The second sentence of this principle is especially important. If you don't trust your team and keep
even the tiniest decisions in your company centralized, you will only hinder your team's
engagement. As a result, individuals will never feel a sense of belonging to the purpose that a given
project is trying to fulfill, and you won't get the most of their potential.

The Most Effective Way of Communication is Face-to-face

"The most efficient and effective method of conveying information to and within a development
team is face-to-face conversation."

In 2001, this principle was spot on. By communicating in person, you reduce the time between
asking a question and receiving an answer. However, in the modern work environment where
teams collaborate across the globe, it provides a severe limitation.

Thankfully, with the development of technology, you can interpret this Agile principle from face-to-
face to "synchronous" or otherwise direct communication. So as long as you have a way to quickly
reach your team and discuss work matters without bouncing back and forward emails for days, you
are good to go.

Working Software is the Primary Measure of Progress

The 7th of the Agile core principles are pretty straightforward. It doesn't matter how many working
hours you've invested in your project, how many bugs you managed to fix, or how many lines of
24
code your team has written. If the result of your work is not the way your customer expects it to be,
you are in trouble

Maintain a Sustainable Working Pace

The precise formulation of this principle is "Agile processes promote sustainable development. The
sponsors, developers, and users should be able to maintain a constant pace indefinitely."

Logically, when putting Agile to practice, your goal is to avoid overburden and optimize the way
you work so you can frequently deliver to the market and respond to change without requiring
personal heroics from your team.

Continuous Excellence Enhances Agility

As stated by the Agile Manifesto founders, "continuous attention to technical excellence and good
design enhances agility". In a development context, this principle allows teams to create not just
working software but also a stable product of high quality.

As a result, changes to the code will be less likely to impact bugs and malfunctions negatively.
Still, the 9th of the Agile management principles is applicable in every industry. When you maintain
operational excellence, you will have less trouble reacting to changes and maintaining agility.

Simplicity is Essential
This principle's original content can be a bit confusing as it states, "Simplicity–the art of maximizing
the amount of work not done–is essential". Yet, it is very practical.

If you can do something in a simple way, why waste time complicating it? Your customers are not
paying for the amount of effort you invest. They are buying a solution to a specific problem that
they have. Keep that in mind, when implementing Agile and avoid doing something just for the sake
of doing it.

Self-organizing Teams Generate Most Value

Once again, we realize that when provided with freedom, motivated teams generate the most value
for the customer. When discussing this principle, the 17 fathers of Agile stated that "the best
architectures, requirements, and designs emerge from self-organizing teams".

If you have to push your team and "drive them forward", maybe you are not ready for Agile, or you
need to make some changes to your leading style.

Regularly Reflect and Adjust Your Way of Work to Boost Effectiveness

Finally, we've come to the last of the Agile management principles. It is related to evaluating your
performance and identifying room for improvement. The long version of the principle states: "At

25
regular intervals, the team reflects on how to become more effective, then tunes and adjusts its
behavior accordingly".

By doing this, you will be able to experiment and improve your performance continuously. If things
don't go as you've planned, you can discuss what went wrong and adjust to get back on track.

There are different Agile methods, but Agile itself is not a methodology or a framework. It is a set
of values and principles. This is why it is incredibly flexible and can be applied by different
organizations. However, to make a successful transformation, you need to have the necessary
foundation. Implementing the 12 Agile principles is precisely how you build it.

Individual Activity:
▪ Show working process of Cross functional team.

SELF-CHECK QUIZ 1.2

Check your understanding by answering the following questions:

1. Explain end to end perspective of data.

2. What is an agile principal?

26
LEARNING OUTCOME 1.3 - Characterize a business problem

Contents:

● business problem
● Cross-functional stakeholders...
● Key business indicators

Assessment criteria:

1. Ability to understand a business problem is demonstrated.

2. Business problem is converted into quantifiable form.
3. Cross-functional stakeholders are identified for a business problem.
4. Collaboration with stakeholders is arranged to identify quantifiable improvements.
5. Key business indicators and target improvement metrics are identified.

Resources required:

Students/trainees must be provided with the following resources:

▪ Workplace (computer with internet connection).

LEARNING ACTIVITY 1.3

Learning Activity Resources/Special Instructions/References

Characterize a business problem
▪ Information Sheets: 1.3
▪ Self-Check: 1.3
▪ Answer Key: 13

27
INFORMATION SHEET 1.3

Learning objective: to characterize a business problem.

● Ability to understand a business problem

The most common methodologies used for Advanced Analytics projects start with a step called
Problem Statement or Problem Shaping. This is a process of identifying the problem we want to
solve and the business benefits we want to obtain.
If a problem is a Data Science problem — and if so, what kind. There is great value in being able
to translate a business idea or question into a clearly formulated problem statement. And being
able to effectively communicate whether or not that problem can be solved by applying appropriate
Machine Learning algorithms.A good data science problem should be specific and conclusive.

● Qualitative to quantitative business problem:

Consider the following mini-scenarios:

During a regular weekday lunch, as you are discussing how everybody’s weekend was,
one of your colleagues mentions she watched a particular movie that you have also been
wanting to watch. To know her feedback on the movie, you ask her – “Hey, was the
movie’s direction up to the mark?”
You bump into a colleague in the hallway who you haven’t seen for a couple of weeks.
She mentions she just returned from a popular international destination vacation. To know
more about the destination, you ask her – “Wow! Is it really as exotic as they show in the
magazines?”
Your roommate got a new video game that he has been playing nonstop for a few hours.
When he takes a break, you ask him – “Is the game really that cool?”
Did you find any of these questions ‘artificial’? Do re-read the scenarios and take a few
seconds to think through. Most of us would find these questions to be perfectly natural!

What would certainly be artificial though is asking questions like:

‘Hey, was the movie direction 3.5-out-of-5?’, or

‘Is the vacation destination 8 on a scale of 1-to-10?’, or
‘Is the video game in the top 10 percentile of all the video games?’
In most scenarios, we express our tasks in qualitative terms. This is true about business
requirements as well.

28
Isn’t it more likely that the initial client ask will be “Build us a landing page which is
aesthetically pleasing yet informative” versus “we need a landing page which is rated at
least 8.5-out-of-10 by 1000 random visitors to our website on visual-appeal, navigability
and product-information parameters”?

On the other hand, systems are built and evaluated based on exact quantitative
requirements. For example, the database query has to return in less than 30 milliseconds,
the website has to fully load in less than 3 milliseconds on a typical 10mbps connection,
and so on
This gap between qualitative business requirements and quantitative machine requirements
is exacerbated when it comes to data-driven products.Some of the facts are described here.

A typical business requirement for a data-driven product could be “develop an optimal

digital marketing strategy to reach the likely target customer population”. Converting this
to a quantifiable requirement has several non-trivial challenges. Some of these are:

How do we define ‘optimal’: Do we focus more on precision or more on recall? Do we

focus more on accuracy (is the approached customer segment really our target customer
segment or not)? Or do we focus more on efficiency (how quickly do we make a go/no-go
decision once the customer segment is exposed to our algorithm)?
How do we actually evaluate if we have met the optimal criteria? And if not, how much of
a gap exists?
To define customers ‘similar’ to our target population, we need to agree on a set of N
dimensions that will be used for computing this similarity:

Patterns in the browsing history

Patterns in e-shopping
Patterns in user-provided meta-data, and so on. Or do we need to device a few other
dimensions?
After that, we need to critically evaluate whether all the relevant data exists in an accessible
format. If not, are there ways to infer at least parts of it?

● Cross Functional Stakeholders and their Collaboration

Analytics teams focused on detecting meaningful business insights may overlook the need to
effectively communicate those insights to their cross-functional partners who can use those
recommendations to improve the business. Part of the DoorDash Analytics team’s success comes
from its ability to communicate actionable insights to key stakeholders, not just identify and
measure them. Many analytics teams that don’t emphasise communication let insights slip through
the cracks when executives don’t understand recommendations or their business impact.

To combat this common problem, analytics teams need to understand the strategies used to ensure
an analytics insight is not being overlooked. This can be done by employing a number of
communications. Best practices designed to identify the business decision makers who can act on
the insights and can directly explain the recommendation in a way that addresses their interests
clearly and concisely with supportive analytics and visuals.

Teams that can communicate effectively using these best practices benefit from the virtuous cycle
of generating good insights, where emphasizing clear communication ensures focus on finding a
clear direction and being actionable. The process of articulating key insights and formulating
recommendations can serve as a forcing function to make data analysis more focused and more
likely to be successful in driving business impact.

29
Here are the best practices that the DoorDash Analytics team uses to emphasise communication,
clarify our thinking, and ensure no actionable insights are overlooked.

Teams that can communicate effectively using these best practices benefit from the virtuous cycle
of generating good insights, where emphasising clear communication ensures focus on finding a
clear direction and being actionable. The process of articulating key insights and formulating
recommendations can serve as a forcing function to make data analysis more focused and more
likely to be successful in driving business impact.

Here are the best practices that the DoorDash Analytics team uses to emphasise communication,
clarify our thinking, and ensure no actionable insights are overlooked.

Analytics communications best practices

While there is no silver bullet to guarantee effective communication, adhering to some best
practices can help data scientists present their insights effectively and drive business impact, while
getting 1% better everyday, one of our core pillars at DoorDash. The best practices laid out below
describe techniques that can help a data scientist’s communication by focusing on presenting what
the audience really needs to know in a way they will understand, and avoiding common
communication pitfalls which may distract from the insight and related recommendations.

Use a TL;DR to clearly communicate what matters

Clearly communicating the business benefits of an analytics insight is important to capture the
attention of key stakeholders so they will consider the recommendations that are supported by the
data. The better analytics teams are at communicating effectively, the more time they can spend
measuring insights. Part of perfecting this art of communication is ensuring that all communications
capture the intended audience's attention and puts them on the path to wanting to quickly learn
more.

To grab the reader’s attention and highlight an insight’s relevance to the business, we often include
a TL;DR at the beginning of every analysis. The TL;DR (short for “Too Long; Didn’t Read”) is a
clear, concise summary of the content (often one line) that frames key insights in the context of
impact on key business metrics.

While the analytics work that produced the insight may be highly complex, key takeaways and
recommendations can usually be distilled down to a few sentences. Even if the TL;DR was the
analysis’ conclusion, it should still kick off communication. If writing a few sentences to summarise
the key insight and why it matters to the audience is challenging for a data scientist, that should
send the signal that the subject matter is not currently understood well enough to communicate
with key stakeholders and should be worked on further.

Identify your audience and speak in their language

Ensuring that analytics insights improve the business means actually sharing the insights with key
stakeholders who can enact a recommendation. While sharing insights with influencers may seem
helpful, sharing insights with audiences that can’t enact recommendations will not directly ensure
insights translate into business improvements. Being laser-focused on speaking to the right
audience can increase the pace of execution significantly since working directly with decision
makers speeds up the pace of making business decisions.

30
After identifying the audience for the new insight, tailoring communication to them will increase the
likelihood that the recommendation will be convincing. In order to speak directly to the kinds of
business stakeholders that will likely be the intended audience, it's important to try and understand
who they are and their priorities. Typically, business decision makers are very busy with a lot of
priorities competing for their attention, which is especially true in startups and fast-growing
companies. Therefore, connecting the new insights and recommendations to the existing goals and
objectives of the target audience is one of the easiest ways to grab and hold their attention. A brief
explanation of why the insight matters, framed in terms of potential impact on the audience’s key
performance metrics, is a concise way of highlighting the value and relevance of an insight to their
performance success.

Use simple data visualisations to support written communications

When communicating data-driven insights, data visualisation can be a very useful tool since a
picture is worth 1,000 words. However, data visualisations should not be seen as a replacement
for the written communication of insights. Even though data visualisations take a leading role in
explaining insights they still require interpretation to be fully understood.
When utilising visualisations, avoid confusing the audience. Presenting unnecessarily complex
visualisations can distract from the key insight and make the overall communication of an insight
less effective. This often occurs because analysts have a bias towards using the data visualisation
technique that helped discover the insight, which might not be the best way to communicate the
insight to every audience.

Avoid extraneous trivia that distracts from the narrative

In an effort to appear data-driven, many presentations and documents include a laundry list of
metrics presented without context, which have little informational value to the audience. Even
summaries are sometimes inundated with numbers. Data presented without narrative can
overwhelm even the most data-savvy audience and make it difficult to extract a coherent story. Any
insight which is not actionable is trivia. Knowing trivia is fun but can easily turn into a distraction
and fog up the general message and recommendations that should be delivered.

Leverage a structured communication strategy

A structured communication strategy goes a long way in driving alignment with the audience.
Consider a three step communication strategy. The first step involves ‘telling’ the audience the
subject of the talk, then actually ‘telling’ them, and then summarizing what they were just ‘told’. This
communication style is most relevant for a meeting with cross-functional participants because
analytics insights and recommendations can oftentimes get granular or technical, making it harder
for all the stakeholders to successfully follow along. Therefore, it is important to summarize the
agenda upfront and recap the conclusions at the end of the meeting.

● Collaboration with Stakeholders

Already Discussed in Section 3.3

● Key Business Indicators

This section sets out the business benefits of performance measurement and target-setting. It
shows you how to choose which Key Performance Indicators (KPIs) to measure and suggests
examples in a number of key business areas. It also highlights the main points to bear in mind
when setting targets for your business.

Performance measurement

31
Knowing how the different areas of your business are performing is valuable information in its own
right, but a good measurement system will also let you examine the triggers for any changes in
performance. This puts you in a better position to manage your performance proactively.
One of the key challenges with performance management is selecting what to measure. The priority
here is to focus on quantifiable factors that are clearly linked to the drivers of success in your
business and your sector. These are known as key performance indicators (KPIs).
For example, if your business succeeds or fails on the quality of its customer service, then that's
what you need to measure - through, for example, the number of complaints received and so on.

Benefits of target-setting
If you've identified the key areas that drive your business performance and found a way to measure
them, then a natural next step is to start setting performance targets to give everyone in your
business a clear sense of what they should be aiming for.

Strategic visions can be difficult to communicate, but by breaking your top level objectives down
into smaller concrete targets you'll make it easier to manage the process of delivering them. In this
way, targets form a crucial link between strategy and day-to-day operations.

Measurement of your financial performance

Getting on top of financial measures of your performance is an important part of running a growing
business.
It will be much easier to invest and manage for growth if you understand how to drill into your
management accounts to find out what's working for your business and to identify possible
opportunities for future expansion.

Measuring your profitability

Most growing businesses ultimately target increased profits, so it's important to know how to
measure profitability. The key standard measures are:

Gross profit margin - this measures how much money is made after direct costs of sales
have been taken into account, or the contribution as it is also known.
Operating margin - the operating margin lies between the gross and net measures of
profitability. Overheads are taken into account, but interest and tax payments are not. For
this reason, it is also known as the EBIT (earnings before interest and taxes) margin.
Net profit margin - this is a much narrower measure of profits, as it takes all costs into
account, not just direct ones. So all overheads, as well as interest and tax payments, are
included in the profit calculation.
Return on capital employed (ROCE) - this calculates net profit as a percentage of the
total capital employed in a business. This allows you to see how well the money invested
in your business is performing compared to other investments you could make with it, like
putting it in the bank.
Other key accounting ratios

There are a number of other commonly used accounting ratios that provide useful measures of
business performance. These include:

Measurement of your customers

Finding and retaining customers is a crucial task for every business. So when looking for areas of
your business to start measuring and analysing, it's worth asking yourself if you know as much as
possible about your clientele.

Widen your focus beyond current customers

Selling more to existing customers might be the easiest way of increasing sales, but most
businesses aiming for significant growth will need to find ways of reaching new groups of
customers.
So knowing more about sections of the market you haven't yet tapped is crucial.

Measurement of your employees

As your business grows the number of people you employ is likely to increase. To keep on top of
how your staff are doing, you may need to find slightly more formal ways of measuring their
performance.

32
Measuring through meetings and appraisals
Informal meetings and more formal appraisals provide a very practical and direct way of
monitoring and encouraging the progress of individual employees. They allow frank
exchanges of views by both sides and they can also be used to drive up productivity and
performance through setting employee targets and measuring progress towards achieving
them.

Regular staff meetings can also be a very useful way of keeping tabs on wider
developments across your business. These meetings often give an early indicator of
important concerns or developments that might otherwise take some time to come to the
attention of your management team.

Quantitative measurement of employee performance

Looking at employee performance from a financial perspective can be a very valuable

management tool. At the level of reporting for the overall business, the most commonly-
used measures are sales per employee, contribution per employee and profit per
employee.

Expressing employee performance quantitatively is easier for some sectors and for some
types of worker. For example, it should be quite easy to see what kind of sales an individual
salesperson has generated, or how many units manufacturing employees produce per
hour at work.

But with a bit more effort, these kinds of measures can be applied in almost any business
or sector. For example, using timesheets to assess how many hours an employee devotes
each month to different projects or customers under their responsibility gives you a way of
assessing what the most profitable use of their time is.

Individual Activity:
▪ Convert a business into a quantifiable form.

SELF-CHECK QUIZ 1.3

Check your understanding by answering the following questions:

1. Describe the impact of cross functional stakeholder in a data science business problem.

33
LEARNING OUTCOME 1.4 – Formulate Business Problem as a
Hypothesis Question

Contents:

▪ Research questions with associated hypotheses

▪ Types of data

Assessment criteria:

1. Research questions with associated hypotheses are constructed from business problem.
2. Types of data needed to test the hypotheses are determined.
3. Hypotheses to be tested are aligned with business value.

Resources required:

Students/trainees must be provided with the following resources:

Workplace (Computer with internet connection).

LEARNING ACTIVITY 1.4

Learning Activity Resources/Special Instructions/References

Formulate Business ▪ Information Sheets: 1.4
Problem as a ▪ Self-Check: 1.4
Hypothesis Question ▪ Answer Key: 1.4

34
INFORMATION SHEET 1.4

Learning Objective: to formulate business problem as a hypothesis question

● Research questions with associated hypotheses are constructed from business problem

The first step towards problem-solving in data science projects isn’t about building machine
learning models.That distinction belongs to hypothesis generation – the step where we combine
our problem solving skills with our business intuition. It’s a truly crucial step in ensuring a successful
data science project.So the first step towards solving any business problem using machine learning
is hypothesis generation. Understanding the problem statement with good domain knowledge is
important and formulating a hypothesis will further expose you to newer ideas of problem-solving.

Hypothesis Generation
Hypothesis generation is an educated “guess” of various factors that are impacting the business
problem that needs to be solved using machine learning. In framing a hypothesis, the data scientist
must not know the outcome of the hypothesis that has been generated based on any evidence.
“A hypothesis may be simply defined as a guess. A scientific hypothesis is an intelligent guess.” –
Isaac Asimov
Hypothesis generation is a crucial step in any data science project. If you skip this or skim through
this, the likelihood of the project failing increases exponentially.
Here are 5 key reasons why hypothesis generation is so important in data science:

Hypothesis generation helps in comprehending the business problem as we dive deep in inferring
the various factors affecting our target variable
You will get a much better idea of what are the major factors that are responsible to solve the
problem
Data that needs to be collected from various sources that are key in converting your business
problem into a data science-based problem
Improves your domain knowledge if you are new to the domain as you spend time understanding
the problem
Helps to approach the problem in a structured manner.

● Type of data needed to test the hypotheses are determined

Hypothesis Generation Based On Various Factors

Distance/Speed based Features

Let us try to come up with a formula that would have a relation with trip duration and would help us
in generating various hypotheses for the problem:
TIME=DISTANCE/SPEED
Distance and speed play an important role in predicting the trip duration. We can notice that the
trip duration is directly proportional to the distance traveled and inversely proportional to the speed
of the taxi. Using this we can come up with a hypothesis based on distance and speed.
● Distance: More the distance traveled by the taxi, the more will be the trip duration.
● Interior drop point: Drop points to congested or interior lanes could result in an increase in
trip duration
● Speed: Higher the speed, the lower the trip duration
Features based on Car

Cars are of various types, sizes, brands, and these features of the car could be vital for commute
not only on the basis of the safety of the passengers but also for the trip duration. Let us now
generate a few hypotheses based on the features of the car.
● Condition of the car: Good conditioned cars are unlikely to have breakdown issues and
could have a lower trip duration
35
● Car Size: Small-sized cars (Hatchback) may have a lower trip duration and larger-sized
cars (XUV) may have higher trip duration based on the size of the car and congestion in
the city

Type of the Trip

Trip types can be different based on trip vendors – it could be an outstation trip, single or pool rides.
Let us now define a hypothesis based on the type of trip used.
● Pool Car: Trips with pooling can lead to higher trip duration as the car reaches multiple
places before reaching your assigned destination

Features based on Driver Details

A driver is an important person when it comes to commute time. Various factors about the driver
can help in understanding the reason behind trip duration and here are a few hypotheses this.
● Age of driver: Older drivers could be more careful and could contribute to higher trip
duration
● Gender: Female drivers are likely to drive slowly and could contribute to higher trip duration
● Driver experience: Drivers with very less driving experience can cause higher trip duration
● Medical condition: Drivers with a medical condition can contribute to higher trip duration
Passenger details
Passengers can influence the trip duration knowingly or unknowingly. We usually come across
passengers requesting drivers to increase the speed as they are getting late and there could be
other factors to hypothesize which we can look at.
● Age of passengers: Senior citizens as passengers may contribute to higher trip duration
as drivers tend to go slow in trips involving senior citizens
● Medical conditions or pregnancy: Passengers with medical conditions contribute to a
longer trip duration
● Emergency: Passengers with an emergency could contribute to a shorter trip duration
● Passenger count: Higher passenger count leads to shorter duration trips due to congestion
in seating

Date-Time Features
The day and time of the week are important as New York is a busy city and could be highly
congested during office hours or weekdays. Let us now generate a few hypotheses on the date
and time-based features.
Pickup Day:
● Weekends could contribute to more outstation trips and could have a higher trip duration
● Weekdays tend to have higher trip duration due to high traffic
● If the pickup day falls on a holiday then the trip duration may be shorter
● If the pickup day falls on a festive week then the trip duration could be lower due to lesser
traffic
Time:
● Early morning trips have a lesser trip duration due to lesser traffic
● Evening trips have a higher trip duration due to peak hours
Road-based Features
Roads are of different types and the condition of the road or obstructions in the road are factors
that can’t be ignored. Let’s form some hypotheses based on these factors.
● Condition of the road: The duration of the trip is more if the condition of the road is bad
● Road type: Trips in concrete roads tend to have a lower trip duration
● Strike on the road: Strikes carried out on roads in the direction of the trip causes the trip
duration to increase

Weather Based Features

Weather can change at any time and could possibly impact the commute if the weather turns bad.
Hence, this is an important feature to consider in our hypothesis.
● Weather at the start of the trip: Rainy weather condition contributes to a higher trip duration
36
● Hypotheses to be tested are aligned with business value

Hypothesis testing is a common statistical tool used in research and data science to support the
certainty of findings. The aim of testing is to answer how probable an apparent effect is detected
by chance given a random data sample. This article provides a detailed explanation of the key
concepts in Frequentist hypothesis testing using problems from the business domain as examples.

What is a hypothesis?

A hypothesis is often described as an “educated guess” about a specific parameter or population.

Once it is defined, one can collect data to determine whether it provides enough evidence that the
hypothesis is true.

Hypothesis testing

In hypothesis testing, two mutually exclusive statements about a parameter or population

(hypotheses) are evaluated to decide which statement is best supported by sample data.

Parameters and statistics

In statistics, a parameter is a description of a population, while a statistic describes a small portion

of a population (sample). For example, if you ask everyone in your class (population) about their
average height, you receive a parameter, a true description about the population since everyone
was asked. If you now want to guess the average height of people in your grade (population) using
the information you have from your class (sample), this information turns into a statistic.
Hypothesis tests including a specific parameter are called parametric tests. In parametric tests, the
population is assumed to have a normal distribution (e.g., the height of people in a class).

Non-parametric tests

In contrast, non-parametric tests (also distribution-free tests) are used when parameters of a
population cannot be assumed to be normally distributed. For example, the price of diamonds
seems exponentially distributed (below right). Non-parametric doesn’t mean that you do not know
anything about a population but rather that it is not normally distributed.

Left: example of normally distributed data. Right: example of non-normal data distribution.
For simplicity, I will focus on parametric tests in the following, with a few mentions on
where to look further if a normal distribution cannot be assumed.
Real-world examples
An often-used example to explain hypothesis tests is the fair coin example. It is an excellent
way to explain the basic concepts of a test but also very abstract. More tangible examples
of possible hypotheses in business that one can ask itself could be:

Hypothesis 1: Average order value has increased since last financial year
Parameter: Mean order value
Test type: one-sample, parametric test (assuming the order value follows a normal
distribution)

Hypothesis 2: Investing in A brings a higher return than investing in B

Parameter: Difference in mean return
Test type: two-sample, parametric test, also AB test (assuming the return follows a normal
distribution)

Hypothesis 3: The new user interface converts more users into customers than the expected
30%
Parameter: none

37
Test type: one-sample, non-parametric test (assuming number of customers is not normally
distributed)
One-sample, two-sample, or more-sample test
When testing hypotheses, it is distinguished between one-sample, two-sample or more-
sample tests. This is not to be confused with one- and two-sided tests, which we will cover
later. In a one-sample test, a sample (average order value this year) is compared to a known
value (average order value of last year). In a two-sample test, two samples (investment A
and B) are compared to each other.

Basic Steps of a hypothesis test

Several steps are used to test a hypothesis and verify its significance. In the following, I will
explain each step in detail and will use the examples from above to explain all concepts.

1. Null & Alternative hypothesis

The null and alternative hypotheses are the two mutually exclusive statements about a
parameter or population mentioned in the introduction. The null hypothesis (often
abbreviated as H0) claims that there is no effect or no difference. The alternative hypothesis
(often abbreviated as H1 or HA) is what you want to prove. Using one of the examples
from above:
H0: There is no difference in the mean return from A and B, or the difference between A
and B is zero.
H1: There is a difference in the mean return from A and B or the difference between A and
B > zero.
One-sided and two-sided (one-tailed and two-tailed) tests
The example hypotheses above describe a so-called two-tailed test. In a two-tailed test, you
are testing in both directions, meaning it is tested whether the mean return from A is
significantly greater and significantly less than the mean return from B.
In a one-tailed test, you are testing in one direction, meaning it is tested either if the mean
return from A is significantly greater or significantly less than the mean return from B. In
this case, the alternative hypothesis would change to:
H1: The mean return of A is greater than the mean return of B. OR
H1: The mean return of A is lower than the mean return of B.

2. Selection of an appropriate test statistic

To test your claims, you need to decide on the right test or test statistic. Often discussed
tests are the t-test, z-test, or F-test, which all assume a normal distribution. However, in
business, a normal distribution often cannot be assumed. Therefore, I will briefly explain
the main concepts you need to know to find the proper test for your hypothesis.
Test statistic
Parametric or non-parametric test, each test has a test statistic. A test statistic is a numerical
summary of a sample. It is a random variable as it is derived from a random sample. In
hypothesis tests, it compares the sample statistic to the expected result of the null
hypothesis. The selection of the test statistic is dependent on:
Parametric vs. non-parametric
Number of samples (one, two, multiple)
Discrete (e.g. number of customers) or continuous variable (e.g. order value)
Let’s assume that the mean average order value AOV in your web shop used to be $20.
After hiring a new web designer with promising skills, the AOV increased to $22. You
want to test whether the mean AOV has significantly increased:
Parameter: mean AOV (continuous variable, assumed to be normally distributed)
Sample statistic: $22 (one sample)

38
Expected value: $20
Test statistic: t-score
Test: one-sample t-test

3. Selection of the appropriate significance level

When testing hypotheses, we cannot always test it on the whole population but only on
randomly selected data samples. Can we, therefore, say that our conclusions are always
100% true for the population? Not really. There are two types of errors that we can make:
Type I error: Rejecting the null hypothesis when it is true.
Type II error: Accepting the null hypothesis when it is false.
Alpha is the probability of the type I error and the chance of making a mistake by rejecting
the null hypothesis when it is true. The lower the alpha, the better. It is, therefore, used as
a threshold to make decisions. Before starting a hypothesis test, you generally pick an error
level you are willing to accept. For example, you are willing to accept a 5% chance that
you’re mistaken when you reject the null hypothesis.
But, wouldn’t I always want to be 100% confident that I didn’t make a mistake, so alpha =
0%?
Power of a test
Yes, this is where it gets problematic. Because next to alpha, we also have beta, the
probability of the type II error. 1-beta is the probability of not making a Type II error and
defined as the power of a test. The lower the beta, the higher the power. Naturally, you
would like to keep both errors as low as possible. However, it is essential to note that both
errors somewhat work against each other: Assume you want to minimise error I or the
mistake of rejecting the null hypothesis when it is true. Then, the easiest way would be to
just always accept it. But this would then work directly against the type II error, namely
accepting it when it is not true.
Therefore, commonly used significance (alpha) levels 0.01, 0.05, or 0.10 serve as a good
balance and should be determined before data collection. Note here that in a two-tailed test,
the alpha level is split in half and applied to both sides of the sampling distribution of a
statistic.

Left: example of a sampling distribution with a rejection region on only one side. Right:
example of a sampling distribution with a rejection region on both sides. Image by author.
4. Data collection
To run a hypothesis test, we need a portion of the true population of interest, a random
sample. The sample should be randomly selected to avoid any bias or undesirable effects.
The question about the optimal sample size is not an easy one to answer. Generally, it is
safe to say: the more data, the better. However, there are cases where this hard to achieve
due to budget or time constraints or just the nature of data. There are several formulas
available that help finds the right sample size:
Cochran sample size formula
Slovins’s formula
Moreover, some tests can be used with small sample sizes, like e.g., the t-test or non-
parametric tests which generally need fewer sample sizes as they do not require a normal
distribution.
Sample size in two-sample or multi-sample tests
When conducting a two-sample or multi-sample test, be aware that your chosen test may
require a similar sample size unless it is robust to different sample sizes. For example, a
test like the t-test may not be appropriate anymore as an unequal sample size can affect the
Type 1 error. In this case, it is best to search for a robust alternative (e.g., Welch’s t-test).
5. Calculation of the test statistics and the p-value

39
Once the data is collected, the chosen test statistic and the corresponding p-value can be
calculated. Both values can be used to make your final decision on inference and are
retrieved from the probability distribution from the test statistic (also sampling
distribution).
How to calculate the test statistic?
You can calculate the test statistic traditionally using its formula (can be found online), or
through statistical software like SPSS or using R/python. For one of our examples from
before, assuming a sample size (n) of 20 and a sample standard deviation (s) of 1.5, our
test statistic is:

One-sample t-test
The p-value
The p-value (short for probability value) is the most critically viewed number in statistics.
It is defined as the probability of receiving a result at least as extreme as the one observed,
given that the null hypothesis is true. My favourite resource of describing the p-value in
simple words is by Cassie Kozyrkov:
The p-value tells you, given the evidence that you have (data), if the null hypothesis looks
ridiculous or not […] The lower the p-value, the more ridiculous the null hypothesis looks.
The p-value is a value between 0% and 100% and can be retrieved from the null hypothesis,
sampling distribution, and the data. Generally, it is calculated with the help of statistical
software or reading off a distribution table with set parameters (degrees of freedom, alpha
level etc.). Distribution tables with the most common parameters can be found online for
most test statistics, like t-score, chi-squared score, or Wilcoxon-rank-sum.
6. Decision
To decide on inference, either the test statistic is compared to a critical value (critical value
approach), or the p-value is compared to the alpha-level (p-value approach).
Critical value
The critical value splits the sampling distribution into a “rejection region” and “acceptance
region”. If the test statistic is greater than the critical value, then the null hypothesis is
rejected in favour of the alternative hypothesis with a confidence level of 1-alpha. If the
test statistic is smaller than the critical value, the null hypothesis is not rejected. Critical
values are found with the sampling distribution and the alpha-level. However, a more
common approach for making a test decision is the p-value approach.
P-value vs. alpha level
Given your alpha level, if the p <= alpha, the null hypothesis is rejected in favour of the
alternative hypothesis with confidence level 1-alpha. If the p-value is greater than the
alpha-level, the null hypothesis is accepted.
Summary
In hypothesis testing, two mutually exclusive statements about a population are tested
using a random data sample. It comprises many concepts and steps that greatly impact the
results, like formulating the hypotheses or selecting the test statistic, alpha-level, and
sample size.

Individual Activity:
▪ Construct Hypothesis for a business problem

40
SELF-CHECK QUIZ 1.4

Check your understanding by answering the following questions:

1. Describe hypothesis testing for a business problem.

41
LEARNING OUTCOME 1.5 -Use Methodologies in Executing Data Science Project
Cycle

Contents:

● Application of scientific method to data science

● CRISP-DM methodology
● Data pipelining
● Application of experimental approach for finding insights and solutions

Assessment criteria:

1. Application of scientific method to data science business problems are demonstrated.

2. Cross-industry standard process for data mining (CRISP-DM methodology) is described.
3. Data pipelining is explained.
4. Application of experimental approach for finding insights and solutions are explained.
5. Application of the scientific method and the CRISP-DM methodology are followed during setting
up new data science project.

Resources required:

Students/trainees must be provided with the following resources:

● Workplace (Computer with internet connection).

LEARNING ACTIVITY 1.5

Learning Activity Resources/Special Instructions/References

Use methodologies in executing data ▪ Information Sheets: 1.5
science project cycle ▪ Self-Check: 1.5
▪ Answer Key: 1.5

42
INFORMATION SHEET 1.5

Learning objective: to use methodologies in executing data science project cycle.

● Application of scientific method

The scientific method is a procedure that has characterised the field of natural science since the
1700s; it consists of a series of systematic steps — which ultimately aim to either validate or reject
a statement (hypothesis).
The Phases of the Scientific Method
The steps go something like this:
● Observe: Make an observation
● Question: Ask questions about the observation, gather information
● Hypothesise: Form a hypothesis — a statement that attempts to explain the observation,
make some predictions based on this hypothesis
● Test: Test the hypothesis (and predictions) using a reproducible experiment
● Conclude: Analyse the results and draw conclusions, thereby accepting or rejecting the
hypothesis
● Redo: The experiment should be reproduced enough times to ensure no inconsistency
between observations/results and theory.

● CRISP-DM methodology
The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process model with six
phases that naturally describes the data science life cycle. It’s like a set of guardrails to help you
plan, organise, and implement your data science (or machine learning) projects.

I. Business Understanding
The Business Understanding phase focuses on understanding the objectives and
requirements of the project. Aside from the third task, the three other tasks in this phase
are foundational project management activities that are universal to most projects:

1. Determine business objectives:

You should first “thoroughly understand, from a business perspective, what the
customer really wants to accomplish.” (CRISP-DM Guide) and then define
business success criteria.
2. Assess situation:
Determine resources availability, project requirements, assess risks and
contingencies, and conduct a cost-benefit analysis.
3. Determine data mining goals:
In addition to defining the business objectives, you should also define what success looks like from
a technical data mining perspective.
Produce project plan: Select technologies and tools and define detailed plans for each project
phase.
While many teams hurry through this phase, establishing a strong business understanding is like
building the foundation of a house – absolutely essential.

II. Data Understanding

Next is the Data Understanding phase. Adding to the foundation of Business
Understanding, it drives the focus to identify, collect, and analyse the data sets that can
help you accomplish the project goals. This phase also has four tasks:

1. Collect initial data: Acquire the necessary data and (if necessary) load it into your
analysis tool.
2. Describe data: Examine the data and document its surface properties like data
format, number of records, or field identities.
3. Explore data: Dig deeper into the data. Query it, visualise it, and identify
relationships among the data.
4. Verify data quality: How clean/dirty is the data? Document any quality issues.

43
III. Data Preparation
This phase, which is often referred to as “data munging”, prepares the final data set(s) for
modelling. It has five tasks:
1. Select data: Determine which data sets will be used and document reasons for
inclusion/exclusion.
2. Clean data: Often this is the lengthiest task. Without it, you’ll likely fall victim to
garbage-in, garbage-out. A common practice during this task is to correct, impute,
or remove erroneous values.
3. Construct data: Derive new attributes that will be helpful. For example, derive
someone’s body mass index from height and weight fields.
4. Integrate data: Create new data sets by combining data from multiple sources.
5. Format data: Re-format data as necessary. For example, you might convert string
values that store numbers to numeric values so that you can perform mathematical
operations.

IV. Modelling
Here you’ll likely build and assess various models based on several different modelling
techniques. This phase has four tasks:

1. Select modelling techniques: Determine which algorithms to try (e.g. regression, neural
net).
2. Generate test design: Pending your modelling approach, you might need to split the data
into training, test, and validation sets.
3. Build model: As glamorous as this might sound, this might just be executing a few lines of
code like “reg = LinearRegression().fit(X, y)”.
4. Assess model: Generally, multiple models are competing against each other, and the data
scientist needs to interpret the model results based on domain knowledge, the pre-defined
success criteria, and the test design.
Although the CRISP-DM guide suggests to “iterate model building and assessment until you
strongly believe that you have found the best model(s)”, in practice teams should continue iterating
until they find a “good enough” model, proceed through the CRISP-DM lifecycle, then further
improve the model in future iterations.

V. Evaluation
Whereas the Assess Model task of the Modelling phase focuses on technical model
assessment, the Evaluation phase looks more broadly at which model best meets the
business and what to do next. This phase has three tasks:
● Evaluate results: Do the models meet the business success criteria? Which one(s)
should we approve for the business?
● Review process: Review the work accomplished. Was anything overlooked? Were
all steps properly executed? Summarise findings and correct anything if needed.
● Determine next steps: Based on the previous three tasks, determine whether to
proceed to deployment, iterate further, or initiate new projects.
VI. Deployment
A model is not particularly useful unless the customer can access its results. The
complexity of this phase varies widely. This final phase has four tasks:

● Plan deployment: Develop and document a plan for deploying the model.
● Plan monitoring and maintenance: Develop a thorough monitoring and
maintenance plan to avoid issues during the operational phase (or post-project
phase) of a model.
● Produce final report: The project team documents a summary of the project which
might include a final presentation of data mining results.
● Review project: Conduct a project retrospective about what went well, what could
have been better, and how to improve in the future.
Your organisation’s work might not end there. As a project framework, CRISP-DM does
not outline what to do after the project (also known as “operations”). But if the model is
going to production, be sure you maintain the model in production. Constant monitoring
and occasional model tuning is often required.

44
● Data pipeline
We know what pipelines are, large pipes systems that carry resources from one location to another
over long distances. We usually hear about pipelines in the context of oil or natural gas. They’re
fast, efficient ways of moving large quantities of material from one point to another.
Data pipelines operate on the same principle; only they deal with information rather than liquids or
gases. Data pipelines are a sequence of data processing steps, many of them accomplished with
special software. The pipeline defines how, what, and where the data is collected. Data pipelining
automates data extraction, transformation, validation, and combination, then loads it for further
analysis and visualisation. The entire pipeline provides speed from one end to the other by
eliminating errors and neutralising bottlenecks or latency.

Incidentally, big data pipelines exist as well. Big data is characterised by the five V’s (variety,
volume, velocity, veracity, and value). Big data pipelines are scalable pipelines designed to handle
one or more big data’s “v” characteristics, even recognizing and processing the data in different
formats, such as structure, unstructured, and semi-structured.

Data Pipeline Architecture

We define data pipeline architecture as the complete system designed to capture, organise, and
dispatch data used for accurate, actionable insights. The architecture exists to provide the best
laid-out design to manage all data events, making analysis, reporting, and usage easier.
Data analysts and engineers apply pipeline architecture to allow data to improve business
intelligence (BI) and analytics, and targeted functionality. Business intelligence and analytics use
data to acquire insight and efficiency in real-time information and trends.
Data-enabled functionality covers crucial subjects such as customer journeys, target customer
behaviour, robotic process automation, and user experiences.

We break down data pipeline architecture into a series of parts and processes, including:

● Sources
This part is where it all begins, where the information comes from. This stage
potentially involves different sources, such as application APIs, the cloud,
relational databases, NoSQL, and Apache Hadoop.

● Joins
Data from different sources are often combined as it travels through the pipeline. Joins list
the criteria and logic for how this data comes together.
● Extraction
Data analysts may want certain specific data found in larger fields, like an area code in a
telephone number contact field. Sometimes, a business needs multiple values assembled
or extracted.

● Standardisation
Say you have some data listed in miles and other data in kilometres. Standardisation
ensures all data follows the same measurement units and is presented in an acceptable
size, font, and colour.

● Correction
If you have data, then you will have errors. It could be something as simple as a zip code
that doesn’t exist or a confusing acronym. The correction phase also removes corrupt
records.

● Loads
Once the data is cleaned up, it's loaded into the proper analysis system, usually a data
warehouse, another relational database, or a Hadoop framework.
● Automation
Data pipelines employ the automation process either continuously or on a schedule. The
automation process handles error detection, status reports, and monitoring,

45
● Application of experimental approach for finding insights and solutions –discussed below.

● Application of the scientific method and the CRISP-DM methodology during setting up new data
science project-discussed below

Individual Activity:
▪ Elaborate the methodology for data science project

SELF-CHECK QUIZ 1.5

Check your understanding by answering the following questions:

1. Write the appropriate/correct answer of the following:

2. Write down the CRISP-DM methodology for data mining.

46
LEARNER JOB SHEET 1

Qualification:

Learning unit: INTERPRET DATA AND BUSINESS

DOMAIN
Learner name:

Personal protective Mask

equipment (PPE):

Materials: Computer and Internet Connection

Tools and equipment:

Performance criteria: 1. Data science is defined.

2. Scopes of data science are interpreted.
3. Benefits of using data science are articulated.
4. Roles of different occupations of data science are described.
5. Tools and technologies related to data science are described.
6. Data analytics processes are described.
7. Services of data analytics platforms are interpreted and their
applications are explained.
8. End to-end perspective of data products is interpreted.
9. Ways of working in a cross-functional team are explained.
10. Agile principles are interpreted.
11. Ability to understand a business problem is demonstrated.
12. Business problems are converted into quantifiable form.
13. Cross-functional stakeholders are identified for a business problem.
14. Collaboration with stakeholders is arranged to identify quantifiable
improvements.
15. Key business indicators and target improvement metrics are
identified.
16. Research questions with associated hypotheses are constructed
from business problems.
17. Types of data needed to test the hypotheses are determined.
18. Hypotheses to be tested are aligned with business value.
19. Application of scientific method to data science business problems
are demonstrated.
20. Cross-industry standard process for data mining (CRISP-DM
methodology) is described.
21. Data pipelining is explained.
22. Application of experimental approach for finding insights and
solutions are explained.
23. Application of the scientific method and the CRISP-DM methodology
are followed during setting up new data science projects.

Measurement:

47
Notes:

Procedure: 1. Collect computer.

2. Connect Wifi Connection
3. Connect router

Learner signature: Date:

Assessor signature: Date:

Quality Assurer
Date:
signature:

Assessor remarks:

Feedback:

48
ANSWER KEYS

ANSWER KEY 1.1

1 Data science is the field of study that combines domain expertise, programming skills, and knowledge of
mathematics and statistics to extract meaningful insights from data.

2. Data Science enables companies to efficiently understand gigantic data from multiple sources and
derive valuable insights to make smarter data-driven decisions. Data Science is widely used in various
industry domains, including marketing, healthcare, finance, banking, policy work, and more.

3. Data Science is helping the Telecom Industry to predict what customers might need in the future based
on their usage of different services. Recommendation Engines are the biggest example of targeted
marketing. The customers are always attracted to better and cheaper services.
4. A data scientist is someone who makes value out of data. Such a person proactively fetches information
from various sources and analyzes it for better understanding about how the business performs, and to
build AI tools that automate certain processes within the company.

ANSWER KEY 1.2

1. End-to-end describes a process that takes a system or service from beginning to end and delivers a
complete functional solution, usually without needing to obtain anything from a third party.

2. The four core values of Agile software development as stated by the Agile Manifesto are:
● individuals and interactions over processes and tools;
● working software over comprehensive documentation;
● customer collaboration over contract negotiation; and.
● responding to change over following a plan.

ANSWER KEY 1.3

1. Any team that deals with data science has to be cross-functional. In other words, it has to cover a
whole stack of the solutions it writes. In a normal infrastructure there should be a DevOps engineer, data
scientist, data engineer and a product developer writing the web app and/or mobile app.

ANSWER KEY 1.4

1. Hypothesis testing is an act in statistics whereby an analyst tests an assumption regarding a population
parameter. The methodology employed by the analyst depends on the nature of the data used and the
reason for the analysis. Hypothesis testing is used to assess the plausibility of a hypothesis by using
sample data.

ANSWER KEY 1.5

1. CRISP-DM, which stands for Cross-Industry Standard Process for Data Mining, is an industry-proven
way to guide your data mining efforts. As a methodology, it includes descriptions of the typical phases of
a project, the tasks involved with each phase, and an explanation of the relationships between these
tasks.

49
Module 2: Demonstrate Statistical Concepts

MODULE CONTENT module covers

Module Descriptor: This unit covers the knowledge, skills and attitudes required to
demonstrate knowledge on statistical concepts. It specifically includes
interpreting probability rules and probability distributions, demonstrating
understanding of descriptive statistics, interpreting sampling and sampling
distributions, interpreting inferential statistics and interpreting regression
models.
Nominal Duration: 60 hours

LEARNING OUTCOMES:

Upon completion of the module, the trainee should be able to:

2.1. Interpret workplace communication and etiquette

2.2. Demonstrate understanding on descriptive statistics
2.3. Interpret sampling and sampling distributions
2.4. Interpret sampling and sampling distributions
2.5. Interpret regression model

PERFORMANCE CRITERIA:

1. Fundamental rules of probability are explained. Mapping of fabrics is performed

2. Rules for Conditional probability and independence are described.
3. Bayes' rule is interpreted.
4. Common continuous and discrete probability distributions are described.
5. Z-score and standard normal distribution are interpreted.
6. Proportions are computed using z-table.
7. Probabilities are computed using normal distribution.
8. Types of data and data measurement scales are described.
9. Measures of central tendency are explained.
10. Measures of dispersion are explained.
11. Measures of dispersion are explained.
12. Mean, median, mode, variance and standard deviation are calculated from sample problems.
13. Sampling methods are described
14. Biases in sampling are interpreted and corrective measures are explained.
15. Sampling distribution and its characteristics are explained.
16. Central limit theorem is interpreted.
17. Confidence interval is explained.
18. Hypothesis testing is interpreted.
19. Hypothesis test is performed using critical value and p value approach.
20. Type-I and Type-II errors are interpreted.
21. Inference for comparing means (ANOVA) is explained.
22. Inference for comparing means (ANOVA) is explained.
23. Simple linear regression and its underlying assumptions are explained.
24. Techniques for testing and validating assumptions of regression are demonstrated.
25. Impact of multicollinearity and heteroscedasticity are explained.
26. Simple and Multivariate linear regression models are used to predict numeric values.
27. Logistic regression is explained.
50
Learning Outcome 2.1- Interpret probability rules and probability
distributions

Contents:

▪ Fundamental rules of probability.

▪ Rules for Conditional probability.
▪ Bayes' rule.
▪ Common continuous and discrete probability distributions.
▪ Z-score and standard normal distribution.
▪ z-table.
▪ Normal distribution.

Assessment criteria:

1. Fundamental rules of probability are explained.

2. Rules for Conditional probability and independence are described.
3. Bayes rule is interpreted.
4. Common continuous and discrete probability distributions are described.
5. Z-score and standard normal distribution are interpreted.
6. Proportions are computed using z-table.
7. Probabilities are computed using normal distribution.

Resources required:

Students/trainees must be provided with the following resources:

▪ Workplace (Computer and internet connection).

LEARNING ACTIVITY 2.1

Learning Activity Resources/Special Instructions/References

Interpret probability rules and probability ▪ Information Sheet: 2.1
distributions ▪ Self-Check: 2.1
▪ Answer Key: 2.1

51
INFORMATION SHEET 2.1

Learning Objective: Interpret probability rules and probability distributions

● Fundamental rules of probability are explained :

Basic properties of probability

For an event A, its probability is defined as P(A). The probability of an event A is calculated by the
following formula:

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑎𝑣𝑜𝑢𝑟𝑎𝑏𝑙𝑒 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠

𝑃(𝐴) =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠

● If there is no ambiguity in the occurrence of an event, then the probability of such an event
is equal to 1. In other words, the probability of a certain event is 1.
● If an event has no chances of occurring, then its probability is 0.
● The probability of an event A is depicted by a number P (A) in such a way that 0 ≤ 𝑃(𝐴) ≤
1. In short, the probability is always a positive number.

Example: Find the probability of getting an even number when a die is tossed.
Solution:
S = {1, 2, 3, 4, 5, 6}
Favourable events = { 2,4,6}
Number of favourable event = 3
Total number of outcomes = 6
Hence probability, P = 3/6 = ½

Sum of probabilities
The sum of probabilities of an event and its complementary event is 1.
P(A) + P(A’) = 1.
Example:
Marital status can be categorised into: never married, married, widowed or
divorced.According to Infoplease.com, the following are the probabilities of those marital
status categories for adults in the United States (data from 2000):

Solution:
Probability of Never married ,P(A) = 0.239
Probability of married ,P(B) = 0.595
Probability of widowed ,P(C) = 0.068
Probability of divorced ,P(D)= ?
According to the sum of probabilities, the probability of a randomly chosen U.S. adult being
divorced is:
P(D)= 1- (P(A)+P(B)+P(C)) = 0.098

Probability of a complement
If A is an event, then the probability of A is equal to 1 minus the probability of the complement of
A, A’.
P(not A) = 1 – P(A)
That is, the probability that an event does not occur is 1 minus the probability that it does occur.

52
Example: The probability of happening an event A in one trial is 0.5. What is the probability that
event A happens at least once in three independent trials?
Solution:
Here 𝑃(𝐴) = 0.5 and 𝑃(𝐴) = 0.5
The probability that A does not happen at all = (0.5)3
Thus, required Probability = 1 − (0.5)3 = 0.875

Probabilities Involving Multiple Events

We will often be interested in finding probabilities involving multiple events such as,
P(A or B) = P(event A occurs or event B occurs)
P(A and B)= P(both event A occurs and event B occurs)
A common issue with terminology relates to how we usually think of “or” in our daily life. For
example, when a parent says to his or her child in a toy store “Do you want toy A or toy B?”, this
means that the child is going to get only one toy and he or she has to choose between them.
Getting both toys is usually not an option.

Addition Rule for Disjoint Events

If A and B are disjoint events, then P(A or B) = P(A) + P(B).
This can be represented in a Venn diagram as:
P(A∪B)=P(A)+P(B).

Example: what is the probability of a dice showing 2 or 5?

Solution:
1 1
𝑃(2) = 𝑃(5) =
6 6
𝑃(2 𝑜𝑟 5) = 𝑃(2) + 𝑃(5)
1 1
= +
6 6
2 1
= =
6 3
1
The probability of a dice showing 2 or 5 is .
3

General addition rule

If A and B are two events in a probability experiment, then the probability that either one of the
events will occur is:
P(A or B)=P(A)+P(B)−P(A and B)
This can be represented in a Venn diagram as:
P(A∪B)=P(A)+P(B)−P(A∩B)

Example:If you take out a single card from a regular pack of cards, what is the probability that the
card is either an ace or spade?

53
Solution:
Let X be the event of picking an ace and Y be the event of picking a spade.
4
P(X)=
52
13
P(Y)=
52
The two events are not mutually exclusive, as there is one favorable outcome in which the card
can be both an ace and spade.
1
P(X and Y)=
52
4 13 1 16 4
P(X or Y) = + − = =
52 52 52 52 13

● Rules for conditional probability and independence:

Multiplication rule for independent events

The multiplication rule for independent events relates the probabilities of two events to the
probability that they both occur. In order to use the rule, we need to have the probabilities of each
of the independent events. Given these events, the multiplication rule states the probability that
both events occur is found by multiplying the probabilities of each event.
Using a Venn diagram, we can visualise “A and B,” which is represented by the overlap between
events A and B:

Denote events A and B and the probabilities of each by P(A) and P(B). If A and B are independent
events, then:
P(A and B) = P(both event A occurs and event B occurs)
= P(A) x P(B)

Example: All human blood can be typed as O, A, B or AB. In addition, the frequency of the
occurrence of these blood types varies by ethnic and racial groups.
According to Stanford University’s Blood Centre (bloodcenter.stanford.edu), these are the
probabilities of human blood types in the United States:

Two people are selected simultaneously and at random from all people in the United States.
What is the probability that both have blood type O?
Let, O1= “person 1 has blood type O” and
O2= “person 2 has blood type O”
We need to find P(O1 and O2)
Since they were chosen simultaneously and at random, the blood type of one has no effect on the
blood type of the other. Therefore, O1 and O2 are independent then:
P(O1 and O2) = P(O1) * P(O2) = 0.44 * 0.44 = 0.1936.

Conditional Probability

54
The conditional probability of an event B is the probability that the event will occur given the
knowledge that an event A has already occurred. This probability is written P(B|A), notation for the
probability of B given A. In the case where events A and B are independent (where event A has
no effect on the probability of event B), the conditional probability of event B given event A is simply
the probability of event B, that is P(B).
If events A and B are not independent, then the probability of the intersection of A and B (the
probability that both events occur) is defined by
P(A and B) = P(A)P(B|A).
From this definition, the conditional probability P(B|A) is easily obtained by dividing by P(A):
𝑃(𝐴∩𝐵)
P(B\A)= ; where P(B)≠ 0
𝑃(𝐴)
And that for A given B :
𝑃(𝐴∩𝐵)
P(A\B)= ; where P(A)≠ 0
𝑃(𝐵)

Example: In a group of 100 sports car buyers, 40 bought alarm systems, 30 purchased bucket
seats, and 20 purchased an alarm system and bucket seats. If a car buyer chosen at random
bought an alarm system, what is the probability they also bought bucket seats?

Solution:
Step 1: Figure out P(A). It’s given in the question as 40%, or 0.4.
Step 2: Figure out P(A∩B). This is the intersection of A and B: both happening
together. It’s given in the question 20 out of 100 buyers, or 0.2.
Step 3: Insert your answers into the formula:
P(B|A) = P(A∩B) / P(A) = 0.2 / 0.4 = 0.5.
The probability that a buyer bought bucket seats, given that they purchased an alarm system, is
50%.

General Multiplication Rule

The multiplication rule is a way to find the probability of two events happening at the same
time.The general multiplication rule formula is: P(A ∩ B) = P(A) P(B|A).
Example:
A bag contains 6 black marbles and 4 blue marbles. Two marbles are drawn from the bag, without
replacement. What is the probability that both marbles are blue?

Step 1: Label your events A and B. Let A be the event that marble 1 is blue and let B be the event
that marble 2 is blue.
Step 2: Figure out the probability of A. There are ten marbles in the bag, so the probability of
drawing a blue marble is 4/10.
Step 3: Figure out the probability of B. There are nine marbles in the bag, so the probability of
choosing a blue marble (P B|)A is 3/9.
Step 4: Multiply Step 2 and 3 together: (4/10)*(3/9) = 2/15.
55
● Bayes theorem :

In statistics and probability theory, the Bayes’ theorem (also known as the Bayes’ rule) is a
mathematical formula used to determine the conditional probability of events. Essentially, the
Bayes’ theorem describes the probability of an event based on prior knowledge of the conditions
that might be relevant to the event.
The theorem is named after English statistician Thomas Bayes, who discovered the formula in
1763. It is considered the foundation of the special statistical inference approach called the Bayes’
inference.
The Bayes’ theorem is expressed in the following formula:

𝑃(𝐵\𝐴) × 𝑃(𝐴)
𝑃(𝐴\𝐵)=
𝑃(𝐵)

Where:
● P(A|B) – the probability of event A occurring, given event B has occurred
● P(B|A) – the probability of event B occurring, given event A has occurred
● P(A) – the probability of event A
● P(B) – the probability of event B

Example of Bayes theorem:

Finding out a patient’s probability of having liver disease if they are an alcoholic. “Being an
alcoholic” is the test (kind of like a litmus test) for liver disease.
A could mean the event “Patient has liver disease.” Past data tells you that 10% of patients
entering your clinic have liver disease. P(A) = 0.10.
B could mean the litmus test that “Patient is an alcoholic.” Five percent of the clinic’s
patients are alcoholics. P(B) = 0.05.
You might also know that among those patients diagnosed with liver disease, 7% are alcoholics.
This is your B|A: the probability that a patient is alcoholic, given that they have liver disease, is
7%.
Bayes’ theorem tells you:
P(A|B) = (0.07 * 0.1)/0.05 = 0.14
In other words, if the patient is an alcoholic, their chances of having liver disease is 0.14 (14%).
This is a large increase from the 10% suggested by past data. But it’s still unlikely that any
particular patient has liver disease.

● Continuous probability distribution and discrete probability distribution:

Continuous probability distribution:

These distributions model the probabilities of random variables that can have any possible
outcome. For example, the possible values for the random variable X that represents weights of
citizens in a town which can have any value like 34.5, 47.7, etc.,
The function is called a Probability Density function (PDF) for continuous distributions.
Examples: Normal, Student’s T, Chi-square, Exponential, etc.,
Normal Distribution
This is the most commonly discussed distribution and most often found in the real world.
Many continuous distributions often reach normal distribution given a large enough
sample. This has two parameters namely mean and standard deviation.
This distribution has many interesting properties. The mean has the highest probability
and all other values are distributed equally on either side of the mean in a symmetric
fashion. The standard normal distribution is a special case where the mean is 0 and the
standard deviation of 1.

56
It also follows the empirical formula that 68% of the values are 1 standard deviation away,
95% percent of them are 2 standard deviations away, and 99.7% are 3 standard deviations
away from the mean. This property is greatly useful when designing hypothesis tests
(https://www.statisticshowto.com/probability-and-statistics/hypothesis-testing/).
The PDF is given by,
1 1 𝑥−𝜇 2
𝑓(𝑥) = 𝑒𝑥𝑝(− ( ) )
√2𝜋𝜎 2 𝜎

where μ is the mean of the random variable X and σ is the standard deviation.

Student’s T Distribution
The student’s t distribution is similar to the normal distribution. The difference is that the
tails of the distribution are thicker. This is used when the sample size is small and the
population variance is not known. This distribution is defined by the degrees of freedom(p)
which is calculated as the sample size minus 1(n – 1).
As the sample size increases, degrees of freedom increase as the t-distribution
approaches the normal distribution and the tails become narrower and the curve gets
closer to the mean. This distribution is used to test estimates of the population mean when
the sample size is less than 30 and population variance is unknown. The sample
variance/standard deviation is used to calculate the t-value.

The PDF is given by,

𝑝+1
𝛤( ) 𝑡 2 −(𝑝+1)
𝑓(𝑡) = 2 (1 + ) 2
𝑝 𝑝
√𝑝𝜋𝛤(2)

where p is the degrees of freedom and Γ is the gamma function. Check this link for a brief
description of the gamma function.
The t-statistic used in hypothesis testing is calculated as follows,
𝑥−𝜇
𝑡= 𝑠
√𝑛
where x̄ is the sample mean, μ the population mean and s is the sample variance.

Chi-square Distribution
57
This distribution is equal to the sum of squares of p normal random variables. p is the
number of degrees of freedom. Like the t-distribution, as the degrees of freedom increase,
the distribution gradually approaches the normal distribution. Below is a chi-square
distribution with three degrees of freedom.

The PDF is given by,

𝑝 𝑥
(𝑥 2 −1 𝑒 −2 )
𝑓(𝑥) = 𝑝
𝑝
22 𝛤( )
2

where p is the degrees of freedom and Γ is the gamma function.

The chi-square value is calculated as follows:
(𝑂𝑖 − 𝐸𝑖 )2
𝑥2 = 𝛴
𝐸𝑖
where o is the observed value and E represents the expected value. This is used in
hypothesis testing to draw inferences about the population variance of normal
distributions.

Discrete probability distributions

These distributions model the probabilities of random variables that can have discrete values as
outcomes. For example, the possible values for the random variable X that represents the number
of heads that can occur when a coin is tossed twice are the set {0, 1, 2} and not any value from 0
to 2 like 0.1 or 1.6. The function is called a Probability Mass function (PMF) for discrete
distributions.

Examples: Bernoulli, Binomial, Negative Binomial, Hypergeometric, etc.

There are many discrete probability distributions to be used in different scenarios.Here some
Discrete distributions are described:
Bernoulli Distribution
This distribution is generated when we perform an experiment once and it has only two
possible outcomes – success and failure. The trials of this type are called Bernoulli trials,
which form the basis for many distributions discussed below. Let p be the probability of
success and 1 – p is the probability of failure.
The PMF is given as,

One example of this would be flipping a coin once. p is the probability of getting ahead
and 1 – p is the probability of getting a tail. Please note down that success and failure are
subjective and are defined by us depending on the context.

Binomial Distribution

58
This is generated for random variables with only two possible outcomes. Let p denote the
probability of an event is a success which implies 1 – p is the probability of the event being
a failure. Performing the experiment repeatedly and plotting the probability each time gives
us the Binomial distribution.
The most common example given for Binomial distribution is that of flipping a coin n
number of times and calculating the probabilities of getting a particular number of heads.
More real-world examples include the number of successful sales calls for a company or
whether a drug works for a disease or not.
The PMF is given as,

where p is the probability of success, n is the number of trials and x is the number of times
we obtain a success.

Multinomial Distribution
In the above distributions, there are only two possible outcomes – success and failure.
The multinomial distribution, however, describes the random variables with many possible
outcomes. This is also sometimes referred to as categorical distribution as each possible
outcome is treated as a separate category. Consider the scenario of playing a game n
number of times. Multinomial distribution helps us to determine the combined probability
that player 1 will win x1 times, player 2 will win x2 times and player k wins xk times.
The PMF is given as,
𝑛!
𝑃(𝑋 = 𝑥1 , 𝑋 = 𝑥2 , . . . . 𝑋 = 𝑥𝑘 ) = 𝑝 𝑥1 𝑝 𝑥2 . . . . 𝑝𝑘 𝑥𝑘
𝑥1 ! 𝑥2 !. . . . . 𝑥𝑘 ! 1 2
where n is the number of trials, p1,……pk denote the probabilities of the outcomes
x1……xk respectively.
In this post, we have defined probability distributions and briefly discussed different
discrete probability distributions. Let me know your thoughts on the article in the comments
section below.

Poisson Distribution
This distribution describes the events that occur in a fixed interval of time or space. An
example might make this clear. Consider the case of the number of calls received by a
customer care centre per hour. We can estimate the average number of calls per hour but
we cannot determine the exact number and the exact time at which there is a call. Each
occurrence of an event is independent of the other occurrences.
The PMF is given as,
𝑒 −𝜆 𝜆𝑥
𝑃(𝑋 = 𝑥) =
𝑥!
where λ is the average number of times the event has occurred in a certain period of time,
x is the desired outcome and e is the Euler’s number.

● Z-score and standard normal distribution:

Z Score

Simply put, a z-score (also called a standard score) gives you an idea of how far from the mean a
data point is. But more technically it’s a measure of how many standard deviations below or above
the population mean a raw score is.
A z-score can be placed on a normal distribution curve. Z-scores range from -3 standard deviations
(which would fall to the far left of the normal distribution curve) up to +3 standard deviations (which
would fall to the far right of the normal distribution curve). In order to use a z-score, you need to
know the mean μ and also the population standard deviation σ.

Z Score Formulas
The Z Score Formula: One Sample

59
The basic z score formula for a sample is:
z = (x – μ) / σ
For example, let’s say you have a test score of 190. The test has a mean (μ) of 150 and a standard
deviation (σ) of 25. Assuming a normal distribution, your z score would be:
z = (x – μ) / σ
= (190 – 150) / 25 = 1.6.

The z score tells you how many standard deviations from the mean your score is. In this example,
your score is 1.6 standard deviations above the mean.
𝑥𝑖 − 𝑥
𝑍𝑖 =
𝑠
You may also see the z score formula shown to the left. This is exactly the same formula as z = x
– μ / σ, except that x̄ (the sample mean) is used instead of μ (the population mean) and s (the
sample standard deviation) is used instead of σ (the population standard deviation). However, the
steps for solving it are exactly the same.

Z Score Formula: Standard Error of the Mean

When you have multiple samples and want to describe the standard deviation of those
sample means (the standard error), you would use this z score formula:
z = (x – μ) / (σ / √n)
This z-score will tell you how many standard errors there are between the sample mean
and the population mean.
Example: In general, the mean height of women is 65″ with a standard deviation of 3.5″.
What is the probability of finding a random sample of 50 women with a mean height of
70″, assuming the heights are normally distributed?
z = (x – μ) / (σ / √n)
= (70 – 65) / (3.5/√50) = 5 / 0.495 = 10.1

Standard Normal Distribution

The standard normal distribution, also called the z-distribution, is a special normal distribution
where the mean is 0 and the standard deviation is 1. Any normal distribution can be standardised
by converting its values into z-scores. Z-scores tell you how many standard deviations from the
mean each value lies.

Converting a normal distribution into a z-distribution allows you to calculate the probability of
certain values occurring and to compare different data sets.

60
● Proportions are computed using Z-table:

A z-table, also called the standard normal table, is a mathematical table that allows us to know the
percentage of values below (to the left) a z-score in a standard normal distribution (SND).

Fig: A standard normal distribution (SND).

A z-score, also known as a standard score, indicates the number of standard deviations a raw
score lays above or below the mean. When the mean of the z-score is calculated it is always 0,
and the standard deviation (variance) is always in increments of 1.
A z-score table shows the percentage of values (usually a decimal figure) to the left of a given z-
score on a standard normal distribution.
For example, imagine our Z-score value is 1.09. First, look at the left side column of the z-table to
find the value corresponding to one decimal place of the z-score (e.g. whole number and the first
digit after the decimal point).
In this case it is 1.0. Then, we look up a remaining number across the table (on the top) which is
0.09 in our example.

61
Figure 2. Using a z-score table to calculate the proportion (%) of the SND to the left of the z-score.
The corresponding area is 0.8621 which translates into 86.21% of the standard normal distribution
being below (or to the left) of the z-score.

Figure : The proportion (%) of the SND to the left of the z-score.

Properties of Z table:
The standard normal distribution table provides the probability that a normally distributed random
variable Z, with mean equal to 0 and variance equal to 1, is less than or equal to z. It does this for
positive values of z only (i.e., z-values on the right-hand side of the mean). What this means in
62
practice is that if someone asks you to find the probability of a value being less than a specific,
positive z-value, you can simply look that value up in the table. We call this area Φ. Thus, for this
table, P(Z < a) = Φ(a), where a is positive. Diagrammatically, the probability of Z less than 'a' being
Φ(a), as determined from the standard normal distribution table, is shown below:

● Calculate the probability of Normal distribution:

Normal Distributions Calculations

This section will show you how to calculate the probability (area under the curve) of a standard
normal distribution. It will first show you how to interpret a Standard Normal Distribution Table. It
will then show you how to calculate the:
● probability less than a z-value
● probability greater than a z-value
● probability between z-values
● probability outside two z-values.

How to Use the Standard Normal Distribution Table

The most common form of standard normal distribution table that you see is a table similar
to the one below :

Figure: The Standard Normal Distribution Table

Probability less than a z-value

P(Z < –a)

63
As explained above, the standard normal distribution table only provides the probability
for values less than a positive z-value (i.e., z-values on the right-hand side of the mean).
So how do we calculate the probability below a negative z-value (as illustrated below)?

We start by remembering that the standard normal distribution has a total area (probability)
equal to 1 and it is also symmetrical about the mean. Thus, we can do the following to
calculate negative z-values: we need to appreciate that the area under the curve covered
by P(Z > a) is the same as the probability less than –a {P(Z < –a)} as illustrated below:
Making this connection is very important because from the standard normal distribution
table, we can calculate the probability less than 'a', as 'a' is now a positive value. Imposing
P(Z < a) on the above graph is illustrated below:

From the above illustration, and from our knowledge that the area under the standard
normal distribution is equal to 1, we can conclude that the two areas add up to 1. We can,
therefore, make the following statements:
Φ(a) + Φ(–a) = 1
∴ Φ(–a) = 1 – Φ(a)
Thus, we know that to find a value less than a negative z-value we use the following
equation:
Φ(–a) = 1 – Φ(a), e.g. Φ(–1.43) = 1 – Φ(1.43)

Probability greater than a z-value

P(Z > a)
The probability of P(Z > a) is: 1 – Φ(a). To understand the reasoning behind this look at
the illustration below:

64
You know Φ(a) and you know that the total area under the standard normal curve is 1 so
by mathematical deduction: P(Z > a) is: 1 - Φ(a).

P(Z > –a)

The probability of P(Z > –a) is P(a), which is Φ(a). To understand this we need to
appreciate the symmetry of the standard normal distribution curve. We are trying to find
out the area below.But by reflecting the area around the center line (mean) we get the
following:

Notice that this is the same size area as the area we are looking for, only we already know
this area, as we can get it straight from the standard normal distribution table: it is P(Z <
a). Therefore, the P(Z > –a) is P(Z < a), which is Φ(a).

Probability between z-values

You are wanting to solve the following:

The key requirement to solve the probability between z-values is to understand that the
probability between z-values is the difference between the probability of the greatest z-
value and the lowest z-value:
P(a < Z < b) = P(Z < b) – P(Z < a)
which is illustrated below:

P(a < Z < b)

The probability of P(a < Z < b) is calculated as follows.
First separate the terms as the difference between z-scores:
P(a < Z < b) = P(Z < b) – P( Z < a) (explained in the section above)

Then express these as their respective probabilities under the standard normal distribution
curve:

P(Z < b) – P(Z < a) = Φ(b) – Φ(a).

Therefore, P(a < Z < b) = Φ(b) – Φ(a), where a and b are positive.

P(–a < Z < b)

The probability of P(–a < Z < b) is illustrated below:

65
First separate the terms as the difference between z-scores:

P(–a < Z < b) = P(Z < b) – P(Z < –a)

Then express these as their respective probabilities under the standard normal distribution
curve:

P(Z < b) – P(Z < –a) = Φ(b) – Φ(–a)

= Φ(b) – {1 – Φ(a)}P(Z < –a) explained above.
∴ P(–a < Z < b) = Φ(b) – {1 – Φ(a)}, where a is negative and b is positive.

P(–a < Z < –b)

The probability of P(–a < Z < –b) is illustrated below:

First separate the terms as the difference between z-scores:

P(–a < Z < –b) = P(Z < –b) – P( Z < –a)
Then express these as their respective probabilities under the standard normal distribution
curve:
P(Z < b) – P(Z < –a) = Φ(–b) – Φ(–a)
= {1 – Φ(b)} – {1 – Φ(a)} P(Z < –a)
= 1 – Φ(b) – 1 + Φ(a)
= Φ(a) – Φ(b)
The above calculations can also be seen clearly in the diagram below:

Notice that the reflection results in a and b "swapping positions".

Probability outside of a range of z-values
An illustration of this type of problem is found below:

66
To solve these types of problems, you simply need to work out each separate area under
the standard normal distribution curve and then add the probabilities together. This will
give you the total probability.
When a is negative and b is positive (as above) the total probability is:

P(Z < –a) + P(Z > b) = Φ(–a) + {1 – Φ(b)} P(Z > b) explained above.
= {1 – Φ(a)} + {1 – Φ(b)} P(Z < –a) explained above.
= 1 – Φ(a) + 1 – Φ(b)
= 2 – Φ(a) – Φ(b)

When a and b are negative as illustrated below:

The total probability is:

P(Z < –a) + P(Z > –b) = Φ(–a) + Φ(b)P(Z > –b)
= {1 – Φ(a)} + Φ(b)P(Z < –a)
= 1 + Φ(b) – Φ(a)

When a and b are positive as illustrated below:

The total probability is:

P(Z < a) + P(Z > b) = Φ(a) + {1 – Φ(b)}P(Z > b)
= 1 + Φ(a) – Φ(b)

67
Individual Activity:
● Discuss Z-score and standard normal distribution.

SELF-CHECK QUIZ 2.1

Check your understanding by answering the following questions:

Write the correct answer for the following questions.

1. What are the fundamental rules of probability?

2. Which rules are most important to conditional probability and independence?

3. What is conditional probability?

4. What is Z-Score?

68
LEARNING OUTCOME 2.2- Demonstrate understanding on
descriptive statistics

Contents:

▪ Types of data and data measurement scales.

▪ Central tendency.
▪ Measures of dispersion.
▪ Measures of association.
▪ Mean, median, mode, variance and standard deviation.

Assessment criteria:

1. Types of data and data measurement scales are described.

2. Measures of central tendency are explained.
3. Measures of dispersion are explained.
4. Measures of association are explained.
5. Mean, median, mode, variance and standard deviation are calculated from sample problems.

Resources required:

Students/trainees must be provided with the following resources:

▪ Workplace (Computer and Internet connection)

LEARNING ACTIVITY 2.2

Learning Activity Resources/Special Instructions/References

Demonstrate understanding on descriptive ▪ Information Sheets: 2.2
statistics ▪ Self-Check: 2.2
▪ Answer Key: 2.2

INFORMATION SHEET 2.2

Learning objective: Demonstrate understanding on descriptive statistics

69
● Types of data and data measurement scales are described:
Types of data:
Qualitative data:
Qualitative data is a bunch of information that cannot be measured in the form of numbers. It is
also known as categorical data. It normally comprises words, narratives, and we labelled them
with names.It delivers information about the qualities of things in data. The outcome of qualitative
data analysis can come in the type of featuring key words, extracting data, and ideas
elaboration.For examples:
Hair colour- black, brown, red
Opinion- agree, disagree, neutral

Types of qualitative data:

Nominal data:
Nominal means “relating to names”. The nominal data are symbols or names of things.
Each value represents some kind of category, code, or states on. The values do not have
any meaningful order.
Example: Suppose that hair colour and marital status are two attributes describing person
objects. In our application, possible values for hair colour are black, brown, blond, red,
auburn, grey, and white. The attribute marital status can take on the values single,
married, divorced, and widowed. Both hair colour and marital status are nominal . Another
example of nominal data is occupation, with the values teacher, dentist, programmer,
farmer, and so on.

Ordinal Data
An ordinal data is the data with possible values that have a meaningful order or ranking
among them, but the magnitude between successive values is not known.

Example: Suppose that drink size corresponds to the size of drinks available at a fast-
food restaurant. Has three possible values: small, medium, and large. The values have
a meaningful sequence (which corresponds to increasing drink size); however, cannot
tell from the values how much bigger, say, a medium is than a large. Other examples
of ordinal data include grade (e.g., A+, A, A−, B+, and so on) and professional rank.
Professional ranks can be enumerated in a sequential order: for example, assistant,
associate, and full for professors, and private, private first class, specialist, corporal,
and sergeant for army ranks.

Binary Data
A binary data is a nominal attribute with only two categories or states: 0 or 1, where 0
typically means that the attribute is absent, and 1 means that it is present. Binary data are
referred to as Boolean if the two states correspond to true and false.

Example: Given the attribute smoker describing a patient object, 1 indicates that the
patient smokes, while 0 indicates that the patient does not. Similarly, suppose the patient
undergoes a medical test that has two possible outcomes. The attribute medical test is
binary, where a value of 1 means the result of the test for the patient is positive, while 0
means the result is negative.

Numeric Data:
Numerical data, also known as quantitative, is a data type expressed in numbers rather
than natural language. Numerical data differentiates itself from other number form data
types with its ability to carry out arithmetic operations with these numbers. Types of
numeric data:

Discrete data

70
Discrete data is a count that involves integers — only a limited number of values is
possible. This type of data cannot be subdivided into different parts. Discrete data includes
discrete variables that are finite, numeric, countable, and non-negative integers. In many
cases, discrete data can be prefixed with “the number of”. For example:
The number of students who have attended the class;
● The number of customers who have bought different products;
● The number of groceries people are purchasing every day;

Continuous data
Continuous data is considered the complete opposite of discrete data. It’s the type of
numerical data that refers to the unspecified number of possible measurements between
two presumed points. The numbers of continuous data are not always clean and integers,
as they are usually collected from very precise measurements. Measuring a particular
subject is allowing for creating a defined range to collect more data. Variables in
continuous data sets often carry decimal points, with the number stretching out as far as
possible. Typically, it changes over time. It can have completely different values at
different time intervals, which might not always be whole numbers. Here are some
examples:
● The weather temperature;
● The wind speed;
● The weight of the kids;

Data measurement scales:

By understanding the scale of the measurement of their data, data scientists can determine the
kind of statistical test to perform.

Nominal scale of measurement

The nominal scale of measurement defines the identity property of data. This scale has
certain characteristics, but doesn’t have any form of numerical meaning. The data can be
placed into categories but can’t be multiplied, divided, added or subtracted from one
another. It’s also not possible to measure the difference between data points.
Examples of nominal data include eye colour and country of birth. Nominal data can be
broken down again into three categories:
Nominal with order: Some nominal data can be sub-categorised in order, such as “cold,
warm, hot and very hot.”
Nominal without order: Nominal data can also be sub-categorised as nominal without
order, such as male and female.
Dichotomous: Dichotomous data is defined by having only two categories or levels, such
as “yes’ and ‘no’.

Ordinal scale of measurement

The ordinal scale defines data that is placed in a specific order. While each value is
ranked, there’s no information that specifies what differentiates the categories from each
other. These values can’t be added to or subtracted from.
An example of this kind of data would include satisfaction data points in a survey, where
‘one = happy, two = neutral, and three = unhappy.’ Where someone finished in a race also
describes ordinal data. While first place, second place or third place shows what order the
runners finished in, it doesn’t specify how far the first-place finisher was in front of the
second-place finisher.

Interval scale of measurement

The interval scale contains properties of nominal and ordered data, but the difference
between data points can be quantified. This type of data shows both the order of the
variables and the exact differences between the variables. They can be added to or
subtracted from each other, but not multiplied or divided. For example, 40 degrees is not
20 degrees multiplied by two.
This scale is also characterised by the fact that the number zero is an existing variable. In
the ordinal scale, zero means that the data does not exist. In the interval scale, zero has
meaning. For example, if you measure degrees, zero has a temperature.
Data points on the interval scale have the same difference between them. The difference
on the scale between 10 and 20 degrees is the same between 20 and 30 degrees. This
scale is used to quantify the difference between variables, whereas the other two scales

71
are used to describe qualitative values only. Other examples of interval scales include the
year a car was made or the months of the year.

Ratio scale of measurement

Ratio scales of measurement include properties from all four scales of measurement. The
data is nominal and defined by an identity, can be classified in order, contains intervals
and can be broken down into exact values. Weight, height and distance are all examples
of ratio variables. Data in the ratio scale can be added, subtracted, divided and multiplied.
Ratio scales also differ from interval scales in that the scale has a ‘true zero’. The number
zero means that the data has no value point. An example of this is height or weight, as
someone cannot be zero centimetres tall or weigh zero kilos – or be negative centimetres
or negative kilos. Examples of the use of this scale are calculating shares or sales. Of all
types of data on the scales of measurement, data scientists can do the most with ratio
data points.

● Central Tendency:
A measure of central tendency is a single value that attempts to describe a set of data by
identifying the central position within that set of data. As such, measures of central tendency
are sometimes called measures of central location. They are also classed as summary
statistics. The mean (often called the average) is most likely the measure of central tendency
that you are most familiar with, but there are others, such as the median and the mode.
The mean, median and mode are all valid measures of central tendency, but under different
conditions, some measures of central tendency become more appropriate to use than others.
In the following sections, we will look at the mean, mode and median, and learn how to
calculate them and under what conditions they are most appropriate to be used.

Mean (Arithmetic)
The mean (or average) is the most popular and well known measure of central tendency. It can
be used with both discrete and continuous data, although its use is most often with continuous
data. The mean is equal to the sum of all the values in the data set divided by the number of values
in the data set. So, if we have n values in a data set and they have values 𝑥1 , 𝑥2 , . . . 𝑥𝑛 the sample
mean, usually denoted by x (pronounced "x bar"), is:
𝑥1 +𝑥2 +.......+𝑥𝑛
€ 𝑥=
𝑛

This formula is usually written in a slightly different manner using the Greek capital letter, ∑,
pronounced "sigma", which means "sum of...":
𝛴𝑥
€ 𝑥=
𝑛
The above formula refers to the sample mean. So, why have we called it a sample mean? This is
because, in statistics, samples and populations have very different meanings and these
differences are very important, even if, in the case of the mean, they are calculated in the same
way. To acknowledge that we are calculating the population mean and not the sample mean, we
use the Greek lower case letter "mu", denoted as μ :
𝛴𝑥
€ μ=
𝑛
The mean is essentially a model of a data set. It is the value that is most common. However, that
the mean is not often one of the actual values that have been observed in a data set. However,
one of its important properties is that it minimises error in the prediction of any one value in the
data set. That is, it is the value that produces the lowest amount of error from all other values in
the data set.
An important property of the mean is that it includes every value in your data set as part of the
calculation. In addition, the mean is the only measure of central tendency where the sum of the
deviations of each value from the mean is always zero.

Median
The median is the middle score for a set of data that has been arranged in order of magnitude.
The median is less affected by outliers and skewed data. In order to calculate the median, suppose
we have the data below:

65 55 89 56 35 14 56 55 87 45 92
We first need to rearrange that data into order of magnitude (smallest first):
72
14 35 45 55 55 56 56 65 87 89 92
Our median mark is the middle mark - in this case, 56 (highlighted in bold). It is the middle
mark because there are 5 scores before it and 5 scores after it. This works fine when you have
an odd number of scores, but what happens when you have an even number of scores? What
if you had only 10 scores? Well, you simply have to take the middle two scores and average
the result. So, if we look at the example below:
65 55 89 56 35 14 56 55 87 45
We again rearrange that data into order of magnitude (smallest first):
14 35 45 55 55 56 56 65 87 89
Only now we have to take the 5th and 6th score in our data set and average them to get a
median of 55.5.
Mode
The mode is the most frequent score in our data set. On a histogram it represents the highest
bar in a bar chart or histogram. You can, therefore, sometimes consider the mode as being
the most popular option. An example of a mode is presented below:

€
Normally, the mode is used for categorical data where we wish to know which is the most
common category, as illustrated below:

€
We can see above that the most common form of transport, in this particular data set, is the
bus. However, one of the problems with the mode is that it is not unique, so it leaves us with
problems when we have two or more values that share the highest frequency, such as below:

73
€
We are now stuck as to which mode best describes the central tendency of the data. This is
particularly problematic when we have continuous data because we are more likely not to have
any one value that is more frequent than the other. For example, consider measuring 30
peoples' weight (to the nearest 0.1 kg). How likely is it that we will find two or more people with
exactly the same weight (e.g., 67.4 kg)? The answer is probably very unlikely - many people
might be close, but with such a small sample (30 people) and a large range of possible
weights, you are unlikely to find two people with exactly the same weight; that is, to the nearest
0.1 kg. This is why the mode is very rarely used with continuous data.
Another problem with the mode is that it will not provide us with a very good measure of central
tendency when the most common mark is far away from the rest of the data in the data set,
as depicted in the diagram below:

74
In the above diagram the mode has a value of 2. We can clearly see, however, that the mode
is not representative of the data, which is mostly concentrated around the 20 to 30 value range.
To use the mode to describe the central tendency of this data set would be misleading.

Skewness
If the values of a specific independent variable (feature) are skewed, depending on the model,
skewness may violate model assumptions or may reduce the interpretation of feature
importance.In statistics, skewness is a degree of asymmetry observed in a probability
distribution that deviates from the symmetrical normal distribution (bell curve) in a given set of
data.
The normal distribution helps to know a skewness. When we talk about normal distribution,
data is symmetrically distributed. The symmetrical distribution has zero skewness as all
measures of a central tendency lie in the middle.

When data is symmetrically distributed, the left-hand side, and right-hand side, contain the
same number of observations. (If the dataset has 90 values, then the left-hand side has 45
observations, and the right-hand side has 45 observations.). But, what if not symmetrically
distributed? That data is called asymmetrical data, and that time skewness comes into the
picture.

Kurtosis
Kurtosis refers to the degree of presence of outliers in the distribution.Kurtosis is a statistical
measure, whether the data is heavy-tailed or light-tailed in a normal distribution.

€
In finance, kurtosis is used as a measure of financial risk. A large kurtosis is associated with a high
level of risk for an investment because it indicates that there are high probabilities of extremely
large and extremely small returns. On the other hand, a small kurtosis signals a moderate level of
risk because the probabilities of extreme returns are relatively low.

Measures of Dispersion:
In statistics, the measures of dispersion help to interpret the variability of data i.e. to know
how homogeneous or heterogeneous the data is. In simple terms, it shows how squeezed
or scattered the variable is.

75
Absolute Measure of Dispersion
An absolute measure of dispersion contains the same unit as the original data set. Absolute
dispersion method expresses the variations in terms of the average of deviations of
observations like standard or mean deviations. It includes range, standard deviation,
quartile deviation, etc.
The types of absolute measures of dispersion are:

Range: It is simply the difference between the maximum value and the minimum value
given in a data set. Example: 1, 3,5, 6, 7 => Range = 7 -1= 6

Variance: Deduct the mean from each data in the set then squaring each of them and
adding each square and finally dividing them by the total no of values in the data set is
the variance. Variance (σ2)=∑(X−μ)2/N

Standard Deviation: The square root of the variance is known as the

standard deviation i.e. S.D. = √σ.
Quartiles and Quartile Deviation: The quartiles are values that divide a list of numbers
into quarters. The quartile deviation is half of the distance between the third and the first
quartile.
Mean and Mean Deviation: The average of numbers is known as the mean and the
arithmetic mean of the absolute deviations of the observations from a measure of central
tendency is known as the mean deviation (also called mean absolute deviation).

● Measures of association:
Covariance:
In mathematics and statistics, covariance is a measure of the relationship between two random
variables. The metric evaluates how much – to what extent – the variables change together. In
other words, it is essentially a measure of the variance between two variables. However, the metric
does not assess the dependency between variables.
Unlike the correlation coefficient, covariance is measured in units. The units are computed by
multiplying the units of the two variables. The variance can take any positive or negative values.
The values are interpreted as follows:

Positive covariance: Indicates that two variables tend to move in the same direction.
Negative covariance: Reveals that two variables tend to move in inverse directions.

In finance, the concept is primarily used in portfolio theory. One of its most common applications
in portfolio theory is the diversification method, using the covariance between assets in a portfolio.
By choosing assets that do not exhibit a high positive covariance with each other, the unsystematic
risk can be partially eliminated.

CFI’s Maths for Corporate Finance Course explores the financial mathematics concepts required
for Financial Modelling.

Formula for Covariance

The covariance formula is similar to the formula for correlation and deals with the calculation of
data points from the average value in a dataset. For example, the covariance between two random
variables X and Y can be calculated using the following formula (for population):

𝛴(𝑋𝑖 − 𝑋) (𝑌𝑖 − 𝑌)
𝑐𝑜𝑣(𝑋, 𝑌) =
𝑛
For a sample covariance, the formula is slightly adjusted:

𝛴(𝑋𝑖 − 𝑋) (𝑌𝑖 − 𝑌)
𝐶𝑜𝑣(𝑋, 𝑌) =
𝑛−1
76
Where:

● Xi – the values of the X-variable

● Yj – the values of the Y-variable
● X̄ – the mean (average) of the X-variable
● Ȳ – the mean (average) of the Y-variable
● n – the number of data points

Correlation
Correlation is used to test relationships between quantitative variables or categorical variables. In
other words, it’s a measure of how things are related. The study of how variables are correlated is
called correlation analysis.

Some examples of data that have a high correlation:

Your caloric intake and your weight.

Your eye colour and your relatives’ eye colours.
The amount of time you study and your GPA.
Some examples of data that have a low correlation (or none at all):

Your sexual preference and the type of cereal you eat.

A dog’s name and the type of dog biscuit they prefer.
The cost of a car wash and how long it takes to buy a soda inside the station. Correlations are
useful because if you can find out what relationship variables have, you can make predictions
about future behaviour. Knowing what the future holds is very important in the social sciences like
government and healthcare. Businesses also use these statistics for budgets and business plans.

Correlation Coefficient
A correlation coefficient is a way to put a value to the relationship. Correlation coefficients have a
value of between -1 and 1. A “0” means there is no relationship between the variables at all, while
-1 or 1 means that there is a perfect negative or positive correlation (negative or positive correlation
here refers to the type of graph the relationship will produce).

means, media, mode, variance and standard deviation calculation from sample
problems:

Example 1:Calculate the mean, median and mode with following table:

77
Payment delay Frequency

9.5-12.5 3
12.5-15.5 14
15.5-18.5 23
18.5-21.5 12
21.5-24.5 8
24.5-27.5 4
27.5-30.5 1

Total 65

Solution:
The grouped distribution was formed as follows:
Payment delay Frequency Mid-value Product Cumulative
(𝑓𝑖 ) (𝑥𝑖 ) (𝑓𝑖 𝑥𝑖 ) frequency

9.5-12.5 3 11 33 3

12.5-15.5 14 14 196 17

15.5-18.5 23 17 391 40

18.5-21.5 12 20 240 52

21.5-24.5 8 23 184 60

24.5-27.5 4 26 104 64

27.5-30.5 1 29 29 65

Total 65 - 1177 -
The arithmetic mean is

𝛴 𝑓𝑖 𝑥 𝑖 1177
𝑥= = = 18.11
𝛴𝑓𝑖 65
To calculate the median , we employ the cumulative total column to locate the middle-most value
and use the formula :
ℎ 𝑛
𝑀𝑒𝑑𝑖𝑎𝑛 = 𝑙𝑚 + ( − 𝐹(𝑚)−1 )
𝑓𝑚 2
Here, 𝑙𝑚 = 15.5, ℎ = 3, 𝑓𝑚 = 23, 𝐹(𝑚)−1 = 17,
So that,
3 65
𝑀𝑒𝑑𝑖𝑎𝑛 = 15.5 + ( − 17 ) = 17.52
23 2
The mode is straightforward to compute . We use the following formula to accomplish this:
𝛥1
𝑀𝑜𝑑𝑒 = 𝑙0 + ℎ( )
𝛥1 + 𝛥2
From the table above, 𝑙0 = 15.5, ℎ = 3, 𝛥1 = 23 − 14 = 9, 𝛥2 = 23 − 12 = 11,
So that,

78
9
𝑀𝑜𝑑𝑒 = 15.5 + 3( ) = 16.85.
9+11

Example 2: Compute the variance and standard deviation for the following frequency distribution
:

x: 3 5 7 8 9

f: 2 3 2 2 1
Solution:
The following table illustrates the computation of variance from the above distribution.

𝑥𝑖 𝑓𝑖 𝑓𝑖 𝑥𝑖 𝑓𝑖 𝑥𝑖 2

3 2 6 18

5 3 15 75

7 2 14 98

8 2 16 128

9 1 9 81

𝑛𝛴𝑓𝑖 𝑥𝑖 2 − (𝛴𝑓𝑖 𝑥𝑖 )2 10(400) − (60)2

𝑠2 = = = 4.44
𝑛(𝑛 − 1) 10(10 − 1)

Individual Activity:
● Explain mean, median, mode are variance and standard deviation with example.

SELF-CHECK QUIZ 2.2

Check your understanding by answering the following questions:

1.Describe type of quality data?

2. What Measures of Dispersion?

79
LEARNING OUTCOME 2.3 – Interpret sampling and
sampling distributions

Contents:

▪ Sampling methods.
▪ Sampling distribution and its characteristics.
▪ Central limit theorem.

Assessment criteria:

1. Sampling methods are described.

2. Biases in sampling are interpreted and corrective measures are explained.
3. Sampling distribution and its characteristics are explained.
4. Central limit theorem is interpreted.

Resources required:

Students/trainees must be provided with the following resources:

● Workplace (Computer and Internet connection)

ACTIVITY : 2.3

Learning Activity Resources/Special Instructions/References

Interpret sampling and ▪ Information Sheets: 2.3
sampling distributions ▪ Self-Check: 2.3
▪ Answer Key: 2.3

80
INFORMATION SHEET 2.3

Learning Objective: to Interpret sampling and sampling distributions

● Sampling methods:

In a statistical study, sampling methods refer to how we select members from the population to be
in the study. If a sample isn't randomly selected, it will probably be biassed in some way and the
data may not be representative of the population.
There are two types of sampling methods:

Probability sampling:
Probability sampling involves random selection, allowing you to make strong statistical inferences
about the whole group.

Non-probability sampling:
Non-probability sampling involves non-random selection based on convenience or other criteria,
allowing you to easily collect data.

Probability sampling methods

Probability sampling means that every member of the population has a chance of being selected.
It is mainly used in quantitative research. If you want to produce results that are representative of
the whole population, probability sampling techniques are the most valid choice.
There are four main types of probability samples:

Simple random sampling

In a simple random sample, every member of the population has an equal chance of being
selected. Your sampling frame should include the whole population.
To conduct this type of sampling, you can use tools like random number generators or
other techniques that are based entirely on chance.
Example: To select a simple random sample of 100 employees of Company X. Assign a
number to every employee in the company database from 1 to 1000, and use a random
number generator to select 100 numbers.

Systematic sampling
Systematic sampling is similar to simple random sampling, but it is usually slightly easier
to conduct. Every member of the population is listed with a number, but instead of
randomly generating numbers, individuals are chosen at regular intervals.
Example: All employees of the company are listed in alphabetical order. From the first 10
numbers, you randomly select a starting point: number 6. From number 6 onwards, every
10th person on the list is selected (6, 16, 26, 36, and so on), and you end up with a sample
of 100 people.
If you use this technique, it is important to make sure that there is no hidden pattern in the
list that might skew the sample. For example, if the HR database groups employees by
team, and team members are listed in order of seniority, there is a risk that your interval
might skip over people in junior roles, resulting in a sample that is skewed towards senior
employees.

Stratified sampling
Stratified sampling involves dividing the population into subpopulations that may differ in
important ways. It allows you to draw more precise conclusions by ensuring that every
subgroup is properly represented in the sample. To use this sampling method, you divide
the population into subgroups (called strata) based on the relevant characteristics (e.g.
gender, age range, income bracket, job role). Based on the overall proportions of the
population, you calculate how many people should be sampled from each subgroup. Then
you use random or systematic sampling to select a sample from each subgroup.
Example: The company has 800 female employees and 200 male employees. You want
to ensure that the sample reflects the gender balance of the company, so you sort the
population into two strata based on gender. Then you use random sampling on each
81
group, selecting 80 women and 20 men, which gives you a representative sample of 100
people.

Cluster sampling
Cluster sampling also involves dividing the population into subgroups, but each subgroup
should have similar characteristics to the whole sample. Instead of sampling individuals
from each subgroup, you randomly select entire subgroups. If it is practically possible, you
might include every individual from each sampled cluster. If the clusters themselves are
large, you can also sample individuals from within each cluster using one of the techniques
above. This is called multistage sampling. This method is good for dealing with large and
dispersed populations, but there is more risk of error in the sample, as there could be
substantial differences between clusters. It’s difficult to guarantee that the sampled
clusters are really representative of the whole population.
Example: The company has offices in 10 cities across the country (all with roughly the
same number of employees in similar roles). You don’t have the capacity to travel to every
office to collect your data, so you use random sampling to select 3 offices – these are your
clusters

Non-probability sampling methods

In a non-probability sample, individuals are selected based on non-random criteria, and not every
individual has a chance of being included. This type of sample is easier and cheaper to access,
but it has a higher risk of sampling bias. That means the inferences you can make about the
population are weaker than with probability samples, and your conclusions may be more limited.
If you use a non-probability sample, you should still aim to make it as representative of the
population as possible.
Non-probability sampling techniques are often used in exploratory and qualitative research. In
these types of research, the aim is not to test a hypothesis about a broad population, but to develop
an initial understanding of a small or under-researched population.
Convenience sampling
A convenience sample simply includes the individuals who happen to be most accessible
to the researcher.This is an easy and inexpensive way to gather initial data, but there is
no way to tell if the sample is representative of the population, so it can’t produce
generalizable results.

Example: You are researching opinions about student support services in your university,
so after each of your classes, you ask your fellow students to complete a survey on the
topic. This is a convenient way to gather data, but as you only surveyed students taking
the same classes as you at the same level, the sample is not representative of all the
students at your university.

Voluntary response sampling

Similar to a convenience sample, a voluntary response sample is mainly based on ease
of access. Instead of the researcher choosing participants and directly contacting them,
people volunteer themselves (e.g. by responding to a public online survey). Voluntary
response samples are always at least somewhat biassed, as some people will inherently
be more likely to volunteer than others.

Example: You send out the survey to all students at your university and a lot of students
decide to complete it. This can certainly give you some insight into the topic, but the people
who responded are more likely to be those who have strong opinions about the student
support services, so you can’t be sure that their opinions are representative of all students.

Purposive sampling
This type of sampling, also known as judgement sampling, involves the researcher using
their expertise to select a sample that is most useful to the purposes of the research. It is
often used in qualitative research, where the researcher wants to gain detailed knowledge
about a specific phenomenon rather than make statistical inferences, or where the
population is very small and specific. An effective purposive sample must have clear
criteria and rationale for inclusion.

Example: You want to know more about the opinions and experiences of disabled
students at your university, so you purposefully select a number of students with different

82
support needs in order to gather a varied range of data on their experiences with student
services.

Snowball sampling
If the population is hard to access, snowball sampling can be used to recruit participants
via other participants. The number of people you have access to “snowballs” as you get
in contact with more people.

Example: You are researching experiences of homelessness in your city. Since there is
no list of all homeless people in the city, probability sampling isn’t possible. You meet one
person who agrees to participate in the research, and she puts you in contact with other
homeless people that she knows in the area.

● Biases in sampling are interpreted and corrective measures are explained:

Sampling Bias:
We can define sample selection bias, or sampling bias, as a kind of bias caused by choosing and
using non-random data for your statistical analysis. In survey or research sampling, bias is usually
the tendency or propensity of a specific sample statistic to overestimate or underestimate a
particular population parameter. Sampling bias can exist because of a flaw in your sample
selection process. As a result, you exclude a subset of your data systematically because of a
specific attribute. It is worth noting that the risk of sampling bias is present in nearly all elements
of both quantitative and qualitative surveys. This is why it may find its source easily in the survey
creator as well as the respondents.
Ideally, you have to select all of your survey participants in a random manner. However, in practice,
it can be hard to do a random selection of participants due to constraints such as cost and
respondent availability. Even if you do not do a randomised data collection, it is crucial to be aware
of the potential biases that could be present in your data. If you are aware of these biases, you
can take them into account in the analysis to do bias correction and better understand the
population that your data represents.

Types of Sampling Bias

Undercoverage
Undercoverage bias happens when you inadequately represent some members of your
population in the sample. One of the classic examples of undercoverage bias is the
popular Literary Digest survey, predicting that Mr. Alfred Landon would defeat Mr. Franklin
Roosevelt in the crucial presidential election of 1936. This research survey sample was
adversely affected by the undercoverage of many low-income voters in the country, who
were Democrats.

Observer Bias
Observer bias occurs when researchers subconsciously project their expectations on the
research. Did you know that this bias may come in several forms?
Some examples are unintentionally influencing your participants during surveys and
interviews or engaging in cherry-picking by focusing on some specific statistics that tend
to support your hypothesis instead of those that do not.

Self-Selection/Voluntary Response Bias

Self-selection bias (or volunteer/voluntary response bias) occurs when the research
participants exercise control over the decision to participate in the study. A great example
of this is call-in radio or TV shows soliciting audience participation in various types of
surveys often on controversial and hot topics, such as abortion, gun control or affirmative
action. Those individuals that choose to participate in the study are likely to share some
characteristics that distinguish them from the ones that choose not to participate. For
instance, people who usually have substantial knowledge or strong opinions might be
more likely to spend more time answering a research survey than people who don’t. As
a result, your sample will not represent your entire population and often over represent
people with strong opinions
Survivorship Bias
Another common bias in research is survivorship bias. Note that it occurs when a sample
concentrates on subjects who passed the selection criteria or process and ignores

83
subjects who didn’t pass the selection process. Survivorship bias can produce overly
optimistic results or findings from a study. For instance, if you use the record of existing
companies or organisations as the indicator of the overall business climate, you will ignore
the companies that failed and hence no longer exist.

Recall Bias
Recall bias is a common error in interview and survey situations. This happens when a
respondent fails to remember things correctly. You should know that it is not about good
or poor memory–human beings have, by default, a selective memory.
One way to avoid some of the implications of recall bias is by collecting information when
a respondent’s memory is fresh.

Exclusion Bias
This bias results from excluding specific groups from your sample, such as the exclusion
of subjects that have migrated recently into the study area. It is worth noting that excluding
subjects or participants that move out of the relevant study area can affect your study’s
validity

● Sampling distribution and its characteristics are explained:

A sampling distribution is a probability distribution of a statistic obtained from a larger number of

samples drawn from a specific population. The sampling distribution of a given population is the
distribution of frequencies of a range of different outcomes that could possibly occur for a statistic
of a population.
In statistics, a population is the entire pool from which a statistical sample is drawn. A population
may refer to an entire group of people, objects, events, hospital visits, or measurements. A
population can thus be said to be an aggregate observation of subjects grouped together by a
common feature.
● A sampling distribution is a statistic that is arrived at through repeated sampling from a
larger population.
● It describes a range of possible outcomes of a statistic, such as the mean or mode of some
variable, as it truly exists in a population.
● The majority of data analysed by researchers are actually drawn from samples, and not
populations.

● Central limit theorem is interpreted:

Put another way, CLT is a statistical premise that, given a sufficiently large sample size from a
population with a finite level of variance, the mean of all sampled variables from the same
population will be approximately equal to the mean of the whole population. Furthermore, these
samples approximate a normal distribution, with their variances being approximately equal to the
variance of the population as the sample size gets larger, according to the law of large numbers.
Although this concept was first developed by Abraham de Moivre in 1733, it was not formalised
until 1930, when noted Hungarian mathematician George Polya dubbed it the Central Limit
Theorem
● The central limit theorem (CLT) states that the distribution of sample means approximates
a normal distribution as the sample size gets larger, regardless of the population's
distribution.
● Sample sizes equal to or greater than 30 are often considered sufficient for the CLT to
hold.
● A key aspect of CLT is that the average of the sample means and standard deviations will
equal the population mean and standard deviation.
● A sufficiently large sample size can predict the characteristics of a population more
accurately.

84
Individual Activity:
● Discuss sampling biases and its corrective measures.

SELF-CHECK QUIZ 2.3

Check your understanding by answering the following questions:

Write the appropriate/correct answer of the following:

1. Write down all sampling methods.

2. What is the central limit theorem?

85
LEARNING OUTCOME 2.4 – Interpret inferential statistics

Contents:

▪ Confidence interval.
▪ Hypothesis testing.
▪ Type-I and Type-II errors.
▪ Inference for comparing means (ANOVA).
▪ Non-parametric tests.

Assessment criteria:

1. Confidence interval is explained.

2. Hypothesis testing is interpreted.
3. Hypothesis test is performed using critical value and p-value approach.
4. Type-I and Type-II errors are interpreted.
5. Inference for comparing means (ANOVA) is explained.
6. Non-parametric tests are explained.

Resources required:

Students/trainees must be provided with the following resources:

Workplace (Computer and Internet connection)

LEARNING ACTIVITY 2.4

Learning Activity Resources/Special Instructions/References

Interpret inferential statistics ▪ Information Sheets: 2.4
▪ Self-Check: 2.4
▪ Answer Key: 2.4

86
INFORMATION SHEET 2.4

Learning Objective: to Interpret inferential statistics

● Confidence interval is explained:

A confidence interval is the mean of your estimate plus and minus the variation in that estimate.
This is the range of values you expect your estimate to fall between if you redo your test, within a
certain level of confidence. Confidence, in statistics, is another way to describe probability. For
example, if you construct a confidence interval with a 95% confidence level, you are confident that
95 out of 100 times the estimate will fall between the upper and lower values specified by the
confidence interval. Your desired confidence level is usually one minus the alpha ( a ) value you
used in your statistical test:
Confidence level = 1 − a
So if you use an alpha value of p < 0.05 for statistical significance, then your confidence level
would be 1 − 0.05 = 0.95, or 95%.

● Hypothesis testing is interpreted:

Hypothesis testing is an act in statistics whereby an analyst tests an assumption regarding a

population parameter. The methodology employed by the analyst depends on the nature of the
data used and the reason for the analysis. Hypothesis testing is used to assess the plausibility of
a hypothesis by using sample data. Such data may come from a larger population, or from a data-
generating process. The word "population" will be used for both of these cases in the following
descriptions.
● Hypothesis testing is used to assess the plausibility of a hypothesis by using sample data.
● The test provides evidence concerning the plausibility of the hypothesis, given the data.
● Statistical analysts test a hypothesis by measuring and examining a random sample of the
population being analysed.

How Hypothesis Testing Works

In hypothesis testing, an analyst tests a statistical sample, with the goal of providing evidence on
the plausibility of the null hypothesis. Statistical analysts test a hypothesis by measuring and
examining a random sample of the population being analysed. All analysts use a random
population sample to test two different hypotheses: the null hypothesis and the alternative
hypothesis. The null hypothesis is usually a hypothesis of equality between population parameters;
e.g., a null hypothesis may state that the population mean return is equal to zero. The alternative
hypothesis is effectively the opposite of a null hypothesis (e.g., the population mean return is not
equal to zero). Thus, they are mutually exclusive, and only one can be true. However, one of the
two hypotheses will always be true.
Four Steps of Hypothesis Testing
All hypotheses are tested using a four-step process:
1. The first step is for the analyst to state the two hypotheses so that only one can be right.
2. The next step is to formulate an analysis plan, which outlines how the data will be
evaluated.
3. The third step is to carry out the plan and physically analyse the sample data.
4. The fourth and final step is to analyse the results and either reject the null hypothesis, or
state that the null hypothesis is plausible, given the data.
5.

87
● Hypothesis test is performed using critical value and p-value approach:

Test About Proportions:

Let us consider the parameter p of population proportion. For instance, we might want to know the
proportion of males within a total population of adults when we conduct a survey. A test of
proportion will assess whether or not a sample from a population represents the true proportion
from the entire population.

Test About one Mean:

When you test a single mean, you’re comparing the mean value to some other hypothesised value.
Which test you run depends on if you know the population standard deviation(σ) or not. Known
population standard deviation If you know the value for σ, then the population mean has a normal
distribution: use a one sample z-test. The z-test uses a formula to find a z-score, which you
compare against a critical value found in a z-table. The formula is:
𝑥 − 𝜇0
𝑍= 𝜎
√𝑛
Test of equality of two means
In this lesson, we'll continue our investigation of hypothesis testing. In this case, we'll focus
our attention on a hypothesis test for the difference in two population means μ1−μ2 for two
situations:
● a hypothesis test based on the t-distribution, known as the pooled two-sample t-test,
for μ1−μ2 when the (unknown) population variances σX2 and σY2 are equal
● a hypothesis test based on the t-distribution, known as Welch's t-test, for μ1−μ2
when the (unknown) population variances σX2 and σY2 are not equal
Of course, because population variances are generally not known, there is no way of being 100%
sure that the population variances are equal or not equal. In order to be able to determine,
therefore, which of the two hypothesis tests we should use, we'll need to make some assumptions
about the equality of the variances based on our previous knowledge of the populations we're
studying.

Test of variances:
A test of two variances hypothesis test determines if two variances are the same. The distribution
for the hypothesis test is the F distribution with two different degrees of freedom.
Assumptions:
● The populations from which the two samples are drawn are normally distributed.
● The two populations are independent of each other.

Critical value approach

By applying the critical value approach it is determined whether or not the observed test statistic
is more extreme than a defined critical value. Therefore the observed test statistic (calculated on
the basis of sample data) is compared to the critical value, some kind of cutoff value. If the test
statistic is more extreme than the critical value, the null hypothesis is rejected. If the test statistic
is not as extreme as the critical value, the null hypothesis is not rejected. The critical value is
computed based on the given significance level α and the type of probability distribution of the
idealised model. The critical value divides the area under the probability distribution curve in the
rejection region(s) and in the non-rejection region. The following three figures show a right tailed
test, a left tailed test, and a two-sided test. The idealised model in the figures, and thus 𝐻0 , is
described by a bell-shaped normal probability curve. In a two-sided test the null hypothesis is
rejected if the test statistic is either too small or too large. Thus the rejection region for such a test
consists of two parts: one on the left and one on the right.

88
For a left-tailed test, the null hypothesis is rejected if the test statistic is too small. Thus, the
rejection region for such a test consists of one part, which is left from the centre.

For a right-tailed test, the null hypothesis is rejected if the test statistic is too large. Thus, the
rejection region for such a test consists of one part, which is right from the centre.

89
p-value approach
For the p-value approach, the likelihood (p-value) of the numerical value of the test statistic
is compared to the specified significance level (α) of the hypothesis test. The p-value
corresponds to the probability of observing sample data at least as extreme as the actually
obtained test statistic. Small p-values provide evidence against the null hypothesis. The
smaller (closer to 0) the p-value, the stronger is the evidence against the null hypothesis. If
the p-value is less than or equal to the specified significance level α, the null hypothesis is
rejected; otherwise, the null hypothesis is not rejected. In other words, if p≤α, reject 𝐻0 ;
otherwise, if p>α do not reject𝐻0 . In consequence, by knowing the p-value any desired level of
significance may be assessed. For example, if the p-value of a hypothesis test is 0.01, the null
hypothesis can be rejected at any significance level larger than or equal to 0.01. It is not rejected
at any significance level smaller than 0.01. Thus, the p-value is commonly used to evaluate the
strength of the evidence against the null hypothesis without reference to significance level.
The following table provides guidelines for using the p-value to assess the evidence against the
null hypothesis (Weiss, 2010).

90
● Type I and Type II errors:

Just like a judge’s conclusion, an investigator’s conclusion may be wrong. Sometimes, by chance
alone, a sample is not representative of the population. Thus the results in the sample do not
reflect reality in the population, and the random error leads to an erroneous inference. A type I
error (false-positive) occurs if an investigator rejects a null hypothesis that is actually true in the
population; a type II error (false-negative) occurs if the investigator fails to reject a null hypothesis
that is actually false in the population. Although type I and type II errors can never be avoided
entirely, the investigator can reduce their likelihood by increasing the sample size (the larger the
sample, the lesser is the likelihood that it will differ substantially from the population).
False-positive and false-negative results can also occur because of bias (observer, instrument,
recall, etc.). (Errors due to bias, however, are not referred to as type I and type II errors.) Such
errors are troublesome, since they may be difficult to detect and cannot usually be quantified

● Inference for comparing means (ANOVA) is explained:

Sometimes we want to compare means across many groups. We might initially think to do pairwise
comparisons. For example, if there were three groups, we might be tempted to compare the first
mean with the second, then with the third, and then finally compare the second and third means
for a total of three comparisons. However, this strategy can be treacherous. If we have many
groups and do many comparisons, it is likely that we will eventually find a difference just by chance,
even if there is no difference in the populations. Instead, we should apply a holistic test to check
whether there is evidence that at least one pair groups are in fact different, and this is where
ANOVA saves the day.
In this section, we will learn a new method called analysis of variance (ANOVA) and a new test
statistic called an F-statistic (which we will introduce in our discussion of mathematical models).
ANOVA uses a single hypothesis test to check whether the means across many groups are equal:
● 𝐻0 :The mean outcome is the same across all groups. In statistical notation, 𝜇1 = 𝜇2 = . . . =
𝜇𝑘 where i represents the mean of the outcome for observations in category i.
● HA:HA: At least one mean is different.

Generally we must check three conditions on the data before performing ANOVA:
● The observations are independent within and between groups,
● The responses within each group are nearly normal, and
● The variability across the groups is about equal.
When these three conditions are met, we may perform an ANOVA to determine whether the data
provide convincing evidence against the null hypothesis that all the μi are equal. Strong evidence
favouring the alternative hypothesis in ANOVA is described by unusually large differences among
the group means. We will soon learn that assessing the variability of the group means relative to
the variability among individual observations within each group is key to ANOVA’s success.
Example: College departments commonly run multiple sections of the same introductory course
each semester because of high demand. Consider a statistics department that runs three sections
of an introductory statistics course. We might like to determine whether there are substantial
differences in first exam scores in these three classes (Section A, Section B, and Section C).
Describe appropriate hypotheses to determine whether there are any differences between the
three classes.
The hypotheses may be written in the following form:
● 𝐻0 :The average score is identical in all sections, 𝜇𝐴 = 𝜇𝐵 = 𝜇𝐶 . Assuming each class is
equally difficult, the observed difference in the exam scores is due to chance.
● 𝐻𝐴 :The average score varies by class. We would reject the null hypothesis in favour of the
alternative hypothesis if there were larger differences among the class averages than what
we might expect from chance alone.

● Non-parametric tests are explained:

91
Non-parametric tests are the mathematical methods used in statistical hypothesis testing, which
do not make assumptions about the frequency distribution of variables that are to be evaluated.
The non-parametric experiment is used when there is skewed data, and it comprises techniques
that do not depend on data pertaining to any particular distribution. The word non-parametric does
not mean that these models do not have any parameters. The fact is, the characteristics and
number of parameters are pretty flexible and not predefined. Therefore, these models are called
distribution-free models.

Wilcoxon Signed-Rank Test

Wilcoxon signed-rank test is used to compare the continuous outcome in the two matched
samples or the paired samples.
Null hypothesis, 𝐻0 : Median difference should be zero.
Test statistic: The test statistic W, is defined as the smaller of W+ or W- .
Where W+ and W- are the sums of the positive and the negative ranks of the different
scores.

Decision Rule: Reject the null hypothesis if the test statistic, W is less than or equal to
the critical value from the table.

Kruskal Wallis Test

Kruskal Wallis test is used to compare the continuous outcome in greater than two
independent samples.
Null hypothesis, 𝐻0 : K Population medians are equal.
Test statistic: If N is the total sample size, k is the number of comparison groups, Rj is
the sum of the ranks in the jth group and nj is the sample size in the jth group, then the
test statistic, H is given by:\
𝑘
12 𝑅𝑗 2
𝐻=( ∑ ) − 3(𝑁 + 1)
𝑁(𝑁 + 1) 𝑛𝑗
𝑗=1

Decision Rule: Reject the null hypothesis 𝐻0 if H ≥ critical value

Mann Whitney U Test

Mann Whitney U test is used to compare the continuous outcomes in the two independent
samples.
Null hypothesis, H0: The two populations should be equal.
Test statistic: If R1 and R2 are the sum of the ranks in group 1 and group 2 respectively,
then the test statistic “U” is the smaller of:
𝑛1 (𝑛1 + 1)
𝑈1 = 𝑛1 𝑛2 + − 𝑅1
2
𝑛2 (𝑛2 + 1)
𝑈2 = 𝑛1 𝑛2 + − 𝑅2
2
Decision Rule: Reject the null hypothesis if the test statistic, U is less than or equal to
critical value from the table.

Spearman rank correlation

The Spearman's rank-order correlation is the nonparametric version of the Pearson
product-moment correlation. Spearman's correlation coefficient, (ρ, also signified by r s)
measures the strength and direction of association between two ranked variables.

There are two methods to calculate Spearman's correlation depending on whether: (1)
your data does not have tied ranks or (2) your data has tied ranks. The formula for when
there are no tied ranks is:
6𝛴𝑑𝑖 2
𝜌=1−
𝑛(𝑛2 − 1)
where, di = difference in paired ranks and n = number of cases. The formula to use when
there are tied ranks is:
𝛴𝑖 (𝑥𝑖 − 𝑥)(𝑦𝑖 − 𝑦)
𝜌=
√𝛴𝑖 (𝑥𝑖 − 𝑥 )2 𝛴𝑖 (𝑦𝑖 − 𝑦 )2

92
where i = paired score

Individual Activity:
● Discuss hypothesis testing and Type-I and Type-II errors.

SELF-CHECK QUIZ 2.4

Check your understanding by answering the following questions:

Write the appropriate/correct answer of the following:

1.What is hypothesis testing?.

2. What is the p-value approach?

93
LEARNING OUTCOME 2.5 – Interpret regression models

Contents:

▪ Simple linear regression.

▪ Techniques for testing and validating assumptions of regression.
▪ Impact of multicollinearity and heteroscedasticity.
▪ Simple and Multivariate linear regression models.
▪ Logistic regression.

Assessment criteria:

1. Simple linear regression and its underlying assumptions are explained.

2. Techniques for testing and validating assumptions of regression are demonstrated.
3. Impact of multicollinearity and heteroscedasticity are explained.
4. Simple and Multivariate linear regression models are used to predict numeric values.
5. Logistic regression is explained.

Resources required:

Students/trainees must be provided with the following resources:

Workplace (Computer and Internet connection)

LEARNING ACTIVITY 2.5

Resources/Special
Learning Activity
Instructions/References
Interpret regression models
● Information Sheets: 2.5
● Self-Check: 2.5
● Answer Key: 2.5

94
INFORMATION SHEET 2.5

Learning Objective: to Interpret regression models

● Simple linear regression and its underlying assumptions are explained.

You’re probably familiar with plotting line graphs with one X axis and one Y axis. The X
variable is sometimes called the independent variable and the Y variable is called the
dependent variable. Simple linear regression plots one independent variable X against one
dependent variable Y. Technically, in regression analysis, the independent variable is
usually called the predictor variable and the dependent variable is called the criterion
variable. However, many people just call them the independent and dependent variables.
More advanced regression techniques (like multiple regression) use multiple independent
variables.

Regression analysis can result in linear or nonlinear graphs. A linear regression is where
the relationships between your variables can be described with a straight line. Non-linear
regressions produce curved lines.

Fig: Simple linear regression for the amount of rainfall per year.

Regression analysis is almost always performed by a computer program, as the equations

are extremely time-consuming to perform by hand.
As this is an introductory article, I kept it simple. But there’s actually an important
technical difference between linear and nonlinear, that will become more important if you
continue studying regression. For details, see the article on nonlinear regression.
This is the range of values you expect your estimate to fall between if you redo your test, within a
certain level of confidence. Confidence, in statistics, is another way to describe probability. For
example, if you construct a confidence interval with a 95% confidence level, you are confident that
95 out of 100 times the estimate will fall between the upper and lower values specified by the
confidence interval. Your desired confidence level is usually one minus the alpha ( a ) value you
used in your statistical test:
Confidence level = 1 − a
So if you use an alpha value of p < 0.05 for statistical significance, then your confidence level
would be 1 − 0.05 = 0.95, or 95%.

95
● Techniques for testing and validating assumptions of regression are
demonstrated:
Regression is a parametric approach. ‘Parametric’ means it makes assumptions about data for
the purpose of analysis. Due to its parametric side, regression is restrictive in nature. It fails to
deliver good results with data sets which don't fulfil its assumptions. Therefore, for a successful
regression analysis, it’s essential to validate these assumptions.
So, how would you check (validate) if a data set follows all regression assumptions? You check it
using the regression plots (explained below) along with some statistical tests.
Let’s look at the important assumptions in regression analysis:
● There should be a linear and additive relationship between dependent (response) variable
and independent (predictor) variable(s). A linear relationship suggests that a change in
response Y due to one unit change in X¹ is constant, regardless of the value of X¹. An
additive relationship suggests that the effect of X¹ on Y is independent of other variables.
● There should be no correlation between the residual (error) terms. Absence of this
phenomenon is known as Autocorrelation.
● The independent variables should not be correlated. Absence of this phenomenon is
known as multicollinearity.
● The error terms must have constant variance. This phenomenon is known as
homoscedasticity. The presence of non-constant variance is referred to as
heteroskedasticity.
● The error terms must be normally distributed.

What if these assumptions get violated ?

Let’s dive into specific assumptions and learn about their outcomes (if violated):
Linear and Additive: If you fit a linear model to a non-linear, non-additive data set, the regression
algorithm would fail to capture the trend mathematically, thus resulting in an inefficient model. Also,
this will result in erroneous predictions on an unseen data set.
How to check: Look for residual vs fitted value plots (explained below). Also, you can include
polynomial terms ( 𝑋, 𝑋 2 , 𝑋 3 ) in your model to capture the non-linear effect.
Autocorrelation: The presence of correlation in error terms drastically reduces model’s accuracy.
This usually occurs in time series models where the next instant is dependent on the previous
instant. If the error terms are correlated, the estimated standard errors tend to underestimate the
true standard error.
If this happens, it causes confidence intervals and prediction intervals to be narrower. Narrower
confidence interval means that a 95% confidence interval would have a lesser probability than
0.95 that it would contain the actual value of coefficients. Let’s understand narrow prediction
intervals with an example:
For example, the least square coefficient of X¹ is 15.02 and its standard error is 2.08 (without
autocorrelation). But in the presence of autocorrelation, the standard error reduces to 1.20. As a
result, the prediction interval narrows down to (13.82, 16.22) from (12.94, 17.10).
Also, lower standard errors would cause the associated p-values to be lower than actual. This will
make us incorrectly conclude a parameter to be statistically significant.
How to check: Look for Durbin – Watson (DW) statistics. It must lie between 0 and 4. If DW = 2,
implies no autocorrelation, 0 < DW < 2 implies positive autocorrelation while 2 < DW < 4 indicates
negative autocorrelation. Also, you can see residual vs time plot and look for the seasonal or
correlated pattern in residual values.

Multicollinearity: This phenomenon exists when the independent variables are found to be
moderately or highly correlated. In a model with correlated variables, it becomes a tough task to
figure out the true relationship of a predictors with response variables. In other words, it becomes
difficult to find out which variable is actually contributing to predict the response variable.
Another point, with presence of correlated predictors, the standard errors tend to increase. And,
with large standard errors, the confidence interval becomes wider leading to less precise estimates
of slope parameters.
Also, when predictors are correlated, the estimated regression coefficient of a correlated variable
depends on which other predictors are available in the model. If this happens, you’ll end up with
an incorrect conclusion that a variable strongly / weakly affects the target variable. Since, even if
you drop one correlated variable from the model, its estimated regression coefficients would
change. That’s not good!

96
How to check: You can use scatter plots to visualise correlation effects among variables. Also, you
can also use the VIF factor. VIF value <= 4 suggests no multicollinearity whereas a value of >= 10
implies serious multicollinearity. Above all, a correlation table should also solve the purpose.

Heteroskedasticity: The presence of non-constant variance in the error terms results in

heteroskedasticity. Generally, non-constant variance arises in presence of outliers or extreme
leverage values. Looks like these values get too much weight, thereby disproportionately
influencing the model’s performance. When this phenomenon occurs, the confidence interval for
out of sample prediction tends to be unrealistically wide or narrow.
How to check: You can look at residual vs fitted values plot. If heteroskedasticity exists, the plot
would exhibit a funnel shape pattern (shown in next section). Also, you can use Breusch-Pagan /
Cook – Weisberg test or White general test to detect this phenomenon.
Normal Distribution of error terms: If the error terms are non- normally distributed, confidence
intervals may become too wide or narrow. Once the confidence interval becomes unstable, it leads
to difficulty in estimating coefficients based on minimization of least squares. Presence of non –
normal distribution suggests that there are a few unusual data points which must be studied closely
to make a better model.
How to check: You can look at the QQ plot (shown below). You can also perform statistical tests
of normality such as Kolmogorov-Smirnov test, Shapiro-Wilk test.

How Hypothesis Testing Works

In hypothesis testing, an analyst tests a statistical sample, with the goal of providing evidence on
the plausibility of the null hypothesis. Statistical analysts test a hypothesis by measuring and
examining a random sample of the population being analysed. All analysts use a random
population sample to test two different hypotheses: the null hypothesis and the alternative
hypothesis. The null hypothesis is usually a hypothesis of equality between population parameters;
e.g., a null hypothesis may state that the population mean return is equal to zero. The alternative
hypothesis is effectively the opposite of a null hypothesis (e.g., the population mean return is not
equal to zero). Thus, they are mutually exclusive, and only one can be true. However, one of the
two hypotheses will always be true.
Four Steps of Hypothesis Testing
All hypotheses are tested using a four-step process:
6. The first step is for the analyst to state the two hypotheses so that only one can be right.
7. The next step is to formulate an analysis plan, which outlines how the data will be
evaluated.
8. The third step is to carry out the plan and physically analyse the sample data.
9. The fourth and final step is to analyse the results and either reject the null hypothesis, or
state that the null hypothesis is plausible, given the data.
10.
● Impact of multicollinearity and heteroscedasticity are explained:

Multicollinearity:
This phenomenon exists when the independent variables are found to be moderately or highly
correlated. In a model with correlated variables, it becomes a tough task to figure out the true
relationship of a predictors with response variables. In other words, it becomes difficult to find out which
variable is actually contributing to predict the response variable.
Another point, with presence of correlated predictors, the standard errors tend to increase. And, with
large standard errors, the confidence interval becomes wider leading to less precise estimates of slope
parameters.
Also, when predictors are correlated, the estimated regression coefficient of a correlated variable
depends on which other predictors are available in the model. If this happens, you’ll end up with an
incorrect conclusion that a variable strongly / weakly affects the target variable. Since, even if you drop
one correlated variable from the model, its estimated regression coefficients would change. That’s not
good!
How to check: You can use scatter plots to visualise correlation effects among variables. Also, you
can also use the VIF factor. VIF value <= 4 suggests no multicollinearity whereas a value of >= 10
implies serious multicollinearity. Above all, a correlation table should also solve the purpose.

Heteroskedasticity:
The presence of non-constant variance in the error terms results in heteroskedasticity. Generally, non-
constant variance arises in presence of outliers or extreme leverage values. Looks like these values
97
get too much weight, thereby disproportionately influencing the model’s performance. When this
phenomenon occurs, the confidence interval for out of sample prediction tends to be unrealistically
wide or narrow.
How to check: You can look at residual vs fitted values plot. If heteroskedasticity exists, the plot would
exhibit a funnel shape pattern (shown in next section). Also, you can use Breusch-Pagan / Cook –
Weisberg test or White general test to detect this phenomenon.

● Simple and Multivariate linear regression models are used to predict numeric
values:

Simple Linear Regression

Simple linear regression is used to estimate the relationship between two quantitative variables.
You can use simple linear regression when you want to know:
How strong the relationship is between two variables (e.g. the relationship between rainfall and
soil erosion).
The value of the dependent variable at a certain value of the independent variable (e.g. the amount
of soil erosion at a certain level of rainfall).
Example: You are a social researcher interested in the relationship between income and
happiness. You survey 500 people whose incomes range from $15k to $75k and ask them to rank
their happiness on a scale from 1 to 10.
Your independent variable (income) and dependent variable (happiness) are both quantitative, so
you can do a regression analysis to see if there is a linear relationship between them.
Assumptions of simple linear regression
Simple linear regression is a parametric test, meaning that it makes certain assumptions about the
data. These assumptions are:

Homogeneity of variance (homoscedasticity): the size of the error in our prediction doesn’t
change significantly across the values of the independent variable.
Independence of observations: the observations in the dataset were collected using statistically
valid sampling methods, and there are no hidden relationships among observations.
Normality: The data follows a normal distribution.

Linear regression makes one additional assumption:

The relationship between the independent and dependent variable is linear: the line of best
fit through the data points is a straight line (rather than a curve or some sort of grouping factor).

Simple linear regression formula

The formula for a simple linear regression is:
𝑦 = 𝛽0 + 𝛽1 𝑋 + 𝜀
Simple linear regression formula
● y is the predicted value of the dependent variable (y) for any given value of the independent
variable (x).
● 𝛽0 is the intercept, the predicted value of y when the x is 0.
● 𝛽1 is the regression coefficient – how much we expect y to change as x increases.
● X is the independent variable ( the variable we expect is influencing y).
● 𝜀 is the error of the estimate, or how much variation there is in our estimate of the
regression coefficient.
Linear regression finds the line of best fit through your data by searching for the regression
coefficient (𝛽1 ) that minimises the total error ( 𝜀 ) of the model.

Multivariate Linear Regression

This is quite similar to the simple linear regression model we have discussed previously, but with
multiple independent variables contributing to the dependent variable and hence multiple
coefficients to determine and complex computation due to the added variables. Jumping straight
into the equation of multivariate linear regression,
𝑌𝑖 = 𝛼 + 𝛽1 𝑥𝑖 (1) + 𝛽2 𝑥𝑖 (2)+. . . . . . . . +𝛽𝑛 𝑥𝑖 (𝑛)
Yi is the estimate of ith component of dependent variable y, where we have n independent
variables and xij denotes the ith component of the jth independent variable/feature. Similarly cost
function is as follows,

98
𝐸(𝛼, 𝛽1 , 𝛽2 , . . . . . . . , 𝛽𝑛 ) = 12𝑚𝛴𝑖 = 1𝑚(𝑦𝑖 − 𝑌𝑖 )
where we have m data points in training data and y is the observed data of dependent variables.As
per the formulation of the equation or the cost function, it is a pretty straightforward generalisation
of simple linear regression. But computing the parameters is the matter of interest here.

Computing parameters
Generally, when it comes to multivariate linear regression, we don't throw in all the independent
variables at a time and start minimising the error function. First one should focus on selecting the
best possible independent variables that contribute well to the dependent variable. For this, we go
on and construct a correlation matrix for all the independent variables and the dependent variable
from the observed data. The correlation value gives us an idea about which variable is significant
and by what factor. From this matrix we pick independent variables in decreasing order of
correlation value and run the regression model to estimate the coefficients by minimising the error
function. We stop when there is no prominent improvement in the estimation function by inclusion
of the next independent feature. This method can still get complicated when there are large no.of
independent features that have significant contribution in deciding our dependent variable. Let's
discuss the normal method first which is similar to the one we used in univariate linear regression.

Normal Equation
Now let us talk in terms of matrices as it is easier that way. As discussed before, if we have N
independent variables in our training data, our matrix X has n+1rows, where the first row is the 0th
term added to each vector of independent variables which has a value of 1 (this is the coefficient
of the constant term α). So, X is as follows,
X=[X1..Xm]
Xi contains entries corresponding to each feature in training data of ith entry. So, matrix X has m
rows and n+1 columns (0thcolumn is all 1s and rest for one independent variable each).
Y=[Y1Y2..Ym ]
and coefficient matrix C,
C=[αβ1..βn]
and our final equation for our hypothesis is,
Y=XC
To calculate the coefficients, we need n+1 equations and we get them from the minimising
condition of the error function. Equating a partial derivative of E(α,β1,β2,...,βn)with each of the
coefficients to 0 gives a system of n+1 equations. Solving these is a complicated step and gives
the following nice result for matrix C,
C=(XTX)−1XTy
where y is the matrix of the observed values of dependent variables.
This method seems to work well when the n value is considerably small (approximately for 3-digit
values of n). As n grows big the above computation of matrix inverse and multiplication takes a
large amount of time. In future tutorials let's discuss a different method that can be used for data
with large no.of features.

● Logistic regression is explained:

Logistic regression streamlines the mathematics for measuring the impact of multiple variables
(e.g., age, gender, ad placement) with a given outcome (e.g., click-through or ignore). The
resulting models can help tease apart the relative effectiveness of various interventions for
different categories of people, such as young/old or male/female.
Logistic models can also transform raw data streams to create features for other types of AI and
machine learning techniques. In fact, logistic regression is one of the commonly used algorithms
in machine learning for binary classification problems, which are problems with two class values,
including predictions such as "this or that," "yes or no," and "A or B."
Logistic regression can also estimate the probabilities of events, including determining a
relationship between features and the probabilities of outcomes. That is, it can be used for
classification by creating a model that correlates the hours studied with the likelihood the student
passes or fails. On the flip side, the same model could be used for predicting whether a particular
student will pass or fail when the number of hours studied is provided as a feature and the variable
for the response has two values: pass and fail.

99
Logistic regression applications in business
Organisations use insights from logistic regression outputs to enhance their business strategy for
achieving business goals such as reducing expenses or losses and increasing ROI in marketing
campaigns.
An e-commerce company that mails expensive promotional offers to customers, for example,
would like to know whether a particular customer is likely to respond to the offers or not: i.e.,
whether that consumer will be a "responder" or a "non-responder." In marketing, this is called
propensity to respond modelling.
Likewise, a credit card company will develop a model to help it predict if a customer is going to
default on its credit card based on such characteristics as annual income, monthly credit card
payments and the number of defaults. In banking parlance, this is known as default propensity
modelling.

Logistic regression use cases

Logistic regression has become particularly popular in online advertising, enabling marketers to
predict the likelihood of specific website users who will click on particular advertisements as a yes
or no percentage.
Logistic regression can also be used in the following areas:
● in healthcare to identify risk factors for diseases and plan preventive measures;
● in drug research to tease apart the effectiveness of medicines on health outcomes
across age, gender and ethnicity;
● in weather forecasting apps to predict snowfall and weather conditions;
● in political polls to determine if voters will vote for a particular candidate;
● in insurance to predict the chances that a policyholder will die before the policy's term
expires based on specific criteria, such as gender, age and physical examination; and
● in banking to predict the chances that a loan applicant will default on a loan or not,
based on annual income, past defaults and past debts.

Advantages and disadvantages of logistic regression

The main advantage of logistic regression is that it is much easier to set up and train than other
machine learning and AI applications.Another advantage is that it is one of the most efficient
algorithms when the different outcomes or distinctions represented by the data are linearly
separable. This means that you can draw a straight line separating the results of a logistic
regression calculation.One of the biggest attractions of logistic regression for statisticians is that it
can help reveal the interrelationships between different variables and their impact on outcomes.
This could quickly determine when two variables are positively or negatively correlated, such as
the finding cited above that more studying tends to be correlated with higher test outcomes. But it
is important to note that other techniques like causal AI are required to make the leap from
correlation to causation.

Logistic regression tools

Logistic regression calculations were a laborious and time-consuming task before the advent of
modern computers. Now, modern statistical analytics tools such as SPSS and SAS include logistic
regression capabilities as an essential feature.
Also, data science programming languages and frameworks built on R and Python include
numerous ways of performing logistic regression and weaving the results into other algorithms.
There are also various tools and techniques for doing logistic regression analysis on top of Excel.
Managers should also consider other data preparation and management tools as part of significant
data science democratisation efforts. For example, data warehouses and data lakes can help
organise larger datasets for analysis. Data catalogue tools can help surface any quality or usability
issues associated with logistic regression. Data science platforms can help analytics leaders
create appropriate guardrails to simplify the broader use of logistic regression across the
enterprise.
Individual Activity:
Show logistic regression, multicollinearity and heteroscedasticity.

100
SELF-CHECK QUIZ 2.5

Check your understanding by answering the following questions:

Write the appropriate/correct answer of the following:

1. What is Simple Linear Regression?

2. Write down the difference between Simple and Multivariate linear regression?

101
LEARNER JOB SHEET 2

Qualification: 2 Years Experience in IT Sector

Learning unit: DEMONSTRATE STATISTICAL CONCEPTS

Learner name:

Personal protective
equipment (PPE):

Materials: Computer and Internet connection

Tools and equipment:

Performance criteria: 1. Fundamental rules of probability are explained. Mapping of fabrics is

performed
2. Rules for Conditional probability and independence are described.
3. Bayes' rule is interpreted.
4. Common continuous and discrete probability distributions are
described.
5. Z-score and standard normal distribution are interpreted.
6. Proportions are computed using z-table.
7. Probabilities are computed using normal distribution.
8. Types of data and data measurement scales are described.
9. Measures of central tendency are explained.
10. Measures of dispersion are explained.
11. Measures of dispersion are explained.
12. Mean, median, mode, variance and standard deviation are
calculated from sample problems.
13. Sampling methods are described
14. Biases in sampling are interpreted and corrective measures are
explained.
15. Sampling distribution and its characteristics are explained.
16. Central limit theorem is interpreted.
17. Confidence interval is explained.
18. Hypothesis testing is interpreted.
19. Hypothesis test is performed using critical value and pvalue
approach.
20. Type-I and Type-II errors are interpreted.
21. Inference for comparing means (ANOVA) is explained.
22. Inference for comparing means (ANOVA) is explained.
23. Simple linear regression and its underlying assumptions are
explained.
24. Techniques for testing and validating assumptions of regression are
demonstrated.
25. Impact of multicollinearity and heteroscedasticity are explained.
26. Simple and Multivariate linear regression models are used to predict
numeric values.
27. Logistic regression is explained.

Measurement:

Notes:

102
Procedure: 1. Connect computers with internet connection.
2. Connect router with internet.

Learner signature: Date:

Assessor signature: Date:

Quality Assurer
Date:
signature:

Assessor remarks:

Feedback:

103
ANSWER KEYS

ANSWER KEY 2.1

1. If A and B are two events defined on a sample space, then: P(A AND B) = P(B)P(A|B). (The probability
of A given B equals the probability of A and B divided by the probability of B.) If A and B are
independent, then P(A|B) = P(A).

2. Rule of Multiplication The probability that Events A and B both occur is equal to the probability that
Event A occurs times the probability that Event B occurs, given that A has occurred.
3. Conditional probability is known as the possibility of an event or outcome happening, based on the
existence of a previous event or outcome. It is calculated by multiplying the probability of the preceding
event by the renewed probability of the succeeding, or conditional, event.
4. A Z-score is a numerical measurement that describes a value's relationship to the mean of a group of
values. Z-score is measured in terms of standard deviations from the mean. If a Z-score is 0, it indicates
that the data point's score is identical to the mean score.

ANSWER KEY 2.2

1. Data quality meets six dimensions: accuracy, completeness, consistency, timeliness, validity, and
uniqueness.

2. Measures of dispersion describe the spread of the data. They include the range, interquartile range,
standard deviation and variance. The range is given as the smallest and largest observations. This is the
simplest measure of variability.

ANSWER KEY 2.4

1. Five Basic Sampling Methods

● Simple Random.
● Convenience.
● Systematic.
● Cluster.
● Stratified.
2. The central limit theorem states that if you have a population with mean μ and standard deviation σ and
take sufficiently large random samples from the population with replacement, then the distribution of the
sample means will be approximately normally distributed.

ANSWER KEY 2.4

1. Hypothesis testing is a form of statistical inference that uses data from a sample to draw conclusions
about a population parameter or a population probability distribution.
Stratified.
2. The p-value approach to hypothesis testing uses the calculated probability to determine whether there
is evidence to reject the null hypothesis. The null hypothesis, also known as the “conjecture,” is the initial
claim about a population (or data-generating process).

104
ANSWER KEY 2.5

1. Simple linear regression is a regression model that estimates the relationship between one independent
variable and one dependent variable using a straight line. Both variables should be quantitative.
2 SLR examines the relationship between the dependent variable and a single independent variable. MLR
examines the relationship between the dependent variable and multiple independent variables.

105
Module 3: Demonstrate Programming Skills For Data Science

MODULE CONTENT module covers

Module Descriptor: This unit covers the knowledge, skills and attitudes required to
demonstrate programming skills for data science. It specifically
includes working with database management system, working with
Python, using Pandas and NumPy libraries and using python to
implement descriptive and inferential statistics.

Nominal Duration: 60 hours

LEARNING OUTCOMES:

Upon completion of the module, the trainee should be able to:

3.1 Work with database management system.

3.2 Work with python.
3.3 Use Pandas and NumPy libraries.
3.4 Use python to implement descriptive and inferential statistics.

PERFORMANCE CRITERIA:

1. Suitable Database Management System (DBMS) is setup according to the organizational

requirements.
2. SQL commands and Logical Operators are written to access the database.
3. Different types of JOINs are written to combine data from multiple sources.
4. Aggregate functions are used to extract basic information about data and transform data
according to analysis requirements.
5. Subqueries are written as required.
6. Window functions and partitioning are used to complete complex tasks.
7. Appropriate Anaconda/ Miniconda version is set up for using Python for data analysis.
8. Suitable Python IDE is selected for data analysis.
9. Logical statements are created using Python data types and structures, Python operators
and variables.
10. Syntax, whitespace and style guidelines are implemented.
11. Conditional statements and loops are used for multiple iteration and decision making.
12. Custom functions and lambda expressions are defined in code.
13. Modules in Python Standard Libraries and third-party libraries are used.
14. Git commands are demonstrated to use with python scripts/ Jupyter notebooks.
15. Objects in Pandas Series and DataFrames are created, accessed and modified.
16. CSV, JSON, XML and XLS files are read using Pandas.
17. Multidimensional NumPy arrays (ndarrays) are created, accessed, modified and sorted.

106
18. Slicing, Boolean indexing and set operations are performed to select or change subset of
ndarray.
19. Element-wise operations are done on ndarrays.
20. Usage of Python Scipy library is demonstrated.
21. Mean, median, mode, standard deviation, percentiles, skewness and kurtosis are calculated
using python.
22. Python code is used to test hypotheses.
23. Correlations are measured.
24. Continuous variable is predicted using regression and regression assumptions are validated.

107
Learning Outcome 3.1 – Work with database management
systems.

Contents:

● Database Management System (DBMS).

● SQL commands and Logical Operators.
● Different types of JOINs.
● Aggregate functions.
● Subqueries.

Assessment criteria:

1. Suitable Database Management System (DBMS) is setup according to the organizational

Resources required:

Students/trainees must be provided with the following resources:

▪ Workplace (Computer, Internet connection).

LEARNING ACTIVITY 3.1

Learning Activity Resources/Special Instructions/References

Work with database management systems. ▪ Information Sheet: 3.1
▪ Self-Check: 3.1
▪ Answer Key: 3.1

108
INFORMATION SHEET 3.1

Learning Objective: Work with database management system.

● Database Management System:

Oracle Database is the first database designed for enterprise grid computing, the most flexible and
cost effective way to manage information and applications. Enterprise grid computing creates large
pools of industry-standard, modular storage and servers. With this architecture, each new system
can be rapidly provisioned from the pool of components.

Microsoft SQL server

Microsoft SQL Server is a relational database management system (RDBMS) that
supports a wide variety of transaction processing, business intelligence and analytics
applications in corporate IT environments. Microsoft SQL Server is one of the three
market-leading database technologies, along with Oracle Database and IBM's DB2.Like
other RDBMS software, Microsoft SQL Server is built on top of SQL, a standardised
programming language that database administrators (DBAs) and other IT professionals
use to manage databases and query the data they contain. SQL Server is tied to Transact-
SQL (T-SQL), an implementation of SQL from Microsoft that adds a set of proprietary
programming extensions to the standard language.

PostgreSQL
PostgreSQL is a powerful, open source object-relational database system that uses and
extends the SQL language combined with many features that safely store and scale the
most complicated data workloads. The origins of PostgreSQL date back to 1986 as part
of the POSTGRES project at the University of California at Berkeley and has more than
30 years of active development on the core platform.
PostgreSQL comes with many features aimed to help developers build applications,
administrators to protect data integrity and build fault-tolerant environments, and help you
manage your data no matter how big or small the dataset. In addition to being free and
open source, PostgreSQL is highly extensible. For example, you can define your own data
types, build out custom functions, even write code from different programming languages
without recompiling your database.

MySQL Database
MySQL is an open-source, fast, reliable, and flexible relational database management
system, typically used with PHP. This chapter is a hands-on chapter about MySQL.
What is MySQL:
● MySQL is a database system used for developing web-based software applications.
● MySQL is used for both small and large applications.
● MySQL is a relational database management system (RDBMS).
● MySQL is fast, reliable, flexible and easy to use.
● MySQL supports standard SQL (Structured Query Language).
● MySQL is free to download and use.
● MySQL was developed by Michael Widenius and David Axmark in 1994.
● MySQL is presently developed, distributed, and supported by Oracle Corporation.
● MySQL Written in C, C++

Installing MySQL:
Installing on Windows
Installing MySQL on one window is relatively simple. You only need to download the
MySQL installation package for the window version and install the installation
package.Under Windows, you can download the community edition from
http://dev.mysql.com/downloads/ (Select the version of MySQL Community Server for the
respective platform that you need.)WAMP or XAMPP can also be installed; It comes with
MySQL database

109
Installing on Debian/Linux
Under Ubuntu, you can install MySQL with the following command

sudo apt-get install mysql-server

NoSQL:
NoSQL databases ("not only SQL") are non-tabular databases and store data differently
than relational tables. NoSQL databases come in a variety of types based on their data
model. The main types are document, key-value, wide-column, and graph. They provide
flexible schemas and scale easily with large amounts of data and high user loads.
NoSQL database features
Each NoSQL database has its own unique features. At a high level, many NoSQL
databases have the following features:

● Flexible schemas
● Horizontal scaling
● Fast queries due to the data model
● Ease of use for developers.
Types of NoSQL databases
Over time, four major types of NoSQL databases emerged: document
databases, key-value databases, wide-column stores, and graph databases.

● Document databases store data in documents similar to JSON (JavaScript Object

Notation) objects. Each document contains pairs of fields and values. The values can
typically be a variety of types including things like strings, numbers, booleans, arrays, or
objects.
● Key-value databases are a simpler type of database where each item contains keys and
values.
● Wide-column stores store data in tables, rows, and dynamic columns.
● Graph databases store data in nodes and edges. Nodes typically store information about
people, places, and things, while edges store information about the relationships between
the nodes.

MongoDB:
MongoDB is a document-oriented NoSQL database used for high volume data storage.
Instead of using tables and rows as in the traditional relational databases, MongoDB
makes use of collections and documents. Documents consist of key-value pairs which are
the basic unit of data in MongoDB. Collections contain sets of documents and function
which is the equivalent of relational database tables. MongoDB is a database which came
into light around the mid-2000s.

● SQL commands and Logical Operators are written to access the database.

SQL commands are divided into four subgroups, DDL, DML, DCL, and TCL.

DDL:
DDL is the short name of Data Definition Language, which deals with database schemas and
descriptions of how the data should reside in the database.
● CREATE - to create a database and its objects like (table, index, views, store
procedure, function, and triggers)
● ALTER - alters the structure of the existing database
● DROP - delete objects from the database

110
● TRUNCATE - remove all records from a table, including all spaces allocated for
the records are removed
● COMMENT - add comments to the data dictionary
● RENAME - rename an object
DML:
DML is the short name of Data Manipulation Language which deals with data manipulation and
includes most common SQL statements such SELECT, INSERT, UPDATE, DELETE, etc., and it
is used to store, modify, retrieve, delete and update data in a database.

● SELECT - retrieve data from a database

● INSERT - insert data into a table
● UPDATE - updates existing data within a table
● DELETE - Delete all records from a database table
● MERGE - UPSERT operation (insert or update)
● CALL - call a PL/SQL or Java subprogram
● EXPLAIN PLAN - interpretation of the data access path
● LOCK TABLE - concurrency Control
DCL:
DCL is the short name of Data Control Language which includes commands such as GRANT and
is mostly concerned with rights, permissions and other controls of the database system.
● GRANT - allow users access privileges to the database
● REVOKE - withdraw users access privileges given by using the GRANT command
TCL
TCL is the short name of Transaction Control Language which deals with a transaction within a
database.
COMMIT - commits a Transaction
ROLLBACK - rollback a transaction in case of any error occurs
SAVEPOINT - to rollback the transaction making points within groups
SET TRANSACTION - specify characteristics of the transaction
Example of using SQL commands:
SELECT
MySQL SELECT statement is used to fetch data from a database table.
Syntax
SELECT column_name(s) FROM table_name

UPDATE
The UPDATE statement is used to modify data in a table.
Syntax
UPDATE table_name
SET column=value, column1=value1,...
WHERE someColumn=someValue

DELETE
The DELETE FROM statement is used to delete data from a database table.
Syntax
DELETE FROM tableName
WHERE someColumn = someValue

INSERT
MySQL Query statement "INSERT INTO" is used to insert new records in a table.
Syntax
INSERT INTO table_name (column, column1, column2, column3, ...)
VALUES (value, value1, value2, value3 ...)

ALTER

111
The ALTER TABLE statement is used to add, delete, or modify columns in an existing
table.The ALTER TABLE statement is also used to add and drop various constraints on
an existing table.

Syntax
ALTER TABLE table_name
ADD column_name datatype;

CREATE
The CREATE TABLE statement is used to create a new table in a database.
Syntax
CREATE TABLE table_name (
column1 datatype,
column2 datatype,
column3 datatype,
....
);

DROP
The DROP DATABASE statement is used to drop an existing SQL database.
DROP DATABASE databasename

The DROP TABLE statement is used to drop an existing table in a database.

DROP TABLE table_name

Logical Operators :
Like Operator:
The LIKE operator is used in a WHERE clause to search for a specified pattern in a column.
There are two wildcards often used in conjunction with the LIKE operator:
● The percent sign (%) represents zero, one, or multiple characters
● The underscore sign (_) represents one, single character
The percent sign and the underscore can also be used in combinations. Some of them
are illustrated below.

Like Operator Description

WHERE CustomerName LIKE 'a%' Finds any values that start with "a"

WHERE CustomerName LIKE '%a' Finds any values that end with "a"

WHERE CustomerName LIKE 'a_%' Finds any values that start with "a" and are
at least 2 characters in length

AND, OR:
The WHERE clause can be combined with AND, OR, and NOT operators.
The AND and OR operators are used to filter records based on more than one condition:
The AND operator displays a record if all the conditions separated by AND are TRUE.
The OR operator displays a record if any of the conditions separated by OR is TRUE.
The NOT operator displays a record if the condition(s) is NOT TRUE.
AND Syntax
SELECT column1, column2, ...
FROM table_name
WHERE condition1 AND condition2 AND condition3 ...;

112
OR Syntax

SELECT column1, column2, ...

FROM table_name
WHERE condition1 OR condition2 OR condition3 ...;

● Different types of JOINs

Inner Join:
The INNER JOIN keyword selects records that have matching values in both tables.
Syntax
SELECT column_name(s)
FROM table1
INNER JOIN table2
ON table1.column_name = table2.column_name;

Outer join:

Left Join
The LEFT JOIN keyword returns all records from the left table (table1), and the matching records
from the right table (table2). The result is 0 records from the right side, if there is no match.

SELECT column_name(s)
FROM table1
LEFT JOIN table2
ON table1.column_name = table2.column_name;

Right Join
The RIGHT JOIN keyword returns all records from the right table (table2), and the matching
records from the left table (table1). The result is 0 records from the left side, if there is no match.

SELECT column_name(s)
FROM table1
RIGHT JOIN table2
ON table1.column_name = table2.column_name;

Full Join/Full Outer Join

The FULL OUTER JOIN keyword returns all records when there is a match in left (table1) or right
(table2) table records.

● Aggregate Function
Count
The COUNT() function returns the number of rows that matches a specified criterion.
Syntax

SELECT COUNT(column_name)
FROM table_name

113
WHERE condition;
SUM
The SUM() function returns the total sum of a numeric column.
SELECT SUM(column_name)
FROM table_name
WHERE condition;
MIN
The MIN() function returns the smallest value of the selected column.
Syntax

SELECT MIN(column_name)
FROM table_name
WHERE condition;
MAX
The MAX() function returns the largest value of the selected column.

SELECT MAX(column_name)
FROM table_name
WHERE condition

CASE
The CASE statement goes through conditions and returns a value when the first condition is met
(like an if-then-else statement). So, once a condition is true, it will stop reading and return the
result. If no conditions are true, it returns the value in the ELSE clause.
If there is no ELSE part and no conditions are true, it returns NULL.

CASE
WHEN condition1 THEN result1
WHEN condition2 THEN result2
WHEN conditionN THEN resultN
ELSE result
END;
DATE
MySQL comes with the following data types for storing a date or a date/time value in the database:
● DATE - format YYYY-MM-DD
● DATETIME - format: YYYY-MM-DD HH:MI:SS
● TIMESTAMP - format: YYYY-MM-DD HH:MI:SS
● YEAR - format YYYY or YY
SQL Server comes with the following data types for storing a date or a date/time value in the
database:
● DATE - format YYYY-MM-DD
● DATETIME - format: YYYY-MM-DD HH:MI:SS
● SMALLDATETIME - format: YYYY-MM-DD HH:MI:SS
● TIMESTAMP - format: a unique number
Note: The date types are chosen for a column when you create a new table in your database!
ISNULL
It is not possible to test for NULL values with comparison operators, such as =, <, or <>.
We will have to use the IS NULL and IS NOT NULL operators instead.
114
IS NULL Syntax
SELECT column_names
FROM table_name
WHERE column_name IS NULL;
IS NOT NULL Syntax

SELECT column_names
FROM table_name
WHERE column_name IS NOT NULL;
COALESCE
Return the first non-null value in a list:

SELECT COALESCE(NULL, NULL, NULL, 'W3Schools.com', NULL, 'Example.com');

● Subqueries:

● A subquery is a SQL query nested inside a larger query.

● A subquery may occur in :
○ - A SELECT clause
○ - A FROM clause
○ - A WHERE clause
● The subquery can be nested inside a SELECT, INSERT, UPDATE, or DELETE statement or inside
another subquery.
● A subquery is usually added within the WHERE Clause of another SQL SELECT statement.
● You can use the comparison operators, such as >, <, or =. The comparison operator can also be a
multiple-row operator, such as IN, ANY, or ALL.
● A subquery is also called an inner query or inner select, while the statement containing a subquery
is also called an outer query or outer select.
● The inner query executes first before its parent query so that the results of an inner query can be
passed to the outer query.

You can use a subquery in a SELECT, INSERT, DELETE, or UPDATE statement to perform the
following task.
● Compare an expression to the result of the query.
● Determine If an expression is included in the results of the query.
● Check whether the query selects any rows.

● Window functions and partitioning are used to complete complex tasks

Window functions apply aggregate and ranking functions over a particular window (set of rows).
OVER clause is used with window functions to define that window. OVER clause does two things
:
● Partition rows into a form set of rows. (PARTITION BY clause is used)
● Orders rows within those partitions into a particular order. (ORDER BY clause is used)
●
Basic Syntax:

115
SELECT coulmn_name1,
window_function(cloumn_name2),
OVER([PARTITION BY column_name1] [ORDER BY column_name3]) AS new_column
FROM table_name;

window_function= any aggregate or ranking function

column_name1= column to be selected
coulmn_name2= column on which window function is to be applied
column_name3= column on whose basis partition of rows is to be done
new_column= Name of new column
table_name= Name of table

Ranking Window Functions :

Ranking functions are, RANK(), DENSE_RANK(), ROW_NUMBER()
● RANK() –
As the name suggests, the rank function assigns rank to all the rows within every partition.
Rank is assigned such that rank 1 given to the first row and rows having the same value are
assigned the same rank. For the next rank after two same rank values, one rank value will be
skipped.

● DENSE_RANK() –
It assigns rank to each row within the partition. Just like the rank function, the first row is
assigned rank 1 and rows having the same value have the same rank. The difference between
RANK() and DENSE_RANK() is that in DENSE_RANK(), for the next rank after two of the same
rank, consecutive integers are used, no rank is skipped.

● ROW_NUMBER() –
It assigns consecutive integers to all the rows within the partition. Within a partition, no two
rows can have the same row number.
SQL Lag Function:
The LAG() function allows access to a value stored in a different row above the current
row. The row above may be adjacent or some number of rows above, as sorted by a specified
column or set of columns. LAG() takes three arguments: the name of the column or an expression
from which the value is obtained, the number of rows to skip (offset) above, and the default value
to be returned if the stored value obtained from the row above is empty. Only the first argument is
required. The third argument (default value) is allowed only if you specify the second argument,
the offset.
As with other window functions, LAG() requires the OVER clause. It can take optional parameters,
which we will explain later. With LAG(), you must specify an ORDER BY in the OVER clause, with
a column or a list of columns by which the rows should be sorted.
LAG(expression [,offset[,default_value]]) OVER(ORDER BY columns)

LEAD Function
LEAD() is similar to LAG(). Whereas LAG() accesses a value stored in a row above, LEAD()
accesses a value stored in a row below.
The syntax of LEAD () is just like that of LAG ():

116
LEAD(expression [,offset[,default_value]]) OVER(ORDER BY columns)

Just like LAG(), the LEAD() function takes three arguments: the name of a column or an expression, the
offset to be skipped below, and the default value to be returned if the stored value obtained from the row
below is empty. Only the first argument is required. The third argument, the default value, can be specified
only if you specify the second argument, the offset.
Just like LAG(), LEAD() is a window function and requires an OVER clause. And as with LAG(), LEAD()
must be accompanied by an ORDER BY in the OVER clause.

NTILE
NTILE() function in SQL Server is a window function that distributes rows of an ordered partition into a pre-
defined number of roughly equal groups. It assigns each group a number_expression ranging from 1.
NTILE() function assigns a number_expression for every row in a group, to which the row belongs.
NTILE(number_expression) OVER (
[PARTITION BY partition_expression ]
ORDER BY sort_expression [ASC | DESC]
)

117
Individual Activity:
● Discuss database management system and Python.

SELF-CHECK QUIZ 3.1

Check your understanding by answering the following questions:

Write the correct answer for the following questions.

1. What is a Database Management System?

2. Which rules are most important SQL Commands?

3. What is a Different type of Join?

4. What is Aggregate?

118
LEARNING OUTCOME 3.2 - Work with python

Contents:

▪ Anaconda/ Miniconda.
▪ Python IDE.
▪ Logical statements.
▪ Syntax, whitespace and style guidelines.
▪ Conditional statements and loops.
▪ Custom functions and lambda expressions.
▪ Modules in Python Standard Libraries and third-party libraries.
▪ Git commands.

Assessment criteria:

1. Appropriate Anaconda/ Miniconda version is setup for using Python for data analysis.
2. Suitable Python IDE is selected for data analysis.
3. Logical statements are created using Python data types and structures, Python operators and
variables.
4. Syntax, whitespace and style guidelines are implemented.
5. Conditional statements and loops are used for multiple iteration and decision making.
6. Custom functions and lambda expressions are defined in code.
7. Modules in Python Standard Libraries and third-party libraries are used.
8. Git commands are demonstrated to use with python scripts/ Jupyter notebooks.

Resources required:

Students/trainees must be provided with the following resources:

▪ Workplace (Computer and Internet connection)

LEARNING ACTIVITY 3.2

Learning Activity Resources/Special Instructions/References

Work with python ▪ Information Sheets: 3.2
▪ Self-Check: 3.2
▪ Answer Key: 3.2

119
INFORMATION SHEET 3.2

Learning Objective: to Work with python

● Anaconda
Anaconda is the data science platform for data scientists, IT professionals and business leaders
of tomorrow. It is a distribution of Python, R, etc. With more than 300 packages for data science,
it becomes one of the best platforms for any project. In this python anaconda tutorial, we will
discuss how we can use anaconda for python programming.

Introduction To Anaconda
Anaconda is an open-source distribution for python and R. It is used for data science, machine
learning, deep learning, etc. With the availability of more than 300 libraries for data science, it
becomes fairly optimal for any programmer to work on anaconda for data science.

Anaconda helps in simplified package management and deployment. Anaconda comes with a
wide variety of tools to easily collect data from various sources using various machine learning
and AI algorithms. It helps in getting an easily manageable environment setup which can deploy
any project with the click of a single button.
Now that we know what anaconda is, let’s try to understand how we can install anaconda and set
up an environment to work on our systems.

Installation And Setup

To install anaconda go to https://www.anaconda.com/distribution/.

Choose a version suitable for you and click on download. Once you complete the download, open
the setup.

120
Follow the instructions in the setup. Don’t forget to click on add anaconda to my path environment
variable. After the installation is complete, you will get a window like shown in the image below.

After finishing the installation, open anaconda prompt and type jupyter notebook.

121
You will see a window like shown in the image below.

Now that we know how to use anaconda for python, let's take a look at how we can install various
libraries in anaconda for any project.
How To Install Python Libraries In Anaconda?
Open anaconda prompt and check if the library is already installed or not.

Since there is no module named numpy present, we will run the following command to install numpy.

122
You will get the window shown in the image once you complete the installation.

Once you have installed a library, just try to import the module again for assurance.

As you can see, there is no error that we got in the beginning, so this is how we can install various
libraries in anaconda.

123
Anaconda Navigator

Anaconda Navigator is a desktop GUI that comes with the anaconda distribution. It allows us to
launch applications and manage conda packages, environments and without using command-line
commands.

● IDE for Python:

Spyder IDE
It is always necessary to have interactive environments to create software applications and this fact
becomes very important when you work in the fields of Data Science, engineering, and scientific
research. The Python Spyder IDE has been created for the same purpose. In this article, you will be
learning how to install and make use of Spyder or the Scientific Python and Development IDE.

What is Python Spyder IDE?

Spyder is an open-source cross-platform IDE. The Python Spyder IDE is written completely in
Python. It is designed by scientists and is exclusively for scientists, data analysts, and engineers. It is
also known as the Scientific Python Development IDE and has a huge set of remarkable features
which are discussed below.
124
Features of Spyder
Some of the remarkable features of Spyder are:
● Customizable Syntax Highlighting
● Availability of breakpoints (debugging and conditional breakpoints)
● Interactive execution which allows you to run line, file, cell, etc.
● Run configurations for working directory selections, command-line options, current/
dedicated/ external console, etc
● Can clear variables automatically ( or enter debugging )
● Navigation through cells, functions, blocks, etc can be achieved through the Outline Explorer
● It provides real-time code introspection (The ability to examine what functions, keywords, and
classes are, what they are doing and what information they contain)
● Automatic colon insertion after if, while, etc
● Supports all the IPython magic commands
● Inline display for graphics produced using Matplotlib
● Also provides features such as help, file explorer, find files, etc

Python Spyder IDE Installation ( Installing with Anaconda — Recommended)

The Python Spyder IDE comes as a default implementation along with Anaconda Python
distribution. This is not just the recommended method but also the easiest one. Follow the
steps given below to install the Python Spyder IDE:
● Go to the official Anaconda website using the following link: https://www.anaconda.com
● Click on the Download option on the top right as shown below:
● Choose the version that is suitable for your OS and click on Download.

● Once the installer is downloaded, you can see a dialog box for the Setup. Complete the
Setup and click on Finish as described earlier.
● Then, search for Anaconda Navigator in the search bar of your system and launch Spyder.
Once launched, you will see a screen similar to the one below:

125
Creating a file/ Starting a Project:
● To start a new file: File->New File
● For creating a new project: Projects->New Project
Writing the code:
Writing code in Spyder becomes very easy with its multi-language code editor and a number of powerful
tools. As mentioned earlier, the editor has features such as syntax highlighting, real-time analysis of
code, style analysis, on-demand completion, etc. When you write your code, you will also notice that it
gives a clear call stack for methods suggesting all the arguments that can be used along with that
method.
Take a look at the example below:

In the above example, you can notice that the editor is showing the complete syntax of the print
function. Not just this, in case you have made an error in any line, you will be notified
about it before the line number with a message describing what the issue is. Take a look
at the image below:

126
To run any file, you can select the Run option and click on run. Once executed, the output will be
visible on the Console as shown in the image below:

Variable Explorer:
The Variable Explorer shows all the global objects references such as modules, variables, methods,
etc of the current IPython Console. Not just this, you can also interact with these using
various GUI based editors.

127
Pycharm
PyCharm is a cross-platform editor developed by JetBrains. Pycharm provides all the tools you
need for productive Python development.
Below are the detailed steps for installing Python and PyCharm
How to Install Python IDE
Below is a step by step process on how to download and install Python on Windows:
Step 1) To download and install Python, visit the official website of Python
https://www.python.org/downloads/ and choose your version. We have chosen Python version
3.6.3

Step 2) Once the download is completed, run the .exe file to install Python. Now click on Install
Now.

128
Step 3) You can see Python installing at this point.

Step 4) When it finishes, you can see a screen that says the Setup was successful. Now click on
“Close”.

129
How to Install Pycharm
Here is a step by step process on how to download and install Pycharm IDE on Windows:
Step 1: To download PyCharm visit the website https://www.jetbrains.com/pycharm/download/ and Click
the “DOWNLOAD” link under the Community Section.

Step 2 : Once the download is complete, run the exe for install PyCharm. The setup wizard should
have started. Click “Next”.

Step 3: On the next screen, Change the installation path if required. Click “Next”.

130
Step 4: On the next screen, you can create a desktop shortcut if you want and click on
“Next”.

Step 5: Choose the start menu folder. Keep selecting JetBrains and click on “Install”.

131
Step 6: Wait for the installation to finish.

Step 7: Once installation finished, you should receive a message screen that PyCharm
is installed. If you want to go ahead and run it, click the “Run PyCharm Community
Edition” box first and click “Finish”.

132
Step 8: After you click on “Finish,” the Following screen will appear.

Jupyter Notebook
JupyterLab is the latest web-based interactive development environment for notebooks, code,
and data. Its flexible interface allows users to configure and arrange workflows in data science,
scientific computing, computational journalism, and machine learning. A modular design invites
extensions to expand and enrich functionality.

Kaggle
Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine
learning practitioners. Kaggle allows users to find and publish data sets, explore and build
models in a web-based data-science environment, work with other data scientists and machine
learning engineers, and enter competitions to solve data science challenges.
Kaggle got its start in 2010 by offering machine learning competitions and now also offers a
public data platform, a cloud-based workbench for data science, and Artificial Intelligence
education.

133
Google Colab:
Colaboratory, or “Colab” for short, is a product from Google Research. Colab allows anybody to
write and execute arbitrary python code through the browser, and is especially well suited to
machine learning, data analysis and education. More technically, Colab is a hosted Jupyter
notebook service that requires no setup to use, while providing access free of charge to computing
resources including GPUs.

Atom
Atom is a free and open-source text and source code editor for macOS, Linux, and Microsoft
Windows with support for plug-ins written in JavaScript, and embedded Git Control. Developed by
GitHub, Atom is a desktop application built using web technologies.Most of the extending
packages have free software licenses and are community-built and maintained. Atom is based on
Electron (formerly known as Atom Shell) a framework that enables cross-platform desktop
applications using Chromium and Node.js. Atom was initially written in CoffeeScript and Less, but
much of it has been converted to JavaScript.

VScode
Visual Studio Code is a source-code editor made by Microsoft for Windows, Linux and macOS.[9]
Features include support for debugging, syntax highlighting, intelligent code completion, snippets,
code refactoring, and embedded Git. Users can change the theme, keyboard shortcuts,
preferences, and install extensions that add additional functionality.
In the Stack Overflow 2021 Developer Survey, Visual Studio Code was ranked the most popular
developer environment tool, with 70% of 82,000 respondents reporting that they use it.

● Logical Statements Using Python:

Python Data Types

Data types are the classification or categorization of data items. It represents the kind of
value that tells what operations can be performed on a particular data. Since everything
is an object in Python programming, data types are actually classes and variables are
instance (object) of these classes.Python has the following data types built-in by default,
in these categories:

Text Type str

Numeric Type Int, float, complex

Sequence Type List, tuple, range

Mapping Type dict

Set Type Set, frozenset

Boolean Type bool

Binary Type Bytes, bytearray, memoryview

Integers:
This value is represented by int class. It contains positive or negative whole numbers
(without fraction or decimal). In Python there is no limit to how long an integer value can
be.

x = 20

134
#display x:
print(x)

#display the data type of x:

print(type(x))

20
<class 'int'>

Floats:
This value is represented by the float class. It is a real number with floating point
representation. It is specified by a decimal point. Optionally, the character e or E followed
by a positive or negative integer may be appended to specify scientific notation.

x = 20.5

#display x:
print(x)

#display the data type of x:

print(type(x))

Output

20.5
<class 'float'>

Boolean
Data type with one of the two built-in values, True or False. Boolean objects that are equal
to True are truthy (true), and those equal to False are falsy (false). But non-Boolean
objects can be evaluated in Boolean context as well and determined to be true or false. It
is denoted by the class bool.Note – True and False with capital ‘T’ and ‘F’ are valid
booleans otherwise python will throw an error.

x = True

#display x:
print(x)

#display the data type of x:

print(type(x))

Output

True
<class 'bool'>

135
Strings
In Python, Strings are arrays of bytes representing Unicode characters. A string is a
collection of one or more characters put in a single quote, double-quote or triple quote. In
python there is no character data type, a character is a string of length one. It is
represented by the str class.

x = "Hello World"

#display x:
print(x)

#display the data type of x:

print(type(x))

Output

Hello World
<class 'str'>

Python Data Structure

Lists
Lists are just like the arrays, declared in other languages which is an ordered collection of
data. It is very flexible as the items in a list do not need to be of the same type.

Input

x = list(("apple", "banana", "cherry"))

#display x:
print(x)

#display the data type of x:

print(type(x))

Output

['apple', 'banana', 'cherry']

Tuples
Just like list, tuple is also an ordered collection of Python objects. The only difference
between tuple and list is that tuples are immutable i.e. tuples cannot be modified after it is
created. It is represented by tuple class.

x = tuple(("apple", "banana", "cherry"))

#display x:
print(x)

136
#display the data type of x:
print(type(x))

Output

('apple', 'banana', 'cherry')

SET
In Python, Set is an unordered collection of data type that is iterable, mutable and has no
duplicate elements. The order of elements in a set is undefined though it may consist of
various elements.

x = set(("apple", "banana", "cherry"))

#display x:
print(x)

#display the data type of x:

print(type(x))

Output

{'cherry', 'banana', 'apple'}

Dictionary
Dictionary in Python is an unordered collection of data values, used to store data
values like a map, which unlike other Data Types that hold only single value as an element,
Dictionary holds key:value pair. Key-value is provided in the dictionary to make it more
optimized. Each key-value pair in a Dictionary is separated by a colon :, whereas each
key is separated by a ‘comma’.

x = {"name" : "John", "age" : 36}

#display x:
print(x)

#display the data type of x:

print(type(x))

Output

{'name': 'John', 'age': 36}

137
Python Operators:
Python Operators in general are used to perform operations on values and variables.
These are standard symbols used for the purpose of logical and arithmetic operations. In this
article, we will look into different types of Python operators.

Arithmetic operators are used to performing mathematical operations like addition, subtraction,
multiplication, and division.

Operator Description Syntax

+ Addition: adds two operands x+y

- Subtraction: subtracts two operands x–y

/ Division (float): divides the first operand by the x/y

second

* Multiplication: multiplies two operands x*y

// Division (floor): divides the first operand by the x // y

second

% Modulus: returns the remainder when the first x%y

operand is divided by the second

Power: Returns first raised to power second x y

Comparison Operators
Comparison of Relational operators compares the values. It either returns True or False according
to the condition.

Operator Description Syntax

> Greater than: True if the left operand is greater x>y

than the right

< Less than: True if the left operand is less than the x<y
right

== Equal to: True if both operands are equal x == y

!= Not equal to – True if operands are not equal x != y

>= Greater than or equal to True if the left operand is x >= y

greater than or equal to the right

<= Less than or equal to True if the left operand is less x <= y
than or equal to the right

Logical Operators
Logical operators perform Logical AND, Logical OR, and Logical NOT operations. It is used to
combine conditional statements.

Operator Description Syntax

and Logical AND: True if both the operands are true x and y

or Logical OR: True if either of the operands is true x or y

138
not Logical NOT: True if the operand is false not x

Assignment Operators
Assignment operators are used to assign values to the variables.

Operator Description Syntax

= Assign value of right side of expression to left side x=y+z

operand

+= Add right-side operand with left side operand and a+=b

then assign to left operand

-= Subtract right operand from left operand and then a-=b

assign to left operand

= Multiply right operand with left operand and then a=b

assign to left operand

/= Divide left operand with right operand and then a/=b

assign to left operand

%= Takes modulus using left and right operands and a%=b

assign the result to left operand

//= Divide left operand with right operand and then a//=b
assign the value(floor) to left operand

= Calculate exponent(raise power) value using a=b

operands and assign value to left operand

&= Performs Bitwise AND on operands and assign a&=b

value to left operand

|= Performs Bitwise OR on operands and assign a|=b

value to left operand

^= Performs Bitwise xOR on operands and assign a^=b

value to left operand

<<= Performs Bitwise left shift on operands and assign a <<= b

value to left operand

● Syntax Whitespace and Style guideline:

Syntax
Python syntax can be executed by writing directly in the Command Line. Or by creating a python
file on the server, using the .py file extension, and running it in the Command Line.
Lines and Indentation
Python provides no braces to indicate blocks of code for class and function definitions or flow
control. Blocks of code are denoted by line indentation, which is rigidly enforced.The number
of spaces in the indentation is variable, but all statements within the block must be indented
the same amount. For example −
if True:
print "True"
else:
print "False"
However, the following block generates an error −
139
if True:
print "True"
else:
print "False"

Whitespace
Except at the beginning of a logical line or in string literals, the whitespace characters
space, tab and form feed can be used interchangeably to separate tokens. Whitespace is needed
between two tokens only if their concatenation could otherwise be interpreted as a different token
(e.g., ab is one token, but a b is two tokens)
Whitespace is used to denote blocks. In other languages curly brackets ({ and }) are common.
When you indent, it becomes a child of the previous line. In addition to the indentation, the parent
also has a colon following it.
im_a_parent:
im_a_child:
im_a_grandchild
Im_another_child:

● Conditional statements and Looping:

Conditional Statements
Python has six conditional statements that are used in decision-making:
● If the statement.
● If else statement.
· If…Elif..else Statement.
● Nested if statement.
● Shorthand if statement.
● ShortHand if-else statement.
If Statement:
The If statement is the most fundamental decision-making statement, in which the code
is executed based on whether it meets the specified condition. It has a code body that
only executes if the condition in the if statement is true. The statement can be a single
line or a block of code.

Example
Input:
num = 5
if num > 0:
print(num, "is a positive number.")
print("This statement is true.")

Output
5 is a positive number.
This statement is true.

If Else Statement:
This statement is used when both the true and false parts of a given condition are
specified to be executed. When the condition is true, the statement inside the if block is
executed; if the condition is false, the statement outside the if block is executed.
Example

140
Input
num = 5
if num >= 0:
print("Positive or Zero")
else:
print("Negative number")

Output

Positive or Zero

If…Elif..else Statement:
In this case, the If condition is evaluated first. If it is false, the Elif statement will be
executed; if it also comes false, the Else statement will be executed.
Example for better understanding:
We will check if the number is positive, negative, or zero.

num = 7
if num > 0:
print("Positive number")
elif num == 0:
print("Zero")
else:
print("Negative number")

output
Positive number

Nested IF Statement:
A Nested IF statement is one in which an If statement is nestled inside another If
statement. This is used when a variable must be processed more than once. If, If-else,
and If…elif…else statements can be used in the program. In Nested If statements, the
indentation (whitespace at the beginning) to determine the scope of each statement
should take precedence.
Example

141
INPUT

num = 8
if num >= 0:
if num == 0:
print("zero")
else:
print("Positive number")
else:
print("Negative number")

OUTPUT
Positive number

Shorthand if statement:
Shorthand if statement is used when only one statement needs to be executed inside
the if block. This statement can be mentioned in the same line which holds the If
statement.
The Short Hand if statement in Python has the following syntax:
if condition: statement
Example for better understanding:

INPUT
i=15
# One line if statement
if i>11 : print (“i is greater than 11″)

OUTPUT
The output of the program : “i is greater than 11.”

ShortHand if-else statement:

It is used to mention If-else statements in one line in which there is only one
statement to execute in both if and else blocks. In simple words, If you have only
one statement to execute, one for if, and one for else, you can put it all on the same
line.

Examples for better understanding:

#single line if-else statement
a=3
142
b=5
print("A") if a > b else print("B")
output: B

Loop
Python has two primitive loop commands:
● while loops
● for loops
While Loop:
In python, while loop is used to execute a block of statements repeatedly until a
given condition is satisfied. And when the condition becomes false, the line
immediately after the loop in the program is executed. All the statements indented
by the same number of character spaces after a programming construct are
considered to be part of a single block of code. Python uses indentation as its
method of grouping statements.
Example:
# Python program to illustrate
# while loop
count = 0
while (count < 3):
count = count + 1
print("Hello World")

Output:
Hello World
Hello World
Hello World
for in Loop:
For loops are used for sequential traversal. For example: traversing a list or string
or array etc. In Python, there is no C style for loop, i.e., for (i=0; i<n; i++). There
is “for in” loop which is similar to for each loop in other languages. Let us learn
how to use for in loop for sequential traversals.
It can be used to iterate over a range and iterators.
Example:
# Python program to illustrate
# Iterating over range 0 to n-1
n=4
for i in range(0, n):
print(i)
Output:
0
1
2
3

● Custom function and lambda expression:

User defined Functions:

What are user-defined functions in Python?
Functions that we define ourselves to do certain specific tasks are referred to as user-defined
functions. The way in which we define and call functions in Python are already discussed.

143
Functions that readily come with Python are called built-in functions. If we use functions written
by others in the form of a library, it can be termed as library functions.

All the other functions that we write on our own fall under user-defined functions. So, our user-
defined function could be a library function to someone else.

Advantages of user-defined functions

● User-defined functions help to decompose a large program into small segments which
makes the program easy to understand, maintain and debug.
● If repeated code occurs in a program. Function can be used to include those codes and
execute when needed by calling that function.
● Programmers working on large projects can divide the workload by making different
functions.

# Program to illustrate
# the use of user-defined functions

def add_numbers(x,y):
sum = x + y
return sum

num1 = 5
num2 = 6

print("The sum is", add_numbers(num1, num2))

Output
The sum is 11

Lambda Function:
Python and other languages like Java, C#, and even C++ have had lambda functions added to
their syntax, whereas languages like LISP or the ML family of languages, Haskell, OCaml, and
F#, use lambdas as a core concept. Python lambdas are little, anonymous functions, subject to a
more restrictive but more concise syntax than regular Python functions.
Example
Here are a few examples to give you an appetite for some Python code, functional style. The
identity function, a function that returns its argument, is expressed with a standard Python
function definition using the keyword def as follows:

>>> def identity(x):

... return x

identity() takes an argument x and returns it upon invocation.

In contrast, if you use a Python lambda construction, you get the following:

>>> lambda x: x

You can write a slightly more elaborated example, a function that adds 1 to an argument, as
follows:

>>> lambda x: x + 1

Lambda function is an expression, it can be named. Therefore you could write the previous code
as follows:
144
>>> add_one = lambda x: x + 1
>>> add_one(2)
3

● Modules in python:
TensorFlow:
The first in the list of python libraries for data science is TensorFlow. TensorFlow is a
library for high-performance numerical computations with around 35,000 comments and a vibrant
community of around 1,500 contributors. It’s used across various scientific fields. TensorFlow is
basically a framework for defining and running computations that involve tensors, which are
partially defined computational objects that eventually produce a value.
Features:
● Better computational graph visualisations
● Reduces error by 50 to 60 percent in neural machine learning
● Parallel computing to execute complex models
● Seamless library management backed by Google
● Quicker updates and frequent new releases to provide you with the latest features
TensorFlow is particularly useful for the following applications:
● Speech and image recognition
● Text-based applications
● Time-series analysis
● Video detection
SciPy

SciPy (Scientific Python) is another free and open-source Python library for data science that is
extensively used for high-level computations. SciPy has around 19,000 comments on GitHub and
an active community of about 600 contributors. It’s extensively used for scientific and technical
computations, because it extends NumPy and provides many user-friendly and efficient routines
for scientific calculations.
Features:
● Collection of algorithms and functions built on the NumPy extension of Python
● High-level commands for data manipulation and visualization
● Multidimensional image processing with the SciPy ndimage submodule
● Includes built-in functions for solving differential equations
Applications
● Multidimensional image operations
● Solving differential equations and the Fourier transform
● Optimization algorithms
● Linear algebra
NumPy

NumPy (Numerical Python) is the fundamental package for numerical computation in Python; it
contains a powerful N-dimensional array object. It has around 18,000 comments on GitHub and
an active community of 700 contributors. It’s a general-purpose array-processing package that
provides high-performance multidimensional objects called arrays and tools for working with them.
NumPy also addresses the slowness problem partly by providing these multidimensional arrays
as well as providing functions and operators that operate efficiently on these arrays.
Features:
● Provides fast, precompiled functions for numerical routines
● Array-oriented computing for better efficiency
● Supports an object-oriented approach
● Compact and faster computations with vectorization
Applications:
● Extensively used in data analysis
● Creates powerful N-dimensional array
● Forms the base of other libraries, such as SciPy and scikit-learn
● Replacement of MATLAB when used with SciPy and matplotlib
Pandas

145
Pandas (Python data analysis) is a must in the data science life cycle. It is the most popular and
widely used Python library for data science, along with NumPy in matplotlib. With around 17,00
comments on GitHub and an active community of 1,200 contributors, it is heavily used for data
analysis and cleaning. Pandas provides fast, flexible data structures, such as data frame CDs,
which are designed to work with structured data very easily and intuitively.
Features:
● Eloquent syntax and rich functionalities that gives you the freedom to deal with missing
data
● Enables you to create your own function and run it across a series of data
● High-level abstraction
● Contains high-level data structures and manipulation tools
Applications:
● General data wrangling and data cleaning
● ETL (extract, transform, load) jobs for data transformation and data storage, as it has
excellent support for loading CSV files into its data frame format
● Used in a variety of academic and commercial areas, including statistics, finance and
neuroscience
● Time-series-specific functionality, such as date range generation, moving window, linear
regression and date shifting.
Matplotlib

Matplotlib has powerful yet beautiful visualisations. It’s a plotting library for Python with around
26,000 comments on GitHub and a very vibrant community of about 700 contributors. Because of
the graphs and plots that it produces, it’s extensively used for data visualisation. It also provides
an object-oriented API, which can be used to embed those plots into applications.
Features:
● Usable as a MATLAB replacement, with the advantage of being free and open source
● Supports dozens of backends and output types, which means you can use it regardless
of which operating system you’re using or which output format you wish to use
● Pandas itself can be used as wrappers around MATLAB API to drive MATLAB like a
cleaner
● Low memory consumption and better runtime behaviour
Applications:
● Correlation analysis of variables
● Visualise 95 percent confidence intervals of the models
● Outlier detection using a scatter plot etc.
● Visualise the distribution of data to gain instant insights
Keras

Similar to TensorFlow, Keras is another popular library that is used extensively for deep learning
and neural network modules. Keras supports both the TensorFlow and Theano backends, so it is
a good option if you don’t want to dive into the details of TensorFlow.
Features:
● Keras provides a vast prelabeled dataset which can be used to directly import and load.
● It contains various implemented layers and parameters that can be used for construction,
configuration, training, and evaluation of neural networks
Applications:
● One of the most significant applications of Keras are the deep learning models that are
available with their pretrained weights. You can use these models directly to make
predictions or extract its features without creating or training your own new model.

● Git Commands:

Git Config:

The git config command is a convenience function that is used to set Git configuration
values on a global or local project level. These configuration levels correspond to .gitconfig
text files. Executing git config will modify a configuration text file. We'll be covering
common configuration settings like email, username, and editor. We'll discuss Git aliases,

146
which allow you to create shortcuts for frequently used Git operations. Becoming familiar
with git config and the various Git configuration settings will help you create a powerful,
customised Git workflow.

Git init:

The git init command creates a new Git repository. It can be used to convert an existing,
unversioned project to a Git repository or initialise a new, empty repository. Most other Git
commands are not available outside of an initialised repository, so this is usually the first
command you'll run in a new project.
Executing git init creates a .git subdirectory in the current working directory, which contains
all of the necessary Git metadata for the new repository. This metadata includes
subdirectories for objects, refs, and template files. A HEAD file is also created which points
to the currently checked out commit.
Aside from the .git directory, in the root directory of the project, an existing project remains
unaltered (unlike SVN, Git doesn't require a .git subdirectory in every subdirectory).
By default, git init will initialise the Git configuration to the .git subdirectory path. The
subdirectory path can be modified and customised if you would like it to live elsewhere.
You can set the $GIT_DIR environment variable to a custom path and git init will initialise
the Git configuration files there. Additionally you can pass the --separate-git-dir argument
for the same result. A common use case for a separate .git subdirectory is to keep your
system configuration "dotfiles" (.bashrc, .vimrc, etc.) in the home directory while keeping
the .git folder elsewhere.

Git add:

The git add command adds a change in the working directory to the staging area. It tells
Git that you want to include updates to a particular file in the next commit. However, git
add doesn't really affect the repository in any significant way—changes are not actually
recorded until you run git commit. In conjunction with these commands, you'll also need
git status to view the state of the working directory and the staging area.

Git commit:

The git commit command is one of the core primary functions of Git. Prior use of the git
add command is required to select the changes that will be staged for the next commit.
Then git commit is used to create a snapshot of the staged changes along a timeline of a
Git project's history. Learn more about git add usage on the accompanying page. The git
status command can be used to explore the state of the staging area and pending commit.

Git push:

git push is most commonly used to publish and upload local changes to a central
repository. After a local repository has been modified a push is executed to share the
modifications with remote team members.

147
The above diagram shows what happens when your local main has progressed past the
central repository main and you publish changes by running git push origin main. Notice
how git push is essentially the same as running git merge main from inside the remote
repository.

Git pull:

You can think of git pull as Git's version of svn update. It’s an easy way to synchronise
your local repository with upstream changes. The following diagram explains each step of
the pulling process.

You start out thinking your repository is synchronised, but then git fetch reveals that
origin's version of main has progressed since you last checked it. Then git merge
immediately integrates the remote main into the local one.

148
Git checkout:

Git checkout works hand-in-hand with git branch. The git branch command can be used
to create a new branch. When you want to start a new feature, you create a new branch
off main using git branch new_branch. Once created you can then use git checkout
new_branch to switch to that branch. Additionally, The git checkout command accepts a -
b argument that acts as a convenience method which will create the new branch and
immediately switch to it. You can work on multiple features in a single repository by
switching between them with git checkout.

Git merge:

Merging is Git's way of putting a forked history back together again. The git merge
command lets you take the independent lines of development created by git branch and
integrate them into a single branch.
Note that all of the commands presented below merge into the current branch. The current
branch will be updated to reflect the merge, but the target branch will be completely
unaffected. Again, this means that git merge is often used in conjunction with git checkout
for selecting the current branch and git branch -d for deleting the obsolete target branch.

Individual Activity:
● Show Python data types and structures, Python operators, Git commands.

SELF-CHECK QUIZ 3.2

Check your understanding by answering the following questions:

Write the correct answer for the following questions.

1. What is Appropriate Anaconda?

2. Which rules are most important for a Python IDE?
3. What are Python data types and structures, python operators?
4. What is syntax, whitespace and style?
5. What are conditional statements and loops?
6. What lambda expressions are defined in code?

149
LEARNING OUTCOME 3.3 - Use Pandas and NumPy
libraries

Contents:

● Objects in Pandas Series and DataFrames.

● CSV, JSON, XML and XLS files.
● Multidimensional NumPy arrays (ndarrays).
● Slicing, Boolean indexing and set operations.
● Element-wise operations.

Assessment criteria:

1. Objects in Pandas Series and DataFrames are created, accessed and modified.
2. CSV, JSON, XML and XLS files are read using Pandas.
3. Multidimensional NumPy arrays (ndarrays) are created, accessed, modified and sorted.
4. Slicing, Boolean indexing and set operations are performed to select or change subset of ndarray.
5. Element-wise operations are done on ndarrays.

Resources required:

Students/trainees must be provided with the following resources:

▪ Workplace (Computer and Internet connection)

LEARNING ACTIVITY 3.3

Learning Activity Resources/Special Instructions/References

Use Pandas and NumPy libraries ▪ Information Sheets: 3.3
▪ Self-Check: 3.3
▪ Answer Key: 3.3

INFORMATION SHEET 3.3

Learning Objective: to User Pandas and NumPy libraries

150
● Pandas:

Pandas
pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation
tool, built on top of the python programming language.

Pandas Series Object

A Pandas series is a one-dimensional array of indexed data. It can be created from a list
or array as follows
Example :

import numpy as np
import pandas as pd

data = pd.Series([0.25, 0.5, 0.75, 1.0])

print(data)

Output :
0 0.25
1 0.50
2 0.75
3 1.00
dtype: float64

print(data.values)

Output : ([ 0.25, 0.5 , 0.75, 1. ])

As we see in the output, the series wraps both a sequence of values and a sequence of
indices, which we can access with the values and index attributes. The values are simply
a familiar NumPy array.
The Index is an array-like object of type pd.index which we'll discuss in more detail
momentarily.
print(data.index)

Output : RangeIndex(start=0, stop=4, step=1)

Like with a NumPy array, data can be accessed by the associated index via the familiar
Python square-bracket notation

print(data[1])

output : 0.5

print(data[1:3])

Output :

151
1 0.50
2 0.75
dtype: float64

Pandas DataFrame Object

In the real world, a Pandas DataFrame will be created by loading the datasets from existing
storage, storage can be SQL Database, CSV file, and Excel file. Pandas DataFrame can be
created from the lists, dictionary, and from a list of dictionaries etc.

A Dataframe is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows
and columns. In dataframe datasets arranged in rows and columns, we can store any number of
datasets in a dataframe. We can perform many operations on these datasets like arithmetic
operation, columns/rows selection, columns/rows addition etc.

Creating an empty dataframe :

A basic DataFrame, which can be created is an Empty Dataframe. An Empty Dataframe

is created just by calling a dataframe constructor.

# import pandas as pd
import pandas as pd

# Calling DataFrame constructor

df = pd.DataFrame()

print(df)

Output :

Empty DataFrame
Columns: []
Index: []

Create a DataFrame Using List :

DataFrame can be created using a single list or a list of lists.

# import pandas as pd
import pandas as pd

# list of strings
lst = ['Geeks', 'For', 'Geeks', 'is',
'portal', 'for', 'Geeks']

# Calling DataFrame constructor on list

df = pd.DataFrame(lst)
print(df)

Output :

152
Creating DataFrame from dict of ndarray / lists:
To create DataFrame from dict of ndarray/list, all the ndarray must be of same length. If
index is passed then the length index should be equal to the length of arrays. If no index
is passed, then by default, index will be range(n) where n is the array length.

# Python code demonstrate creating

# DataFrame from dict ndarray / lists
# By default addresses.

import pandas as pd

# initialize data of lists.

data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}

# Create DataFrame
df = pd.DataFrame(data)

# Print the output.

print(df)

Output :

153
Create Pandas DtaFrame from lists using dictionary:
Creating pandas data-frame from lists using a dictionary can be achieved in different
ways. We can create pandas dataframe from lists using a dictionary using
pandas.DataFrame. With this method in Pandas we can transform a dictionary of lists to
a dataframe.

# importing pandas as pd
import pandas as pd

# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
'degree': ["MBA", "BCA", "M.Tech", "MBA"],
'score':[90, 40, 80, 98]}

df = pd.DataFrame(dict)

print(df)

Output :

How to modify values in a Pandas DataFrame?

As part of our data wrangling process, we are often required to modify data previously
acquired from a csv, text, json, API, database or other data source.

Replace existing data in Pandas DataFrames

We’ll look into several cases:

154
1. Replacing values in an entire DF.
2. Updating values in specific cells by index
3. Changing values in an entire DF row
4. Replace cells content according to condition
5. Set values for an entire column / series.
Creating the data
Let’s define a simple survey DataFrame:

# Import DA packages

import pandas as pd
import numpy as np

# Create test Data

survey_dict = {
'language': ['Python', 'Java', 'Haskell', 'Go', 'C++'],
'salary': [120,85,95,80,90],
'num_candidates': [18,22,34,10, np.nan]
}

# Initialize the survey DataFrame

survey_df = pd.DataFrame(survey_dict)
print(surver_df)

Output :

language salary num_candidates

0 Python 120 18.0

1 Java 85 22.0

2 Haskell 95 34.0

3 Go 80 10.0

4 C++ 90 NaN

Set cell values in the entire DF using replace()

We’ll use the DataFrame replace method to modify DF sales according to their value. In
the example we’ll replace the null value in the last row. Note that we could accomplish
the same result with the more elegant fillna() method.

155
# Import DA packages

import pandas as pd
import numpy as np

# Create test Data

survey_dict = {
'language': ['Python', 'Java', 'Haskell', 'Go', 'C++'],
'salary': [120,85,95,80,90],
'num_candidates': [18,22,34,10, np.nan]
}

# Initialize the survey DataFrame

survey_df = pd.DataFrame(survey_dict)

surver_df.replace(to_replace=np.nan, value=17, inplace=True)

print(survey_df)

Output :

language salary num_candidates

0 Python 120 18.0

1 Java 85 22.0

2 Haskell 95 34.0

3 Go 80 10.0

4 C++ 90 17.0

Change value of cell content by index

To pick a specific row index to be modified, we’ll use the iloc indexer. Note that we could
also use the loc indexer to update the cell by row/column label.

156
# Import DA packages

import pandas as pd
import numpy as np

# Create test Data

survey_dict = {
'language': ['Python', 'Java', 'Haskell', 'Go', 'C++'],
'salary': [120,85,95,80,90],
'num_candidates': [18,22,34,10, np.nan]
}

# Initialize the survey DataFrame

survey_df = pd.DataFrame(survey_dict)

survey_df.iloc[0].replace(to_replace=120, value = 130)

print(survey_df)

Output:

language Python
salary 130
num_candidates 18.0
Name: 0, dtype: object

Modify multiple cells in a DataFrame row

Similar to before, but this time we’ll pass a list of values to replace and their respective
replacements:

# Import DA packages

import pandas as pd
import numpy as np

# Create test Data

survey_dict = {
'language': ['Python', 'Java', 'Haskell', 'Go', 'C++'],
'salary': [120,85,95,80,90],
'num_candidates': [18,22,34,10, np.nan]
}

# Initialize the survey DataFrame

survey_df = pd.DataFrame(survey_dict)

survey_df.loc[0].replace(to_replace=(130,18), value=(120, 20))

print(survey_df)

157
Update cells based on conditions
In reality, we’ll update our data based on specific conditions. Here’s an example on how
to update cells with conditions. Let’s assume that we would like to update the salary
figures in our data so that the minimal salary will be $90/hour.
We’ll first slide the DataFrame and find the relevant rows to update:

cond = survey_df['salary'] < 90

We’ll then pass the rows and columns labels to be updated into the loc indexer:

# Import DA packages

import pandas as pd
import numpy as np

# Create test Data

survey_dict = {
'language': ['Python', 'Java', 'Haskell', 'Go', 'C++'],
'salary': [120,85,95,80,90],
'num_candidates': [18,22,34,10, np.nan]
}

# Initialize the survey DataFrame

survey_df = pd.DataFrame(survey_dict)

survey_df.loc[cond,'salary'] = 90
print(Survey_df)

Here’s our output:

SR language salary num_candidates

0 Python 120 18.0

1 Java 90 22.0

2 Haskell 95 34.0

3 Go 90 10.0

4 C++ 90 Nan

Replace values for an entire column

Let’s now assume that we would like to modify the num_candidates figure for all the DF
entries. That’s fairly easy:

158
# Import DA packages

import pandas as pd
import numpy as np

# Create test Data

survey_dict = {
'language': ['Python', 'Java', 'Haskell', 'Go', 'C++'],
'salary': [120,85,95,80,90],
'num_candidates': [18,22,34,10, np.nan]
}

# Initialise the survey DataFrame

survey_df = pd.DataFrame(survey_dict)

survey_df['num_candidates'] = 25
print(Survey_df)

● Common File Formats in Pandas– CSV, JSON, XML, and XLS

CSV:
Ah, the good old CSV format. A CSV (or Comma Separated Value) file is the most common type
of file that a data scientist will ever work with. These files use a “,” as a delimiter to separate the
values and each row in a CSV file is a data record.

These are useful to transfer data from one application to another and is probably the reason why
they are so commonplace in the world of data science.
The Pandas library makes it very easy to read CSV files using the read_csv() function:

Create a new csv file :

file_name.csv

ID, Name, Age

1, X, 22
2, Y, 25
3, Z, 21

# import pandas
import pandas as pd

# read csv file into a DataFrame

df = pd.read_csv(r'path/file_name.csv')
# display DataFrame
print(df)

Output :
ID Name Age
0 1 X 22
1 2 Y 25

159
2 3 Z 21

But CSV can run into problems if the values contain commas. This can be overcome by using
different delimiters to separate information in the file, like ‘\t’ or ‘;’, etc. These can also be
imported with the read_csv() function by specifying the delimiter in the parameter value as
shown below while reading a TSV (Tab Separated Values) file:

import pandas as pd
df = pd.read_csv(r'.path/file_name.txt',delimiter='\t')
print(df)

Output :
ID, Name, Age
0 1, X, 22
1 2, Y, 25
2 3, Z, 21

Working with JSON Files in Python

JSON (JavaScript Object Notation) files are lightweight and human-readable to store and
exchange data. It is easy for machines to parse and generate these files and are based on the
JavaScript programming language.
JSON files store data within {} similar to how a dictionary stores it in Python. But their major
benefit is that they are language-independent, meaning they can be used with any programming
language – be it Python, C or even Java!
Python provides a json module to read JSON files. You can read JSON files just like simple text
files. However, the read function, in this case, is replaced by json.load() function that returns a
JSON dictionary.
Once you have done that, you can easily convert it into a Pandas dataframe using the
pandas.DataFrame() function
Create A json File :
{"person1": {"name": "X", "age": "20"}, "person2": {"name": "y","age": "21"}, "person3":
{"name": "z", "age":"30"}}

Python code :
import json
import pandas as pd

# open json file

with open('path/file_name.json','r') as file:
data = json.load(file)

# json dictionary
print(type(data))

# loading into a DataFrame

df_json = pd.DataFrame(data)
print(df_json)

Output :
<class 'dict'>
person1 person2 person3

160
name X y z
age 20 21 30

But you can even load the JSON file directly into a dataframe using the pandas.read_json()
function as shown below:

Python code :
# reading directly into a DataFrame usind pd.read_json()
path = 'path/file_name.json'
df = pd.read_json(path)
print(df)

Output :
<class 'dict'>
person1 person2 person3
name X y z
age 20 21 30

Reading XML file with Pandas :

XML (Extensible Markup Language) is a markup language used to store structured data. The
Pandas data analysis library provides functions to read/write data for most of the file types.
Let's have a look at a few ways to read XML data and put it in a Pandas DataFrame
Save the following XML in a file called file_name.xml :

<?xml version='1.0' encoding='utf-8'?>

<data xmlns="http://example.com">
<row>
<Id>035</Id>
<Name>Mr. X</Name>
<Age>22</Age>
</row>
<row>
<Id>036</Id>
<Name>Mr. Y</Name>
<Age>23</Age>
</row>
<row>
<Id>037</Id>
<Name>Mr. Z</Name>
<Age>25</Age>
</row>
</data>
The lxml library is a Python binding for the C libraries libxml2 and libxslt. It also extends the
native ElementTree module. As this is a third-party module, you'll need to install it with pip like
this

pip install lxml

import pandas as pd

df = pd.read_xml(r'file_name.xml')

161
# display DataFrame
print(df)

Output :

Id Name Age
0 35 Mr. X 22
1 36 Mr. Y 23
2 37 Mr. Z 25

Reading XLS file with Pandas :

An XLS file is a spreadsheet file created by Microsoft Excel or exported by another spreadsheet
program, such as OpenOffice Calc or Apple Numbers. It contains one or more worksheets,
which store and display data in a table format. XLS files may also store mathematical functions,
charts, styles, and formatting
As this is a third-party module, you'll need to install it with pip like this

pip install openpyxl

XLS file

Name Age

Mr. X 22

Mr.Y 23

Mr.Z 24

Python Code :
# import pandas
import pandas as pd
# read XLS file into a DataFrame
df = pd.read_excel(r'file_name.xlsx')
#display DataFrame
print(df)

Output :

Name Age
0 Mr.X 22
1 Mr.Y 23
2 Mr.Z 24

● Multidimensional NumPy arrays (ndarrays):

NdArary :

162
ndarray is the abbreviation of n-dimension array, or in other words - multidimensional arrays.
ndarray is an array object representing a multidimensional, homogeneous array of fixed-size
items

Let’s take a look at the specific meaning of each attribute through example codes.
>>> import numpy as np
>>> a = np.array([1, 2, 3])

We need to import NumPy library and create a new 1-D array. You could check its data type and
the data type of its element.
>>> type(a)
numpy.ndarray
>>> a.dtype
dtype('int32')

Let’s create a new 2-D array and then check its attributes.
>>> b = np.array([[4, 5, 6], [7, 8, 9]])
>>> b
array([[4, 5, 6],
[7, 8, 9]])
>>> b.T # get the transpose of b
array([[4, 7],
[5, 8],
[6, 9]])
>>> b # b keeps unmodified
array([[4, 5, 6],
[7, 8, 9]])
>>> a.size # a has 3 elements
3
>>> b.size # b has 6 elements
6
>>> a.itemsize # The size of element in a. The data type here is int64 - 8 bytes
8
>>> b.nbytes # check how many bytes in b. It is 48, where 6x8 = 48
48
>>> b.shape # The shape of b
(2, 3)
>>> b.dnim # The dimensions of b
2

Ndarray Attributes

Let’s list the attributes of ndarray.

Attributes Description

T Transpose matrix. When the array is 1 D, the original array is

returned.

data A Python buffer object that points to the starting position of the
data in the array.

163
dtype The data type of the element contained in the ndarray.

flags Information about how to store ndarray data in memory (memory

layout).

flat An iterator that converts ndarray to a one-dimensional array.

imag The imaginary part of ndarray data

real Real part of ndarray data

size The number of elements contained in the ndarray.

itemsize The size of each element in bytes.

nbytes The total memory (in bytes) occupied by the ndarray.

ndim The number of dimensions contained in the ndarray.

shape The shape of the ndarray (results are tuples).

strides The number of bytes required to move to the next adjacent

element in each dimension direction is represented by a tuple.

ctypes An iterator that is processed in the ctypes module.

base The object on which ndarray is based (which memory is being

referenced).

NumPy Sorting Arrays(ndarrays) :

164
import numpy as np
arr = np.array([[3, 2, 4], [5, 0, 1]])
print(np.sort(arr))

Output :

[[2 3 4]
[0 1 5]]

● Slicing, Boolean indexing and set operations are performed to select or change
subset or ndarray NumPy Array Slicing :

Slicing arrays
Slicing in python means taking elements from one given index to another given index.
We pass a slice instead of an index like this: [start:end].
We can also define the step, like this: [start:end:step].
If we don't pass start its considered 0
If we don't pass end its considered length of array in that dimension
If we don't pass step its considered 1

Example 1 :
Slice elements from index 1 to index 5 from the following array:
import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7])

print(arr[1:5])
Output :

[2 3 4 5]

Example 2 :
Slice elements from index 4 to the end of the array:
import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7])

print(arr[4:])

Output:
[5 6 7]

Example 3 :
Slice elements from the beginning to index 4 (not included):
import numpy as np

165
arr = np.array([1, 2, 3, 4, 5, 6, 7])

print(arr[:4])

Output:

[1 2 3 4]

Example 4 :
Slice from the index 3 from the end to index 1 from the end:
import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7])

print(arr[-3:-1])
Output:

[5 6]

STEP
Use the step value to determine the step of the slicing:
Example
Return every other element from index 1 to index 5:

import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7])

print(arr[1:5:2])

Slicing 2-D Arrays

Example
From the second element, slice elements from index 1 to index 4 (not included):

import numpy as np

arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])

print(arr[1, 1:4])

Output :
[7 8 9]

NumPy Boolean Indexing :

You can index specific values from a NumPy array using another NumPy array of Boolean
values on one axis to specify the indices you want to access. For example, to access the

166
second and third values of array a = np.array([4, 6, 8]), you can use the expression
a[np.array([False, True, True])] using the Boolean array as an indexing mask.

import numpy as np
# 1D Boolean Indexing
a = np.array([4, 6, 8])
b = np.array([False, True, True])
print(a[b])
'''
[6 8]
'''

Output :

[6 8]

# 2D Boolean Indexing
a = np.array([[1, 2, 3],
[4, 5, 6]])
b = np.array([[True, False, False],
[False, False, True]])
print(a[b])
'''
[6 8]
[1 6]
'''

Output :

[6 8]

NumPy Set Operations

What is a Set
A set in mathematics is a collection of unique elements.
Sets are used for operations involving frequent intersection, union and difference
operations.
Create Sets in NumPy
We can use NumPy's unique() method to find unique elements from any array.
E.g. create a set array, but remember that the set arrays should only be 1-D
arrays.
Example
Convert following array with repeated elements to a set:

import numpy as np

arr = np.array([1, 1, 1, 2, 3, 4, 5, 5, 6, 7])

x = np.unique(arr)
print(x)

Output :

167
[1 2 3 4 5 6 7]

Finding Union
To find the unique values of two arrays, use the union1d() method.

Example
Find union of the following two set arrays:

import numpy as np
arr1 = np.array([1, 2, 3, 4])
Arr2 = np.array([3, 4, 5, 6])
newarr = np.union1d(arr1, arr2)
print(newarr)
Output :

[1 2 3 4 5 6]

Finding Intersection
To find only the values that are present in both arrays, use the intersect1d()
method.
Example
Find intersection of the following two set arrays:
import numpy as np
arr1 = np.array([1, 2, 3, 4])
arr2 = np.array([3, 4, 5, 6])
newarr = np.intersect1d(arr1, arr2, assume_unique=True)
print(newarr)
Output :

[3 4]
Note: the intersect1d() method takes an optional argument assume_unique, which
if set to True can speed up computation. It should always be set to True when
dealing with sets.
Finding Difference
To find only the values in the first set that is NOT present in the seconds set, use
the setdiff1d() method.
Example
Find the difference of the set1 from set2:

import numpy as np
set1 = np.array([1, 2, 3, 4])
set2 = np.array([3, 4, 5, 6])
newarr = np.setdiff1d(set1, set2, assume_unique=True)
print(newarr)
Output :

[1 2]

168
Note: the setdiff1d() method takes an optional argument assume_unique, which if
set to True can speed up computation. It should always be set to True when
dealing with sets.
Finding Symmetric Difference
To find only the values that are NOT present in BOTH sets, use the setxor1d()
method.
Example
Find the symmetric difference of the set1 and set2:

import numpy as np
set1 = np.array([1, 2, 3, 4])
set2 = np.array([3, 4, 5, 6])
newarr = np.setxor1d(set1, set2, assume_unique=True)
print(newarr)
Output :

[1 2 5 6]

Element-wise operations on ndarrays

Shown in 3.4

Individual Activity:
▪ Objects in Pandas Series and DataFrames are created, accessed and modified.
▪ Show CSV, JSON, XML and XLS.

SELF-CHECK QUIZ 3.3

Check your understanding by answering the following questions:

Write the correct answer for the following questions.

1. How do Pandas Series and DataFrames are created?

2. How to use CSV, JSON, XML and XLS files that are read using Pandas?

169
LEARNING OUTCOME 3.4 - Use python to implement
descriptive and inferential statistics

Contents:

▪ Python Scipy library.

▪ Mean, median, mode, standard deviation, percentiles, skewness and kurtosis.
▪ Correlations.
▪ Continuous variables.

Assessment criteria:

1. Usage of Python Scipy library is demonstrated.

2. Mean, median, mode, standard deviation, percentiles, skewness and kurtosis are calculated using
python.
3. Python code is used to test hypothesis.
4. Correlations are measured.
5. Continuous variable is predicted using regression and regression assumptions are validated.

Resources required:

Students/trainees must be provided with the following resources:

▪ Workplace ( Computer and Internet connection)

LEARNING ACTIVITY 3.4

Learning Activity Resources/Special Instructions/References

Use python to implement descriptive and
▪ Information Sheets: 3.4
inferential statistics
▪ Self-Check: 3.4
▪ Answer Key: 3.4

170
INFORMATION SHEET 3.4

Learning Objective: to Use python to implement descriptive and inferential statistics.

● Python Scipy library:

SciPy is a scientific computation library that uses NumPy underneath.SciPy stands for Scientific
Python.It provides more utility functions for optimization, stats and signal processing.Like
NumPy, SciPy is open source so we can use it freely.SciPy was created by NumPy's creator
Travis Olliphant.

Basic Functionality
By default, all the NumPy functions have been available through the SciPy namespace.
There is no need to import the NumPy functions explicitly, when SciPy is imported. The
main object of NumPy is the homogeneous multidimensional array. It is a table of
elements (usually numbers), all of the same type, indexed by a tuple of positive integers.
In NumPy, dimensions are called axes. The number of axes is called the rank.
Now, let us revise the basic functionality of Vectors and Matrices in NumPy. As SciPy is
built on top of NumPy arrays, understanding of NumPy basics is necessary. As most
parts of linear algebra deal with matrices only.
NumPy Vector
A Vector can be created in multiple ways. Some of them are described below.
Converting Python array-like objects to NumPy
Let us consider the following example.
import numpy as np
list = [1,2,3,4]
arr = np.array(list)
print arr
The output of the above program will be as follows.
[1 2 3 4]

Intrinsic NumPy Array Creation

NumPy has built-in functions for creating arrays from scratch. Some of these functions
are explained below.
Using zeros()
The zeros(shape) function will create an array filled with 0 values with the specified
shape. The default dtype is float64. Let us consider the following Example:
import numpy as np
print np.zeros((2, 3))
The output of the above program will be as follows.
array([[ 0., 0., 0.],
[ 0., 0., 0.]])

Using ones()
The ones(shape) function will create an array filled with 1 values. It is identical to zeros in
all the other respects. Let us consider the following example.
import numpy as np
print np.ones((2, 3))
The output of the above program will be as follows.
array([[ 1., 1., 1.],
[ 1., 1., 1.]])

Using arange()
The arange() function will create arrays with regularly incrementing values. Let us
consider the following example.
import numpy as np
print np.arange(7)
The above program will generate the following output.

171
array([0, 1, 2, 3, 4, 5, 6])

Defining the data type of the values

Let us consider the following example.
import numpy as np
arr = np.arange(2, 10, dtype = np.float)
print arr
print "Array Data Type :",arr.dtype
The above program will generate the following output.
[ 2. 3. 4. 5. 6. 7. 8. 9.]
Array Data Type : float64

Using linspace()
The linspace() function will create arrays with a specified number of elements, which will
be spaced equally between the specified beginning and end values. Let us consider the
following example.
import numpy as np
print np.linspace(1., 4., 6)
The above program will generate the following output.
array([ 1. , 1.6, 2.2, 2.8, 3.4, 4. ])

Matrix
A matrix is a specialised 2-D array that retains its 2-D nature through operations. It has
certain special operators, such as * (matrix multiplication) and ** (matrix power). Let us
consider the following example.
import numpy as np
print np.matrix('1 2; 3 4')
The above program will generate the following output.
matrix([[1, 2],
[3, 4]])

Conjugate Transpose of Matrix

This feature returns the (complex) conjugate transpose of self. Let us consider the
following example.
import numpy as np
mat = np.matrix('1 2; 3 4')
print mat.H
The above program will generate the following output.
matrix([[1, 3],
[2, 4]])

Transpose of Matrix
This feature returns the transpose of self. Let us consider the following example.
import numpy as np
mat = np.matrix('1 2; 3 4')
mat.T
The above program will generate the following output.
matrix([[1, 3],
[2, 4]])

When we transpose a matrix, we make a new matrix whose rows are the columns of the
original. A conjugate transposition, on the other hand, interchanges the row and the
column index for each matrix element. The inverse of a matrix is a matrix that, if
multiplied with the original matrix, results in an identity matrix.

Cluster
K-means clustering is a method for finding clusters and cluster centers in a set of
unlabelled data. Intuitively, we might think of a cluster as – comprising of a group of
data points, whose inter-point distances are small compared with the distances to

172
points outside of the cluster. Given an initial set of K centers, the K-means algorithm
iterates the following two steps −
● For each center, the subset of training points (its cluster) that is closer to it is identified
than any other center.
● The mean of each feature for the data points in each cluster are computed, and this
mean vector becomes the new center for that cluster.
These two steps are iterated until the centers no longer move or the assignments no
longer change. Then, a new point x can be assigned to the cluster of the closest
prototype. The SciPy library provides a good implementation of the K-Means algorithm
through the cluster package. Let us understand how to use it.
K-Means Implementation in SciPy
We will understand how to implement K-Means in SciPy.
Import K-Means
We will see the implementation and usage of each imported function.
from SciPy.cluster.vq import kmeans,vq,whiten

Data generation
We have to simulate some data to explore the clustering.
from numpy import vstack,array
from numpy.random import rand

# data generation with three features

data = vstack((rand(100,3) + array([.5,.5,.5]),rand(100,3)))
Now, we have to check for data. The above program will generate the following output.
array([[ 1.48598868e+00, 8.17445796e-01, 1.00834051e+00],
[ 8.45299768e-01, 1.35450732e+00, 8.66323621e-01],
[ 1.27725864e+00, 1.00622682e+00, 8.43735610e-01],
…………….

Normalise a group of observations on a per feature basis. Before running K-Means, it is beneficial to
rescale each feature dimension of the observation set with whitening. Each feature is divided by its
standard deviation across all observations to give it unit variance.
Whiten the data
We have to use the following code to whiten the data.
# whitening of data
data = whiten(data)

Compute K-Means with Three Clusters

Let us now compute K-Means with three clusters using the following code.
# computing K-Means with K = 3 (2 clusters)
centroids,_ = kmeans(data,3)

The above code performs K-Means on a set of observation vectors forming K clusters. The K-Means
algorithm adjusts the centroids until sufficient progress cannot be made, i.e. the change in distortion,
since the last iteration is less than some threshold. Here, we can observe the centroid of the cluster by
printing the centroids variable using the code given below.
print(centroids)
The above code will generate the following output.
print(centroids)[ [ 2.26034702 1.43924335 1.3697022 ]
[ 2.63788572 2.81446462 2.85163854]
[ 0.73507256 1.30801855 1.44477558] ]

Assign each value to a cluster by using the code given below.

# assign each sample to a cluster
clx,_ = vq(data,centroids)
The vq function compares each observation vector in the ‘M’ by ‘N’ obs array with the centroids and
assigns the observation to the closest cluster. It returns the cluster of each observation and the
distortion. We can check the distortion as well. Let us check the cluster of each observation using the
following code.
# check clusters of observation
print clx
173
The above code will generate the following output.
array([1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 2, 0, 2, 0, 1, 1, 1,
0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0,
0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 2, 2, 2, 2, 0, 0,
2, 2, 2, 1, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)

The distinct values 0, 1, 2 of the above array indicate the clusters.

Constants
SciPy constants package provides a wide range of constants, which are used in the general scientific
area.
SciPy Constants Package
The scipy.constants package provides various constants. We have to import the required constant and
use them as per the requirement. Let us see how these constant variables are imported and used.
To start with, let us compare the ‘pi’ value by considering the following example.
#Import pi constant from both the packages
from scipy.constants import pi
from math import pi

print("sciPy - pi = %.16f"%scipy.constants.pi)
print("math - pi = %.16f"%math.pi)
The above program will generate the following output.
sciPy - pi = 3.1415926535897931
math - pi = 3.1415926535897931

List of Constants Available

The following tables describe in brief the various constants.
Mathematical Constants
Sr. No. Constant Description

1 pi pi

2 golden Golden Ratio

Physical Constants
The following table lists the most commonly used physical constants.
Sr. No. Constant & Description

c
Speed of light in vacuum
1

speed_of_light
Speed of light in vacuum
2

174
h
Planck constant
3

Planck
Planck constant h
4

G
Newton’s gravitational constant
5

e
Elementary charge
6

R
Molar gas constant
7

Avogadro
Avogadro constant
8

k
Boltzmann constant
9

electron_mass(OR) m_e
Electronic mass
10

proton_mass (OR) m_p

Proton mass
11

175
neutron_mass(OR)m_n
Neutron mass
12

Units
The following table has the list of SI units.
Sr. No. Unit Value

1 milli 0.001

2 micro 1e-06

3 kilo 1000

These units range from yotta, zetta, exa, peta, tera ……kilo, hector, …nano, pico, … to zepto.
Other Important Constants
The following table lists other important constants used in SciPy.
Sr. No. Unit Value

1 gram 0.001 kg

2 atomic mass Atomic mass constant

3 degree Degree in radians

4 minute One minute in seconds

5 day One day in seconds

6 inch One inch in meters

7 micron One micron in meters

8 light_year One light-year in meters

176
9 atm Standard atmosphere in pascals

10 acre One acre in square meters

11 liter One liter in cubic meters

12 gallon One gallon in cubic meters

13 kmh Kilometers per hour in meters per seconds

14 degree_Fahrenheit One Fahrenheit in kelvins

15 eV One electron volt in joules

16 hp One horsepower in watts

17 dyn One dyne in newtons

18 lambda2nu Convert wavelength to optical frequency

Remembering all of these are a bit tough. The easy way to get which key is for which function is with the
scipy.constants.find() method. Let us consider the following example.
import scipy.constants
res = scipy.constants.physical_constants["alpha particle mass"]
print res
The above program will generate the following output.
[
'alpha particle mass',
'alpha particle mass energy equivalent',
'alpha particle mass energy equivalent in MeV',
'alpha particle mass in u',
'electron to alpha particle mass ratio'
]

This method returns the list of keys, else nothing if the keyword does not match.

Linear Algebra

SciPy is built using the optimized ATLAS LAPACK and BLAS libraries. It has very fast linear algebra
capabilities. All of these linear algebra routines expect an object that can be converted into a two-
dimensional array. The output of these routines is also a two-dimensional array.
SciPy.linalg vs NumPy.linalg
A scipy.linalg contains all the functions that are in numpy.linalg. Additionally, scipy.linalg also has some
other advanced functions that are not in numpy.linalg. Another advantage of using scipy.linalg over

177
numpy.linalg is that it is always compiled with BLAS/LAPACK support, while for NumPy this is optional.
Therefore, the SciPy version might be faster depending on how NumPy was installed.
Linear Equations
The scipy.linalg.solve feature solves the linear equation a * x + b * y = Z, for the unknown x, y values.
As an example, assume that it is desired to solve the following simultaneous equations.
x + 3y + 5z = 10
2x + 5y + z = 8
2x + 3y + 8z = 3
To solve the above equation for the x, y, z values, we can find the solution vector using a matrix inverse
as shown below.
$$\begin{bmatrix} x\\ y\\ z \end{bmatrix} = \begin{bmatrix} 1 & 3 & 5\\ 2 & 5 & 1\\ 2 & 3 & 8
\end{bmatrix}^{-1} \begin{bmatrix} 10\\ 8\\ 3 \end{bmatrix} = \frac{1}{25} \begin{bmatrix} -232\\ 129\\ 19
\end{bmatrix} = \begin{bmatrix} -9.28\\ 5.16\\ 0.76 \end{bmatrix}.$$
However, it is better to use the linalg.solve command, which can be faster and more numerically stable.
The solve function takes two inputs ‘a’ and ‘b’ in which ‘a’ represents the coefficients and ‘b’ represents
the respective right hand side value and returns the solution array.
Let us consider the following example.
#importing the scipy and numpy packages
from scipy import linalg
import numpy as np

#Declaring the numpy arrays

a = np.array([[3, 2, 0], [1, -1, 0], [0, 5, 1]])
b = np.array([2, 4, -1])

#Passing the values to the solve function

x = linalg.solve(a, b)

#printing the result array

print x
The above program will generate the following output.
array([ 2., -2., 9.])

Finding a Determinant
The determinant of a square matrix A is often denoted as |A| and is a quantity often used in linear
algebra. In SciPy, this is computed using the det() function. It takes a matrix as input and returns a scalar
value.
Let us consider the following example.
#importing the scipy and numpy packages
from scipy import linalg
import numpy as np

#Declaring the numpy array

A = np.array([[1,2],[3,4]])

#Passing the values to the det function

x = linalg.det(A)

#printing the result

print x
The above program will generate the following output.
-2.0

Eigenvalues and Eigenvectors

The eigenvalue-eigenvector problem is one of the most commonly employed linear algebra
operations. We can find the Eigen values (λ) and the corresponding Eigen vectors (v) of a square
matrix (A) by considering the following relation −
Av = λv
scipy.linalg.eig computes the eigenvalues from an ordinary or generalized eigenvalue problem. This
function returns the Eigen values and the Eigen vectors.

178
Let us consider the following example.
#importing the scipy and numpy packages
from scipy import linalg
import numpy as np

#Declaring the numpy array

A = np.array([[1,2],[3,4]])

#Passing the values to the eig function

l, v = linalg.eig(A)

#printing the result for eigen values

print l

#printing the result for eigen vectors

print v
The above program will generate the following output.
array([-0.37228132+0.j, 5.37228132+0.j]) #--Eigen Values
array([[-0.82456484, -0.41597356], #--Eigen Vectors
[ 0.56576746, -0.90937671]])

Singular Value Decomposition

A Singular Value Decomposition (SVD) can be thought of as an extension of the eigenvalue problem to
matrices that are not square.
The scipy.linalg.svd factorizes the matrix ‘a’ into two unitary matrices ‘U’ and ‘Vh’ and a 1-D array ‘s’ of
singular values (real, non-negative) such that a == U*S*Vh, where ‘S’ is a suitably shaped matrix of
zeros with the main diagonal ‘s’.
Let us consider the following example.
#importing the scipy and numpy packages
from scipy import linalg
import numpy as np

#Declaring the numpy array

a = np.random.randn(3, 2) + 1.j*np.random.randn(3, 2)

#Passing the values to the eig function

U, s, Vh = linalg.svd(a)

# printing the result

print U, Vh, s
The above program will generate the following output.
(
array([
[ 0.54828424-0.23329795j, -0.38465728+0.01566714j,
-0.18764355+0.67936712j],
[-0.27123194-0.5327436j , -0.57080163-0.00266155j,
-0.39868941-0.39729416j],
[ 0.34443818+0.4110186j , -0.47972716+0.54390586j,
0.25028608-0.35186815j]
]),

array([ 3.25745379, 1.16150607]),

array([
[-0.35312444+0.j , 0.32400401+0.87768134j],
[-0.93557636+0.j , -0.12229224-0.33127251j]
])
)

179
● Use python to implement descriptive and inferential statistics

Mean, Median, and Mode:

What can we learn from looking at a group of numbers?

In Machine Learning (and in mathematics) there are often three values that interests us:
● Mean - The average value
● Median - The midpoint value
● Mode - The most common value
Example: We have registered the speed of 13 cars:
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
What is the average, the middle, or the most common speed value?

Mean

The mean value is the average value.

To calculate the mean, find the sum of all values, and divide the sum by the number of values:
(99+86+87+88+111+86+103+87+94+78+77+85+86) / 13 = 89.77
Example
Use the NumPy mean() method to find the average speed:

import numpy

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

x = numpy.mean(speed)

print(x)

Output :
89.76923076923077

Median

The median value is the value in the middle, after you have sorted all the values:
77, 78, 85, 86, 86, 86, 87, 87, 88, 94, 99, 103, 111
It is important that the numbers are sorted before you can find the median.
Example
Use the NumPy median() method to find the middle value:

import numpy

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

x = numpy.median(speed)

print(x)

Output :
87.0

If there are two numbers in the middle, divide the sum of those numbers by two.

77, 78, 85, 86, 86, 86, 87, 87, 94, 98, 99, 103

180
(86 + 87) / 2 = 86.5

Example
Using the NumPy module:
import numpy

speed = [99,86,87,88,86,103,87,94,78,77,85,86]

x = numpy.median(speed)

print(x)

Output :
86.5

Mode

The Mode value is the value that appears the most number of times:
99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86 = 86

Example
Use the SciPy mode() method to find the number that appears the most:

from scipy import stats

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

x = stats.mode(speed)

print(x)
The mode() method returns a ModeResult object that contains the mode number (86), and count
(how many times the mode number appeared (3)).
ModeResult(mode=array([86]), count=array([3]))

Standard Deviation

● Standard deviation is a number that describes how spread out the values are.
● A low standard deviation means that most of the numbers are close to the mean
(average) value.
● A high standard deviation means that the values are spread out over a wider range.
Example
Use the NumPy std() method to find the standard deviation:
import numpy

speed = [32,111,138,28,59,77,97]

x = numpy.std(speed)

print(x)
`Output :
37.84501153334721

Percentiles

181
Percentiles are used in statistics to give you a number that describes the value that a given
percent of the values are lower than.
Example: Let's say we have an array of the ages of all the people that lives in a street.
ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]
What is 75 ? percentile? The answer is 43, meaning that 75% of the people are 43 or younger.
The NumPy module has a method for finding the specified percentile
Example
Use the NumPy percentile() method to find the percentiles:

import numpy

ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]

x = numpy.percentile(ages, 75)

print(x)

Output :
43.0

Example
What is the age that 90% of the people are younger than?

import numpy

ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]

x = numpy.percentile(ages, 90)

print(x)

Output :
61.0

Skewness & Kurtosis

In statistics, skewness and kurtosis are two ways to measure the shape of a distribution.
Skewness is a measure of the asymmetry of a distribution. This value can be positive or
negative.
● A negative skew indicates that the tail is on the left side of the distribution, which extends
towards more negative values.
● A positive skew indicates that the tail is on the right side of the distribution, which extends
towards more positive values.
● A value of zero indicates that there is no skewness in the distribution at all, meaning the
distribution is perfectly symmetrical.

Kurtosis is a measure of whether or not a distribution is heavy-tailed or light-tailed relative to a

normal distribution.
● The kurtosis of a normal distribution is 3.
● If a given distribution has a kurtosis less than 3, it is said to be platykurtic, which means it
tends to produce fewer and less extreme outliers than the normal distribution.
● If a given distribution has a kurtosis greater than 3, it is said to be leptokurtic, which
means it tends to produce more outliers than the normal distribution.

Example: Skewness & Kurtosis in Python

182
Suppose we have the following dataset:
data = [88, 85, 82, 97, 67, 77, 74, 86, 81, 95, 77, 88, 85, 76, 81]
To calculate the sample skewness and sample kurtosis of this dataset, we can use the skew()
and kurt() functions from the Scipy Stata library with the following syntax:

● skew(array of values, bias=False)

● kurt(array of values, bias=False)

We use the argument bias=False to calculate the sample skewness and kurtosis as opposed to
the population skewness and kurtosis.

Here is how to use these functions for our particular dataset:

from scipy import stats

data = [88, 85, 82, 97, 67, 77, 74, 86, 81, 95, 77, 88, 85, 76, 81]

#calculate sample skewness

result_skew = stats.skew(data, bias=False)

print(result_skew)

#calculate sample kurtosis

result_kurtosis = stats.kurtosis(data, bias=False)

print(result_kurtosis)

Output :

0.11815715154945083
0.0326966578855933

● Python code is used to test Hypothesis:

Hypothesis testing
Hypothesis testing is a statistical method that is used in making statistical decisions using
experimental data. Hypothesis Testing is basically an assumption that we make about the
population parameter.

Why do we use it ?
Hypothesis testing is an essential procedure in statistics. A hypothesis test evaluates two mutually
exclusive statements about a population to determine which statement is best supported by the
sample data. When we say that a finding is statistically significant, it’s thanks to a hypothesis test.

What are the basics of hypothesis ?

183
Fig: Normal Curve images with different mean and variance

The basics of hypothesis is normalisation and standard normalisation. All our hypotheses revolve
around the basics of these 2 terms. Let's see these.

Fig: Standardised Normal curve image and separation on data in percentage in each section.

You must be wondering what's the difference between these two images, one might say i don’t
find it, while others will see some flatter graph compared to steep. Well buddy this is not what I
want to represent , in the 1st first you can see there are different normal curves. All those normal
curves can have different means and variances whereas in the 2nd image if you notice the graph
is properly distributed and mean =0 and variance =1 always. The concept of z-score comes into
picture when we use standardised normal data.

Normal Distribution -

A variable is said to be normally distributed or have a normal distribution if its distribution has the
shape of a normal curve — a special bell-shaped curve. … The graph of a normal distribution is
called the normal curve, which has all of the following properties: 1. The mean, median, and mode
are equal.
𝑥 − 𝑥𝑚𝑖𝑛
𝑥𝑛𝑒𝑤 =
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛

Standardised Normal Distribution

A standard normal distribution is a normal distribution with mean 0 and standard deviation 1.
𝑥−𝜇
𝑥𝑛𝑒𝑤 =
𝜎

184
Which are important parameters of hypothesis testing ?

Null hypothesis :- In inferential statistics, the null hypothesis is a general statement or default
position that there is no relationship between two measured phenomena, or no association among
groups
In other words it is a basic assumption or made based on domain or problem knowledge.
Example : a company production is = 50 units/per day etc.

Alternative hypothesis :

The alternative hypothesis is the hypothesis used in hypothesis testing that is contrary to the null
hypothesis. It is usually taken to be that the observations are the result of a real effect (with some
amount of chance variation superposed)
Example : a company production is !=50 unit/per day etc.

Fig: Null and Alternate hypothesis.

Level of significance:Refers to the degree of significance in which we accept or reject the null-hypothesis.
100% accuracy is not possible for accepting or rejecting a hypothesis, so we therefore select a level of
significance that is usually 5%.
This is normally denoted with alpha(maths symbol ) and generally it is 0.05 or 5% , which means your
output should be 95% confident to give a similar kind of result in each sample.
Type I error: When we reject the null hypothesis, although that hypothesis was true. Type I error is denoted
by alpha. In hypothesis testing, the normal curve that shows the critical region is called the alpha region
Type II errors: When we accept the null hypothesis but it is false. Type II errors are denoted by beta. In
Hypothesis testing, the normal curve that shows the acceptance region is called the beta region.

One tailed test :- A test of a statistical hypothesis , where the region of rejection is on only one side of the
sampling distribution , is called a one-tailed test.
Example :- a college has ≥ 4000 students or data science ≤ 80% org adopted.

Two-tailed test :- A two-tailed test is a statistical test in which the critical area of a distribution is two-sided
and tests whether a sample is greater than or less than a certain range of values. If the sample being
tested falls into either of the critical areas, the alternative hypothesis is accepted instead of the null
hypothesis.
Example : a college != 4000 student or data science != 80% org adopted

185
one and two-tailed images
P-value :- The P value, or calculated probability, is the probability of finding the observed, or more extreme,
results when the null hypothesis (H 0) of a study question is true — the definition of ‘extreme’ depends on
how the hypothesis is being tested.
If your P value is less than the chosen significance level then you reject the null hypothesis i.e. accept that
your sample gives reasonable evidence to support the alternative hypothesis. It does NOT imply a
“meaningful” or “important” difference; that is for you to decide when considering the real-world relevance
of your result.
Example : You have a coin and you don’t know whether that is fair or tricky so let’s decide null and alternate
hypothesis.
H0 : a coin is a fair coin.
H1 : a coin is a tricky coin. and alpha = 5% or 0.05
Now let’s toss the coin and calculate p- value ( probability value).
Toss a coin 1st time and result is tail- P-value = 50% (as head and tail have equal probability)
Toss a coin 2nd time and result is tail, now p-value = 50/2 = 25%
and similarly we Toss 6 consecutive time and got result as P-value = 1.5% but we set our significance level
as 95% means 5% error rate we allow and here we see we are beyond that level i.e. our null- hypothesis
does not hold good so we need to reject and propose that this coin is a tricky coin which is actually.

Degree of freedom :- Now imagine you’re not into hats. You’re into data analysis.You have a data set
with 10 values. If you’re not estimating anything, each value can take on any number, right? Each value is
completely free to vary.But suppose you want to test the population mean with a sample of 10 values,
using a 1-sample t test. You now have a constraint — the estimation of the mean. What is that constraint,
exactly? By definition of the mean, the following relationship must hold: The sum of all values in the data
must equal n x mean, where n is the number of values in the data set.
So if a data set has 10 values, the sum of the 10 values must equal the mean x 10. If the mean of the 10
values is 3.5 (you could pick any number), this constraint requires that the sum of the 10 values must equal
10 x 3.5 = 35.
With that constraint, the first value in the data set is free to vary. Whatever value it is, it’s still possible for
the sum of all 10 numbers to have a value of 35. The second value is also free to vary, because whatever
value you choose, it still allows for the possibility that the sum of all the values is 35.

Now Let’s see some of widely used hypothesis testing type :-

1. T Test ( Student T test)
2. Z Test

186
3. ANOVA Test
4. Chi-Square Test
T-Test : A t-test is a type of inferential statistic which is used to determine if there is a significant difference
between the means of two groups which may be related in certain features. It is mostly used when the
data sets, like the set of data recorded as outcome from flipping a coin a 100 times, would follow a normal
distribution and may have unknown variances. T test is used as a hypothesis testing tool, which allows
testing of an assumption applicable to a population.

T-test has 2 types :

1. one sampled t-test
2. two-sampled t-test.

One sample t-test : The One Sample t Test determines whether the sample mean is statistically different
from a known or hypothesised population mean. The One Sample t Test is a parametric test.
Example :you have 10 ages and you are checking whether avg age is 30 or not.

from scipy.stats import ttest_1samp

import numpy as np
ages = np.genfromtxt(“ages.csv”)
print(ages)
ages_mean = np.mean(ages)
print(ages_mean)
tset, pval = ttest_1samp(ages, 30)
print(“p-values”,pval)
if pval < 0.05: # alpha value is 0.05 or 5%
print(" we are rejecting null hypothesis")
else:
print("we are accepting null hypothesis")

Two sampled T-test :-The Independent Samples t Test or 2-sample t-test compares the means of two
independent groups in order to determine whether there is statistical evidence that the associated
population means are significantly different. The Independent Samples t Test is a parametric test. This
test is also known as: Independent t Test.
Example : is there any association between week1 and week2

from scipy.stats import ttest_ind

import numpy as np

week1 = np.genfromtxt("week1.csv", delimiter=",")

week2 = np.genfromtxt("week2.csv", delimiter=",")

print(week1)
print("week2 data :-\n")
print(week2)
week1_mean = np.mean(week1)
week2_mean = np.mean(week2)

print("week1 mean value:",week1_mean)

print("week2 mean value:",week2_mean)

week1_std = np.std(week1)
week2_std = np.std(week2)

print("week1 std value:",week1_std)

print("week2 std value:",week2_std)

ttest,pval = ttest_ind(week1,week2)

187
print("p-value",pval)

if pval <0.05:
print("we reject null hypothesis")
else:
print("we accept null hypothesis")

When we can run a Z Test.

● Your sample size is greater than 30. Otherwise, use a t test.
● Data points should be independent from each other. In other words, one data point isn’t
related or doesn’t affect another data point.
● Your data should be normally distributed. However, for large sample sizes (over 30) this
doesn’t always matter.
● Your data should be randomly selected from a population, where each item has an equal
chance of being selected.
● Sample sizes should be equal if at all possible.

Example again we are using z-test for blood pressure with some mean like 156 (python code is below for
same)

one-sample Z test.
import pandas as pd
from scipy import stats
from statsmodels.stats import weightstats as stests

ztest ,pval = stests.ztest(df['bp_before'], x2=None, value=156)

print(float(pval))

if pval<0.05:
print("reject null hypothesis")
else:
print("accept null hypothesis")

Two-sample Z test
In two sample z-test , similar to t-test here we are checking two independent data groups and
deciding whether the sample mean of two groups is equal or not.
H0 : mean of two group is 0
H1 : mean of two group is not 0
Example : we are checking in blood data after blood and before blood data.(code in python below)

ztest ,pval1 = stests.ztest(df['bp_before'], x2=df['bp_after'], value=0,alternative='two-sided')

print(float(pval1))

if pval<0.05:
print("reject null hypothesis")
else:
print("accept null hypothesis")

ANOVA (F-TEST)
The t-test works well when dealing with two groups, but sometimes we want to compare more than
two groups at the same time. For example, if we wanted to test whether voter age differs based
on some categorical variable like race, we have to compare the means of each level or group the
variable. We could carry out a separate t-test for each pair of groups, but when you conduct many

188
tests you increase the chances of false positives. The analysis of variance or ANOVA is a statistical
inference test that lets you compare multiple groups at the same time.
F = Between group variability / Within group variability

Fig: F-Test or Anova concept image

Unlike the z and t-distributions, the F-distribution does not have any negative values because between and
within-group variability are always positive due to squaring each deviation.
One Way F-test(Anova)
It tells whether two or more groups are similar or not based on their mean similarity and f-score.
Example : there are 3 different category of plant and their weight and need to check whether all 3
group are similar or not (code in python below)

df_anova = pd.read_csv('PlantGrowth.csv')
df_anova = df_anova[['weight','group']]

grps = pd.unique(df_anova.group.values)
d_data = {grp:df_anova['weight'][df_anova.group == grp] for grp in grps}

F, p = stats.f_oneway(d_data['ctrl'], d_data['trt1'], d_data['trt2'])

print("p-value for significance is: ", p)

if p<0.05:
print("reject null hypothesis")
else:
print("accept null hypothesis")

Two Way F-test :

189
Two way F-test is an extension of 1-way f-test, it is used when we have 2 independent variables
and 2+ groups. 2-way F-test does not tell which variable is dominant. If we need to check individual
significance then Post-hoc testing needs to be performed.
Now let’s take a look at the Grand mean crop yield (the mean crop yield not by any sub-group), as
well the mean crop yield by each factor, as well as by the factors grouped together

import statsmodels.api as sm
from statsmodels.formula.api import ols

df_anova2 = pd.read_csv("https://raw.githubusercontent.com/Opensourcefordatascience/Data-
sets/master/crop_yield.csv")

model = ols('Yield ~ C(Fert)*C(Water)', df_anova2).fit()

print(f"Overall model F({model.df_model: .0f},{model.df_resid: .0f}) = {model.fvalue: .3f}, p =
{model.f_pvalue: .4f}")

res = sm.stats.anova_lm(model, typ= 2)

res

Chi-Square Test
The test is applied when you have two categorical variables from a single population. It is used to
determine whether there is a significant association between the two variables.
or example, in an election survey, voters might be classified by gender (male or female) and voting
preference (Democrat, Republican, or Independent). We could use a chi-square test for
independence to determine whether gender is related to voting preference

df_chi = pd.read_csv('chi-test.csv')
contingency_table=pd.crosstab(df_chi["Gender"],df_chi["Shopping?"])
print('contingency_table :-\n',contingency_table)

#Observed Values
Observed_Values = contingency_table.values
print("Observed Values :-\n",Observed_Values)
b=stats.chi2_contingency(contingency_table)
Expected_Values = b[3]
print("Expected Values :-\n",Expected_Values)

no_of_rows=len(contingency_table.iloc[0:2,0])
no_of_columns=len(contingency_table.iloc[0,0:2])
ddof=(no_of_rows-1)*(no_of_columns-1)
print("Degree of Freedom:-",ddof)
alpha = 0.05

from scipy.stats import chi2

chi_square=sum([(o-e)**2./e for o,e in zip(Observed_Values,Expected_Values)])
chi_square_statistic=chi_square[0]+chi_square[1]
print("chi-square statistic:-",chi_square_statistic)

critical_value=chi2.ppf(q=1-alpha,df=ddof)
print('critical_value:',critical_value)

#p-value
p_value=1-chi2.cdf(x=chi_square_statistic,df=ddof)

print('p-value:',p_value)
print('Significance level: ',alpha)
print('Degree of Freedom: ',ddof)
print('chi-square statistic:',chi_square_statistic)
print('critical_value:',critical_value)

190
print('p-value:',p_value)

if chi_square_statistic>=critical_value:
print("Reject H0,There is a relationship between 2 categorical variables")
else:
print("Retain H0,There is no relationship between 2 categorical variables")

if p_value<=alpha:
print("Reject H0,There is a relationship between 2 categorical variables")
else:
print("Retain H0,There is no relationship between 2 categorical variables")

● Correlations are measured:

Correlation
Variables within a dataset can be related for lots of reasons.
For example:
● One variable could cause or depend on the values of another variable.
● One variable could be lightly associated with another variable.
● Two variables could depend on a third unknown variable.
Pearson’s Correlation
The Pearson correlation coefficient (named for Karl Pearson) can be used to summarise the
strength of the linear relationship between two data samples.The Pearson’s correlation coefficient
is calculated as the covariance of the two variables divided by the product of the standard deviation
of each data sample. It is the normalisation of the covariance between the two variables to give
an interpretable score. The pearsonr() SciPy function can be used to calculate the Pearson’s
correlation coefficient between two data samples with the same length.

from numpy.random import randn

from numpy.random import seed
from scipy.stats import pearsonr

# seed random number generator

seed(1)
# prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)

# calculate Pearson's correlation

corr, _ = pearsonr(data1, data2)
print('Pearson's correlation: %.3f' % corr)

Output :
Pearson's correlation: 0.888

Spearman’s Correlation
Two variables may be related by a nonlinear relationship, such that the relationship is stronger or
weaker across the distribution of the variables.
Further, the two variables being considered may have a non-Gaussian distribution.

# calculate the spearman's correlation between two variables

from numpy.random import randn
from numpy.random import seed

191
from scipy.stats import spearmanr

# seed random number generator

seed(1)

# prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)

# calculate spearman's correlation

corr, _ = spearmanr(data1, data2)
print('Spearman's correlation: %.3f' % corr)
Output :
Spearman's correlation: 0.872

● Continuous variable is predicted using regression and regression assumptions are

validated:

Continuous variable is predicted using regression :

How common factors are affecting the price of houses ?
We saw the common locations and now we’re going to see a few common factors affecting the
prices of the house and if so ? then by how much ?
Let us start with , If the price is getting affected by the living area of the house or not ?

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import mpl_toolkits

data = pd.read_csv("kc_house_data.csv")
data['bedrooms'].value_counts().plot(kind='bar')
plt.scatter(data.price, data.sqft_living)
plt.title('Price vs Square Feet')
print(plt)

plt.scatter(data.price, data.long)
plt.title("Price vs location of area")
print(plt)

192
Fig: Price vs Square feet and Price vs Longitude

The plot that we used above is called scatter plot , scatter plot helps us to see how our data points
are scattered and are usually used for two variables. From the first figure we can see that more
the living area , more the price, though data is concentrated towards a particular price zone , but
from the figure we can see that the data points seem to be in linear direction. Thanks to the scatter
plot we can also see some irregularities that the house with the highest square feet was sold for
very less. Maybe there is another factor or probably the data must be wrong. The second figure
tells us about the location of the houses in terms of longitude and it gives us quite an interesting
observation that -122.2 to -122.4 sells houses at much higher amounts.

193
Fig:Similarly we compare other factors

We can see more factors affecting the price

Fig: Total sqft including basement vs price and waterfront vs price

194
Fig:Floors vs Price and condition vs Price

Which location by zip code is pricey ?

As we can see from all the above representations that many factors are affecting the prices of the
house , like square feet which increases the price of the house and even location influencing the
prices of the house.
Now that we are familiar with all these representations and can tell our own story, let us move and
create a model which would predict the price of the house based upon the other factors such as
square feet , waterfront etc . We are going to see what linear regression is and how do we do it ?

Linear Regression :
In easy words a model in statistics which helps us predict the future based upon the past
relationship of variables. So when you see your scatter plot being having data points placed
linearly you know regression can help you!

195
Regression works on the line equation , y=mx+c , trend line is set through the data points to predict
the outcome.

Fig: Fitting line on the basis of scatter

The variable we are predicting is called the criterion variable and is referred to as Y. The variable
we are basing our predictions on is called the predictor variable and is referred to as X. When
there is only one predictor variable, the prediction method is called Simple Regression. and if
multiple predictor variables are present then multiple regression.
Let’s look at the code ,

196
Fig: Linear regression on the data to predict prices

We use train data and test data , train data to train our machine and test data to see if it has learnt
the data well or not. Before anything , I want everyone to remember that the machine is the student
and training data is the syllabus and test data is the exam. We see how much the machine has
scored and if it scores well, the model is successful.
So what did we do ? Let’s go step by step.
1. We import our dependencies , for linear regression we use sklearn (built in python
library) and import linear regression from it.
2. We then initialise Linear Regression to a variable reg.
3. Now we know that prices are to be predicted , hence we set labels (output) as price
columns and we also convert dates to 1’s and 0’s so that it doesn’t influence our data
much . We use 0 for houses which are new that are built after 2014.
4. We again import another dependency to split our data into train and test.
5. I’ve made my train data as 90% and 10% of the data to be my test data , and
randomised the splitting of data by using random_state.
6. So now , we have train data , test data and labels for both of us to fit our train and test
data into a linear regression model.
7. After fitting our data to the model we can check the score of our data i.e. , prediction.
in this case the prediction is 73%

197
Individual Activity:
▪ Objects in Python Scipy library.
▪ Show Python code is used to test hypotheses.

SELF-CHECK QUIZ 3.4

Check your understanding by answering the following questions:

Write the correct answer for the following questions.

1. What is the Python Scipy library?

2. How to use regression and regression assumptions are validated?

198
LEARNER JOB SHEET 3

Qualification: 2 Years Experience in IT Sector

Learning unit: Relationships within the team and with other workers are identified.

Learner name:

Personal protective
equipment (PPE):

Materials: Computer and Internet connection

Tools and equipment:

Performance criteria: 1. A Suitable Database Management System (DBMS) is set up

according to the organizational requirements.
2. SQL commands and Logical Operators are written to access
the database.
3. Different types of JOINs are written to combine data from
multiple sources.
4. Aggregate functions are used to extract basic information
about data and transform data according to analysis
requirements.
5. Subqueries are written as required.
6. Window functions and partitioning are used to complete
complex tasks.
7. Appropriate Anaconda/ Miniconda version is set up for using
Python for data analysis.
8. Suitable Python IDE is selected for data analysis.
9. Logical statements are created using Python data types and
structures, Python operators and variables.
10. Syntax, whitespace and style guidelines are implemented.
11. Conditional statements and loops are used for multiple
iteration and decision making.
12. Custom functions and lambda expressions are defined in
code.
13. Modules in Python Standard Libraries and third-party libraries
are used.
14. Git commands are demonstrated to use with python scripts/
Jupyter notebooks.
15. Objects in Pandas Series and DataFrames are created,
accessed and modified.
16. CSV, JSON, XML and XLS files are read using Pandas.
17. Multidimensional NumPy arrays (ndarrays) are created,
accessed, modified and sorted.
18. Slicing, Boolean indexing and set operations are performed to
select or change subsets of ndarray.
19. Element-wise operations are done on ndarrays.
20. Usage of Python Scipy library is demonstrated.
21. Mean, median, mode, standard deviation, percentiles,
skewness and kurtosis are calculated using python.
22. Python code is used to test hypotheses.
23. Correlations are measured.
24. Continuous variables are predicted using regression and
regression assumptions are validated.

199
Measurement:

Notes:

Procedure: 3. Connect computers with internet connection.

4. Connect router with internet.

Learner signature: Date:

Assessor signature: Date:

Quality Assurer
Date:
signature:

Assessor remarks:

Feedback:

200
ANSWER KEYS

ANSWER KEY 3.1

1. A database management system (or DBMS) is essentially nothing more than a computerized data-
keeping system. Users of the system are given facilities to perform several kinds of operations on such a
system for either manipulation of the data in the database or the management of the database structure
itself

2. Rules for SQL commands

A. SQL commands can be written on multiple line.
B. Clauses are generally used to separate lines to build accuracy through it is not
necessary.
C. Tabulation (Index) can be used.
D. Command words cannot divide over the lines.
E. SQL commands are not case sensitive.
F. SQL commands are enrolled with SQL prompt.
3. Different Types of SQL JOINs
(INNER) JOIN: Returns records that have matching values in both tables. LEFT (OUTER) JOIN : Returns
all records from the left table, and the matched records from the right table. RIGHT (OUTER) JOIN :
Returns all records from the right table, and the matched records from the left table.
4. Aggregate means to combine or mix together. An example of aggregate is to mix ingredients for
concrete together. To aggregate is to add together. An example of aggregate is to add individual
amounts of candy bars sold to find the total.

ANSWER KEY 3.2

1. Anaconda is an open-source distribution for python and R. It is used for data science, machine
learning, deep learning, etc. With the availability of more than 300 libraries for data science, it becomes
fairly optimal for any programmer to work on anaconda for data science.

2. PyCharm. One of the best (and only) full-featured, dedicated IDEs for Python is PyCharm.
Available in both paid (Professional) and free open-source (Community) editions, PyCharm installs
quickly and easily on Windows, Mac OS X, and Linux platforms. Out of the box, PyCharm supports
Python development directly.
3. The basic Python data structures in Python include list, set, tuples, and dictionary. Each of the
data structures is unique in its own way. Data structures are “containers” that organize and group
data according to type. The data structures differ based on mutability and order.
4. Whitespace (also called negative space) is the area between the elements on a web page (or
physical page). These elements typically are images, typography, and icons. It is often used to
balance elements on a page by creating a natural flow for the user to navigate through the content.
5. Loops in C language are implemented using conditional statements. A block of loop control
statements in C are executed for a number of times until the condition becomes false. Loops in C
programming are of 2 types: entry-controlled and exit-controlled.
6. In languages that support first class functions, the type of the lambda expression would be a
function; but in Java, the lambda expressions are represented as objects, and so they must be
bound to a particular object type known as a functional interface.

7. The git pull command is used to get updates from the remote repo. This command is a combination
of git fetch and git merge which means that, when we use git pull, it gets the updates from remote
repository (git fetch) and immediately applies the latest changes in your local (git mer

201
ANSWER KEY 3.3

1. Series is a type of list in pandas which can take integer values, string values, double values and more.
But in Pandas Series we return an object in the form of list, having index starting from 0 to n, Where n is
the length of values in series.
The Pandas Series data structure is a one-dimensional labelled array. It is the primary building block for a
DataFrame, making up its rows and columns.
2. Pandas: How to Read and Write Files
A. Installing Pandas.
B. Preparing Data.
C. Using the Pandas read_csv() and .to_csv() Functions. Write a CSV File. ...
D. Using Pandas to Write and Read Excel Files. Write an Excel File. ...
E. Understanding the Pandas IO API. Write Files. ...
F. Working With Different File Types. CSV Files. ...
G. Working With Big Data. ...
H. Conclusion.

ANSWER KEY 3.4

1. SciPy is a scientific computation library that uses NumPy underneath. SciPy stands for Scientific Python.
It provides more utility functions for optimization, stats and signal processing. Like NumPy, SciPy is open
source so we can use it freely. SciPy was created by NumPy's creator Travis Olliphant.
2. Assumptions in Regression
● There should be a linear and additive relationship between dependent (response)
variable and independent (predictor) variable(s). ...
● There should be no correlation between the residual (error) terms. ...
● The independent variables should not be correlated. ...
● The error terms must have constant variance.

202
Module 4: Prepare And Visualise Data

MODULE CONTENT: module covers

Module Descriptor: This unit covers the knowledge, skills, and attitudes required to
prepare and visualize data. It specifically includes identifying, collecting, manipulating,
transforming and cleaning data; conducting exploratory data analysis (EDA) and visualizing
and reporting data.

Nominal Duration: 40 hours

LEARNING OUTCOMES:

Upon completion of the module, the trainee should be able to:

4.1. Identify and collect data

4.2. Manipulate, transform and clean data
4.3. Conduct Exploratory Data Analysis (EDA)
4.4. Visualize and report data

PERFORMANCE CRITERIA:

1. Data sources are identified that are relevant to the business problem.
2. Data is loaded from multiple data sources.
3. Complex data is loaded using appropriate data acquisition techniques.
4. Data is stored in the required format.
5. Cases which require data transformations are identified.
6. Impacts of imbalanced data are explained.
7. Feature selection techniques are applied.
8. Data transformation is performed.
9. Slicing, indexing, sub-setting, merging and joining datasets are performed.
10. Facts, Dimensions and schemas are identified and explained.
11. Techniques are applied to handle missing values.
12. Outliers are identified, visualized and dealt with.
13. Fully usable dataset is constructed by cleaning and transforming data.
14. Steps of EDA are described.
15. Appropriate features/ variables are identified for the analysis
16. Dataset is parsed by cleaning, treating missing values & outliers and transforming
as required.
17. Bi-variate and multivariate analysis are performed to find associations between
variables.
203
18. Variables and relationships are visualized using various visualization methods
suitable for various data types.
19. Python visualization libraries (matplotlib, plotly) or Power BI or Tableau are used to
plot charts and graphs.
20. Plots are analyzed to identify important patterns.
21. Reports are generated using Power BI or Tableau.

204
Learning Outcome 4.1- Identify and collect data:

Contents:

● Data and Data sources are identified that are relevant to the business problem.
● Complex data.
● Data storing.

Assessment criteria:

1. Data sources are identified that are relevant the business problem.
2. Data is loaded from multiple data sources.
3. Complex data is loaded using appropriate data acquisition techniques.
4. Data is stored in the required format.

Resources required:

Students/trainees must be provided with the following resources:

▪ Workplace (Computer and internet connection).

LEARNING ACTIVITY 4.1

Learning Activity Resources/Special Instructions/References

Identify and collect data ▪ Information Sheet: 4.1
▪ Self-Check: 4.1
▪ Answer Key: 4.1

205
INFORMATION SHEET 4.1

Learning Objective: interpret the types of fabrics. Interpret probability rules and probability
distributions

● Identify and collect data:

Database
A database is an organized collection of data, so that it can be easily accessed and managed.You
can organize data into tables, rows, columns, and index it to make it easier to find relevant
information.Database handlers create a database in such a way that only one set of software
program provides access of data to all the users.The main purpose of the database is to operate
a large amount of information by storing, retrieving, and managing data.There are many dynamic
websites on the World Wide Web nowadays which are handled through databases. For example,
a model that checks the availability of rooms in a hotel. It is an example of a dynamic website that
uses a database.There are many databases available like MySQL, Sybase, Oracle, MongoDB,
Informix, PostgreSQL, SQL Server, etc.
Modern databases are managed by the database management system (DBMS).SQL or Structured
Query Language is used to operate on the data stored in a database. SQL depends on relational
algebra and tuple relational calculus.A cylindrical structure is used to display the image of a
database.
Flat File
A Flat file database is also known as the text database. It is the most important type of database
used to store data in a plain text file (MS Excel). Flat file databases were developed by IBM in the
early 1970s.In the Flat file database, each line of the plain text file holds only one record. These
records are separated using delimiters, such as tabs and commas. The advantage of a flat-file
database is that it is easy to understand and helps us to sort the results easily.

Web Services
A Web Service is can be defined by following ways:
● It is a client-server application or application component for communication.
● The method of communication between two devices over the network.
● It is a software system for interoperable machine to machine communication.
● It is a collection of standards or protocols for exchanging information between two devices
or application
Cloud Data
Over 400 million SaaS data sets remained siloed globally, isolated in cloud data storage and on-
premise data centers. The Data Cloud eliminates these silos, allowing you to seamlessly unify,
analyze, share, and even monetize your data.The Data Cloud allows organizations to unify and
connect to a single copy of all of their data with ease. The result is an ecosystem of thousands of
businesses and organizations connecting to not only their own data, but also connecting to each
other by effortlessly sharing and consuming shared data and data services. The Data Cloud makes
vast and growing quantities of valuable data connected, accessible, and available.

Multidimensional
A multidimensional model views data in the form of a data-cube. A data cube enables data to be
modeled and viewed in multiple dimensions. It is defined by dimensions and facts.
The dimensions are the perspectives or entities concerning which an organization keeps records.
For example, a shop may create a sales data warehouse to keep records of the store's sales for
the dimension, time, item, and location. These dimensions allow the save to keep track of things,
for example, monthly sales of items and the locations at which the items were sold. Each
dimension has a table related to it, called a dimensional table, which describes the dimension
further. For example, a dimensional table for an item may contain the attributes item_name, brand,
and type large database from distributed framework like hadoop
This is an open-source batch processing framework that can be used for the distributed storage
and processing of big data sets. Hadoop relies on computer clusters and modules that have been
designed with the assumption that hardware will inevitably fail, and those failures should be
automatically handled by the framework.

206
There are four main modules within Hadoop. Hadoop Common is where the libraries and utilities
needed by other Hadoop modules reside. The Hadoop Distributed File System (HDFS) is the
distributed file system that stores the data. Hadoop YARN (Yet Another Resource Negotiator) is
the resource management platform that manages the computing resources in clusters, and
handles the scheduling of users’ applications. The Hadoop MapReduce involves the
implementation of the MapReduce programming model for large-scale data processing. Hadoop
operates by splitting files into large blocks of data and then distributing those datasets across the
nodes in a cluster. It then transfers code into the nodes, for processing data in parallel. The idea
of data locality, meaning that tasks are performed on the node that stores the data, allows the
datasets to be processed more efficiently and more quickly. Hadoop can be used within a
traditional onsite data center, as well as through the cloud.

Meta Data
Metadata is data about data. In other words, it's information that's used to describe the data that's
contained in something like a web page, document, or file. Another way to think of metadata is as
a short explanation or summary of what the data is.
A simple example of metadata for a document might include a collection of information like the
author, file size, the date the document was created, and keywords to describe the document.
Metadata for a music file might include the artist's name, the album, and the year it was
released.For computer files, metadata can be stored within the file itself or elsewhere, like is the
case with some EPUB book files that keep metadata in an associated ANNOT file.Metadata
represents behind-the-scenes information that's used everywhere, by every industry, in multiple
ways. It's ubiquitous in information systems, social media, websites, software, music services, and
online retailing. Metadata can be created manually to pick and choose what's included, but it can
also be generated automatically based on the data.

● Data is loaded from multiple data sources:

Pandas is a Python Data Analysis Library that has cemented its place in the Data Science world.
Articles on the internet about top Python libraries for Data Science include Pandas as one of its
favourites. Pandas library offers several functions that can speed up data wrangling and
exploratory data analysis processes. However, the first step for any Data Science project is to
import data, and here also Pandas library has some great functions to offer. This article shows
ways to import data into Pandas from different data sources.
In 2008, Wes McKinney started developing Pandas library to fulfill the need for robust data
analytics and data manipulation software. The vision behind Pandas was to offer a toolkit for data
analysis that is free, powerful, flexible, fast, and easy to use. In 2009, Pandas became an open-
source library and became an integral part of the Data Science community. Two core Python
libraries, i.e., matplotlib and NumPy, are the two pillars of Pandas. Hence, it offers an easy way to
access NumPy and matplotlib functions with few lines of code. Also, Pandas offer several functions
to read and write data from/to various data sources. This article shows importing data from the
following data sources into Pandas.

Five ways to load data using python is discussed here

● Manual function
● loadtxt function
● genfromtxt function
● read_csv function
● Pickle
Imports
We will use Numpy, Pandas, and Pickle packages so import them.

import numpy as np

import pickle

import pandas as pd

Manual Function
207
This is the most difficult, as you have to design a custom function, which can load data for you.
You have to deal with Python’s normal filing concepts and using that you have to read a .csv file.

def load_csv(filepath):
data = []
col = []
checkcol = False
with open(filepath) as f:
for val in f.readlines():
val = val.replace("\n","")
val = val.split(',')
if checkcol is False:
col = val
checkcol = True
else:
data.append(val)
df = pd.DataFrame(data=data, columns=col)
return df

Numpy.loadtxt function
This is a built-in function in Numpy, a famous numerical library in Python. It is a really simple
function to load the data. It is very useful for reading data which is of the same datatype.
When data is more complex, it is hard to read using this function, but when files are easy and
simple, this function is really powerful.

df = np.loadtxt('convertcsv.csv', delimeter = ',')

Here our Columns titles are Rows, to make them column titles, we have to add another parameter
which is names and set it to True so it will take the first row as the Column Titles.

data = np.genfromtxt('100 Sales Records.csv', delimiter=',', dtype=None, names = True)

Pandas.read_csv()
Pandas is a very popular data manipulation library, and it is very commonly used. One of its very
important and mature functions is read_csv() which can read any .csv file very easily and help us
manipulate it. This function is very popular due to its ease of use.

>>> pdDf = pd.read_csv('100 Sales Record.csv')

>>> pdDf.head()

Pickle
When your data is not in a good, human-readable format, you can use pickle to save it in a binary
format. Then you can easily reload it using the pickle library.

with open('test.pkl','wb') as f:
pickle.dump(pdDf, f)

This will create a new file test.pkl which has inside it out pdDf from Pandas heading.
Now to open it using pickle, we just have to use pickle.load function.

with open("test.pkl", "rb") as f:

d4 = pickle.load(f)

208
d4.head()

Data acquisition techniques- Discussed in the Big data section.

Data is storing format- discussed in the Big data section.

209
Individual Activity:
● Show Exploratory Data Analysis.
● Identify and collect data

SELF-CHECK QUIZ 4.1

Check your understanding by answering the following questions:

Write the correct answer for the following questions.

1. What is data loaded?

2. Write down multiple source for data loaded?

210
LEARNING OUTCOME 4.2 - Manipulate, transform and
clean data

Content:

▪ Cases which require data transformations.

▪ Impacts of imbalanced data.
▪ Feature selection techniques.
▪ Data transformation.
▪ Slicing, indexing, sub-setting, merging and joining datasets.
▪ Facts, Dimensions and schemas.
▪ Outliers.

Assessment criteria:

1. Cases which require data transformations are identified.

2. Impacts of imbalanced data are explained.
3. Feature selection techniques are applied.
4. Data transformation is performed.
5. Slicing, indexing, sub-setting, merging and joining datasets are performed.
6. Facts, Dimensions and schemas are identified and explained.
7. Techniques are applied to handle missing values.
8. Outliers are identified, visualized and dealt with.
9. Fully usable dataset is constructed by cleaning and transforming data.

Resources required:

Students/trainees must be provided with the following resources:

▪ Workplace ( Computer and Internet connection)

LEARNING ACTIVITY 4.2

Learning Activity Resources/Special Instructions/References

Manipulate, transform and clean data ▪ Information Sheets: 4.2
▪ Self-Check: 4.2
▪ Answer Key: 4.2

211
INFORMATION SHEET 4.2

Learning objective: Demonstrate understanding on descriptive statistics

● Cases which require data transformations are identified:

Data Preprocessing:

Today’s real-world databases are highly susceptible to noisy, missing, and inconsistent data due
to their typically huge size (often several gigabytes or more) and their likely origin from multiple,
heterogeneous sources. Low-quality data will lead to low-quality mining results. “How can the data
be preprocessed in order to help improve the quality of the data and, consequently, of the mining
results? How can the data be preprocessed so as to improve the efficiency and ease of the mining
process?”
There are several data preprocessing techniques. Data cleaning can be applied to remove noise
and correct inconsistencies in data. Data integration merges data from multiple sources into a
coherent data store such as a data warehouse. Data reduction can reduce data size by, for
instance, aggregating, eliminating redundant features, or clustering. Data transformations (e.g.,
normalization) may be applied, where data are scaled to fall within a smaller range like 0.0 to 1.0.
This can improve the accuracy and efficiency of mining algorithms involving distance
measurements. These techniques are not mutually exclusive; they may work together. For
example, data cleaning can involve transformations to correct wrong data, such as by transforming
all entries for a date field to a common format.

Data Transformation Strategies Overview

In data transformation, the data are transformed or consolidated into forms appropriate for mining.
Strategies for data transformation include the following:
● Smoothing,which works to remove noise from the data.Techniques include binning,
regression, and clustering.
● Attribute construction (or feature construction), where new attributes are constructed
and added from the given set of attributes to help the mining process.
● Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual
total amounts. This step is typically used in constructing a data cube for data analysis at
multiple abstraction levels.
● Normalization,where the attribute data are scaled so as to fall within a smaller range,
such as −1.0 to 1.0, or 0.0 to 1.0.
● Discretization,where the raw values of a numeric attribute(e.g.,age)are replaced by
interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior). The
labels, in turn, can be recursively organized into higher-level concepts, resulting in a
concept hierarchy for the numeric attribute.

Common Data transformation techniques

Mapping:
Data mapping is a way to organize various bits of data into a manageable and easy-to-understand
system. This system matches data fields with target fields while in storage.
Simply put, not all data goes by the same organizational standards. They may refer to a phone
number in as many different ways as you can think of. Data mapping recognizes phone numbers
for what they are and puts them all in the same field rather than having them drift around by other
names.
With this technique, we're able to take the organized data and put a bigger picture together. You
can find out where most of your target audience lives, learn what sorts of things they have in
common and even figure out a few controversies that you shouldn't touch on.
Armed with this information, your business can make smarter decisions and spend less money
while spinning your products and services to your audience.

212
Encoding:
In the field of data science, before going for the modeling, data preparation is a mandatory task.
There are various tasks we require to perform in the data preparation. Encoding categorical data
is one of such tasks which is considered crucial. As we know, most of the data in real life come
with categorical string values and most of the machine learning models work with integer values
only and some with other different values which can be understandable for the model. All models
basically perform mathematical operations which can be performed using different tools and
techniques. But the harsh truth is that mathematics is totally dependent on numbers. So in short
we can say most of the models require numbers as the data, not strings or not anything else and
these numbers can be float or integer.
Encoding categorical data is a process of converting categorical data into integer format so that
the data with converted categorical values can be provided to the models to give and improve the
predictions. In this article, we will discuss categorical data encoding and we will try to understand
why we need the process of categorical data encoding. The following are the important points that
we will discuss in this article.

Normalization and standardization:

Normalization and standardization are not the same things. Standardization, interestingly, refers
to setting the mean to zero and the standard deviation to one. Normalization in machine learning
is the process of translating data into the range [0, 1] (or any other range) or simply transforming
data onto the unit sphere.
Some machine learning algorithms benefit from normalization and standardization, particularly
when Euclidean distance is used. For example, if one of the variables in the K-Nearest Neighbor,
KNN, is in the 1000s and the other is in the 0.1s, the first variable will dominate the distance rather
strongly. In this scenario, normalization and standardization might be beneficial.
An instance of standardization is when a machine learning method is utilized and the data is
assumed to come from a normal distribution. One example is linear discriminant analysis or LDA.
When using linear models and interpreting their coefficients as variable importance, normalization
and standardization come in handy. If one of the variables has a value in the 100s and the other
has a value in the 0.01s, the coefficient discovered by Logistic Regression for the first variable will
most likely be significantly bigger than the coefficient produced by Logistic Regression for the
second variable.
This does not reveal whether the first variable is more essential or not, but it does illustrate that
this coefficient must be large to compensate for the variable’s scale. Normalization and
standardization change the coordinate system so that all variables have the same scale, making
linear model coefficients understandable.

Data Cleaning

Real-world data tends to be incomplete, noisy, and inconsistent. Data cleaning (or data cleansing)
routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct
inconsistencies in the data. In this section, you will study basic methods for data cleaning. Section
looks at ways of handling missing values. data smoothing techniques. Section 3.2.3 discusses
approaches to data cleaning as a process.

Missing Values: Imagine that you need to analyze sales and customer data. You note that many
tuples have no recorded value for several attributes such as customer income. How can you go
about filling in the missing values for this attribute? Let’s look at the following methods.

Ignore the tuple: This is usually done when the class label is missing (assuming the mining task
involves classification). This method is not very effective, unless the tuple contains several
attributes with missing values. It is especially poor when the percentage of missing values per
attribute varies considerably. By ignoring the tuple, we do not make use of the remaining attributes’
values in the tuple. Such data could have been useful to the task at hand.

Fill in the missing value manually: In general, this approach is time consuming and may not be
feasible given a large data set with many missing values.

Use a global constant to fill in the missing value: Replace all missing attribute values by the
same constant such as a label like “Unknown” or −∞. If missing values are replaced by, say,

213
“Unknown,” then the mining program may mistakenly think that they form an interesting
concept, since they all have a value in common—that of “Unknown.” Hence, although this
method is simple, it is not foolproof.

Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in the
missing value: For normal (symmetric) data distributions, the mean can be used, while skewed
data distribution should employ the median.For example, suppose that the data distribution
regarding the income of customers is symmetric and that the mean income is $56,000. Use this
value to replace the missing value for income.

Use the attribute mean or median for all samples belonging to the same class as the given
tuple: For example, if classifying customers according to credit risk, we may replace the missing
value with the mean income value for customers in the same credit risk category as that of the
given tuple. If the data distribution for a given class is skewed, the median value is a better choice.

Use the most probable value to fill in the missing value: This may be determined with
regression, inference-based tools using a Bayesian formalism, or decision tree induction. For
example, using the other customer attributes in your data set, you may construct a decision tree
to predict the missing values for income.

Noisy Data: Noise is a random error or variance in a measured variable. Given a numeric attribute
such as, say, price, how can we “smooth” out the data to remove the noise? Let’s look at the
following data smoothing techniques.

Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,” that is,
the values around it. The sorted values are distributed into a number of “buckets,” or bins. Because
binning methods consult the neighborhood of values, they perform local smoothing. Figure
illustrates some binning techniques. In this example, the data for price are first sorted and then
partitioned into equal-frequency bins of size 3 (i.e., each bin contains three values). In smoothing
by bin means, each value in a bin is replaced by the mean value of the bin. For example, the mean
of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in this bin is replaced by the
value 9.

Similarly, smoothing by bin medians can be employed, in which each bin value is replaced by the
bin median. In smoothing by bin boundaries, the minimum and maximum values in a given bin are
identified as the bin boundaries. Each bin value is then replaced by the closest boundary value. In
general, the larger the width, the
Regression: Data smoothing can also be done by regression, a technique that con-forms data
values to a function. Linear regression involves finding the “best” line to fit two attributes (or

214
variables) so that one attribute can be used to predict the other. Multiple linear regression is an
extension of linear regression, where more than two attributes are involved and the data are fit to
a multidimensional surface.
Outlier analysis: Outliers may be detected by clustering, for example, where similar values are
organized into groups, or “clusters.” Intuitively, values that fall outside of the set of clusters may
be considered outliers.

● Impacts of imbalanced data are explained:

Imbalanced data refers to those types of datasets where the target class has an uneven
distribution of observations, i.e one class label has a very high number of observations and
the other has a very low number of observations. We can better understand it with an example.
Let’s assume that XYZ is a bank that issues a credit card to its customers. Now the bank is
concerned that some fraudulent transactions are going on and when the bank checks their data
they found that for each 2000 transactions there are only 30 Nos of fraud recorded. So, the
number of frauds per 100 transactions is less than 2%, or we can say more than 98%
transactions are “No Fraud” in nature. Here, the class “No Fraud” is called the majority class,
and the much smaller in size “Fraud” class is called the minority class.

● Feature selection techniques are applied:

Filter Methods:

Filter methods are generally used as a preprocessing step. The selection of features is
independent of any machine learning algorithms. Instead, features are selected on the
basis of their scores in various statistical tests for their correlation with the outcome
variable.The correlation is a subjective term here. For basic guidance, you can refer to the
following table for defining correlation coefficients.

Feature\Response Continuous Categorical

Continuous Pearson’s Correlation LDA

Categorical Anova Chi-Square

● Pearson’s Correlation: It is used as a measure for quantifying linear dependence

between two continuous variables X and Y. Its value varies from -1 to +1.
Pearson’s correlation is given as:
𝑐𝑜𝑣(𝑋, 𝑌)
𝜌𝑥,𝑦 =
𝜎𝑋 𝜎𝑌
● LDA: Linear discriminant analysis is used to find a linear combination of features
that characterizes or separates two or more classes (or levels) of a categorical
variable.
● ANOVA: ANOVA stands for Analysis of variance. It is similar to LDA except for the
fact that it is operated using one or more categorical independent features and one
continuous dependent feature. It provides a statistical test of whether the means
of several groups are equal or not.
● Chi-Square: It is a statistical test applied to the groups of categorical features to
evaluate the likelihood of correlation or association between them using their
frequency distribution.

215
● One thing that should be kept in mind is that filter methods do not remove
multicollinearity. So, you must deal with multicollinearity of features as well before
training models for your data.

Wrapper Methods:

In wrapper methods, we try to use a subset of features and train a model using them.
Based on the inferences that we draw from the previous model, we decide to add or
remove features from your subset. The problem is essentially reduced to a search
problem. These methods are usually computationally very expensive.
Some common examples of wrapper methods are forward feature selection, backward
feature elimination, recursive feature elimination, etc.
● Forward Selection: Forward selection is an iterative method in which we start with
having no feature in the model. In each iteration, we keep adding the feature which
best improves our model till an addition of a new variable does not improve the
performance of the model.
● Backward Elimination: In backward elimination, we start with all the features and
remove the least significant feature at each iteration which improves the
performance of the model. We repeat this until no improvement is observed on
removal of features.
● Recursive Feature elimination: It is a greedy optimization algorithm which aims to
find the best performing feature subset. It repeatedly creates models and keeps
aside the best or the worst performing feature at each iteration. It constructs the
next model with the left features until all the features are exhausted. It then ranks
the features based on the order of their elimination.
One of the best ways for implementing feature selection with wrapper methods is to use
the Boruta package that finds the importance of a feature by creating shadow features.
It works in the following steps:
1. Firstly, it adds randomness to the given data set by creating shuffled copies of all
features (which are called shadow features).
2. Then, it trains a random forest classifier on the extended data set and applies a
feature importance measure (the default is Mean Decrease Accuracy) to evaluate
the importance of each feature where higher means more important.
3. At every iteration, it checks whether a real feature has a higher importance than
the best of its shadow features (i.e. whether the feature has a higher Z-score than
the maximum Z-score of its shadow features) and constantly removes features
which are deemed highly unimportant.
4. Finally, the algorithm stops either when all features get confirmed or rejected or it
reaches a specified limit of random forest runs.

216
Embedded Methods:

Embedded methods combine the qualities’ of filter and wrapper methods. It’s implemented
by algorithms that have their own built-in feature selection methods.

Some of the most popular examples of these methods are LASSO and RIDGE regression
which have inbuilt penalization functions to reduce overfitting.

● Lasso regression performs L1 regularization which adds a penalty equivalent to

absolute value of the magnitude of coefficients.
● Ridge regression performs L2 regularization which adds a penalty equivalent to
the square of the magnitude of coefficients.
Hybrid Methods:
Feature selection has become the focus of research areas of applications with datasets
of thousands of variables. In this study we present a hybrid feature selection (HFS) method
that adopts both filter and wrapper models of feature subset selection. In the first stage of
the feature selection, we use the filter model to rank the features by the mutual information
(MI) between each feature and each class, and then choose k highest relevant features
to the classes. In the second stage, we complete a wrapper model based feature selection
algorithm, which uses Shepley value to evaluate the contribution of features to the
classification task in a feature subset. Experimental results show obviously that the HFS
method obtains better classification performance than solo Shepley value based or solo
MI based feature selection method.
Information Gains:
Information Gain, or IG for short, measures the reduction in entropy or surprise by splitting
a dataset according to a given value of a random variable.A larger information gain
suggests a lower entropy group or groups of samples, and hence less surprise.
You might recall that information quantifies how surprising an event is in bits. Lower
probability events have more information, higher probability events have less information.
Entropy quantifies how much information there is in a random variable, or more specifically
its probability distribution. A skewed distribution has a low entropy, whereas a distribution
where events have equal probability has a larger entropy.
In information theory, we like to describe the “surprise” of an event. Low probability events
are more surprising therefore have a larger amount of information. Whereas probability
distributions where the events are equally likely are more surprising and have larger
entropy
Gini Index:
This paper presents an improved Gini-Index algorithm to correct feature-selection bias in
text classification. Gini-Index has been used as a split measure for choosing the most
appropriate splitting attribute in decision tree. Recently, an improved Gini-Index algorithm
for feature selection, designed for text categorization and based on Gini-Index theory, was
introduced, and it has proved to be better than the other methods. However, we found that
the Gini-Index still shows a feature selection bias in text classification, specifically for

217
unbalanced datasets having a huge number of features. The feature selection bias of the
Gini-Index in feature selection is shown in three ways: 1) the Gini values of low-frequency
features are low (on purity measure) overall, irrespective of the distribution of features
among classes, 2) for high-frequency features, the Gini values are always relatively high
and 3) for specific features belonging to large classes, the Gini values are relatively lower
than those belonging to small classes. Therefore, to correct that bias and improve feature
selection in text classification using Gini-Index, we propose an improved Gini-Index (I-GI)
algorithm with three reformulated Gini-Index expressions. In the present study, we used
global dimensionality reduction (DR) and local DR to measure the goodness of features
in feature selections. In experimental results for the I-GI algorithm, we obtained unbiased
feature values and eliminated many irrelevant general features while retaining many
specific features. Furthermore, we could improve the overall classification performances
when we used the local DR method. The total averages of the classification performance
were increased by 19.4%, 15.9%, 3.3%, 2.8% and 2.9% (kNN) in Micro-F1, 14%, 9.8%,
9.2%, 3.5% and 4.3% (SVM) in Micro-F1, 20%, 16.9%, 2.8%, 3.6% and 3.1% (kNN) in
Macro-F1, 16.3%, 14%, 7.1%, 4.4%, 6.3% (SVM) in Macro-F1, compared with tf*idf, χ2,
Information Gain, Odds Ratio and the existing Gini-Index methods according to each
classifier.
Chi Square:
Chi-Square is to be used when the feature is categorical, the target variable is any way
can be thought as categorical. It measures the degree of association between two
categorical variables. If both are numeric, we can use Pearson’s product-moment
correlation, and if the attribute is numerical and there are two classes we can use a t-test
if more than two classes we can use ANOVA Outliers
Outlier:
● An outlier is an observation of a data point that lies an abnormal distance from
other values in a given population. (odd man out)
○ Like in the following data point (Age)
■ 18,22,45,67,89,125,30
● An outlier is an object(s) that deviates significantly from the rest of the object
collection.
○ List of Cities
■ New York, Las Angeles, London, France, Delhi, Chennai
● It is an abnormal observation during the Data Analysis stage, that data point lies
far away from other values.
○ List of Animals
■ cat, fox, rabbit, fish
● An outlier is an observation that diverges from well-structured data.
● The root cause for the Outlier can be an error in measurement or data collection
error.
● Quick ways to handle Outliers.
○ Outliers can either be a mistake or just variance. (As mentioned,
examples)
○ If we find this is due to a mistake, then we can ignore them.
○ If we found this is due to variance, in the data, we can work on this.

4.4 described in 4.1

4.5 Described before
2.6 Big data part

2.7 , 2.8 & 2.9

218
Individual Activity:
● Discuss Slicing, indexing, sub-setting, merging and joining datasets.

SELF-CHECK QUIZ 4.2

Check your understanding by answering the following questions:

1.What is imbalanced data?

2 . Describe all Filter Methods?

219
LEARNING OUTCOME 4.3 – Conduct Exploratory Data
Analysis (EDA)

Contents:

● Steps of EDA.
● Statistical characteristics of the variables.
● Bi-variate and multivariate analysis.

Assessment criteria:

1. Steps of EDA are described.

2. Appropriate features/ variables are identified for the analysis.
3. Statistical characteristics of the variables are analyzed.
4. Dataset is parsed by cleaning, treating missing values & outliers and transforming as
required.
5. Bi-variate and multivariate analysis are performed to find associations between variables.

Resources required:

Students/trainees must be provided with the following resources:

Workplace (Computer and Internet connection)

LEARNING ACTIVITY 4.3

Learning Activity Resources/Special Instructions/References

Conduct Exploratory Data ▪ Information Sheets: 4.3
Analysis (EDA) ▪ Self-Check: 4.3
▪ Answer Key: 4.3

220
INFORMATION SHEET 4.3

Learning Objective: to List out Exploratory Data Analysis

● 4.3.1, 4.3.2, 4.3.3, 4.3.4,.4.3.5, Conduction of Exploratory Data Analysis:

Exploratory Data Analysis is a data analytics process to understand the data in depth and learn
the different data characteristics, often with visual means. This allows you to get a better feel of
your data and find useful patterns in it.
It is crucial to understand it in depth before you perform data analysis and run your data through
an algorithm. You need to know the patterns in your data and determine which variables are
important and which do not play a significant role in the output. Further, some variables may have
correlations with other variables. You also need to recognize errors in your data.
All of this can be done with Exploratory Data Analysis. It helps you gather insights and make better
sense of the data, and removes irregularities and unnecessary values from data.
Steps Involved in Exploratory Data Analysis

1. Data Collection
Data collection is an essential part of exploratory data analysis. It refers to the process of finding
and loading data into our system. Good, reliable data can be found on various public sites or
bought from private organizations. Some reliable sites for data collection are Kaggle, Github,
Machine Learning Repository, etc.
The data depicted below represents the housing dataset that is available on Kaggle. It contains
information on houses and the price that they were sold for.

Figure: Housing dataset

2. Data Cleaning
Data cleaning refers to the process of removing unwanted variables and values from your dataset
and getting rid of any irregularities in it. Such anomalies can disproportionately skew the data and
hence adversely affect the results. Some steps that can be done to clean data are:
● Removing missing values, outliers, and unnecessary rows/ columns.
● Re-indexing and reformatting our data.
Now, it’s time to clean the housing dataset. You first need to check to see the number of missing
values in each column and the percentage of missing values they contribute to.

221
Figure: Finding Missing Values
To do so, drop the columns which are missing more than 15% of the data. Further, some variables
are missing a significant chunk of the data, like 'PoolQC' , 'MiscFeature', 'Alley', etc., seem to be
outliers.

Figure: Dropping Missing Values

Your final dataset after cleaning looks as shown below. You now have only 63 columns of
importance.

Figure: Final Dataset

3. Univariate Analysis
In Univariate Analysis, you analyze data of just one variable. A variable in your dataset refers to a
single feature/ column. You can do this either with graphical or non-graphical means by finding
specific mathematical values in the data. Some visual methods include:
● Histograms: Bar plots in which the frequency of data is represented with rectangle bars.
222
● Box-plots: Here the information is represented in the form of boxes.
Let's make a histogram out of our SalePrice column.

Figure: Data Distribution in our Dataset

From the above graph, you can say that the graph deviates from the normal and is positively
skewed. Now, find the Skewness and Kurtosis of the graph.

Figure: Skewness and Kurtosis in your data

To understand exactly which variables are outliers, you need to establish a threshold. To do this,
you have to standardize the data. Hence, the data should have a mean of 1 and a standard
deviation of 0.

223
Figure: Standardizing data

The above figure shows that the lower range values fall in a similar range and are too far from 0.
Meanwhile, all the higher range values have a range far from 0. You cannot consider that all of
them are outliers, but you have to be careful with the last two variables that are above 7.

4. Bivariate Analysis
Here, you use two variables and compare them. This way, you can find how one feature affects
the other. It is done with scatter plots, which plot individual data points or correlation matrices that
plot the correlation in hues. You can also use boxplots.
Let's plot a scatter plot of the greater living area and Sales prices. Here, you can see that most of
the values follow the same trend and are concentrated around one point, except for two isolated
values at the very top. These are probably the data points with values above 7.

Figure: Scatterplot

224
Now, delete the last two values as they are outliers.

Figure : Deleting Outliers

Now, plot a scatter plot of the Basement area vs. the Sales Price and see their relationship. Again,
you can see that the greater the basement area, the more the sales price.

Figure: Scatterplot

Moving ahead, plot a boxplot of the Sales Price with Overall Quality. The overall quality feature is
categorical here. It falls in the range of 1 to 10. Here, you can see the increase in sales price as
the quality increases. The rise looks a bit like an exponential curve.

225
Figure: Boxplot

Market Analysis With Exploratory Data Analysis

Now, perform Exploratory Data Analysis on market analysis data. You start by importing all
necessary modules.

Figure: Importing necessary modules

Then, you read in the data as a pandas data frame.

226
Figure: Market Analysis Data

You can see here that the dataset is not formatted correctly. The first two rows contain the actual
column names, and the column names are just arbitrary values.

Importing Data
To overcome the skewed rows, import your data by skipping the first two rows. This will make sure
that your column names are populated correctly.

Figure: Importing Market Analysis Data

The dataset is imported correctly now. The column names are in the correct row, and you’ve
dropped the arbitrary data.
The above data was collected while taking a survey. Different information about the survey takers,
like their occupation, salary, if they have taken a loan, age, etc, is given. You will use exploratory
data analysis to find patterns in this data and find correlations between columns. You will also
perform basic data cleaning steps.
Data Cleaning
The next step that you need to do is data cleaning. Let us drop the customer id column as it is just
the row numbers, but indexed at 1. Also, split the ‘jobedu’ column into two. One column for the job
and one for the education field. After splitting the columns, you can drop the ‘jobedu’ column as it
is of no use anymore.

Figure: Cleaning Market Analysis Data

227
This is what the dataset looks like now.

Figure: Market Analysis Data

Missing Values
The data has some missing values in its columns. There are three major categories of missing
values:
● MCAR (Missing completely at random): These are values that are randomly missing
and do not depend on any other values.
● MAR (Missing at random): These values are dependent on some additional features.
● MNAR (Missing not at random): There is a reason behind why these values are missing.
Let’s check the columns which have missing values.

Figure: Missing values

There is nothing you can do about the missing age values. So, drop all rows which do not have
the age values.

228
Figure: Missing age values
Now, coming to the month column, you can fill in the missing values by finding the most commonly
occurring month and filling it in place of the missing values. You see the mode of the month column
to get the most commonly occurring values and fill in the missing values using the fillna function.

Figure: Filling in missing month values

Check to see the number of missing values left in your data.

229
Figure: Missing values

Finally, only the response column has missing values. You cannot do anything about these values.
If the user hasn't filled in the response, you cannot auto-generate them. So you drop these values.

Figure: Dropping Missing response values

Finally, you can see that the data is clean. You can now start finding the outliers in the data.
Handling Outliers
There are two types of outliers:
● Univariate outliers: Univariate outliers are the data points whose values lie outside the
expected range of values. Here, only a single variable is being considered.
● Multivariate outliers: These outliers are dependent on the correlation between two
variables. While plotting data, one variable may not lie beyond the expected range, but
when you plot the same variable with some other variable, these values may lie far from
the expected value.
230
Univariate Analysis
Now, consider the different jobs that you have data on. Plotting the job column as a bar graph in
ascending order of the number of people who work in that job tells us the most popular jobs in the
market. To ensure that they lie in the same range and are comparable, normalize the data.

Figure: Plotting the number of people performing a certain job

Moving on, plot a pie chart to compare the education qualifications of the people in the survey.
Almost half of the people have only a secondary school education and one-fourth have a tertiary
education.

Figure: Plotting the education qualification of people

Bivariate Analysis
Bivariate analysis is of three main types:
1. Numeric-Numeric Analysis
When both the variables being compared have numeric data, the analysis is said to be Numeric-
Numeric Analysis. To compare two numeric columns, you can use scatter plots, pair plots, and
correlation matrices.
Scatter Plot
A scatter plot is used to represent every data point in the graph. It shows how the data of one
column fluctuates according to the corresponding data points in another column. Plot a scatterplot
between different individuals' salaries and bank balances and the balance and age of individuals.

231
Figure: Plotting a scatter plot of Salary vs. Balance

By looking at the above plot, it can be said that regardless of the salary of individuals, the average
bank balance ranges from 0 - 25,0000. The majority of the people have a bank balance below 40k.

Figure: Plotting a scatter plot of Balance vs Age

From the above graph, you can derive the conclusion that the average balance of people,
regardless of age, is around 25,000. This is the average balance, irrespective of age and salary.
Pair Plot
Pair plots are used to compare multiple variables at the same time. They plot a scatter plot of all
input variables against each other. This helps save space and lets us compare various variables
at the same time. Let's plot the pair for salary, balance, and age.

Figure: Plotting a pairplot

The below figures show the pair plots for salary, balance, and age. Each variable is plotted against
the others on both the x and y-axis.

232
Figure: Pairplots of salary, balance, and age

Correlation Matrix
A correlation matrix is used to see the correlation between different variables. How correlated two
variables are is determined by the correlation coefficient. The below table shows the correlation
between salary, age, and balance. Correlation tells you how one variable affects the other. This
helps us determine how changes in one variable will also cause a change in the other variables.

Figure: Correlation matrix between salary, balance, and age

The above matrix tells us that balance, age, and salary have a high correlation coefficient and
affect each other. Age and salary have a lower correlation coefficient.
233
2. Numeric - Categorical Analysis
When one variable is of numeric type and another is a categorical variable, then you perform
numeric-categorical analysis.
You can use the groupby function to arrange the data into similar groups. Rows that have the
same value in a particular column will be arranged in a group together. This way, you can see the
numerical occurrences of a certain category across a column. You can groupby values and find
their mean.

Figure: Groupby of response with respect to salary

The above values tell you the average salary of the people who have responded with yes and no
in the response column.
You can also find the middle value of salary or the median value of the people who have responded
with yes and no in our survey.

Figure: Median of grouped of response with respect to salary

You can also plot the box plot of response vs salary. A boxplot will show you the range of values
that fall under a certain category.

Figure: Boxplot of response with respect to salary

The above plot tells you that the salary range of people who said no on the survey is between 20k
- 70k with a median salary of 60k, while the salary range of people who replied with yes on the
survey was between 50k - 100k with a median salary of 60K.
3. Categorical — Categorical Analysis
When both the variables contain categorical data, you perform categorical-categorical analysis.
First, convert the categorical response column into a numerical column with 1 corresponding to a
positive response and 0 corresponding to a negative response.

234
Figure: Changing categorical to numerical values

Now, plot the marital status of people with the response rate. The below figure tells you the mean
number of people who responded with yes to the survey and their marital status.

Figure: Changing categorical to numerical values

Also plot the mean loan wrt the response rate.

Figure: Changing categorical to numerical values

You can conclude that people who have taken a loan are more likely to respond with a no on the
survey.

235
Individual Activity:
● Show Bi-variate and multivariate analysis.

SELF-CHECK QUIZ 4.3

Check your understanding by answering the following questions:

Write the appropriate/correct answer of the following:

1.What is Scatter Plot?

2.Describe the Bivariate Analysis?

236
LEARNING OUTCOME 4.4 – Visualize and report data

Contents:

▪ Variables and relationships.

▪ Python visualization libraries (matplotlib, plotly) or Power BI or Tableau.
▪ Plots.
▪ Reports.

Assessment criteria:

1. Variables and relationships are visualized using various visualization methods suitable for various
data types.
2. Python visualization libraries (matplotlib, plotly) or Power BI or Tableau are used to plot charts and
graphs.
3. Plots are analyzed to identify important patterns.
4. Reports are generated using Power BI or Tableau.

Resources required:

Students/trainees must be provided with the following resources:

Workplace (Computer and Internet connection)

LEARNING ACTIVITY 4.4

Resources/Special
Learning Activity
Instructions/References
Visualize and report data
● Information Sheets: 4.4
● Self-Check: 4.4
● Answer Key: 4.4

237
INFORMATION SHEET 4.4

Learning Objective: to List out Visualize and report data.

Variables and relationships are visualized using various visualization methods suitable
for various data types:

Data visualization is the representation of data through use of common graphics, such as charts, plots,
infographics, and even animations. These visual displays of information communicate complex data
relationships and data-driven insights in a way that is easy to understand.
Data visualization can be utilized for a variety of purposes, and it’s important to note that is not only
reserved for use by data teams. Management also leverages it to convey organizational structure and
hierarchy while data analysts and data scientists use it to discover and explain patterns and trends.
Harvard Business Review (link resides outside IBM) categorizes data visualization into four key purposes:
idea generation, idea illustration, visual discovery, and everyday dataviz. We’ll delve deeper into these
below:
Data visualization is commonly used to spur idea generation across teams. They are frequently leveraged
during brainstorming or Design Thinking sessions at the start of a project by supporting the collection of
different perspectives and highlighting the common concerns of the collective. While these visualizations
are usually unpolished and unrefined, they help set the foundation within the project to ensure that the
team is aligned on the problem that they’re looking to address for key stakeholders.
Data visualization for idea illustration assists in conveying an idea, such as a tactic or process. It is
commonly used in learning settings, such as tutorials, certification courses, centers of excellence, but it
can also be used to represent organization structures or processes, facilitating communication between
the right individuals for specific tasks. Project managers frequently use Gantt charts and waterfall charts
to illustrate workflows.
Visual discovery and every day data viz are more closely aligned with data teams. While visual discovery
helps data analysts, data scientists, and other data professionals identify patterns and trends within a
dataset, every day data viz supports the subsequent storytelling after a new insight has been found. Data
visualization is a critical step in the data science process, helping teams and individuals convey data more
effectively to colleagues and decision makers. However, it’s important to remember that it is a skillset that
can and should extend beyond your core analytics team.

Histogram:
A histogram is a graphical representation of the distribution of a dataset. Although its appearance is
similar to that of a standard bar graph, instead of making comparisons between different items or
categories or showing trends over time, a histogram is a plot that lets you show the underlying
frequency distribution or the probability distribution of a single continuous numerical variable.
Let me clarify that a probability distribution indicates all the possible values that a certain random
variable can take plus a summary of probabilities for those values. Also to put it simply, a continuous
numerical variable is the one that can take on an unlimited number of values within a range or interval.
For example: height, weight, age, temperature.

Boxplot:
Boxplot is used to show the distribution of a variable. The box plot is a standardized way of displaying
the distribution of data based on the five-number summary: minimum, first quartile, median, third
quartile, and maximum.
Here, we’ll plot a Boxplot for checking the distribution of Sepal Length.

238
€
€ Figure: boxplot for univariate data using seaborn.
Also, A box plot shows the distribution of quantitative data in a way that facilitates comparisons
between variables or across levels of a categorical variable.
Here, we’ll plot Boxplot to compare the distribution of Sepal Length for each level of Species.

€
€ Figure: boxplot for bivariate data using seaborn.
We can also plot a Boxplot for the entire dataset with Horizontal orientation.

239
€
€ Figure: boxplot for the entire dataset using seaborn.
So, we can observe that all the plots represent the distribution of dataset with four quartiles. Also, it
represents the maximum and minimum value. While the dots outside the plot represent outliers.

Scatterplot:
This is used to find a relationship in a bivariate data. It is most commonly used to find correlations
between two continuous variables. Here, we’ll see a scatter plot for Petal Length and Petal Width using
matplotlib.

€
€ Figure: Scatter Plot using matplotlib
We can notice that the relationship between the two variables is linear and positive.

240
We used plt.title to add a title to our post, plt.xlabel to add a label for the x-axis and similarly plt.ylabel
to add a label for the y-axis. There are plenty of such options which can be useful for adding/modifying
plots. you can refer the matplotlib documentation for a complete guide.

Pie Chart:
Pie Chart is a type of plot which is used to represent the proportion of each category in categorical
data. The whole pie is divided into slices which are equal to the number of categories.

Figure: Pie chart using matplotlib

The three slices in the above chart represent three categories of species. we have used explode to
separate the three slices. Similar to a histogram, The three slices have different colors which represent
each of the categories uniquely.
Pareto chart:
A Pareto chart is a bar graph. The lengths of the bars represent frequency or cost (time or money),
and are arranged with longest bars on the left and the shortest to the right. In this way the chart visually
depicts which situations are more significant.

Q-Q Plot:
The quantile-quantile (q-q) plot is a graphical technique for determining if two data sets come from
populations with a common distribution.
A q-q plot is a plot of the quantiles of the first data set against the quantiles of the second data set. By
a quantile, we mean the fraction (or percent) of points below the given value. That is, the 0.3 (or 30%)
quantile is the point at which 30% percent of the data fall below and 70% fall above that value.
A 45-degree reference line is also plotted. If the two sets come from a population with the same
distribution, the points should fall approximately along this reference line. The greater the departure
from this reference line, the greater the evidence for the conclusion that the two data sets have come
from populations with different distributions.
The advantages of the q-q plot are:
The sample sizes do not need to be equal.
Many distributional aspects can be simultaneously tested. For example, shifts in location, shifts in
scale, changes in symmetry, and the presence of outliers can all be detected from this plot. For
example, if the two data sets come from populations whose distributions differ only by a shift in location,
the points should lie along a straight line that is displaced either up or down from the 45-degree
reference line.

241
● Python visualization libraries (matplotlib, plotly) or Power BI or Tableau are used
to plot charts and graphs:

Matplotlib:
matplotlib is the O.G. of Python data visualization libraries. Despite being over a decade old, it's
still the most widely used library for plotting in the Python community. It was designed to closely
resemble MATLAB, a proprietary programming language developed in the 1980s.
Because matplotlib was the first Python data visualization library, many other libraries are built on
top of it or designed to work in tandem with it during analysis. Some libraries like pandas and
Seaborn are “wrappers” over matplotlib. They allow you to access a number of matplotlib’s
methods with less code.
While matplotlib is good for getting a sense of the data, it's not very useful for creating publication-
quality charts quickly and easily. As Chris Moffitt points out in his overview of Python visualization
tools, matplotlib “is extremely powerful but with that power comes complexity.”
matplotlib has long been criticized for its default styles, which have a distinct 1990s feel. Its current
release of matplotlib 3.4.3 still reflects this style.

Plotly:

You might know Plotly as an online platform for data visualization, but did you also know you can
access its capabilities from a Python notebook? Like Bokeh, Plotly's forte is making interactive
plots, but it offers some charts you won't find in most libraries, like contour plots, dendrograms,
and 3D charts.
Plots are analyzed to identify important pattern
● Comprehend- gain a basic understanding after reading the story over
● Interpret- dig deeper into the details of the story
● Draw Conclusions- taking what you learned from steps 1 and 2 and drawing analytical
conclusions
Let's read Aesop's fable about the wolf in sheep's clothing and then analyze the plot.
There once was a hungry wolf wanting to eat sheep. One particular shepherd and his dogs
watched over the sheep so well the wolf couldn't steal any. One day, the wolf found the skin of a
sheep and put it on. He went in among the sheep, deceiving the shepherd and dogs. The lamb
that belonged to the sheep, whose skin the wolf was wearing, began to follow the wolf. The wolf
led the little lamb apart from the flock and ate her. He continued for some time deceiving everyone
and enjoying being well-fed.
Comprehend
Now let's comprehend the fable by understanding the protagonist and the basic plot. This step is
very general and basic.
Protagonist: Who is the protagonist? What is the protagonist's purpose?
The protagonist is the wolf. The wolf wants to eat sheep without being discovered.
Comprehend the Plot: What is the basic plot line?
The hungry wolf deceives everyone by dressing like a sheep.
Interpret
In the interpretation step, we examine obstacles, the climax, and the resolution of the plot. This is
taking the comprehend step deeper and digging into the analysis of different parts of the story.
Creating a timeline gives us a good visual of the plot.

Obstacles: What obstacles stand in the way of the protagonist's purpose? Who is the antagonist?
What character flaws does the protagonist have?
242
The protagonist is the wolf because he is the leading character. He is an antihero because he
does not stand for good morals. His obstacle is that the shepherd and the dogs are watching after
the sheep. The antagonists would be the shepherd and the dogs because they oppose the
protagonist. His flaw is that he looks like a wolf.
Climax: What moment in the story is the most intense?
The most intense moment is when the wolf deceives the first lamb.

Power BI:

When creating Power BI reports for clients, there are two icons in the Visualization pane that I
always just skip over – the Python Visual and the R Visual. I’m not familiar with Python, and even
less with R, and the standard and custom Power BI visuals have been enough so far – why would
I need to use them?
However, over the weekend, curiosity got the better of me, and turns out, it’s pretty easy to get
started.
A step-by-step tutorial for creating a correlation heatmap
Annie Leung1 & Janni Leung2 (2020)
1 WARDY IT Solutions, Australia
2The University of Queensland, Australia
This is a step-by-step tutorial of how to use Python in Power BI. We used the example dataset
“Admission_Predict.csv” to create a correlation heatmap in Power BI, using Python’s pandas,
matplotlib and seaborn library.
Pre-requisites
To create Python Visualizations in Power BI, you will need:
● Power BI Desktop installed
● Python installed
● The example dataset Admission_Predict.csv
● Basic Power BI and Python skills
I’m using a Windows 10 machine for this exercise, instructions may differ using a different
operating system.
The Cheat Sheet
To summarize, this is what we are doing to create a Python visualization in Power BI:
● Install Python libraries
● Load “Admission_Predict.csv” dataset into Power BI
● Create the Python Visual
● Write Python visual script into the Python script editor
How did I do it?
Install Python Libraries
For this exercise, we will need to install the Python libraries pandas, matplotlib and seaborn. To
do so:
Open command prompt with “Run as administrator”.

Run the following commands separately to install the required libraries:

1 pip install pandas

1 pip install matplotlib

1 pip install seaborn

243
If pip is not recognised, it means either Python has not been installed, or the Python’s path has
not been added to your system’s environment variables.

Use the following command to find out where your python installation is, and add it to your
environment variables.
1 where python

Once the required libraries are installed, we are ready to create the Power BI report.
Load “Admission_Predict.csv” dataset into Power BI
Open Power BI Desktop and select Get Data.

Choose Text/CSV click Connect.

244
Browse to your “Admission_Predict.csv” dataset, and click Open.

Preview the file and select Load to import the dataset into your Power BI file.

245
We are now ready to create the Python Visual.
Create the Python Visual
In the Visualization pane, click on the PY icon to create a Python Visual.

You may be prompted to Enable script visuals – click on Enable.

With the Python Visual selected, click on each of the fields in the Fields pane except for “Serial No.”. This
will add all these elements into the Python Visual.

246
Ensure that each of the fields are not summarized.

By adding these fields into the Values pane of the Python Visual, Power BI has automatically created a
panda dataframe called “dataset”, which is ready to be used in the Python script editor.
Write Python visual script into the Python script editor
247
We are creating a correlation heatmap using Python to analyse the graduate admissions dataset.
Ensure the Python Visual is selected, and copy and paste the following script into the Python script editor,
under “# Paste or type your script code here”:
1 # Paste or type your script code here:
2 import seaborn as sns
3 import matplotlib.pyplot as plt
4
5 corr_matrix = dataset.corr() #create a correlation matrix using the .corr function
6 sns.heatmap(corr_matrix, annot=True) #create a seaborn heatmap using the correlation matrix,
7 with the values showing as the annotation
plt.show() #this is required by Power BI to show the visual

Click on the “Run scripts” button, and the Heatmap should render onto your Power BI page.

And that’s pretty much it! You can now adapt this Python script to create any visualizations that are possible
with Python.

4.4.3,4.4.4& 4.4.5 described already

248
Individual Activity:
● Discuss visualization methods and power BI.

SELF-CHECK QUIZ 4.3

Check your understanding by answering the following questions:

Write the appropriate/correct answer of the following:

1.What is power BI.

2.Describe power BI process?

249
LEARNER JOB SHEET 4

Qualification: 2 Years Experience in IT Sector

Learning unit:
PREPARE AND VISUALIZE DATA

Learner name:

Personal protective
equipment (PPE):

Materials: Computer and Internet connection

Tools and equipment:

Performance criteria: 1. Data sources are identified that are relevant the business
problem.
2. Data is loaded from multiple data sources.
3. Complex data is loaded using appropriate data acquisition
techniques.
4. Data is stored in the required format.
5. Cases which require data transformations are identified.
6. Impacts of imbalanced data are explained.
7. Feature selection techniques are applied.
8. Data transformation is performed.
9. Slicing, indexing, sub-setting, merging and joining datasets
are performed.
10. Facts, Dimensions and schemas are identified and explained.
11. Techniques are applied to handle missing values.
12. Outliers are identified, visualized and dealt with.
13. Fully usable dataset is constructed by cleaning and
transforming data.
14. Steps of EDA are described.
15. Appropriate features/ variables are identified for the analysis
16. Dataset is parsed by cleaning, treating missing values &
outliers and transforming as required.
17. Bi-variate and multivariate analysis are performed to find
associations between variables.
18. Variables and relationships are visualized using various
visualization methods suitable for various data types.
19. Python visualization libraries (matplotlib, plotly) or Power BI or
Tableau are used to plot charts and graphs.
20. Plots are analyzed to identify important patterns.
21. Reports are generated using Power BI or Tableau.

Measurement:

Notes:

250
Procedure: 5. Connect computers with internet connection.
6. Connect router with internet.

Learner signature: Date:

Assessor signature: Date:

Quality Assurer
Date:
signature:

Assessor remarks:

Feedback:

251
ANSWER KEYS

ANSWER KEY 4.1

1. Data loading is the process of copying and loading data or data sets from a source file, folder or
application to a database or similar application. It is usually implemented by copying digital data from a
source and pasting or loading the data to a data storage or processing utility.

2. Using multiple sources of data means accessing a variety of types and kinds of data from more than one
place. Some common data sources found within education include state assessments, school-site
information, such as student cumulative files, and of course classroom-based sources.

ANSWER KEY 4.2

1.Imbalanced data refers to those types of datasets where the target class has an uneven distribution of
observations, i.e one class label has a very high number of observations and the other has a very low
number of observations

2.Filtration, the technique used to separate solids from liquids, is the act of pouring a mixture onto a
membrane (filter paper) that allows the passage of liquid (the filtrate) and results in the collection of the
solid.The basic Python data structures in Python include list, set, tuples, and dictionary. Each of the data
structures is unique in its own way. Data structures are “containers” that organize and group data
according to type. The data structures differ based on mutability and order.

ANSWER KEY 4.3

1. A scatter plot is a set of points plotted on a horizontal and vertical axes. Scatter plots are important in
statistics because they can show the extent of correlation, if any, between the values of observed
quantities or phenomena (called variables).
2. Bivariate analysis is a kind of statistical analysis when two variables are observed against each other.
One of the variables will be dependent and the other is independent. The variables are denoted by X and
Y. The changes are analyzed between the two variables to understand to what extent the change has
occurred.
3. Power BI service is a secure Microsoft hosted cloud service that lets users view dashboards, reports,
and Power BI apps — a type of content that combines related dashboards and reports — using a web
browser or via mobile apps for Windows, iOS, and Android.
4. The overall process of Power BI is ETL which stands for extract, transform and load. The data is first
extracted from the data sources in Power BI. The data is then transformed into the required format. Once
you are done with shaping the data, you can load the data in the Data Catalog.

252
Module 5: Build, Validate And Deploy Model

MODULE CONTENT module covers

Module Descriptor: This unit covers the knowledge, skills, and attitudes required to
build, validate and deploy model. It specifically includes
demonstrating understanding of modeling techniques, performing
data modelling, Carrying Out data Validation and Model
Deployment

Nominal Duration: 60 hours

LEARNING OUTCOMES:

Upon completion of the module, the trainee should be able to:

5.1 Demonstrate understanding of modeling techniques

5.2 Perform Data Modeling.
5.3 Carry Out data Validation.
5.4 Carry Out Model Deployment.

PERFORMANCE CRITERIA:

1. Machine learning algorithms are interpreted.

2. Ensemble learning techniques are explained.
3. Potential models are selected based on the available data, data distributions and
goals of the project.
4. Feature selection techniques are applied.
5. Dimension reduction techniques are applied.
6. Supervised and unsupervised models are tested on the dataset.
7. Usages of Scikit-learn library to build models are demonstrated.
8. Cross validation techniques are applied on the dataset to validate model.
9. Machine learning model's performance is evaluated using performance evaluation
factors
10. Report is prepared with findings and conclusions for the data science/ business
audience.
11. DJANGO and REST API are interpreted.
12. DJANGO and REST API are interpreted.
13. Project is evaluated and drawbacks are incorporated and rectified.

253
Learning Outcome 5.1 – Demonstrate understanding of
modeling techniques

Contents:

● Data Modelling
● Data Validation
● Model Deployment

Assessment criteria:

1. Machine learning algorithms are interpreted.

2. Ensemble learning techniques are explained.

Resources required:

Students/trainees must be provided with the following resources:

▪ Workplace (Computer, Internet connection).

LEARNING ACTIVITY 5.1

Learning Activity Resources/Special Instructions/References

BUILD, VALIDATE AND DEPLOY MODEL ▪ Information Sheet: 5.1
▪ Self-Check: 5.1
▪ Answer Key: 5.1

254
INFORMATION SHEET 5.1

Learning Objective: Demonstrate understanding of modelling techniques

● Machine learning algorithms are interpreted:

● Machine learning algorithms are interpreted

What is machine learning?

Machine learning is a branch of artificial intelligence (AI) and computer science which focuses on
the use of data and algorithms to imitate the way that humans learn, gradually improving its
accuracy

How Does Machine Learning Work?

Similar to how the human brain gains knowledge and understanding, machine learning relies on
input, such as training data or knowledge graphs, to understand entities, domains and the
connections between them. With entities defined, deep learning can begin.
The machine learning process begins with observations or data, such as examples, direct
experience or instruction. It looks for patterns in data so it can later make inferences based on
the examples provided. The primary aim of ML is to allow computers to learn autonomously
without human intervention or assistance and adjust actions accordingly.

What is a machine learning algorithm?

A machine learning algorithm is the method by which the AI system conducts its task, generally
predicting output values from given input data. The two main processes of machine learning
algorithms are classification and regression.

Regression
The term regression is used when you try to find the relationship between variables.
In Machine Learning, and in statistical modeling, that relationship is used to predict the outcome
of future events.

Linear Regression:

Linear regression uses the relationship between the data-points to draw a straight line through all
them.
This line can be used to predict future values.

255
In Machine Learning, predicting the future is very important.

How Does it Work?

Python has methods for finding a relationship between data-points and to draw a line of linear
regression. We will show you how to use these methods instead of going through the
mathematic formula.
In the example below, the x-axis represents age, and the y-axis represents speed. We have
registered the age and speed of 13 cars as they were passing a tollbooth. Let us see if the data
we collected could be used in a linear regression:

Example:

Start by drawing a scatter plot :

import matplotlib.pyplot as plt

x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
plt.scatter(x, y)
plt.show()

256
Output :

What is Logistic Regression ?

Logistic Regression is a Machine Learning classification algorithm that is used to predict the
probability of a categorical dependent variable. In logistic regression, the dependent variable is a
binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.)

Logistic Regression Assumptions

● Binary logistic regression requires the dependent variable to be binary.

● For a binary regression, the factor level 1 of the dependent variable should

represent the desired outcome.

● Only the meaningful variables should be included.

● The independent variables should be independent of each other. That is, the model

should have little or no multicollinearity.

● The independent variables are linearly related to the log odds.

● Logistic regression requires quite large sample sizes.

Keeping the above assumptions in mind, let’s look at our dataset.

Data

The dataset can be downloaded from here.

https://raw.githubusercontent.com/madmashup/targeted-marketing-predictive-
engine/master/banking.csv

257
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt
plt.rc("font", size=14)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)
data = pd.read_csv('bank.csv', header=0)
data = data.dropna()
# print(data.shape)
# print(list(data.columns))
print(data.head())
data['education'].unique()
data['y'].value_counts()
sns.countplot(x='y', data=data, palette='hls')
plt.show()
plt.savefig('count_plot')

Output:

258
Decision Tree Algorithm

Decision Tree:

In this chapter we will show you how to make a "Decision Tree". A Decision Tree is a Flow Chart,
and can help you make decisions based on previous experience.

In the example, a person will try to decide if he/she should go to a comedy show or not.

Luckily our example person has registered every time there was a comedy show in town, and
registered some information about the comedian, and also registered if he/she went or not.

Age Experience Rank Nationality Go

259
36 10 9 UK NO

42 12 4 USA NO

23 4 6 N NO

52 4 4 USA NO

43 21 8 USA YES

44 14 5 UK NO

66 3 7 N YES

35 14 9 UK YES

52 13 7 N YES

35 5 9 N YES

24 3 5 USA NO

18 3 7 UK YES

260
45 9 9 UK YES

Now, based on this data set, Python can create a decision tree that can be used to decide if any
new shows are worth attending to.

How Does it Work?

First, import the modules you need, and read the dataset with pandas:

import pandas
from sklearn import tree
import pydotplus
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
import matplotlib.image as pltimg
df = pandas.read_csv("shows.csv")
print(df)

Age Experience Rank Nationality Go

0 36 10 9 UK NO
1 42 12 4 USA NO
2 23 4 6 N NO
3 52 4 4 USA NO
4 43 21 8 USA YES
5 44 14 5 UK NO
6 66 3 7 N YES
7 35 14 9 UK YES
8 52 13 7 N YES
9 35 5 9 N YES
10 24 3 5 USA NO
11 18 3 7 UK YES
12 45 9 9 UK YES

Support Vector Machines:

Generally, Support Vector Machines is considered to be a classification approach, it but can be

employed in both types of classification and regression problems. It can easily handle multiple
continuous and categorical variables. SVM constructs a hyperplane in multidimensional space to
separate different classes. SVM generates optimal hyperplane in an iterative manner, which is
used to minimize an error. The core idea of SVM is to find a maximum marginal
hyperplane(MMH) that best divides the dataset into classes.

261
Support Vectors

Support vectors are the data points, which are closest to the hyperplane. These points will define
the separating line better by calculating margins. These points are more relevant to the
construction of the classifier.

Hyperplane

A hyperplane is a decision plane which separates between a set of objects having different class
memberships.

Margin

A margin is a gap between the two lines on the closest class points. This is calculated as the
perpendicular distance from the line to support vectors or closest points. If the margin is larger in
between the classes, then it is considered a good margin, a smaller margin is a bad margin.

How does SVM work?

The main objective is to segregate the given dataset in the best possible way. The distance
between the either nearest points is known as the margin. The objective is to select a hyperplane
with the maximum possible margin between support vectors in the given dataset. SVM searches
for the maximum marginal hyperplane in the following steps:

1. Generate hyperplanes which segregates the classes in the best way. Left-hand side
figure showing three hyperplanes black, blue and orange. Here, the blue and orange
have higher classification error, but the black is separating the two classes correctly.
2. Select the right hyperplane with the maximum segregation from the either nearest data
points as shown in the right-hand side figure.

262
#Import scikit-learn dataset library
from sklearn import datasets
#Load dataset
cancer = datasets.load_breast_cancer()
print("Features: ", cancer.feature_names)
# print the label type of cancer('malignant' 'benign')
print("Labels: ", cancer.target_names)
cancer.data.shape
print(cancer.data[0:5])
print(cancer.target)

263
Features: ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
'mean smoothness' 'mean compactness' 'mean concavity'
'mean concave points' 'mean symmetry' 'mean fractal dimension'
'radius error' 'texture error' 'perimeter error' 'area error'
'smoothness error' 'compactness error' 'concavity error'
'concave points error' 'symmetry error' 'fractal dimension error'
'worst radius' 'worst texture' 'worst perimeter' 'worst area'
'worst smoothness' 'worst compactness' 'worst concavity'
'worst concave points' 'worst symmetry' 'worst fractal dimension']
Labels: ['malignant' 'benign']
[[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
4.601e-01 1.189e-01]
[2.057e+01 1.777e+01 1.329e+02 1.326e+03 8.474e-02 7.864e-02 8.690e-02
7.017e-02 1.812e-01 5.667e-02 5.435e-01 7.339e-01 3.398e+00 7.408e+01
5.225e-03 1.308e-02 1.860e-02 1.340e-02 1.389e-02 3.532e-03 2.499e+01
2.341e+01 1.588e+02 1.956e+03 1.238e-01 1.866e-01 2.416e-01 1.860e-01
2.750e-01 8.902e-02]
[1.969e+01 2.125e+01 1.300e+02 1.203e+03 1.096e-01 1.599e-01 1.974e-01
1.279e-01 2.069e-01 5.999e-02 7.456e-01 7.869e-01 4.585e+00 9.403e+01
6.150e-03 4.006e-02 3.832e-02 2.058e-02 2.250e-02 4.571e-03 2.357e+01
2.553e+01 1.525e+02 1.709e+03 1.444e-01 4.245e-01 4.504e-01 2.430e-01
3.613e-01 8.758e-02]
[1.142e+01 2.038e+01 7.758e+01 3.861e+02 1.425e-01 2.839e-01 2.414e-01
1.052e-01 2.597e-01 9.744e-02 4.956e-01 1.156e+00 3.445e+00 2.723e+01
9.110e-03 7.458e-02 5.661e-02 1.867e-02 5.963e-02 9.208e-03 1.491e+01
2.650e+01 9.887e+01 5.677e+02 2.098e-01 8.663e-01 6.869e-01 2.575e-01
6.638e-01 1.730e-01]
[2.029e+01 1.434e+01 1.351e+02 1.297e+03 1.003e-01 1.328e-01 1.980e-01
1.043e-01 1.809e-01 5.883e-02 7.572e-01 7.813e-01 5.438e+00 9.444e+01
1.149e-02 2.461e-02 5.688e-02 1.885e-02 1.756e-02 5.115e-03 2.254e+01
1.667e+01 1.522e+02 1.575e+03 1.374e-01 2.050e-01 4.000e-01 1.625e-01
2.364e-01 7.678e-02]]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1000000001011111001001111010011110100
1010011100100011101100111001111011011
1111110001001110010100100110110111101
1111111101111001011001100111101100010
1011101100100001000101011010000110011
1011111001101100101111011111010000000
0000000111111010110110100111111111111
1011010111111111111110111010111100011
1101010111011111110001111111111100100
0100111110111110111011001111110111111
1011111011011111111111101001011111011
0101101011111111001111110111111111101
1111110101101111100101011111011010100
1110111111111110100111111111111111111
1 1 1 1 1 1 1 0 0 0 0 0 0 1]

Naive Bayes:
Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’
theorem with the “naive” assumption of conditional independence between every pair of features
given the value of the class variable. Bayes’ theorem states the following relationship, given
class variable y and dependent feature vector x1 through xn :

264
and we can use Maximum A Posteriori (MAP) estimation to estimate
P(y) and P(xi∣y); the former is then the relative frequency of class y in the training set.
The different naive Bayes classifiers differ mainly by the assumptions they make regarding the
distribution of P(xi∣y)
In spite of their apparently over-simplified assumptions, naive Bayes classifiers have worked
quite well in many real-world situations, famously document classification and spam filtering.
They require a small amount of training data to estimate the necessary parameters. (For
theoretical reasons why naive Bayes works well, and on which types of data it does, see the
references below.)
Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated
methods. The decoupling of the class conditional feature distributions means that each
distribution can be independently estimated as a one dimensional distribution. This in turn helps
to alleviate problems stemming from the curse of dimensionality.
On the flip side, although naive Bayes is known as a decent classifier, it is known to be a bad
estimator, so the probability outputs from predict_proba are not to be taken too seriously.
References:
H. Zhang (2004). The optimality of Naive Bayes. Proc. FLAIRS.
Gaussian Navie Bayes
GaussianNB implements the Gaussian Naive Bayes algorithm for classification. The likelihood of
the features is assumed to be Gaussian:

The parameters Oy and μy are estimated using maximum likelihood.

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)

265
print("Number of mislabeled points out of a total %d points : %d"
% (X_test.shape[0], (y_test != y_pred).sum()))

Output :

Number of mislabeled points out of a total 75 points : 4

k-nearest neighbor algorithm:

This algorithm is used to solve the classification model problems. K-nearest neighbor or K-NN
algorithm basically creates an imaginary boundary to classify the data. When new data points
come in, the algorithm will try to predict that to the nearest of the boundary line.
Therefore, larger k value means smother curves of separation resulting in less complex models.
Whereas, smaller k value tends to overfit the data and resulting in complex models
Example of the k-nearest neighbor algorithm

# Import necessary modules

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Loading data
irisData = load_iris()

# Create feature and target arrays

X = irisData.data
y = irisData.target

# Split into training and test set

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)

knn = KNeighborsClassifier(n_neighbors=7)

knn.fit(X_train, y_train)

# Predict on dataset which model has not seen before

print(knn.predict(X_test))

Output :

[1 0 2 1 1 0 1 2 2 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]

In the example shown above following steps are performed:

1. The k-nearest neighbor algorithm is imported from the scikit-learn package.

2. Create feature and target variables.

3. Split data into training and test data.

4. Generate a k-NN model using neighbors value.

5. Train or fit the data into the model.

6. Predict the future.

266
We have seen how we can use K-NN algorithm to solve the supervised machine learning
problem. But how to measure the accuracy of the model?

Consider an example shown below where we predicted the performance of the above model:

# Import necessary modules

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Loading data
irisData = load_iris()

# Create feature and target arrays

X = irisData.data
y = irisData.target

# Split into training and test set

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.2, random_state=42)

knn = KNeighborsClassifier(n_neighbors=7)

knn.fit(X_train, y_train)

# Calculate the accuracy of the model

print(knn.score(X_test, y_test))

Output :

0.9666666666666667

Model Accuracy:
So far so good. But how to decide the right k-value for the dataset? Obviously, we need to be
familiar to data to get the range of expected k-value, but to get the exact k-value we need to test
the model for each and every expected k-value. Refer to the example shown below.

# Import necessary modules

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import numpy as np
import matplotlib.pyplot as plt

irisData = load_iris()

# Create feature and target arrays

X = irisData.data
y = irisData.target

# Split into training and test set

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)

267
neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))

# Loop over K values

for i, k in enumerate(neighbors):
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)

# Compute training and test data accuracy

train_accuracy[i] = knn.score(X_train, y_train)
test_accuracy[i] = knn.score(X_test, y_test)

# Generate plot
plt.plot(neighbors, test_accuracy, label='Testing dataset Accuracy')
plt.plot(neighbors, train_accuracy, label='Training dataset Accuracy')

plt.legend()
plt.xlabel('n_neighbors')
plt.ylabel('Accuracy')
plt.show()

Output :

Random Forest :

Random forest is a type of supervised machine learning algorithm based on ensemble learning.
Ensemble learning is a type of learning where you join different types of algorithms or same
algorithm multiple times to form a more powerful prediction model. The random forest algorithm
combines multiple algorithm of the same type i.e. multiple decision trees, resulting in a forest of
trees, hence the name "Random Forest". The random forest algorithm can be used for both
regression and classification tasks

How the Random Forest Algorithm Works

The following are the basic steps involved in performing the random forest algorithm:

1. Pick N random records from the dataset.

2. Build a decision tree based on these N records.
3. Choose the number of trees you want in your algorithm and repeat steps 1 and 2.

268
4. In case of a regression problem, for a new record, each tree in the forest predicts a value
for Y (output). The final value can be calculated by taking the average of all the values
predicted by all the trees in forest. Or, in case of a classification problem, each tree in the
forest predicts the category to which the new record belongs. Finally, the new record is
assigned to the category that wins the majority vote.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
import numpy as np
from sklearn import metrics

dataset = pd.read_csv('petrol_consumption.csv')
X = dataset.iloc[:, 0:4].values
y = dataset.iloc[:, 4].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

sc = StandardScaler()

regressor = RandomForestRegressor(n_estimators=20, random_state=0)

regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))

print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Output :

Mean Absolute Error: 51.76500000000001

Mean Squared Error: 4216.166749999999
Root Mean Squared Error: 64.93201637097064

What is Convolutional Neural Network(CNN) ?

In deep learning, a convolutional neural network (CNN/ConvNet) is a class of deep neural

networks, most commonly applied to analyze visual imagery

Image classification

Convolutional neural networks are often used for image classification. By recognizing valuable
features, CNN can identify different objects on images. This ability makes them useful in
medicine, for example, for MRI diagnostics. CNN can be also used in agriculture. The networks
receive images from satellites like LSAT and can use this information to classify lands based on
their level of cultivation. Consequently, this data can be used for making predictions about the
fertility level of the grounds or developing a strategy for the optimal use of farmland. Hand-written
digits recognition is also one of the earliest uses of CNN for computer vision.

269
Object detection

Self-driving cars, AI-powered surveillance systems, and smart homes often use CNN to be able
to identify and mark objects. CNN can identify objects on the photos and in real-time, classify,
and label them. This is how an automated vehicle finds its way around other cars and
pedestrians and smart homes recognize the owner’s face among all others.

Audio visual matching

YouTube, Netflix, and other video streaming services use audio visual matching to improve their
platforms. Sometimes the user’s requests can be very specific, for example, ‘movies about
zombies in space’, but the search engine should satisfy even such exotic requests.

270
Object reconstruction

You can use CNN for 3D modelling of real objects in the digital space. Today there are CNN
models that create 3D face models based on just one image. Similar technologies can be used
for creating digital twins, which are useful in architecture, biotech, and manufacturing.

Speech recognition

Even though CNNs are often used to work with images, it is not the only possible use for them.
ConvNet can help with speech recognition and natural language processing. For example,
Facebook’s speech recognition technology is based on convolutional neural networks.

Generative adversarial network(GAN )

What is Generative Adversarial Networks GAN?

GAN is about creating, like drawing a portrait or composing a symphony. This is hard compared
to other deep learning fields. It is much easier to identify a Monet painting than painting one, by
computers or by people. But it brings us closer to understanding intelligence. GAN leads us to
thousands of GAN research papers written in recent years. In developing games, we hire many
production artists to create animation. Some of the tasks are routine. By applying automation
with GAN, we may one day focus ourselves on the creative sides rather than repeating routine
tasks daily. Being a part of the GAN series, this article covers the GAN concept and its algorithm.

Generator and discriminator

What does GAN do?

The main focus for GAN (Generative Adversarial Networks) is to generate data from scratch,
mostly images but other domains including music have been done. But the scope of application
is far bigger than this. Just like the example below, it generates a zebra from a horse. In
reinforcement learning, it helps a robot to learn much faster.

271
Unsupervised Learning
Unsupervised learning is a machine learning algorithm that searches for previously unknown
patterns within a data set containing no labeled responses and without human interaction. The
most prominent methods of unsupervised learning are cluster analysis and principal component
analysis

Preparing data for Unsupervised Learning

For our example, we'll use the Iris dataset to make predictions. The dataset contains a set of 150
records under four attributes — petal length, petal width, sepal length, sepal width, and three iris
classes: setosa, virginica and versicolor. We'll feed the four features of our flower to the
unsupervised algorithm and it will predict which class the iris belongs to.
We use the scikit-learn library in Python to load the Iris dataset and matplotlib for data
visualization. Below is the code snippet for exploring the dataset.

# Importing Modules
from sklearn import datasets
import matplotlib.pyplot as plt

# Loading dataset
iris_df = datasets.load_iris()

# Available methods on dataset

print(dir(iris_df))

# Features
print(iris_df.feature_names)

# Targets
print(iris_df.target)

# Target Names
print(iris_df.target_names)
label = {0: 'red', 1: 'blue', 2: 'green'}

# Dataset Slicing
x_axis = iris_df.data[:, 0] # Sepal Length
y_axis = iris_df.data[:, 2] # Sepal Width

# Plotting
plt.scatter(x_axis, y_axis, c=iris_df.target)
plt.show()

Output :

['DESCR', 'data', 'feature_names', 'target', 'target_names']

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

272
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11111111111111111111111111111111111111111111111111
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
['setosa' 'versicolor' 'virginica']

VIOLET: SETOSA, GREEN: VERSICOLOR, YELLOW: VIRGINICA

What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset
into different clusters. Here K defines the number of pre-defined clusters that need to be created
in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and
so on

Working of K-Means Algorithm

The following stages will help us understand how the K-Means clustering technique works-

● Step 1: First, we need to provide the number of clusters, K, that need to be

generated by this algorithm.
● Step 2: Next, choose K data points at random and assign each to a cluster.
Briefly, categorize the data based on the number of data points.
● Step 3: The cluster centroids will now be computed.
● Step 4: Iterate the steps below until we find the ideal centroid, which is the
assigning of data points to clusters that do not vary.
● 4.1 The sum of squared distances between data points and centroids would be
calculated first.
● 4.2 At this point, we need to allocate each data point to the cluster that is closest
to the others (centroid).
● 4.3 Finally, compute the centroids for the clusters by averaging all of the cluster’s
data points

# Importing Modules
from sklearn import datasets
from sklearn.cluster import KMeans

273
# Loading dataset
iris_df = datasets.load_iris()

# Declaring Model
model = KMeans(n_clusters=3)

# Fitting Model
model.fit(iris_df.data)

# Predicitng a single input

predicted_label = model.predict([[7.2, 3.5, 0.8, 1.6]])

# Prediction on the entire data

all_predictions = model.predict(iris_df.data)

# Printing Predictions
print(predicted_label)
print(all_predictions)

Output :

[1]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1111111111111002000000000000000000000
0002000000000000000000000020222202222
2200222202020220022222022220222022202
2 0]

Association Rule Learning

Association rule learning is a type of unsupervised learning technique that checks for the
dependency of one data item on another data item and maps accordingly so that it can be more
profitable. It tries to find some interesting relations or associations among the variables of
dataset. It is based on different rules to discover the interesting relations between variables in
the database.

How does Association Rule Learning work?

Association rule learning works on the concept of If and Else Statement, such as if A then B.

Here the If element is called antecedent, and then statement is called as Consequent. These
types of relationships where we can find out some association or relation between two items is
known as single cardinality. It is all about creating rules, and if the number of items increases,
then cardinality also increases accordingly. So, to measure the associations between thousands
of data items, there are several metrics. These metrics are given below:

274
● Support
● Confidence
● Lift

Let's understand each of them:

Support

Support is the frequency of A or how frequently an item appears in the dataset. It is

defined as the fraction of the transaction T that contains the itemset X. If there are X
datasets, then for transactions T, it can be written as:

Confidence

Confidence indicates how often the rule has been found to be true. Or how often the
items X and Y occur together in the dataset when the occurrence of X is already given. It
is the ratio of the transaction that contains X and Y to the number of records that contain
X.

Lift

It is the strength of any rule, which can be defined as below formula:

It is the ratio of the observed support measure and expected support if X and Y are
independent of each other. It has three possible values:

● If Lift= 1: The probability of occurrence of antecedent and consequent is

independent of each other.

● Lift>1: It determines the degree to which the two itemsets are dependent to each
other.

275
● Lift<1: It tells us that one item is a substitute for other items, which means one
item has a negative effect on another.

Types of Association Rule Lerning

Association rule learning can be divided into three algorithms:

Apriori Algorithm

This algorithm uses frequent datasets to generate association rules. It is designed to work on the
databases that contain transactions. This algorithm uses a breadth-first search and Hash Tree to
calculate the itemset efficiently.

It is mainly used for market basket analysis and helps to understand the products that can be
bought together. It can also be used in the healthcare field to find drug reactions for patients.

Eclat Algorithm

Eclat algorithm stands for Equivalence Class Transformation. This algorithm uses a depth-first
search technique to find frequent itemsets in a transaction database. It performs faster execution
than Apriori Algorithm.

F-P Growth Algorithm

The F-P growth algorithm stands for Frequent Pattern, and it is the improved version of the
Apriori Algorithm. It represents the database in the form of a tree structure that is known as a
frequent pattern or tree. The purpose of this frequent tree is to extract the most frequent
patterns.

Applications of Association Rule Learning

It has various applications in machine learning and data mining. Below are some popular
applications of association rule learning:

● Market Basket Analysis: It is one of the popular examples and applications of

association rule mining. This technique is commonly used by big retailers to determine
the association between items.

● Medical Diagnosis: With the help of association rules, patients can be cured easily, as it
helps in identifying the probability of illness for a particular disease.

● Protein Sequence: The association rules help in determining the synthesis of artificial
Proteins.

● It is also used for the Catalog Design and Loss-leader Analysis and many more other
applications.

276
What is reinforcement learning?

Reinforcement Learning is a type of Machine Learning paradigms in which a learning algorithm

is trained not on preset data but rather based on a feedback system. These algorithms are
touted as the future of Machine Learning as these eliminate the cost of collecting and cleaning
the data
What is reinforcement learning ?

Q-Learning is a basic form of Reinforcement Learning which uses Q-values (also called action
values) to iteratively improve the behavior of the learning agent.

import numpy as np
import pylab as pl
import networkx as nx

edges = [(0, 1), (1, 5), (5, 6), (5, 4), (1, 2),
(1, 3), (9, 10), (2, 4), (0, 6), (6, 7),
(8, 9), (7, 8), (1, 7), (3, 9)]

goal = 10
G = nx.Graph()
G.add_edges_from(edges)
pos = nx.spring_layout(G)
nx.draw_networkx_nodes(G, pos)
nx.draw_networkx_edges(G, pos)
nx.draw_networkx_labels(G, pos)
pl.show()

Output :

what is temporal difference learning algorithm ?

Temporal difference (TD) learning is an approach to learning how to predict a quantity that
depends on future values of a given signal. The name TD derives from its use of changes, or
differences, in predictions over successive time steps to drive the learning process.

Ensemble learning:

Ensemble learning is a general meta approach to machine learning that seeks better predictive
performance by combining the predictions from multiple models.
Although there are a seemingly unlimited number of ensembles that you can develop for your predictive
modeling problem, there are three methods that dominate the field of ensemble learning. So much so,
that rather than algorithms per se, each is a field of study that has spawned many more specialized
methods.
277
The three main classes of ensemble learning methods are bagging, stacking, and boosting, and it is
important to both have a detailed understanding of each method and to consider them on your predictive
modeling project.
Stacking

What is stacking ?

In stacking, an algorithm takes the outputs of sub-models as input and attempts to learn
how to best combine the input predictions to make a better output prediction.

Fig 2. Stacking algorithm. The number of weak learners in the stack is variable.

How stacking works?

1. We split the training data into K-folds just like K-fold cross-validation.

2. A base model is fitted on the K-1 parts and predictions are made for Kth part.

3. We do for each part of the training data.

4. The base model is then fitted on the whole train data set to calculate its
performance on the test set.

5. We repeat the last 3 steps for other base models.

6. Predictions from the train set are used as features for the second level
model.

7. Second level model is used to make a prediction on the test set.

278
Blending

Blending is also an ensemble technique that can help us to improve performance and
increase accuracy. It follows the same approach as stacking but uses only a holdout
(validation) set from the train set to make predictions. In other words, unlike stacking, the
predictions are made on the holdout set only. The holdout set and the predictions are
used to build a model which is run on the test set. Here is a detailed explanation of the
blending process:

What Is Bagging in Machine Learning?

Bagging, also known as Bootstrap aggregating, is an ensemble learning technique that

helps to improve the performance and accuracy of machine learning algorithms. It is
used to deal with bias-variance trade-offs and reduces the variance of a prediction
model. Bagging avoids overfitting of data and is used for both regression and
classification models, specifically for decision tree algorithms.

What is AdaBoost Algorithm?

AdaBoost algorithm, short for Adaptive Boosting, is a Boosting technique used as an

Ensemble Method in Machine Learning. It is called Adaptive Boosting as the weights are
re-assigned to each instance, with higher weights assigned to incorrectly classified
instances

What is Gradient Boostring :

Gradient boosting is a machine learning technique used in regression and

classification tasks, among others. It gives a prediction model in the form of an ensemble
of weak prediction models, which are typically decision trees

How does Gradient Boosting Work?

279
Gradient boosting works in a very interesting way, as we have learned till now that
gradient boosting contains multiple models but we don’t know the purpose of multiple
models till now.

Basically every decision tree model learns something and gives some predictions, the
prediction first decision tree makes is generally of no use to what we are trying to make
our model learn, therefore there are chances of a huge amount of errors, an error in
machine learning can be calculated by the difference between the actual data point and
the predicted data point value.

These differences must be reduced in order to make an optimum machine learning

predictive model, therefore the next decision tree takes the prediction from the previous
decision tree model, and try to reduce the errors of the previous one, and also learns
some features from its own dataset, this way with every consecutive decision tree, the
result gets more refined.

Hence, the errors get significantly reduced at the end, and we tend to learn more
features than what we would have with a single model, therefore, with one dataset and
multiple models, we can achieve better results than one dataset and one model.

Models which do not produce great accuracy are considered to be weak learners, the
intuition behind the boosting algorithm is to use these weak learners, and gradually
increase their accuracy by reducing the errors.

The only problem one could think of here is of the speed, with multiple models, does it
perform as fast as it should? The answer to the speed of the gradient boosting method is
that it is quite average as long as the speed is concerned.

import pandas as pd
from sklearn.metrics import classification_report
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split

df = pd.DataFrame(load_breast_cancer()['data'],

columns=load_breast_cancer()['feature_names'])

df['y'] = load_breast_cancer()['target']

print(df.head(5))

X,y = df.drop('y',axis=1),df.y

test_size = 0.30 # taking 70:30 training and test set

seed = 7 # Random number seeding for reapeatability of the code

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size,

random_state=seed)
gradient_booster = GradientBoostingClassifier(learning_rate=0.1)
gradient_booster.fit(X_train,y_train)
print(classification_report(y_train,gradient_booster.predict(X_train)))

Output :

mean radius mean texture ... worst fractal dimension y

280
0 17.99 10.38 ... 0.11890 0
1 20.57 17.77 ... 0.08902 0
2 19.69 21.25 ... 0.08758 0
3 11.42 20.38 ... 0.17300 0
4 20.29 14.34 ... 0.07678 0

[5 rows x 31 columns]
precision recall f1-score support

0 1.00 1.00 1.00 157

1 1.00 1.00 1.00 241

accuracy 1.00 398

macro avg 1.00 1.00 1.00 398
weighted avg 1.00 1.00 1.00 398

XGBoost :

What is XGBoost ?

XGBoost is an algorithm that has recently been dominating applied machine learning
and Kaggle competitions for structured or tabular data.
XGBoost is an implementation of gradient boosted decision trees designed for speed
and performance
What Algorithm Does XGBoost Use?

The XGBoost library implements the gradient boosting decision tree algorithm.
This algorithm goes by lots of different names such as gradient boosting, multiple
additive regression trees, stochastic gradient boosting or gradient boosting machines.
Boosting is an ensemble technique where new models are added to correct the errors
made by existing models. Models are added sequentially until no further improvements
can be made. A popular example is the AdaBoost algorithm that weights data points that
are hard to predict.
Gradient boosting is an approach where new models are created that predict the
residuals or errors of prior models and then added together to make the final prediction.
It is called gradient boosting because it uses a gradient descent algorithm to minimize
the loss when adding new models.
This approach supports both regression and classification predictive modeling problems.
For more on boosting and gradient boosting, see Trevor Hastie’s talk on Gradient
Boosting Machine Learning.
Using XGBoost in Python

import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
boston = load_boston()
# print(boston.keys())
# print(boston.data.shape)
# print(boston.feature_names)
# print(boston.DESCR)
# print(data.head())

data = pd.DataFrame(boston.data)
data.columns = boston.feature_names
data['PRICE'] = boston.target

281
# print(data.info())
print(data.describe())
Output :

CRIM ZN INDUS ... B LSTAT PRICE

count 506.000000 506.000000 506.000000 ... 506.000000 506.000000
506.000000
mean 3.613524 11.363636 11.136779 ... 356.674032 12.653063 22.532806
std 8.601545 23.322453 6.860353 ... 91.294864 7.141062 9.197104
min 0.006320 0.000000 0.460000 ... 0.320000 1.730000 5.000000
25% 0.082045 0.000000 5.190000 ... 375.377500 6.950000 17.025000
50% 0.256510 0.000000 9.690000 ... 391.440000 11.360000 21.200000
75% 3.677083 12.500000 18.100000 ... 396.225000 16.955000 25.000000
max 88.976200 100.000000 27.740000 ... 396.900000 37.970000 50.000000

[8 rows x 14 columns]

282
Individual Activity:
▪ Machine learning algorithms are interpreted.
▪ Ensemble learning techniques are explained.

SELF-CHECK QUIZ 5.1

Check your understanding by answering the following questions:

Write the correct answer for the following questions.

1. What Machine learning algorithms are interpreted ?

2. Which rules are most important to Ensemble learning techniques?

283
Learning Outcome 5.2- Perform Data Modelling

Contents:

▪ Potential models.
▪ Feature selection techniques.
▪ Dimension reduction techniques.
▪ Supervised and unsupervised models.
▪ Scikit-learn library.

Assessment criteria:

1. Potential models are selected based on the available data, data distributions and goals of
the project.
2. Feature selection techniques are applied.
3. Dimension reduction techniques are applied.
4. Supervised and unsupervised models are tested on the dataset.
5. Usages of Scikit-learn library to build models are demonstrated.

Resources required:

Students/trainees must be provided with the following resources:

▪ Workplace (Computer and internet connection).

LEARNING ACTIVITY 5.2

Learning Activity Resources/Special Instructions/References

Perform Data Modelling ▪ Information Sheet: 5.2
▪ Self-Check: 5.2
▪ Answer Key: 5.2

284
INFORMATION SHEET 5.2

Learning Objective: Exploring Classifiers with Python Scikit-learn — Iris Dataset

Step-by-step guide on how you can build your classifier in Python

For a moment, imagine that you are not a flower expert (if you are an expert, good for you!). Can you
distinguish between three different species of iris — setosa, versicolor, and virginica?

BUT what if we have a dataset that contains instances of these species, with measurements of their sepals
and petals?In other words, can we learn anything from this dataset that would help us distinguish between
the three species?
Dataset
In this blog post, I will explore the Iris dataset from the UCI Machine Learning Repository. Excerpted from
its website, it is said to be “perhaps the best known database to be found in the pattern recognition
literature” [1]. In addition, Jason Brownlee who started the community of Machine Learning Mastery called
it the “Hello World” of machine learning [2].
I would recommend this dataset to anyone who is a beginner in data science and is eager to build their
first ML model. See below for some of nice characteristics of this dataset:
● 150 samples, with 4 attributes (same units, all numeric)
● Balanced class distribution (50 samples for each class)
● No missing data

285
As you can see, these characteristics can help minimize the time you need to spend in the data preparation
process so you can focus on building the ML model. It is NOT that the preparation stage is not important.
On the contrary, this process is so important that it can be too time-consuming for some beginners that
they may overwhelm themselves before getting to the model development stage.
As an example, the popular dataset House Prices: Advanced Regression Techniques from Kaggle has
about 80 features and more than 20% of them contain some level of missing data. In that case, you might
need to spend some time understanding the attributes and imputing missing values.
Now hopefully your confidence level (no stats pun intended) is relatively high. Here are some resources
on data wrangling that you can read through as you work on more complex datasets and tasks:
Dimensionality reduction, Imbalanced classification, Feature engineering, and Imputation.

Objectives
There are two questions that we want to be able to answer after exploring this dataset, which are quite
typical in most classification problems:
1. Prediction — given new data points, how accurately can the model predict their classes
(species)?
2. Inference — Which predictor(s) can effectively help with the predictions?

A Few Words on Classification

Classification is a type of supervised machine learning problem where the target (response) variable is
categorical. Given the training data, which contains the known label, the classifier approximates a mapping
function (f) from the input variables (X) to output variables (Y). For more sources on classification, see
Chapter 3 in An Introduction to Statistical Learning, Andrew Ng’s Machine Learning Course (Week 3), and
Simplilearn’s tutorial on Classification.

Now it’s time to write some code! See my Github page for my full Python code (written in Jupyter
Notebook).

Import Libraries and Load Dataset

First, we need to import some libraries: pandas (loading dataset), numpy (matrix manipulation), matplotlib
and seaborn (visualisation), and sklearn (building classifiers). Make sure they are installed already before
importing them (guide on installing packages here).

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from pandas.plotting import parallel_coordinates
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier

286
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

To load the dataset, we can use the read_csv function from pandas (my code also includes the option of
loading through url).
data = pd.read_csv('data.csv')

After we load the data, we can take a look at the first couple of rows through the head function:
data.head(5)

Note: all four measurements are in centimetres.

Numerical Summary
First, let’s look at a numerical summary of each attribute through describe:
data.describe()

We can also check the class distribution using groupby and size:
data.groupby('species').size()

287
We can see that each class has the same number of instances.

Train-Test Split
Now, we can split the dataset into a training set and a test set. In general, we should also have a validation
set, which is used to evaluate the performance of each classifier and fine-tune the model parameters in
order to determine the best model. The test set is mainly used for reporting purposes. However, due to the
small size of this dataset, we can simplify this process by using the test set to serve the purpose of the
validation set.
In addition, I used a stratified hold-out approach to estimate model accuracy. The other approach is to do
cross-validation to reduce bias and variances.
train, test = train_test_split(data, test_size = 0.4, stratify = data[‘species’], random_state = 42)
Note: The general rule of thumb is have 20–30% of dataset as the test set. Due to the small size of this
dataset, I chose 40% to ensure there are enough data points to test the model performance.

Exploratory Data Analysis

After we split the dataset, we can go ahead to explore the training data. Both matplotlib and seaborn have
great plotting tools then we can use for visualization.
Let’s first create some univariate plots, through a histogram for each feature:
n_bins = 10
fig, axs = plt.subplots(2, 2)
axs[0,0].hist(train['sepal_length'], bins = n_bins);
axs[0,0].set_title('Sepal Length');
axs[0,1].hist(train['sepal_width'], bins = n_bins);
axs[0,1].set_title('Sepal Width');
axs[1,0].hist(train['petal_length'], bins = n_bins);
axs[1,0].set_title('Petal Length');
axs[1,1].hist(train['petal_width'], bins = n_bins);
axs[1,1].set_title('Petal Width');
# add some spacing between subplots
fig.tight_layout(pad=1.0);

288
Note that for both petal_length and petal_width, there seems to be a group of data points that have smaller
values than the others, suggesting that there might be different groups in this data.
Next, let’s try some side-by-side box plots:

fig, axs = plt.subplots(2, 2)

fn = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
cn = ['setosa', 'versicolor', 'virginica']
sns.boxplot(x = 'species', y = 'sepal_length', data = train, order = cn, ax = axs[0,0]);
sns.boxplot(x = 'species', y = 'sepal_width', data = train, order = cn, ax = axs[0,1]);
sns.boxplot(x = 'species', y = 'petal_length', data = train, order = cn, ax = axs[1,0]);
sns.boxplot(x = 'species', y = 'petal_width', data = train, order = cn, ax = axs[1,1]);
# add some spacing between subplots
fig.tight_layout(pad=1.0);

Side-by-side box plots

The two plots at the bottom suggest that that group of data points we saw earlier are setosas. Their petal
measurements are smaller and less spread-out than those of the other two species as well. Comparing
the other two species, versicolor has lower values than virginica on average.
Violin plot is another type of visualization, which combines the benefit of both histogram and box plot:

sns.violinplot(x="species", y="petal_length", data=train, size=5, order = cn, palette = 'colorblind');

289
Violin plot for petal_length
Now we can make scatterplots of all-paired attributes by using seaborn’s pairplot function:
sns.pairplot(train, hue="species", height = 2, palette = 'colorblind');

Figure:Scatterplots of all-paired attributes

Note that some variables seem to be highly correlated, e.g., petal_length and petal_width. In addition, the
petal measurements separate the different species better than the sepal ones.
Next, let’s make a correlation matrix to quantitatively examine the relationship between variables:
corrmat = train.corr()
sns.heatmap(corrmat, annot = True, square = True);

290
Correlation Matrix
The main takeaway is that the petal measurements have highly positive correlation, while the sepal ones
are uncorrelated. Note that the petal features also have relatively high correlation with sepal_length, but
not with sepal_width.
Another cool visualisation tool is parallel coordinate plot, which represents each sample as a line.
parallel_coordinates(train, "species", color = ['blue', 'red', 'green']);

Parallel coordinate plot

As we have seen before, petal measurements can separate species better than the sepal ones.

Build Classifiers
Now we are ready to build some classifiers (woo-hoo!)
To make our lives easier, let’s separate out the class label and features first:
X_train = train[['sepal_length','sepal_width','petal_length','petal_width']]
y_train = train.species
X_test = test[['sepal_length','sepal_width','petal_length','petal_width']]
y_test = test.species

Classification Tree
The first classifier that comes to my mind is a discriminative classification model called classification trees
(read more here). The reason is that we get to see the classification rules and it is easy to interpret.
Let’s build one using sklearn (documentation), with a maximum depth of 3, and we can check its accuracy
on the test data:
mod_dt = DecisionTreeClassifier(max_depth = 3, random_state = 1)
mod_dt.fit(X_train,y_train)

291
prediction=mod_dt.predict(X_test)
print(‘The accuracy of the Decision Tree is’,”{:.3f}”.format(metrics.accuracy_score(prediction,y_test)))
--------------------------------------------------------------------
The accuracy of the Decision Tree is 0.983.
This decision tree predicts 98.3% of the test data correctly. One nice thing about this model is that you can
see the importance of each predictor through its feature_importances_ attribute:

mod_dt.feature_importances_
--------------------------------------------------------------------
array([0. , 0. , 0.42430866, 0.57569134])

From the output and based on the indices of the four features, we know that the first two features (sepal
measurements) are of no importance, and only the petal ones are used to build this tree.
Another nice thing about the decision tree is that we can visualize the classification rules through plot_tree:

plt.figure(figsize = (10,8))
plot_tree(mod_dt, feature_names = fn, class_names = cn, filled = True);

Classification rules from this tree (for each split, left ->yes, right ->no)
Apart from each rule (e.g. the first criterion is petal_width ≤ 0.7), we can also see the Gini index
(impurity measure) at each split, assigned class, etc. Note that all terminal nodes are pure besides
the two “light purple” boxes at the bottom. We can be less confident regarding instances in those two
categories.
To demonstrate how easy it is to classify new data points, say a new instance has a petal length of 4.5cm
and a petal width of 1.5cm, then we can predict it to be versicolor following the rules.
Since only the petal features are being used, we can visualise the decision boundary and plot the test data
in 2D:

292
Out of the 60 data points, 59 are correctly classified. Another way to show the prediction results is through
a confusion matrix:
disp = metrics.plot_confusion_matrix(mod_dt, X_test, y_test,
display_labels=cn,
cmap=plt.cm.Blues,
normalize=None)
disp.ax_.set_title('Decision Tree Confusion matrix, without normalization');

Through this matrix, we see that there is one versicolor which we predict to be virginica.
One downside to building a single tree is its instability, which can be improved through ensemble
techniques such as random forests, boosting, etc. For now, let’s move on to the next model.

Gaussian Naive Bayes Classifier

One of the most popular classification models is Naive Bayes. It contains the word “Naive” because it has
a key assumption of class-conditional independence, which means that given the class, each feature’s
value is assumed to be independent of that of any other feature (read more here).
We know that it is clearly not the case here, evidenced by the high correlation between the petal features.
Let’s examine the test accuracy using this model to see whether this assumption is robust:
The accuracy of the Guassian Naive Bayes Classifier on test data is 0.933

293
What about the result if we only use the petal features:
The accuracy of the Guassian Naive Bayes Classifier with 2 predictors on test data is 0.950
Interestingly, using only two features results in more correctly classified points, suggesting possibility of
over-fitting when using all features. Seems that our Naive Bayes classifier did a decent job.

Linear Discriminant Analysis (LDA)

If we use multivariate Gaussian distribution to calculate the class conditional density instead of taking a
product of univariate Gaussian distribution (used in Naive Bayes), we would then get a LDA model (read
more here). The key assumption of LDA is that the covariances are equal among classes. We can examine
the test accuracy using all features and only petal features:
The accuracy of the LDA Classifier on test data is 0.983
The accuracy of the LDA Classifier with two predictors on test data is 0.933

Using all features boosts the test accuracy of our LDA model.
To visualise the decision boundary in 2D, we can use our LDA model with only petals and also plot the
test data:

Four test points are misclassified — three virginica and one versicolor. Now suppose we want to classify
new data points with this model, we can just plot the point on this graph, and predicts according to the
colored region it belonged to.

Quadratic Discriminant Analysis (QDA)

The difference between LDA and QDA is that QDA does NOT assume the covariances to be equal across
classes, and it is called “quadratic” because the decision boundary is a quadratic function.

The accuracy of the QDA Classifier is 0.983

The accuracy of the QDA Classifier with two predictors is 0.967

294
It has the same accuracy with LDA in the case of all features, and it performs slightly better when only
using petals.
Similarly, let’s plot the decision boundary for QDA (model with only petals):

K Nearest Neighbours (K-NN)

Now, let’s switch gears a little and take a look at a non-parametric generative model called KNN (read
more here). It is a popular model since it is relatively simple and easy to implement. However, we need to
be aware of the curse of dimensionality when number of features gets large.
Let’s plot the test accuracy with different choices of K:

We can see that the accuracy is highest (about 0.965) when K is 3, or between 7 and 10. Compared to the
previous models, it is less straightforward to classify new data points since we would need to look at its K
closest neighbours in four-dimensional space.

Other Models

295
I also explored other models such as logistic regression, support vector machine classifier, etc. See my
code on Github for details.
Note that the SVC (with linear kernel) achieved a test accuracy of 100%!
We should be pretty confident now since most of our models performed better than 95% accuracy.

Learning Objective: Carry Out data Validation

Cross validation techniques are applied on the dataset to validate model:

Cross-validation is a technique in which we train our model using the subset of the data-set and then
evaluate using the complementary subset of the data-set.
The three steps involved in cross-validation are as follows :
1. Reserve some portion of sample data-set.
2. Using the rest data-set, train the model.
3. Test the model using the reserve portion of the data-set.

Validation
In this method, we perform training on the 50% of the given data-set and rest 50% is used for the testing
purpose. The major drawback of this method is that we perform training on the 50% of the dataset, it may
be possible that the remaining 50% of the data contains some important information which we are leaving
while training our model i.e higher bias.

LOOCV (Leave One Out Cross Validation)

In this method, we perform training on the whole data-set but leaves only one data-point of the available
data-set and then iterates for each data-point. It has some advantages as well as disadvantages also.
An advantage of using this method is that we make use of all data points and hence it is low bias.
The major drawback of this method is that it leads to higher variation in the testing model as we are testing
against one data point. If the data point is an outlier it can lead to higher variation. Another drawback is it
takes a lot of execution time as it iterates over ‘the number of data points’ times.

K-Fold Cross Validation

In this method, we split the data-set into k numbers of subsets(known as folds) then we perform
training on all the subsets but leave one(k-1) subset for the evaluation of the trained model. In this
method, we iterate k times with a different subset reserved for testing purposes each time.
Note:
It is always suggested that the value of k should be 10 as the lower value. if k takes towards
validation and higher value of k leads to LOOCV method.

Example
The diagram below shows an example of the training subsets and evaluation subsets generated
in k-fold cross-validation. Here, we have total 25 instances. In first iteration we use the first 20
percent of data for evaluation, and the remaining 80 percent for training([1-5] testing and [5-25]
training) while in the second iteration we use the second subset of 20 percent for evaluation, and
the remaining three subsets of the data for training([5-10] testing and [1-5 and 10-25] training),
and so on.

296
Total instances: 25
Value of k : 5

No. Iteration Training set observations Testing set observations

1 [5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [0 1 2 3 4]
2 [0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [5 6 7 8 9]
3 [0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22 23 24] [10 11 12 13 14]
4 [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21 22 23 24] [15 16 17 18 19]
5 [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19] [20 21 22 23 24]

Bagging
We now take an intuitive look at how bagging works as a method of increasing accuracy. Suppose
that you are a patient and would like to have a diagnosis made based on your symptoms. Instead
of asking one doctor, you may choose to ask several. If a certain diagnosis occurs more than any
other, you may choose this as the final or best diagnosis. That is, the final diagnosis is made based
on a majority vote, where each doctor gets an equal vote. Now replace each doctor by a classifier,
and you have the basic idea behind bagging. Intuitively, a majority vote made by a large group of
doctors may be more reliable than a majority vote made by a small group. Given a set, D, of d
tuples, bagging works as follows. For iteration i(i = 1, 2,..., k), a training set, Di , of d tuples is
sampled with replacement from the original set of tuples, D. Note that the term bagging stands for
bootstrap aggregation. Each training set is a bootstrap sample, as described in Section 8.5.4.
Because sampling with replacement is used, some of the original tuples of D may not be included
in Di , whereas others may occur more than once. A classifier model, Mi , is learned for each
training set, Di . To classify an unknown tuple, X, each classifier, Mi , returns its class prediction,
which counts as one vote. The bagged classifier, M∗, counts the votes and assigns the class with
the most votes to X. Bagging can be applied to the prediction of continuous values by taking the
average value of each prediction for a given test tuple. The algorithm is summarised in Figure
8.23. The bagged classifier often has significantly greater accuracy than a single classifier derived
from D, the original training data. It will not be considerably worse and is more

Algorithm:
Bagging. The bagging algorithm—creates an ensemble of classification models for a learning
scheme where each model gives an equally weighted prediction.
Input:
● D, a set of d training tuples;
● k, the number of models in the ensemble;
● a classification learning scheme (decision tree algorithm, na¨ıve Bayesian, etc.).

297
Output: The ensemble—a composite model, M∗.

Method:
(1) for i = 1 to k do // create k models:
(2) create bootstrap sample, Di , by sampling D with replacement;
(3) use Di and the learning scheme to derive a model, Mi ;
(4) endfor
To use the ensemble to classify a tuple, X:
let each of the k models classify X and return the majority vote;
Bootstrap
Unlike the accuracy estimation methods just mentioned, the bootstrap method samples the
given training tuples uniformly with replacement. That is, each time a tuple is selected, it is
equally likely to be selected again and re-added to the training set. For instance, imagine a
machine that randomly selects tuples for our training set. In sampling with replacement, the
machine is allowed to select the same tuple more than once. There are several bootstrap
methods. A commonly used one is the .632 bootstrap, which works as follows. Suppose we
are given a data set of d tuples. The data set is sampled d times, with replacement, resulting
in a bootstrap sample or training set of d samples. It is very likely that some of the original
data tuples will occur more than once in this sample. The data tuples that did not make it into
the training set end up forming the test set. Suppose we were to try this out several times. As
it turns out, on average, 63.2% of the original data tuples will end up in the bootstrap sample,
and the remaining 36.8% will form the test set (hence, the name, .632 bootstrap). “Where does
the figure, 63.2%, come from?” Each tuple has a probability of 1/d of being selected, so the
probability of not being chosen is (1 − 1/d). We have to select d times, so the probability that
a tuple will not be chosen during this whole time is (1 − 1/d) d . If d is large, the probability
approaches e −1 = 0.368.7 Thus, 36.8% of tuples will not be selected for training and thereby
end up in the test set, and the remaining 63.2% will form the training set. We can repeat the
sampling procedure k times, where in each iteration, we use the current test set to obtain an
accuracy estimate of the model obtained from the current bootstrap sample. The overall
accuracy of the model, M, is then estimated as
𝑘
1
𝐴𝑐𝑐(𝑀) = ∑ (0.632 × 𝐴𝑐𝑐(𝑀𝑖 )𝑡𝑒𝑠𝑡.𝑠𝑒𝑡 + 0.368 × 𝐴𝑐𝑐(𝑀𝑖 )𝑡𝑟𝑎𝑖𝑛.𝑠𝑒𝑡 )
𝑘
𝑖=1

where the Acc(Mi)test set is the accuracy of the model obtained with bootstrap sample i when it is
applied to test set i. Acc(Mi)train set is the accuracy of the model obtained with bootstrap sample
i when it is applied to the original set of data tuples. Bootstrapping tends to be overly optimistic. It
works best with small data sets.
7e is the base of natural logarithms, that is, e = 2.718.

● Performance Evaluation Factors:

Confusion Matrix
A confusion matrix is a summary of prediction results on a classification problem.
The number of correct and incorrect predictions are summarized with count values and broken
down by each class. This is the key to the confusion matrix.
How to Calculate a Confusion Matrix
Below is the process for calculating a confusion Matrix.
● You need a test dataset or a validation dataset with expected outcome values.
● Make a prediction for each row in your test dataset.
● From the expected outcomes and predictions count:
○ The number of correct predictions for each class.

298
○ The number of incorrect predictions for each class, organized by the class
that was predicted.

Example Confusion Matrix in Python with scikit-learn

The scikit-learn library for machine learning in Python can calculate a confusion matrix.
Given an array or list of expected values and a list of predictions from your machine learning
model, the confusion_matrix() function will calculate a confusion matrix and return the result as an
array

# Example of a confusion matrix in Python

from sklearn.metrics import confusion_matrix

expected = [1, 1, 0, 1, 0, 0, 1, 0, 0, 0]
predicted = [1, 0, 0, 1, 0, 0, 1, 1, 1, 0]
results = confusion_matrix(expected, predicted)
print(results)

Output :

[[4 2]
[1 3]]

Gain Chart and Lift Chart

The gain chart and lift chart are two measures that are used for Measuring the benefits of using
the model and are used in business contexts such as target marketing. It’s not just restricted to
marketing analysis. It can also be used in other domains such as risk modelling, supply chain
analytics, etc. In other words, Gain and Lift charts are two approaches used while solving
classification problems with imbalanced data sets

Example: In target marketing or marketing campaigns, the customer responses to campaigns are
usually very low (in many cases the customers who respond to marketing campaigns are less than
1%). The organisation will raise the cost for each customer contact and hence would like to
minimise the cost of the marketing campaign and at the same time achieve desired response level
from the customers

The gain chart and lift chart is the measures in logistic regression that will help organisations to
understand the benefits of using that model. So that better and more efficient output is carried out.

The gain and lift chart is obtained using the following steps:
1. Predict the probability Y = 1 (positive) using the LR model and arrange the observation
in the decreasing order of predicted probability [i.e., P(Y = 1)].
2. Divide the data sets into deciles. Calculate the number of positives (Y = 1) in each
decile and the cumulative number of positives up to a decile.
3. Gain is the ratio between the cumulative number of positive observations up to a decile
to the total number of positive observations in the data. The gain chart is a chart drawn
between the gain on the vertical axis and the decile on the horizontal axis.
𝐶𝑢𝑚𝑢𝑙𝑎𝑡𝑖𝑣𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑢𝑝𝑡𝑜 𝑑𝑒𝑐𝑖𝑙𝑒 𝑖
𝐺𝑎𝑖𝑛 =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎

4. Lift is the ratio of the number of positive observations up to decile i using the model to
the expected number of positives up to that decile i based on a random model. Lift chart
is the chart between the lift on the vertical axis and the corresponding decile on the
horizontal axis.
𝐶𝑢𝑚𝑢𝑙𝑎𝑡𝑖𝑣𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑢𝑝𝑡𝑜 𝑑𝑒𝑐𝑖𝑙𝑒 𝑖 𝑢𝑠𝑖𝑛𝑔 𝑀𝐿 𝑚𝑜𝑑𝑒𝑙
𝐿𝑖𝑓𝑡 =
𝐶𝑢𝑚𝑢𝑙𝑎𝑡𝑖𝑣𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑢𝑝𝑡𝑜 𝑑𝑒𝑐𝑖𝑙𝑒 𝑖 𝑢𝑠𝑖𝑛𝑔 𝑟𝑎𝑛𝑑𝑜𝑚 𝑚𝑜𝑑𝑒𝑙

299
Gain Chart Calculation:

Ratio between the cumulative number of positive response up to a decile to the total number of
positive responses in the data

Gain Chart:

Lift Chart Calculation:

300
Ratio of the number of positive responses up to decile i using the model to the expected number
of positives up to that decile i based on a random model
Lift Chart:

● Cumulative gains and lift charts are visual aids for measuring model performance.

● Both charts consist of Lift Curve (In Lift Chart) / Gain Chart (In Gain Chart) and Baseline
(Blue Line for Lift, Orange Line for Gain).
● The Greater the area between the Lift / Gain and Baseline, the Better the model.

Gini Coefficient
The Gini coefficient (Gini index or Gini ratio) is a statistical measure of economic inequality in a
population. The coefficient measures the dispersion of income or distribution of wealth among the
members of a population.
Accuracy

301
Accuracy can also be defined as the ratio of the number of correctly classified cases to the total of
cases under evaluation. The best value of accuracy is 1 and the worst value is 0.

In python, the following code calculates the accuracy of the machine learning model.

from sklearn.datasets import load_breast_cancer

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

# choose a binary classification problem

data = load_breast_cancer()
# develop predictors X and target y dataframes
X = pd.DataFrame(data['data'], columns=data['feature_names'])
y = abs(pd.Series(data['target']) - 1)
# split data into train and test set in 80:20 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# build a RF model with default parameters
model = RandomForestClassifier(random_state=1)
model.fit(X_train, y_train)
preds = model.predict(X_test)

plt.figure(figsize=(7, 7))
y.value_counts().plot.pie(ylabel=' ', autopct='%0.1f%%')
plt.title(f'0 - Not cancerous (negative)\n 1 - Cancerous (positive) ', size=14, c='green')
plt.tight_layout();
plt.show()

accuracy = metrics.accuracy_score(y_test, preds)

print(accuracy)

302
Output :
0.956140350877193

It gives 0.956 as output. However, care should be taken while using accuracy as a metric because
it gives biased results for data with unbalanced classes. We discussed that our data is highly
unbalanced, hence the accuracy score may be a biased one!
Precision
Precision can be defined with respect to either of the classes. The precision of negative class is
intuitively the ability of the classifier not to label as positive a sample that is negative. The precision
of positive class is intuitively the ability of the classifier not to label as negative a sample that is
positive. The best value of precision is 1 and the worst value is 0.

In Python, precision can be calculated using the code,

from sklearn.datasets import load_breast_cancer

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

303
from sklearn import metrics
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

# choose a binary classification problem

precision_positive = metrics.precision_score(y_test, preds, pos_label=1)

precision_negative = metrics.precision_score(y_test, preds, pos_label=0)
print(precision_positive, precision_negative )

Output :

which gives (1.000, 0.935) as output.

Recall
Recall can also be defined with respect to either of the classes. Recall of positive class is also
termed sensitivity and is defined as the ratio of the True Positive to the number of actual positive
cases. It can intuitively be expressed as the ability of the classifier to capture all the positive cases.
It is also called the True Positive Rate (TPR).

Recall of negative class is also termed specificity and is defined as the ratio of the True Negative
to the number of actual negative cases. It can intuitively be expressed as the ability of the classifier
to capture all the negative cases. It is also called True Negative Rate (TNR).

In python, sensitivity and specificity can be calculated as

from sklearn.datasets import load_breast_cancer

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

304
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

# choose a binary classification problem

recall_sensitivity = metrics.recall_score(y_test, preds, pos_label=1)

recall_specificity = metrics.recall_score(y_test, preds, pos_label=0)
print(recall_sensitivity, recall_specificity)

Output :

which gives (0.881, 1.000) as output. The best value of recall is 1 and the worst value is 0.

ROC and AUC score

ROC is the short form of Receiver Operating Curve, which helps determine the optimum threshold
value for classification. The threshold value is the floating-point value between two classes forming
a boundary between those two classes. Here in our model, any predicted output above the
threshold is classified as class 1 and below it is classified as class 0.
ROC is realized by visualizing it in a plot. The area under ROC, famously known as AUC is used
as a metric to evaluate the classification model. ROC is drawn by taking false positive rate in the
x-axis and true positive rate in the y-axis. The best value of AUC is 1 and the worst value is 0.
However, AUC of 0.5 is generally considered the bottom reference of a classification model.
In python, ROC can be plotted by calculating the true positive rate and false-positive rate. The
values are calculated in steps by changing the threshold value from 0 to 1 gradually.

from sklearn.datasets import load_breast_cancer

# choose a binary classification problem

data = load_breast_cancer()
# develop predictors X and target y dataframes
X = pd.DataFrame(data['data'], columns=data['feature_names'])

305
y = abs(pd.Series(data['target'])-1)
# split data into train and test set in 80:20 ratio
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=1)
# build a RF model with default parameters
model = RandomForestClassifier(random_state=1)
model.fit(X_train, y_train)
preds = model.predict(X_test)

sns.set_style('darkgrid')
preds_train = model.predict(X_train)
# calculate prediction probability
prob_train = np.squeeze(model.predict_proba(X_train)[:,1].reshape(1,-1))
prob_test = np.squeeze(model.predict_proba(X_test)[:,1].reshape(1,-1))
# false positive rate, true positive rate, thresholds
fpr1, tpr1, thresholds1 = metrics.roc_curve(y_test, prob_test)
fpr2, tpr2, thresholds2 = metrics.roc_curve(y_train, prob_train)
# auc score
auc1 = metrics.auc(fpr1, tpr1)
auc2 = metrics.auc(fpr2, tpr2)
plt.figure(figsize=(8,8))
# plot auc
plt.plot(fpr1, tpr1, color='blue', label='Test ROC curve area = %0.2f'%auc1)
plt.plot(fpr2, tpr2, color='green', label='Train ROC curve area = %0.2f'%auc2)
plt.plot([0,1],[0,1], 'r--')
plt.xlim([-0.1, 1.1])
plt.ylim([-0.1, 1.1])
plt.xlabel('False Positive Rate', size=14)
plt.ylabel('True Positive Rate', size=14)
plt.legend(loc='lower right')
plt.show()

Tuning ROC to find the optimum threshold value: Python guides find the right value of threshold
(cut-off) with the following codes.
306
Precision-Recall Curve
To find the best threshold value based on the trade-off between precision and recall,
precision_recall_curve is drawn.

from sklearn.datasets import load_breast_cancer

# choose a binary classification problem

data = load_breast_cancer()
# develop predictors X and target y dataframes
X = pd.DataFrame(data['data'], columns=data['feature_names'])
y = abs(pd.Series(data['target'])-1)
# split data into train and test set in 80:20 ratio
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=1)
# build a RF model with default parameters
model = RandomForestClassifier(random_state=1)
model.fit(X_train, y_train)
preds = model.predict(X_test)

# creating index
i = np.arange(len(tpr1))
# extracting roc values against different thresholds
roc = pd.DataFrame({'fpr':fpr1, 'tpr':tpr1, 'tf':(tpr1-1+fpr1), 'thresholds':thresholds1}, index=i)
# top 5 best roc occurrences
roc.iloc[(roc.tf-0).abs().argsort()[:5]]

pre, rec, thr = metrics.precision_recall_curve(y_test, prob_test)

plt.figure(figsize=(8, 4))
plt.plot(thr, pre[:-1], label='precision')
plt.plot(thr, rec[1:], label='recall')
plt.xlabel('Threshold')
plt.title('Precision & Recall vs Threshold', c='r', size=16)
plt.legend()
plt.show()

307
Trade-off performed by our random forest model between Precision and Recall can be visualized
using the following codes:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

# choose a binary classification problem

data = load_breast_cancer()
# develop predictors X and target y dataframes
X = pd.DataFrame(data['data'], columns=data['feature_names'])
y = abs(pd.Series(data['target'])-1)
# split data into train and test set in 80:20 ratio
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=1)
# build a RF model with default parameters
model = RandomForestClassifier(random_state=1)
model.fit(X_train, y_train)
preds = model.predict(X_test)

pre, rec, thr = metrics.precision_recall_curve(y_test, prob_test)

plt.fig, ax = plt.subplots(1,1, figsize=(8,8))

308
plt.plot(thr, pre[:-1], label='precision')
plt.plot(thr, rec[1:], label='recall')
plt.xlabel('Threshold')
plt.title('Precision & Recall vs Threshold', c='r', size=16)
plt.legend()
plt.show()

R-Squared Adjusted R-Squared Score

R-squared, often written R2, is the proportion of the variance in the response variable that can be
explained by the predictor variables in a linear regression model.

adjusted R-squared is a modified version of R-squared that adjusts for the number of predictors
in a regression model. It is calculated as:
Adjusted R2 = 1 – [(1-R2)*(n-1)/(n-k-1)]
where:
● R2: The R2 of the model
● n: The number of observations
● k: The number of predictor variables
Since R2 always increases as you add more predictors to a model, adjusted R2 can serve as a
metric that tells you how useful a model is, adjusted for the number of predictors in a model.

Example 1: Calculate Adjusted R-Squared with sklearn

The following code shows how to fit a multiple linear regression model and calculate the adjusted
R-squared of the model using sklearn:
from sklearn.linear_model import LinearRegression
import pandas as pd

#define URL where dataset is located

url = "https://raw.githubusercontent.com/Statology/Python-Guides/main/mtcars.csv"

#read in data

309
data = pd.read_csv(url)

#fit regression model

model = LinearRegression()
X, y = data[["mpg", "wt", "drat", "qsec"]], data.hp
model.fit(X, y)

#display adjusted R-squared

print(1 - (1-model.score(X, y))*(len(y)-1)/(len(y)-X.shape[1]-1))
Output :

0.7787005290062521

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors).
Residuals are a measure of how far from the regression line data points are; RMSE is a measure
of how spread out these residuals are. In other words, it tells you how concentrated the data is
around the line of best fit. Root mean square error is commonly used in climatology, forecasting,
and regression analysis to verify experimental results.
The formula is:

Where:
● f = forecasts (expected values or unknown results),
● o = observed values (known results).
The bar above the squared differences is the mean (similar to x̄). The same formula can be written
with the following, slightly different, notation (Barnston, 1992):

Where:
● Σ = summation (“add up”)
● (zfi – Zoi)2 = differences, squared
● N = sample size.
You can use whichever formula you feel most comfortable with, as they both do the same thing. If
you don’t like formulas, you can find the RMSE by:
1. Squaring the residuals.
2. Finding the average of the residuals.
3. Taking the square root of the result.
That said, this can be a lot of calculation, depending on how large your data set is. A shortcut to
finding the root mean square error is:

Where SDy is the standard deviation of Y.

When standardised observations and forecasts are used as RMSE inputs, there is a direct
relationship with the correlation coefficient. For example, if the correlation coefficient is 1, the
RMSE will be 0, because all of the points lie on the regression line (and therefore there are no
errors).

Elbow Method for optimal value of k in KMeans :

A fundamental step for any unsupervised algorithm is to determine the optimal number of clusters
into which the data may be clustered. The Elbow Method is one of the most popular methods to
determine this optimal value of k.
We now demonstrate the given method using the K-Means clustering technique using the Sklearn
library of python

310
from sklearn.cluster import KMeans
from sklearn import metrics
from scipy.spatial.distance import cdist
import numpy as np
import matplotlib.pyplot as plt

# Creating the data

x1 = np.array([3, 1, 1, 2, 1, 6, 6, 6, 5, 6, 7, 8, 9, 8, 9, 9, 8])
x2 = np.array([5, 4, 5, 6, 5, 8, 6, 7, 6, 7, 1, 2, 1, 2, 3, 2, 3])
X = np.array(list(zip(x1, x2))).reshape(len(x1), 2)

# Visualizing the data

plt.plot()
plt.xlim([0, 10])
plt.ylim([0, 10])
plt.title('Dataset')
plt.scatter(x1, x2)
plt.show()

311
Individual Activity:
▪ Identify types Dimension reduction techniques
▪ Show the Usages of Scikit-learn library.

SELF-CHECK QUIZ 5.2

Check your understanding by answering the following questions:

Write the correct answer for the following questions.

1. What are Dimension reduction techniques?

2. Which rules are most important in the Scikit-learn library?

312
Learning Outcome 5.3- Carry Out data Validation

Contents:

▪ Cross validation techniques.

▪ Machine learning model's.

Assessment criteria:

1. Cross validation techniques are applied on the dataset to validate model.

2. Machine learning model's performance is evaluated using performance evaluation factors
3. Report is prepared with findings and conclusions for the data science/ business audience.

Resources required:

Students/trainees must be provided with the following resources:

▪ Workplace (Computer and internet connection).

LEARNING ACTIVITY 5.3

Learning Activity Resources/Special Instructions/References

Carry Out data Validation ▪ Information Sheet: 5.3
▪ Self-Check: 5.3
▪ Answer Key: 5.3

313
INFORMATION SHEET 5.3

Learning Objective: Carry Out Model Deployment

Cross validation techniques

Cross-validation is a technique in which we train our model using the subset of the data-set and then
evaluate using the complementary subset of the data-set.
The three steps involved in cross-validation are as follows :
4. Reserve some portion of sample data-set.
5. Using the rest data-set, train the model.
6. Test the model using the reserve portion of the data-set.

Validation
In this method, we perform training on the 50% of the given data-set and rest 50% is used for the testing
purpose. The major drawback of this method is that we perform training on the 50% of the dataset, it
may be possible that the remaining 50% of the data contains some important information which we are
leaving while training our model i.e higher bias.

LOOCV (Leave One Out Cross Validation)

In this method, we perform training on the whole data-set but leaves only one data-point of the available
data-set and then iterates for each data-point. It has some advantages as well as disadvantages also.
An advantage of using this method is that we make use of all data points and hence it is low bias.
The major drawback of this method is that it leads to higher variation in the testing model as we are
testing against one data point. If the data point is an outlier it can lead to higher variation. Another
drawback is it takes a lot of execution time as it iterates over ‘the number of data points’ times.

K-Fold Cross Validation

In this method, we split the data-set into k numbers of subsets(known as folds) then we perform
training on all the subsets but leave one(k-1) subset for the evaluation of the trained model. In
this method, we iterate k times with a different subset reserved for testing purposes each time.
Note:
It is always suggested that the value of k should be 10 as the lower value. if k takes towards
validation and higher value of k leads to LOOCV method.

314
Total instances: 25
Value of k : 5

No. Iteration Training set observations Testing set observations

Bagging
We now take an intuitive look at how bagging works as a method of increasing accuracy.
Suppose that you are a patient and would like to have a diagnosis made based on your
symptoms. Instead of asking one doctor, you may choose to ask several. If a certain diagnosis
occurs more than any other, you may choose this as the final or best diagnosis. That is, the final
diagnosis is made based on a majority vote, where each doctor gets an equal vote. Now replace
each doctor by a classifier, and you have the basic idea behind bagging. Intuitively, a majority
vote made by a large group of doctors may be more reliable than a majority vote made by a
small group. Given a set, D, of d tuples, bagging works as follows. For iteration i(i = 1, 2,..., k), a
training set, Di , of d tuples is sampled with replacement from the original set of tuples, D. Note
that the term bagging stands for bootstrap aggregation. Each training set is a bootstrap sample,
as described in Section 8.5.4. Because sampling with replacement is used, some of the original
tuples of D may not be included in Di , whereas others may occur more than once. A classifier
model, Mi , is learned for each training set, Di . To classify an unknown tuple, X, each classifier,
Mi , returns its class prediction, which counts as one vote. The bagged classifier, M∗, counts the
votes and assigns the class with the most votes to X. Bagging can be applied to the prediction of
continuous values by taking the average value of each prediction for a given test tuple. The
algorithm is summarised in Figure 8.23. The bagged classifier often has significantly greater
accuracy than a single classifier derived from D, the original training data. It will not be
considerably worse and is more

315
● a classification learning scheme (decision tree algorithm, na¨ıve Bayesian, etc.).
Output: The ensemble—a composite model, M∗.

Method:
(1) for i = 1 to k do // create k models:
(2) create bootstrap sample, Di , by sampling D with replacement;
(3) use Di and the learning scheme to derive a model, Mi ;
(4) endfor
To use the ensemble to classify a tuple, X:
let each of the k models classify X and return the majority vote;
Bootstrap
Unlike the accuracy estimation methods just mentioned, the bootstrap method samples the
given training tuples uniformly with replacement. That is, each time a tuple is selected, it is
equally likely to be selected again and re-added to the training set. For instance, imagine a
machine that randomly selects tuples for our training set. In sampling with replacement, the
machine is allowed to select the same tuple more than once. There are several bootstrap
methods. A commonly used one is the .632 bootstrap, which works as follows. Suppose we
are given a data set of d tuples. The data set is sampled d times, with replacement, resulting
in a bootstrap sample or training set of d samples. It is very likely that some of the original
data tuples will occur more than once in this sample. The data tuples that did not make it
into the training set end up forming the test set. Suppose we were to try this out several
times. As it turns out, on average, 63.2% of the original data tuples will end up in the
bootstrap sample, and the remaining 36.8% will form the test set (hence, the name, .632
bootstrap). “Where does the figure, 63.2%, come from?” Each tuple has a probability of 1/d
of being selected, so the probability of not being chosen is (1 − 1/d). We have to select d
times, so the probability that a tuple will not be chosen during this whole time is (1 − 1/d) d
. If d is large, the probability approaches e −1 = 0.368.7 Thus, 36.8% of tuples will not be
selected for training and thereby end up in the test set, and the remaining 63.2% will form
the training set. We can repeat the sampling procedure k times, where in each iteration, we
use the current test set to obtain an accuracy estimate of the model obtained from the
current bootstrap sample. The overall accuracy of the model, M, is then estimated as
𝑘
1
𝐴𝑐𝑐(𝑀) = ∑ (0.632 × 𝐴𝑐𝑐(𝑀𝑖 )𝑡𝑒𝑠𝑡.𝑠𝑒𝑡 + 0.368 × 𝐴𝑐𝑐(𝑀𝑖 )𝑡𝑟𝑎𝑖𝑛.𝑠𝑒𝑡 )
𝑘
𝑖=1

where the Acc(Mi)test set is the accuracy of the model obtained with bootstrap sample i when it
is applied to test set i. Acc(Mi)train set is the accuracy of the model obtained with bootstrap
sample i when it is applied to the original set of data tuples. Bootstrapping tends to be overly
optimistic. It works best with small data sets.
7e is the base of natural logarithms, that is, e = 2.718.

● Performance Evaluation Factors

316
○ The number of correct predictions for each class.
○ The number of incorrect predictions for each class, organized by the
class that was predicted.

Example Confusion Matrix in Python with scikit-learn

# Example of a confusion matrix in Python

from sklearn.metrics import confusion_matrix

expected = [1, 1, 0, 1, 0, 0, 1, 0, 0, 0]
predicted = [1, 0, 0, 1, 0, 0, 1, 1, 1, 0]
results = confusion_matrix(expected, predicted)
print(results)

Output :

[[4 2]
[1 3]]

Gain Chart and Lift Chart

Example: In target marketing or marketing campaigns, the customer responses to campaigns

are usually very low (in many cases the customers who respond to marketing campaigns are
less than 1%). The organisation will raise the cost for each customer contact and hence would
like to minimise the cost of the marketing campaign and at the same time achieve desired
response level from the customers

The gain and lift chart is obtained using the following steps:
4. Predict the probability Y = 1 (positive) using the LR model and arrange the
observation in the decreasing order of predicted probability [i.e., P(Y = 1)].
5. Divide the data sets into deciles. Calculate the number of positives (Y = 1) in each
decile and the cumulative number of positives up to a decile.
6. Gain is the ratio between the cumulative number of positive observations up to a
decile to the total number of positive observations in the data. The gain chart is a
chart drawn between the gain on the vertical axis and the decile on the horizontal
axis.
𝐶𝑢𝑚𝑢𝑙𝑎𝑡𝑖𝑣𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑢𝑝𝑡𝑜 𝑑𝑒𝑐𝑖𝑙𝑒 𝑖
𝐺𝑎𝑖𝑛 =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎

Ratio between the cumulative number of positive response up to a decile to the total number of
positive responses in the data

Gain Chart:

Lift Chart Calculation:

318
Ratio of the number of positive responses up to decile i using the model to the expected number
of positives up to that decile i based on a random model
Lift Chart:

● Cumulative gains and lift charts are visual aids for measuring model performance.

● Both charts consist of Lift Curve (In Lift Chart) / Gain Chart (In Gain Chart) and
Baseline (Blue Line for Lift, Orange Line for Gain).
● The Greater the area between the Lift / Gain and Baseline, the Better the model.

319
Accuracy can also be defined as the ratio of the number of correctly classified cases to the total
of cases under evaluation. The best value of accuracy is 1 and the worst value is 0.

In python, the following code calculates the accuracy of the machine learning model.

from sklearn.datasets import load_breast_cancer

# choose a binary classification problem

accuracy = metrics.accuracy_score(y_test, preds)

print(accuracy)

320
Output :
0.956140350877193

It gives 0.956 as output. However, care should be taken while using accuracy as a metric
because it gives biased results for data with unbalanced classes. We discussed that our data is
highly unbalanced, hence the accuracy score may be a biased one!
Precision
Precision can be defined with respect to either of the classes. The precision of negative class is
intuitively the ability of the classifier not to label as positive a sample that is negative. The
precision of positive class is intuitively the ability of the classifier not to label as negative a
sample that is positive. The best value of precision is 1 and the worst value is 0.

In Python, precision can be calculated using the code,

from sklearn.datasets import load_breast_cancer

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

321
from sklearn import metrics
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

# choose a binary classification problem

precision_positive = metrics.precision_score(y_test, preds, pos_label=1)

precision_negative = metrics.precision_score(y_test, preds, pos_label=0)
print(precision_positive, precision_negative )

Output :

which gives (1.000, 0.935) as output.

Recall
Recall can also be defined with respect to either of the classes. Recall of positive class is also
termed sensitivity and is defined as the ratio of the True Positive to the number of actual positive
cases. It can intuitively be expressed as the ability of the classifier to capture all the positive
cases. It is also called the True Positive Rate (TPR).

Recall of negative class is also termed specificity and is defined as the ratio of the True Negative
to the number of actual negative cases. It can intuitively be expressed as the ability of the
classifier to capture all the negative cases. It is also called True Negative Rate (TNR).

In python, sensitivity and specificity can be calculated as

from sklearn.datasets import load_breast_cancer

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

322
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

# choose a binary classification problem

recall_sensitivity = metrics.recall_score(y_test, preds, pos_label=1)

recall_specificity = metrics.recall_score(y_test, preds, pos_label=0)
print(recall_sensitivity, recall_specificity)

Output :

which gives (0.881, 1.000) as output. The best value of recall is 1 and the worst value is 0.

ROC and AUC score

ROC is the short form of Receiver Operating Curve, which helps determine the optimum
threshold value for classification. The threshold value is the floating-point value between two
classes forming a boundary between those two classes. Here in our model, any predicted output
above the threshold is classified as class 1 and below it is classified as class 0.
ROC is realized by visualizing it in a plot. The area under ROC, famously known as AUC is used
as a metric to evaluate the classification model. ROC is drawn by taking false positive rate in the
x-axis and true positive rate in the y-axis. The best value of AUC is 1 and the worst value is 0.
However, AUC of 0.5 is generally considered the bottom reference of a classification model.
In python, ROC can be plotted by calculating the true positive rate and false-positive rate. The
values are calculated in steps by changing the threshold value from 0 to 1 gradually.

from sklearn.datasets import load_breast_cancer

# choose a binary classification problem

data = load_breast_cancer()
# develop predictors X and target y dataframes
X = pd.DataFrame(data['data'], columns=data['feature_names'])

323
y = abs(pd.Series(data['target'])-1)
# split data into train and test set in 80:20 ratio
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=1)
# build a RF model with default parameters
model = RandomForestClassifier(random_state=1)
model.fit(X_train, y_train)
preds = model.predict(X_test)

Tuning ROC to find the optimum threshold value: Python guides find the right value of threshold
(cut-off) with the following codes.
324
Precision-Recall Curve
To find the best threshold value based on the trade-off between precision and recall,
precision_recall_curve is drawn.

from sklearn.datasets import load_breast_cancer

# choose a binary classification problem

data = load_breast_cancer()
# develop predictors X and target y dataframes
X = pd.DataFrame(data['data'], columns=data['feature_names'])
y = abs(pd.Series(data['target'])-1)
# split data into train and test set in 80:20 ratio
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=1)
# build a RF model with default parameters
model = RandomForestClassifier(random_state=1)
model.fit(X_train, y_train)
preds = model.predict(X_test)

pre, rec, thr = metrics.precision_recall_curve(y_test, prob_test)

325
Trade-off performed by our random forest model between Precision and Recall can be visualized
using the following codes:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

# choose a binary classification problem

data = load_breast_cancer()
# develop predictors X and target y dataframes
X = pd.DataFrame(data['data'], columns=data['feature_names'])
y = abs(pd.Series(data['target'])-1)
# split data into train and test set in 80:20 ratio
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=1)
# build a RF model with default parameters
model = RandomForestClassifier(random_state=1)
model.fit(X_train, y_train)
preds = model.predict(X_test)

pre, rec, thr = metrics.precision_recall_curve(y_test, prob_test)

plt.fig, ax = plt.subplots(1,1, figsize=(8,8))

326
plt.plot(thr, pre[:-1], label='precision')
plt.plot(thr, rec[1:], label='recall')
plt.xlabel('Threshold')
plt.title('Precision & Recall vs Threshold', c='r', size=16)
plt.legend()
plt.show()

R-Squared Adjusted R-Squared Score

R-squared, often written R2, is the proportion of the variance in the response variable that can
be explained by the predictor variables in a linear regression model.

Example 1: Calculate Adjusted R-Squared with sklearn

#define URL where dataset is located

url = "https://raw.githubusercontent.com/Statology/Python-Guides/main/mtcars.csv"

#read in data

327
data = pd.read_csv(url)

#fit regression model

model = LinearRegression()
X, y = data[["mpg", "wt", "drat", "qsec"]], data.hp
model.fit(X, y)

#display adjusted R-squared

print(1 - (1-model.score(X, y))*(len(y)-1)/(len(y)-X.shape[1]-1))
Output :

0.7787005290062521

Where:
● f = forecasts (expected values or unknown results),
● o = observed values (known results).
The bar above the squared differences is the mean (similar to x̄). The same formula can be
written with the following, slightly different, notation (Barnston, 1992):

Where:
● Σ = summation (“add up”)
● (zfi – Zoi)2 = differences, squared
● N = sample size.
You can use whichever formula you feel most comfortable with, as they both do the same thing.
If you don’t like formulas, you can find the RMSE by:
4. Squaring the residuals.
5. Finding the average of the residuals.
6. Taking the square root of the result.
That said, this can be a lot of calculation, depending on how large your data set is. A shortcut to
finding the root mean square error is:

Where SDy is the standard deviation of Y.

Elbow Method for optimal value of k in KMeans :

A fundamental step for any unsupervised algorithm is to determine the optimal number of
clusters into which the data may be clustered. The Elbow Method is one of the most popular
methods to determine this optimal value of k.
We now demonstrate the given method using the K-Means clustering technique using the
Sklearn library of python

328
from sklearn.cluster import KMeans
from sklearn import metrics
from scipy.spatial.distance import cdist
import numpy as np
import matplotlib.pyplot as plt

# Creating the data

x1 = np.array([3, 1, 1, 2, 1, 6, 6, 6, 5, 6, 7, 8, 9, 8, 9, 9, 8])
x2 = np.array([5, 4, 5, 6, 5, 8, 6, 7, 6, 7, 1, 2, 1, 2, 3, 2, 3])
X = np.array(list(zip(x1, x2))).reshape(len(x1), 2)

# Visualizing the data

plt.plot()
plt.xlim([0, 10])
plt.ylim([0, 10])
plt.title('Dataset')
plt.scatter(x1, x2)
plt.show()

Individual Activity:
▪ Cross validation techniques
▪ Show the performance evaluation factors.

SELF-CHECK QUIZ 5.3

Check your understanding by answering the following questions:

Write the correct answer for the following questions.

1. What are Cross validation techniques?

2.Which rules is most important Machine learning model's performance is evaluated using
performance evaluation factors?

3. What is conditional probability?

329
Learning Outcome 5.4- Carry Out Model Deployment

Contents:

▪ DJANGO and REST API

Assessment criteria:

1. DJANGO and REST API are interpreted.

2. DJANGO and REST API are used for deploying model.
3. Project is evaluated and drawbacks are incorporated and rectified.

Resources required:

Students/trainees must be provided with the following resources:

▪ Workplace (actual or simulated), Computer and internet connection

LEARNING ACTIVITY 5.4

Learning Activity Resources/Special Instructions/References

Carry Out Model
▪ Information Sheet: 5.4
Deployment
▪ Self-Check: 5.4
▪ Answer Key: 5.4

INFORMATION SHEET 5.4

Django and REST API are interpreted :

Django
Django is a high-level Python web framework that encourages rapid development and clean,
pragmatic design. Built by experienced developers, it takes care of much of the hassle of web
development, so you can focus on writing your app without needing to reinvent the wheel.
● It’s free and open source.
● Ridiculously fast.
● Django was designed to help developers take applications from concept to completion as
quickly as possible.
330
● Reassuringly secure.
● Django takes security seriously and helps developers avoid many common security
mistakes.
● Exceedingly scalable.
● Some of the busiest sites on the web leverage Django’s ability to quickly and flexibly scale.
REST APIs
REST APIs provide a flexible, lightweight way to integrate applications, and have emerged as the
most common method for connecting components in microservices architectures.
An API, or application programming interface, is a set of rules that define how applications or
devices can connect to and communicate with each other. A REST API is an API that conforms to
the design principles of the REST, or representational state transfer architectural style. For this
reason, REST APIs are sometimes referred to RESTful APIs.
First defined in 2000 by computer scientist Dr. Roy Fielding in his doctoral dissertation, REST
provides a relatively high level of flexibility and freedom for developers. This flexibility is just one
reason why REST APIs have emerged as a common method for connecting components and
applications in a microservices architecture.

Django and Rest API are used for deploying model

There is a rise of use in Machine Learning applications for business. And you will find a lot of
Machine Learning models running online commercially. A number of machine learning models are
running behind every search engine. You will find them inside google translator, apple’s Siri,
facebook’s facial recognition algorithms. So how do they deploy them on the web?
If you have so far worked with machine learning models locally, just applying ML algorithms on
datasets and making predictions, you should know how to deploy them on the web. In this tutorial,
I will walk you through different steps to build and deploy a machine learning model using Django
and REST API, let’s dive deep into it!

Django
Django is a high-level Python framework that lets you build robust and scalable web applications.
This is the most popular framework available in python. It follows the MVT or Model-View-Template
pattern. It is closely related to other MVC frameworks like Ruby on Rails and Laravel. In the MVC
framework, the view and model parts are controlled by the Controller but in Django, the tasks of a
controller are handled implicitly by the framework itself.
Django lets you create a number of applications under a single project. The application has all the
functionalities to work independently. The app is considered as a package that you can reuse in
other projects without making any major changes. This is the greatest advantage of using Django
for building web applications.

Django REST Framework

331
Django REST framework is a wonderful toolkit for developing robust web APIs using Django and
Python. It gives an easy way to serialise the data and provide it to other applications. It is like a
door between the database and the program which handles querying the database and formatting
of the data. Generally, it uses JSON to format the data.
Machine learning models are mostly written in Python and run locally in a Jupyter notebook or
similar IDEs. But when you need to productionize your model that means you make it available on
the web, you can do this by one of the following-
● Hard code the ML model in the web applications. This is the easiest way to deploy ML
models like simple linear regression or random forest classification on the web. But it has
a lot of drawbacks if you are trying to implement some complex models like Neural
Networks.
● The most efficient way is to provide an interface that will communicate between the ML
model and the web interface. It will fetch data to the model, the model will process it
independently. After getting the prediction this interface will take it back to the web
applications end. For this, we can use REST APIs, Websockets, or RPI.
Using Django REST frameworks, we can build powerful APIs for our machine learning models.
Which will let us handle all the data retrieving tasks without any hassle.

Deploy your First Machine Learning Model using Django and Rest API:
If you have read the above words or know them before, I think you are determined to go with me to learn
how to deploy your first ML project on the web.
In the following sections, we are going to build a simple ML model and web API in Django. Let’s do that!
Here are the steps you need to deploy a machine learning model-
1. Build a Machine Learning Model
2. Install Django, Django REST Framework and Other Dependencies
3. Create a New Django Project
4. Create a New Django App
5. Create a Django Model
6. Update url.py File
7. Build a Machine Learning Model
8. Create a Form in Django
9. Create a Superuser in Django
10. Build The REST API
11. Update View in Django
12. Update the App's URL
Build a Machine Learning Model
Before going into production, we need a machine learning model to start with. You can take any
machine learning model to deploy. In this article, we are going to focus more on deployment rather
than building a complete machine learning model. So, I took a simple machine learning model to
deploy.
The model is built upon a simple dataset where needs to predict whether a customer would buy a
car based on her age and salary. For that, I will build a simple Support Vector Machine classifier
to make predictions upon the dataset.

import numpy as np
import pandas as pd
dataset = pd.read_csv('Social_Network_Ads.csv')
dataset["Gender"] = 1 if "Male" else 2
X = dataset.iloc[:, 1:4]
y = dataset.iloc[:, 4]

from sklearn.model_selection import train_test_split

332
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state =
1000)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

from sklearn.svm import SVC

classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

Install Django, Django REST Framework and Other Dependencies

Before going to the development, you should create a virtual environment on your computer. This
is helpful to manage the development process with ease. So, it is recommended to use a virtual
environment.
If you are using the Anaconda platform, go to the anaconda prompt, and write the following-
conda create -n DjangoML python=3.7

This will create a virtual environment. Now, activate the environment-

conda activate DjangoML

After activating the environment, install all the requirements for our project.
pip install numpy pandas sklearn pickle django djangorestframework

This will install all the dependencies into your virtual environment.

Create a New Django Project

In Django, the first step is to create a project which will contain the applications(Django lets you
build different applications under a single project). This is super easy and can be created with a
single command.
In the command line, go to the specific directory where you want to create the project. Then write
the following command-
django-admin startproject DeployML

With this, you will get a Django project containing all the important files you need to build your
applications. The project structure should look like this-

Create a New Django App

Django lets you build many apps under a single project. An app is a complete web application
containing all the necessary files and codes to run independently from other apps. All you need
do is to create an app and register it into the project and change some other settings to make it
run. You can use apps from other projects too.
Type the following command to create a new app in the project-
django-admin startapp DjangoAPI

333
This will create a Django app inside the project. Remember rest_framework is itself an app to
Django.
Now, go to the settings.py file and register both the rest_framework and your created app in the
INSTALLED_APPS section.

Create a Django Model

The most important part of our project is to create a database where we can keep and retrieve the
data. This database will take care of all the data users provide through the web interface. Upon
this data, our machine learning model will make predictions. This data can be used in the future to
continuously improve our ML model.
In Django, we can do it simply by making a model. A model is a class in python where we will
create the necessary fields to take data from the users. With the specified fields in the model, a
similar table will be created in your database. The fields will be the names of the features of our
dataset. To build a model identical to our dataset, write the following code in the model.py file of
your app-

from django.db import models

class Customer(models.Model ):
GENDER_CHOICES = (('Male','Male'),('Female', 'Female') )
gender = models.CharField(max_length=6, choices=GENDER_CHOICES)
age = models.IntegerField()
salary = models.IntegerField()

def __str__(self):
return self.gender

Here, the Customer is the required model to make our database where gender, age, and salary
will represent the features of our dataset.
Now, we need to migrate this model as a table in our dataset. SQLite is the default database in
Django. But it supports other databases such as PostgresSQL, MongoDB, MariaDB, Oracle, and
so on. You can use any of these databases for your project.
You need to write two different commands to migrate the tables. Type the following commands for
that-
python manage.py makemigrations
python manage.py migrate

This will create a table named Customers into your database.

You need to register this model to the admin.py file to make it work.

from django.contrib import admin

from .models import Customer

admin.site.register(Customer)

Update url.py File

334
Django comes with a default url.py file in the project. This file keeps the URLs you need to access
the different web pages or applications you build under the project. Django does not provide a
url.py file for apps, you need to create that file for every application you use under your project. In
the app-specific url.py file, the URLs to access different parts/web pages of an app are listed.
Remember, you need to update both the url.py file.
In the projects url.py file, write the following-

from django.contrib import admin

from django.urls import path, include

urlpatterns = [
path('admin/', admin.site.urls)
path('', include('DjangoAPI.urls')),
]

Create a Superuser in Django

Now, we need to create a user account as an admin to access and control our databases and
other pages. In Django, it is made easier with the following command-
python manage.py createsuperuser

This will require you to give your email address and set a password. Do exactly what it says and
create a superuser account in your web application.
After creating a superuser account, you can now check the table and edit it through the admin site.

Create a Form in Django

In our project, we need to collect information from the users, run the ML model into the collected
data, and show the output to the user. If we want to collect data from the users, we need to build
a form structure in HTML. In Django, the process of creating a form can be done simply with the
Form class. This class is much similar to the structure of a Django model. It will simplify all the
complicated tasks of managing forms manually by yourself. With this class, you can prepare the
HTML template for display the form, render the data, return data to the server, validate and clean
up the data and then save or pass the data on for further processing.
Now, we will build a simple form to collect data for our project. Create a forms.py file into the
DjangoAPI app directory and write the following-

from django import forms

from .models import Customer

class CustomerForm(forms.ModelForm):
class Meta:
model = Customer
fields = "__all__"

gender = forms.TypedChoiceField(choices=[('Male', 'Male'), ('Female', 'Female')])

age = forms.IntegerField()
salary = forms.IntegerField()

This code will create a form that you can use further for different purposes.
We need to create a simple HTML file to show our form to the user. In your templates folder, create
a form.html file for showing the form.

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Document</title>
</head>
<body>

335
<form method='POST'>
{% csrf_token %}
{{ form.as_p }}
<button type="submit" value="submit">Submit</button>
</form>
</body>
</html>

This HTML form will be used to collect information.

Then we need another HTML file to show the status after submitting the form. Make a status.html
file in your DjangoApi/templates folder.

<!DOCTYPE html>
{% load static %}
<html lang="en">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Status</title>
</head>
<body>
<h2>Buying Status: {{data}}</h2>
</body>
</html>

Build The REST API

As we have discussed earlier, we will use a REST API to transfer data between the model and the
database. Using the Django-REST framework we can build an API in no time!
In order to allow our model to understand and work with the data, we need to first convert them
into native Python data types that we can easily render into JSON or XML. In Django, we can use
serializers to convert complex data like querysets and model instances to convert into native
Python data types and vice versa. It is similar to the model and form class provided by the
framework.
Django-REST framework provides a class named Serializers to build your own serializers. Create
a file name serializer.py and start editing like the following.

from rest_framework import serializers

from .models import Customer

class CustomerSerializers(serializers.ModelSerializer):
class meta:
model=Customer
fields='__all__'

This will do all the tasks regarding data conversions. We need to set the URL for the API. This will
be done later when we will update the app's url.py file

Update View in Django

So far we have built most of the necessary things to make our model work. Now, it's time to do the
most crucial part of our project, updating the views.
In Django, the view is a python function that takes all the web requests of the site and returns web
responses. The responses can be anything, in the project we need to redirect the user to the form,
collect the data from it, process it, and show the result to the users. All these things will be done
in the view.
Go to the views.py file and update it like the following-

336
from .forms import CustomerForm
from rest_framework import viewsets
from rest_framework.decorators import api_view
from django.core import serializers
from rest_framework.response import Response
from rest_framework import status
from django.http import JsonResponse
from rest_framework.parsers import JSONParser
from .models import Customer
from .serializer import CustomerSerializers

import pickle
import json
import numpy as np
from sklearn import preprocessing
import pandas as pd
from django.shortcuts import render, redirect
from django.contrib import messages

class CustomerView(viewsets.ModelViewSet):
queryset = Customer.objects.all()
serializer_class = CustomerSerializers

def status(df):
try:
scaler=pickle.load(open("/Users/HP-k/DeployML/DjangoAPI/Scaler.sav", 'rb'))
model=pickle.load(open("/Users/HP-k/DeployML/DjangoAPI/Prediction.sav",
'rb'))
X = scaler.transform(df)
y_pred = model.predict(X)
y_pred=(y_pred>0.80)
result = "Yes" if y_pred else "No"
return result
except ValueError as e:
return Response(e.args[0], status.HTTP_400_BAD_REQUEST)

def FormView(request):
if request.method=='POST':
form=CustomerForm(request.POST or None)

if form.is_valid():
Gender = form.cleaned_data['gender']
Age = form.cleaned_data['age']
EstimatedSalary = form.cleaned_data['salary']
df=pd.DataFrame({'gender':[Gender], 'age':[Age], 'salary':[EstimatedSalary]})
df["gender"] = 1 if "male" else 2
result = status(df)
return render(request, 'status.html', {"data": result})

form=CustomerForm()
return render(request, 'form.html', {'form':form})

Note: copy the Scaler.sav and Prediction.sav files in your DjangoApi folder and update the path of
status function as your project path.

Update the App's URL

To access all the different parts of our Django app, we need to specify the URLs of the app. First,
create a url.py file under the DjangoApi app and update the URLs like the following-

337
from django.contrib import admin
from django.urls import path,include
from . import views
from rest_framework import routers

router = routers.DefaultRouter()
urlpatterns = [
path('api/', include(router.urls)),
path('form/', views.FormView, name='form'),
]

Now, we are all set to collect data from the user, pass them to the model by the REST API, and
process them using the model we pickled earlier.

338
Individual Activity:
▪ types of DJANGO and REST API are interpreted
▪ Project is evaluated and drawbacks are incorporated and rectified

SELF-CHECK QUIZ 5.4

Check your understanding by answering the following questions:

Write the correct answer for the following questions.

1. What is DJANGO?

2. What is REST API?

339
LEARNER JOB SHEET 5

Qualification: 2 Years Experience in IT Sector

Learning unit: BUILD, VALIDATE AND DEPLOY MODEL

Learner name:

Personal protective
equipment (PPE):

Materials: Computer and Internet connection

Tools and equipment:

Performance criteria: 1. Machine learning algorithms are interpreted.

2. Ensemble learning techniques are explained.
3. Potential models are selected based on the available data,
data distributions and goals of the project.
4. Feature selection techniques are applied.
5. Dimension reduction techniques are applied.
6. Supervised and unsupervised models are tested on the
dataset.
7. Usages of Scikit-learn library to build models are
demonstrated.
8. Cross validation techniques are applied on the dataset to
validate model.
9. Machine learning model's performance is evaluated using
performance evaluation factors
10. Report is prepared with findings and conclusions for the
data science/ business audience.
11. DJANGO and REST API are interpreted.
12. DJANGO and REST API are interpreted.
13. Project is evaluated and drawbacks are incorporated and
rectified.

Measurement:

Notes:

Procedure: 7. Connect computer with internet connection.

8. Connect router with internet.

Learner signature: Date:

Assessor signature: Date:

340
Quality Assurer
Date:
signature:

Assessor remarks:

Feedback:

341
ANSWER KEYS

ANSWER KEY 5.1

1. A machine learning algorithm is the method by which the AI system conducts its task, generally
predicting output values from given input data. The two main processes of machine learning algorithms
are classification and regression.

2. The general principle of an ensemble method is to combine the predictions of several models built with
a given learning algorithm in order to improve robustness over a single model.

ANSWER KEY 5.2

1.Dimensionality reduction refers to techniques that reduce the number of input variables in a dataset.
More input features often make a predictive modeling task more challenging to model, more generally
referred to as the curse of dimensionality.
2. Scikit-learn is probably the most useful library for machine learning in Python. The sklearn
library contains a lot of efficient tools for machine learning and statistical modeling including
classification, regression, clustering and dimensionality reduction.

ANSWER KEY 5.3

1. Cross-Validation also referred to as out of sampling technique is an essential element of a data

science project. It is a resampling procedure used to evaluate machine learning models and
access how the model will perform for an independent test dataset.
2. Big picture. This is the foremost and the most important step in machine learning, yet
a lot of starters completely ignore this step and thereby limit the understanding of the
purpose of machine learning to their project.
3. Conditional probability is known as the possibility of an event or outcome happening,
based on the existence of a previous event or outcome. It is calculated by multiplying the
probability of the preceding event by the renewed probability of the succeeding, or conditional,
event.

ANSWER KEY 5.4

1. Django is a high-level Python web framework that enables rapid development of secure and
maintainable websites. Built by experienced developers, Django takes care of much of the hassle of web
development, so you can focus on writing your app without needing to reinvent the wheel.
2. One of the key advantages of REST APIs is that they provide a great deal of flexibility. Data is not tied
to resources or methods, so REST can handle multiple types of calls, return different data formats and
even change structurally with the correct implementation of hypermedia.

342
Module 6: Demonstrate Understanding On Big Data

MODULE CONTENT

Module Descriptor: This unit covers the knowledge, skills, and attitudes required to
demonstrate understanding on big data. It specifically includes
interpreting big data, interpreting big data ecosystems and
demonstrating skills on big data platforms.

Nominal Duration: 60 hours

LEARNING OUTCOMES:

Upon completion of the module, the trainee should be able to:

6.1 Interpret big data

6.2 Interpret big data ecosystems
6.3 Demonstrate skills on big data platforms.

PERFORMANCE CRITERIA:

1. Big data is explained.

2. Usage of big data in different organizations are described.
3. Major applications of distributed and cloud platforms are explained.
4. Cloud-based solutions are identified.
5. Big data ecosystems are interpreted.
6. Major components of a big data ecosystem are described.
7. Components of the Hadoop ecosystem are identified.
8. Components of the Spark ecosystem are identified.
9. Spark framework is used to complete a small big-data project.
10. Big data projects are evaluated and drawbacks are incorporated and rectified.

343
Learning Outcome 6.1 – DEMONSTRATE
UNDERSTANDING ON BIG DATA

Contents:

● Big data
● Usage of big data.
● Distributed and cloud platforms
● Cloud-based solutions.

Assessment criteria:

1. Big data is explained.

2. Usage of big data in different organizations are described.
3. Major applications of distributed and cloud platforms are explained.
4. Cloud-based solutions are identified.

Resources required:

Students/trainees must be provided with the following resources:

▪ Workplace (actual or simulated), Computer, Internet connection.

LEARNING ACTIVITY 6.1

Learning Activity Resources/Special Instructions/References

DEMONSTRATE UNDERSTANDING ON ▪ Information Sheet: 6.1
BIG DATA ▪ Self-Check: 6.1
▪ Answer Key: 6.1

344
INFORMATION SHEET 6.1

Learning Objective: Interpret big data

Big data is explained:

Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data with
so large size and complexity that none of traditional data management tools can store it or process it
efficiently. Big data is also a data but with huge size.

Types Of Big Data

Following are the types of Big Data:

1. Structured
2. Unstructured
3. Semi-structured

Structured

Any data that can be stored, accessed and processed in the form of fixed format is termed as a ‘structured’
data. Over the period of time, talent in computer science has achieved greater success in developing
techniques for working with such kind of data (where the format is well known in advance) and also deriving
value out of it. However, nowadays, we are foreseeing issues when the size of such data grows to a huge
extent, typical sizes being in the range of multiple zettabytes.

Unstructured

Any data with unknown form or the structure is classified as unstructured data. In addition to the size being
huge, un-structured data poses multiple challenges in terms of its processing for deriving value out of it. A
typical example of unstructured data is a heterogeneous data source containing a combination of simple
text files, images, videos etc. Nowadays organizations have a wealth of data available with them but
unfortunately, they don’t know how to derive value out of it since this data is in its raw form or unstructured
format.

Semi-structured

Semi-structured data can contain both the forms of data. We can see semi-structured data as structured
in form but it is actually not defined with e.g. a table definition in relational DBMS. Example of semi-
structured data is data represented in an XML file.

Usage of big data in different organizations are described:

Big data challenges current information technologies (IT landscape) while promising more competitive and
efficient contributions to business organizations. What big data can contribute to is what organizations
have been wanting for a long time ago. This paper presents the nature of big data and how organizations
can advance their systems with big data technologies. By improving the efficiency and effectiveness of
organizations, people can benefit from the benefits of a more convenient life contributed by Information
Technology.
The nature of Big Data can be expressed by studying the big data definition, the data hierarchy and sources
of big data, and its prominent characteristics, databases and processes. Tthere are five organization
domains for big data to create value and based on the size of the potential including health care,
manufacturing, public sector administration, retail and global personal location data. There are many
potential organizations which require big data solution such as scientific discovery (e.g. astronomical
organizations, weather predictions) with huge amount of data
345
Major applications of distributed and cloud platforms are explained:

Big data applications can help companies to make better business decisions by analyzing large volumes
of data and discovering hidden patterns. These data sets might be from social media, data captured by
sensors, website logs, customer feedbacks, etc. Organizations are spending huge amounts on big data
applications to discover hidden patterns, unknown associations, market style, consumer preferences, and
other valuable business information [4]. The following are domains where big data can be applied:

● health care,
● media and entertainment,
● IoT,
● manufacturing, and
● government.

Health care
There is a significant improvement in the healthcare domain by personalized medicine and prescriptive
analytics due to the role of big data systems. Researchers analyze the data to determine the best treatment
for a particular disease, side effects of the drugs, forecasting the health risks, etc. Mobile applications on
health and wearable devices are causing available data to grow at an exponential rate. It is possible to
predict a disease outbreak by mapping healthcare data and geographical data. Once predicted,
containment of the outbreak can be handled and plans to eradicate the disease made.

Media and entertainment

The media and entertainment industries are creating, advertising, and distributing their content using new
business models. This is due to customer requirements to view digital content from any location and at
any time. The introduction of online TV shows, Netflix channels, etc. is proving that new customers are not
only interested in watching TV but are interested in accessing data from any location. The media houses
are targeting audiences by predicting what they would like to see, how to target the ads, content
monetization, etc. Big data systems are thus increasing the revenues of such media houses by analyzing
viewer patterns.

Internet of Things
IoT devices generate continuous data and send them to a server on a daily basis. These data are mined
to provide the interconnectivity of devices. This mapping can be put to good use by government agencies
and also a range of companies to increase their competence. IoT is finding applications in smart irrigation
systems, traffic management, crowd management, etc.

Manufacturing
Predictive manufacturing can help to increase efficiency by producing more goods by minimizing the
downtime of machines. This involves a massive quantity of data for such industries. Sophisticated
forecasting tools follow an organized process to explore valuable information for these data. The following
are the some of the major advantages of employing big data applications in manufacturing.

Government

346
By adopting big data systems, the government can attain efficiencies in terms of cost, output, and novelty.
Since the same data set is used in many applications, many departments can work in association with
each other. Government plays an important role in innovation by acting in all these domains.

Big data applications can be applied in each and every field. Some of the major areas where big data finds
applications include:

● agriculture,
● aviation,
● cyber security and intelligence,
● crime prediction and prevention,
● e-commerce,
● fake news detection,
● fraud detection,
● pharmaceutical drug evaluation,
● scientific research,
● weather forecasting, and
tax compliance.

Cloud-based solutions are identified:

Cloud-based solutions (or ‘cloud’ for short) stands for on-demand delivery of computing resources over
the Internet. On a pay-for-use-basis, you can get access to as many resources as you need such as
storage space, software and applications, networks, and other on-demand services.

AWS Instances Types:

AWS instance types are grouped together into families with several subcategories in each family. These
subcategories are based on the hardware on which they’re run, such as the number of virtual CPUs,
memory (RAM), storage volume, and bandwidth capacity into and out of the instances.

AWS instance types should be selected based on the CPU and memory needs of different workloads and
the network resources required. It’s important to choose the correct AWS instance types as there are
considerable price differences between the different families of AWS instances and the different AWS
instance types within those families. For example, instances with extreme memory and CPU capacities
can be very expensive. Therefore, it’s important to provision each instance appropriately at the point of
deployment and monitor utilization thereafter.

Amazon Redshift:

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start
with just a few hundred gigabytes of data and scale to a petabyte or more. This enables you to use your
data to acquire new insights for your business and customers.

The first step to create a data warehouse is to launch a set of nodes, called an Amazon Redshift cluster.
After you provision your cluster, you can upload your data set and then perform data analysis queries.
Regardless of the size of the data set, Amazon Redshift offers fast query performance using the same
SQL-based tools and business intelligence applications that you use today.

Amazon Sagemaker:

347
Amazon SageMaker is a fully managed machine learning service. With SageMaker, data scientists and
developers can quickly and easily build and train machine learning models, and then directly deploy them
into a production-ready hosted environment. It provides an integrated Jupyter authoring notebook instance
for easy access to your data sources for exploration and analysis, so you don't have to manage servers.
It also provides common machine learning algorithms that are optimized to run efficiently against extremely
large data in a distributed environment. With native support for bring-your-own-algorithms and frameworks,
SageMaker offers flexible distributed training options that adjust to your specific workflows. Deploy a model
into a secure and scalable environment by launching it with a few clicks from SageMaker Studio or the
SageMaker console. Training and hosting are billed by minutes of usage, with no minimum fees and no
upfront commitments.

Azure Data Factory :

Azure Data Factory is Azure's cloud ETL service for scale-out serverless data integration and data
transformation. It offers a code-free UI for intuitive authoring and single-pane-of-glass monitoring and
management. You can also lift and shift existing SSIS packages to Azure and run them with full
compatibility in ADF. SSIS Integration Runtime offers a fully managed service, so you don't have to worry
about infrastructure management.

Azure Synapse:

Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise
data warehousing, and big data analytics. It gives you the freedom to query data on your terms, using
either serverless or dedicated options—at scale. Azure Synapse brings these worlds together with a unified
experience to ingest, explore, prepare, transform, manage, and serve data for immediate BI and machine
learning needs.

GCP big query:

BigQuery is a fully managed enterprise data warehouse that helps you manage and analyze your data
with built-in features like machine learning, geospatial analysis, and business intelligence. BigQuery's
serverless architecture lets you use SQL queries to answer your organization's biggest questions with zero
infrastructure management. BigQuery's scalable, distributed analysis engine lets you query terabytes in
seconds and petabytes in minutes.

348
Individual Activity:
▪ Big data is explained.
▪ Cloud-based solutions are identified.

SELF-CHECK QUIZ 6.1

Check your understanding by answering the following questions:

Write the correct answer for the following questions.

1. What is Big data?

2. Which rules are most important Cloud-based solutions?

349
Learning Outcome 6.2- Interpret big data ecosystems

Contents:

▪ Big data ecosystems and its components.

Assessment criteria:

1. Big data ecosystems are interpreted.

2. Major components of a big data ecosystem are described.

Resources required:

Students/trainees must be provided with the following resources:

▪ Workplace (Computer and internet connection).

LEARNING ACTIVITY 6.2

Learning Activity Resources/Special Instructions/References

Interpret big data ecosystems ▪ Information Sheet: 6.2
▪ Self-Check: 6.2
▪ Answer Key: 6.2

350
Information Sheet: 6.2

Learning Objective: Interpret big data

Big data ecosystems are interpreted:

The term Ecosystem is defined in scientific literature as a complex network or interconnected systems.
While in the past corporations used to deal with static, centrally stored data collected from various sources,
with the birth of the web and cloud services, cloud computing is rapidly overtaking the traditional in-house
system as a reliable, scalable and cost-effective IT solution. Thus, large datasets – log files, social media
sentiments, click-streams – are no longer expected to reside within a central server or within a fixed place
in the cloud. To handle the copious amounts of data, advanced analytical tools are needed which can
process and store billions of bytes of real-time data, with hundreds of thousands of transactions per
second. Hence, the goal of this book is to introduce definitions, methods, tools, frameworks and solutions
for big data processing starting from the process of information extraction, via knowledge processing and
knowledge representation to storing and visualization, sense-making, and practical applications.

Components of the Big Data Ecosystem

In order to depict the information processing flow in just a few phases, in Fig. 1, from left to right, we have
divided the processing workflow into three layers:
● Data sources;
● Data management (integration, storage and processing);
● Data analytics, Business intelligence (BI) and knowledge discovery (KD).

351
Such partition will allow the authors of this book to discuss big data topics from different perspectives. For
computer scientists and engineers, big data poses problems of data storage and management,
communication, and computation. For data scientists and statisticians responsible for machine learning
models development, the issues are how to get usable information out of datasets that are too huge and
complex for many traditional or classical methods to handle. From an organizational viewpoint, business
analysts are expected to select and deploy analytics services and solutions that contribute mostly to the
organizational strategic goals, for instance, taking into consideration a framework for measuring the
organizational performance.

Data Sources. In a modern data ecosystem, the data sources layer is composed of both private and public
data sources – see the left side of Fig. 2. The corporate data originates from internal systems, cloud-based
systems, as well as external data provided from partners and third parties. Within a modern data
architecture, any type of data can be acquired and stored; however, the most challenging task is to capture
the heterogeneous datasets from various service providers. In order to allow developers to create new
applications on top of open datasets (see examples below), machine-readable formats are needed. As
such, XML and JSON have quickly become the de facto format for the web and mobile applications due
to their ease of integration into browser technologies and server technologies that support Javascript.
Once the data has been acquired, the interlinking of diverse data sources is quite a complex and
challenging process, especially for the acquired unstructured data. That is the reason why semantic
technologies and Linked Data principles have become popular over the last decade . Using Linked Data
principles and a set of agreed vocabularies for a domain, the input data is modeled in the form of resources,
while the existing relationships are modeled as a set of (named) relationships between resources. In order
to represent the knowledge of a specific domain, conceptual schemas are applied (also called ontologies).
Automatic procedures are used to map the data to the target ontology, while standard languages are used
to represent the mappings . Furthermore, in order to unify the knowledge representation and data
processing, standardized hierarchical and multilingual schemas are used called taxonomies. Over the last
decade, thousands of data repositories emerged on the web that companies can use to improve their
products and/or processes. The public data sources (statistics, trends, conversations, images, videos,
audios, and podcasts for instance from Google Trends, Twitter, Instagram, and others ) provide real-time
information and on-demand insights that enable businesses to analyse user interactions, draw patterns
and conclusions. IoT devices have also created significant challenges in many industries and enabled the
development of new business models. However, one of the main challenges associated with these
repositories is automatically understanding the underlying structures and patterns of the data. Such an
understanding is a prerequisite to the application of advanced analytics to the retrieved data [143].
Examples of Open Data Sources from different domains are:

352
● Facebook Graph API, curated by Facebook, is the primary way for apps to read and write to the
Facebook social graph. It is essentially a representation of all information on Facebook now and in
the past. For more info see hereFootnote1.
● Open Corporates is one of the largest open databases of companies in the world and holds
hundreds of millions of datasets in essentially any country. For more info, see hereFootnote2.
● Global Financial Data’s API is recommended for analysts who require large amounts of data for
broad research needs. It enables researchers to study the interaction between different data series,
sectors, and genres of data. The API supports R and Python so that the data can be directly
uploaded to the target application. For more info, see hereFootnote3.
● Open Street Map is a map of the world, created by people free to use under an open license. It
powers map data on thousands of websites, mobile apps, and hardware devices. For more info,
see hereFootnote4.
● The National Centers for Environmental Information (NCEI) is responsible for hosting and
providing access to one of the most significant archives on Earth, with comprehensive oceanic,
atmospheric, and geophysical data. For more info about the data access, see hereFootnote5.
● DBPedia is a semantic version of Wikipedia. It has helped companies like Apple, Google, and IBM
to support artificial intelligence projects. DBpedia is in the center of the Linked Data cloud presented
in Fig. 2, top-right quadrantFootnote6. For more info, see hereFootnote7.

Data Management. As data becomes increasingly available (from social media, web logs, IoT sensors
etc.), the challenge of managing (selecting, combining, storing) and analyzing large and growing data sets
is growing more urgent. From a data analytics point of view, that means that data processing has to be
designed taking into consideration the diversity and scalability requirements of targeted data analytics
applications. In modern settings, data acquisition via near real-time data streams in addition to batch loads
is managed by different automated processes (see Fig. 2, top-left quadrant presents an example of
monitoring and control of electric power facilities with the Supervisory, Control and Data Acquisition
SystemsFootnote8 developed by the Mihajlo Pupin Institute. The novel architecture [471] is ’flexible
enough to support different service levels as well as optimal algorithms and techniques for the different
query workloads’ [426].

Over the last two decades, the emerging challenges in the design of end-to-end data processing pipelines
were addressed by computer scientists and software providers in the following ways:

● In addition to operational database management systems (present on the market since 1970s),
different NoSQL stores appeared that lack adherence to the time-honored SQL principles of ACID
(atomicity, consistency, isolation, durability), see Table 3.
● Cloud computing emerged as a paradigm that focuses on sharing data and computations over a
scalable network of nodes including end user computers, data centers (see Fig. 2, bottom-left
quadrant), and web services [23].
● The Data Lake concept as a new storage architecture was promoted where raw data can be stored
regardless of source, structure and (usually) size. The data warehousing approach (based on a
repository of structured, filtered data that has already been processed for a specific purpose) is
thus perceived as outdated as it creates certain issues with respect to data integration and the
addition of new data sources.

The wide availability of big data also means that there are many quality issues that need to be dealt with
before using such data. For instance, data inherently contains a lot of noise and uncertainty or is
compromised because of sensor malfunctioning or interferences, which may result in missing or conflicting
data. Therefore, quality assessment approaches and methods applicable in open big data ecosystems
have been developed [481].

Furthermore, in order to ensure interoperability between different processes and interconnected systems,
the semantic representation of data sources/processes was introduced where a knowledge graph, from
one side, meaningfully describes the data pipeline, and from the other, is used to generate new knowledge.
Big Data, Standards and Interoperability
Interoperability remains a major burden for the developers of the big data ecosystem. In its EU 2030 vision,
the European Union has set out the creation of an internal single market through a standardised system

353
of laws that apply in all member states and a single European data [85] space – a genuine single market
for data where businesses have easy access to an almost infinite amount of high-quality industrial data.
The vision is also supported by the EU Rolling Plan for ICT Standardisation [86] that identifies 170 actions
organised around five priority domains—5G, cloud, cybersecurity, big data and Internet of Things. In order
to enable broad data integration, data exchange and interoperability with the overall goal of fostering
innovation based on data, standardisation at different levels (such as metadata schemata, data
representation formats and licensing conditions of open data) is needed. This refers to all types of
(multilingual) data, including both structured and unstructured data, and data from different domains as
diverse as geospatial data, statistical data, weather data, public sector information (PSI) and research
data, to name just a few.

In the domain of big data, five different actions have been requested that also involve the following
standardization organizations:

● CEN, the European Committee for Standardization, to support and assist the standardisation
process and to coordinate with the relevant W3C groups on preventing incompatible changes and
on the conditions for availability of the standard(s). The work will be in particular focused on the
interoperability needs of data portals in Europe while providing semantic interoperability with other
applications on the basis of reuse of established controlled vocabularies (e.g. EuroVoc) and
mappings to existing metadata vocabularies (e.g. SDMX, INSPIRE metadata, Dublin Core, etc.);
● CENELEC (the European Committee for Electrotechnical Standardization) in particular in relation
to personal data management and the protection of individuals’ fundamental rights;
● ETSI (the European Telecommunications Standards Institute) to coordinate stakeholders and
produce a detailed map of the necessary standards (e.g. for security, interoperability, data
portability and reversibility) and together with CEN to work on various standardisation deliverables
needed for the completion of the rationalised framework of e-signatures standards;
● IEEE has a series of new standards projects related to big data (mobile health, energy-efficient
processing, personal agency and privacy) as well as pre-standardisation activities on big data and
open data;
● ISO/IEC JTC1, WG 9—Big Data, formed at the November 2014 in relation to requirements, use
cases, vocabulary and a reference architecture for big data;
● OASIS, in relation to querying and sharing data across disparate applications and multiple
stakeholders for reuse in enterprise, cloud, and mobile devices. Specification development in the
OASIS OData TC builds on the core OData Protocol V4 released in 2014 and addresses additional
requirements identified as extensions in four directional white papers: data aggregation, temporal
data, JSON documents, and XML documents as streams;
● OGC, the Open Geospatial Consortium defines and maintains standards for location-based, spatio-
temporal data and services. The work includes, for instance, schema allowing descriptions of
spatio-temporal sensors, images, simulations, and statistics data (such as “datacubes”), a modular
suite of standards for Web services allowing ingestion, extraction, fusion, and (with the web
coverage processing service (WCPS) component standard) analytics of massive spatio-temporal
data like satellite and climate archives. OGC also contributes to the INSPIRE project;
● W3C, the W3C Semantic Web Activity Group has accepted numerous Web technologies as
standards or recommendations for building semantic applications including RDF (Resource
Description Framework) as a general-purpose language; RDF Schema as a meta-language or
vocabulary to define properties and classes of RDF resources; SPARQL as a standard language
for querying RDF data: OWL, Web Ontology Language for effective reasoning. More about
semantic standards can be found in [223].

Major components of a big data ecosystem are described:

The Big Data Architecture Framework (BDAF) is proposed to address all aspects of the Big Data
Ecosystem and includes the following components:Data sources,Data management,Data
analytics,business intelligence(BI),Knowledge discovery(KD).

354
Data sources:
A data source, in the context of computer science and computer applications, is the location where data
that is being used comes from. In a database management system, the primary data source is the
database, which can be located in a disk or a remote server. The data source for a computer program can
be a file, a data sheet, a spreadsheet, an XML file or even hard-coded data within the program.

Individual Activity:
▪ Big data ecosystems.
▪ Major components of a big data ecosystem.

SELF-CHECK QUIZ 6.2

Check your understanding by answering the following questions:

Write the correct answer for the following questions.

1. What are Big data ecosystems?

2. What is Major components of a big data ecosystem?

355
Learning Outcome 6.3- Demonstrate skills on big data
platforms

Contents:

▪ Hadoop ecosystem.
▪ Spark ecosystem.
▪ Spark framework.
▪ Big data projects.

Assessment criteria:

1. Components of Hadoop ecosystem are identified.

2. Components of Spark ecosystem are identified.
3. Spark framework is used to complete a small big-data project.
4. Big data project is evaluated and drawbacks are incorporated and rectified.

Resources required:

Students/trainees must be provided with the following resources:

▪ Workplace (Computer and internet connection).

LEARNING ACTIVITY 6.3

Learning Activity Resources/Special Instructions/References

Demonstrate skills on big data platforms ▪ Information Sheet: 6.3
▪ Self-Check: 6.3
▪ Answer Key: 6.3

356
INFORMATION SHEET 6.3

● Learning Objective: Demonstrate skills on big data platforms:

Components of Hadoop ecosystem are identified

Hadoop Ecosystem is a platform or a suite which provides various services to solve the big data
problems. It includes Apache projects and various commercial tools and solutions. There are four major
elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common. Most of the tools or
solutions are used to supplement or support these major elements. All these tools work collectively to
provide services such as absorption, analysis, storage and maintenance of data etc.
Hadoop Distributed File System:

● HDFS is the primary or major component of the Hadoop ecosystem and is responsible for
storing large data sets of structured or unstructured data across various nodes and thereby
maintaining the metadata in the form of log files.

● HDFS consists of two core components i.e.

1. Name node

2. Data Node

● Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that store the actual data. These data
nodes are commodity hardware in the distributed environment. Undoubtedly, making Hadoop
cost effective.

● HDFS maintains all the coordination between the clusters and hardware, thus working at the
heart of the system.
Yet Another Resource Negotiator:

● Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to
manage the resources across the clusters. In short, it performs scheduling and resource
allocation for the Hadoop System.

● Consists of three major components i.e.

1. Resource Manager

2. Nodes Manager

3. Application Manager

● Resource manager has the privilege of allocating resources for the applications in a system
whereas Node managers work on the allocation of resources such as CPU, memory,
bandwidth per machine and later on acknowledges the resource manager. Application
manager works as an interface between the resource manager and node manager and
performs negotiations as per the requirement of the two.
MapReduce:

357
● By making the use of distributed and parallel algorithms, MapReduce makes it possible to
carry over the processing’s logic and helps to write applications which transform big data sets
into a manageable one.

● MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:

1. Map() performs sorting and filtering of data and thereby organizing them in the
form of a group. Map generates a key-value pair based result which is later on
processed by the Reduce() method.

2. Reduce(), as the name suggests, does the summarization by aggregating the

mapped data. In simple terms, Reduce() takes the output generated by Map() as
input and combines those tuples into a smaller set of tuples.
Hadoop common:
The Hadoop Common package is considered as the base/core of the framework as it provides essential
services and basic processes such as abstraction of the underlying operating system and its file system.
Hadoop Common also contains the necessary Java Archive (JAR) files and scripts required to start
Hadoop. The Hadoop Common package also provides source code and documentation, as well as a
contribution section that includes different projects from the Hadoop Community.

Components of spark ecosystem

The components of the Spark ecosystem are getting developed and several contributions are being
made every now and then. Primarily, Spark Ecosystem comprises the following components:

1. Shark (SQL)
2. Spark Streaming (Streaming)
3. MLLib (Machine Learning)
4. GraphX (Graph Computation)
5. Spark Core

These components are built on top of Spark Core Engine. Spark Core Engine allows writing raw Spark
programs and Scala programs and launching them; it also allows writing Java programs before launching
them. All these are being executed by Spark Core Engine. To top it all, there are various projects that
have come up very fast and efficiently.

Spark Core:

● The Spark Core is the heart of Spark and performs the core functionality.
● It holds the components for task scheduling, fault recovery, interacting with storage systems and
memory management.

Shark (SQL):

● The Spark SQL is built on the top of Spark Core. It provides support for structured data.

358
● It allows querying the data via SQL (Structured Query Language) as well as the Apache Hive
variant of SQL?called the HQL (Hive Query Language).
● It supports JDBC and ODBC connections that establish a relation between Java objects and
existing databases, data warehouses and business intelligence tools.
● It also supports various sources of data like Hive tables, Parquet, and JSON.

Spark Streaming and Structured Streaming:

● Spark Streaming is a Spark component that supports scalable and fault-tolerant processing of
streaming data.
● It uses Spark Core's fast scheduling capability to perform streaming analytics.
● It accepts data in mini-batches and performs RDD transformations on that data.
● Its design ensures that the applications written for streaming data can be reused to analyze
batches of historical data with little modification.
● The log files generated by web servers can be considered as a real-time example of a data
stream.

Machine learning library:

● The MLlib is a Machine Learning library that contains various machine learning algorithms.
● These include correlations and hypothesis testing, classification and regression, clustering, and
principal component analysis.
● It is nine times faster than the disk-based implementation used by Apache Mahout.

GraphX:

● The GraphX is a library that is used to manipulate graphs and perform graph-parallel
computations.
● It facilitates the creation of a directed graph with arbitrary properties attached to each vertex and
edge.
● To manipulate graphs, it supports various fundamental operators like subgraph, join Vertices, and
aggregate Messages.

Basic project of Big data using spark

An Example to Predict Customer Churn

Apache Spark has become arguably the most popular tool for analyzing large data sets. As my capstone
project for Udacity’s Data Science Nanodegree, I’ll demonstrate the use of Spark for scalable data
manipulation and machine learning. Context-wise, we use the user log data from a fictitious music
streaming company, Sparkify, to predict which customers are at risk to churn.

359
The full data set is 12GB. We'll first analyze a mini subset (128MB) and build classification models using
Spark Dataframe, Spark SQL, and Spark ML APIs in local mode through the python interface API,
PySpark. Then we’ll deploy a Spark cluster on AWS to run the models on the full 12GB of data.
Hereafter, we assume that Spark and PySpark are installed (a tutorial for installing PySpark).

Set up a Spark session

Before we are able to read csv, json, or xml data into Spark dataframes, a Spark session needs to be set
up. A Spark session is a unified entry point for Spark applications from Spark 2.0. Note that prior to
Spark 2.0, various Spark contexts are needed to interact with Spark’s different functionalities (a good
Medium article on this).

# Set up a SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("capstone").getOrCreate()

The Dataset

# Load data and show basic data shape

path = "mini_sparkify_event_data.json"
df = spark.read.json(path)

Now that the mini Sparkify user log data set is in the Spark dataframe format, we can do some initial
exploration to get familiar with the data. Spark dataframe and Spark SQL modules have methods such
as the following: select(), filter(), where(), groupBy(), sort(), dropDuplicates(), count(), avg(), max(), min().
They also have Window functions that are useful for basic analysis (see documentation for syntax). To
summarize the data:

The dataset has 286500 rows and 18 columns.

It spans the time period from 2018–9–30 to 2018–12–02.

It records each event users did during this time period as a row.

Column definitions are as follow:

-- artist (string): artist's name for a song

360
-- auth (string): Logged Out | Cancelled | Guest | Logged In
-- firstName (string): user's firstname
-- gender (string): Female | Male
-- itemInSession (long) : number of items in a session
-- lastName (string): user's lastname
-- length (double): a song's length in seconds
-- level (string): paid | free
-- location (string): city and state of the user
-- method (string): HTTP method
-- page (string): which page a user is on at an event
-- registration (long): timestamp of user registration
-- sessionId (long): the Id of the session a user is in at an event
-- song (string): song name
-- status(long): 307 | 404 | 200
-- ts (long): timestamp ateach event
-- userAgent (string) :
-- userId (string): user ID

5. Some of these columns probably are not very useful for prediction, such as firstName, lastName,
method, and userAgent. Categorical features need to be encoded, such as gender and level. Some
numerical features will be useful for engineering aggregated behavior features, such as itemInSession,
length, page visits, etc.

6. Class is imbalanced; we need to consider stratified sampling when we split training test data. We also
should consider the f1 score over the accuracy for our model evaluation metrics.

7. For models, we will try Logistic Regression, Decision Tree, Random Forest, and Gradient Boosted
Trees.

With these initial thoughts, let’s proceed with handling missing values.

Dealing with Missing Value

The seaborn heatmap is a good way to show where missing values are located in the dataset and

whether data is missing in some systematic way.

# Let's take a look at where the missing values are located.

plt.figure(figsize=(18,6))
sns.heatmap(df.toPandas().isnull(),cbar=False)

361
Note (1): From the heatmap, we can see that firstName, lastName, gender, location, userAgent, and

registration are missing at same rows. We can infer that these missing values come from users that are

not registered. Usually, users who are not registered would not have a user ID. We’ll explore this further.

df.select(‘userid’).filter(‘registration is null’).show(3)

It turns out that the userId column actually has missing values, but they are coded as just empty, instead

of coded as a “NaN”. The number of such null values match the number of missing rows in registration.

Since these records do not even have userId information, we’ll go ahead and delete them.

Note (2): Also, artist, length, and song are missing data at the same rows. These records do not have

song-related information. We’ll explore what pages users are on for these rows.

print(df_pd[df_pd.artist.isnull()][‘page’].value_counts())
print(df_pd[df_pd.artist.isnull()==False][‘page’].value_counts())
Thumbs Up 12551
Home 10082
Add to Playlist 6526
Add Friend 4277
Roll Advert 3933
Logout 3226
Thumbs Down 2546
Downgrade 2055
Settings 1514
Help 1454
Upgrade 499
About 495
Save Settings 310
Error 252

362
Submit Upgrade 159
Submit Downgrade 63
Cancel 52
Cancellation Confirmation 52
Name: page, dtype: int64
NextSong 228108
Name: page, dtype: int64

Feature Engineering and EDA

Based on intuition and domain knowledge, we decide not to include the columns for firstName,
lastName, method, and userAgent in our first-pass modeling for now, since these variables probably do
not affect our prediction. We also decided not to include the artist, location, song, and status for now.
This leaves us with the following columns:

-- gender (string): Female | Male

-- itemInSession (long) : number of items in a session
-- length (double): a song's length in seconds
-- level (string): paid | free
-- page (string): which page a user is on at an event
-- registration (long): timestamp of user registration
-- sessionId (long): the Id of the session a user is in at an event
-- ts (long): timestamp ateach event
-- userId (string): user ID

1) Define Churned User: We can see that there is approximately a 1:3 class imbalance.

flag_cancellation = udf(lambda x : 1 if x=="Cancellation Confirmation" else 0, IntegerType())

df = df.withColumn("churn",flag_cancellation("page"))
# Create the cross-sectional data that we’ll use in analysis and modelling
w1 = Window.partitionBy(‘userId’)df_user = df.select(‘userId’,’churn’,’gender’,’level’) \
.withColumn(‘churned_user’,Fsum(‘churn’).over(w1)) \
.dropDuplicates([‘userId’]).drop(‘churn’)
df_user.groupby(‘churned_user’).count().show()
+------------+-----+
|churned_user|count|
+------------+-----+
| 0| 173|
| 1| 52|
+------------+-----+

363
2) Categorical features: For categorical features, we need to first label encoding (simply converting

each value to a number). Depending on the machine learning models, we may need to further encode

these numbers to dummy variables (e.g., one-hot encoding).In Spark, StringIndexer does the label

encoding part:

indexer = StringIndexer(inputCol="gender",outputCol="genderIndex")
df_user = indexer.fit(df_user).transform(df_user)
indexer = StringIndexer(inputCol="level",outputCol="levelIndex")
df_user = indexer.fit(df_user).transform(df_user)
df_user.show(3)
+------+------+-----+------------+-----------+----------+
|userId|gender|level|churned_user|genderIndex|levelIndex|
+------+------+-----+------------+-----------+----------+
|100010| F| free| 0| 1.0| 0.0|
|200002| M| free| 0| 0.0| 0.0|
| 125| M| free| 1| 0.0| 0.0|
+------+------+-----+------------+-----------+----------+
only showing top 3 rows

Let’s take a look at how gender and level are related to churn. By looking at the simple statistics, it
seems that a larger percentage of male users tend to churn than female users and that a larger
percentage of paid users tend to churn than free users.

df_user.groupby(‘genderIndex’).avg(‘churned_user’).show()
+-----------+-------------------+
|genderIndex| avg(churned_user)|
+-----------+-------------------+
| 0.0| 0.2644628099173554|
| 1.0|0.19230769230769232|
+-----------+-------------------+
df_user.groupby('churned_user').avg('levelIndex').show()
+------------+-------------------+
|churned_user| avg(levelIndex)|
+------------+-------------------+
| 0|0.23121387283236994|
| 1|0.15384615384615385|
+------------+-------------------+

Since we will utilize Logistic Regression and SVM classifier, we will need to convert label encoding to
dummy variables. OneHotEncoderEstimator() does this part:

encoder = OneHotEncoderEstimator(inputCols=[“genderIndex”, “levelIndex”],

outputCols=[“genderVector”, “levelVector”])
model = encoder.fit(df_user)
df_user = model.transform(df_user)
df_user.select('genderVector','levelVector').show(3)
+------+-------------+-------------+
|userId| genderVector| levelVector|

364
+------+-------------+-------------+
|100010| (1,[],[])|(1,[0],[1.0])|
|200002|(1,[0],[1.0])|(1,[0],[1.0])|
| 125|(1,[0],[1.0])|(1,[0],[1.0])|
+------+-------------+-------------+
only showing top 3 rows

The output columns of OneHotEncoderEstimator() is not the same as sklearn’s output. Instead of binary

values, it gives this sparse vector format as shown in the above code snippet.

3) General activity aggregates: Based on the columns for sessionId, song, artist, length, and

registration, we generate aggregated features including:

● numSessions (number of sessions a user had during this period)

● numSongs (number of different songs a user listened to)

● numArtists (number of different artists a user listened to)

● playTime (total time of playing songs measured in seconds)

● activeDays (number of days since a user registered)

4) Page visits aggregates: Based on the page column, we generate aggregated page visit behavior

features that count how many times a user visited each type of page during the period.

w2 = Window.partitionBy('userId','page')
columns = [str(row.page) for row in df.select('page')\
.dropDuplicates().sort('page').collect()]
df_pageVisits = df.select('userId','page')\
.withColumn('pageVisits',count('userId').over(w2))\
.groupby('userId')\
.pivot('page',columns)\
.mean('pageVisits')
df_pageVisits = df_pageVisits.na.fill(0).drop(['Cancel','Cancellation Confirmation'],axis=1)

5) Check for multicollinearity: Tree-based models would not be affected by multicollinearity, but, since
we are also testing linear models (logistic regression and svm), we’ll go ahead and remove highly
correlated features.

365
366
6) Vector Assembling and Feature Scaling: In Spark, machine learning models require features to be
a vector type. The VectorAssembler() method converts all the feature columns into one vector, as shown
in the following code snippet.

# Vector Assembler
cols = df_inuse.drop('userID','churned_user').columns
assembler=VectorAssembler(inputCols=cols,outputCol='feature_vector')
df_inuse=assembler.transform(df_inuse).select('userId','churned_user','feature_vector')
df_inuse.take(1)
[Row(userId='100010', churned_user=0, feature_vector=SparseVector(13, {0: 1.0, 2: 52.0, 6: 2.0,
7: 7.0, 8: 11.4259, 9: 1.0, 12: 1.0}))]

The scaled data looks like this:

df_inuse_scaled.take(1)
[Row(userId='100010', label=0, feature_vector=SparseVector(13, {0: 1.0, 2: 52.0, 6: 2.0, 7: 7.0, 8:
11.4259, 9: 1.0, 12: 1.0}), features=SparseVector(13, {0: 0.3205, 2: 2.413, 6: 0.7817, 7: 0.4779, 8:
0.3488, 9: 2.0013, 12: 2.4356}))]

367
7) Split data into training and testing sets:

ratio = 0.8
train = df_inuse_scaled.sampleBy(‘churned_user’,fractions={0:ratio,1:ratio}, seed = 42)
test = df_inuse_scaled.subtract(train)

Model Selection
We will compare five baseline models: Logistic Regression, Linear SVM Classifier, Decision Tree,
Random Forests, and Gradient Boosted Tree Classifier

lr = LogisticRegression()
svc = LinearSVC()
dtc = DecisionTreeClassifier()
rfc = RandomForestClassifier()
gbt = GBTClassifier()

The ParaGridBuilder() class can be used to construct a grid of hyper-parameters to search over.
However, since the purpose here is to show Spark’s ML methods, we will not do an in-depth tuning of the
model here.

# this line will keep the default hyper-parameters of a model

paramGrid = ParamGridBuilder().build()
# to search over more parameters, we can use the ,,addGrid() method, for example:
paramGrid = ParamGridBuilder()\
.addGrid(lr.regParam, [0.1, 0.01]) \
.addGrid(lr.fitIntercept, [False, True])\
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
.build()

We’ll define an evaluation function to run through all five classification models and output their cross

validation average metrics (f1).

def evaluate(model_name,train,test):
evaluator = MulticlassClassificationEvaluator(metricName=’f1')
paramGrid = ParamGridBuilder().build()
crossval = CrossValidator(estimator=model_name,\
evaluator=evaluator, \
estimatorParamMaps=paramGrid,\
numFolds=3)
cvModel = crossval.fit(train)
cvModel_metrics = cvModel.avgMetrics
transformed_data = cvModel.transform(test)
test_metrics = evaluator.evaluate(transformed_data)
return (cvModel_metrics, test_metrics)

Finally, the performance of the five baseline models are as shown in the following code snippet. As we

can see, the f1 scores for all of the models are unsatisfactory. We certainly need finer tuning to search

368
for optimized hyper-parameters for these models!However, if we were to choose from these baseline

models, the cross validation models’ f1 score should be the criterion. In this case, the LinearSVC model

will be the model of choice. (Note that the test score is worse than the score on the training data,

indicating over-fitting).

model_names = [lr,svc,dtc,rfc,gbt]
for model in model_names:
a = evaluate(model,train,test)
print(model,a)
LogisticRegression ([0.6705811320138374], 0.6320191158900836)
LinearSVC ([0.6765153189823112], 0.6320191158900836)
DecisionTreeClassifier ([0.6382104034150818], 0.684376432033105)
RandomForestClassifier ([0.666026954511646], 0.6682863679086347)
GBTClassifier([0.6525712756381464], 0.6576482830385015)

Deploy on Cloud (AWS)

To run the model on the full 12GB data on AWS, we’ll use basically the same codes, except the plotting

portion done by Pandas is removed. There is one point that is worth noting though: following the Spark

course’s instruction on configuring the cluster in the Nanodegree’s extracurricular material, I was able to

run the codes; however, the session goes inactive after a while. This is likely due to insufficient spark

driver memory. Therefore, we need to go with the advanced option in configuring a cluster and increase

the driver memory.

369
370
Individual Activity:
▪ Identify types of Components of the Hadoop ecosystem.
▪ Show the production process of Components of Spark ecosystem.

SELF-CHECK QUIZ 6.3

Check your understanding by answering the following questions:

Write the correct answer for the following questions.

1. What are Components of the Spark ecosystem ?

2. What are Components of the Hadoop ecosystem?

371
LEARNER JOB SHEET 6

Qualification: 2 Years Experience in IT Sector

Learning unit: DEMONSTRATE UNDERSTANDING ON BIG DATA

Learner name:

Personal protective
equipment (PPE):

Materials: Computer and Internet connection

Tools and equipment:

Performance criteria:
1. Big data is explained.
2. Usage of big data in different organizations are described.
3. Major applications of distributed and cloud platforms are
explained.
4. Cloud-based solutions are identified.
5. Big data ecosystems are interpreted.
6. Major components of a big data ecosystem are described.
7. Components of the Hadoop ecosystem are identified.
8. Components of the Spark ecosystem are identified.
9. Spark framework is used to complete a small big-data
project.
10. Big data projects are evaluated and drawbacks are
incorporated and rectified.

Measurement:

Notes:

Procedure: 1.Connect computer with internet connection.

2.Connect router with internet.

Learner signature: Date:

Assessor signature: Date:

Quality Assurer
Date:
signature:

Assessor remarks:

372
Feedback:

373
ANSWER KEYS

ANSWER KEY 6.1

1. The definition of big data is data that contains greater variety, arriving in increasing volumes and with
more velocity. This is also known as the three Vs. Put simply, big data is larger, more complex data sets,
especially from new data sources.

2. It enables us to run software programs without installing them on our computers; it enables us to store
and access our multimedia content via the internet, it enables us to develop and test programs without
necessarily having servers and so on

ANSWER KEY 6.2

1. Big data ecosystem is the comprehension of massive functional components with various enabling
tools. Capabilities of the big data ecosystem are not only about computing and storing big data, but also
the advantages of its systematic platform and potentials of big data analytics.
2. The Big Data Architecture Framework (BDAF) is proposed to address all aspects of the Big Data
Ecosystem and includes the following components: Big Data Infrastructure, Big Data Analytics, Data
structures and models, Big Data Lifecycle Management, Big Data Security.

ANSWER KEY 6.3

1. The Apache Spark ecosystem is an open-source distributed cluster-computing framework. Spark is a

data processing engine developed to provide faster and easier analytics than Hadoop MapReduce.
2. There are three components of Hadoop: Hadoop HDFS - Hadoop Distributed File System (HDFS) is
the storage unit. Hadoop MapReduce - Hadoop MapReduce is the processing unit. Hadoop YARN - Yet
another Resource Negotiator (YARN) is a resource management unit.

374

Internshala Summer Training Report On Data Science
77% (22)
Internshala Summer Training Report On Data Science
70 pages
S - ALR - 87012357 Advance Tax Reporting (RFUMSV00)
60% (5)
S - ALR - 87012357 Advance Tax Reporting (RFUMSV00)
11 pages
SM 1000 Idi Reference Manual
No ratings yet
SM 1000 Idi Reference Manual
108 pages
Data Science S (2 Files Merged)
No ratings yet
Data Science S (2 Files Merged)
30 pages
Privacy & Data Protection Practitioner Courseware - English
From Everand
Privacy & Data Protection Practitioner Courseware - English
Marios Siathas
No ratings yet
Data Analytics Syllabus
No ratings yet
Data Analytics Syllabus
10 pages
Fods MQP Solutions - 025136
No ratings yet
Fods MQP Solutions - 025136
76 pages
Introduction To Data Science, Evolution of Data Science
No ratings yet
Introduction To Data Science, Evolution of Data Science
11 pages
Course 1 Overview
No ratings yet
Course 1 Overview
13 pages
Final Industrial Report
No ratings yet
Final Industrial Report
34 pages
WINSEM2024-25_BCSE206L_TH_VL2024250502024_2024-12-21_Reference-Material-II(1)
No ratings yet
WINSEM2024-25_BCSE206L_TH_VL2024250502024_2024-12-21_Reference-Material-II(1)
27 pages
IDS Unit 1
No ratings yet
IDS Unit 1
67 pages
Summary of Data Science
No ratings yet
Summary of Data Science
5 pages
DS Career Landscape - ACs
No ratings yet
DS Career Landscape - ACs
35 pages
Chapter 1
No ratings yet
Chapter 1
85 pages
Data ScienceCaseStudy
No ratings yet
Data ScienceCaseStudy
5 pages
DSE 3 Unit 1
100% (1)
DSE 3 Unit 1
10 pages
Internshala Summer Training Report On Data Science
No ratings yet
Internshala Summer Training Report On Data Science
70 pages
6001 - Datascience With Bigdata
No ratings yet
6001 - Datascience With Bigdata
34 pages
TRAINING Report
No ratings yet
TRAINING Report
32 pages
Data Science
No ratings yet
Data Science
18 pages
Fundamentals of Data Science Unit 1
No ratings yet
Fundamentals of Data Science Unit 1
33 pages
BADS (KMBA 106) - Qus Bank
No ratings yet
BADS (KMBA 106) - Qus Bank
7 pages
Fundamentals of Data Science
No ratings yet
Fundamentals of Data Science
53 pages
Data Science CLASS 12 INVESTIGATORY PROJECT
No ratings yet
Data Science CLASS 12 INVESTIGATORY PROJECT
9 pages
Ids Unit I
No ratings yet
Ids Unit I
46 pages
HRDF IndSF DATASCIENCE
No ratings yet
HRDF IndSF DATASCIENCE
3 pages
Dsdm-Unit1 241031 194317
No ratings yet
Dsdm-Unit1 241031 194317
38 pages
Brochure UC Berkeley Data Science With Learning Experience 10 May 19 V33
No ratings yet
Brochure UC Berkeley Data Science With Learning Experience 10 May 19 V33
14 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
7 pages
Unit2 DATA SCIENCE
No ratings yet
Unit2 DATA SCIENCE
8 pages
Unit I
No ratings yet
Unit I
52 pages
Project File For Internship Report
No ratings yet
Project File For Internship Report
17 pages
Big Data Analytics
No ratings yet
Big Data Analytics
35 pages
Data Science RoadMap Min
No ratings yet
Data Science RoadMap Min
27 pages
CISSP Domain 1 Study Guide ( Updated 2024 ) With Practice Exam Questions, Quizzes, Flash Cards: CISSP Study Guide - Updated 2024, #1
From Everand
CISSP Domain 1 Study Guide ( Updated 2024 ) With Practice Exam Questions, Quizzes, Flash Cards: CISSP Study Guide - Updated 2024, #1
ADITYA .
No ratings yet
Rasheva-Yordanova, 2020
No ratings yet
Rasheva-Yordanova, 2020
8 pages
Unit 1
No ratings yet
Unit 1
11 pages
Data Science Report
No ratings yet
Data Science Report
32 pages
Data_Science
No ratings yet
Data_Science
6 pages
Unit 2 PPT (BA)
No ratings yet
Unit 2 PPT (BA)
33 pages
Data Science - Curriculum Brochure
No ratings yet
Data Science - Curriculum Brochure
31 pages
Data Science Course in Hyderabad
No ratings yet
Data Science Course in Hyderabad
9 pages
Lecture 4
No ratings yet
Lecture 4
20 pages
Unit 1 DS BCA NOTES
No ratings yet
Unit 1 DS BCA NOTES
7 pages
Program Delivery Plan
No ratings yet
Program Delivery Plan
17 pages
File
No ratings yet
File
27 pages
DS Unit 1
No ratings yet
DS Unit 1
35 pages
Fd45092a Ccad 459e Bc18 B01536fd6bac Untitled
No ratings yet
Fd45092a Ccad 459e Bc18 B01536fd6bac Untitled
53 pages
Data Analytics 1
No ratings yet
Data Analytics 1
4 pages
Data Science in 2021
No ratings yet
Data Science in 2021
28 pages
2 Marks With Answers
No ratings yet
2 Marks With Answers
39 pages
Data Science in Business
No ratings yet
Data Science in Business
9 pages
Data Scientist Roadmap 2025-26
No ratings yet
Data Scientist Roadmap 2025-26
32 pages
Unit 1 Part 1
No ratings yet
Unit 1 Part 1
18 pages
DS B&V-1
No ratings yet
DS B&V-1
30 pages
Touchpad Information Technology Class 10
From Everand
Touchpad Information Technology Class 10
Sanjay Jain
5/5 (1)
Extended Comprehensive Guide To Data Science
No ratings yet
Extended Comprehensive Guide To Data Science
2 pages
Program Delivery Plan 12 Months DataScience&AI
No ratings yet
Program Delivery Plan 12 Months DataScience&AI
18 pages
Unit-1 IDS
No ratings yet
Unit-1 IDS
26 pages
Data Science Foundations Syllabus
No ratings yet
Data Science Foundations Syllabus
5 pages
IGNOU BCA System Analysis and Design Previous Year Solved Papers MCS 014
From Everand
IGNOU BCA System Analysis and Design Previous Year Solved Papers MCS 014
Manish Soni
No ratings yet
Java Project Report
100% (2)
Java Project Report
46 pages
VINEET SHARMA CV Updated
No ratings yet
VINEET SHARMA CV Updated
3 pages
N4 Computerised Financial Systems
No ratings yet
N4 Computerised Financial Systems
29 pages
GM Commands
No ratings yet
GM Commands
10 pages
Microsoft Corporation's Organizational Culture & Its Characteristics
No ratings yet
Microsoft Corporation's Organizational Culture & Its Characteristics
2 pages
Awk Script Test 101 PDF
No ratings yet
Awk Script Test 101 PDF
3 pages
VSE+InfoScale Enterprise OracleRAC 2020 05
No ratings yet
VSE+InfoScale Enterprise OracleRAC 2020 05
89 pages
S D Letters
No ratings yet
S D Letters
3 pages
Suzuki Equiry Max HLD v1.4
No ratings yet
Suzuki Equiry Max HLD v1.4
22 pages
Array List
No ratings yet
Array List
15 pages
Assignment MET1233
No ratings yet
Assignment MET1233
12 pages
6 Arrow Diagram With Example - PDF - Project Management
No ratings yet
6 Arrow Diagram With Example - PDF - Project Management
6 pages
VoLTE Optimization - Session 1
No ratings yet
VoLTE Optimization - Session 1
39 pages
Grail
No ratings yet
Grail
11 pages
Basic Linux - Unix Commands With Examples
No ratings yet
Basic Linux - Unix Commands With Examples
5 pages
CMD Commands Under Windows - Thomas-Krenn-Wiki
No ratings yet
CMD Commands Under Windows - Thomas-Krenn-Wiki
4 pages
Ex1 8bit Addition v2
No ratings yet
Ex1 8bit Addition v2
4 pages
Endpoint DLP Overview
No ratings yet
Endpoint DLP Overview
25 pages
Vijay Resume C C++
No ratings yet
Vijay Resume C C++
3 pages
HG8245H Datasheet PDF
No ratings yet
HG8245H Datasheet PDF
2 pages
PDF - Next Step Home Automatio Final Paper 150919
No ratings yet
PDF - Next Step Home Automatio Final Paper 150919
6 pages
Curriculum (English)
No ratings yet
Curriculum (English)
8 pages
Chapter 7: Operations and Postimplementation Chapter Objectives
No ratings yet
Chapter 7: Operations and Postimplementation Chapter Objectives
7 pages
Log
No ratings yet
Log
390 pages
Introduction To Embedded Systems - : Lesson 1: Definition, Classification, Skills Required, Application Examples, .
No ratings yet
Introduction To Embedded Systems - : Lesson 1: Definition, Classification, Skills Required, Application Examples, .
15 pages
Specifications
No ratings yet
Specifications
58 pages
Module 4
No ratings yet
Module 4
15 pages
New Microsoft Excel Worksheet (2) - 1
No ratings yet
New Microsoft Excel Worksheet (2) - 1
6 pages