Notes - Unit 1 - Exploratory Data Analysis
Notes - Unit 1 - Exploratory Data Analysis
EDA Fundamentals
Introduction
➢ Data encompasses a collection of discrete objects, numbers, words, events, facts,
measurements, observations, or even descriptions of things.
➢ Such data is collected and stored by every event or process occurring in several disciplines,
including biology, economics, engineering, marketing, and others.
➢ Processing such data elicits useful information and processing such information generates
useful knowledge.
➢ Exploratory Data Analysis enables generating meaningful and useful information from such
data.
➢ Exploratory Data Analysis (EDA) is a process of examining the available dataset to discover
patterns, spot anomalies, test hypotheses, and check assumptions using statistical measures.
➢ Primary aim of EDA is to examine what data can tell us before actually going through formal
modeling or hypothesis formulation.
Understanding data science
➢ Data science involves cross-disciplinary knowledge from
computer science, data, statistics, and mathematics.
➢ There are several phases of data analysis, including
1. Data requirements
2.Data collection
3.Data processing
4.Data cleaning
5.Exploratory data analysis
6.Modeling and algorithms
7.Data product and communication
➢ These phases are similar to the CRoss-Industry Standard Process for data mining
(CRISP) framework in data mining.
1. Data requirements
• There can be various sources of data for an organization.
• It is important to comprehend what type of data is required for the organization to
be collected, curated, and stored.
• For example, an application tracking the sleeping pattern of patients suffering from
dementia requires several types of sensors' data storage, such as sleep data, heart rate
from the patient, electro-dermal activities, and user activities patterns.
• All of these data points are required to correctly diagnose the mental state of the
person.
• Hence, these are mandatory requirements for the application.
• It is also required to categorize the data, numerical or categorical, and the format of
storage and dissemination.
2. Data collection
• Data collected from several sources must be stored in the
correct format and transferred to the right information
3. Data processing
• Preprocessing involves the process of pre-curating
correlation or causation.
• These models or equations involve one or more variables
that depend on other variables to cause an event.
• For example, when buying pens, the total price of
pens(Total)
= price for one pen(UnitPrice) * the number of pens bought
(Quantity). Hence, our model would be Total = UnitPrice *
Quantity. Here, the total price is dependent on the unit price.
Hence, the total price is referred to as the dependent
variable and the unit price is referred to as an
independent variable.
• In general, a model always describes the relationship
between independent and dependent variables.
• Inferential statistics deals with quantifying relationships
between particular variables.
• The
7. Data Judd model for describing the relationship between
Product
• Any computer software that uses data as inputs, produces
outputs, and provides feedback based on the output to control
1. Problem Definition
• It is essential to define the business problem to be solved
before trying to extract useful insight from the data.
• The problem definition works as the driving force for a data
analysis plan execution
• The main tasks involved in problem definition are
o summary tables
o graphs
o descriptive statistics
o inferential statistics
o correlation statistics
o searching
o grouping
o mathematical models
4. Development and representation of the results
• This step involves presenting the dataset to the target
audience in the form of graphs, summary tables, maps, and
diagrams.
• This is also an essential step as the result analyzed from
the dataset should be interpretable by the business
stakeholders, which is one of the major goals of EDA.
• Most of the graphical analysis techniques include
o scattering plots
o character plots
o histograms
o box plots
o residual plots
o mean plots
Making Sense of Data
➢ It is crucial to identify the type of data under analysis.
➢ Different disciplines store different kinds of data for different purposes.
➢ Example: medical researchers store patients' data, universities store students' and teachers' data,
and real estate industries storehouse and building datasets.
➢ A dataset contains many observations about a particular object.
➢ For instance, a dataset about patients in a hospital can contain many observations.
➢ A patient can be described by a
o patient identifier (ID)
o name
o address
o weight
o date of birth
o address
o email
o gender
➢ Each of these features that describes a patient is a variable.
➢ Each observation can have a specific value for each of these variables.
➢ For example, a patient can have the following: PATIENT_ID =
1001
Name = Yoshmi Mukhiya
Address = Mannsverk 61, 5094, Bergen, Norway Date of birth =
10th July 2018
Email = yoshmimukhiya@gmail.com Weight = 10
Gender = Female
➢ These datasets are stored in hospitals and are presented for analysis.
➢ Most of this data is stored in some sort of database management system in tables/schema.
Table for storing patient information
➢ The table contains five observations (001, 002, 003, 004, 005).
➢ Each observation describes variables
(PatientID, name, address, dob, email, gender, and weight).
Types of datasets
➢ Most datasets broadly fall into two groups—numerical data and categorical data.
Numerical data
➢ This data has a sense of measurement involved in it
➢ For example, a person's age, height, weight, blood pressure, heart rate, temperature, number
of teeth, number of bones, and the number of family members.
➢ This data is often referred to as quantitative data in statistics.
➢ The numerical dataset can be either discrete or continuous types.
Discrete data
➢ This is data that is countable and its values can be listed.
➢ For example, if we flip a coin, the number of heads in 200 coin flips can take values from 0
to 200 (finite) cases.
➢ A variable that represents a discrete dataset is referred to as a discrete variable.
➢ The discrete variable takes a fixed number of distinct values.
➢ Example:
o The Country variable can have values such as Nepal, India, Norway, and Japan.
o The Rank variable of a student in a classroom can take values from 1, 2, 3, 4, 5, and so on.
Continuous data
➢ A variable that can have an infinite number of numerical values within a specific range
is classified as continuous data.
➢ A variable describing continuous data is a continuous variable.
➢ Continuous data can follow an interval measure of scale or ratio measure of scale
➢ Example:
o The temperature of a city
o The weight variable is a continuous variable Example table:
Categorical data
➢ This type of data represents the characteristics of an object
➢ Examples: gender, marital status, type of address, or categories of the movies.
➢ This data is often referred to as qualitative datasets in statistics.
➢ Examples of categorical data
Measurement scales
➢ There are four different types of measurement scales in statistics: nominal, ordinal, interval,
and ratio.
➢ These scales are used more in academic industries.
➢ Understanding the type of data is required to understand
o what type of computation could be performed
o what type of model should fit the dataset
o what type of visualization can be generated
➢ Need for classifying data as nominal or ordinal: While analyzing datasets, the decision of
generating pie chart, bar chart, or histogram is taken based on whether it is nominal or
ordinal.
Nominal
➢ These are used for labeling variables without any quantitative value. The scales are
generally referred to as labels.
➢ These scales are mutually exclusive and do not carry any numerical importance.
➢ Examples:
1. What is your gender?
o Male
o Female
o Third gender/Non-binary
o I prefer not to answer
o Other
2. The languages that are spoken in a particular country
3. Biological species
4. Parts of speech in grammar (noun, pronoun, adjective, and
so on)
5. Taxonomic ranks in biology (Archea, Bacteria, and Eukarya)
➢ Nominal scales are considered qualitative scales and the measurements that are taken using
qualitative scales are considered qualitative data.
➢ Using numbers as labels have no concrete numerical value or meaning.
➢ No form of arithmetic calculation can be made on nominal measures.
➢ Example: The following can be measured in the case of a nominal dataset,
• Frequency is the rate at which a label occurs over a period of time
➢ The answer to the question is scaled down to five different ordinal values, Strongly Agree,
Agree, Neutral, Disagree, and Strongly Disagree.
➢ These Scales are referred to as the Likert scale. More examples of
the Likert scale:
➢ To make it easier, consider ordinal scales as an order of ranking (1st, 2nd, 3rd, 4th, and so
on).
➢ The median item is allowed as the measure of central tendency; however, the average is not
permitted.
Interval
➢ Both the order and exact differences between the values are significant.
➢ Interval scales are widely used in statistics.
➢ Examples:
o The measure of central tendencies—mean, median, mode, and standard
deviations.
o location in Cartesian coordinates and direction measured in degrees from
magnetic north.
Ratio
➢ Ratio scales contain order, exact values, and absolute zero.
➢ They are used in descriptive and inferential statistics.
➢ These scales provide numerous possibilities for statistical analysis.
➢ Mathematical operations, the measure of central tendencies, and the measure of dispersion
and coefficient of variation can also be computed from such scales.
➢ Examples: the measure of energy, mass, length, duration, electrical energy, plan angle, and
volume.
Summary of the data types and scale measures:
Comparing EDA with classical and Bayesian analysis
➢ This approach follows the same approach as classical data analysis except for the model
imposition and the data analysis steps are swapped.
➢ The main focus is on the data, its structure, outliers, models, and visualizations.
➢ EDA does not impose any deterministic or probabilistic models on the data.
Bayesian data analysis approach
➢ This approach incorporates prior probability distribution knowledge into the analysis steps.
➢ Prior probability distribution of any quantity expresses the belief about that particular
quantity before considering some evidence.
Three different approaches for data analysis
NumPy
➢ NumPy is a Python library.
➢ NumPy is short for "Numerical Python".
➢ NumPy is used for working with arrays.
➢ It also has functions for working in domain of linear algebra, fourier transform, and
matrices.
Why use NumPy?
➢ In Python, lists serve the purpose of arrays, but they are slow to process.
➢ NumPy provides an array object that is up to 50x faster than traditional Python
lists.
➢ The array object in NumPy is called ndarray, and it provides a lot of support functions.
➢ Arrays are very frequently used in data science
Importing numpy
import numpy as np
Defining 1D array
my1DArray = np.array([1, 8, 27, 64]) print(my1DArray)
Array of ones
ones = np.ones((3,4)) print(ones)
Array of zeros
zeros = np.zeros((2,3,4),dtype=np.int16) print(zeros)
Empty array
emptyArray = np.empty((3,2))
print(emptyArray)
Full array
fullArray = np.full((2,2),7) print(fullArray)
Broadcasting
Broadcasting is a mechanism that permits NumPy to operate with arrays of different shapes
when performing arithmetic operations.
Rule 1: Two dimensions are operatable if they are equal Create an array
of two dimension
A =np.ones((6, 8))
Shape of A
print(A.shape)
Create another array
B = np.random.random((6,8))
Shape of B print(B.shape)
Sum of A and B, here the shape of both the matrix is same. print(A + B)
Rule 2: Two dimensions are also compatible when one of the dimensions of the array is 1
Initialize `x`
x = np.ones((3,4)) print(x)
Check shape of `x` print(x.shape)
Initialize `y`
y = np.arange(4)
print(y)
Check shape of `y` print(y.shape)
Subtract `x` and `y` print(x -
y)
Rule 3: Arrays can be broadcast together if they are compatible in all dimensions
x = np.ones((6,8))
y = np.random.random((2, 1, 8)) print(x +
y)
NumPy mathematics
df =pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databas es/adult/adult.data',names=columns)
df.head(10)
SciPy
➢ SciPy is a scientific computation library that
uses NumPy underneath.
➢ SciPy stands for Scientific Python.
➢ It provides more utility functions for optimization, stats and signal processing.
➢ Like NumPy, SciPy is open source so we can use it freely.
➢ SciPy has optimized and added functions that are frequently used in NumPy and Data
Science.
Matplotlib
➢ Matplotlib is a low-level graph plotting library in python that serves as a visualization utility.
➢ It provides a huge library of customizable plots, along with a comprehensive set of backends.
➢ It can be utilized to create professional reporting applications, interactive analytical
applications, complex dashboard applications, web/GUI applications, embedded views, and
many more