0% found this document useful (0 votes)
3 views

Notes - Unit 1 - Exploratory Data Analysis

The document provides a comprehensive overview of Exploratory Data Analysis (EDA), detailing its importance in extracting meaningful insights from data across various disciplines. It outlines the phases of data analysis, including data requirements, collection, processing, and the steps involved in EDA such as problem definition, data preparation, analysis, and result representation. Additionally, it discusses the types of data, measurement scales, and the significance of data visualization in communicating results to stakeholders.

Uploaded by

geethakanna13
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Notes - Unit 1 - Exploratory Data Analysis

The document provides a comprehensive overview of Exploratory Data Analysis (EDA), detailing its importance in extracting meaningful insights from data across various disciplines. It outlines the phases of data analysis, including data requirements, collection, processing, and the steps involved in EDA such as problem definition, data preparation, analysis, and result representation. Additionally, it discusses the types of data, measurement scales, and the significance of data visualization in communicating results to stakeholders.

Uploaded by

geethakanna13
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

23AD42C- DATA EXPLORATION AND VISUALIZATION

UNIT I EXPLORATORY DATA ANALYSIS

EDA Fundamentals
Introduction
➢ Data encompasses a collection of discrete objects, numbers, words, events, facts,
measurements, observations, or even descriptions of things.
➢ Such data is collected and stored by every event or process occurring in several disciplines,
including biology, economics, engineering, marketing, and others.
➢ Processing such data elicits useful information and processing such information generates
useful knowledge.
➢ Exploratory Data Analysis enables generating meaningful and useful information from such
data.
➢ Exploratory Data Analysis (EDA) is a process of examining the available dataset to discover
patterns, spot anomalies, test hypotheses, and check assumptions using statistical measures.
➢ Primary aim of EDA is to examine what data can tell us before actually going through formal
modeling or hypothesis formulation.
Understanding data science
➢ Data science involves cross-disciplinary knowledge from
computer science, data, statistics, and mathematics.
➢ There are several phases of data analysis, including
1. Data requirements
2.Data collection
3.Data processing
4.Data cleaning
5.Exploratory data analysis
6.Modeling and algorithms
7.Data product and communication
➢ These phases are similar to the CRoss-Industry Standard Process for data mining
(CRISP) framework in data mining.

1. Data requirements
• There can be various sources of data for an organization.
• It is important to comprehend what type of data is required for the organization to
be collected, curated, and stored.
• For example, an application tracking the sleeping pattern of patients suffering from
dementia requires several types of sensors' data storage, such as sleep data, heart rate
from the patient, electro-dermal activities, and user activities patterns.
• All of these data points are required to correctly diagnose the mental state of the
person.
• Hence, these are mandatory requirements for the application.
• It is also required to categorize the data, numerical or categorical, and the format of
storage and dissemination.
2. Data collection
• Data collected from several sources must be stored in the
correct format and transferred to the right information

technology personnel within a company.


• Data can be collected from several objects during several
events using different types of sensors and storage tools.

3. Data processing
• Preprocessing involves the process of pre-curating

(selecting and organizing) the dataset before actual analysis.


• Common tasks involve correctly exporting the dataset,
placing them under the right tables, structuring them,
and
4. Dataexporting
cleaning them in the correct format.
• Preprocessed data is still not ready for detailed analysis.
• It must be correctly transformed for an incompleteness check, duplicates check, error
check, and missing value check.
• This stage involves responsibilities such as matching the correct record, finding
inaccuracies in the dataset, understanding the overall data quality, removing duplicate
items, and filling in the missing values.
• Data cleaning is dependent on the types of data under study.
• Hence, it is essential for data scientists or EDA experts to comprehend different types
of datasets.
• An example of data cleaning is using outlier detection methods for quantitative data
cleaning.
5. EDA
• Exploratory data analysis is the stage where the message

contained in the data is actually understood.


• Several types of data transformation techniques might be
required during the process of exploration.
6. Modeling and algorithm
• Generalized models or mathematical formulas represent or
exhibit relationships among different variables, such as

correlation or causation.
• These models or equations involve one or more variables
that depend on other variables to cause an event.
• For example, when buying pens, the total price of
pens(Total)
= price for one pen(UnitPrice) * the number of pens bought
(Quantity). Hence, our model would be Total = UnitPrice *
Quantity. Here, the total price is dependent on the unit price.
Hence, the total price is referred to as the dependent
variable and the unit price is referred to as an
independent variable.
• In general, a model always describes the relationship
between independent and dependent variables.
• Inferential statistics deals with quantifying relationships
between particular variables.
• The
7. Data Judd model for describing the relationship between
Product
• Any computer software that uses data as inputs, produces
outputs, and provides feedback based on the output to control

the environment is referred to as a data product.


• A data product is generally based on a model
developed during data analysis
• Example: a recommendation model that inputs user
purchase
history and recommends a related item that the user is
highly likely to buy.
8. Communication
• This stage deals with disseminating the results to end

stakeholders to use the result for business intelligence.


• One of the most notable steps in this stage is data
visualization.
• Visualization deals with information relay techniques such
as
tables, charts, summary diagrams, and bar charts to
The significanceshow the analyzed result.
of EDA
➢ Different fields of science, economics, engineering, and marketing accumulate and store
data primarily in electronic databases.
➢ Appropriate and well-established decisions should be made using the data collected.
➢ It is practically impossible to make sense of datasets containing more than a handful of
data points without the help of computer programs.
➢ To make sure of the insights provided by the collected data and to make further
decisions, data mining is performed which includes distinct analysis processes.
➢ Exploratory data analysis is the key and first exercise in data mining.
➢ It allows us to visualize data to understand it as well as to create hypotheses (ideas) for
further analysis.
➢ The exploratory analysis centers around creating a synopsis of data or insights for the
next steps in a data mining project.
➢ EDA actually reveals the ground truth about the content without making any underlying
assumptions.
➢ Hence, data scientists use this process to actually understand what type of modeling and
hypotheses can be created.
➢ Key components of exploratory data analysis include summarizing data, statistical
analysis, and visualization of data.
➢ Python provides expert tools for exploratory analysis
• pandas for summarizing
• scipy, along with others, for statistical analysis
• matplotlib and plotly for visualizations
Steps in EDA
The four different steps involved in exploratory data analysis are,
1. Problem Definition
2. Data Preparation
3. Data Analysis
4. Development and Representation of the Results

1. Problem Definition
• It is essential to define the business problem to be solved
before trying to extract useful insight from the data.
• The problem definition works as the driving force for a data
analysis plan execution
• The main tasks involved in problem definition are

o defining the main objective of the analysis


o defining the main deliverables
o outlining the main roles and responsibilities
o obtaining the current status of the data
o defining the timetable, and
o performing cost/benefit analysis

• Based on the problem definition, an execution plan can be


created.
• Data
2. Preparation
This step involves methods for preparing the dataset before actual
analysis.
• This step involves

o defining the sources of data


o defining data schemas and tables
o understanding the main characteristics of the data
o cleaning the dataset
o deleting non-relevant datasets
o transforming the data
o dividing the data into required chunks for analysis
3. Data analysis
o This is one of the most crucial steps that deals with

descriptive statistics and analysis of the data


o The main tasks involve

o summarizing the data


o finding the hidden correlation
o relationships among the data
o developing predictive models
o evaluating the models
o calculating the accuracies
➢ Some of the techniques used for data summarization are

o summary tables
o graphs
o descriptive statistics
o inferential statistics
o correlation statistics
o searching
o grouping
o mathematical models
4. Development and representation of the results
• This step involves presenting the dataset to the target
audience in the form of graphs, summary tables, maps, and

diagrams.
• This is also an essential step as the result analyzed from
the dataset should be interpretable by the business
stakeholders, which is one of the major goals of EDA.
• Most of the graphical analysis techniques include

o scattering plots
o character plots
o histograms
o box plots
o residual plots
o mean plots
Making Sense of Data
➢ It is crucial to identify the type of data under analysis.
➢ Different disciplines store different kinds of data for different purposes.
➢ Example: medical researchers store patients' data, universities store students' and teachers' data,
and real estate industries storehouse and building datasets.
➢ A dataset contains many observations about a particular object.
➢ For instance, a dataset about patients in a hospital can contain many observations.
➢ A patient can be described by a
o patient identifier (ID)
o name
o address
o weight
o date of birth
o address
o email
o gender
➢ Each of these features that describes a patient is a variable.
➢ Each observation can have a specific value for each of these variables.
➢ For example, a patient can have the following: PATIENT_ID =
1001
Name = Yoshmi Mukhiya
Address = Mannsverk 61, 5094, Bergen, Norway Date of birth =
10th July 2018
Email = yoshmimukhiya@gmail.com Weight = 10
Gender = Female
➢ These datasets are stored in hospitals and are presented for analysis.
➢ Most of this data is stored in some sort of database management system in tables/schema.
Table for storing patient information

➢ The table contains five observations (001, 002, 003, 004, 005).
➢ Each observation describes variables
(PatientID, name, address, dob, email, gender, and weight).
Types of datasets
➢ Most datasets broadly fall into two groups—numerical data and categorical data.

Numerical data
➢ This data has a sense of measurement involved in it
➢ For example, a person's age, height, weight, blood pressure, heart rate, temperature, number
of teeth, number of bones, and the number of family members.
➢ This data is often referred to as quantitative data in statistics.
➢ The numerical dataset can be either discrete or continuous types.

Discrete data
➢ This is data that is countable and its values can be listed.
➢ For example, if we flip a coin, the number of heads in 200 coin flips can take values from 0
to 200 (finite) cases.
➢ A variable that represents a discrete dataset is referred to as a discrete variable.
➢ The discrete variable takes a fixed number of distinct values.
➢ Example:
o The Country variable can have values such as Nepal, India, Norway, and Japan.
o The Rank variable of a student in a classroom can take values from 1, 2, 3, 4, 5, and so on.
Continuous data
➢ A variable that can have an infinite number of numerical values within a specific range
is classified as continuous data.
➢ A variable describing continuous data is a continuous variable.
➢ Continuous data can follow an interval measure of scale or ratio measure of scale
➢ Example:
o The temperature of a city
o The weight variable is a continuous variable Example table:
Categorical data
➢ This type of data represents the characteristics of an object
➢ Examples: gender, marital status, type of address, or categories of the movies.
➢ This data is often referred to as qualitative datasets in statistics.
➢ Examples of categorical data

o Gender (Male, Female, Other, or


Unknown)
o Marital Status (Annulled, Divorced, Interlocutory, Legally
Separated, Married, Polygamous, Never Married,
Domestic
Partner, Unmarried, Widowed, or Unknown)
o Movie genres (Action, Adventure, Comedy, Crime,
Drama,
Fantasy, Historical, Horror, Mystery, Philosophical, Political,
Romance, Saga, Satire, Science Fiction, Social, Thriller, Urban,
o or Western)
Blood type (A, B, AB, or O)
o Types of drugs (Stimulants, Depressants, Hallucinogens,
Dissociatives, Opioids, Inhalants, or
➢ A Cannabis)
variable describing categorical data is referred to as
a categorical variable.
➢ These types of variables can have a limited number of values.
Types of categorical variables
Binary categorical variable
Polytomous variables

This type of variable can take exactly two values


Also referred to as a dichotomous variable.
Example: while creating an experiment, the result is either
success or failure.
➢ This type can take more than two possible values.
➢ Example: marital status can have several values, such as
divorced,
legally separated, married, never married, unmarried, widowed,
➢ etc.
Most of the categorical dataset follows either nominal or ordinal measurement scales.

Measurement scales
➢ There are four different types of measurement scales in statistics: nominal, ordinal, interval,
and ratio.
➢ These scales are used more in academic industries.
➢ Understanding the type of data is required to understand
o what type of computation could be performed
o what type of model should fit the dataset
o what type of visualization can be generated
➢ Need for classifying data as nominal or ordinal: While analyzing datasets, the decision of
generating pie chart, bar chart, or histogram is taken based on whether it is nominal or
ordinal.

Nominal
➢ These are used for labeling variables without any quantitative value. The scales are
generally referred to as labels.
➢ These scales are mutually exclusive and do not carry any numerical importance.
➢ Examples:
1. What is your gender?

o Male
o Female
o Third gender/Non-binary
o I prefer not to answer
o Other
2. The languages that are spoken in a particular country
3. Biological species
4. Parts of speech in grammar (noun, pronoun, adjective, and
so on)
5. Taxonomic ranks in biology (Archea, Bacteria, and Eukarya)

➢ Nominal scales are considered qualitative scales and the measurements that are taken using
qualitative scales are considered qualitative data.
➢ Using numbers as labels have no concrete numerical value or meaning.
➢ No form of arithmetic calculation can be made on nominal measures.
➢ Example: The following can be measured in the case of a nominal dataset,
• Frequency is the rate at which a label occurs over a period of time

within the dataset.


• Proportion can be calculated by dividing the frequency by the
total number of events.
• Then, the percentage of each proportion is computed.
• To visualize the nominal dataset, either a pie chart or a bar
chart can be used.
Ordinal
➢ The main difference in the ordinal and nominal scale is the order.
➢ In ordinal scales, the order of the values is a significant factor.
➢ The Likert scale uses a variation of an ordinal scale.
➢ Example of ordinal scale using the Likert scale:
WordPress is making content managers' lives easier. How do you feel about this
statement?
Likert scale:

➢ The answer to the question is scaled down to five different ordinal values, Strongly Agree,
Agree, Neutral, Disagree, and Strongly Disagree.
➢ These Scales are referred to as the Likert scale. More examples of
the Likert scale:

➢ To make it easier, consider ordinal scales as an order of ranking (1st, 2nd, 3rd, 4th, and so
on).
➢ The median item is allowed as the measure of central tendency; however, the average is not
permitted.
Interval
➢ Both the order and exact differences between the values are significant.
➢ Interval scales are widely used in statistics.
➢ Examples:
o The measure of central tendencies—mean, median, mode, and standard
deviations.
o location in Cartesian coordinates and direction measured in degrees from
magnetic north.

Ratio
➢ Ratio scales contain order, exact values, and absolute zero.
➢ They are used in descriptive and inferential statistics.
➢ These scales provide numerous possibilities for statistical analysis.
➢ Mathematical operations, the measure of central tendencies, and the measure of dispersion
and coefficient of variation can also be computed from such scales.
➢ Examples: the measure of energy, mass, length, duration, electrical energy, plan angle, and
volume.
Summary of the data types and scale measures:
Comparing EDA with classical and Bayesian analysis

Several approaches to data analysis

➢ Classical data analysis


➢ Exploratory data analysis approach
➢ Bayesian data analysis approach
Classical data analysis
➢ This approach includes the problem definition and data collection step followed by model
development, which is followed by analysis and result communication.
Exploratory data analysis approach

➢ This approach follows the same approach as classical data analysis except for the model
imposition and the data analysis steps are swapped.
➢ The main focus is on the data, its structure, outliers, models, and visualizations.
➢ EDA does not impose any deterministic or probabilistic models on the data.
Bayesian data analysis approach
➢ This approach incorporates prior probability distribution knowledge into the analysis steps.
➢ Prior probability distribution of any quantity expresses the belief about that particular
quantity before considering some evidence.
Three different approaches for data analysis

➢ It is difficult to estimate which model is best for data analysis.


➢ All of them have their paradigms and are suitable for different types of data analysis

Software tools available for EDA


➢ Python
• an open-source programming language widely used in data analysis, data mining,
and data science
➢ R programming language
• an open-source programming language that is widely utilized in statistical
computation and graphical data analysis
➢ Weka
• an open-source data mining package that involves several EDA tools and
algorithms
➢ KNIME
• an open-source tool for data analysis and is based on Eclipse
Python tools and packages

NumPy
➢ NumPy is a Python library.
➢ NumPy is short for "Numerical Python".
➢ NumPy is used for working with arrays.
➢ It also has functions for working in domain of linear algebra, fourier transform, and
matrices.
Why use NumPy?
➢ In Python, lists serve the purpose of arrays, but they are slow to process.
➢ NumPy provides an array object that is up to 50x faster than traditional Python
lists.
➢ The array object in NumPy is called ndarray, and it provides a lot of support functions.
➢ Arrays are very frequently used in data science

Basic operations of EDA using the NumPy library

Importing numpy

import numpy as np

Creating different types of numpy arrays

Importing numpy import numpy


as np

Defining 1D array
my1DArray = np.array([1, 8, 27, 64]) print(my1DArray)

Defining and printing 2D array


my2DArray = np.array([[1, 2, 3, 4], [2, 4, 9, 16], [4, 8, 18, 32]])
print(my2DArray)

Defining and printing 3D array


my3Darray = np.array([[[ 1, 2 , 3 , 4],[ 5 , 6 , 7 ,8]], [[ 1, 2,
3, 4],[ 9, 10, 11, 12]]])
print(my3Darray)
Displaying basic information, such as the data type, shape, size, and strides of a NumPy array

Print out memory address


print(my2DArray.data)

Print the shape of array print(my2DArray.shape)

Print out the data type of the array print(my2DArray.dtype)

Print the stride of the array. print(my2DArray.strides)

Creating an array using built-in NumPy functions

Array of ones
ones = np.ones((3,4)) print(ones)

Array of zeros
zeros = np.zeros((2,3,4),dtype=np.int16) print(zeros)

Empty array
emptyArray = np.empty((3,2))
print(emptyArray)

Full array
fullArray = np.full((2,2),7) print(fullArray)

Array of evenly-spaced values evenSpacedArray =


np.arange(10,25,5) print(evenSpacedArray)
NumPy arrays and file operations

Save a numpy array into file x =


np.arange(0.0,50.0,1.0)
np.savetxt('data.out', x, delimiter=',')

Loading numpy array from text


z = np.loadtxt('data.out', unpack=True) print(z)

Loading numpy array using genfromtxt method my_array2 =


np.genfromtxt('data.out', skip_header=1,
filling_values=-999)
print(my_array2)

Inspecting NumPy arrays


import numpy as np

my2DArray = np.array([[1, 2, 3, 4], [2, 4, 9, 16], [4, 8, 18, 32]])

Print the number of `my2DArray`'s dimensions


print(my2DArray.ndim)

Print the number of `my2DArray`'s elements


print(my2DArray.size)

Print information about `my2DArray`'s memory layout print(my2DArray.flags)


Print the length of one array element in bytes
print(my2DArray.itemsize)

Print the total consumed bytes by `my2DArray`'s elements print(my2DArray.nbytes)

Broadcasting
Broadcasting is a mechanism that permits NumPy to operate with arrays of different shapes
when performing arithmetic operations.

Rule 1: Two dimensions are operatable if they are equal Create an array
of two dimension
A =np.ones((6, 8))
Shape of A
print(A.shape)
Create another array
B = np.random.random((6,8))
Shape of B print(B.shape)
Sum of A and B, here the shape of both the matrix is same. print(A + B)
Rule 2: Two dimensions are also compatible when one of the dimensions of the array is 1
Initialize `x`
x = np.ones((3,4)) print(x)
Check shape of `x` print(x.shape)
Initialize `y`
y = np.arange(4)
print(y)
Check shape of `y` print(y.shape)
Subtract `x` and `y` print(x -
y)

Rule 3: Arrays can be broadcast together if they are compatible in all dimensions
x = np.ones((6,8))
y = np.random.random((2, 1, 8)) print(x +
y)

NumPy mathematics

Basic operations (+, -, *, /, %) x =


np.array([[1, 2, 3], [2, 3, 4]])
y = np.array([[1, 4, 9], [2, 3, -2]])
Add two array add =
np.add(x, y) print(add)
Subtract two array sub =
np.subtract(x, y) print(sub)
Multiply two array mul =
np.multiply(x, y) print(mul)
Divide x, y
div = np.divide(x,y)
print(div)
Calculated the remainder of x and y rem =
np.remainder(x, y)
print(rem)

Creating a subset and slice an array using an index

x = np.array([10, 20, 30, 40, 50])


Select items at index 0 and 1 print(x[0:2])
Select item at row 0 and 1 and column 1 from 2D array y = np.array([[ 1, 2, 3,
4], [ 9, 10, 11 ,12]])
print(y[0:2, 1])
Specifying conditions biggerThan2 = (y
>= 2) print(y[biggerThan2])
Pandas
➢ Pandas is a Python library used for working with data sets.
➢ It has functions for analyzing, cleaning, exploring, and manipulating data.
Why use Pandas?
➢ Pandas allow us to analyze big data and make conclusions based on statistical theories.
➢ Pandas can clean messy data sets, and make them readable and relevant.
➢ Relevant data is very important in data science.
What can Pandas do?
➢ Pandas give answers about the data.
• Is there a correlation between two or more columns?
• What is average value?
• Max value?
• Min value?
➢ Pandas are also able to delete rows that are not relevant, or contains wrong values, like
empty or NULL values. This is called cleaning the data.
Creating dataframe from Dictionary import pandas as
pd
mydataset = D
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pd.DataFrame(mydataset) print(myvar)
Creating dataframe from Dictionary import pandas as
pd
dict_df = [D'A': 'Apple', 'B': 'Ball'},D'A': 'Aeroplane', 'B': 'Bat', 'C': 'Cat'}]
dict_df = pd.DataFrame(dict_df)
print(dict_df)

Creating dataframe from Series import


pandas as pd
import numpy as np series_df =
pd.DataFrame(D 'A': range(1, 5),
'B': pd.Timestamp('20190526'),
'C': pd.Series(5, index=list(range(4)), dtype='float64'), 'D': np.array([3] * 4,
dtype='int64'),
'E': pd.Categorical(["Depression", "Social Anxiety", "Bipolar Disorder", "Eating Disorder"]),
'F': 'Mental health', 'G': 'is
challenging'
})
print(series_df)

Creating a dataframe from ndarrays import pandas as


pd
import numpy as np sdf = D
'County':['Østfold', 'Hordaland', 'Oslo', 'Hedmark', 'Oppland', 'Buskerud'],
'ISO-Code':[1,2,3,4,5,6],
'Area': [4180.69, 4917.94, 454.07, 27397.76, 25192.10,
14910.94],
'Administrative centre': ["Sarpsborg", "Oslo", "City of Oslo", "Hamar", "Lillehammer",
"Drammen"]
}
sdf = pd.DataFrame(sdf) print(sdf)
Load a dataset from an external source into a pandas DataFrame import pandas as pd
import numpy as np
columns = ['age', 'workclass', 'fnlwgt', 'education',
'education_num', 'marital_status', 'occupation', 'relationship', 'ethnicity',
'gender','capital_gain','capital_loss','hours_per_week','country_of
_origin','income']

df =pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databas es/adult/adult.data',names=columns)

df.head(10)

Select rows and columns in any dataframe Selects a row


df.iloc[10]

Selects 10 rows df.iloc[0:10]

Selects a range of rows df.iloc[10:15]

Selects the last 2 rows df.iloc[-


2:]

Selects every other row in columns 3-5 df.iloc[::2, 3:5].head()

Combine NumPy and pandas to create a dataframe import pandas as pd


import numpy as np
np.random.seed(24)
dFrame = pd.DataFrame(D'F': np.linspace(1, 10, 10)})
dFrame = pd.concat([df, pd.DataFrame(np.random.randn(10, 5), columns=list('EDCBA'))],
axis=1)
dFrame.iloc[0, 2] = np.nan dFrame
Output dataframe table

SciPy
➢ SciPy is a scientific computation library that
uses NumPy underneath.
➢ SciPy stands for Scientific Python.
➢ It provides more utility functions for optimization, stats and signal processing.
➢ Like NumPy, SciPy is open source so we can use it freely.
➢ SciPy has optimized and added functions that are frequently used in NumPy and Data
Science.
Matplotlib
➢ Matplotlib is a low-level graph plotting library in python that serves as a visualization utility.
➢ It provides a huge library of customizable plots, along with a comprehensive set of backends.
➢ It can be utilized to create professional reporting applications, interactive analytical
applications, complex dashboard applications, web/GUI applications, embedded views, and
many more

You might also like