0% found this document useful (0 votes)

40 views33 pages

Notes - Unit 1 - Exploratory Data Analysis

The document provides a comprehensive overview of Exploratory Data Analysis (EDA), detailing its importance in extracting meaningful insights from data across various disciplines. It outlines the phases of data analysis, including data requirements, collection, processing, and the steps involved in EDA such as problem definition, data preparation, analysis, and result representation. Additionally, it discusses the types of data, measurement scales, and the significance of data visualization in communicating results to stakeholders.

Uploaded by

geethakanna13

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views33 pages

Notes - Unit 1 - Exploratory Data Analysis

Uploaded by

geethakanna13

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

23AD42C- DATA EXPLORATION AND VISUALIZATION

UNIT I EXPLORATORY DATA ANALYSIS

EDA Fundamentals
Introduction
➢ Data encompasses a collection of discrete objects, numbers, words, events, facts,
measurements, observations, or even descriptions of things.
➢ Such data is collected and stored by every event or process occurring in several disciplines,
including biology, economics, engineering, marketing, and others.
➢ Processing such data elicits useful information and processing such information generates
useful knowledge.
➢ Exploratory Data Analysis enables generating meaningful and useful information from such
data.
➢ Exploratory Data Analysis (EDA) is a process of examining the available dataset to discover
patterns, spot anomalies, test hypotheses, and check assumptions using statistical measures.
➢ Primary aim of EDA is to examine what data can tell us before actually going through formal
modeling or hypothesis formulation.
Understanding data science
➢ Data science involves cross-disciplinary knowledge from
computer science, data, statistics, and mathematics.
➢ There are several phases of data analysis, including
1. Data requirements
2.Data collection
3.Data processing
4.Data cleaning
5.Exploratory data analysis
6.Modeling and algorithms
7.Data product and communication
➢ These phases are similar to the CRoss-Industry Standard Process for data mining
(CRISP) framework in data mining.

1. Data requirements
• There can be various sources of data for an organization.
• It is important to comprehend what type of data is required for the organization to
be collected, curated, and stored.
• For example, an application tracking the sleeping pattern of patients suffering from
dementia requires several types of sensors' data storage, such as sleep data, heart rate
from the patient, electro-dermal activities, and user activities patterns.
• All of these data points are required to correctly diagnose the mental state of the
person.
• Hence, these are mandatory requirements for the application.
• It is also required to categorize the data, numerical or categorical, and the format of
storage and dissemination.
2. Data collection
• Data collected from several sources must be stored in the
correct format and transferred to the right information

technology personnel within a company.

• Data can be collected from several objects during several
events using different types of sensors and storage tools.

3. Data processing
• Preprocessing involves the process of pre-curating

(selecting and organizing) the dataset before actual analysis.

• Common tasks involve correctly exporting the dataset,
placing them under the right tables, structuring them,
and
4. Dataexporting
cleaning them in the correct format.
• Preprocessed data is still not ready for detailed analysis.
• It must be correctly transformed for an incompleteness check, duplicates check, error
check, and missing value check.
• This stage involves responsibilities such as matching the correct record, finding
inaccuracies in the dataset, understanding the overall data quality, removing duplicate
items, and filling in the missing values.
• Data cleaning is dependent on the types of data under study.
• Hence, it is essential for data scientists or EDA experts to comprehend different types
of datasets.
• An example of data cleaning is using outlier detection methods for quantitative data
cleaning.
5. EDA
• Exploratory data analysis is the stage where the message

contained in the data is actually understood.

• Several types of data transformation techniques might be
required during the process of exploration.
6. Modeling and algorithm
• Generalized models or mathematical formulas represent or
exhibit relationships among different variables, such as

correlation or causation.
• These models or equations involve one or more variables
that depend on other variables to cause an event.
• For example, when buying pens, the total price of
pens(Total)
= price for one pen(UnitPrice) * the number of pens bought
(Quantity). Hence, our model would be Total = UnitPrice *
Quantity. Here, the total price is dependent on the unit price.
Hence, the total price is referred to as the dependent
variable and the unit price is referred to as an
independent variable.
• In general, a model always describes the relationship
between independent and dependent variables.
• Inferential statistics deals with quantifying relationships
between particular variables.
• The
7. Data Judd model for describing the relationship between
Product
• Any computer software that uses data as inputs, produces
outputs, and provides feedback based on the output to control

the environment is referred to as a data product.

• A data product is generally based on a model
developed during data analysis
• Example: a recommendation model that inputs user
purchase
history and recommends a related item that the user is
highly likely to buy.
8. Communication
• This stage deals with disseminating the results to end

stakeholders to use the result for business intelligence.

• One of the most notable steps in this stage is data
visualization.
• Visualization deals with information relay techniques such
as
tables, charts, summary diagrams, and bar charts to
The significanceshow the analyzed result.
of EDA
➢ Different fields of science, economics, engineering, and marketing accumulate and store
data primarily in electronic databases.
➢ Appropriate and well-established decisions should be made using the data collected.
➢ It is practically impossible to make sense of datasets containing more than a handful of
data points without the help of computer programs.
➢ To make sure of the insights provided by the collected data and to make further
decisions, data mining is performed which includes distinct analysis processes.
➢ Exploratory data analysis is the key and first exercise in data mining.
➢ It allows us to visualize data to understand it as well as to create hypotheses (ideas) for
further analysis.
➢ The exploratory analysis centers around creating a synopsis of data or insights for the
next steps in a data mining project.
➢ EDA actually reveals the ground truth about the content without making any underlying
assumptions.
➢ Hence, data scientists use this process to actually understand what type of modeling and
hypotheses can be created.
➢ Key components of exploratory data analysis include summarizing data, statistical
analysis, and visualization of data.
➢ Python provides expert tools for exploratory analysis
• pandas for summarizing
• scipy, along with others, for statistical analysis
• matplotlib and plotly for visualizations
Steps in EDA
The four different steps involved in exploratory data analysis are,
1. Problem Definition
2. Data Preparation
3. Data Analysis
4. Development and Representation of the Results

1. Problem Definition
• It is essential to define the business problem to be solved
before trying to extract useful insight from the data.
• The problem definition works as the driving force for a data
analysis plan execution
• The main tasks involved in problem definition are

o defining the main objective of the analysis

o defining the main deliverables
o outlining the main roles and responsibilities
o obtaining the current status of the data
o defining the timetable, and
o performing cost/benefit analysis

• Based on the problem definition, an execution plan can be

created.
• Data
2. Preparation
This step involves methods for preparing the dataset before actual
analysis.
• This step involves

o defining the sources of data

o defining data schemas and tables
o understanding the main characteristics of the data
o cleaning the dataset
o deleting non-relevant datasets
o transforming the data
o dividing the data into required chunks for analysis
3. Data analysis
o This is one of the most crucial steps that deals with

descriptive statistics and analysis of the data

o The main tasks involve

o summarizing the data

o finding the hidden correlation
o relationships among the data
o developing predictive models
o evaluating the models
o calculating the accuracies
➢ Some of the techniques used for data summarization are

o summary tables
o graphs
o descriptive statistics
o inferential statistics
o correlation statistics
o searching
o grouping
o mathematical models
4. Development and representation of the results
• This step involves presenting the dataset to the target
audience in the form of graphs, summary tables, maps, and

diagrams.
• This is also an essential step as the result analyzed from
the dataset should be interpretable by the business
stakeholders, which is one of the major goals of EDA.
• Most of the graphical analysis techniques include

o scattering plots
o character plots
o histograms
o box plots
o residual plots
o mean plots
Making Sense of Data
➢ It is crucial to identify the type of data under analysis.
➢ Different disciplines store different kinds of data for different purposes.
➢ Example: medical researchers store patients' data, universities store students' and teachers' data,
and real estate industries storehouse and building datasets.
➢ A dataset contains many observations about a particular object.
➢ For instance, a dataset about patients in a hospital can contain many observations.
➢ A patient can be described by a
o patient identifier (ID)
o name
o address
o weight
o date of birth
o address
o email
o gender
➢ Each of these features that describes a patient is a variable.
➢ Each observation can have a specific value for each of these variables.
➢ For example, a patient can have the following: PATIENT_ID =
1001
Name = Yoshmi Mukhiya
Address = Mannsverk 61, 5094, Bergen, Norway Date of birth =
10th July 2018
Email = yoshmimukhiya@gmail.com Weight = 10
Gender = Female
➢ These datasets are stored in hospitals and are presented for analysis.
➢ Most of this data is stored in some sort of database management system in tables/schema.
Table for storing patient information

➢ The table contains five observations (001, 002, 003, 004, 005).
➢ Each observation describes variables
(PatientID, name, address, dob, email, gender, and weight).
Types of datasets
➢ Most datasets broadly fall into two groups—numerical data and categorical data.

Numerical data
➢ This data has a sense of measurement involved in it
➢ For example, a person's age, height, weight, blood pressure, heart rate, temperature, number
of teeth, number of bones, and the number of family members.
➢ This data is often referred to as quantitative data in statistics.
➢ The numerical dataset can be either discrete or continuous types.

Discrete data
➢ This is data that is countable and its values can be listed.
➢ For example, if we flip a coin, the number of heads in 200 coin flips can take values from 0
to 200 (finite) cases.
➢ A variable that represents a discrete dataset is referred to as a discrete variable.
➢ The discrete variable takes a fixed number of distinct values.
➢ Example:
o The Country variable can have values such as Nepal, India, Norway, and Japan.
o The Rank variable of a student in a classroom can take values from 1, 2, 3, 4, 5, and so on.
Continuous data
➢ A variable that can have an infinite number of numerical values within a specific range
is classified as continuous data.
➢ A variable describing continuous data is a continuous variable.
➢ Continuous data can follow an interval measure of scale or ratio measure of scale
➢ Example:
o The temperature of a city
o The weight variable is a continuous variable Example table:
Categorical data
➢ This type of data represents the characteristics of an object
➢ Examples: gender, marital status, type of address, or categories of the movies.
➢ This data is often referred to as qualitative datasets in statistics.
➢ Examples of categorical data

o Gender (Male, Female, Other, or

Unknown)
o Marital Status (Annulled, Divorced, Interlocutory, Legally
Separated, Married, Polygamous, Never Married,
Domestic
Partner, Unmarried, Widowed, or Unknown)
o Movie genres (Action, Adventure, Comedy, Crime,
Drama,
Fantasy, Historical, Horror, Mystery, Philosophical, Political,
Romance, Saga, Satire, Science Fiction, Social, Thriller, Urban,
o or Western)
Blood type (A, B, AB, or O)
o Types of drugs (Stimulants, Depressants, Hallucinogens,
Dissociatives, Opioids, Inhalants, or
➢ A Cannabis)
variable describing categorical data is referred to as
a categorical variable.
➢ These types of variables can have a limited number of values.
Types of categorical variables
Binary categorical variable
Polytomous variables

This type of variable can take exactly two values

Also referred to as a dichotomous variable.
Example: while creating an experiment, the result is either
success or failure.
➢ This type can take more than two possible values.
➢ Example: marital status can have several values, such as
divorced,
legally separated, married, never married, unmarried, widowed,
➢ etc.
Most of the categorical dataset follows either nominal or ordinal measurement scales.

Measurement scales
➢ There are four different types of measurement scales in statistics: nominal, ordinal, interval,
and ratio.
➢ These scales are used more in academic industries.
➢ Understanding the type of data is required to understand
o what type of computation could be performed
o what type of model should fit the dataset
o what type of visualization can be generated
➢ Need for classifying data as nominal or ordinal: While analyzing datasets, the decision of
generating pie chart, bar chart, or histogram is taken based on whether it is nominal or
ordinal.

Nominal
➢ These are used for labeling variables without any quantitative value. The scales are
generally referred to as labels.
➢ These scales are mutually exclusive and do not carry any numerical importance.
➢ Examples:
1. What is your gender?

o Male
o Female
o Third gender/Non-binary
o I prefer not to answer
o Other
2. The languages that are spoken in a particular country
3. Biological species
4. Parts of speech in grammar (noun, pronoun, adjective, and
so on)
5. Taxonomic ranks in biology (Archea, Bacteria, and Eukarya)

➢ Nominal scales are considered qualitative scales and the measurements that are taken using
qualitative scales are considered qualitative data.
➢ Using numbers as labels have no concrete numerical value or meaning.
➢ No form of arithmetic calculation can be made on nominal measures.
➢ Example: The following can be measured in the case of a nominal dataset,
• Frequency is the rate at which a label occurs over a period of time

within the dataset.

• Proportion can be calculated by dividing the frequency by the
total number of events.
• Then, the percentage of each proportion is computed.
• To visualize the nominal dataset, either a pie chart or a bar
chart can be used.
Ordinal
➢ The main difference in the ordinal and nominal scale is the order.
➢ In ordinal scales, the order of the values is a significant factor.
➢ The Likert scale uses a variation of an ordinal scale.
➢ Example of ordinal scale using the Likert scale:
WordPress is making content managers' lives easier. How do you feel about this
statement?
Likert scale:

➢ The answer to the question is scaled down to five different ordinal values, Strongly Agree,
Agree, Neutral, Disagree, and Strongly Disagree.
➢ These Scales are referred to as the Likert scale. More examples of
the Likert scale:

➢ To make it easier, consider ordinal scales as an order of ranking (1st, 2nd, 3rd, 4th, and so
on).
➢ The median item is allowed as the measure of central tendency; however, the average is not
permitted.
Interval
➢ Both the order and exact differences between the values are significant.
➢ Interval scales are widely used in statistics.
➢ Examples:
o The measure of central tendencies—mean, median, mode, and standard
deviations.
o location in Cartesian coordinates and direction measured in degrees from
magnetic north.

Ratio
➢ Ratio scales contain order, exact values, and absolute zero.
➢ They are used in descriptive and inferential statistics.
➢ These scales provide numerous possibilities for statistical analysis.
➢ Mathematical operations, the measure of central tendencies, and the measure of dispersion
and coefficient of variation can also be computed from such scales.
➢ Examples: the measure of energy, mass, length, duration, electrical energy, plan angle, and
volume.
Summary of the data types and scale measures:
Comparing EDA with classical and Bayesian analysis

Several approaches to data analysis

➢ Classical data analysis

➢ Exploratory data analysis approach
➢ Bayesian data analysis approach
Classical data analysis
➢ This approach includes the problem definition and data collection step followed by model
development, which is followed by analysis and result communication.
Exploratory data analysis approach

➢ This approach follows the same approach as classical data analysis except for the model
imposition and the data analysis steps are swapped.
➢ The main focus is on the data, its structure, outliers, models, and visualizations.
➢ EDA does not impose any deterministic or probabilistic models on the data.
Bayesian data analysis approach
➢ This approach incorporates prior probability distribution knowledge into the analysis steps.
➢ Prior probability distribution of any quantity expresses the belief about that particular
quantity before considering some evidence.
Three different approaches for data analysis

➢ It is difficult to estimate which model is best for data analysis.

➢ All of them have their paradigms and are suitable for different types of data analysis

Software tools available for EDA

➢ Python
• an open-source programming language widely used in data analysis, data mining,
and data science
➢ R programming language
• an open-source programming language that is widely utilized in statistical
computation and graphical data analysis
➢ Weka
• an open-source data mining package that involves several EDA tools and
algorithms
➢ KNIME
• an open-source tool for data analysis and is based on Eclipse
Python tools and packages

NumPy
➢ NumPy is a Python library.
➢ NumPy is short for "Numerical Python".
➢ NumPy is used for working with arrays.
➢ It also has functions for working in domain of linear algebra, fourier transform, and
matrices.
Why use NumPy?
➢ In Python, lists serve the purpose of arrays, but they are slow to process.
➢ NumPy provides an array object that is up to 50x faster than traditional Python
lists.
➢ The array object in NumPy is called ndarray, and it provides a lot of support functions.
➢ Arrays are very frequently used in data science

Basic operations of EDA using the NumPy library

Importing numpy

import numpy as np

Creating different types of numpy arrays

Importing numpy import numpy

as np

Defining 1D array
my1DArray = np.array([1, 8, 27, 64]) print(my1DArray)

Defining and printing 2D array

my2DArray = np.array([[1, 2, 3, 4], [2, 4, 9, 16], [4, 8, 18, 32]])
print(my2DArray)

Defining and printing 3D array

my3Darray = np.array([[[ 1, 2 , 3 , 4],[ 5 , 6 , 7 ,8]], [[ 1, 2,
3, 4],[ 9, 10, 11, 12]]])
print(my3Darray)
Displaying basic information, such as the data type, shape, size, and strides of a NumPy array

Print out memory address

print(my2DArray.data)

Print the shape of array print(my2DArray.shape)

Print out the data type of the array print(my2DArray.dtype)

Print the stride of the array. print(my2DArray.strides)

Creating an array using built-in NumPy functions

Array of ones
ones = np.ones((3,4)) print(ones)

Array of zeros
zeros = np.zeros((2,3,4),dtype=np.int16) print(zeros)

Empty array
emptyArray = np.empty((3,2))
print(emptyArray)

Full array
fullArray = np.full((2,2),7) print(fullArray)

Array of evenly-spaced values evenSpacedArray =

np.arange(10,25,5) print(evenSpacedArray)
NumPy arrays and file operations

Save a numpy array into file x =

np.arange(0.0,50.0,1.0)
np.savetxt('data.out', x, delimiter=',')

Loading numpy array from text

z = np.loadtxt('data.out', unpack=True) print(z)

Loading numpy array using genfromtxt method my_array2 =

np.genfromtxt('data.out', skip_header=1,
filling_values=-999)
print(my_array2)

Inspecting NumPy arrays

import numpy as np

my2DArray = np.array([[1, 2, 3, 4], [2, 4, 9, 16], [4, 8, 18, 32]])

Print the number of `my2DArray`'s dimensions

print(my2DArray.ndim)

Print the number of `my2DArray`'s elements

print(my2DArray.size)

Print information about `my2DArray`'s memory layout print(my2DArray.flags)

Print the length of one array element in bytes
print(my2DArray.itemsize)

Print the total consumed bytes by `my2DArray`'s elements print(my2DArray.nbytes)

Broadcasting
Broadcasting is a mechanism that permits NumPy to operate with arrays of different shapes
when performing arithmetic operations.

Rule 1: Two dimensions are operatable if they are equal Create an array
of two dimension
A =np.ones((6, 8))
Shape of A
print(A.shape)
Create another array
B = np.random.random((6,8))
Shape of B print(B.shape)
Sum of A and B, here the shape of both the matrix is same. print(A + B)
Rule 2: Two dimensions are also compatible when one of the dimensions of the array is 1
Initialize `x`
x = np.ones((3,4)) print(x)
Check shape of `x` print(x.shape)
Initialize `y`
y = np.arange(4)
print(y)
Check shape of `y` print(y.shape)
Subtract `x` and `y` print(x -
y)

Rule 3: Arrays can be broadcast together if they are compatible in all dimensions
x = np.ones((6,8))
y = np.random.random((2, 1, 8)) print(x +
y)

NumPy mathematics

Basic operations (+, -, *, /, %) x =

np.array([[1, 2, 3], [2, 3, 4]])
y = np.array([[1, 4, 9], [2, 3, -2]])
Add two array add =
np.add(x, y) print(add)
Subtract two array sub =
np.subtract(x, y) print(sub)
Multiply two array mul =
np.multiply(x, y) print(mul)
Divide x, y
div = np.divide(x,y)
print(div)
Calculated the remainder of x and y rem =
np.remainder(x, y)
print(rem)

Creating a subset and slice an array using an index

x = np.array([10, 20, 30, 40, 50])

Select items at index 0 and 1 print(x[0:2])
Select item at row 0 and 1 and column 1 from 2D array y = np.array([[ 1, 2, 3,
4], [ 9, 10, 11 ,12]])
print(y[0:2, 1])
Specifying conditions biggerThan2 = (y
>= 2) print(y[biggerThan2])
Pandas
➢ Pandas is a Python library used for working with data sets.
➢ It has functions for analyzing, cleaning, exploring, and manipulating data.
Why use Pandas?
➢ Pandas allow us to analyze big data and make conclusions based on statistical theories.
➢ Pandas can clean messy data sets, and make them readable and relevant.
➢ Relevant data is very important in data science.
What can Pandas do?
➢ Pandas give answers about the data.
• Is there a correlation between two or more columns?
• What is average value?
• Max value?
• Min value?
➢ Pandas are also able to delete rows that are not relevant, or contains wrong values, like
empty or NULL values. This is called cleaning the data.
Creating dataframe from Dictionary import pandas as
pd
mydataset = D
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pd.DataFrame(mydataset) print(myvar)
Creating dataframe from Dictionary import pandas as
pd
dict_df = [D'A': 'Apple', 'B': 'Ball'},D'A': 'Aeroplane', 'B': 'Bat', 'C': 'Cat'}]
dict_df = pd.DataFrame(dict_df)
print(dict_df)

Creating dataframe from Series import

pandas as pd
import numpy as np series_df =
pd.DataFrame(D 'A': range(1, 5),
'B': pd.Timestamp('20190526'),
'C': pd.Series(5, index=list(range(4)), dtype='float64'), 'D': np.array([3] * 4,
dtype='int64'),
'E': pd.Categorical(["Depression", "Social Anxiety", "Bipolar Disorder", "Eating Disorder"]),
'F': 'Mental health', 'G': 'is
challenging'
})
print(series_df)

Creating a dataframe from ndarrays import pandas as

pd
import numpy as np sdf = D
'County':['Østfold', 'Hordaland', 'Oslo', 'Hedmark', 'Oppland', 'Buskerud'],
'ISO-Code':[1,2,3,4,5,6],
'Area': [4180.69, 4917.94, 454.07, 27397.76, 25192.10,
14910.94],
'Administrative centre': ["Sarpsborg", "Oslo", "City of Oslo", "Hamar", "Lillehammer",
"Drammen"]
}
sdf = pd.DataFrame(sdf) print(sdf)
Load a dataset from an external source into a pandas DataFrame import pandas as pd
import numpy as np
columns = ['age', 'workclass', 'fnlwgt', 'education',
'education_num', 'marital_status', 'occupation', 'relationship', 'ethnicity',
'gender','capital_gain','capital_loss','hours_per_week','country_of
_origin','income']

df =pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databas es/adult/adult.data',names=columns)

df.head(10)

Select rows and columns in any dataframe Selects a row

df.iloc[10]

Selects 10 rows df.iloc[0:10]

Selects a range of rows df.iloc[10:15]

Selects the last 2 rows df.iloc[-

2:]

Selects every other row in columns 3-5 df.iloc[::2, 3:5].head()

Combine NumPy and pandas to create a dataframe import pandas as pd

import numpy as np
np.random.seed(24)
dFrame = pd.DataFrame(D'F': np.linspace(1, 10, 10)})
dFrame = pd.concat([df, pd.DataFrame(np.random.randn(10, 5), columns=list('EDCBA'))],
axis=1)
dFrame.iloc[0, 2] = np.nan dFrame
Output dataframe table

SciPy
➢ SciPy is a scientific computation library that
uses NumPy underneath.
➢ SciPy stands for Scientific Python.
➢ It provides more utility functions for optimization, stats and signal processing.
➢ Like NumPy, SciPy is open source so we can use it freely.
➢ SciPy has optimized and added functions that are frequently used in NumPy and Data
Science.
Matplotlib
➢ Matplotlib is a low-level graph plotting library in python that serves as a visualization utility.
➢ It provides a huge library of customizable plots, along with a comprehensive set of backends.
➢ It can be utilized to create professional reporting applications, interactive analytical
applications, complex dashboard applications, web/GUI applications, embedded views, and
many more

Ccs346 Eda Unit 1 Notes
100% (2)
Ccs346 Eda Unit 1 Notes
20 pages
Unit I - Part I Notes
100% (7)
Unit I - Part I Notes
33 pages
EDA Unit 1 Notes
No ratings yet
EDA Unit 1 Notes
27 pages
3 LEVEL SVPWM Fpga PDF
No ratings yet
3 LEVEL SVPWM Fpga PDF
61 pages
Notes Unit I
No ratings yet
Notes Unit I
47 pages
Exploratory Data Analysis (Eda)
No ratings yet
Exploratory Data Analysis (Eda)
10 pages
Notes - EDA-Unit1
No ratings yet
Notes - EDA-Unit1
34 pages
UNIT 1 Exploratory Data Analysis
100% (3)
UNIT 1 Exploratory Data Analysis
21 pages
Unit 1
No ratings yet
Unit 1
50 pages
Unit 1
No ratings yet
Unit 1
29 pages
Ccs346 Eda Unit 1
No ratings yet
Ccs346 Eda Unit 1
129 pages
Unit 3
No ratings yet
Unit 3
83 pages
Unit - 1 EDA
No ratings yet
Unit - 1 EDA
123 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
62 pages
Unit 1 Dev
No ratings yet
Unit 1 Dev
26 pages
What Is Exploratory Data Analysis
No ratings yet
What Is Exploratory Data Analysis
28 pages
Unit 1
No ratings yet
Unit 1
19 pages
DSP Unit - Ii
No ratings yet
DSP Unit - Ii
14 pages
Data Analytics Interview Questions
No ratings yet
Data Analytics Interview Questions
3 pages
P23MBA547 Predictive Analytics
No ratings yet
P23MBA547 Predictive Analytics
133 pages
Eda Sandhya
No ratings yet
Eda Sandhya
7 pages
ccs346 Eda Unit 1 Notes
No ratings yet
ccs346 Eda Unit 1 Notes
20 pages
Unit 4
No ratings yet
Unit 4
33 pages
Unit3 Eda
No ratings yet
Unit3 Eda
13 pages
Dev Answer Key
No ratings yet
Dev Answer Key
21 pages
Unit 1
No ratings yet
Unit 1
23 pages
BI-LEc 3
No ratings yet
BI-LEc 3
24 pages
Module 2
No ratings yet
Module 2
78 pages
Unit 2
No ratings yet
Unit 2
58 pages
Data Science Lecture No 02
No ratings yet
Data Science Lecture No 02
21 pages
Dev 1
No ratings yet
Dev 1
2 pages
Systematic Approach To Perform Task Centric Exploratory Data Analysis With Case Study
No ratings yet
Systematic Approach To Perform Task Centric Exploratory Data Analysis With Case Study
8 pages
Data Science Lecture No 02
No ratings yet
Data Science Lecture No 02
21 pages
Eda 2
No ratings yet
Eda 2
69 pages
Session1 DataCharacteristics
No ratings yet
Session1 DataCharacteristics
41 pages
Group 7
No ratings yet
Group 7
19 pages
Linear Regression Merged
No ratings yet
Linear Regression Merged
38 pages
Devish All Unit
No ratings yet
Devish All Unit
42 pages
Data Science Tools Final
No ratings yet
Data Science Tools Final
11 pages
Unit 1 - Exploratory Data Analysis Fundamentals
No ratings yet
Unit 1 - Exploratory Data Analysis Fundamentals
47 pages
4.1 Advanced Data Analysis & Visualization
No ratings yet
4.1 Advanced Data Analysis & Visualization
12 pages
Eda 1
No ratings yet
Eda 1
25 pages
Unit 1
No ratings yet
Unit 1
52 pages
Data Sciecnce
No ratings yet
Data Sciecnce
16 pages
Datascience Unit-4
No ratings yet
Datascience Unit-4
6 pages
Data Exploration and Visualization
100% (1)
Data Exploration and Visualization
281 pages
AW Term1 - DS
No ratings yet
AW Term1 - DS
33 pages
What Is Exploratory Data Analysis (EDA)
100% (2)
What Is Exploratory Data Analysis (EDA)
13 pages
Research Assignment 02burhan Ul Din
No ratings yet
Research Assignment 02burhan Ul Din
8 pages
DSML Notes
No ratings yet
DSML Notes
32 pages
FTA-Module 1-Notes
No ratings yet
FTA-Module 1-Notes
24 pages
Unit - 1
No ratings yet
Unit - 1
25 pages
EDA Lecture Notes
No ratings yet
EDA Lecture Notes
205 pages
Unit I Exploratory Data Analysis
No ratings yet
Unit I Exploratory Data Analysis
38 pages
ML Exp1 - 2201107
No ratings yet
ML Exp1 - 2201107
34 pages
Introduction To Data Analytics Techniques and Tools
No ratings yet
Introduction To Data Analytics Techniques and Tools
9 pages
22amh32 - Data Analytics and Data Science Unit I & Exploratory Data Analysis (Eda) 1. Exploratory Data Analysis (Eda)
No ratings yet
22amh32 - Data Analytics and Data Science Unit I & Exploratory Data Analysis (Eda) 1. Exploratory Data Analysis (Eda)
9 pages
The Analysis - in - EDA
No ratings yet
The Analysis - in - EDA
7 pages
21ai402 Data Analytics Unit-3
No ratings yet
21ai402 Data Analytics Unit-3
150 pages
Lecture 2 The Data Science Process and Tools For Each Step
No ratings yet
Lecture 2 The Data Science Process and Tools For Each Step
8 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Dhruv Arora - Resume
No ratings yet
Dhruv Arora - Resume
1 page
Consolidated Marksheet
No ratings yet
Consolidated Marksheet
3 pages
SRQ 6 and 7 Strucutre and Model Sample
No ratings yet
SRQ 6 and 7 Strucutre and Model Sample
20 pages
Peluang Kewirausahaan AUC 0324 Samarinda
No ratings yet
Peluang Kewirausahaan AUC 0324 Samarinda
19 pages
Auditing
No ratings yet
Auditing
54 pages
Final Exam Samplex
No ratings yet
Final Exam Samplex
9 pages
US NAVY Aeromedical Reference and Waiver Guide-2014
No ratings yet
US NAVY Aeromedical Reference and Waiver Guide-2014
317 pages
Gear / Gearbox Transmission Error (TE)
No ratings yet
Gear / Gearbox Transmission Error (TE)
1 page
جودة المواقع PDF
No ratings yet
جودة المواقع PDF
25 pages
BJT AC Analysis Part 1 PDF
100% (1)
BJT AC Analysis Part 1 PDF
9 pages
2019 - X - Important - Comparison of Change Management
No ratings yet
2019 - X - Important - Comparison of Change Management
20 pages
Binomail Distribution
No ratings yet
Binomail Distribution
37 pages
What Is Wireless Communication?
No ratings yet
What Is Wireless Communication?
17 pages
Bhatti 062014
No ratings yet
Bhatti 062014
41 pages
Agoda Booking ID 221750301 - RECEIPT Enclosed
No ratings yet
Agoda Booking ID 221750301 - RECEIPT Enclosed
1 page
File Chinh Thuc - HSG 2020 - Vòng 2
No ratings yet
File Chinh Thuc - HSG 2020 - Vòng 2
17 pages
RONSAIRO
No ratings yet
RONSAIRO
3 pages
Corporate Governance&amp CSR Lecture 6
No ratings yet
Corporate Governance&amp CSR Lecture 6
26 pages
Escp European Standard Clinical Practice Recommendations For Non Hodgkin Lymphoma of Childhood and
No ratings yet
Escp European Standard Clinical Practice Recommendations For Non Hodgkin Lymphoma of Childhood and
45 pages
KITI FHK Technik 2015 Engl INT PDF
No ratings yet
KITI FHK Technik 2015 Engl INT PDF
140 pages
Computer Engineering Technician - Sample Resume
No ratings yet
Computer Engineering Technician - Sample Resume
2 pages
The Critical Succesfactor of The Client Consultant Relationship
No ratings yet
The Critical Succesfactor of The Client Consultant Relationship
26 pages
Sts Benigno Aquino III
No ratings yet
Sts Benigno Aquino III
3 pages
The Theoretical Framework of The Optimization of Public Transport Travel
No ratings yet
The Theoretical Framework of The Optimization of Public Transport Travel
7 pages
How To Post Bail For Your Temporary Liberty
No ratings yet
How To Post Bail For Your Temporary Liberty
4 pages
Cheques Clearing
No ratings yet
Cheques Clearing
5 pages
Lec 1 Manufacturing Processes HAF
No ratings yet
Lec 1 Manufacturing Processes HAF
11 pages
Analytical VaR VaR Mapping
No ratings yet
Analytical VaR VaR Mapping
13 pages
KSP Response To LINK Nky Records Request
No ratings yet
KSP Response To LINK Nky Records Request
2 pages

Notes - Unit 1 - Exploratory Data Analysis

Uploaded by

Notes - Unit 1 - Exploratory Data Analysis

Uploaded by

23AD42C- DATA EXPLORATION AND VISUALIZATION

UNIT I EXPLORATORY DATA ANALYSIS

technology personnel within a company.

(selecting and organizing) the dataset before actual analysis.

contained in the data is actually understood.

the environment is referred to as a data product.

stakeholders to use the result for business intelligence.

o defining the main objective of the analysis

• Based on the problem definition, an execution plan can be

o defining the sources of data

descriptive statistics and analysis of the data

o summarizing the data

o Gender (Male, Female, Other, or

This type of variable can take exactly two values

within the dataset.

Several approaches to data analysis

➢ Classical data analysis

➢ It is difficult to estimate which model is best for data analysis.

Software tools available for EDA

Basic operations of EDA using the NumPy library

Creating different types of numpy arrays

Importing numpy import numpy

Defining and printing 2D array

Defining and printing 3D array

Print out memory address

Print the shape of array print(my2DArray.shape)

Print out the data type of the array print(my2DArray.dtype)

Print the stride of the array. print(my2DArray.strides)

Creating an array using built-in NumPy functions

Array of evenly-spaced values evenSpacedArray =

Save a numpy array into file x =

Loading numpy array from text

Loading numpy array using genfromtxt method my_array2 =

Inspecting NumPy arrays

my2DArray = np.array([[1, 2, 3, 4], [2, 4, 9, 16], [4, 8, 18, 32]])

Print the number of `my2DArray`'s dimensions

Print the number of `my2DArray`'s elements

Print information about `my2DArray`'s memory layout print(my2DArray.flags)

Print the total consumed bytes by `my2DArray`'s elements print(my2DArray.nbytes)

Basic operations (+, -, *, /, %) x =

Creating a subset and slice an array using an index

x = np.array([10, 20, 30, 40, 50])

Creating dataframe from Series import

Creating a dataframe from ndarrays import pandas as

Select rows and columns in any dataframe Selects a row

Selects 10 rows df.iloc[0:10]

Selects a range of rows df.iloc[10:15]

Selects the last 2 rows df.iloc[-

Selects every other row in columns 3-5 df.iloc[::2, 3:5].head()

Combine NumPy and pandas to create a dataframe import pandas as pd

You might also like