Unit 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

UNIT III

Unit – III: Exploratory Data Analytics


Descriptive Statistics – Mean, Standard Deviation, Skewness and Kurtosis – Box Plots –Pivot
Table – Heat Map – Correlation Statistics – ANOVA

Descriptive Statistics
Mean, Standard Deviation, Skewness, and Kurtosis
Descriptive statistics provide simple summaries about the sample and the measures. These
summaries help to understand and describe the main features of a collection of data. Key measures
include mean, standard deviation, skewness, and kurtosis. Understanding these metrics is essential
for data analysis and interpretation.

1. Mean

The mean is the average of a set of numbers and provides a central value for the dataset. It is
calculated by summing all the values and dividing by the number of values.

Parameters:

 μ\muμ: The mean of the dataset.


 NNN: The number of data points.
 xix_ixi: Each individual data point in the dataset.

Example: Calculating Mean in Python

import numpy as np

data = [10, 20, 30, 40, 50]


mean = np.mean(data)
print("Mean:", mean)

Output:

Mean: 30.0

2. Standard Deviation
Standard deviation measures the dispersion or spread of a set of values. It indicates how much
the values deviate from the mean.

Parameters:

 N: The number of data points.


 xi: Each individual data point in the dataset.
 μ: The mean of the dataset.
 σ\sigmaσ: The standard deviation of the dataset.

Example: Calculating Standard Deviation in Python

import numpy as np

data = [10, 20, 30, 40, 50]


std_dev = np.std(data)
print("Standard Deviation:", std_dev)

Output:

Standard Deviation: 14.142135623730951

Normal Distribution:

While working with data that are numerical , need to understand how those numbers are spread.
In an ideal world, data would be distributed symmetrically around the center of all scores. Thus if we
drew a vertical line through the center of a distribution, both sides should look the same. This so-called
normal distribution is characterized by a bell-shaped curve
Two ways in which a distribution can deviate from normal are:
• Lack of symmetry (called skew)
• Pointiness (called kurtosis).

3. Skewness

Skewness measures the asymmetry of the data distribution. It indicates whether the data points
are skewed to the left (negative skew) or right (positive skew).\

Parameters:

 N: The number of data points.


 xi: Each individual data point in the dataset.
 μ: The mean of the dataset.
 σ\sigmaσ: The standard deviation of the dataset.

Example: Calculating Skewness in Python

from scipy.stats import skew

data = [10, 20, 30, 40, 50]


skewness = skew(data)
print("Skewness:", skewness)

Output:
Skewness: 0.0

Positively Skewed /Distribution

• Positively skewed, also known as right-skewed.


• The majority of data points are concentrated on the left side.
• The tail of the distribution extends towards the right (higher values).
• The mean is greater than the median.

Characteristics:

• 1. Asymmetrical distribution
• 2. Long tail on the right side
• 3. Mean > Median > Mode

Examples:

Income distribution: Most people earn lower incomes, while a few individuals earn very high
incomes.

Negatively Skewed Distribution

• Negatively skewed distribution, also known as a left-skewed distribution


• Most of the data points are clustered on the right side (higher values)
• The tail on the left side is longer than the tail on the right side.

Characteristics:

 The distribution is asymmetrical


 With a longer tail on the left side.
 Mode > Median > Mean

Examples :-

Ages of people in a retirement community (most people are older, but there are a few younger
individuals)
4. Kurtosis

Kurtosis measures the "tailedness" of the data distribution. High kurtosis means more of the
variance is due to infrequent extreme deviations, as opposed to frequent modestly sized
deviations.

Parameters:

 NNN: The number of data points.


 xix_ixi: Each individual data point in the dataset.
 μ\muμ: The mean of the dataset.
 σ\sigmaσ: The standard deviation of the dataset.

There are three types of kurtosis:

1. Low kurtosis (platykurtic): A flat, broad curve with thin tails. Think of a flat, shallow
bowl.
2. Medium kurtosis (mesokurtic): A standard bell curve with average tailed ness. Think of a
regular bell.
3. High kurtosis (leptokurtic): A tall, narrow curve with fat tails. Think of a sharp, pointed
hat.
Example: Calculating Kurtosis in Python

from scipy.stats import kurtosis

data = [10, 20, 30, 40, 50]


kurt = kurtosis(data)
print("Kurtosis:", kurt)

Output:

Kurtosis: -1.3

Practical Applications

Understanding these descriptive statistics is critical for several practical applications:

 Mean: Provides a quick snapshot of the central tendency of data.


 Standard Deviation: Helps in assessing the risk in finance and the consistency in
manufacturing.
 Skewness: Important in fields like finance where the asymmetry of returns can indicate
potential risk.
 Kurtosis: Useful in detecting outliers and understanding the distribution of data in
various scientific fields.

Box Plot
A box plot, also known as a box-and-whisker plot, is a graphical representation of the distribution
of a dataset that highlights the median, quartiles, and potential outliers. It provides a visual
summary of the data's central tendency, spread, and symmetry.
Components of a Box Plot
1. Minimum: The smallest data point excluding outliers.
2. First Quartile (Q1): The median of the lower half of the dataset (25th percentile).
3. Median (Q2): The middle value of the dataset (50th percentile).
4. Third Quartile (Q3): The median of the upper half of the dataset (75th percentile).
5. Maximum: The largest data point excluding outliers.
6. Interquartile Range (IQR): The range between the first and third quartiles (Q3 - Q1).
7. Whiskers: Lines extending from the quartiles to the minimum and maximum values,
representing the range of the data.
8. Outliers: Data points that fall outside 1.5 times the IQR above Q3 or below Q1.
Creating a Box Plot
To create a box plot, follow these steps:
1. Order the Data: Arrange the data points in ascending order.
2. Calculate Quartiles: Determine Q1, Q2 (median), and Q3.
3. Compute IQR: Subtract Q1 from Q3 to get the IQR.
4. Determine Whiskers: Identify the smallest and largest values within 1.5 times the IQR
from Q1 and Q3, respectively.
5. Identify Outliers: Any data points outside the whiskers are considered outliers.
6. Draw the Box and Whiskers: Draw a box from Q1 to Q3, a line inside the box at the
median, and whiskers from the box to the minimum and maximum values within the
whisker range. Plot outliers as individual points.

Example
Given the dataset: 2, 5, 7, 8, 10, 12, 15, 18, 20, 22
1. Order the Data: Already ordered.
2. Quartiles:
o Q1: (5 + 7) / 2 = 6
o Median (Q2): (10 + 12) / 2 = 11
o Q3: (18 + 20) / 2 = 19
3. IQR: 19 - 6 = 13
4. Whiskers:
o Minimum: 2 (no value below Q1 - 1.5*IQR)
o Maximum: 22 (no value above Q3 + 1.5*IQR)
5. Outliers: No outliers in this dataset.
6. Draw:
o Box from 6 to 19
o Median at 11
o Whiskers from 2 to 22
Interpretation
 Central Tendency: The median line within the box represents the central value of the
dataset.
 Spread: The width of the box indicates the spread of the middle 50% of the data.
 Symmetry: If the median line is centered within the box, the data is symmetrically
distributed. Otherwise, it indicates skewness.
 Outliers: Points outside the whiskers show potential outliers, which may require further
investigation.
Python Implementation
Using Python's matplotlib library to create a box plot:
import matplotlib.pyplot as plt
# Example data
data = [2, 5, 7, 8, 10, 12, 15, 18, 20, 22]
# Create box plot
plt.boxplot(data)
# Add title and labels
plt.title('Box Plot Example')
plt.xlabel('Dataset')
plt.ylabel('Values')

# Show plot
plt.show()
PIVOT TABLE
A pivot table is a powerful data analysis tool used to summarize, analyze, explore, and present
summary data. Pivot tables are particularly useful for handling large datasets and can dynamically
reorganize data to present meaningful insights.
Components of a Pivot Table
1. Rows: Categories or labels that are displayed in the rows of the pivot table.
2. Columns: Categories or labels that are displayed in the columns of the pivot table.
3. Values: Data points that are summarized and displayed within the table cells. Typically,
these are numeric data that can be aggregated.
4. Filters: Criteria used to include or exclude specific data from the pivot table.

Creating a Pivot Table


1. Select the Data: Choose the data range that you want to analyze.
2. Insert Pivot Table: In spreadsheet software like Excel, use the "Insert Pivot Table"
function.
3. Configure Rows and Columns: Drag fields to the Rows and Columns areas to organize
the data.
4. Summarize Values: Drag fields to the Values area to perform aggregations like sum,
average, count, etc.
5. Apply Filters: Use filters to focus on specific data subsets.
Example
Consider the following dataset:

| Date | Product | Sales |


|------------|---------|-------|
| 2024-01-01 | A | 100 |
| 2024-01-01 | B | 150 |
| 2024-01-02 | A | 200 |
| 2024-01-02 | B | 250 |
To create a pivot table that shows the total sales for each product by date:
1. Rows: Date
2. Columns: Product
3. Values: Sum of Sales
The resulting pivot table:

| Date |A |B |
|------------|-----|-----|
| 2024-01-01 | 100 | 150 |
| 2024-01-02 | 200 | 250 |

Pivot Table Parameters


1. data
 Description: The DataFrame to be used for creating the pivot table.
 Example: df_large
2. values
 Description: Column(s) to aggregate.
 Example: values='Sales'
3. index
 Description: Column(s) to set as the index (rows) of the pivot table.
 Example: index='Date'
4. columns
 Description: Column(s) to set as the columns of the pivot table.
 Example: columns='Product'
5. aggfunc
 Description: Function to use for aggregating the data. Common options include np.sum,
np.mean, np.count, etc.
 Example: aggfunc=np.sum
6. fill_value
 Description: Value to replace missing values with. Useful for ensuring that the pivot
table has no NaN values.
 Example: fill_value=0
7. margins
 Description: Add all rows/columns (subtotals). Useful for showing totals along rows and
columns.
 Example: margins=True
8. margins_name
 Description: Name of the row/column that will contain the totals when margins=True.
 Example: margins_name='Total'
9. dropna
 Description: Do not include columns whose entries are all NaN.
 Example: dropna=True
10. sort
 Description: If True, sort the DataFrame.
 Example: sort=True
Example Code
import pandas as pd
import numpy as np

# Generate a larger dataset


np.random.seed(42)
dates = pd.date_range(start='2024-01-01', end='2024-01-10', freq='D')
products = ['A', 'B', 'C', 'D']
data = {
'Date': np.random.choice(dates, 100),
'Product': np.random.choice(products, 100),
'Sales': np.random.randint(50, 300, 100)
}

# Create DataFrame
df_large = pd.DataFrame(data)

# Create Pivot Table


pivot_table_large = pd.pivot_table (df_large, values='Sales', index='Date',
columns='Product', aggfunc=np.sum, fill_value=0)

# Display the pivot table


import ace_tools as tools; tools.display_dataframe_to_user(name="Large Pivot Table",
dataframe=pivot_table_large)
Explanation of the Example
 Data Generation:
o np.random.seed(42): Ensures reproducibility of random numbers.
o dates: Creates a date range from '2024-01-01' to '2024-01-10'.
o products: Defines a list of product categories.
o data: Generates random sales data for the products over the given dates.
 Create DataFrame:
o df_large = pd.DataFrame(data): Converts the generated data into a DataFrame.
 Create Pivot Table:
o pd.pivot_table(): Function used to create the pivot table.
o values='Sales': Specifies that the 'Sales' column is to be aggregated.
o index='Date': Sets the 'Date' column as the rows.
o columns='Product': Sets the 'Product' column as the columns.
o aggfunc=np.sum: Aggregates the sales data using the sum function.
o fill_value=0: Replaces any missing values with 0.
 Display Pivot Table:
o tools.display_dataframe_to_user(name="Large Pivot Table",
dataframe=pivot_table_large): Displays the pivot table using a custom display
function.
Additional Parameters (not used in the example)
 dropna: If True, do not include columns whose entries are all NaN.
 margins: If True, add all rows/columns (subtotals).
 margins_name: Name of the row/column that will contain the totals when
margins=True.
 sort: If True, sort the DataFrame.

The output of the pivot table created from the larger dataset:

Date A B C D

2024-01-01 522 88 662 0


Date A B C D

2024-01-02 209 523 353 689

2024-01-03 178 244 1096 153

2024-01-04 396 100 471 601

2024-01-05 563 586 593 273

2024-01-06 0 338 484 0

2024-01-07 353 266 658 838

2024-01-08 1561 779 0 424

2024-01-09 179 515 1063 860

2024-01-10 826 377 686 436

This pivot table summarizes the sales data for products A, B, C, and D over the given dates

HEAT MAP
A heat map is a graphical representation of data where individual values are represented by
colors. Heat maps are used to visualize data density, patterns, and variations across a dataset.
Creating a Heat Map
1. Prepare Data: Ensure the data is in a matrix format, with rows and columns representing
different dimensions.
2. Choose a Color Scheme: Select a gradient color scheme to represent the range of values.
3. Generate Heat Map: Use visualization tools or libraries (e.g., Python's Seaborn or
Matplotlib) to create the heat map.
Example
Consider the pivot table data from the previous example. We can create a heat map to visualize
the sales data:
1. Data Matrix:
css
Copy code
| Date |A |B |
|------------|-----|-----|
| 2024-01-01 | 100 | 150 |
| 2024-01-02 | 200 | 250 |
2. Color Scheme: Use a gradient from light blue (low values) to dark blue (high values).
3. Generate Heat Map: Use Python's Seaborn library.

Example Code
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Example data
data = {
'Date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
'Product': ['A', 'B', 'A', 'B'],
'Sales': [100, 150, 200, 250]
}

# Create DataFrame
df = pd.DataFrame(data)

# Pivot Table
pivot_table = df.pivot('Date', 'Product', 'Sales')

# Generate Heat Map


plt.figure(figsize=(10, 8))
sns.heatmap(pivot_table_large, annot=True, cmap='Blues', fmt='d')
# Add title and labels
plt.title('Sales Heat Map for Products A, B, C, and D')
plt.xlabel('Product')
plt.ylabel('Date')

# Show plot
plt.show()

Here is the heat map generated from the larger dataset's pivot table. It visually represents the sales data
for products A, B, C, and D over the given dates, using colors to indicate the magnitude of sales.

Explanation of the Example


 plt.figure(figsize=(10, 8)): Sets the size of the figure to 10 inches wide and 8 inches tall.
 sns.heatmap(pivot_table_large, annot=True, cmap='Blues', fmt='d'): Creates the heat map
using the pivot_table_large data, annotates the cells with the data values, uses the 'Blues'
colormap, and formats the annotations as integers.
 plt.title('Sales Heat Map for Products A, B, C, and D'): Sets the title of the heat map.
 plt.xlabel('Product'): Labels the x-axis as 'Product'.
 plt.ylabel('Date'): Labels the y-axis as 'Date'.
 plt.show(): Displays the heat map
Heat Map Parameters
1. data
 Description: The data to be plotted. In this case, it's the pivot table generated from the
sales data.
 Example: pivot_table_large
2. annot
 Description: When set to True, the heat map will display the data values in each cell.
 Example: annot=True
3. cmap
 Description: Specifies the color map to use for the heat map. This determines the range
of colors used to represent the data values.
 Example: cmap='Blues'
 Other Options: 'viridis', 'plasma', 'inferno', 'magma', 'cividis', 'coolwarm', etc.
4. fmt
 Description: String formatting code to use when adding annotations. It specifies the
format of the text annotations.
 Example: fmt='d' (displays data as integers)
 Other Options: 'f' (for floating-point numbers), etc.
5. figsize
 Description: Specifies the size of the figure in inches. This is set using
plt.figure(figsize=(width, height)).
 Example: plt.figure(figsize=(10, 8))
6. title
 Description: The title of the heat map. Set using plt.title().
 Example: plt.title('Sales Heat Map for Products A, B, C, and D')
7. xlabel and ylabel
 Description: Labels for the x-axis and y-axis, respectively. Set using plt.xlabel() and
plt.ylabel().
 Example:
o plt.xlabel('Product')
o plt.ylabel('Date')
Additional Parameters (not used in the example)
 linewidths: Width of the lines that will divide each cell.
 linecolor: Color of the lines that will divide each cell.
 square: If True, set the Axes aspect to “equal” so each cell will be square-shaped.
 vmin and vmax: Values to anchor the colormap, otherwise they are inferred from the
data and other keyword arguments.
 center: Value at which to center the colormap when plotting divergently. Useful with
vmin and vmax.

CORRELATION STATISTICS
Correlation statistics measure the strength and direction of the relationship between two variables.
It quantifies how one variable changes with respect to another.
Types of Correlation
1. Positive Correlation: Both variables increase or decrease together.
2. Negative Correlation: One variable increases while the other decreases.
3. No Correlation: No discernible relationship between the variables.
Correlation Coefficient
The correlation coefficient (denoted as rrr) is a numerical measure of the correlation between two
variables. It ranges from -1 to 1.

Calculating Correlation
 Pearson Correlation Coefficient: Measures linear correlation between two variables.

Spearman's Rank Correlation: Measures the strength and direction of association
between two ranked variables.
Example in Python
1. Using Python's pandas library to calculate Pearson correlation:
import pandas as pd

# Example data
data = {'X': [1, 2, 3, 4, 5], 'Y': [2, 4, 5, 4, 5]}
df = pd.DataFrame(data)

# Calculate Pearson correlation


correlation = df['X'].corr(df['Y'])
print('Pearson Correlation Coefficient:', correlation)

2. The code for calculating Spearman's Rank Correlation using Python:


import pandas as pd
from scipy.stats import spearmanr

# Example data
data = {'X': [1, 2, 3, 4, 5], 'Y': [2, 3, 8, 9, 10]}
df = pd.DataFrame(data)

# Calculate Spearman's Rank Correlation


correlation, p_value = spearmanr(df['X'], df['Y'])
print('Spearman Rank Correlation Coefficient:', correlation)
print('P-Value:', p_value)
Explanation
 Data Preparation:
o data = {'X': [1, 2, 3, 4, 5], 'Y': [2, 3, 8, 9, 10]}: Defines two variables X and Y
with sample data points.
o df = pd.DataFrame(data): Creates a DataFrame from the data.
 Spearman's Rank Correlation Calculation:
o spearmanr(df['X'], df['Y']): Computes Spearman's rank correlation coefficient and
the associated p-value for the two variables X and Y.
o correlation: The Spearman rank correlation coefficient.
o p_value: The p-value for testing non-correlation.

You might also like