Unit 3
Unit 3
Unit 3
Descriptive Statistics
Mean, Standard Deviation, Skewness, and Kurtosis
Descriptive statistics provide simple summaries about the sample and the measures. These
summaries help to understand and describe the main features of a collection of data. Key measures
include mean, standard deviation, skewness, and kurtosis. Understanding these metrics is essential
for data analysis and interpretation.
1. Mean
The mean is the average of a set of numbers and provides a central value for the dataset. It is
calculated by summing all the values and dividing by the number of values.
Parameters:
import numpy as np
Output:
Mean: 30.0
2. Standard Deviation
Standard deviation measures the dispersion or spread of a set of values. It indicates how much
the values deviate from the mean.
Parameters:
import numpy as np
Output:
Normal Distribution:
While working with data that are numerical , need to understand how those numbers are spread.
In an ideal world, data would be distributed symmetrically around the center of all scores. Thus if we
drew a vertical line through the center of a distribution, both sides should look the same. This so-called
normal distribution is characterized by a bell-shaped curve
Two ways in which a distribution can deviate from normal are:
• Lack of symmetry (called skew)
• Pointiness (called kurtosis).
3. Skewness
Skewness measures the asymmetry of the data distribution. It indicates whether the data points
are skewed to the left (negative skew) or right (positive skew).\
Parameters:
Output:
Skewness: 0.0
Characteristics:
• 1. Asymmetrical distribution
• 2. Long tail on the right side
• 3. Mean > Median > Mode
Examples:
Income distribution: Most people earn lower incomes, while a few individuals earn very high
incomes.
Characteristics:
Examples :-
Ages of people in a retirement community (most people are older, but there are a few younger
individuals)
4. Kurtosis
Kurtosis measures the "tailedness" of the data distribution. High kurtosis means more of the
variance is due to infrequent extreme deviations, as opposed to frequent modestly sized
deviations.
Parameters:
1. Low kurtosis (platykurtic): A flat, broad curve with thin tails. Think of a flat, shallow
bowl.
2. Medium kurtosis (mesokurtic): A standard bell curve with average tailed ness. Think of a
regular bell.
3. High kurtosis (leptokurtic): A tall, narrow curve with fat tails. Think of a sharp, pointed
hat.
Example: Calculating Kurtosis in Python
Output:
Kurtosis: -1.3
Practical Applications
Box Plot
A box plot, also known as a box-and-whisker plot, is a graphical representation of the distribution
of a dataset that highlights the median, quartiles, and potential outliers. It provides a visual
summary of the data's central tendency, spread, and symmetry.
Components of a Box Plot
1. Minimum: The smallest data point excluding outliers.
2. First Quartile (Q1): The median of the lower half of the dataset (25th percentile).
3. Median (Q2): The middle value of the dataset (50th percentile).
4. Third Quartile (Q3): The median of the upper half of the dataset (75th percentile).
5. Maximum: The largest data point excluding outliers.
6. Interquartile Range (IQR): The range between the first and third quartiles (Q3 - Q1).
7. Whiskers: Lines extending from the quartiles to the minimum and maximum values,
representing the range of the data.
8. Outliers: Data points that fall outside 1.5 times the IQR above Q3 or below Q1.
Creating a Box Plot
To create a box plot, follow these steps:
1. Order the Data: Arrange the data points in ascending order.
2. Calculate Quartiles: Determine Q1, Q2 (median), and Q3.
3. Compute IQR: Subtract Q1 from Q3 to get the IQR.
4. Determine Whiskers: Identify the smallest and largest values within 1.5 times the IQR
from Q1 and Q3, respectively.
5. Identify Outliers: Any data points outside the whiskers are considered outliers.
6. Draw the Box and Whiskers: Draw a box from Q1 to Q3, a line inside the box at the
median, and whiskers from the box to the minimum and maximum values within the
whisker range. Plot outliers as individual points.
Example
Given the dataset: 2, 5, 7, 8, 10, 12, 15, 18, 20, 22
1. Order the Data: Already ordered.
2. Quartiles:
o Q1: (5 + 7) / 2 = 6
o Median (Q2): (10 + 12) / 2 = 11
o Q3: (18 + 20) / 2 = 19
3. IQR: 19 - 6 = 13
4. Whiskers:
o Minimum: 2 (no value below Q1 - 1.5*IQR)
o Maximum: 22 (no value above Q3 + 1.5*IQR)
5. Outliers: No outliers in this dataset.
6. Draw:
o Box from 6 to 19
o Median at 11
o Whiskers from 2 to 22
Interpretation
Central Tendency: The median line within the box represents the central value of the
dataset.
Spread: The width of the box indicates the spread of the middle 50% of the data.
Symmetry: If the median line is centered within the box, the data is symmetrically
distributed. Otherwise, it indicates skewness.
Outliers: Points outside the whiskers show potential outliers, which may require further
investigation.
Python Implementation
Using Python's matplotlib library to create a box plot:
import matplotlib.pyplot as plt
# Example data
data = [2, 5, 7, 8, 10, 12, 15, 18, 20, 22]
# Create box plot
plt.boxplot(data)
# Add title and labels
plt.title('Box Plot Example')
plt.xlabel('Dataset')
plt.ylabel('Values')
# Show plot
plt.show()
PIVOT TABLE
A pivot table is a powerful data analysis tool used to summarize, analyze, explore, and present
summary data. Pivot tables are particularly useful for handling large datasets and can dynamically
reorganize data to present meaningful insights.
Components of a Pivot Table
1. Rows: Categories or labels that are displayed in the rows of the pivot table.
2. Columns: Categories or labels that are displayed in the columns of the pivot table.
3. Values: Data points that are summarized and displayed within the table cells. Typically,
these are numeric data that can be aggregated.
4. Filters: Criteria used to include or exclude specific data from the pivot table.
| Date |A |B |
|------------|-----|-----|
| 2024-01-01 | 100 | 150 |
| 2024-01-02 | 200 | 250 |
# Create DataFrame
df_large = pd.DataFrame(data)
The output of the pivot table created from the larger dataset:
Date A B C D
This pivot table summarizes the sales data for products A, B, C, and D over the given dates
HEAT MAP
A heat map is a graphical representation of data where individual values are represented by
colors. Heat maps are used to visualize data density, patterns, and variations across a dataset.
Creating a Heat Map
1. Prepare Data: Ensure the data is in a matrix format, with rows and columns representing
different dimensions.
2. Choose a Color Scheme: Select a gradient color scheme to represent the range of values.
3. Generate Heat Map: Use visualization tools or libraries (e.g., Python's Seaborn or
Matplotlib) to create the heat map.
Example
Consider the pivot table data from the previous example. We can create a heat map to visualize
the sales data:
1. Data Matrix:
css
Copy code
| Date |A |B |
|------------|-----|-----|
| 2024-01-01 | 100 | 150 |
| 2024-01-02 | 200 | 250 |
2. Color Scheme: Use a gradient from light blue (low values) to dark blue (high values).
3. Generate Heat Map: Use Python's Seaborn library.
Example Code
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Example data
data = {
'Date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
'Product': ['A', 'B', 'A', 'B'],
'Sales': [100, 150, 200, 250]
}
# Create DataFrame
df = pd.DataFrame(data)
# Pivot Table
pivot_table = df.pivot('Date', 'Product', 'Sales')
# Show plot
plt.show()
Here is the heat map generated from the larger dataset's pivot table. It visually represents the sales data
for products A, B, C, and D over the given dates, using colors to indicate the magnitude of sales.
CORRELATION STATISTICS
Correlation statistics measure the strength and direction of the relationship between two variables.
It quantifies how one variable changes with respect to another.
Types of Correlation
1. Positive Correlation: Both variables increase or decrease together.
2. Negative Correlation: One variable increases while the other decreases.
3. No Correlation: No discernible relationship between the variables.
Correlation Coefficient
The correlation coefficient (denoted as rrr) is a numerical measure of the correlation between two
variables. It ranges from -1 to 1.
Calculating Correlation
Pearson Correlation Coefficient: Measures linear correlation between two variables.
Spearman's Rank Correlation: Measures the strength and direction of association
between two ranked variables.
Example in Python
1. Using Python's pandas library to calculate Pearson correlation:
import pandas as pd
# Example data
data = {'X': [1, 2, 3, 4, 5], 'Y': [2, 4, 5, 4, 5]}
df = pd.DataFrame(data)
# Example data
data = {'X': [1, 2, 3, 4, 5], 'Y': [2, 3, 8, 9, 10]}
df = pd.DataFrame(data)