Chapter 1 SAIDS

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

PROGRAM STRUCTURE FOR THIRD YEAR

UNIVERSITY OF MUMBAI (With Effect from 2022-2023)


Semester V
Teaching Scheme
Course Credits Assigned
Course Name (Contact Hours)
Code
Theory Pract. Theory Pract. Total
CSC501 Computer Network 3 -- 3 -- 3
CSC502 Web Computing 3 -- 3 3
CSC503 Artificial Intelligence 3 -- 3 -- 3
Data Warehousing &
CSC504 3 -- 3 -- 3
Mining
CSDLO5 Department Level
3 -- 3 -- 3
01X Optional Course- 1
Web Computing and
CSL501 Network Lab -- 2 -- 1 1

CSL502 Artificial Intelligence Lab -- 2 -- 1 1


Data Warehousing &
CSL503 -- 2 -- 1 1
Mining Lab
Business Communication
CSL504 and Ethics-II -- 2*+2 -- 2 2

CSM501 Mini Project: 2 A -- 4$ -- 2 2


Total 15 14 15 07 22
Examination Scheme
Term Pract
Theory Work &oral Total
Course End Exam.
Course Name Internal
Code Sem Duration
Assessment
Exam (in Hrs)

Test1 Test2 Avg

CSC501 Computer Network 20 20 20 80 3 - -- 100


CSC502 Web Computing 20 20 20 80 3 -- -- 100
CSC503 Artificial Intelligence 20 20 20 80 3 -- -- 100
Data Warehousing &
CSC504 20 20 20 80 3 -- -- 100
Mining
CSDLO5 Department Level Optional
20 20 20 80 3 -- -- 100
01X Course- 1
Web Computing and
CSL501 Network Lab -- -- -- -- -- 25 25 50

CSL502 Artificial Intelligence Lab -- -- -- -- -- 25 25 50


Data Warehousing &
CSL503 Mining Lab -- -- -- -- -- 25 25 50
Business Communication
CSL504 and Ethics-II -- -- -- -- -- 50 -- 50

CSM501 Mini Project : 2A -- -- -- -- -- 25 25 50


Total -- -- 100 400 -- 150 100 750

* Theory class to be conducted for full class and $ indicates workload of Learner (Not Faculty), students can
formgroups with minimum 2(Two) and not more than 4(Four). Faculty Load: 1hour per week per four groups.
Course Code Course Name Credit

CSDLO5011 Statistics for Artificial Intelligence Data Science 03

Prerequisite: C Programming

Course Objectives: The course aims:


1 To Perform exploratory analysis on the datasets
2 To Understand the various distribution and sampling
3 To Perform Hypothesis Testing on datasets
4 To Explore different techniques for Summarizing Data
5 To Perform The Analysis of Variance
6 To Explore Linear Least Squares
Course Outcomes: Learner will be able to
1 Illustrate Exploratory Data Analysis
2 Describe Data and Sampling Distributions
3 Solve Statistical Experiments and Significance Testing
4 Demonstrate Summarizing Data
5 Interpret the Analysis of Variance
6 Use Linear Least Squares

Prerequisite: Discrete Structures and Graph Theory

Module Detailed Content Hours


1 Exploratory Data Analysis 5
1.1 Elements of Structured Data ,Further Reading ,Rectangular Data ,Data Frames and
Indexes ,Nonrectangular Data Structures , Estimates of Location ,Mean ,Median and
Robust Estimates , Estimates of Variability,Standard Deviation and Related
Estimates
EstimatesBased on Percentiles , Exploring the Data Distribution ,Percentiles and
Boxplots ,Frequency Tables and Histograms ,Density Plots and Estimates.
1.2 Exploring Binary and Categorical Data , Mode ,Expected Value,
Probability
,Correlation ,Scatterplots ,Exploring Two or More Variables ,Hexagonal Binning
andContours (Plotting Numeric Versus Numerical Data) ,Two Categorical
Variables
,Categorical and Numeric Data ,Visualizing Multiple Variables.
2 Data and Sampling Distribution 6

2.1 Random Sampling and Sample Bias ,Bias ,Random Selection ,Size Versus
Quality, Sample Mean Versus Population Mean ,Selection Bias ,Regression to the
Mean Sampling Distribution of a Statistic ,Central Limit Theorem ,Standard Error
,The Bootstrap ,Resampling Versus Bootstrapping .
2.2 Confidence Intervals ,Normal Distribution ,Standard Normal and QQ-
PlotsLong-Tailed Distributions ,Student‘s t-Distribution ,Binomial Distribution ,Chi-
Square Distribution ,F-Distribution ,Poisson and Related Distributions ,Poisson
Distributions
Exponential Distribution ,Estimating the Failure Rate ,Weibull Distribution .

Self Study : Problems in distributions.


3 Statistical Experiments and Significance Testing 8
3.1 A/B Testing ,Hypothesis Tests ,The Null Hypothesis ,Alternative Hypothesis ,One-
Way Versus Two-Way Hypothesis Tests ,Resampling ,Permutation Test ,Example:
Web Stickiness,Exhaustive and Bootstrap Permutation Tests ,Permutation Tests: The
Bottom
Line for Data Science ,Statistical Significance and p-Values ,p-Value ,Alpha ,Type 1 and
Type 2 Errors
3.2 Data Science and p-Values , t-Tests ,Multiple Testing ,Degrees of Freedom ,ANOVA
,F-Statistic ,Two-Way ANOVA , Chi-Square Test ,Chi-Square Test: A Resampling
Approach ,Chi-Square Test: Statistical Theory ,Fisher‘s Exact Test ,Relevance for Data
Science ,Multi-Arm Bandit Algorithm ,Power and Sample Size ,Sample Size .

Self Study : Testing of Hypothesis using any statistical tool


4 Summarizing Data 6
4.1 Methods Based on the Cumulative Distribution Function , The Empirical Cumulative
Distribution Function ,The Survival Function ,Quantile-Quantile Plots , Histograms,
Density Curves, and Stem-and-Leaf Plots , Measures of Location.
4.2 The Arithmetic Mean ,The Median , The Trimmed Mean , M Estimates , Comparison of
Location Estimates ,Estimating Variability of Location Estimates by the Bootstrap ,
Measures of Dispersion , Boxplots , Exploring Relationships with Scatterplots .

Self Study : using any statistical tool perform data summarization


5 The Analysis of Variance 6
5.1 The One-Way Layout, Normal Theory; the F Test ,The Problem of Multiple
Comparisons , A Nonparametric Method—The Kruskal-Wallis Test ,The Two-Way
Layout , Additive Parametrization , Normal Theory for the Two-Way Layout
,Randomized Block Designs , A Nonparametric Method—Friedman‘s Test .

6 Linear Least Squares 8


6.1 Simple Linear Regression, Statistical Properties of the Estimated Slope and Intercept ,
Assessing the Fit , Correlation and Regression , The Matrix Approach to Linear Least
Squares , Statistical Properties of Least Squares Estimates , Vector-Valued Random
Variables , Mean and Covariance of Least Squares Estimates , Estimation of σ2,
Residuals and Standardized Residuals , Inference about β , Multiple Linear
Regression—An Example , Conditional Inference, Unconditional Inference, and the
Bootstrap , Local Linear Smoothing .

Self Study :Create a Linear Regression model for a dataset and display the error
measures, Chose a dataset with categorical data and apply linear regression
model

Textbooks:
1 Bruce, Peter, and Andrew Bruce. Practical statistics for data scientists: 50 essential concepts. Reilly Media,
2017.
2 Mathematical Statistics and Data Analysis John A. Rice University of California, Berkeley,Thomson Higher
Education
References:
1 Dodge, Yadolah, ed. Statistical data analysis and inference. Elsevier, 2014.
2 Ismay, Chester, and Albert Y. Kim. Statistical Inference via Data Science: A Modern Dive into R and the
Tidyverse. CRC Press, 2019.
3 Milton. J. S. and Arnold. J.C., "Introduction to Probability and Statistics", Tata McGraw Hill, 4th Edition,
2007.
4 Johnson. R.A. and Gupta. C.B., "Miller and Freund‘s Probability and Statistics for Engineers", Pearson
Education, Asia, 7th Edition, 2007.
5 A. Chandrasekaran, G. Kavitha, ―Probability, Statistics, Random Processes and Queuing Theory‖, Dhanam
Publications, 2014.
Statistics for Artificial Intelligence Data Science
Chapter 1 Exploratory Data Analysis
Q. What is Statistics?

Statistics is a branch of mathematics that deals with collecting, organizing, analyzing,


interpreting, and presenting data. It provides methods and techniques to make sense of
numerical data, allowing us to make informed decisions and draw conclusions from the
information available.

In essence, statistics involves:

1. Data Collection: Gathering data through various methods such as surveys,


experiments, or observations.
2. Data Organization: Structuring and arranging the collected data in a meaningful
way, often using tables, graphs, or charts.
3. Data Analysis: Applying statistical methods to analyze the data, which may include
calculating averages, measures of variability, correlations, and performing hypothesis
tests.
4. Interpretation: Drawing conclusions and making inferences based on the analyzed
data, while considering the limitations and uncertainties.
5. Presentation: Communicating the findings effectively through reports, visualizations,
or presentations.

Statistics is used in various fields such as science, business, economics, social sciences,
engineering, and medicine, among others, to quantify uncertainty, support decision-making,
and explore relationships within data.

Q.What is Data Science ?

Data science is an field that uses scientific methods, processes, algorithms, and systems to
extract knowledge and insights from structured and unstructured data. It combines aspects of
mathematics, statistics, computer science, domain knowledge, and information science to
uncover patterns, trends, and correlations that can be used to make decisions and predictions.

Key components of data science include:

1. Data Collection: Gathering data from various sources, which can include databases,
data lakes, sensors, social media, and more.
2. Data Cleaning and Preparation: Processing and transforming raw data into a usable
format, handling missing values, outliers, and ensuring data quality.
3. Exploratory Data Analysis (EDA): Exploring and visualizing data to understand its
characteristics, identify patterns, trends, and relationships.
4. Data Modeling and Machine Learning: Applying statistical models, machine
learning algorithms, and computational techniques to build predictive models and
make data-driven decisions.
5. Data Interpretation and Visualization: Analyzing model results, interpreting
findings, and communicating insights through visualizations and reports.
6. Deployment and Monitoring: Implementing data-driven solutions in real-world
applications, and monitoring their performance over time to ensure they remain
effective.

Data science techniques are used across various industries and disciplines, including
business, healthcare, finance, marketing, and more. It plays a crucial role in enabling
organizations to derive value from large volumes of data, automate processes, improve
decision-making, and gain a competitive edge.

Q. Elements of Structured Data?


Structured data refers to data that is organized and formatted according to a predefined
schema or model, typically stored in databases or structured file formats. This type of data is
characterized by its orderly arrangement into rows (records) and columns (fields or
attributes), where each column represents a specific data attribute and each row represents a
single instance or record.

Elements of structured data refer to the fundamental components that make up data organized
in a structured format, typically within a database or a structured file format like CSV. These
elements include:

1. Fields or Attributes:
o Fields are individual pieces of information that describe a specific aspect or
characteristic of an entity (e.g., person, product, transaction).
o Attributes define the properties of each field, such as data type (integer, string,
date), length, and format.
2. Records or Rows:
o Records represent individual instances or entries within a dataset.
o Each record contains a collection of related fields that together describe a
single entity or event.
3. Tables or Entities:
o Tables organize data into a structured format, where each table represents a
collection of related records.
o Entities refer to the conceptual representations of objects or concepts within a
data model, typically represented by tables in a relational database.
4. Schema:
o A schema defines the structure of the data within a database or dataset.
o It specifies the names of tables, the fields within each table, their data types,
and any constraints or rules that govern the data.
5. Keys:
o Keys are attributes or combinations of attributes that uniquely identify records
within a table.
o Primary keys uniquely identify each record in a table, while foreign keys
establish relationships between tables.
6. Relationships:
o Relationships define how data entities are connected or related to each other
within a database.
o Relationships are established through keys (e.g., primary keys and foreign
keys) that link tables and enable data retrieval and manipulation across related
entities.
o
7. Constraints:
o Constraints enforce rules and restrictions on the data to ensure data integrity
and consistency.
o Examples include unique constraints (ensuring no duplicate values in a
column), not null constraints (requiring a field to have a value), and referential
integrity constraints (ensuring relationships between tables remain valid).
8. Query Language:
o Structured data is queried and manipulated using structured query languages
(SQL), which provide standardized syntax and commands for interacting with
databases.
o SQL allows users to retrieve, insert, update, and delete data based on specified
criteria and relationships defined in the schema.

These elements collectively define the structured nature of data, facilitating efficient storage,
retrieval, manipulation, and analysis within databases and other structured data formats. They
are essential for ensuring data consistency, accuracy, and usability across various applications
and industries.

NOTE for extra knowledge:

CSV (Comma-Separated Values) format is a plain-text format used for representing tabular
data. In CSV files, each line represents a row of data, and each field within that row is
separated by a comma (,). This format is commonly used because it is simple, lightweight,
and widely supported by various applications and programming languages.

Key characteristics of CSV format include:

1. Structure: Data is organized into rows and columns, similar to a spreadsheet or a


database table.
2. Delimiter: Fields are separated by commas ,. However, other delimiters like
semicolons ; or tabs \t can also be used depending on regional conventions or
specific requirements.
3. Text Qualification: Fields can optionally be enclosed within double quotes ",
especially when they contain special characters (like commas or line breaks) or when
they are empty.
4. Header Row: The first row of a CSV file is often used to specify column names,
known as the header row.
5. Encoding: CSV files are typically encoded using UTF-8 encoding to support
international characters and ensure compatibility across different platforms.

CSV files are widely used for exchanging data between different applications,
importing/exporting data into/from databases, and for storing structured data that needs to be
easily readable and manipulated by both humans and machines. They are commonly
generated by spreadsheet software like Microsoft Excel or Google Sheets and can be
processed programmatically using various programming languages such as Python, Java.

Q.What is Rectangular data ?

Rectangular data refers to a structured format where data is organized in rows and columns,
forming a grid-like structure similar to a table or a matrix. This format is also known as
tabular data or two-dimensional data, and it is commonly used in databases, spreadsheets,
CSV files, and data frames in programming languages like Python and R.

Key characteristics of rectangular data include:

1. Rows and Columns: Data is organized into rows (also known as records or
observations) and columns (also known as fields or variables). Each row represents a
single entity or observation, and each column represents a specific attribute or
characteristic of that entity.
2. Tabular Structure: The data is structured in a tabular format where rows and
columns intersect to form cells. Each cell contains a single data value corresponding
to a specific row and column intersection.
3. Homogeneous Format: Rectangular data typically assumes a homogeneous format,
meaning that all rows have the same set of columns (variables) with consistent data
types. This consistency allows for straightforward manipulation and analysis of data.
4. Header Row: Rectangular data often includes a header row at the top that defines the
names of each column or variable. This header row provides context and labels for the
data stored in each column.
5. Structured Querying: Rectangular data can be queried and manipulated using
structured query languages (SQL) in databases or through data manipulation tools and
libraries in programming languages like pandas in Python or data.table in R.

Examples of rectangular data include:

• Database Tables: In relational databases, tables are structured as rectangular data


where each row represents a record and each column represents an attribute or field.
• Spreadsheets: Excel spreadsheets and Google Sheets organize data in rows and
columns, making them examples of rectangular data.
• CSV Files: Comma-Separated Values (CSV) files store tabular data where each line
corresponds to a row and each comma-separated value corresponds to a column in the
dataset.

Data Frames:

• Definition: A data frame is a two-dimensional labeled data structure with rows and
columns, similar to a table in a relational database or a spreadsheet. It is a core data
structure used for data manipulation and analysis in statistical computing and data
science.
• Characteristics:
o Tabular Structure: Data frames organize data into rows and columns, where
each column represents a variable or feature, and each row represents an
observation or record.
o Column Names: Each column in a data frame has a name or label, which
allows for easy reference and manipulation of specific variables.
o Homogeneous Columns: All columns in a data frame typically have the same
length and contain data of the same type (e.g., integers, strings, dates).
o Flexible Data Types: Data frames can accommodate various data types within
columns, including numerical data, categorical data, dates, and text.
• Usage:
oData frames are widely used for data exploration, cleaning, transformation,
analysis, and visualization tasks.
o They provide a convenient way to handle and manipulate structured data,
making it easier to perform operations like filtering rows, selecting columns,
joining datasets, and computing summary statistics.
• Examples:
o In Python, data frames are commonly created and manipulated using the
pandas library (import pandas as pd). Operations such as reading data from
CSV files, performing data aggregation, and plotting data are efficiently
handled with pandas data frames.

Indexes:

• Definition: An index in the context of data frames refers to a structure that allows for
fast lookup, retrieval, and manipulation of data. It provides a way to uniquely identify
each row in a data frame.
• Characteristics:
o Unique Identification: Each row in a data frame can be associated with a
unique index value, typically integers starting from 0 to n-1 (where n is the
number of rows).
o Immutable: Index values are immutable (cannot be changed), which ensures
consistency and integrity when performing operations that rely on row
identification.
o Labeling: Indexes can also be labeled with meaningful identifiers (e.g.,
timestamps, alphanumeric codes) instead of numeric values.
• Usage:
o Indexes facilitate efficient data retrieval and manipulation operations such as
slicing, selecting subsets of data, merging data frames, and aligning data in
mathematical operations.
o They play a crucial role in maintaining the integrity of data relationships
across different data frames and when performing operations that involve
joining, grouping, or reshaping data.
• Examples:
o In pandas, the index of a data frame can be explicitly specified when creating
the data frame or can be automatically generated. Operations like setting a
column as the index (df.set_index('column_name')) or resetting the index
(df.reset_index()) are common in data manipulation workflows.

In summary, data frames and indexes are foundational concepts in data analysis and
manipulation, providing structured ways to organize, access, and process data efficiently
across different programming languages and environments.
EXTRA
Pandas is a powerful open-source data analysis and manipulation library for Python. It provides high-
performance, easy-to-use data structures and data analysis tools for working with structured (tabular)
data.
Nonrectangular data structures

Nonrectangular data structures refer to data formats that do not conform to the traditional
tabular (rows and columns) structure typical of relational databases or spreadsheets. These
data structures are more flexible and can accommodate varying data shapes and nested
hierarchies. Here are some common examples of nonrectangular data structures:

1. Hierarchical Data Structures:

• JSON (JavaScript Object Notation):


o Definition: JSON is a lightweight data-interchange format that is easy for
humans to read and write, and easy for machines to parse and generate.
o Characteristics: JSON data is organized in key-value pairs and supports
nested structures, arrays, and objects. It is widely used for transmitting data
between a server and a web application.
• XML (eXtensible Markup Language):
o Definition: XML is a markup language that defines a set of rules for encoding
documents in a format that is both human-readable and machine-readable.
o Characteristics: XML documents use tags to define elements and their
attributes, allowing for hierarchical structures and nested data. It is commonly
used for data exchange and configuration files.
o Example:

2. Key-Value Stores:

• Redis:
o Definition: Redis is an open-source, in-memory data structure store used as a
database, cache, and message broker.
o Characteristics: Redis stores data as key-value pairs, where values can be
complex data structures such as lists, sets, sorted sets, hashes, and more. It is
known for its high performance and flexibility.

3. Graph-Based Data Structures:

• Graph Databases (e.g., Neo4j):


o Definition: Graph databases are designed to treat the relationships between
data as equally important to the data itself.
o Characteristics: Data is stored in nodes (entities) and edges (relationships),
allowing for complex, interconnected data models. Graph databases excel in
scenarios where relationships between entities are central to the analysis.

4. Document-Oriented Databases:

• MongoDB:
o Definition: MongoDB is a NoSQL document-oriented database that stores
data in JSON-like documents.
o Characteristics: Each document can have a different structure, and fields can
vary from document to document. MongoDB supports nested structures,
arrays, and complex data types, making it suitable for flexible and scalable
applications.

5. Time-Series Databases:

• InfluxDB:
o Definition: InfluxDB is an open-source time series database designed for
high-performance handling of time-stamped data.
o Characteristics: It stores data in a schema-less fashion, optimized for
querying and analyzing time-series data points. Time-series databases are
crucial for applications like financial data analysis.

Estimates of Location

Estimates of location, also known as measures of central tendency, are statistical measures
that describe the central or typical value in a dataset. These measures help to summarize the
location of the data distribution and provide insights into its central tendency. Here are three
common estimates of location along with examples:

1. Mean (Arithmetic Average):

• Definition: The mean is the sum of all values in a dataset divided by the number of
observations. It represents the typical value or average of the dataset.
• Example: Suppose we have exam scores of students in a class:
• Calculation

2. Median:

• Definition: The median is the middle value in a sorted dataset. If the dataset has an
odd number of observations, the median is the middle value. If the dataset has an even
number of observations, the median is the average of the two middle values.
• Example: Consider the following dataset of ages:
• Ages: 25, 30, 35, 40, 45

Since there are 5 observations, the median is the third value (35).

3. Mode:

• Definition: The mode is the value that appears most frequently in a dataset. It
represents the most common value or values.
• Example: Consider a dataset of exam grades:
Grades: A, B, B, C, A, A, B, A
The mode in this dataset is "A", as it appears more frequently than any other grade.

Interpretation and Use:


• Mean: Useful for describing the average value of a dataset, especially when the data
is normally distributed or symmetric.
• Median: Provides a measure of central tendency that is robust to outliers and skewed
distributions.
• Mode: Helps identify the most frequent or popular value in categorical or discrete
datasets.

These estimates of location provide valuable insights into the characteristics of a dataset and
are essential tools in descriptive statistics for summarizing data distributions.

Q. Robust Estimates

Robust estimates of location and dispersion are statistical measures that are less sensitive to
outliers or deviations from the typical pattern in a dataset. They are particularly useful when
dealing with data that may contain extreme values or when the underlying distribution is not
strictly normal. Here are two robust estimates commonly used:

1. Median Absolute Deviation (MAD):

• Definition: MAD is a robust measure of dispersion that describes the variability of


data points around the median. It is less affected by outliers compared to the standard
deviation.
• Formula:
• Example: calculation

Estimates of Variability
Estimates of variability in statistics refer to measures that describe how spread out or dispersed a set
of data points are. There are several common measures used to quantify variability:

1. Range: The range is the simplest measure of variability and is calculated as the
difference between the maximum and minimum values in a dataset. It gives an idea of
the spread of the entire dataset.
2. Variance: The variance measures the average squared deviation of each data point
from the mean of the dataset. It provides a measure of the spread of data points
around the mean. However, the variance is in squared units of the original data.
3. Standard Deviation: The standard deviation is the square root of the variance. It
represents the typical distance between each data point and the mean. Standard
deviation is often preferred over variance because it is in the same units as the original
data.
4. Interquartile Range (IQR): The IQR is a measure of variability based on dividing
the data into quartiles. It is the difference between the 75th percentile (Q3) and the
25th percentile (Q1). IQR is less sensitive to outliers compared to range, variance, and
standard deviation.
5. Coefficient of Variation: The coefficient of variation (CV) is a relative measure of
variability. It is calculated as the standard deviation divided by the mean, expressed as
a percentage. CV allows for comparison of variability between datasets with different
units or scales.
Each of these measures provides different insights into the variability of data. The choice of
which measure to use depends on the nature of the data and the specific question being
addressed in the analysis.

1. Range:
o Example: Consider the following dataset representing the daily temperatures (in
degrees Celsius) in a city over a week: 20,22,19,25,18,23,21 ,
o Calculation: Range = Maximum value - Minimum value
▪ Maximum value = 25
▪ Minimum value = 18
▪ Range = 25 - 18 = 7
o Interpretation: The range of temperatures over the week is 7 degrees Celsius,
indicating the extent of variability in daily temperatures.
2. Variance:
o Example: Let's use the same dataset of daily temperatures.
o Calculation:
3. Standard Deviation:
o Example: Using the same dataset of daily temperatures.
o Calculation:
4. Interquartile Range (IQR):
o Example: Consider the following dataset representing the scores of a class in a math
test: 65,70,72,75,80,85,90,9565, 70, 72, 75, 80, 85, 90, 9565,70,72,75,80,85,90,95.
o Calculation:
▪ First Quartile (Q1) = 71(the median of the lower half of the data)
▪ Third Quartile (Q3) = 87.5 (the median of the upper half of the data)
▪ IQR = Q3 - Q1 = 87.5 - 72 = 15.5

To calculate the interquartile range (IQR) for the dataset 65,70,72,75,80,85,90,95, follow
these steps:

1. Arrange the data in ascending order: 65,70,72,75,80,85,90,95


2. Identify the first quartile (Q1):
o Q1 is the median of the lower half of the data.
o In this dataset, the lower half is 65,70,72,75.
o Q1 = median of 65,70,72,75
o Q1 = 70+72/2=71
3. Identify the third quartile (Q3):
o Q3 is the median of the upper half of the data.
o In this dataset, the upper half is 80,85,90,95
o Q3 = median of 80,85,90,95
o Q3 = 85+90/2=87.5
4. Calculate the interquartile range (IQR):
o IQR = Q3 - Q1
o IQR = 87.5−71=16.5
o Interpretation: The interquartile range of scores is 15.5, which represents the spread
of the middle 50% of scores, making it less sensitive to extreme values (outliers).

5. Coefficient of Variation (CV):


o Example: Consider two datasets representing the heights of students in two different
classes:
▪ Class A: Mean height = 160 cm, Standard Deviation = 5 cm
▪ Class B: Mean height = 170 cm, Standard Deviation = 10 cm
o Calculation:
▪ CV for Class A =
▪ Standard Deviation/mean ×100=5/160×100≈3.13%
▪ CV for Class B =
▪ 10/170×100≈5.88
o Interpretation: Class A has a coefficient of variation of approximately 3.13%,
indicating relatively low variability in heights compared to the mean, while Class B
has a coefficient of variation of approximately 5.88%, indicating higher variability in
heights relative to the mean.

These examples illustrate how different measures of variability can be calculated and
interpreted using real-world datasets. Each measure provides valuable insights into the spread
or dispersion of data points within a dataset.

Standard Deviation and Related Estimates

The standard deviation is a measure of the amount of variation or dispersion of a set of


values. It indicates how much each value in a dataset deviates from the mean (average) value
of the dataset.

Example Calculation:

Estimates Based on Percentiles


Estimates based on percentiles involve using statistical data to understand and interpret
different parts of a distribution. Here’s how percentiles work and how estimates can be
derived from them:

1. Understanding Percentiles: Percentiles divide a dataset into hundred equal parts. For
example, the 50th percentile (also known as the median) divides the data into two
halves, with half the data points below and half above this value.
2. Using Percentiles for Estimates:
o Central Tendency: The median (50th percentile) is often used as a measure of
central tendency when the data is skewed or has outliers. It gives an estimate
of the middle value of the dataset.
o Spread or Variability: Percentiles can also help estimate the spread of the
data. For instance, the interquartile range (difference between the 75th and
25th percentiles) can provide a measure of how spread out the middle portion
of the data is.
o Outliers and Extremes: Percentiles can be useful for identifying outliers or
extreme values in the dataset. For example, the 95th percentile indicates the
value below which 95% of the data falls.
3. Estimating Values:
o Estimating Specific Percentiles: If you know the percentile you're interested
in (e.g., the 90th percentile), you can estimate the corresponding value in the
dataset. This can be useful in various applications such as finance (estimating
high-percentile risk) or healthcare (estimating high-percentile patient wait
times).
o Forecasting and Planning: Businesses often use percentile estimates for
forecasting demand or planning resources. For example, estimating the 95th
percentile of sales volumes can help ensure sufficient inventory levels during
peak periods.
4. Practical Considerations:
o Data Quality: The accuracy of percentile estimates depends on the quality
and representativeness of the data sample.
o Interpretation: Percentiles should be interpreted in context. For example, the
25th percentile may indicate a low value, but it doesn’t necessarily mean it’s a
"low" value in an absolute sense without understanding the distribution of the
data.

In summary, estimates based on percentiles provide valuable insights into different aspects of
a dataset, from central tendency to variability and extremes. They are widely used in
statistics, finance, healthcare, and other fields where understanding distribution
characteristics is important for decision-making.let's walk through an example to illustrate
how estimates based on percentiles work:

Example: Estimating Wait Times at a Hospital

Imagine you're analyzing wait times for patients at a hospital's emergency department. You
have data on the time each patient waits before being seen by a doctor. Here’s how you can
use percentiles to estimate and interpret different aspects of these wait times:

1. Dataset: You have wait time data for 100 patients: [10,15,20,25,30,…,120Percentile
Calculation:
o Median (50th Percentile): To find the median, sort the data and find the
middle value. Here, with 100 data points, the median is the average of the 50th
and 51st values:

Median=(50th value+51st value)/2=(50+55)/2=52.5 This means half of the


patients waited 52.5 minutes or less before being seen.

o 75th Percentile: To find the 75th percentile, 75% of the patients waited this
amount of time or less. With 100 data points, the 75th percentile corresponds
to the 75th value:
75th Percentile=75th value=75 minutes
This indicates that 75% of patients waited 75 minutes or less.
o 90th Percentile: To find the 90th percentile, 90% of the patients waited this
amount of time or less. With 100 data points, the 90th percentile corresponds
to the 90th value:
90th Percentile=90th value=90 minutes
This means 90% of patients waited 90 minutes or less.

2. Estimates and Interpretation:


o Median: The median wait time of 52.5 minutes gives you an estimate of the
typical wait time for patients. It’s useful for understanding the central
tendency of wait times.
o 75th Percentile: Knowing that 75% of patients are seen within 75 minutes
helps you gauge how long most patients wait before their turn.
o90th Percentile: The 90th percentile of 90 minutes highlights the maximum
wait time that only 10% of patients exceed. This is crucial for identifying
outliers or unusually long wait times.
3. Application:
o Hospital Management: Using these percentiles, hospital administrators can
plan staffing levels to ensure that the majority of patients are seen within a
reasonable time frame (e.g., within the median or 75th percentile wait time).
o Patient Expectations: Communicating these percentiles to patients can
manage expectations and improve patient satisfaction, as they have a clearer
understanding of potential wait times.

In this example, percentiles provide actionable insights into the distribution of wait times,
helping stakeholders make informed decisions and improve service delivery in the hospital
setting.

Exploring the distribution of data

involves analyzing its shape, central tendency, spread, and any patterns or outliers present.
Here’s a step-by-step guide on how to explore data distribution:

Steps to Explore Data Distribution

1. Visual Inspection:
o Histogram: Plot a histogram to visualize the frequency distribution of the data. This
provides an overview of how data points are distributed across different bins or
intervals.
o Box Plot: Construct a box plot (box-and-whisker plot) to identify the median,
quartiles, and potential outliers in the data. This plot gives a quick snapshot of the
data's spread and central tendency.
2. Measure of Central Tendency:
o Mean: Calculate the arithmetic mean to understand the average value of the dataset.
This is sensitive to outliers.
o Median: Find the median to determine the middle value of the dataset, which is less
affected by extreme values compared to the mean.
o Mode: Identify the mode, or most frequent value, if applicable, to understand the
peak of the distribution.
3. Measure of Spread:
o Range: Compute the range (difference between the maximum and minimum values)
to get an idea of the spread of the data.
o Interquartile Range (IQR): Calculate the IQR (difference between the 75th and
25th percentiles) to measure the spread of the middle 50% of the data, which is
resistant to outliers.
o Standard Deviation: Compute the standard deviation to quantify the average
deviation of data points from the mean. It provides a measure of how spread out the
data is.
4. Shape of the Distribution:
o Skewness: Assess the skewness of the distribution to understand its symmetry. A
skewness value close to zero indicates a symmetric distribution, while positive or
negative values indicate skewness towards the right (positive) or left (negative).
o Kurtosis: Evaluate the kurtosis to understand the peakedness or flatness of the
distribution relative to a normal distribution. High kurtosis indicates a more peaked
distribution, while low kurtosis indicates a flatter distribution.
5. Identifying Outliers:
o Box Plot: Use the box plot to identify potential outliers, which are data points that
significantly differ from other observations in the dataset.
o Z-Score: Calculate the Z-score of data points to identify outliers based on their
deviation from the mean.
6. Data Transformation:
o Normalization: If the data is not normally distributed, consider transforming it (e.g.,
logarithmic transformation) to achieve normality, which can be useful for certain
statistical analyses.
7. Contextual Understanding:
o Consider the context of your data and the underlying processes generating it.
Real-world data often exhibit complex patterns that might not fit simple
theoretical distributions.

When exploring data distributions, it's important to combine multiple approaches for a
comprehensive understanding and to ensure robustness in subsequent analyses or modelling
efforts.

Example Scenario: Analyzing Exam Scores

Let's apply these steps to a hypothetical dataset of exam scores (out of 100):

1. Dataset: Visual Inspection:


o Plot a histogram to visualize the distribution of scores.
o Construct a box plot to identify the median, quartiles, and any outliers.
2. Measure of Central Tendency:
o Calculate the mean, median, and mode of the scores to understand their central
values.
3. Measure of Spread:
o Compute the range, IQR, and standard deviation to assess how spread out the scores
are.
4. Shape of the Distribution:
o Evaluate skewness and kurtosis to understand the shape of the score distribution.
5. Identifying Outliers:
o Use the box plot and Z-score calculations to identify any outliers in the dataset.
6. Data Transformation:
o Consider transforming the scores if the distribution is highly skewed or not normal.

Histograms Questions with Solutions


Question 1:

The below histogram shows the weekly wages of workers at a


construction site:
Answer the following questions:

(i) How many workers get wages of ₹ 60-70?

(ii) Construct a frequency distribution table.

(iii) What is the cumulative frequency for the class 50-60?

Solution:

(i) 16 workers

(ii)

Daily Wages in ₹ Number of Workers


30-40 10
40-50 20
50-60 40
60-70 16
70-80 8
80-90 6
(iii) The cumulative frequency for the class 50-60 = 10 + 20 + 40 = 70.
BOX PLOT

Definition
The method to summarize a set of data that is measured using an interval scale is
called a box and whisker plot. These are maximum used for data analysis. We use
these types of graphs or graphical representation to know:

• Distribution Shape
• Central Value of it
• Variability of it

A box plot is a chart that shows data from a five-number summary including one of
the measures of central tendency. It does not show the distribution in particular as
much as a stem and leaf plot or histogram does. But it is primarily used to indicate a
distribution is skewed or not and if there are potential unusual observations (also
called outliers) present in the data set. Boxplots are also very beneficial when large
numbers of data sets are involved or compared.

In simple words, we can define the box plot in terms of descriptive statistics related
concepts. That means box or whiskers plot is a method used for depicting groups of
numerical data through their quartiles graphically. These may also have some lines
extending from the boxes or whiskers which indicates the variability outside the lower
and upper quartiles, hence the terms box-and-whisker plot and box-and-whisker
diagram. Outliers can be indicated as individual points.

It helps to find out how much the data values vary or spread out with the help of
graphs. As we need more information than just knowing the measures of central
tendency, this is where the box plot helps. This also takes less space. It is also a
type of pictorial representation of data.

Since, the centre, spread and overall range are immediately apparent, using these
boxplots the distributions can be compared easily

Parts of Box Plots


Check the image below which shows the minimum, maximum, first quartile, third
quartile, median and outliers.
Minimum: The minimum value in the given dataset

First Quartile (Q1): The first quartile is the median of the lower half of the data set.

Median: The median is the middle value of the dataset, which divides the given
dataset into two equal parts. The median is considered as the second quartile.

Third Quartile (Q3): The third quartile is the median of the upper half of the data.

Maximum: The maximum value in the given dataset.

Apart from these five terms, the other terms used in the box plot are:

Interquartile Range (IQR): The difference between the third quartile and first
quartile is known as the interquartile range. (i.e.) IQR = Q3-Q1

Outlier: The data that falls on the far left or right side of the ordered data is tested to
be the outliers. Generally, the outliers fall more than the specified distance from the
first and third quartile.

(i.e.) Outliers are greater than Q3+(1.5 . IQR) or less than Q1-(1.5 . IQR).
Boxplot Distribution
The box plot distribution will explain how tightly the data is grouped, how the data is
skewed, and also about the symmetry of data.

Positively Skewed: If the distance from the median to the maximum is greater than
the distance from the median to the minimum, then the box plot is positively skewed.

Negatively Skewed: If the distance from the median to minimum is greater than the
distance from the median to the maximum, then the box plot is negatively skewed.

Symmetric: The box plot is said to be symmetric if the median is equidistant from
the maximum and minimum values.

Box Plot Chart


In a box and whisker plot:

• the ends of the box are the upper and lower quartiles so that the box crosses the
interquartile range
• a vertical line inside the box marks the median
• the two lines outside the box are the whiskers extending to the highest and lowest
observations.

Applications
It is used to know:

• The outliers and their values


• Symmetry of Data
• Tight grouping of data
• Data skewness – if, in which direction and how

Box Plot Example


Example:

Find the maximum, minimum, median, first quartile, third quartile for the given data
set: 23, 42, 12, 10, 15, 14, 9.

Solution:

Given: 23, 42, 12, 10, 15, 14, 9.

Arrange the given dataset in ascending order.

9, 10, 12, 14, 15, 23, 42

Hence,

Minimum = 9

Maximum = 42

Median = 14

First Quartile = 10 (Middle value of 9, 10, 12 is 10)

Third Quartile = 23 (Middle value of 15, 23, 42 is 23).

PERCENTILE

Let's consider the following dataset of scores: scores=[72,85,68,90,78,82,88,94,76,81]\

Steps to Calculate the 25th Percentile (Q1):

1. Sort the Dataset: First, sort the scores in ascending order.


Sorted scores: 68,72,76,78,81,82,85,88,90,94

Identify the Position: Use the formula to find the position of the 25th percentile:

Position=(25/100)×(number of data points+1)

Position=0.25×11 Position=2.75

Here, 2.75 represents the exact position in the sorted dataset where the 25th percentile falls.
Since we need an actual index (because dataset indexing starts at 0), we take the ceiling of
2.75 which gives us 3.

requency tables and histograms are two common tools used in statistics to summarize and
visualize the distribution of a dataset. Let's explore each of them in detail:

Frequency Table

A frequency table is a summary of the data showing the number of occurrences (frequency)
of each distinct value or range of values in a dataset. It's particularly useful for categorical
and discrete numerical data. Here’s how you construct a frequency table:

1. Identify Data Values: List down all unique values present in the dataset.
2. Count Frequencies: Count how many times each unique value appears in the dataset.
3. Organize Data: Arrange the values in ascending or descending order along with their
corresponding frequencies.

Example of a Frequency Table

Consider the following dataset representing the number of books read by students in a month:

{2,3,1,4,2,3,5,1,3,2,4,3,2,3,4,2,1,3,4,5}

• Values: 1, 2, 3, 4, 5
• Frequencies:
o 1: 3 times
o 2: 5 times
o 3: 5 times
o 4: 4 times
o 5: 3 times

Histogram

A histogram is a graphical representation of the distribution of numerical data. It consists of


bars that represent the frequency of data points falling into specific intervals (bins) of values.
Here's how you create a histogram:

1. Determine Number of Bins: Divide the range of the data into intervals (bins). The
number of bins depends on the dataset size and the desired level of detail.
2. Count Data Points: Count how many data points fall into each bin.
3. Plot the Histogram: Draw a bar for each bin where the height of the bar represents
the frequency of data points in that bin.

Histogram Graph

A histogram graph is a bar graph representation of data. It is a


representation of a range of outcomes into columns formation along the x-
axis. in the same histogram, the number count or multiple occurrences in
the data for each column is represented by the y-axis. It is the easiest
manner that can be used to visualize data distributions. Let us understand
the histogram graph by plotting one for the given below example.

Uncle Bruno owns a garden with 30 black cherry trees. Each tree is of a
different height. The height of the trees (in inches): 61, 63, 64, 66, 68, 69,
71, 71.5, 72, 72.5, 73, 73.5, 74, 74.5, 76, 76.2, 76.5, 77, 77.5, 78, 78.5, 79,
79.2, 80, 81, 82, 83, 84, 85, 87. We can group the data as follows in
a frequency distribution table by setting a range:

Height Range (ft) Number of Trees (Frequency)

60 - 75 3

66 - 70 3

71 - 75 8

76 - 80 10

81 - 85 5

86 - 90 1

This data can be now shown using a histogram. We need to make sure that
while plotting a histogram, there shouldn’t be any gaps between the bars.
How to Make a Histogram?

The process of making a histogram using the given data is described


below:

• Step 1: Choose a suitable scale to represent weights on the horizontal axis.


• Step 2: Choose a suitable scale to represent the frequencies on the vertical axis.
• Step 3: Then draw the bars corresponding to each of the given weights using their frequencies.

Example: Construct a histogram for the following frequency distribution


table that describes the frequencies of weights of 25 students in a class.

Weights (in lbs) Frequency (Number of students)

65 - 70 4

70 - 75 10

75 - 80 8

80 - 85 4

Steps to draw a histogram:


• Step 1: On the horizontal axis, we can choose the scale to be 1 unit = 11 lb. Since the weights in
the table start from 65, not from 0, we give a break on the X-axis.
• Step 2: On the vertical axis, the frequencies are varying from 4 to 10. Thus, we choose the scale
to be 1 unit = 2.
• Step 3: Then draw the bars corresponding to each of the given weights using their frequencies.

Density Plots and Estimates.

Density plots and density estimates are valuable tools in statistics and data visualization for
understanding the distribution of data. Here's a detailed explanation of density plots and
estimates:

Density Plot

A density plot is a graphical representation of the distribution of data. It displays the


probability density function of a continuous variable. Here are key points about density plots:

• Smooth Representation: Unlike histograms, which use bins to represent data


frequencies, density plots provide a smooth curve that estimates the probability
density of the data points.
• Kernel Density Estimation (KDE): The most common method to construct a density
plot is through kernel density estimation. KDE places a kernel (often Gaussian) at
each data point and sums these to create a smooth curve that approximates the
underlying distribution.
• Interpretation: The height of the density plot at any point represents the estimated
probability density of the data at that point. The total area under the curve equals 1,
representing the total probability of the dataset.
• Visualization: Density plots are effective for visualizing the shape, spread, and
skewness of data distributions, especially when histograms may not provide enough
detail due to binning choices.

Density Estimate

Density estimation is the broader statistical process of estimating the probability density
function of a random variable from a sample of data points. Key aspects of density estimation
include:

• Methods: There are several methods for density estimation:


o Histogram-based Estimation: Divides the data into bins and estimates
density within each bin.
o Kernel Density Estimation (KDE): Smooths data points using kernels to
estimate the density function.
o Parametric Estimation: Assumes a specific parametric form (e.g., normal
distribution) and estimates its parameters (mean, variance) from the data.
• Non-parametric vs Parametric: Non-parametric methods like KDE do not assume a
specific distribution form, making them versatile for a wide range of data types and
distributions.
Section 1.2
Exploring Binary and Categorical Data ,
Exploring binary and categorical data involves understanding the characteristics and
distributions of these types of variables. Here's an overview of how to approach exploring
and analyzing binary and categorical data:

1. Understanding Binary Data

Binary data consists of variables that can take on only two values, typically represented as 0
and 1 (or "yes" and "no", "true" and "false", etc.). Examples include:

• Gender (male/female)
• Response to a yes/no survey question
• Presence/absence of a characteristic or event

Exploratory Techniques:

• Frequency Distribution: Counting the occurrences of each value (0s and 1s).
• Proportion Calculation: Calculating the proportion of 1s (or 0s) in the dataset.
• Visualization: Bar plots or pie charts can visually represent the distribution of binary
variables.

2. Understanding Categorical Data

Categorical data can take on a limited number of distinct values or categories. These values
are often qualitative and do not have a natural ordering. Examples include:

• Marital status (single, married, divorced)


• Education level (high school, college, graduate)
• Type of vehicle (sedan, SUV, truck)

Exploratory Techniques:

• Frequency Distribution: Counting the occurrences of each category.


• Proportion Calculation: Calculating the proportion of each category relative to the
total number of observations.
• Visualization: Bar plots, pie charts, and histograms are commonly used to visualize
categorical data distributions.

Key Analysis Techniques:

• Cross-tabulation: Analyzing relationships between two categorical variables using


contingency tables.
• Chi-square test: Testing for independence between categorical variables.
• Mode: Identifying the most frequent category in a categorical variable.
Example Analysis:

Let's consider an example dataset where we have binary and categorical variables related to
customer satisfaction with a product:

• Binary Variable: Purchased (1 if purchased, 0 if not)


• Categorical Variable: Feedback (values could be 'positive', 'neutral', 'negative')

Analysis Steps:

1. Binary Variable (Purchased):


o Compute the proportion of customers who purchased the product.
o Visualize with a bar plot or pie chart.
2. Categorical Variable (Feedback):
o Calculate the frequency of each feedback category (positive, neutral, negative).
o Visualize with a bar plot to see the distribution of feedback.
3. Cross-tabulation:
o Create a contingency table to analyze the relationship between Purchased and
Feedback.
o Calculate percentages to see if there's a pattern (e.g., do customers who give positive
feedback tend to purchase more?).
4. Chi-square test:
o Perform a chi-square test of independence to determine if there is a statistically
significant association between Purchased and Feedback.

Conclusion:

Exploring binary and categorical data involves summarizing their distributions, visualizing
their patterns, and analyzing relationships between variables. These techniques help in
gaining insights into the characteristics and behaviors described by such data types, essential
for making informed decisions and drawing meaningful conclusions in various fields of study
and business applications.

Mode

The mode of a dataset is the value that appears most frequently. If there are multiple modes
(where multiple values have the same highest frequency), the dataset is considered
multimodal.

Example: Consider the following dataset representing the number of goals scored by a
football team in 10 matches: {2,1,3,2,4,2,1,3,2,3To find the mode:

• Count the frequency of each number:


o 1 occurs 2 times
o 2 occurs 4 times
o 3 occurs 3 times
o 4 occurs 1 time

Therefore, the mode of this dataset is 2 because it appears most frequently.


Expected Value (Mean)

The expected value (mean) of a random variable XXX is the long-run average value of
repetitions of the experiment it represents.

Example: calculation

Probability

Probability P(A)of an event A is a measure of the likelihood that the event will occur. It is
always between 0 (impossible) and 1 (certain).

Example: Consider flipping a fair coin. The probability of getting heads P(Heads)1/2 and the
probability of getting tails P(Tails)is also ½.

Summary

• Mode: The mode is the most frequent value in a dataset.


• Expected Value (Mean): The expected value E(X) is the mean or average value of a
random variable X, calculated as a weighted average of all possible outcomes.
• Probability: Probability P(A) of an event A is a measure of the likelihood of that
event occurring.

These concepts are fundamental in statistics and probability theory, providing tools to
describe, analyze, and predict outcomes in various scenarios.

Correlation in Statistics
This section shows how to calculate and interpret correlation coefficients for ordinal
and interval level scales. Methods of correlation summarize the relationship between
two variables in a single number called the correlation coefficient. The correlation
coefficient is usually represented using the symbol r, and it ranges from -1 to +1.

A correlation coefficient quite close to 0, but either positive or negative, implies little
or no relationship between the two variables. A correlation coefficient close to plus 1
means a positive relationship between the two variables, with increases in one of the
variables being associated with increases in the other variable.

A correlation coefficient close to -1 indicates a negative relationship between two


variables, with an increase in one of the variables being associated with a decrease
in the other variable. A correlation coefficient can be produced for ordinal, interval or
ratio level variables, but has little meaning for variables which are measured on a
scale which is no more than nominal.
For ordinal scales, the correlation coefficient can be calculated by using Spearman’s
rho. For interval or ratio level scales, the most commonly used correlation coefficient
is Pearson’s r, ordinarily referred to as simply the correlation coefficient.

What Does Correlation Measure?

In statistics, Correlation studies and measures the direction and extent of


relationship among variables, so the correlation measures co-variation, not
causation. Therefore, we should never interpret correlation as implying cause and
effect relation. For example, there exists a correlation between two variables X and
Y, which means the value of one variable is found to change in one direction, the
value of the other variable is found to change either in the same direction (i.e.
positive change) or in the opposite direction (i.e. negative change). Furthermore, if
the correlation exists, it is linear, i.e. we can represent the relative movement of the
two variables by drawing a straight line on graph paper.

Correlation Coefficient
The correlation coefficient, r, is a summary measure that describes the extent of the
statistical relationship between two interval or ratio level variables. The correlation
coefficient is scaled so that it is always between -1 and +1. When r is close to 0 this
means that there is little relationship between the variables and the farther away
from 0 r is, in either the positive or negative direction, the greater the relationship
between the two variables.

The two variables are often given the symbols X and Y. In order to illustrate how the
two variables are related, the values of X and Y are pictured by drawing the scatter
diagram, graphing combinations of the two variables. The scatter diagram is given
first, and then the method of determining Pearson’s r is presented. From the
following examples, relatively small sample sizes are given. Later, data from larger
samples are given.

Scatter Diagram
A scatter diagram is a diagram that shows the values of two variables X and Y, along
with the way in which these two variables relate to each other. The values of variable
X are given along the horizontal axis, with the values of the variable Y given on the
vertical axis.
Later, when the regression model is used, one of the variables is defined as an
independent variable, and the other is defined as a dependent variable. In
regression, the independent variable X is considered to have some effect or
influence on the dependent variable Y. Correlation methods are symmetric with
respect to the two variables, with no indication of causation or direction of influence
being part of the statistical consideration. A scatter diagram is given in the following
example. The same example is later used to determine the correlation coefficient.

Types of Correlation
The scatter plot explains the correlation between the two attributes or variables. It
represents how closely the two variables are connected. There can be three such
situations to see the relation between the two variables –

• Positive Correlation – when the values of the two variables move in the same direction so
that an increase/decrease in the value of one variable is followed by an
increase/decrease in the value of the other variable.
• Negative Correlation – when the values of the two variables move in the opposite
direction so that an increase/decrease in the value of one variable is followed by
decrease/increase in the value of the other variable.
• No Correlation – when there is no linear dependence or no relation between the two
variables.
Correlation Formula
Correlation shows the relation between two variables. Correlation coefficient shows
the measure of correlation. To compare two datasets, we use the correlation
formulas.

Pearson Correlation Coefficient Formula


The most common formula is the Pearson Correlation coefficient used for linear
dependency between the data sets. The value of the coefficient lies between -1 to
+1. When the coefficient comes down to zero, then the data is considered as not
related. While, if we get the value of +1, then the data are positively correlated, and -
1 has a negative correlation.
Where n = Quantity of Information

Σx = Total of the First Variable Value

Σy = Total of the Second Variable Value

Σxy = Sum of the Product of first & Second Value

Σx2 = Sum of the Squares of the First Value

Σy2 = Sum of the Squares of the Second Value

Linear Correlation Coefficient Formula


The formula for the linear correlation coefficient is given by;

Sample Correlation Coefficient Formula


The formula is given by:

rxy = Sxy/SxSy

Where Sx and Sy are the sample standard deviations, and Sxy is the sample
covariance.

Population Correlation Coefficient Formula


The population correlation coefficient uses σx and σy as the population standard
deviations and σxy as the population covariance.
rxy = σxy/σxσy

• arson Correlation Formula


Correlation Example
Years of Education and Age of Entry to Labour Force Table.1 gives the number of
years of formal education (X) and the age of entry into the labour force (Y ), for 12
males from the Regina Labour Force Survey. Both variables are measured in years,
a ratio level of measurement and the highest level of measurement. All of the males
are aged close to 30, so that most of these males are likely to have completed their
formal education.

Respondent Number Years of Education, X Age of Entry into Labour Force, Y

1 10 16

2 12 17

3 15 18

4 8 15

5 20 18

6 17 22

7 12 19

8 15 22

9 12 18

10 10 15

11 8 18

12 10 16

Table 1. Years of Education and Age of Entry into Labour Force for 12 Regina
Males
Since most males enter the labour force soon after they leave formal schooling, a
close relationship between these two variables is expected. By looking through the
table, it can be seen that those respondents who obtained more years of schooling
generally entered the labour force at an older age. The mean years of schooling are
x̄ = 12.4 years and the mean age of entry into the labour force is ȳ= 17.8, a
difference of 5.4 years.

This difference roughly reflects the age of entry into formal schooling,
that is, age five or six. It can be seen through that the relationship
between years of schooling and age of entry into the labour force is
not perfect. Respondent 11, for example, has only 8 years of
schooling but did not enter the labour force until the age of 18. In
contrast, respondent 5 has 20 years of schooling but entered the
labour force at the age of 18. The scatter diagram provides a quick
way of examining the relationship between X and Y.
What is a Scatterplot?

A scatterplot is a graph that displays the values of two variables as points on a Cartesian
plane. Each point represents the values of the two variables for a single observation in your
data set. The horizontal axis (x-axis) typically represents one variable, and the vertical axis
(y-axis) represents the other variable.

Purpose of Scatterplots:

1. Identifying Relationships: Scatterplots help you visually identify and understand the
relationship between the two variables. The pattern of points on the plot can indicate
whether there is a positive, negative, or no relationship between the variables.
2. Detecting Outliers: Outlying points that do not fit the general pattern of the data can
be easily spotted on a scatterplot. These outliers may be important for understanding
the data distribution or for indicating unusual observations.
3. Examining Patterns: Scatterplots can reveal underlying patterns such as clusters of
points, trends, or any deviations from expected relationships between variables.

Interpreting Scatterplots:

• Positive Relationship: When the points on the scatterplot generally rise from left to
right, this indicates a positive relationship between the variables.
• Negative Relationship: Conversely, if the points on the scatterplot generally fall
from left to right, this indicates a negative relationship between the variables.
• No Relationship: If the points on the scatterplot appear randomly scattered with no
discernible pattern, this suggests there is no relationship between the variables.

Example:

Imagine you have a data set that includes the number of hours studied and the exam scores of
students. By creating a scatterplot with hours studied on the x-axis and exam scores on the y-
axis, you can quickly see if there’s a relationship between the amount of study time and exam
performance. If there is a positive relationship, you would expect to see points clustering in a
generally upward direction.

Practical Use:

• Correlation Analysis: Scatterplots are often used in correlation analysis to determine


the strength and direction of the relationship between two variables.
• Model Checking: In regression analysis, scatterplots can help you check assumptions
such as linearity and homoscedasticity (equal variance).

Creating a Scatterplot:

To create a scatterplot, you typically:

1. Choose your variables: Decide which two variables you want to compare.
2. Plot the points: Plot each data point on the graph, with one variable on the x-axis and
the other on the y-axis.
3. Interpret the plot: Analyze the pattern of points to draw conclusions about the
relationship between the variables.

In summary, scatterplots are essential in statistics for exploring and visualizing relationships
between variables, making them a powerful tool for both exploratory data analysis and
hypothesis testing.

Hexagonal binning and contour plots are effective techniques for visualizing the
relationship between two numerical variables. They provide insights into data density and
patterns that may not be apparent in traditional scatterplots. Let's explore both methods and
how they can be used to plot numeric versus numerical data.
Hexagonal Binning
Hexagonal binning divides the data space into hexagonal bins, counts the number of
observations in each bin, and then represents this count using color intensity or shading. It's
particularly useful when dealing with a large number of data points where traditional
scatterplots may suffer from overplotting.

Comparison and Usage


• Hexagonal Binning: Useful for large datasets, as it effectively summarizes data
density and reduces overplotting.

• Contour Plots: Useful for identifying patterns and trends in data distributions,
especially when relationships are non-linear or complex.

Both hexagonal binning and contour plots provide complementary ways to visualize numeric
versus numerical data, offering insights into data density and patterns that aid in exploratory
data analysis and hypothesis generation.

Adjust the parameters such as gridsize for hexagonal binning and cmap for colormaps to
suit your data characteristics and visualization preferences. These plots are versatile and can
be customized further to enhance clarity and interpretability based on your specific analytical
needs.

Two Categorical Variables ,Categorical and Numeric Data


When visualizing relationships involving two categorical variables or categorical and
numeric data, different types of plots and techniques can be employed. Let's explore how we
can effectively visualize these scenarios using examples.

1. Visualizing Two Categorical Variables


When you have two categorical variables, you typically want to understand the distribution or
relationship between these variables. Here are some common visualization techniques:
a. Stacked Bar Chart:
A stacked bar chart is useful for showing the composition of one categorical variable across
different categories of another categorical variable.
b. Grouped Bar Chart:
A grouped bar chart compares the values of one categorical variable across different
categories of another categorical variable side-by-side.
2. Categorical and Numeric Data
When visualizing relationships between a categorical variable and a numeric variable, box
plots and violin plots are commonly used to show the distribution of the numeric variable
across different categories of the categorical variable.
a. Box Plot:
A box plot summarizes the distribution of a numeric variable for different categories of a
categorical variable, showing key statistics such as median, quartiles, and outliers.
Theory Questions on Module 1 for Self Study

Q.1. ExplainElements of Structured Data


Q.2. Rectangular Data
Q.3.Data Frames and Index
Q.4. Explain Nonrectangular Data Structures
Q.5. Estimates of Location
Q.6 Robust Estimates,and Estimates of Variability.
Q.7.Standard Deviation and Related Estimates
Q.8. ExplainEstimatesBased on Percentiles
Q.9. Explain EstimatesBased on Percentiles
Q.10ExplainExploring the Data Distribution
Q.10. ExplainvBoxplots ,Frequency Tables and Histograms ,
Q.11.Density Plots and Estimates
Q.12.Explain Exploring Binary and Categorical Data. Mode
Q.13.Explain Correlation ,Scatterplots .
Q.14. EXplain Exploring Two or More Variables .
Q.15.Explain Hexagonal Binning andContours (Plotting Numeric Versus Numerical Data)
Q.16.Explain Two Categorical Variables.
Q,17.Explain Categorical and Numeric Data ,Visualizing Multiple Variables.

You might also like