Chapter 1 SAIDS
Chapter 1 SAIDS
Chapter 1 SAIDS
* Theory class to be conducted for full class and $ indicates workload of Learner (Not Faculty), students can
formgroups with minimum 2(Two) and not more than 4(Four). Faculty Load: 1hour per week per four groups.
Course Code Course Name Credit
Prerequisite: C Programming
2.1 Random Sampling and Sample Bias ,Bias ,Random Selection ,Size Versus
Quality, Sample Mean Versus Population Mean ,Selection Bias ,Regression to the
Mean Sampling Distribution of a Statistic ,Central Limit Theorem ,Standard Error
,The Bootstrap ,Resampling Versus Bootstrapping .
2.2 Confidence Intervals ,Normal Distribution ,Standard Normal and QQ-
PlotsLong-Tailed Distributions ,Student‘s t-Distribution ,Binomial Distribution ,Chi-
Square Distribution ,F-Distribution ,Poisson and Related Distributions ,Poisson
Distributions
Exponential Distribution ,Estimating the Failure Rate ,Weibull Distribution .
Self Study :Create a Linear Regression model for a dataset and display the error
measures, Chose a dataset with categorical data and apply linear regression
model
Textbooks:
1 Bruce, Peter, and Andrew Bruce. Practical statistics for data scientists: 50 essential concepts. Reilly Media,
2017.
2 Mathematical Statistics and Data Analysis John A. Rice University of California, Berkeley,Thomson Higher
Education
References:
1 Dodge, Yadolah, ed. Statistical data analysis and inference. Elsevier, 2014.
2 Ismay, Chester, and Albert Y. Kim. Statistical Inference via Data Science: A Modern Dive into R and the
Tidyverse. CRC Press, 2019.
3 Milton. J. S. and Arnold. J.C., "Introduction to Probability and Statistics", Tata McGraw Hill, 4th Edition,
2007.
4 Johnson. R.A. and Gupta. C.B., "Miller and Freund‘s Probability and Statistics for Engineers", Pearson
Education, Asia, 7th Edition, 2007.
5 A. Chandrasekaran, G. Kavitha, ―Probability, Statistics, Random Processes and Queuing Theory‖, Dhanam
Publications, 2014.
Statistics for Artificial Intelligence Data Science
Chapter 1 Exploratory Data Analysis
Q. What is Statistics?
Statistics is used in various fields such as science, business, economics, social sciences,
engineering, and medicine, among others, to quantify uncertainty, support decision-making,
and explore relationships within data.
Data science is an field that uses scientific methods, processes, algorithms, and systems to
extract knowledge and insights from structured and unstructured data. It combines aspects of
mathematics, statistics, computer science, domain knowledge, and information science to
uncover patterns, trends, and correlations that can be used to make decisions and predictions.
1. Data Collection: Gathering data from various sources, which can include databases,
data lakes, sensors, social media, and more.
2. Data Cleaning and Preparation: Processing and transforming raw data into a usable
format, handling missing values, outliers, and ensuring data quality.
3. Exploratory Data Analysis (EDA): Exploring and visualizing data to understand its
characteristics, identify patterns, trends, and relationships.
4. Data Modeling and Machine Learning: Applying statistical models, machine
learning algorithms, and computational techniques to build predictive models and
make data-driven decisions.
5. Data Interpretation and Visualization: Analyzing model results, interpreting
findings, and communicating insights through visualizations and reports.
6. Deployment and Monitoring: Implementing data-driven solutions in real-world
applications, and monitoring their performance over time to ensure they remain
effective.
Data science techniques are used across various industries and disciplines, including
business, healthcare, finance, marketing, and more. It plays a crucial role in enabling
organizations to derive value from large volumes of data, automate processes, improve
decision-making, and gain a competitive edge.
Elements of structured data refer to the fundamental components that make up data organized
in a structured format, typically within a database or a structured file format like CSV. These
elements include:
1. Fields or Attributes:
o Fields are individual pieces of information that describe a specific aspect or
characteristic of an entity (e.g., person, product, transaction).
o Attributes define the properties of each field, such as data type (integer, string,
date), length, and format.
2. Records or Rows:
o Records represent individual instances or entries within a dataset.
o Each record contains a collection of related fields that together describe a
single entity or event.
3. Tables or Entities:
o Tables organize data into a structured format, where each table represents a
collection of related records.
o Entities refer to the conceptual representations of objects or concepts within a
data model, typically represented by tables in a relational database.
4. Schema:
o A schema defines the structure of the data within a database or dataset.
o It specifies the names of tables, the fields within each table, their data types,
and any constraints or rules that govern the data.
5. Keys:
o Keys are attributes or combinations of attributes that uniquely identify records
within a table.
o Primary keys uniquely identify each record in a table, while foreign keys
establish relationships between tables.
6. Relationships:
o Relationships define how data entities are connected or related to each other
within a database.
o Relationships are established through keys (e.g., primary keys and foreign
keys) that link tables and enable data retrieval and manipulation across related
entities.
o
7. Constraints:
o Constraints enforce rules and restrictions on the data to ensure data integrity
and consistency.
o Examples include unique constraints (ensuring no duplicate values in a
column), not null constraints (requiring a field to have a value), and referential
integrity constraints (ensuring relationships between tables remain valid).
8. Query Language:
o Structured data is queried and manipulated using structured query languages
(SQL), which provide standardized syntax and commands for interacting with
databases.
o SQL allows users to retrieve, insert, update, and delete data based on specified
criteria and relationships defined in the schema.
These elements collectively define the structured nature of data, facilitating efficient storage,
retrieval, manipulation, and analysis within databases and other structured data formats. They
are essential for ensuring data consistency, accuracy, and usability across various applications
and industries.
CSV (Comma-Separated Values) format is a plain-text format used for representing tabular
data. In CSV files, each line represents a row of data, and each field within that row is
separated by a comma (,). This format is commonly used because it is simple, lightweight,
and widely supported by various applications and programming languages.
CSV files are widely used for exchanging data between different applications,
importing/exporting data into/from databases, and for storing structured data that needs to be
easily readable and manipulated by both humans and machines. They are commonly
generated by spreadsheet software like Microsoft Excel or Google Sheets and can be
processed programmatically using various programming languages such as Python, Java.
Rectangular data refers to a structured format where data is organized in rows and columns,
forming a grid-like structure similar to a table or a matrix. This format is also known as
tabular data or two-dimensional data, and it is commonly used in databases, spreadsheets,
CSV files, and data frames in programming languages like Python and R.
1. Rows and Columns: Data is organized into rows (also known as records or
observations) and columns (also known as fields or variables). Each row represents a
single entity or observation, and each column represents a specific attribute or
characteristic of that entity.
2. Tabular Structure: The data is structured in a tabular format where rows and
columns intersect to form cells. Each cell contains a single data value corresponding
to a specific row and column intersection.
3. Homogeneous Format: Rectangular data typically assumes a homogeneous format,
meaning that all rows have the same set of columns (variables) with consistent data
types. This consistency allows for straightforward manipulation and analysis of data.
4. Header Row: Rectangular data often includes a header row at the top that defines the
names of each column or variable. This header row provides context and labels for the
data stored in each column.
5. Structured Querying: Rectangular data can be queried and manipulated using
structured query languages (SQL) in databases or through data manipulation tools and
libraries in programming languages like pandas in Python or data.table in R.
Data Frames:
• Definition: A data frame is a two-dimensional labeled data structure with rows and
columns, similar to a table in a relational database or a spreadsheet. It is a core data
structure used for data manipulation and analysis in statistical computing and data
science.
• Characteristics:
o Tabular Structure: Data frames organize data into rows and columns, where
each column represents a variable or feature, and each row represents an
observation or record.
o Column Names: Each column in a data frame has a name or label, which
allows for easy reference and manipulation of specific variables.
o Homogeneous Columns: All columns in a data frame typically have the same
length and contain data of the same type (e.g., integers, strings, dates).
o Flexible Data Types: Data frames can accommodate various data types within
columns, including numerical data, categorical data, dates, and text.
• Usage:
oData frames are widely used for data exploration, cleaning, transformation,
analysis, and visualization tasks.
o They provide a convenient way to handle and manipulate structured data,
making it easier to perform operations like filtering rows, selecting columns,
joining datasets, and computing summary statistics.
• Examples:
o In Python, data frames are commonly created and manipulated using the
pandas library (import pandas as pd). Operations such as reading data from
CSV files, performing data aggregation, and plotting data are efficiently
handled with pandas data frames.
Indexes:
• Definition: An index in the context of data frames refers to a structure that allows for
fast lookup, retrieval, and manipulation of data. It provides a way to uniquely identify
each row in a data frame.
• Characteristics:
o Unique Identification: Each row in a data frame can be associated with a
unique index value, typically integers starting from 0 to n-1 (where n is the
number of rows).
o Immutable: Index values are immutable (cannot be changed), which ensures
consistency and integrity when performing operations that rely on row
identification.
o Labeling: Indexes can also be labeled with meaningful identifiers (e.g.,
timestamps, alphanumeric codes) instead of numeric values.
• Usage:
o Indexes facilitate efficient data retrieval and manipulation operations such as
slicing, selecting subsets of data, merging data frames, and aligning data in
mathematical operations.
o They play a crucial role in maintaining the integrity of data relationships
across different data frames and when performing operations that involve
joining, grouping, or reshaping data.
• Examples:
o In pandas, the index of a data frame can be explicitly specified when creating
the data frame or can be automatically generated. Operations like setting a
column as the index (df.set_index('column_name')) or resetting the index
(df.reset_index()) are common in data manipulation workflows.
In summary, data frames and indexes are foundational concepts in data analysis and
manipulation, providing structured ways to organize, access, and process data efficiently
across different programming languages and environments.
EXTRA
Pandas is a powerful open-source data analysis and manipulation library for Python. It provides high-
performance, easy-to-use data structures and data analysis tools for working with structured (tabular)
data.
Nonrectangular data structures
Nonrectangular data structures refer to data formats that do not conform to the traditional
tabular (rows and columns) structure typical of relational databases or spreadsheets. These
data structures are more flexible and can accommodate varying data shapes and nested
hierarchies. Here are some common examples of nonrectangular data structures:
2. Key-Value Stores:
• Redis:
o Definition: Redis is an open-source, in-memory data structure store used as a
database, cache, and message broker.
o Characteristics: Redis stores data as key-value pairs, where values can be
complex data structures such as lists, sets, sorted sets, hashes, and more. It is
known for its high performance and flexibility.
4. Document-Oriented Databases:
• MongoDB:
o Definition: MongoDB is a NoSQL document-oriented database that stores
data in JSON-like documents.
o Characteristics: Each document can have a different structure, and fields can
vary from document to document. MongoDB supports nested structures,
arrays, and complex data types, making it suitable for flexible and scalable
applications.
5. Time-Series Databases:
• InfluxDB:
o Definition: InfluxDB is an open-source time series database designed for
high-performance handling of time-stamped data.
o Characteristics: It stores data in a schema-less fashion, optimized for
querying and analyzing time-series data points. Time-series databases are
crucial for applications like financial data analysis.
Estimates of Location
Estimates of location, also known as measures of central tendency, are statistical measures
that describe the central or typical value in a dataset. These measures help to summarize the
location of the data distribution and provide insights into its central tendency. Here are three
common estimates of location along with examples:
• Definition: The mean is the sum of all values in a dataset divided by the number of
observations. It represents the typical value or average of the dataset.
• Example: Suppose we have exam scores of students in a class:
• Calculation
2. Median:
• Definition: The median is the middle value in a sorted dataset. If the dataset has an
odd number of observations, the median is the middle value. If the dataset has an even
number of observations, the median is the average of the two middle values.
• Example: Consider the following dataset of ages:
• Ages: 25, 30, 35, 40, 45
Since there are 5 observations, the median is the third value (35).
3. Mode:
• Definition: The mode is the value that appears most frequently in a dataset. It
represents the most common value or values.
• Example: Consider a dataset of exam grades:
Grades: A, B, B, C, A, A, B, A
The mode in this dataset is "A", as it appears more frequently than any other grade.
These estimates of location provide valuable insights into the characteristics of a dataset and
are essential tools in descriptive statistics for summarizing data distributions.
Q. Robust Estimates
Robust estimates of location and dispersion are statistical measures that are less sensitive to
outliers or deviations from the typical pattern in a dataset. They are particularly useful when
dealing with data that may contain extreme values or when the underlying distribution is not
strictly normal. Here are two robust estimates commonly used:
Estimates of Variability
Estimates of variability in statistics refer to measures that describe how spread out or dispersed a set
of data points are. There are several common measures used to quantify variability:
1. Range: The range is the simplest measure of variability and is calculated as the
difference between the maximum and minimum values in a dataset. It gives an idea of
the spread of the entire dataset.
2. Variance: The variance measures the average squared deviation of each data point
from the mean of the dataset. It provides a measure of the spread of data points
around the mean. However, the variance is in squared units of the original data.
3. Standard Deviation: The standard deviation is the square root of the variance. It
represents the typical distance between each data point and the mean. Standard
deviation is often preferred over variance because it is in the same units as the original
data.
4. Interquartile Range (IQR): The IQR is a measure of variability based on dividing
the data into quartiles. It is the difference between the 75th percentile (Q3) and the
25th percentile (Q1). IQR is less sensitive to outliers compared to range, variance, and
standard deviation.
5. Coefficient of Variation: The coefficient of variation (CV) is a relative measure of
variability. It is calculated as the standard deviation divided by the mean, expressed as
a percentage. CV allows for comparison of variability between datasets with different
units or scales.
Each of these measures provides different insights into the variability of data. The choice of
which measure to use depends on the nature of the data and the specific question being
addressed in the analysis.
1. Range:
o Example: Consider the following dataset representing the daily temperatures (in
degrees Celsius) in a city over a week: 20,22,19,25,18,23,21 ,
o Calculation: Range = Maximum value - Minimum value
▪ Maximum value = 25
▪ Minimum value = 18
▪ Range = 25 - 18 = 7
o Interpretation: The range of temperatures over the week is 7 degrees Celsius,
indicating the extent of variability in daily temperatures.
2. Variance:
o Example: Let's use the same dataset of daily temperatures.
o Calculation:
3. Standard Deviation:
o Example: Using the same dataset of daily temperatures.
o Calculation:
4. Interquartile Range (IQR):
o Example: Consider the following dataset representing the scores of a class in a math
test: 65,70,72,75,80,85,90,9565, 70, 72, 75, 80, 85, 90, 9565,70,72,75,80,85,90,95.
o Calculation:
▪ First Quartile (Q1) = 71(the median of the lower half of the data)
▪ Third Quartile (Q3) = 87.5 (the median of the upper half of the data)
▪ IQR = Q3 - Q1 = 87.5 - 72 = 15.5
To calculate the interquartile range (IQR) for the dataset 65,70,72,75,80,85,90,95, follow
these steps:
These examples illustrate how different measures of variability can be calculated and
interpreted using real-world datasets. Each measure provides valuable insights into the spread
or dispersion of data points within a dataset.
Example Calculation:
1. Understanding Percentiles: Percentiles divide a dataset into hundred equal parts. For
example, the 50th percentile (also known as the median) divides the data into two
halves, with half the data points below and half above this value.
2. Using Percentiles for Estimates:
o Central Tendency: The median (50th percentile) is often used as a measure of
central tendency when the data is skewed or has outliers. It gives an estimate
of the middle value of the dataset.
o Spread or Variability: Percentiles can also help estimate the spread of the
data. For instance, the interquartile range (difference between the 75th and
25th percentiles) can provide a measure of how spread out the middle portion
of the data is.
o Outliers and Extremes: Percentiles can be useful for identifying outliers or
extreme values in the dataset. For example, the 95th percentile indicates the
value below which 95% of the data falls.
3. Estimating Values:
o Estimating Specific Percentiles: If you know the percentile you're interested
in (e.g., the 90th percentile), you can estimate the corresponding value in the
dataset. This can be useful in various applications such as finance (estimating
high-percentile risk) or healthcare (estimating high-percentile patient wait
times).
o Forecasting and Planning: Businesses often use percentile estimates for
forecasting demand or planning resources. For example, estimating the 95th
percentile of sales volumes can help ensure sufficient inventory levels during
peak periods.
4. Practical Considerations:
o Data Quality: The accuracy of percentile estimates depends on the quality
and representativeness of the data sample.
o Interpretation: Percentiles should be interpreted in context. For example, the
25th percentile may indicate a low value, but it doesn’t necessarily mean it’s a
"low" value in an absolute sense without understanding the distribution of the
data.
In summary, estimates based on percentiles provide valuable insights into different aspects of
a dataset, from central tendency to variability and extremes. They are widely used in
statistics, finance, healthcare, and other fields where understanding distribution
characteristics is important for decision-making.let's walk through an example to illustrate
how estimates based on percentiles work:
Imagine you're analyzing wait times for patients at a hospital's emergency department. You
have data on the time each patient waits before being seen by a doctor. Here’s how you can
use percentiles to estimate and interpret different aspects of these wait times:
1. Dataset: You have wait time data for 100 patients: [10,15,20,25,30,…,120Percentile
Calculation:
o Median (50th Percentile): To find the median, sort the data and find the
middle value. Here, with 100 data points, the median is the average of the 50th
and 51st values:
o 75th Percentile: To find the 75th percentile, 75% of the patients waited this
amount of time or less. With 100 data points, the 75th percentile corresponds
to the 75th value:
75th Percentile=75th value=75 minutes
This indicates that 75% of patients waited 75 minutes or less.
o 90th Percentile: To find the 90th percentile, 90% of the patients waited this
amount of time or less. With 100 data points, the 90th percentile corresponds
to the 90th value:
90th Percentile=90th value=90 minutes
This means 90% of patients waited 90 minutes or less.
In this example, percentiles provide actionable insights into the distribution of wait times,
helping stakeholders make informed decisions and improve service delivery in the hospital
setting.
involves analyzing its shape, central tendency, spread, and any patterns or outliers present.
Here’s a step-by-step guide on how to explore data distribution:
1. Visual Inspection:
o Histogram: Plot a histogram to visualize the frequency distribution of the data. This
provides an overview of how data points are distributed across different bins or
intervals.
o Box Plot: Construct a box plot (box-and-whisker plot) to identify the median,
quartiles, and potential outliers in the data. This plot gives a quick snapshot of the
data's spread and central tendency.
2. Measure of Central Tendency:
o Mean: Calculate the arithmetic mean to understand the average value of the dataset.
This is sensitive to outliers.
o Median: Find the median to determine the middle value of the dataset, which is less
affected by extreme values compared to the mean.
o Mode: Identify the mode, or most frequent value, if applicable, to understand the
peak of the distribution.
3. Measure of Spread:
o Range: Compute the range (difference between the maximum and minimum values)
to get an idea of the spread of the data.
o Interquartile Range (IQR): Calculate the IQR (difference between the 75th and
25th percentiles) to measure the spread of the middle 50% of the data, which is
resistant to outliers.
o Standard Deviation: Compute the standard deviation to quantify the average
deviation of data points from the mean. It provides a measure of how spread out the
data is.
4. Shape of the Distribution:
o Skewness: Assess the skewness of the distribution to understand its symmetry. A
skewness value close to zero indicates a symmetric distribution, while positive or
negative values indicate skewness towards the right (positive) or left (negative).
o Kurtosis: Evaluate the kurtosis to understand the peakedness or flatness of the
distribution relative to a normal distribution. High kurtosis indicates a more peaked
distribution, while low kurtosis indicates a flatter distribution.
5. Identifying Outliers:
o Box Plot: Use the box plot to identify potential outliers, which are data points that
significantly differ from other observations in the dataset.
o Z-Score: Calculate the Z-score of data points to identify outliers based on their
deviation from the mean.
6. Data Transformation:
o Normalization: If the data is not normally distributed, consider transforming it (e.g.,
logarithmic transformation) to achieve normality, which can be useful for certain
statistical analyses.
7. Contextual Understanding:
o Consider the context of your data and the underlying processes generating it.
Real-world data often exhibit complex patterns that might not fit simple
theoretical distributions.
When exploring data distributions, it's important to combine multiple approaches for a
comprehensive understanding and to ensure robustness in subsequent analyses or modelling
efforts.
Let's apply these steps to a hypothetical dataset of exam scores (out of 100):
Solution:
(i) 16 workers
(ii)
Definition
The method to summarize a set of data that is measured using an interval scale is
called a box and whisker plot. These are maximum used for data analysis. We use
these types of graphs or graphical representation to know:
• Distribution Shape
• Central Value of it
• Variability of it
A box plot is a chart that shows data from a five-number summary including one of
the measures of central tendency. It does not show the distribution in particular as
much as a stem and leaf plot or histogram does. But it is primarily used to indicate a
distribution is skewed or not and if there are potential unusual observations (also
called outliers) present in the data set. Boxplots are also very beneficial when large
numbers of data sets are involved or compared.
In simple words, we can define the box plot in terms of descriptive statistics related
concepts. That means box or whiskers plot is a method used for depicting groups of
numerical data through their quartiles graphically. These may also have some lines
extending from the boxes or whiskers which indicates the variability outside the lower
and upper quartiles, hence the terms box-and-whisker plot and box-and-whisker
diagram. Outliers can be indicated as individual points.
It helps to find out how much the data values vary or spread out with the help of
graphs. As we need more information than just knowing the measures of central
tendency, this is where the box plot helps. This also takes less space. It is also a
type of pictorial representation of data.
Since, the centre, spread and overall range are immediately apparent, using these
boxplots the distributions can be compared easily
First Quartile (Q1): The first quartile is the median of the lower half of the data set.
Median: The median is the middle value of the dataset, which divides the given
dataset into two equal parts. The median is considered as the second quartile.
Third Quartile (Q3): The third quartile is the median of the upper half of the data.
Apart from these five terms, the other terms used in the box plot are:
Interquartile Range (IQR): The difference between the third quartile and first
quartile is known as the interquartile range. (i.e.) IQR = Q3-Q1
Outlier: The data that falls on the far left or right side of the ordered data is tested to
be the outliers. Generally, the outliers fall more than the specified distance from the
first and third quartile.
(i.e.) Outliers are greater than Q3+(1.5 . IQR) or less than Q1-(1.5 . IQR).
Boxplot Distribution
The box plot distribution will explain how tightly the data is grouped, how the data is
skewed, and also about the symmetry of data.
Positively Skewed: If the distance from the median to the maximum is greater than
the distance from the median to the minimum, then the box plot is positively skewed.
Negatively Skewed: If the distance from the median to minimum is greater than the
distance from the median to the maximum, then the box plot is negatively skewed.
Symmetric: The box plot is said to be symmetric if the median is equidistant from
the maximum and minimum values.
• the ends of the box are the upper and lower quartiles so that the box crosses the
interquartile range
• a vertical line inside the box marks the median
• the two lines outside the box are the whiskers extending to the highest and lowest
observations.
Applications
It is used to know:
Find the maximum, minimum, median, first quartile, third quartile for the given data
set: 23, 42, 12, 10, 15, 14, 9.
Solution:
Hence,
Minimum = 9
Maximum = 42
Median = 14
PERCENTILE
Identify the Position: Use the formula to find the position of the 25th percentile:
Position=0.25×11 Position=2.75
Here, 2.75 represents the exact position in the sorted dataset where the 25th percentile falls.
Since we need an actual index (because dataset indexing starts at 0), we take the ceiling of
2.75 which gives us 3.
requency tables and histograms are two common tools used in statistics to summarize and
visualize the distribution of a dataset. Let's explore each of them in detail:
Frequency Table
A frequency table is a summary of the data showing the number of occurrences (frequency)
of each distinct value or range of values in a dataset. It's particularly useful for categorical
and discrete numerical data. Here’s how you construct a frequency table:
1. Identify Data Values: List down all unique values present in the dataset.
2. Count Frequencies: Count how many times each unique value appears in the dataset.
3. Organize Data: Arrange the values in ascending or descending order along with their
corresponding frequencies.
Consider the following dataset representing the number of books read by students in a month:
{2,3,1,4,2,3,5,1,3,2,4,3,2,3,4,2,1,3,4,5}
• Values: 1, 2, 3, 4, 5
• Frequencies:
o 1: 3 times
o 2: 5 times
o 3: 5 times
o 4: 4 times
o 5: 3 times
Histogram
1. Determine Number of Bins: Divide the range of the data into intervals (bins). The
number of bins depends on the dataset size and the desired level of detail.
2. Count Data Points: Count how many data points fall into each bin.
3. Plot the Histogram: Draw a bar for each bin where the height of the bar represents
the frequency of data points in that bin.
Histogram Graph
Uncle Bruno owns a garden with 30 black cherry trees. Each tree is of a
different height. The height of the trees (in inches): 61, 63, 64, 66, 68, 69,
71, 71.5, 72, 72.5, 73, 73.5, 74, 74.5, 76, 76.2, 76.5, 77, 77.5, 78, 78.5, 79,
79.2, 80, 81, 82, 83, 84, 85, 87. We can group the data as follows in
a frequency distribution table by setting a range:
60 - 75 3
66 - 70 3
71 - 75 8
76 - 80 10
81 - 85 5
86 - 90 1
This data can be now shown using a histogram. We need to make sure that
while plotting a histogram, there shouldn’t be any gaps between the bars.
How to Make a Histogram?
65 - 70 4
70 - 75 10
75 - 80 8
80 - 85 4
Density plots and density estimates are valuable tools in statistics and data visualization for
understanding the distribution of data. Here's a detailed explanation of density plots and
estimates:
Density Plot
Density Estimate
Density estimation is the broader statistical process of estimating the probability density
function of a random variable from a sample of data points. Key aspects of density estimation
include:
Binary data consists of variables that can take on only two values, typically represented as 0
and 1 (or "yes" and "no", "true" and "false", etc.). Examples include:
• Gender (male/female)
• Response to a yes/no survey question
• Presence/absence of a characteristic or event
Exploratory Techniques:
• Frequency Distribution: Counting the occurrences of each value (0s and 1s).
• Proportion Calculation: Calculating the proportion of 1s (or 0s) in the dataset.
• Visualization: Bar plots or pie charts can visually represent the distribution of binary
variables.
Categorical data can take on a limited number of distinct values or categories. These values
are often qualitative and do not have a natural ordering. Examples include:
Exploratory Techniques:
Let's consider an example dataset where we have binary and categorical variables related to
customer satisfaction with a product:
Analysis Steps:
Conclusion:
Exploring binary and categorical data involves summarizing their distributions, visualizing
their patterns, and analyzing relationships between variables. These techniques help in
gaining insights into the characteristics and behaviors described by such data types, essential
for making informed decisions and drawing meaningful conclusions in various fields of study
and business applications.
Mode
The mode of a dataset is the value that appears most frequently. If there are multiple modes
(where multiple values have the same highest frequency), the dataset is considered
multimodal.
Example: Consider the following dataset representing the number of goals scored by a
football team in 10 matches: {2,1,3,2,4,2,1,3,2,3To find the mode:
The expected value (mean) of a random variable XXX is the long-run average value of
repetitions of the experiment it represents.
Example: calculation
Probability
Probability P(A)of an event A is a measure of the likelihood that the event will occur. It is
always between 0 (impossible) and 1 (certain).
Example: Consider flipping a fair coin. The probability of getting heads P(Heads)1/2 and the
probability of getting tails P(Tails)is also ½.
Summary
These concepts are fundamental in statistics and probability theory, providing tools to
describe, analyze, and predict outcomes in various scenarios.
Correlation in Statistics
This section shows how to calculate and interpret correlation coefficients for ordinal
and interval level scales. Methods of correlation summarize the relationship between
two variables in a single number called the correlation coefficient. The correlation
coefficient is usually represented using the symbol r, and it ranges from -1 to +1.
A correlation coefficient quite close to 0, but either positive or negative, implies little
or no relationship between the two variables. A correlation coefficient close to plus 1
means a positive relationship between the two variables, with increases in one of the
variables being associated with increases in the other variable.
Correlation Coefficient
The correlation coefficient, r, is a summary measure that describes the extent of the
statistical relationship between two interval or ratio level variables. The correlation
coefficient is scaled so that it is always between -1 and +1. When r is close to 0 this
means that there is little relationship between the variables and the farther away
from 0 r is, in either the positive or negative direction, the greater the relationship
between the two variables.
The two variables are often given the symbols X and Y. In order to illustrate how the
two variables are related, the values of X and Y are pictured by drawing the scatter
diagram, graphing combinations of the two variables. The scatter diagram is given
first, and then the method of determining Pearson’s r is presented. From the
following examples, relatively small sample sizes are given. Later, data from larger
samples are given.
Scatter Diagram
A scatter diagram is a diagram that shows the values of two variables X and Y, along
with the way in which these two variables relate to each other. The values of variable
X are given along the horizontal axis, with the values of the variable Y given on the
vertical axis.
Later, when the regression model is used, one of the variables is defined as an
independent variable, and the other is defined as a dependent variable. In
regression, the independent variable X is considered to have some effect or
influence on the dependent variable Y. Correlation methods are symmetric with
respect to the two variables, with no indication of causation or direction of influence
being part of the statistical consideration. A scatter diagram is given in the following
example. The same example is later used to determine the correlation coefficient.
Types of Correlation
The scatter plot explains the correlation between the two attributes or variables. It
represents how closely the two variables are connected. There can be three such
situations to see the relation between the two variables –
• Positive Correlation – when the values of the two variables move in the same direction so
that an increase/decrease in the value of one variable is followed by an
increase/decrease in the value of the other variable.
• Negative Correlation – when the values of the two variables move in the opposite
direction so that an increase/decrease in the value of one variable is followed by
decrease/increase in the value of the other variable.
• No Correlation – when there is no linear dependence or no relation between the two
variables.
Correlation Formula
Correlation shows the relation between two variables. Correlation coefficient shows
the measure of correlation. To compare two datasets, we use the correlation
formulas.
rxy = Sxy/SxSy
Where Sx and Sy are the sample standard deviations, and Sxy is the sample
covariance.
Correlation Example
Years of Education and Age of Entry to Labour Force Table.1 gives the number of
years of formal education (X) and the age of entry into the labour force (Y ), for 12
males from the Regina Labour Force Survey. Both variables are measured in years,
a ratio level of measurement and the highest level of measurement. All of the males
are aged close to 30, so that most of these males are likely to have completed their
formal education.
1 10 16
2 12 17
3 15 18
4 8 15
5 20 18
6 17 22
7 12 19
8 15 22
9 12 18
10 10 15
11 8 18
12 10 16
Table 1. Years of Education and Age of Entry into Labour Force for 12 Regina
Males
Since most males enter the labour force soon after they leave formal schooling, a
close relationship between these two variables is expected. By looking through the
table, it can be seen that those respondents who obtained more years of schooling
generally entered the labour force at an older age. The mean years of schooling are
x̄ = 12.4 years and the mean age of entry into the labour force is ȳ= 17.8, a
difference of 5.4 years.
This difference roughly reflects the age of entry into formal schooling,
that is, age five or six. It can be seen through that the relationship
between years of schooling and age of entry into the labour force is
not perfect. Respondent 11, for example, has only 8 years of
schooling but did not enter the labour force until the age of 18. In
contrast, respondent 5 has 20 years of schooling but entered the
labour force at the age of 18. The scatter diagram provides a quick
way of examining the relationship between X and Y.
What is a Scatterplot?
A scatterplot is a graph that displays the values of two variables as points on a Cartesian
plane. Each point represents the values of the two variables for a single observation in your
data set. The horizontal axis (x-axis) typically represents one variable, and the vertical axis
(y-axis) represents the other variable.
Purpose of Scatterplots:
1. Identifying Relationships: Scatterplots help you visually identify and understand the
relationship between the two variables. The pattern of points on the plot can indicate
whether there is a positive, negative, or no relationship between the variables.
2. Detecting Outliers: Outlying points that do not fit the general pattern of the data can
be easily spotted on a scatterplot. These outliers may be important for understanding
the data distribution or for indicating unusual observations.
3. Examining Patterns: Scatterplots can reveal underlying patterns such as clusters of
points, trends, or any deviations from expected relationships between variables.
Interpreting Scatterplots:
• Positive Relationship: When the points on the scatterplot generally rise from left to
right, this indicates a positive relationship between the variables.
• Negative Relationship: Conversely, if the points on the scatterplot generally fall
from left to right, this indicates a negative relationship between the variables.
• No Relationship: If the points on the scatterplot appear randomly scattered with no
discernible pattern, this suggests there is no relationship between the variables.
Example:
Imagine you have a data set that includes the number of hours studied and the exam scores of
students. By creating a scatterplot with hours studied on the x-axis and exam scores on the y-
axis, you can quickly see if there’s a relationship between the amount of study time and exam
performance. If there is a positive relationship, you would expect to see points clustering in a
generally upward direction.
Practical Use:
Creating a Scatterplot:
1. Choose your variables: Decide which two variables you want to compare.
2. Plot the points: Plot each data point on the graph, with one variable on the x-axis and
the other on the y-axis.
3. Interpret the plot: Analyze the pattern of points to draw conclusions about the
relationship between the variables.
In summary, scatterplots are essential in statistics for exploring and visualizing relationships
between variables, making them a powerful tool for both exploratory data analysis and
hypothesis testing.
Hexagonal binning and contour plots are effective techniques for visualizing the
relationship between two numerical variables. They provide insights into data density and
patterns that may not be apparent in traditional scatterplots. Let's explore both methods and
how they can be used to plot numeric versus numerical data.
Hexagonal Binning
Hexagonal binning divides the data space into hexagonal bins, counts the number of
observations in each bin, and then represents this count using color intensity or shading. It's
particularly useful when dealing with a large number of data points where traditional
scatterplots may suffer from overplotting.
• Contour Plots: Useful for identifying patterns and trends in data distributions,
especially when relationships are non-linear or complex.
Both hexagonal binning and contour plots provide complementary ways to visualize numeric
versus numerical data, offering insights into data density and patterns that aid in exploratory
data analysis and hypothesis generation.
Adjust the parameters such as gridsize for hexagonal binning and cmap for colormaps to
suit your data characteristics and visualization preferences. These plots are versatile and can
be customized further to enhance clarity and interpretability based on your specific analytical
needs.