FDSA Unit-2
FDSA Unit-2
FDSA Unit-2
The frequency of a value is the number of times it occurs in a dataset. A frequency distribution is the pattern of frequencies
of a variable. It’s the number of times each possible value of a variable occurs in a dataset.
A gardener set up a bird feeder in their backyard. To help them decide how much and what type of birdseed to buy, they
decide to record the bird species that visit their feeder. Over the course of one morning, the following birds visit their feeder:
2.1 Frequency distributions
A sociologist conducted a survey of 20 adults. She wants to report the frequency distribution of the ages of the survey
respondents. The respondents were the following ages in years:
52, 34, 32, 29, 63, 40, 46, 54, 36, 36, 24, 19, 45, 20, 28, 29, 38, 33, 49, 37
2.1 Frequency distributions
x=[60,64,66,70,72,72,74,90]
import matplotlib.pyplot as plt
plt.boxplot(x)
plt.show()
● Variability in measurement: Maybe the instrument used to collect data wasn't calibrated properly.
● Novel data: It could represent a genuine new discovery that doesn't fit the current understanding.
● Error: Sometimes, mistakes happen during data collection or analysis.
2.3 interpreting distributions
Data distribution shows how data points are spread out, like the different shapes in a puzzle.
It helps understand & predict data, make informed decisions, & choose the right analysis methods.
Types of Distributions
1. Discrete Distributions: These describe data that can only take on certain distinct values, like the number of heads in 5 coin
flips (0, 1, 2, 3, 4, or 5). Some common discrete distributions include:
Bernoulli distribution: Describes the probability of success or failure in a single trial (e.g., coin toss).
Binomial distribution: Describes the probability of getting a certain number of successes in a fixed number of trials
(e.g., getting 3 heads in 5 coin flips).
Poisson distribution: Describes the probability of a certain number of events occurring in a fixed interval of time or
space (e.g., the number of customers arriving at a store in an hour).
2. Continuous Distributions: These describe data that can take on any value within a continuous range, like the height of
people (any value between 0 and, say, 3 meters). Some common continuous distributions include:
Normal distribution (bell curve): Describes data that is symmetric and bell-shaped, with most values clustered
around the mean (e.g., heights of people).
Uniform distribution: Describes data where all values within a certain range are equally likely (e.g., random numbers
between 0 and 1).
Exponential distribution: Describes the time between events in a Poisson process (e.g., the time between customer
arrivals at a store).
2.3 interpreting distributions
The normal distribution is commonly used in
machine learning and data science
Type of Skewness
Various types of skewness used in mathematics are,
Problem: A bakery analyzed its daily bread sales for the past week. They found the following:
We can't calculate the exact skewness value due to missing information about the standard deviation.
2.4 Graphs
Graph Data Science is an analytics and machine learning (ML) solution that analyzes relationships in
data to improve predictions and discover insights.
There are various types of graphs used in data science:
1. Line Graph:
2. Bar Graph:
4. Histogram:
5. Pie Chart:
7. Heatmap:
8. Network Graph:
Ex: The mean, or the average, is calculated by adding all the figures within the data set and then
dividing by the number of figures within the set. For example, the sum of the following data set is 20:
(2, 3, 4, 5, 6). The mean is 4 (20/5).
let's use the following example data set of exam scores: 78, 82, 85, 88, 90, 92, 95, 98, 100.
Range:
The range is the difference between the highest and lowest values in the data set.
1. Variance is the average squared deviation of all data points from the mean. It essentially tells
you how much, on average, each data point deviates from the central tendency (mean).
2. Standard deviation is the square root of the variance. It gives you a measure of the spread of
data in the same units as the original data, making it easier to interpret and compare data sets
with different units.
2.7 variability for qualitative and ranked data
2.7 variability for qualitative and ranked data
Example: Variability in normal distributions
You are investigating the amounts of time spent on phones daily by different groups of people.
formula
2.8 Normal Distributions
Example: Calculate the probability density function of normal distribution using the following
data. x = 3, μ = 4 and σ = 2.
Solution: Given, variable, x = 3
Mean = 4 and
Standard deviation = 2
By the formula of the probability density of normal distribution, we can write;
Problem 2: The speeds of cars are measured using a radar unit, on a motorway. The speeds are normally
distributed with a mean of 90 km/hr and a standard deviation of 10 km/hr. What is the probability that a car
selected at chance is moving at more than 100 km/hr?
Problem 3: A factory produces widgets, and the weights of the widgets are normally distributed with a mean
of 100 grams and a standard deviation of 5 grams.What percentage of widgets fall within 5 grams of the
mean (between 95 and 105 grams)? What is the probability that a randomly chosen widget weighs less than
80 grams?
Problem 4: A company sells running shoes. They know that the shoe size for their target market is normally
distributed with an average size of 9 (US) and a standard deviation of 1.5. They recently received a shipment
of 1000 new shoes.How many shoes can they expect to be larger than size 11 (US)? They want to offer a
discount on shoes that are unlikely to sell due to size. What is the minimum size shoe they should discount, if
they want to target the bottom 10% of shoe sizes? (Ans: 1. 0.9082 * 1000 shoes = approximately 908 shoes.)
(Ans:2 cumulative area of 0.10 (10%). This value is approximately -1.28. z-score back to shoe size: -1.28 * 1.5 + 9 = 6.72 ~ 6)
2.10 Correlation and Scatter Plots
correlation is a statistical method that measure the relationship between two variables.
Ex: measuring the dance moves of two friends, Alice and Bob.
2.10 correlation ( -1< r >1 )
Example: Determine the correlation coefficient for the following data
Find a?
2.13 standard error of estimate (Sigma)
Find the sum of the squared errors (SSE)
2.14 interpretation of r2
The coefficient of determination is a number between 0 and 1 that measures how well a
statistical model predicts an outcome.
2.14 interpretation of r2
2.14 interpretation of r2
You will get R squared value.We get R square= 0.74,Which shows that the prediction values
are somehow close to the actual values
2.14 multiple regression equations
Multiple linear regression refers to a statistical technique that uses two or more independent variables
to predict the outcome of a dependent variable.
2.14 multiple regression equations
Example Problem:
Formula: y=b0+b1x1+b2x2
b1= 3.148
b2= -1.656
b0= -6.867
2.14 regression toward the mean
Regression toward the mean is a common statistical phenomenon that describes how extreme values
(either very high or very low) tend to move closer to the average (mean) in subsequent measurements.
Example
● A basketball player has an abnormally high number of points in one game. Regression to the
mean suggests their scoring average over the season will likely be closer to their typical
performance, not as high as this single game.