2 Descriptive Analytics
2 Descriptive Analytics
2 Descriptive Analytics
Descriptive Analytics
• Descriptive analytics is the starting point of analytics based solution
to problems. It helps to understand the data and provide directions for
predictive and prescriptive analytics. Business Intelligence (BI), which
largely involves creating reports and business dashboard that lead to
actionable insights, is essentially a descriptive analytics exercise.
DATA TYPES AND SCALES
• Structured and Unstructured Data
• Any data that is not originally in the matrix form with rows and
columns is an unstructured data. For example, e-mails, click streams,
textual data, images (photos and images generated by medical
devices), log data, and videos. Machine generated data such as
images generated by satellite, magnetic resonance imaging (MRI),
electrocardiogram (ECG) and thermography are few examples of
unstructured data.
DATA TYPES AND SCALES
• Structured Data
DATA TYPES AND SCALES
• Cross-sectional, Time Series, and Panel Data
1. Cross-Sectional Data: A data collected on many variables of interest at the same
time or duration of time is called cross-sectional data. For example, consider data
on movies such as budget, box-office collection, actors, directors, genre of the
movie during year 2017.
• 2. Time Series Data: A data collected for a single variable such as demand for
smartphones collected over several time intervals (weekly, monthly, etc.) is called a
time series data.
• 3. Panel Data: Data collected on several variables (multiple dimensions) over
several time intervals is called panel data (also known as longitudinal data). Example
of a panel data is data collected on variables such as gross domestic product (GDP),
Gini index, and unemployment rate for several countries over several years.
TYPES OF DATA MEASUREMENT
SCALES
• Nominal Scale (Qualitative Data)
Ordinal scale is a variable in which the value of the data is captured from an ordered
set, which is recorded in the order of magnitude. For example, in many survey data,
Likert scale is used. Likert scale is finite (usually a 5 point scale) and the data
collector would have defined the order of preference.
Interval scale corresponds to a variable in which the value is chosen from an interval set.
Variable such as temperature measured in centigrade (°C) or intelligence quotient (IQ)
score are examples of interval scale. In interval scale, the ratios do not make sense.
Any variable for which the ratios can be computed and are meaningful is called
ratio scale. Most variables come under this type; for example: demand for a product,
market share of a brand, sales, salary, and so on. If Ms Hawai Sundari’s salary is
40,000 per month and Ms Dawai Sundari’s salary is 90,000 per
month then we can interpret that Dawai Sundari earns 2.25 times the salary of
Hawai Sundari.
POPULATION AND SAMPLE
• Population is the set of all possible observations (often
called cases, records, subjects or data points) for a
• given context of the problem. The size of the population can
be very large in many cases. For example, in
• 2014, close to 834.08 million people were eligible to vote in
the Indian general elections (Source: Election
• Commission of India).
POPULATION AND SAMPLE
• Population (also known as universal set) is the set of all possible data
for a given context whereas sample is the subset taken from a
population. In many analytical problems, we make inference about the
population based on the sample data. There are many challenges in
sampling (process of selecting an
observation from the population). An incorrect sample may result in
bias and incorrect inference about the population
MEASURES OF CENTRAL TENDENCY
• Mean (or Average) Value
MEASURES OF CENTRAL TENDENCY
• Mean (or Average) Value with frequency
MEASURES OF CENTRAL TENDENCY
• Median (or Mid) Value
• Now (n + 1)/2 = (8/2) = 4. Thus the median is the 4th value in the data after
arranging them in the increasing order; in this case it is 260.
MEASURES OF CENTRAL TENDENCY
• Median (or Mid) Value
• The 5th and 6th observations are 240000 and 240000 and the average is
240000. Thus, the median salary for the data in Table 2.1 is 240000.
MEASURES OF CENTRAL TENDENCY
• Median (or Mid) Value
• Median is much more stable than the mean value, that is adding a
new observation may not change the median significantly. However,
the drawback of median is that it is not calculated using the entire
data like in the case of mean. We are simply looking for the
midpoint instead of using the actual values of the data.
MEASURES OF CENTRAL TENDENCY
• Mode
•
Mode is the most frequently occurring value in the data set. For example, in the
data ‘salary’ in Table 2.1, the value 240000 is appearing three times and is the mode
since all other values are observed only once.
In Microsoft Excel, the function ‘Mode(array)’ can be used for calculating mode.
Mode is the only measure of central tendency which is valid for qualitative
(nominal) data since the mean and median for
nominal data are meaningless.
MEASURES OF CENTRAL TENDENCY
MEASURES OF CENTRAL TENDENCY
MEASURES OF VARIATION
1. Range
2. Variance
3. Standard Déviation
MEASURES OF VARIATION
1. Range
Range is the difference between maximum and minimum value of
the data. It captures the data spread.
In the data in Table 2.4, the range = 102 – 2 = 100.
MEASURES OF VARIATION
2. Variance
Variance is a measure of variability in the data from the mean value. Variance for
population, s 2, is calculated using
MEASURES OF VARIATION
2. Sample Variance
In case of a sample, the Sample Variance (S2) is calculated using
With (n − 1) instead of n.
MEASURES OF VARIATION
Degrees of freedom