Introduction To Data Analtsis

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 33

Introduction to data analtsis

Definition analysis data

The process of evaluating data using analytical and logical
reasoning to examine each component of the data
provided. This form of analysis is just one of the many
steps that must be completed when conducting a research
experiment. Data from various sources is gathered,
reviewed, and then analyzed to form some sort of finding
or .
Types of data analysis

 Descriptive analysis is an important first step for conducting statistical analyses. It gives you an idea of the
distribution of your data, helps you detect outliers and typos, and enable you identify associations among variables,
thus preparing you for conducting further statistical analysis.
 Predictive Analysis
Predictive analytics uses historical data to predict future events. Typically, historical data is used to build a mathematical
model that captures important trends. That predictive model is then used on current data to predict what will happen next,
or to suggest actions to take for optimal outcomes.
 Prescriptive Analytics
Prescriptive Analytics is the area of data analytics that focuses on finding the best course of action in a scenario given the
available data.
Analysis life Cycle

1. Problem identification
2. Hypothesis formulation
3. Data collection
4. Data exploration/preparation
5. Model building
6. Model validation and evalution
Analysis life Cycle

1. Problem identification
-The problem is situation which is judged to be corrected or solved
-Problem can be identified through
i) comparative/benchmarking stidies
ii) performance reporting
iii) asking some basic questions
a) who are affected by the problem
b) what will happen if problem is not solved
c) when and where does the problem occur
d) Why the problem occurring
e) how are the people currently handling the problem
Analysis life Cycle

2.Hypothesis formulation
i) Frame the questions which need to be answered
ii) Develop a comprehensive list of all possible issues related to the problem.
iii) Reduce the list by eliminating duplicates and combining overlapping issues
iv) Using consensus building get down to a major issue list
Analysis life Cycle

3. Data collection
i) Using data that is already collected by ather
ii) Systematically selecting and watching charateristics of people,objects and events
iii) Oral questioning respondents either individually or as a group
iv) Collecting data based on answers provided by the respondents in written format
Analysis life Cycle

4. Data Exploration
i) Importing data
ii) Variable Idewnfication
iii) Data Cleaning
iv) Summarizing data
v) Selecting subset of data
5. Model Building
 Building a Model is a very iterative process because there is no such thing as final and
perfect solution
 Many of the machine learning and statistical techniques are available in traditional technology

 The entire group of individuals is called the

 For example, a researcher may be interested in the
relation between class size (variable 1) and
academic performance (variable 2) for the
population of third-grade children.
9 Sample

 Usually populations are so large that a researcher

cannot examine the entire group. Therefore, a
sample is selected to represent the population in a
research study. The goal is to use the results
obtained from the sample to help answer questions
about the population.
A census is a list of all individuals in a
population along with certain characteristics of
each individual.

A Pilot Study ia a study done before the actual field work is

carried out. This study is also used to test out questionnaires
and to improvethem in term of flow,question design,
language and clarity

A sample survey, on the other hand, involves a subgroup (or

sample) of the population being chosen and questioned on set
of topics. The results of this sample survey are usually used to
make inference about the larger population.
A sample of size n from a population of size N
is obtained through simple random sampling
if every possible sample of size n has an
equally likely chance of occurring. The sample
is then called a simple random sample.

 The individual measurements or scores obtained for a

research participant will be identified by the letter X (or X
and Y if there are multiple scores for each individual).
 The number of scores in a data set will be identified by N
for a population or n for a sample.
 Summing a set of values is a common operation in
statistics and has its own notation. The Greek letter sigma,
Σ, will be used to stand for "the sum of." For example, ΣX
identifies the sum of the scores.
EXAMPLE Parameter versus Statistic

Suppose the percentage of all students on your campus who have a

job is 84.9%. This value represents a parameter because it is a
numerical summary of a population.
Suppose a sample of 250 students is obtained, and from this
sample we find that 86.3% have a job. This value represents a
statistic because it is a numerical summary based on a sample.
16 Data

 The measurements obtained in a research study are

called the data.
 The goal of statistics is to help researchers
organize and interpret the data.
Some Characteristics of Data
 Not all data is the same. There are some limitations as to
what can and cannot be done with a data set, depending
on the characteristics of the data
 Some key characteristics that must be considered are:
 Continuous vs. Discrete
 Grouped vs. Individual
 Scale of Measurement

 A variable is a characteristic or condition that can

change or take on different values.
 Most research begins with a general question about
the relationship between two variables for a
specific group of individuals.
Variables are the characteristics of the individuals
within the population

Key Point: Variables vary. Consider the variable

heights. If all individuals had the same height, then
obtaining the height of one individual would be
sufficient in knowing the heights of all individuals. Of
course, this is not the case. As researchers, we wish to
identify the factors that influence variability.
20 Types of Variables

 Variables can be classified as discrete or continuous.

 Discrete variables (such as class size) consist of indivisible categories, and continuous
variables (such as time or weight) are infinitely divisible into whatever units a researcher
may choose. For example, time can be measured to the nearest minute, second, half-
second, etc.
Qualitative or Categorical variables allow for
classification of individuals based on some attribute or

Quantitative variables provide numerical measures of

individuals. Arithmetic operations such as addition and
subtraction can be performed on the values of the
quantitative variable and provide meaningful results.
EXAMPLE Distinguishing between Qualitative and Quantitative Variables

Researcher Elisabeth Kvaavik and others studied factors that affect the eating habits of
adults in their mid-thirties. (Source: Kvaavik E, et. Al. Psychological explanatorys
of eating habits among adults in their mid-30’s (2005) International Journal of
Behavioral Nutrition and Physical Activity (2)9.) Classify each of the following
variables considered in the study as qualitative or quantitative.
a. Nationality
b. Number of children
c. Household income in theQuantitative
previous year
d. Level of education Quantitative
e. Daily intake of whole grains (measured in grams per day)
A discrete variable is a quantitative variable that either has a finite number
of possible values or a countable number of possible values. The term
“countable” means the values result from counting such as 0, 1, 2, 3, and so

A continuous variable is a quantitative variable that has an infinite

number of possible values it can take on and can be measured to any
desired level of accuracy. e.g., 1, 1.43, and 3.1415926 are all acceptable
Geographic examples: distance, tree height, amount of precipitation, etc
EXAMPLE Distinguishing between Qualitative and Quantitative Variables

Researcher Elisabeth Kvaavik and others studied factors that affect the eating habits of
adults in their mid-thirties. (Source: Kvaavik E, et. Al. Psychological explanatorys
of eating habits among adults in their mid-30’s (2005) International Journal of
Behavioral Nutrition and Physical Activity (2)9.) Classify each of the following
quantitative variables considered in the study as discrete or continuous.

a. Number of children
b. Household income in theDiscrete
previous year
c. Daily intake of whole grains (measured in grams per day)
Measuring Variables
 To establish relationships between variables, researchers must observe the variables
and record their observations. This requires that the variables be measured.
 The process of measuring a variable requires a set of categories called a scale of
measurement and a process that classifies each individual into one category.
Scales of Measurement

 The data used in statistical analyses can divided into

four types:
1. The Nominal Scale
2. The Ordinal Scale
As we progress through
3. The interval Scale
these scales, the types
4. The Ratio Scale
of data they describe
have increasing
information content
The Nominal Scale
 Nominal scale data are data that can simply be
broken down into categories, i.e., having to do with
names or types
 Dichotomous or binary nominal data has just two
types, e.g., yes/no, female/male, is/is not, hot/cold,
 Multichotomous data has more than two types, e.g.,
vegetation types, soil types, counties, eye color, etc
 Not a scale in the sense that categories cannot be
ranked or ordered (no greater/less than)
The Ordinal Scale
 Ordinal scale data can be categorized AND can be
placed in an order, i.e., categories that can be assigned a
relative importance and can be ranked such that
numerical category values have
 star-system restaurant rankings
5 stars > 4 stars, 4 stars > 3 stars, 5 stars > 2 stars
 BUT ordinal data still are not scalar in the sense that
differences between categories do not have a
quantitative meaning
 i.e., a 5 star restaurant is not superior to a 4 star restaurant by
the same amount as a 4 star restaurant is than a 3 star
The Interval Scale
 Interval scale data take the notion of ranking items in
order one step further, since the distance between
adjacent points on the scale are equal
 For instance, the Fahrenheit scale is an interval scale,
since each degree is equal but there is no absolute zero
 This means that although we can add and subtract
degrees (100° is 10° warmer than 90°), we cannot
multiply values or create ratios (100° is not twice as
warm as 50°)
The Ratio Scale

 Similar to the interval scale, but with the addition of

having a meaningful zero value, which allows us to
compare values using multiplication and division
operations, e.g., precipitation, weights, heights, etc
 e.g., rain – We can say that 2 inches of rain is twice as
much rain as 1 inch of rain because this is a ratio scale
 e.g., age – a 100-year old person is indeed twice as old
as a 50-year old one
The Ordinal Scale
 Ordinal scale data can be categorized AND can be
placed in an order, i.e., categories that can be assigned a
relative importance and can be ranked such that
numerical category values have
 star-system restaurant rankings
5 stars > 4 stars, 4 stars > 3 stars, 5 stars > 2 stars
 BUT ordinal data still are not scalar in the sense that
differences between categories do not have a
quantitative meaning
 i.e., a 5 star restaurant is not superior to a 4 star restaurant by
the same amount as a 4 star restaurant is than a 3 star
The Interval Scale
 Interval scale data take the notion of ranking items in
order one step further, since the distance between
adjacent points on the scale are equal
 For instance, the Fahrenheit scale is an interval scale,
since each degree is equal but there is no absolute zero
 This means that although we can add and subtract
degrees (100° is 10° warmer than 90°), we cannot
multiply values or create ratios (100° is not twice as
warm as 50°)
The Ratio Scale

 Similar to the interval scale, but with the addition of

having a meaningful zero value, which allows us to
compare values using multiplication and division
operations, e.g., precipitation, weights, heights, etc
 e.g., rain – We can say that 2 inches of rain is twice as
much rain as 1 inch of rain because this is a ratio scale
 e.g., age – a 100-year old person is indeed twice as old
as a 50-year old one

You might also like