2/ Organizing and Visualizing Variables: Dcova
2/ Organizing and Visualizing Variables: Dcova
Data visualisation is all about organizing and representing data in a way that allows us to
more easily perceive trends and relationships, and effectively communicate otherwise
complex information. We do this in order to better understand information to ultimately drive
better business decision making
DCOVA
To properly apply statistics, you should follow a framework to minimize possible
errors.
Define the data you want to study in order to solve a problem or meet an objective (e.g.
study sales data in order to solve the problem of advertising expenditure).
Collect the data from appropriate sources.
Organise the data collected by developing pages.
Visualise the data by developing figures/charts
Analyse the data collected to reach conclusions and present results.
A summary table tallies the frequencies or percentages of items in a set of categories so that
you can see differences between categories.
A Contingency Table Helps Organize Two or More Categorical Variables
Used to study patterns that may exist between the responses of two or more
categorical variables.
Cross tabulates or tallies jointly the responses of the categorical variables.
For two variables the tallies for one variable are located in the rows and the tallies
for the second variable are located in the columns.
An ordered array is a sequence of data, in rank order, from the smallest value to the largest
value. Shows range (minimum value to maximum value). May help identify outliers (unusual
observations).
The frequency distribution is a summary table in which the data are arranged into
numerically ordered classes.
You must give attention to selecting the appropriate number of class groupings for the
table, determining a suitable width of a class grouping, and establishing the
boundaries of each class grouping to avoid overlapping.
The number of classes depends on the number of values in the data. With a larger
number of values, typically there are more classes. In general, a frequency distribution
should have at least 5 but no more than 15 classes.
To determine the width of a class interval, you divide the range (Highest value–
Lowest value) of the data by the number of class groupings desired.
Frequency Distribution
It condenses the raw data into a more useful form. It allows for a quick visual interpretation
of the data. It enables the determination of the major characteristics of the data set including
where the data are concentrated / clustered.
The bar chart visualizes a categorical variable as a series of bars. The length of each bar
represents either the frequency or percentage of values for each category. Each bar is
separated by a space called a gap.
The pie chart is a circle broken up into slices that represent categories. The size of each slice
of the pie varies according to the percentage in each category.
The doughnut chart is the outer part of a circle broken up into pieces that represent
categories. The size of each piece of the doughnut varies according to the percentage in each
category.
The side by side bar chart represents the data from a contingency table.
Stem-and-Leaf Display
A simple way to see how the data are distributed and where concentrations of data exist.
METHOD: Separate the sorted data series into leading digits (the stems) and the trailing
digits (the leaves).
A stem-and-leaf display organizes data into groups (called stems) so that the values within
each group (the leaves) branch out to the right on each row.
The Histogram
A vertical bar chart of the data in a frequency distribution is called a histogram.
In a histogram there are no gaps between adjacent bars.
The class boundaries (or class midpoints) are shown on the
horizontal axis.
percentage.
The Polygon
A percentage polygon is formed by having the midpoint of each class represent
the data in that class and then connecting the sequence of midpoints at their
respective class percentages.
The cumulative percentage polygon, or ogive, displays the variable of interest
along the X axis, and the cumulative percentages along the Y axis.
Useful when there are two or more groups to compare
Numeric variable is measured on the vertical axis and the time period is measured on the
horizontal axis.
o Presentation issues that can undercut the usefulness of methods from this
chapter.