Data Visualization Tools Module
Data Visualization Tools Module
Data Visualization Tools Module
In the world of Big Data, data visualization tools and technologies are essential to
analyze massive amounts of information and make data-driven decisions. Data
visualization is the graphical representation of information and data. By using visual
elements like charts, graphs, and maps, data visualization tools provide an accessible
way to see and understand trends, outliers, and patterns in data.
What is Data Visualization?
Data visualization tools offer new approaches to dramatically improve the ability to
grasp information hiding in the large volume of business data. The primary advantages
of data-visualization to decision makers and their organizations are as follows:
1) Enhanced Assimilation of Business Information
Data visualization enables users to receive vast amounts of information
regarding operational and business conditions. Data visualization allows decision
makers to see connections between multi-dimensional data sets and provides new
ways to interpret data through the use of graphs, charts, infographics and other rich
graphical representations.
DATA TRANSFORMATION
Data comes in many forms such as text, numerical, images and videos. For
example, a customer details form where few fields are not filled and left empty. Such
data are known as missing data. In most of the cases, data may be missing data,
unstructured data, or data that lacks regular structure. In data visualization, before
processing the data, there is a need of cleaning data to make it fit to process further.
Data cleansing has a long history in databases and is a key step known as extract,
transform, load (ETL), commonly used in data warehouses shown in figure 2.1, where
data is extracted from one or more sources; transformed into its proper format and
structure, including cleansing of the data; and finally loaded into a final target location,
Page |4
such as a single database or file which can be used for business analytics & data
visualization
The first step of the ETL process is extraction. In this step, data from various
source systems is extracted which can be in various formats like relational databases,
SQL, XML and flat files into the staging area. It is important to extract the data from
various source systems and store it into the staging area first and not directly into
the data warehouse because the extracted data is in various formats and can be
corrupted also. Hence loading it directly into the data warehouse may damage it
and rollback will be much more difficult. Therefore, this is one of the most important
steps of ETL process.
2) Transformation
The second step of the ETL process is transformation. In this step, a set of rules or
functions are applied on the extracted data to convert it into a single standard
format.
It may involve following processes/tasks:
Filtering – loading only certain attributes into the data warehouse.
Cleaning – filling up the NULL values with some default values, mapping
U.S.A, United States and America into USA, etc.
Joining – joining multiple attributes into one.
Page |5
3) Loading
The third and final step of the ETL process is loading. In this step, the transformed
data is finally loaded into the data warehouse. Sometimes the data is updated by
loading into the data warehouse very frequently and sometimes it is done after
longer but regular intervals. The rate and period of loading solely depends on the
requirements and varies from system to system.
value (metric).
Page |7
*Stacked bar chart comparing consumer spending across different categories for different
generations
*Overlapping bar chart comparing branch efficiency across locations in terms of people and
profits
Page |8
2) PIE CHART
Pie charts are extensively used in presentations and offices. Pie Charts help show
proportions and percentages between categories, by dividing a circle into proportional
segments. Each arc length represents a proportion of each category, while the full circle
represents the total sum of all the data, equal to 100%. Pie Charts are ideal for giving the
reader a quick idea of the proportional distribution of the data.
P a g e | 10
One major disadvantage to using pie charts is that they cannot show more than a few
values, because as the number of values shown increases, the size of each segment/slice
becomes smaller. This makes them unsuitable for large amounts of data.
*Management in U.S. Manufacturing: How many key performance indicators were monitored at
this establishment?
P a g e | 11
3) DATA TABLES
Data tables display the data in a grid of rows and columns. Each column represents
a dimension or metric, while each row is one record of the data. Tables
automatically summarize the data. Each row in the table displays the summary for
each unique combination of the dimensions included in the table definition. Each
metric in the table is summarized according to the aggregation type for that metric
(sum, average, count, etc.).
Scatter charts can be used to look for relationships between variables. These
charts show the data as points or circles on a graph using X (left to right) and Y
(top to bottom) axes. Scatter charts can include a trend line that shows how the
variables in the chart are related. They tend to be more frequently used in
scientific fields. Though infrequent, there are use cases for scatter charts in the
business world as well.
To focus primarily on
those cases where cost
per mile is above
average, a slightly
modified scatter chart
designed as given in
figure 2.17.
Time series analysis is a statistical technique used to record and analyze data points
over a period of time, such as daily, monthly, yearly, etc. A time series chart is the
graphical representation of the time series data across the interval period.
HYPOTHESIS VS PREDICTION
A prediction is a statement that uses existing data to forecast future events. Predictions
can be types of guesses, but they usually come directly from observations.
For example, if a delivery driver comes to your house every day at 2 p.m. for four days
in a row, you might predict that the driver will come the following day at the same
time. Based on your previous observations, your prediction is a likely foretelling of
future behavior.
Here are some example scenarios that can help you better understand hypotheses and
predictions:
Diet example
A teenager notices that a change in their diet has made their skin more oily and prone to
breakouts. They make the following hypothesis and prediction:
Prediction: If I eat healthier food, then my skin will produce less oil.
In this scenario, the independent variable is the person’s diet, and the dependent
variable is their skin. To test their hypothesis, the teenager can change the independent
variable and record the differences this makes on the dependent variable.
P a g e | 21
Prediction: If tomorrow is sunny and nice, I’ll make more money than I did on
Tuesday.
In this scenario, the weather is the independent variable, and lemonade sales is the
dependent variable. Although she can't control the weather, the girl can test her
hypothesis by recording the varying temperatures and her sales each day to see if she
can establish a correlation that proves her prediction is correct.
Gardener example
A gardener notices that when he plants his tomato plants next to marigolds, fewer nematodes
affect the roots of his crops. He creates the following hypothesis and prediction:
Hypothesis: Marigolds are a good companion crop for tomatoes because they reduce
nematodes.
Prediction: If I plant marigolds next to my tomatoes, then I can produce more tomatoes.
In this scenario, the marigolds are the independent variables, and the tomato plants are
the dependent variables. The gardener plants marigolds near his tomatoes and leaves
some without a companion crop. To test his hypothesis, he records the outcomes on his
dependent variables to see if his prediction holds true.
DATA ANALYTICS
Data analytics is the process of analyzing data sets in order to make the decision
about the information they have, increasingly with specialized software and
system.
The techniques and processes of data analytics have been automated into
mechanical processes and algorithms that work over raw data for human
consumption.
Data visualizations can take on multiple formats and can represent a diversity of
information types and combinations, all of which can impact your ability to understand
what is being represented.
P a g e | 24
Sentence starters are one way to scaffold students' interpretation of data visuals.
Sentence starters provide a focal point for students to begin writing (or saying) an
interpretation of the data they are viewing in graphical form.
Sentence starters can range in their cognitive demand, moving from identifying
information and patterns in the graph to generating comparisons, predictions,
and hypotheses.
Sentence starters teachers can provide students include:
Source: Figure 3 in Boden Institute, University of Sydney 2014. Evidence Brief Obesity: Sugar-
Sweetened Beverages, Obesity and Health. Australian National Preventive Health Agency,
Canberra.
Hypothesis Formulation Statements
o A different pattern in the graph is that energy drinks go down for 14 to 16-year
old.
o A reason for this pattern might be because they prefer drinking other drinks.
o The data that most stood out to me was that sports drinks were drunk more than
soft drinks.
Sample Interpretation:
____________________________________________________________________
Hypothesis Statement:
____________________________________________________________________
Prediction:
____________________________________________________________________
Source: Manning, M., Smith, C., & Mazerolle, P. (2013). The estimated societal costs of alcohol misuse in
Australia. Trends and Issues in Crime and Criminal Justice no. 454. Canberra: Australian Institute of Criminology
Sample Interpretation
This graph shows the estimated societal costs of alcohol misuse in Australia. The total estimated
cost exceeds $14 billion. The largest cost relates to productivity, which accounted for 42.1% or
$6.046 billion. Traffic accidents comprised 25.5% or a quarter of the costs ($3.662 billion).
Alcohol misuse had the least cost to the health system, costing $1.686 billion.
P a g e | 27
Hypothesis?
Prediction?
Sample Interpretation
This graph shows the number of notified cases of laboratory-confirmed cases of influenza in
Victoria from 2011 to 2014. Each year, there is a spike in confirmed cases, which begins in June
and lasts until October. This coincides with winter when people are more likely to be spending
time indoors. The number of infected cases during the winter spike has also increased each
year. In 2011, the peak number of infected cases was around 800 while in 2014, the peak number
is just over 3000.
Hypothesis?
Prediction?
P a g e | 28
Hypothesis?
Prediction?
P a g e | 29