DataVisualization 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

DATA VISUALIZATION

Data Visualization
Visualization :
process of that transforms the representation of real raw data into
meaningful information/insights in a visual representation.

Data Visualization:
- It is graphical representation of information and a data.
- Mapping between original data(numeric data) and graphic elements
(lines, pointers)
- Visual elements(charts, graphs, maps)
Advantages of Visualization:

1. Simplifies Complex Data: make it easier to interpret and understand complex datasets
by presenting information in a clear and concise manner.
2. Enhances Communication: They improve communication by allowing information to be
shared quickly and effectively across diverse audiences.
3. Engages Audience: Visual elements are more engaging and can capture the attention of
the audience better than text or tables alone.
4. Improves Analytical Efficiency: Visual tools can help analysts quickly identify key
insights, reducing the time needed to analyze data.
5. Quick Insights: All for faster identification of patterns, trends, outliers.
6. Error detection: Makes it easier to spot error in the data that affect analysis.
7. Increases Productivity: Reduces time spent on data analysis & interpretation by making
data insights more immediately apparent.
Introduction to Exploratory Data Analysis:
- process of examining or understanding the data & extracting insights
of the data
- Process of investigating the dataset to discover patterns, and
anomalies and form hypotheses based on understanding the dataset.
- EDA involves generating summary statistic for numerical data in the
dataset and creating various graphical representation to understand
the data easy and better.
- EDA refers to critical process of performing initial investigation on
data so as to discover patterns & check assumptions with the help of
summary statistic & graphical representation.
EDA involves a combination foll. methods:
• Univariate Visualization of and summary statistic for each field in the raw dataset.
• Bivariate visualization & summary statistic for accessing the relationship betn
each variable in the dataset and target variable of interest.
• Multivariate visualizations to understand interactions between different fields in
the data.
• Dimensionality Reduction to understand the fields in the data that accounts for the
most variance between observations and allow for processing of reduced volume
data.
• Clustering of similar to observations in the dataset into differentiated groupings,
which by collapsing the data into a few small data points, patterns of behaviour can
be more easily identified.
DATA VISUALIZATION & VISUAL CODING:

- Data visualization has the power of illustrating complex data


relationship and patterns with the help of simple designs consisting of
lines, shapes and colors.
- Visual Encoding is used to map data into visual structures, there by
building an image on the screen.
Data Visualization can help in:
1. Identify Outliers in Data: Data visualization makes it easy to sport outliers
those data points that look different from the rest.
ex. In chart, an outliers might be a dot that is far away from the other dots, helps to
see quickly.
2. Enhanced Collaboration: Advanced visualization tools make it easier for teams to
collaboratively go through the reports for instant decision making.
3. Business Analysis Made easy: It deals with various sales prediction, product
promotion, customer behaviour through the use of coreect data visualization
techniques.
4.Improve Response Time
5.Greater Simplicity
6. Easier Visualization of Patterns.
VISUAL ENCODING:
- translating the data into a visual element on a chart or map through
position, shape, size, symbol and color.
- It is way in which data is mapped into visual structure, upon which we
build the images on a screen.
What is the visualization graph supposed to display?

Distribution Relationship Comparison Connection Composition Location


Role of Data Visualization Visualization Graph
possible illustrative data
Distribution - Scatter chart
- 3D Area chart
- Histogram
Relationship - Bubble Chart
- Scatter Chart
Comparison - Bar chart
- Line chart
- Column Chart
- Area Chart
Composition - Pie chart
- Waterfall chart
- Stacked column chart
- Stacked area chart
Location - Bubble map
- Connection map
Connection - Matrix chart
- Tube map…..
To represent data that involves 3 or more variables, these retinal variables play a
major role. For example:
1. Shape: circle, oval, diamond, rectangle may signify different types of data & is
easily recognized by the eye for the distinguished look.
2. Size: used for quantitative data as smaller size indicates less values while bigeer
indicates more value.
3. Color: satuaration decides intensity of color and can be used to differentiate visual
elements from their surroundings by displaying diff. scales of value.
4. Orientation: (vertical, horizontal, slanted) help in signifying data trends such as
upward trend or downward trend.
5. Texture: show differentiation among data and is mainly used for data comparison.
6. Angles: provides a sense of proportion and this characteristics can help Data
Science Fundamentals & Practical Approaches analyst or data scientist make better
data comparison.
SOFTWARES USED FOR DATA VISUALIZATIONS:
1. Tableau:
- It is based on Visual Analytics Platform. Easily connects, visualizes & shares data with
effective seamless experience from desktop to mobile.
- Features: DB integration, Drag & Drop interface, Email integration, Dashboard Creation,
Mobile Friendly.
2. QlikView:
- data discovery and visualization tool that focuses on providing deep insights through
associative data models, allowing users to explore data in a non-linear way.
- Features: Personalized data search, Script building, role based access.
3. Microsoft Power BI:
- Comes with unlimited access to on-site and in-cloud data that gives a centralized data
access hub, enabling user to create and share interactive reports and dashboard with ease.
- Features: Unlimited connectivity
DATA VISUALIZATION LIBRARIES:
- The python package has rich built-in libraries for practically every data visualization.
- Using particular library for specific task helps the user to complete the task in more
easy and accurate way.
- Each library of visualization has its own specification.
- Using particular libraries for specific task helps the user to complete the teask in
more easy and accurate way.
- Python Libraries → matplotlib, seaborn, ggplot, plotly, geoplotlib, gleam.
Matplotlib Library:
- Used for plotting 2D data visualizations.
- Used to generate simple yet powerful visualizations.
- Mainly used for creating plots that can be zoomed in on a section of plot
- It is first data visualization library to be developed in Python.
- Can be used to make visualization types such as Scatter plots, Bar charts and
Histogram, Line Plots, Pie charts, Stem plants.
- Library allows easy use of labels, axes, titles, grids and others graphic requirement
with customizable values and text.
Seaborn Library:
- This library couples the power of the matplotlib library to create artistic charts
with very few line of code.
- It follows creative styles and rich color palettes, that allows to create
visualization plots to be more attractive and modern.
- It is popular data visualization library that built on top of Matplotlib.
- Today’s visualization graph is mainly plotted in seaborn rather than mayplotlib
because of → rich color palatte, graphic styles.
ggplot Library:
- This library which is an R plotting system and concepts are based on the Grammer
of Graphics.
- It creates a layer of components for creating plots which makes it different from
matplotlib based on the operations of plotting graph.
- This library is integrated with pandas & is mainly used for creating very simple
graphics.
- It is not designed to develop a high level of customized graphics. It has simpler
method of plotting with a lack of complexity.
Bokeh Library:
- Mainly used to create interactive, web-ready plots, which can be easily output as
HTML documents, JSON objects or interactive web applications.
- Advantage → managing real-time data streaming.
- Used for creating common charts such as histograms, bar plots & box plots.
- Also handle very minute points of a graph such as handling a dot of scatter plot.
- Methods for creating common charts such as bar plots, box plots & histogram.
- Using this, it is easy for user to control & define every element of the chart without
using any default values & designs.
- allows you to build charts and graphs that you can zoom, hover over, making it
easy to explore data directly in a web browser.
plotly Library:
- It is an online platform for data visualization
- It can be used in making interactive plots that are not possible using other Python
Libraries.
- Such plots include dendograms, contour plots and 3D charts.
- We can also create → area charts, bar charts, box plots, histograms, polar charts,
bubble charts
- Graphs are not saved as images but in JSON, because of which graphs can be
opened & viewed with other applications such as R, MATLAB etc.

Dendograms:-
pygal Library:
- It creates interactive plots that can be embedded in the web browser.
- It also has ability to output charts as SVG (Scaler Vector Graphics).
- All charts types created are packed into methods which makes it easy to create an
artistic chart in few lines of code.
- to create a bar chart → import pygal library & create variable to assign the value
of pygal.Bar().
- few lines of code can make attractive designs.
Geoplotlib Library:
- It is toolbox for designing maps and plotting geographical data.
- Maptypes created are heatmaps, dot-density maps & choropleths.
- Library mainly used for drawing maps as no other Python libraries are meant for
creating graphics for maps.
BASIC DATA VISUALIZATION TOOLS
Histogram:
- Graphical display of data using bars of different heights.
- Shows accurate representation of the distribution of numeric data.
- Histogram uses a ‘bin’ for a set or range of values to be distributed.
- To make histogram with matplotlib, we can use plt.hist() function.
- First argument is the numeric data & second argument is number of bins.
(default value of bin is 10)
Bar Charts / Graphs:
- Used for comparing quantities of different categories or groups.
- Values of category are represented with help of bars
- They can be configured with vertical or horizontal bars representing value.
- Major difference betn bar chart & histogram is → There are gaps between bars in a
bar chart but in histogram, the bars are placed adjacent to each other.
- Histogram displays frequency of numerical data , bar charts uses bars to compare
different categories of data.
- Histogram used in quantitative data & bar chart is used for qualitative data.
- barh() – display vertically & barh( ) – display horizontically
Line Plot:
- It is 2 dimensional plotting of values connected foll. order.
- In line chart values are displayed in an ordered manner & connected.
- Frequently used to show trends and analyze how the data has changed over time.
- To make line plot with matplotlin we call plt.plot()
- First argument → data on horizontal axis, second argument → vertical axis.
- To display plot , we need to call plt.show() function.
Scatter Plot:
- Two dimensional chart showing comparision of two variables scattered across
two axes.
- Scatter plot is also known as XY chart .
- Scatter plot shows all individual data points. Here they aren’t connected with
lines.
- Used to display trends and correlations & correlation between two variables.
- To make scatter plot → plt.scatter() function. (1st argument → horizontal axis &
2nd argument → vertical axis).
Pie-chart:
- Shows proportion or percentage of a data element in a circular format.
- Circular charts split into various pies based on value of data.
- Pies represent the “part-of-whole-data”.
- Overall sum of pie is 100%.
- It is circular plot, divided into slices to show numerical proportion.
- Widely used in a business world.
- It is difficult for comparison, so that time we use bar chart.
Donut Chart:
- It is extension of pie chart.
- Center of donut chart is empty to showcase additional data/metrics
- Donut chart is more space efficient than pie chart.
- Inner space is used to display percentage ot any other information related to it.
- Since slice are not provided in pie shape, analyst can focus on arc lengths rather
than slice.
Basic Visualization Rules:
➢Choose appropriate plot type.
➢If there is various option, compare various options and choose best one.
➢When you choose type of plot, label your axis.
➢We can add title to make plot more informative.
➢Add labels for different categories.
➢In some cases we can use some colors and size of data to make it more
informative.
Specialized Data Visualization Tools:
Box Plot:
- Commonly used for business, professional aspects & extensively in data science
- Used to show the distribution of two or more data elements in a summarized
manner.
- Gives good indication of how values in the data are spread out.
- The median is the value that seprates the higher half of the data from lower half.

- We can actually have 2 middle numbers(in case of even no. of values) so sum that
two numbers and divide by 2.
Ex. 1,2,5,6,8,9 → (5+6)/2 = 5.5
Bubble Plots:
- It is variation of scatter chart in which the data points are replaced with bubbles
and an additional dimension of the data is represented in the size of bubbles.
- It is scattered plot where 3rd dimension is added.
- We need 3 numerical variable→ (1st represented by X-axis, 2nd represented by Y-
axis, 3rd by the dot size).
- While ploting: i. Ensure smaller dots are visible when overlapping with larger
dots.
ii. Either by placing smaller dots above larger dots OR make larger
dot transparent.
- Typically used to compare and
show relationships between categorised circles.
Heat Map:
- It is tool to show magnitude of the data elements using colors.
- Intensity of colors is shown in a two-dimensional manner, showing hpow close
the elements are correlated.
- A heat map is data analysis software that uses color the way a bar graph uses
height and width.
- It visualize data through variations in coloring.
- Useful for cross-examining multivariate data, through placing variables in the
rows and columns & coloring cells with table.
- Colors play major role in displaying heatmap.
- Most commonly used color scheme is warm-to-cool color scheme.
(warm -> high value datapoints , cool-> low value datapoints)
- Heatmaps measures website’s performance, simplify numerical data, understand
visitor’s thiking.
- Best example is Netflix → for digital business that use heatmaps.
Dendrogram:
- Diagram that shows hierarchical relationship between objects.
- Commonly created as an output from hierarchical clustering.
- Main use of is to work out the best way to allocate objects to cluster.
- A dendogram is a diagram representing a tree.
- Mainly used for visual representation of hierarchical clustering to illustrate the
arrangements of various clusters.
Venn Diagram:
- It is Primary diagram.
- It shows all possible logical relationship between a
finite collections of different sets.
- Each set is represented by circle.
- Groups are usually overlapping- size of overlap represents the intersection
between both groups.
- Venn Diagram is most commonly used to:
1. Visually Organize information to quickly grasp relationship between
datasets and identify difference or common.
2. Compare And Contrast two or more choices to identify the overlapping
elements and clearly see how they fare against each other.
3. Find Correlations and predict probabilities when comparing datasets.
Treemap:
- Visualization that displays hierarchically organized data as a set of nested
rectangles and the sizes and colors of rectangles are proportional to the values of
the data points they represent.
- It is technique for large, hieararchical data sets.
- It is tool that can be used to break down the relationships between multiple
variables in the data.
3D scatter Plots:
- Its becoming increasingly common to visualize 3D data by adding a third
dimension to a scatter plot.
- The 3D scatter plots are used to plot data points on three axes in the attempt to
show the relationship between three variables.
- Each row in data table is represented by marker whose position depends on its
values in columns set on the X, Y, Z axes.
- 4th variable can be set to correspond to color or size of the markers.
- The relationship between different variables is called correlation.
- If the markers are close to making a straight line in any direction in 3rd
dimensional space of the 3D scatter plot, the correlation between corresponding
variables is high.
ADVANCED DATA VISUALIZATION TOOL: WORD CLOUDS
- A word cloud is word visualization that displays the most
used words in a text from small to large, according to how often each appears.
- Word Cloud works- the more a specific word appears in a source of textual data,
the bigger and bolder it appears in word cloud.
- World cloud is collection or cluster of word depicted in different sizes.
- It is more often used for aesthetic purpose to depict categorial data.
Limitations using word clouds:
1. Lack of context: Word clouds show the frequency of words but don't provide
context, so it's hard to understand how words are used or their sentiment
(positive or negative).
2. Misleading Emphasis: Common but unimportant words (e.g., "the," "and") may
dominate the cloud if not filtered out, leading to misleading conclusions.
3. No Deep Insights: provide a quick overview but don't offer detailed insights or
analysis,
VISUALIZATION OF GEOSPATIAL DATA:
- Geospatial data consists of numeric data that denotes a geographic coordinate
system(latitude, longitude and elevation) of a geographical location of an object
such as building
- Geodata gives information about Location, Size, Area and shape of a physical
object.
- Maps are primary focus of geospatial data visualizations.
- Geospatial visualizations highlights the physical connection between data points.
- Python is highly efficient to deal with geospatial data and for this many libraries
have been developed in Python
Standard Python libraries commonly used for geospatial data are:
1. shapely Python Library: used to create geometric objects such as square,
polygon or even a point. & geometric calculations such as finding area or
intersection can also handles by this library.
2. geopandas Python Library: more powerful than shapely library. It not only
create geometric objects but also read/write vector file formats.
3. gdal (Geospatial Data Abstraction Library): originally written in C and it often
used with geospatial data. It presents a single abstract data model to calling
applications for all supported formats.
4. fiona Library: used for read/write vector file formats such as shapefiles.
5. rasterio library: used to handle raster data and s]handles transformation of
coordinate refrence systems. This library uses matplotlib library to plotd data for
analysis.
Three Tools used in Python fo creating geospatial data:
Choropleth Map, Bubble Map, Connection Map

1. Choropleth Map:
- These maps contain area that are shaded or patterned in proportion to statistical
variable being displayed on the map.
- Map contains partritioned geographical regions or areas.
The area are divided based on colors, shades or patterns
in relation to data variable.
- It is filled maps.
There are mainly two elements required to build a choropleth map:
• Shape file that gives boundaries of every zone to be represented on the map.
• A data frame that gives the values of each zone.
- Maps display divided geographical areas or regions that are colored, shaded or
patterned in relation to a data variable.
- Data variables uses color progression to represent itself in each region of the map.
Connection Maps:
- Drawn by connecting points placed on a map by straight or curved lines.
- Connection maps shows connection between several positions on a map.
- It is used to display network combined with geographical data.
- Two positions of map are connected via lines and the positions are marked by
circle.
- The connection map that is crated in Python by using
basemap Library. Each route indicates the shortest
routes between two positions.
Categorial Data Description Data Visualization Graphs
Temporial Data Data Visualization are linear are one Scatter plot
dimensional Line Chart
Data Visualization Types: Time series sequence
Hierarchical Data Data Visualization having ordered groups Dendogram
within larger groups, each group or Ring charts
cluster of information flowing from a Tree maps
single point of origin.
Network data Data Visualization showing relationships Matrix Chart
among data within network Node-Link diagram
World cloud
Tube map

Multidimensional data Data Visualization having multiple Scatter plot


dimensions Pie chart
Venn diagram
Histogram
Multivariate data Data visualization having more than 2 3d scatter plot
variables to be studied under a single Scatter ploy matrix
observation
Geospatial data Data visualization relate to real-life Bubble map
physical locations, overlaying familiar Choropleth map
maps with different data points. Heat map

You might also like