DataVisualization 1
DataVisualization 1
DataVisualization 1
Data Visualization
Visualization :
process of that transforms the representation of real raw data into
meaningful information/insights in a visual representation.
Data Visualization:
- It is graphical representation of information and a data.
- Mapping between original data(numeric data) and graphic elements
(lines, pointers)
- Visual elements(charts, graphs, maps)
Advantages of Visualization:
1. Simplifies Complex Data: make it easier to interpret and understand complex datasets
by presenting information in a clear and concise manner.
2. Enhances Communication: They improve communication by allowing information to be
shared quickly and effectively across diverse audiences.
3. Engages Audience: Visual elements are more engaging and can capture the attention of
the audience better than text or tables alone.
4. Improves Analytical Efficiency: Visual tools can help analysts quickly identify key
insights, reducing the time needed to analyze data.
5. Quick Insights: All for faster identification of patterns, trends, outliers.
6. Error detection: Makes it easier to spot error in the data that affect analysis.
7. Increases Productivity: Reduces time spent on data analysis & interpretation by making
data insights more immediately apparent.
Introduction to Exploratory Data Analysis:
- process of examining or understanding the data & extracting insights
of the data
- Process of investigating the dataset to discover patterns, and
anomalies and form hypotheses based on understanding the dataset.
- EDA involves generating summary statistic for numerical data in the
dataset and creating various graphical representation to understand
the data easy and better.
- EDA refers to critical process of performing initial investigation on
data so as to discover patterns & check assumptions with the help of
summary statistic & graphical representation.
EDA involves a combination foll. methods:
• Univariate Visualization of and summary statistic for each field in the raw dataset.
• Bivariate visualization & summary statistic for accessing the relationship betn
each variable in the dataset and target variable of interest.
• Multivariate visualizations to understand interactions between different fields in
the data.
• Dimensionality Reduction to understand the fields in the data that accounts for the
most variance between observations and allow for processing of reduced volume
data.
• Clustering of similar to observations in the dataset into differentiated groupings,
which by collapsing the data into a few small data points, patterns of behaviour can
be more easily identified.
DATA VISUALIZATION & VISUAL CODING:
Dendograms:-
pygal Library:
- It creates interactive plots that can be embedded in the web browser.
- It also has ability to output charts as SVG (Scaler Vector Graphics).
- All charts types created are packed into methods which makes it easy to create an
artistic chart in few lines of code.
- to create a bar chart → import pygal library & create variable to assign the value
of pygal.Bar().
- few lines of code can make attractive designs.
Geoplotlib Library:
- It is toolbox for designing maps and plotting geographical data.
- Maptypes created are heatmaps, dot-density maps & choropleths.
- Library mainly used for drawing maps as no other Python libraries are meant for
creating graphics for maps.
BASIC DATA VISUALIZATION TOOLS
Histogram:
- Graphical display of data using bars of different heights.
- Shows accurate representation of the distribution of numeric data.
- Histogram uses a ‘bin’ for a set or range of values to be distributed.
- To make histogram with matplotlib, we can use plt.hist() function.
- First argument is the numeric data & second argument is number of bins.
(default value of bin is 10)
Bar Charts / Graphs:
- Used for comparing quantities of different categories or groups.
- Values of category are represented with help of bars
- They can be configured with vertical or horizontal bars representing value.
- Major difference betn bar chart & histogram is → There are gaps between bars in a
bar chart but in histogram, the bars are placed adjacent to each other.
- Histogram displays frequency of numerical data , bar charts uses bars to compare
different categories of data.
- Histogram used in quantitative data & bar chart is used for qualitative data.
- barh() – display vertically & barh( ) – display horizontically
Line Plot:
- It is 2 dimensional plotting of values connected foll. order.
- In line chart values are displayed in an ordered manner & connected.
- Frequently used to show trends and analyze how the data has changed over time.
- To make line plot with matplotlin we call plt.plot()
- First argument → data on horizontal axis, second argument → vertical axis.
- To display plot , we need to call plt.show() function.
Scatter Plot:
- Two dimensional chart showing comparision of two variables scattered across
two axes.
- Scatter plot is also known as XY chart .
- Scatter plot shows all individual data points. Here they aren’t connected with
lines.
- Used to display trends and correlations & correlation between two variables.
- To make scatter plot → plt.scatter() function. (1st argument → horizontal axis &
2nd argument → vertical axis).
Pie-chart:
- Shows proportion or percentage of a data element in a circular format.
- Circular charts split into various pies based on value of data.
- Pies represent the “part-of-whole-data”.
- Overall sum of pie is 100%.
- It is circular plot, divided into slices to show numerical proportion.
- Widely used in a business world.
- It is difficult for comparison, so that time we use bar chart.
Donut Chart:
- It is extension of pie chart.
- Center of donut chart is empty to showcase additional data/metrics
- Donut chart is more space efficient than pie chart.
- Inner space is used to display percentage ot any other information related to it.
- Since slice are not provided in pie shape, analyst can focus on arc lengths rather
than slice.
Basic Visualization Rules:
➢Choose appropriate plot type.
➢If there is various option, compare various options and choose best one.
➢When you choose type of plot, label your axis.
➢We can add title to make plot more informative.
➢Add labels for different categories.
➢In some cases we can use some colors and size of data to make it more
informative.
Specialized Data Visualization Tools:
Box Plot:
- Commonly used for business, professional aspects & extensively in data science
- Used to show the distribution of two or more data elements in a summarized
manner.
- Gives good indication of how values in the data are spread out.
- The median is the value that seprates the higher half of the data from lower half.
- We can actually have 2 middle numbers(in case of even no. of values) so sum that
two numbers and divide by 2.
Ex. 1,2,5,6,8,9 → (5+6)/2 = 5.5
Bubble Plots:
- It is variation of scatter chart in which the data points are replaced with bubbles
and an additional dimension of the data is represented in the size of bubbles.
- It is scattered plot where 3rd dimension is added.
- We need 3 numerical variable→ (1st represented by X-axis, 2nd represented by Y-
axis, 3rd by the dot size).
- While ploting: i. Ensure smaller dots are visible when overlapping with larger
dots.
ii. Either by placing smaller dots above larger dots OR make larger
dot transparent.
- Typically used to compare and
show relationships between categorised circles.
Heat Map:
- It is tool to show magnitude of the data elements using colors.
- Intensity of colors is shown in a two-dimensional manner, showing hpow close
the elements are correlated.
- A heat map is data analysis software that uses color the way a bar graph uses
height and width.
- It visualize data through variations in coloring.
- Useful for cross-examining multivariate data, through placing variables in the
rows and columns & coloring cells with table.
- Colors play major role in displaying heatmap.
- Most commonly used color scheme is warm-to-cool color scheme.
(warm -> high value datapoints , cool-> low value datapoints)
- Heatmaps measures website’s performance, simplify numerical data, understand
visitor’s thiking.
- Best example is Netflix → for digital business that use heatmaps.
Dendrogram:
- Diagram that shows hierarchical relationship between objects.
- Commonly created as an output from hierarchical clustering.
- Main use of is to work out the best way to allocate objects to cluster.
- A dendogram is a diagram representing a tree.
- Mainly used for visual representation of hierarchical clustering to illustrate the
arrangements of various clusters.
Venn Diagram:
- It is Primary diagram.
- It shows all possible logical relationship between a
finite collections of different sets.
- Each set is represented by circle.
- Groups are usually overlapping- size of overlap represents the intersection
between both groups.
- Venn Diagram is most commonly used to:
1. Visually Organize information to quickly grasp relationship between
datasets and identify difference or common.
2. Compare And Contrast two or more choices to identify the overlapping
elements and clearly see how they fare against each other.
3. Find Correlations and predict probabilities when comparing datasets.
Treemap:
- Visualization that displays hierarchically organized data as a set of nested
rectangles and the sizes and colors of rectangles are proportional to the values of
the data points they represent.
- It is technique for large, hieararchical data sets.
- It is tool that can be used to break down the relationships between multiple
variables in the data.
3D scatter Plots:
- Its becoming increasingly common to visualize 3D data by adding a third
dimension to a scatter plot.
- The 3D scatter plots are used to plot data points on three axes in the attempt to
show the relationship between three variables.
- Each row in data table is represented by marker whose position depends on its
values in columns set on the X, Y, Z axes.
- 4th variable can be set to correspond to color or size of the markers.
- The relationship between different variables is called correlation.
- If the markers are close to making a straight line in any direction in 3rd
dimensional space of the 3D scatter plot, the correlation between corresponding
variables is high.
ADVANCED DATA VISUALIZATION TOOL: WORD CLOUDS
- A word cloud is word visualization that displays the most
used words in a text from small to large, according to how often each appears.
- Word Cloud works- the more a specific word appears in a source of textual data,
the bigger and bolder it appears in word cloud.
- World cloud is collection or cluster of word depicted in different sizes.
- It is more often used for aesthetic purpose to depict categorial data.
Limitations using word clouds:
1. Lack of context: Word clouds show the frequency of words but don't provide
context, so it's hard to understand how words are used or their sentiment
(positive or negative).
2. Misleading Emphasis: Common but unimportant words (e.g., "the," "and") may
dominate the cloud if not filtered out, leading to misleading conclusions.
3. No Deep Insights: provide a quick overview but don't offer detailed insights or
analysis,
VISUALIZATION OF GEOSPATIAL DATA:
- Geospatial data consists of numeric data that denotes a geographic coordinate
system(latitude, longitude and elevation) of a geographical location of an object
such as building
- Geodata gives information about Location, Size, Area and shape of a physical
object.
- Maps are primary focus of geospatial data visualizations.
- Geospatial visualizations highlights the physical connection between data points.
- Python is highly efficient to deal with geospatial data and for this many libraries
have been developed in Python
Standard Python libraries commonly used for geospatial data are:
1. shapely Python Library: used to create geometric objects such as square,
polygon or even a point. & geometric calculations such as finding area or
intersection can also handles by this library.
2. geopandas Python Library: more powerful than shapely library. It not only
create geometric objects but also read/write vector file formats.
3. gdal (Geospatial Data Abstraction Library): originally written in C and it often
used with geospatial data. It presents a single abstract data model to calling
applications for all supported formats.
4. fiona Library: used for read/write vector file formats such as shapefiles.
5. rasterio library: used to handle raster data and s]handles transformation of
coordinate refrence systems. This library uses matplotlib library to plotd data for
analysis.
Three Tools used in Python fo creating geospatial data:
Choropleth Map, Bubble Map, Connection Map
1. Choropleth Map:
- These maps contain area that are shaded or patterned in proportion to statistical
variable being displayed on the map.
- Map contains partritioned geographical regions or areas.
The area are divided based on colors, shades or patterns
in relation to data variable.
- It is filled maps.
There are mainly two elements required to build a choropleth map:
• Shape file that gives boundaries of every zone to be represented on the map.
• A data frame that gives the values of each zone.
- Maps display divided geographical areas or regions that are colored, shaded or
patterned in relation to a data variable.
- Data variables uses color progression to represent itself in each region of the map.
Connection Maps:
- Drawn by connecting points placed on a map by straight or curved lines.
- Connection maps shows connection between several positions on a map.
- It is used to display network combined with geographical data.
- Two positions of map are connected via lines and the positions are marked by
circle.
- The connection map that is crated in Python by using
basemap Library. Each route indicates the shortest
routes between two positions.
Categorial Data Description Data Visualization Graphs
Temporial Data Data Visualization are linear are one Scatter plot
dimensional Line Chart
Data Visualization Types: Time series sequence
Hierarchical Data Data Visualization having ordered groups Dendogram
within larger groups, each group or Ring charts
cluster of information flowing from a Tree maps
single point of origin.
Network data Data Visualization showing relationships Matrix Chart
among data within network Node-Link diagram
World cloud
Tube map