Exploring and Visualizing Data
Stephen F Elston| Principle Consultant, Quantia Analytics, LLC
Module Outline
• Exploring Data
• Visualizing Data
Exploring Data
• Introduction to R and Python for Data Science
• Working with Data Frames in R and Python
• Working with Data Frames in Azure ML
• Working with Metadata
Data Frames
• Available in R and Python Pandas Column1 Column2 … ColumnN
1 ABC … 12.2
– Map to and from Azure ML tables
2 XYZ … 13.1
• Rectangular tables 3 ABC … 12.8
– Each column of one type 4 XYZ … 10.9
5 ABC … 3.75
• Common Tasks:
– Subsetting by rows and columns
– Logical filtering of rows and columns
Dplyr
library(dplyr)
Col1 Col2 Col3
2012 14 45
2013 13 76
2013 34 65
2014 23 47
dir <- "C:\data"
file <- "values.csv"
path <- file.path(dir, file)
frame1 <- read.csv(path, header=TRUE, stringsAsFactors = FALSE)
Col1 Col2 Col3
2012
2013 14
13 45
76
2013 13
34 76
65
2013 34 65
2014 23 47
frame1 <- filter(frame1, Col1 == 2013)
Col1 Col3
Col2 Col3
2012 45
14 45
2013 76
13 76
2013 65
34 65
2014 47
23 47
frame1 <- select(frame1, Col1, Col3)
Col1 Col2 Col2 Col3 Col3 Col4
2012 14 14 45 45 59
2013 13 13 76 76 89
2013 34 34 65 65 99
2014 23 23 47 47 70
frame1 <- mutate(frame1, Col4 = Col2 + Col3)
Other useful dplyr verbs include:
frame1 <- group_by(frame1, Col1)
frame1 <- distinct(frame1, Col1)
frame1 <- sample_frac(frame1, 0.5)
frame1 <- sample_n(frame1, 500)
frame1 <- summarize(frame1, m1 = mean(Col1))
Col1 Col2 Col2 Col3 Col3 Col4
2013
2012 13 14 76 45 89
2013 34 13 65 76 99
2013 34 65
2014 23 47
frame1 <- frame1 %>%
filter(Col1 == 2013) %>%
mutate(Col4 = Col2 + Col3)
Pandas
Col1 Col2 Col3
2012 14 45
2013 13 76
2013 34 65
2014 23 47
import pandas as pd
import os
dir = "c:\data"
file = "values.csv"
path = os.path.join(dir, file)
frame1 = pd.read_csv(path)
Col1 Col2 Col3
2012 14 45
2013 13 76
2013 34 65
2014 23 47
frame1 = frame1["Col2"]
Col1 Col2 Col3
2012 14 45
2013 13 76
2013 34 65
2014 23 47
frame1 = frame1[["Col1", "Col2"]]
Col1 Col2 Col3
2013
2012 13
14 76
45
2013 34
13 65
76
2013 34 65
2014 23 47
frame1 = frame1[1:3:1]
Col1 Col2 Col3
2012 14 45
2013 13 76
2013 34 65
2014 23 47
frame1 = frame1[:3]
Col1 Col2 Col3
2012 14 45
2013 13 76
2013 34 65
2014 23 47
frame1 = frame1["Col2"][1:2]
Col1 Col2 Col3
2012
4 14
4 45
4
2013 13
21 76
58.25
2013
0.816497 34
9.763879 65
14.863266
2014
2012 23
13 47
45
2012.75 13.75 46.5
…
frame1 = frame1.describe()
Col1 Col2 Col2 Col3 Col3 Col4
2012 14 14 45 45 59
2013 13 13 76 76 89
2013 34 34 65 65 99
2014 23 23 47 47 70
frame1["Col4"] = frame1["Col2"] + frame1["Col3"]
Col1 Col2 Col3
2012 14 45
2013 13 76
2013 34 65
2014 23 47
frame1.drop("Col3", axis=1, inplace=True)
Other Useful Methods
isnull()
groupby(key|expression, axis)
copy()
where(Boolean)
Other Operations
Pandas.DataFrame.apply(function, axis)
Pandas.Series.Map(function, dictionary | series)
Pandas.DataFrame.applymap(function)
Col1 Col2 Col2 Col3 Col3
2012 14 14 45 45
2013 47 13 141 76
2013 23 34 47 65
2014 23 47
frame1= frame1.groupby("Col1").sum()
R Data Frames in Azure ML
Azure ML
Dataset
Azure ML Table
Execute R Script
Data Frame
1 2
frame1 <- maml.mapInputPort(1)
frame2 <- maml.mapInputPort(2)
source("src/myScript.R")
print("Hello world")
maml.mapOutputPort("frame1")
R Device Port
Python Data Frames in Azure ML
Azure ML
Dataset
Azure ML Table
Execute Python Script
Data Frame
1 2
def azureml_main(frame1, frame2)
import myModule as mm
print("Hello world")
return frame1
Device Port
Data Types and Metadata
Stephen F Elston | Principle Consultant , Quantia Analytics, LLC
Chapter Overview
• Data types
• Continuous and discreet values
• Categorical variables
• Azure ML tools
• Quantization of categorical variables
Azure ML Table Data Types
• Numeric; Floating Point • Categorical
• Numeric: Integer • Date-time
• Boolean • Time-Span
• String • Image
Data type is Metadata
Continuous vs discrete variables
• Continuous variable can take on any value within the resolution
– Temperature
– Distance
– Weight
• Discrete variables have fixed values
– Number of people
– Number of wheels on a vehicle
Categorical variables
• Categories are metadata
• Too many categories can lead to problems
– Not enough data per category
– Too many dimensions in a model
• Often need to combine categories
– Reduce number of categories
– Group like categories
Continuous vs categorical variables
• Categories are metadata
• Too many categories can lead to problems
– Not enough data per category
– Too many dimensions in a model
• Often need to combine categories
– Reduce number of categories
– Group like categories
The Azure ML Metadata Editor
• Meta data includes:
– Data type
– Categories of categorical data
– Field type; feature, label, etc.
– Column name
• Editor enables manipulation of metadata
Quantizing Continuous Variables
• Convert continuous variable to categorical
• Bin values into categories
– Small, medium, large
– Hot, cold
– Income groups
Visualizing Data
Overview
• Exploratory data analysis through visualization
• The R ggplot2 package
• The Python Pandas plotting and matplotlib package
Exploratory data analysis
• Explore the data with visualization
• Understand the relationships in the data
• Create multiple views of data
• Aesthetics to project multiple dimensions
• Conditioning to project multiple dimensions
• Understand sources of model errors
John Tukey, Exploratory Data Analysis, 1977, Addison-
Westley
Views of data
• Relationships in data can be complex
• Data exploration requires multiple views
• Views reveal different aspects of the relationships
• Different plots highlight different relationships
Different plots for different views
• Scatter
• Scatter plot matrix
• Line plots
• Bar plots
• Histograms
• Box plots
• Violin plots
• Q-Q plots
Aesthetics for visualization
• Allow projection of additional dimensions
• But don’t over do it!
• Color
• Shape
• Size
• Transparency
• Aesthetics specific to plot type
Scatter plot
Scatter plot (larger point size)
Scatter plot (+ color by category)
Scatter plot (+ shape by category)
Scatter plot (+ alpha = 0.3)
Scatter plot matrix
Line plot
Bar Plot - unordered
Bar Plot - ordered
Histogram
Box Plot (group by category)
Violin Plot (group by category)
Q-Q Normal Plot
Conditioned Plots
Conditioned plots
• How can you project multiple dimensions?
• Analog with conditional probability: p( d | g)
• Plots of subsets (group by)
• Also know as facetted plots
William S. Cleveland, Visualizing Data, 1993, Hobart
Conditioned plots (faceting)
One conditioning variable
Conditioned plots (faceting)
With two dimensions of conditioning
Conditioning (faceting)
With scatter plot
Conditioning (faceting)
With two conditioning categorical variables
Conditioning (faceting)
With three conditioning categorical variables
Another view
Different views reveal different relationships
Introduction to ggplot2
Overview of ggplot2
• Produces presentation quality charts
• Uses grammar of graphics
• Operators define graphics properties
• Operators chained to create complex plots
The Grammar of Graphics
1. Import library
library(ggplot2)
2. Chain methods to define plot
ggplot(dataframe,aes(x
ggplot(dataframe, aes(x==xcol,
xcol,yy==ycol,
ycol,by
by==opt))
opt))+
geom_plottype(arguments)
3. Add attributes to chain
+
xlab("X label") + ylab("Y label") + ggtitle("Title") +
other_properties()
ggplot2 Types
geom_bar
geom_boxplot
geom_histogram
geom_line
geom_point
stat_smooth
stat_hexbin
ggplot2 Options and Asthetics
facet_grid()
xlab(), ylab()
ggtitle()
shape
color
alpha
size
Execute R Script
Azure ML Tables zip file
myFrame <- maml.mapInputPort(1,2)
source("src/myScript.R")
maml.mapOutputPort(“myFrame")
plots
Azure ML Table R Device Port
Introduction to pandas plotting and
matplotlib
Python plotting
• matplotlib underpins plotting in Python
e.g. matplotlib.pyplot
• pandas.DataFrame.plot built on matplotlib.pyplot
• Other libraries built on matplotlib
• For some plot types of more control use matplotlib.pyplot directly
Pandas Plotting
1. Import libraries
import matplotlib.pyplot as plt
2. Define and clear a figure
fig1 = plt.figure(figsize=(9, 9))
fig1.clf()
3. Define one or more axis
ax = fig1.gca()
4. Apply plot method
pandas.DataFrame.plot(kind = 'someType', ax = ax, ….)
fig1.savefig('scatter2.png')
5. Save figure
Python Plotting in Azure ML
def azureml_main(frame1):
import matplotlib.pyplot as plt ## Import libraries
fig1 = plt.figure(figsize=(9, 9)) ## Define a figure
fig1.clf() ## Clear the current figure
ax = fig1.gca() ## Define axis to plot
pandas.DataFrame.plot(kind = 'someType', ax = ax, ….)
fig1.savefig('scatter2.png') ## Save figure in a file for output
return frame1 ## Must return a Pandas dataframe
Types for pandas.DataFrame.plot()
• ‘line’ : line plot (default)
• ‘bar’ : vertical bar plot
• ‘barh’ : horizontal bar plot
• ‘kde’ or ‘density’: Kernel Density Estimation plot
• ‘scatter’ : scatter plot
Options and Aesthetics for pandas.DataFrame.plot()
• ax – pyplot axis
• x, y – coordinates
• color – line or symbol color
• s – size by value
• shape
• alpha – transparency
Execute Python Script
Azure ML Tables zip file
Def azureml_main(inFrame1, inFrame2)
import my_package
return myFrame
fig.savefig(‘fig.png')
Azure ML Table Python Device Port
©2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Office, Azure, System Center, Dynamics and other product names are or may be registered trademarks and/or trademarks in the
U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft
must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after
the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.