Tackling Big Data Using Matlab
Tackling Big Data Using Matlab
Alka Nair
Application Engineer
Preprocess,
Access Exploration & Scale up & Integrate with
Model Development Production Systems
2
Case study: Predict Air Quality
Factors Affecting
My Weather Page Air Quality
• Temperature
www.myweather.com/stats.html
• Pressure
• Relative Humidity
• Dew Point
• Wind speed
• Wind direction
• Ozone
• CO
• NO2
• SO2
3
4
Building Machine Learning Models with Big Data
5
Challenges in Modeling and Deploying Big Data Applications
Preprocess, Scale up & Integrate
Access Exploration & Model with Production Systems
Development
7
Building machine learning models with big data
Preprocess,
Exploration & Scale up & Integrate with
Access
Model Development Production Systems
8
Access and Manage Big Data
Different Data Types Different Data Sources Different Applications
• MapReduce
• Hadoop Distributed File
System (HDFS)
• Image Segmentation
▪ Text • Amazon S3
▪ Images • Windows Azure Blob • Image Classification
Datastores 9
Datastore
Single
Machine
Memory
Single
Machine
Memory
Process
Cluster of
Cluster of Machines
Machines Memory
Memory
10
Air Quality Data on Local Folder
11
Accessing and Processing different types of data
12
You have 1 TB of data you’ve never seen before. How do you
access this data?
13
Historical files are on HDFS and real time data are available
through an API
• Temperature
• Pressure
• Relative Humidity
• Dew Point
• Wind Speed
• Wind Direction
• Ozone
• CO
• NO2
• SO2
14
Access air quality data using datastore
15
Preview the data and adjust properties to best represent the
data of interest
16
Access data from anywhere with minimal changes
Local disk
17
Datastores enable big data workflows
Deep Learning
18
Datastores enable big data workflows
Predictive
Maintenance
19
Datastores enable big data workflows
Fleet
Analytics
20
Datastores: Access Big Data with Minimal Changes
• MapReduce
• Hadoop Distributed File
System (HDFS)
• Image Segmentation
▪ Text • Amazon S3
• Windows Azure Blob • Image Classification
▪ Images
▪ Spreadsheet Storage • Denoising Images
▪ Custom File Formats • Relational Database
• Predictive Maintenance
• HDFS on Hortonworks or
Cloudera
✓ ✓ ✓ 21
Building machine learning models with big data
Preprocess,
Exploration & Scale up & Integrate with
Access Model Development Production Systems
22
You have 1TB of data you’ve never seen before. How do you
visualize and process the data?
23
Use tall arrays to work with the data like any MATLAB array
24
▪ Introduction to Tall Arrays
25
Tall arrays Single
Machine
Memory
Challenge
Cluster of
▪ Data doesn’t fit into memory Machines
Memory
26
Tall arrays (new R2016b) Single
tall array Single
Machine Machine
Memory Process Memory
summary(tt)
max(tt.EndTime – tt.StartTime)
27
tall
tall arrays Single
array Single
Machine Machine
Memory Process Memory
Single
Machine
Process Memory
28
Use a Spark-enabled Hadoop cluster and MATLAB
29
It’s easy to run MATLAB code on Spark + Hadoop
Spark Connection
Hadoop Access
30
MATLAB Documentation for
31
Summary for tall arrays
Local disk,
Shared folders,
Run on Compute Clusters
Databases
or Spark + Hadoop (HDFS),
for large scale analysis
ozone
33
Execution model makes operations more efficient on big data
tt : tall array
▪ Deferred evaluation
– Commands are not executed right
away
– Operations are added to a queue
34
Execution model makes operations more efficient on big data
35
✓ Introduction to Tall Arrays
36
Explore Big Data with Tall Visualizations
plot
scatter
binscatter
histogram
histogram2
ksdensity
37
Explore Big Data with Tall Visualizations
38
Get a summary of the data
tt – tall table
39
Use data types to best represent the data
40
Managing Big and Messy Time-stamped Data
41
Use the results of explorations to help make decisions
- Synchronize to daily
data
- By location
42
Synchronize all data to daily times
43
Clean messy data using common preprocessing functions
44
Use familiar MATLAB functions on tall arrays
46
Save preprocessed data
47
✓ Introduction to Tall Arrays
48
Predict air quality
Regression Classification
49
How do you know which model to use?
50
Use apps for model exploration on a subset of data
Air Quality Index Air Quality Label
51
Validate and Compare Machine Learning Models
52
Validate and Compare Machine Learning Models
53
Validate and Compare Machine Learning Models
54
Validate and Compare Machine Learning Models
55
Scale up with tall machine learning models
▪ Linear Regression (fitlm)
▪ Logistic & Generalized Linear Regression (fitglm)
▪ Discriminant Analysis Classification (fitcdiscr)
▪ K-means Clustering (kmeans)
▪ Principal Component Analysis (pca)
▪ Partition for Cross Validation (cvpartition)
56
Training Machine Learning Model against Spark for Air Quality
Classification
57
Train and validate with tall data for Air Quality Index Prediction
58
Select the most important features
59
✓ Introduction to Tall Arrays
61
Building machine learning models with big data
Preprocess,
Exploration & Scale up & Integrate with
Access Model Development Production Systems
62
63
Predict air quality for given location
Current Weather
My Weather
My WeatherPage
Page
www.myweather.com/stats.html
www.myweather.com/stats.html
Location: 01760
Temperature: 32F
MATLAB
Runtime Humidity: 76%
Wind: SSW 13 mph
Embedded Hardware
MATLAB
Runtime
65
Package and test MATLAB code
66
67
Package and test MATLAB code
68
Call MATLAB in production environment
AirQual.ctf
69
MATLAB Production Server
▪ Server software
– Manages packaged MATLAB programs and worker pool
Enterprise
Application
MATLAB
▪ Lightweight client libraries Runtime
70
MATLAB for Modeling and Deploying Big Data Applications
Scale up & Integrate
Preprocess,
with Production Systems
Access Exploration & Model
Development
Easily Access Data Prototype and easily scale up Seamless integration with
however/wherever it is stored algorithms to Big Data platforms Enterprise level systems
using Datastore using the familiar MATLAB Syntax using MATLAB Production
with Tall Arrays Server
71
How do you get started?
▪ Try Tall Array Based Processing on Your Own Set of Big Data
https://in.mathworks.com/help/matlab/examples/analyze-big-data-in-matlab-using-tall-
arrays.html
Other Resources
mathworks.com/big-data
mathworks.com/machine-learning eBook
72
MathWorks Training Offerings
http://www.mathworks.com/services/training/
73
Speaker Details Contact MathWorks India
Email: Alka.Nair@mathworks.in Products/Training Enquiry Booth
LinkedIn: https://www.linkedin.com/in/alka-nair- Call: 080-6632-6000
1820501a/ Email: info@mathworks.in
74