0% found this document useful (0 votes)
3 views

Tackling Big Data Using Matlab

The document discusses the use of MATLAB for building machine learning models with big data, focusing on data access, preprocessing, exploration, and integration with production systems. It highlights challenges in modeling big data applications and introduces tools like tall arrays for efficient data processing and visualization. A case study on predicting air quality is presented, demonstrating how to manage various data types and sources effectively.

Uploaded by

Walvede
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Tackling Big Data Using Matlab

The document discusses the use of MATLAB for building machine learning models with big data, focusing on data access, preprocessing, exploration, and integration with production systems. It highlights challenges in modeling big data applications and introduces tools like tall arrays for efficient data processing and visualization. A case study on predicting air quality is presented, demonstrating how to manage various data types and sources effectively.

Uploaded by

Walvede
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Tackling Big Data Using MATLAB

Alka Nair
Application Engineer

© 2015 The MathWorks, Inc.


1
Building Machine Learning Models with Big Data

Preprocess,
Access Exploration & Scale up & Integrate with
Model Development Production Systems

2
Case study: Predict Air Quality
Factors Affecting
My Weather Page Air Quality
• Temperature
www.myweather.com/stats.html

• Pressure
• Relative Humidity
• Dew Point
• Wind speed
• Wind direction
• Ozone
• CO
• NO2
• SO2

3
4
Building Machine Learning Models with Big Data

Preprocess, Exploration Scale up & Integrate with


Access
& Model Development Production Systems

5
Challenges in Modeling and Deploying Big Data Applications
Preprocess, Scale up & Integrate
Access Exploration & Model with Production Systems
Development

▪ Distributed Data Storage ▪ Preprocessing and Visualizing Big Data


▪ Enterprise level
▪ Different Data Sources & ▪ Parallelizing Jobs and Scaling up deployment
Types Computations to Cluster

Managing Different APIs for Data


▪ Rewriting Algorithms to Use Big
Sources and Data Formats Overhead in Moving the
Data Platforms
Algorithm to Production
▪ Parallelizing Code to Scale up to
Use Cluster and Cloud Compute
6
Wouldn’t it be nice if you could:
▪ Easily access data however it is stored

▪ Prototype algorithms quickly using small data sets

▪ Scale up to big data sets running on large clusters

▪ Using the same intuitive MATLAB syntax you are used to

7
Building machine learning models with big data
Preprocess,
Exploration & Scale up & Integrate with
Access
Model Development Production Systems

8
Access and Manage Big Data
Different Data Types Different Data Sources Different Applications

• MapReduce
• Hadoop Distributed File
System (HDFS)
• Image Segmentation
▪ Text • Amazon S3
▪ Images • Windows Azure Blob • Image Classification

▪ Spreadsheet Storage • Denoising Images


▪ Custom File Formats • Relational Database
• Predictive Maintenance
• HDFS on Hortonworks or
Cloudera

Datastores 9
Datastore
Single
Machine
Memory
Single
Machine
Memory

Process

Cluster of
Cluster of Machines
Machines Memory
Memory

One or more files

10
Air Quality Data on Local Folder

11
Accessing and Processing different types of data

TabularTextDatastore Text files containing column-oriented data, including


CSV files

ImageDatastore Image files, including formats that are supported


Image Collection by imread such as JPEG and PNG

SpreadsheetDatastore Spreadsheet files with a supported Excel ® format


such as .xlsx
MDF
Files MDFDatastore Datastore for collection of MDF files

Custom Datastore Datastore for custom or proprietary format

12
You have 1 TB of data you’ve never seen before. How do you
access this data?

13
Historical files are on HDFS and real time data are available
through an API

• Temperature
• Pressure
• Relative Humidity
• Dew Point
• Wind Speed
• Wind Direction
• Ozone
• CO
• NO2
• SO2

14
Access air quality data using datastore

15
Preview the data and adjust properties to best represent the
data of interest

16
Access data from anywhere with minimal changes

Local disk

17
Datastores enable big data workflows
Deep Learning

18
Datastores enable big data workflows
Predictive
Maintenance

19
Datastores enable big data workflows
Fleet
Analytics

20
Datastores: Access Big Data with Minimal Changes

Different Data Types Different Data Sources Different Applications

• MapReduce
• Hadoop Distributed File
System (HDFS)
• Image Segmentation
▪ Text • Amazon S3
• Windows Azure Blob • Image Classification
▪ Images
▪ Spreadsheet Storage • Denoising Images
▪ Custom File Formats • Relational Database
• Predictive Maintenance
• HDFS on Hortonworks or
Cloudera

✓ ✓ ✓ 21
Building machine learning models with big data
Preprocess,
Exploration & Scale up & Integrate with
Access Model Development Production Systems

22
You have 1TB of data you’ve never seen before. How do you
visualize and process the data?

23
Use tall arrays to work with the data like any MATLAB array

24
▪ Introduction to Tall Arrays

▪ Tall Arrays for Big Data Visualization and Preprocessing

▪ Machine Learning for Big Data Using Tall Arrays

25
Tall arrays Single
Machine
Memory

▪ Data is in one or more files


▪ Files stacked vertically
▪ Typically tabular data

Challenge
Cluster of
▪ Data doesn’t fit into memory Machines
Memory

(even cluster memory)


▪ Takes a lot of time for even simple
operations on data

26
Tall arrays (new R2016b) Single
tall array Single
Machine Machine
Memory Process Memory

▪ Create tall table from datastore


ds = datastore('*.csv')
tt = tall(ds) Datastore

▪ Operate on whole tall table


Cluster of
just like ordinary table Machines
Memory

summary(tt)

max(tt.EndTime – tt.StartTime)

27
tall
tall arrays Single
array Single
Machine Machine
Memory Process Memory

▪ With Parallel Computing Toolbox,


process several “chunks” at once Single
Machine
Process Memory

▪Can scale up to clusters with


MATLAB Distributed Computing Server Single
Cluster of Machine
Machines Process Memory
Memory

Single
Machine
Process Memory

28
Use a Spark-enabled Hadoop cluster and MATLAB

Support for many other platforms through reference architectures

29
It’s easy to run MATLAB code on Spark + Hadoop

Spark Connection

Cluster Config for Spark

Hadoop Access

30
MATLAB Documentation for

31
Summary for tall arrays

Local disk,
Shared folders,
Run on Compute Clusters
Databases
or Spark + Hadoop (HDFS),
for large scale analysis

Process out-of-memory data on


your Desktop to explore,
analyze, gain insights and to
develop analytics
Use Parallel Computing
Toolbox for increased
performance MATLAB Distributed Computing Server,
Spark+Hadoop

Develop your code locally using Tall Arrays or


MapReduce only once
Use the same code to scale up to
cluster 32
Create a tall array for each datastore

ozone
33
Execution model makes operations more efficient on big data

tt : tall array
▪ Deferred evaluation
– Commands are not executed right
away
– Operations are added to a queue

▪ Execution triggers include:


– gather function
– summary function
– Machine learning models
– Plotting

34
Execution model makes operations more efficient on big data

Unnecessary results are not


computed

35
✓ Introduction to Tall Arrays

▪ Tall Arrays for Big Data Visualization and Preprocessing

▪ Machine Learning for Big Data Using Tall Arrays

36
Explore Big Data with Tall Visualizations

plot
scatter
binscatter
histogram
histogram2
ksdensity

37
Explore Big Data with Tall Visualizations

38
Get a summary of the data

tt – tall table

39
Use data types to best represent the data

40
Managing Big and Messy Time-stamped Data

41
Use the results of explorations to help make decisions

- Synchronize to daily
data
- By location

42
Synchronize all data to daily times

43
Clean messy data using common preprocessing functions

44
Use familiar MATLAB functions on tall arrays

Functions Supported with Tall Arrays


45
You don’t need to leave MATLAB to monitor large jobs

46
Save preprocessed data

47
✓ Introduction to Tall Arrays

✓ Tall Arrays for Big Data Visualization and Preprocessing

▪ Machine Learning for Big Data Using Tall Arrays

48
Predict air quality

Air Quality Index Air Quality Label

Regression Classification
49
How do you know which model to use?

▪ Try them all ☺

50
Use apps for model exploration on a subset of data
Air Quality Index Air Quality Label

Regression Learner Classification Learner

51
Validate and Compare Machine Learning Models

52
Validate and Compare Machine Learning Models

53
Validate and Compare Machine Learning Models

54
Validate and Compare Machine Learning Models

55
Scale up with tall machine learning models
▪ Linear Regression (fitlm)
▪ Logistic & Generalized Linear Regression (fitglm)
▪ Discriminant Analysis Classification (fitcdiscr)
▪ K-means Clustering (kmeans)
▪ Principal Component Analysis (pca)
▪ Partition for Cross Validation (cvpartition)

▪ Linear Support Vector Machine (SVM) Classification (fitclinear)


▪ Naïve Bayes Classification (fitcnb)
▪ Random Forest Ensemble Classification (TreeBagger)
▪ Lasso Linear Regression (lasso)

▪ Linear Support Vector Machine (SVM) Regression (fitrlinear)


▪ Single Classification Decision Tree (fitctree)
▪ Linear SVM Classification with Random Kernel Expansion (fitckernel)
▪ Gaussian Kernel Regression (fitrkernel)

56
Training Machine Learning Model against Spark for Air Quality
Classification

57
Train and validate with tall data for Air Quality Index Prediction

58
Select the most important features

59
✓ Introduction to Tall Arrays

✓ Tall Arrays for Big Data Visualization and Preprocessing

✓ Machine Learning for Big Data Using Tall Arrays

61
Building machine learning models with big data
Preprocess,
Exploration & Scale up & Integrate with
Access Model Development Production Systems

62
63
Predict air quality for given location

Current Weather

My Weather
My WeatherPage
Page
www.myweather.com/stats.html
www.myweather.com/stats.html

Your Weather Conditions

Get weather conditions for your area.

Location: 01760

Temperature: 32F
MATLAB
Runtime Humidity: 76%
Wind: SSW 13 mph

Use MATLAB model running on Spark in Python web


framework
64
Integrate analytics with systems

Embedded Hardware

C, C++ HDL PLC


Enterprise Systems
GPU

Standalone Excel Hadoop/ MATLAB


Application Add-in C/C++ Java ++ Python .NET Production
Spark Server

MATLAB
Runtime

65
Package and test MATLAB code

66
67
Package and test MATLAB code

68
Call MATLAB in production environment

AirQual.ctf

69
MATLAB Production Server

▪ Server software
– Manages packaged MATLAB programs and worker pool
Enterprise
Application

▪ MATLAB Runtime libraries MPS Client


Library
MATLAB Production Server

– Single server can use runtimes


from different releases Request Broker
&
Program
Manager
Applications/

▪ RESTful JSON interface Database


Servers RESTful
JSON

MATLAB
▪ Lightweight client libraries Runtime

– C/C++, .NET, Python, and Java

70
MATLAB for Modeling and Deploying Big Data Applications
Scale up & Integrate
Preprocess,
with Production Systems
Access Exploration & Model
Development

▪ Distributed Data Storage ▪ Preprocessing and Visualizing Big Data


▪ Enterprise level
▪ Different Data Sources & ▪ Parallelizing Jobs and Scaling up deployment
Types Computations to Cluster

Easily Access Data Prototype and easily scale up Seamless integration with
however/wherever it is stored algorithms to Big Data platforms Enterprise level systems
using Datastore using the familiar MATLAB Syntax using MATLAB Production
with Tall Arrays Server
71
How do you get started?
▪ Try Tall Array Based Processing on Your Own Set of Big Data

▪ Refer to the example mentioned below to get started:

https://in.mathworks.com/help/matlab/examples/analyze-big-data-in-matlab-using-tall-
arrays.html

Other Resources
mathworks.com/big-data

mathworks.com/machine-learning eBook

72
MathWorks Training Offerings

http://www.mathworks.com/services/training/

73
Speaker Details Contact MathWorks India
Email: Alka.Nair@mathworks.in Products/Training Enquiry Booth
LinkedIn: https://www.linkedin.com/in/alka-nair- Call: 080-6632-6000
1820501a/ Email: info@mathworks.in

• Share your experience with MATLAB & Simulink on Social Media


▪ Use #MATLABEXPO

• Share your session feedback:


Please fill in your feedback for this session in the feedback form

74

You might also like