Spatial Big Data: Joe Niemi
Spatial Big Data: Joe Niemi
Spatial Big Data: Joe Niemi
Joe Niemi
Contents
1) Introduction
- what is Spatial Big Data?
- motivation
- use cases
2) Cloud partitioning
3) PAIRS (A scalable Spatial Big Data analytics platform)
4) AQWA (Adaptive Query-Workload-Aware partitioning of Spatial Big Data)
5) Summary
Spatial Data
● All types of data objects or elements that have geographical information present
1. Introduction
Spatial Data
Raster data
Vector data
Graph data
1. Introduction
Topological Coverage
1. Introduction
Types of Spatial Big Data
● Speed every minute for every
road-segment
● GPS trace data from cell-phones
● Engine measurements of fuel
consumption (can be estimated from fuel
levels, distance travelled and engine idling from
engine RPM)
● Greenhouse gas emissions
1. Introduction
Motivation
1. Introduction
Motivation
SBD or GIS (Geographic Information System) helps with
From ‘s ArcGIS: “Just about every problem and situation has a location aspect.”
1. Introduction
Use cases for Spatial Big Data
1) Eco routing
2) Tracking Endangered Species
3) Better crop production, reducing costs
4) Detecting extreme events
1. Introduction
Eco routing
● Next generation routing service
○ avoids congestion
○ reduces idling at red lights
○ avoids left turns
● Estimation: in 2020 about $600 billion is saved annually in terms of fuel and time
● Takes into account various datasets
○ real-time and historic traffic data of engine measurements
○ speed-limits
○ road types
○ “rush hour vs non-rush hour”
1. Introduction
Eco routing
1. Introduction
Tracking endangered species
2013: 970 studies over 250
contributors, 41,170 tracks and 61
million locations
1. Introduction
Better crop production
“If you can grow crop fast in these circumstances, query for similiar places”
1. Introduction
Detecting extreme events
● Earthquakes
● Wildfires
● Flooding
● Other calamities
How to detect
1. Introduction
Future
● New Datasets -> need to rapily integrate new datasets and algorithms
● Easy to collect, sensors (or sensor networks) are becoming more and more
common (Internet of things)
1. Introduction
Features of Spatial Big Data
● Access of data depends on the daytime of where it is used
● Changes dynamically
● Recent Spatial Big Data is usually being generated at a very high speed
1. Introduction
Challenges of Spatial Big Data
1) Retaining computational efficiency
2) Storing Spatial Big Data into the cloud
3) Applying new data when Spatial Big Data or change old data => repartitioning is
needed
1. Introduction
Contents
1) Introduction
- what is Spatial Big Data?
- motivation
- use cases
2) Cloud partitioning
3) PAIRS (A scalable Spatial Big Data analytics platform)
4) AQWA (Adaptive Query-Workload-Aware partitioning of Spatial Big Data)
5) Summary
Cloud partitioning of Spatial Big Data
● If partitions are not being accessed, servers remain idle and the user is still
charged.
● Most of the existing partitioning approaches co-locate frequently accessed data
together to minimize distributed transactions
● Cloud providers often offer time-based pricing models -> users are getting
charged even when servers idle or have low CPU usage
2. Cloud partitioning
Bad example: partitioning of Spatial Big Data
5 servers store data in Europe, 5 servers store data in USA
2. Cloud partitioning
Good example: partitioning of Spatial Big Data
10 servers store data with diverse access patterns to minimize server idle-time
=> Main drawback: Lag or latency problems due to data communication cost
We need a cache for servers in Europe to contain frequently accessed data partitions in USA and vise versa
2. Cloud partitioning
Good example: partitioning of Spatial Big Data
6 servers store data with diverse access patterns to minimize server idle-time
=> Main drawback: Lag or latency problems due to data communication cost
We need a cache for servers in Europe to contain frequently accessed data partitions in USA and vise versa
2. Cloud partitioning
Efficient partitioning method
1) Split dataset to partitions based on spatial proximity
A flatness metric is used to find best possible pair. It shows how diverse access patterns are.
Tabu search algorithm is used that takes into account the history of moves and prevents non-improving moves
from happening
2. Cloud partitioning
An easier way to maximize server utilization
In Amazon, based on user defined rules, scale down to a cheaper server if CPU usage is
less than 40 percent
● does not take into account server idle-time (they still have to pay for the cheapest
server)
2. Cloud partitioning
Contents
1) Introduction
- what is Spatial Big Data?
- motivation
- use cases
2) Cloud partitioning
3) PAIRS (A scalable Spatial Big Data analytics platform)
4) AQWA (Adaptive Query-Workload-Aware partitioning of Spatial Big Data)
5) Summary
PAIRS
is a cloud service deployed on top of Hadoop and HBase
1) Initialization
2) Query Execution
3) Data Acquisition
4) Repartitioning
“Partitioned areas that are queried with high frequency need to be partitioned much
more often in comparison to other less queried areas”
3) How to efficiently
determine the best split?
● query workload of spatial big data can change and you should react to it
● new data applied on hourly / daily basis
Geospatial big data: challenges and opportunities, Authors: Lee, Jae-Gil and Kang, Minseo, Year 2015
PAIRS: A scalable geo-spatial data analytics platform, Authors: Klein, Levente J and Marianno, Fernando J and
Albrecht, Conrad M and Freitag, Marcus and Lu, Siyuan and Hinds, Nigel and Shao, Xiaoyan and Bermudez
Rodriguez, Sergio and Hamann, Hendrik F, Year 2015
Cost-efficient partitioning of spatial data on cloud, Authors: Akdogan, Afsin and Indrakanti, Saratchandra and
Demiryurek, Ugur and Shahabi, Cyrus, Year 2015
AQWA: adaptive query workload aware partitioning of big spatial data, Authors: Aly, Ahmed M and Mahmood,
Ahmed R and Hassan, Mohamed S and Aref, Walid G and Ouzzani, Mourad and Elmeleegy, Hazem and Qadah,
Thamir, Year 2015
Questions?