Impala and BigQuery

This document compares and contrasts Impala and BigQuery, two database systems for querying large datasets. Impala is an open source massively parallel processing (MPP) query engine that runs directly on Hadoop data, without requiring MapReduce. BigQuery is a database service by Google that uses their Dremel technology to enable fast queries on terabytes of data within seconds. The document provides details on the architectures and capabilities of each system.

Uploaded by

durdurk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

262 views47 pages

Impala and BigQuery

Uploaded by

durdurk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 47

Impala and BigQuery

By David Gruzman

BigDataCraft.com
Impala and BigQuery
by David Gruzman
Big Query is google's database service based
on the Dremel. Big Query is hosted by Google.
Impala is open source database inspired by the
Dremel paper. Impala is part of the Cloudera
Hadoop distribution.
Today agenda
Overview of Dremel as a technology
Overview of the BigQuery
A few words about Impala
DG Mediamind use case
Deeper insights into Impala
Conclusions
Q&A
Why dremel?
Google is first who got MapReduce
Google is first faced MapReduce main problem
latency. The problem was propagated to
engines on top of MapReduce also.
It is logical that Google was first who
approached it by developing real time query
capability for big data.
How dremel is used in google
Dremel is not replacement for the MapReduce
or Tenzing but complements it. (Tenzing is
Google's Hive)
Analyst can make many fast queries using
Dremel
After getting good idea what is needed run
slow MapReduce (or SQL based on
MapReduce) to get precise results
Why dremel is Unique
Dremel with BigQuery built on top of it is
probably only Interactive big data query engine
today.
I mean that it is only engine capable to produce
results over terabytes of data in seconds!
Main idea (my guess) that is harness huge
cluster of machines for the single query.
Dremel as technology
Novel Hierarchical columnar format.
LLVM based code generation.
Distributed aggregation Tree
In-situ data processing. (inside the storage)
Dremel : Aggregation tree
Dremel : Nested columnar format
Big Query
Service built by google on top of the Dremel
engine
Only (known to me) query engine as a service
working with BigData.
Query time not depends on data size
BigQuery main capabilities
Aggregations
Join of big table to small table.
Join of two big tables (recently added)
Hierarchical data format. It makes pre-
aggregations cheaper.
Main limitations
Small results size
Intermediate results should not exceed memory
size.
No external tables
Why BigQuery is not popular
So,why BigQuery is not popular
Data is not created in google cloud. It is hard
and not practical to move big data. It is heavy,
after all.
Google is used to change APIs. BigQuery also
changed during last years. It is hard to build
busines.
Many companies in Internet related businesses
a wary of sharing data with Google.
It is expensive. 35$ per TB can give 1000
th
of
dollars bills per day.
Dremel
In the same time it is good
technically
I got referances from company doing serious
testing
Marting Fawler's company also tested it and
give very good feedback.
Question to all of you
Why Your organization decided not to use
google's Big Query?
Where we can find Impala
Impala
What is impala
Massive parralel processing (MPP) database
engine, developed by Cloudera.
Integrated into Hadoop stack on the same level
as MapReduce, and not above it (as Hive and
Pig)

HDFS
Map Reduce
Hive Pig
Impala
Why impala
Data has a gravity
Today a lot of data live in HDFS
It is not practical to move big data
It is practical to bring engine to the data
In the same time MapReduce is not must
Impala process data in Hadoop cluster without
using MapReduce
MapReduce bypass
Several other modern Database engines also
realized the opportunity to bypass MapReduce
but work right with HDFS.
They takes various approaches.

MapReduce Bypass
Existing MPP databases, like Greenplum
store their external tables in the HDFS

MapReduce bypass
Jethrodata store data in their own format on
HDFS and also work with it without MR layer.
They have their proprietary format which enable
full indexing of the data together with columnar
efficiency. In cases of high selectivity queries
this approach has serious advantages.

Use Case from DG
I think it is will be typical case in the future
DG is using Hadoop and Hive
Evaluation Impala to do part of things more
efficiently.
After their case presentation we will back to
discuss insights of the Impala
Again Impala has different place
then Pig and Hive
HDFS
Map Reduce
Hive and Pig
Impala
Impala architecture
Impala Dremel traces
LLVM code generation
It is really fast
C++ as implementation language (not Java...)
Simple query engine. It actually doing things
which can be done in memory.
Broadcast join algorithm is implemented
LLVM code generation
Assume you want to write custom code for the
specific query. It will be super efficient
Code generation automate this process for
each query
We actually need to super-optimize inner loop
doing filtering (where) and group by.
LLVM enables us to compile in fraction of
seconds into native code
LLVM enable us to enjoy new CPU capabilities
like SSE in a portable way.
Why code generation it interesting?
If you develop own engine, or some peace of
code responsible to process serious data
volumes code generation may give you order of
magnitude boost.
I had cases when usage of such technology
was game changing
Impala Hive Traces
While dremel converts data into own format,
Impala supports multiple formats. It is kind of
schema on read.
Impala shares metastore with Hive, which
enables very simple adoption
Internally Impala have well defined way to add
new formats
Impala unique things
Impala format adapters, called scanners have
predicate pushdown capability.
Probably only open source MPP engine
Today we do not have any other means to run
hundreds of CPU cores in one query efficiently
without expensive license.
Hive give us the same but not efficiently.
Impala vs MPP
It usually tooks many years to create MPP
database.
There are serious simplifications:
The data is read only
There is actually not DBMS only query
engine.
No serious resource management, but
measurement (all over code).

Impala hive killer?
Not so quickly.
Hive is doing things Impala can not do yet, like
joins between several big tables.
Hive has convinient java UDF, while impala is
not
Impala does not have inter-query fault
tolerance.
In the same time MapReduce is not good
framework for the database engine
Impala Data Formats
There are scanners for the following types:
RCFile
Parquet (native dremel format)
CSV
AVRO
Sequence File
Impala future
Will get closer to other MPP engines
Support more formats
More advanced scheduling and resource
management
Basic benchmark
TPC-H, Q1, SF=10
4 EC2 large instances
4 seconds, while hive takes about 1 minute.

This number means group by speed of about
235MB/sec per core.

Impala price per GB
1 Large instance costs $0.24
Cluster costs 0.96 per hour.
Cost of 1 second : 0.96 / 3600
We process by such cluster 1.75GB per second
So cost of 1 TB processing is about $0.15
It is about 300 times cheaper then BigQuery
Performance - summary
It is fast when data reduction is big
It is fast, when data is hot.
It should enjoy fast storage / SSD. My
measurements shows about 200 MB/sec per
core group by processing
Always faster then Hive at least 10 times

What with clouds?
Impala in cloud is not elastic
To be elastic we need to create cluster when
we need it.
Even if we agree to by hour resolution storage
will be a problem
S3 will not give us hundreds of Mbs per second
per instance
To store data in local file system is transient
Impala - conclusions
It is first time I remember when we can put our
hands on free MPP database.
There is no risk to try it side-by-side with Hive
It is possible to offload part of the work to
Impala and do the rest with Hive
It is part of the Cloudera Hadoop distribution
and easily installed by Cloudera Manager
Materials used
Benchmarks
http://www.slideshare.net/sudabon/performance-
evaluation-of-cloudera-impala-20121208-
15536323
https://amplab.cs.berkeley.edu/benchmark/
Architecture
http://www.slideshare.net/scottleber/impala-
19176906
https://cloud.google.com/files/BigQueryTechnical
WP.pdf
POC
http://martinfowler.com/articles/bigQueryPOC.htm
l

Material used - comparisons
To hive: http://www.quora.com/Cloudera/Does-
Cloudera-Impala-have-any-drawbacks-when-
compared-with-Hive
To vertica: http://www.quora.com/Cloudera-
Impala/How-does-Cloudera-Impala-compare-to-
Vertica
To dremel: http://www.quora.com/Cloudera-
Impala/How-does-Clouderas-Impala-compare-
to-Googles-Dremel
Thank you!!!
Special thanks to
Faina Kamenetsky who helped set up clusters
in amazon.

BigDataCraft.com
We are boutique consulting company
Our services are:
On paper POC
On hardware POC
Architecture / Design reviews
Custom integrations and bug fixing
Impala - Flow

Delivering Solution For Turnout
No ratings yet
Delivering Solution For Turnout
93 pages
Iti Pdfs
No ratings yet
Iti Pdfs
10 pages
Quiz 13
50% (2)
Quiz 13
5 pages
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
No ratings yet
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
48 pages
Lab - GAE
No ratings yet
Lab - GAE
133 pages
Oltp Olap Rtap
No ratings yet
Oltp Olap Rtap
53 pages
G G 'S Bigtable: Name: Tunahan YILDIRIM Number:2195303 Paper: A Distributed Storage System For Structured Data
No ratings yet
G G 'S Bigtable: Name: Tunahan YILDIRIM Number:2195303 Paper: A Distributed Storage System For Structured Data
38 pages
Extract Transform Load
No ratings yet
Extract Transform Load
80 pages
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
73857-Big Data Powerpoint Templates-4-3
No ratings yet
73857-Big Data Powerpoint Templates-4-3
30 pages
Build Solutions On GCP
No ratings yet
Build Solutions On GCP
3 pages
Technologies For Handling Big Data: Prepared By: Saidatul Rahah Hamidi
No ratings yet
Technologies For Handling Big Data: Prepared By: Saidatul Rahah Hamidi
49 pages
Hands On Scripting
No ratings yet
Hands On Scripting
24 pages
DBMS SQL Practice Questions Shivani
No ratings yet
DBMS SQL Practice Questions Shivani
10 pages
DataStage Faq S
No ratings yet
DataStage Faq S
57 pages
Apache Sqoop
No ratings yet
Apache Sqoop
21 pages
GraphQL Thesis
No ratings yet
GraphQL Thesis
77 pages
07 - Ingesting New Datasets Into Google BigQuery
No ratings yet
07 - Ingesting New Datasets Into Google BigQuery
8 pages
Cloud Network and Security Services: Google Amazon Azure
No ratings yet
Cloud Network and Security Services: Google Amazon Azure
27 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Interview
No ratings yet
Interview
86 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Apache Hive
No ratings yet
Apache Hive
3 pages
Data Warehouse - What Is It
No ratings yet
Data Warehouse - What Is It
5 pages
Data Modeling Tips, Tricks, and Customizations
No ratings yet
Data Modeling Tips, Tricks, and Customizations
50 pages
Teradata Advanced SQL Part1 PDF
100% (2)
Teradata Advanced SQL Part1 PDF
38 pages
Basics of Database Testing Contains The Following
No ratings yet
Basics of Database Testing Contains The Following
4 pages
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
No ratings yet
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
23 pages
Snowflake Setup - MD
No ratings yet
Snowflake Setup - MD
2 pages
DW
No ratings yet
DW
29 pages
Spark NLP Training-Public-Oct 2020
No ratings yet
Spark NLP Training-Public-Oct 2020
50 pages
Explain Terraform vs. Other Software
No ratings yet
Explain Terraform vs. Other Software
5 pages
DW Olap
No ratings yet
DW Olap
57 pages
Mapreduce and Hadoop Distributed File System
No ratings yet
Mapreduce and Hadoop Distributed File System
36 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
Snowflake Fundamentals Anand Jha
No ratings yet
Snowflake Fundamentals Anand Jha
50 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Database Course Outline INFO1101
No ratings yet
Database Course Outline INFO1101
5 pages
Hadoop Notes Unit2
No ratings yet
Hadoop Notes Unit2
24 pages
Spark NLP Training-Public-April 2020
No ratings yet
Spark NLP Training-Public-April 2020
39 pages
Advanced Data Model
No ratings yet
Advanced Data Model
18 pages
Sample Paper Q0503
No ratings yet
Sample Paper Q0503
20 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Data Warehouse Components
No ratings yet
Data Warehouse Components
18 pages
Talend Open Studio For Data Integration: User Guide
No ratings yet
Talend Open Studio For Data Integration: User Guide
452 pages
6 Documentdatabases
No ratings yet
6 Documentdatabases
27 pages
Flink Vs Spark by Slim Baltagi
No ratings yet
Flink Vs Spark by Slim Baltagi
67 pages
Hive Tutorial For Beginners: Learn With Examples in 3 Days
No ratings yet
Hive Tutorial For Beginners: Learn With Examples in 3 Days
3 pages
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
DataEngineer Roadmap
No ratings yet
DataEngineer Roadmap
12 pages
Lab - Qlik Replicate With Google BigQuery
No ratings yet
Lab - Qlik Replicate With Google BigQuery
23 pages
SQL Server Theory
No ratings yet
SQL Server Theory
2 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Business Intelligence DW
No ratings yet
Business Intelligence DW
17 pages
Fundamentals of Big Data Engineering: A Guide To The
No ratings yet
Fundamentals of Big Data Engineering: A Guide To The
14 pages
OLTP
No ratings yet
OLTP
12 pages
Spark
No ratings yet
Spark
160 pages
Nosql - Journey Ahead!: Origin: Punch Cards To Dbms
No ratings yet
Nosql - Journey Ahead!: Origin: Punch Cards To Dbms
54 pages
SnowFlake Course Brochure FINAL
No ratings yet
SnowFlake Course Brochure FINAL
7 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Zacdoca Report Extract1 Styled Techspec
No ratings yet
Zacdoca Report Extract1 Styled Techspec
3 pages
What Is A Microprocessor
No ratings yet
What Is A Microprocessor
10 pages
bxl900 User Manual
No ratings yet
bxl900 User Manual
3 pages
Mathematics7 - q2 - Week 7
No ratings yet
Mathematics7 - q2 - Week 7
11 pages
Casting Words Guidelines
No ratings yet
Casting Words Guidelines
1 page
Kbonk Sample Installation Set: Full Isometric View
No ratings yet
Kbonk Sample Installation Set: Full Isometric View
10 pages
Cpcs202 02 Basics s19
No ratings yet
Cpcs202 02 Basics s19
117 pages
Equilibrium Ncert Revision Questions
No ratings yet
Equilibrium Ncert Revision Questions
7 pages
Chapter 10 Notes
100% (1)
Chapter 10 Notes
1 page
Physics Class 12 EXPT NO. 4
No ratings yet
Physics Class 12 EXPT NO. 4
3 pages
PPS Array
No ratings yet
PPS Array
2 pages
Primary Care Research: An Introduction (To Some Really Important Concepts)
No ratings yet
Primary Care Research: An Introduction (To Some Really Important Concepts)
30 pages
Weidmann 2011 Separation of Nickel and Cobalt by Paper Chromatography
No ratings yet
Weidmann 2011 Separation of Nickel and Cobalt by Paper Chromatography
4 pages
87Rb 6.835GHz 2021
No ratings yet
87Rb 6.835GHz 2021
2 pages
Ecte324 8324 Labnotes 2020 LAB5 PDF
No ratings yet
Ecte324 8324 Labnotes 2020 LAB5 PDF
28 pages
Brain Fog Scale Development and Validation
No ratings yet
Brain Fog Scale Development and Validation
6 pages
Enhanced Reduced Thrust at Takeoff
No ratings yet
Enhanced Reduced Thrust at Takeoff
13 pages
Cross-Project Defect Prediction Using A Connectivity-Based Unsupervised Classifier
No ratings yet
Cross-Project Defect Prediction Using A Connectivity-Based Unsupervised Classifier
12 pages
Algorithm and Problem Solving
No ratings yet
Algorithm and Problem Solving
9 pages
New Microsoft PowerPoint Presentation
100% (1)
New Microsoft PowerPoint Presentation
158 pages
Photoshells
No ratings yet
Photoshells
5 pages
CC-Link IE Field Basic Network Remote IO Module User's Manual
No ratings yet
CC-Link IE Field Basic Network Remote IO Module User's Manual
146 pages
Equacionamento de HERTZ-MINDLIN
No ratings yet
Equacionamento de HERTZ-MINDLIN
3 pages
Quectel BG95 Hardware Design V1.1 PDF
No ratings yet
Quectel BG95 Hardware Design V1.1 PDF
89 pages
Prof - Deepali Jain (AI) UNIT-6 Knowledge Engineering
No ratings yet
Prof - Deepali Jain (AI) UNIT-6 Knowledge Engineering
19 pages
Specification of D-Wall and Bored Piles
No ratings yet
Specification of D-Wall and Bored Piles
34 pages
Assignment P1-P4 Danghuynhxuankhang
No ratings yet
Assignment P1-P4 Danghuynhxuankhang
31 pages
With Aerzen Turbos, Blowers and Compressors: Heat Recovery
No ratings yet
With Aerzen Turbos, Blowers and Compressors: Heat Recovery
8 pages

Impala and BigQuery

Uploaded by

Impala and BigQuery

Uploaded by

Impala and BigQuery

You might also like