How Is Bigdata Handled in Kaggle?: 17Cp006-Leenanci Parmar 17CP012-DHRUVI LAD
How Is Bigdata Handled in Kaggle?: 17Cp006-Leenanci Parmar 17CP012-DHRUVI LAD
KAGGLE?
17CP006-LEENANCI PARMAR
17CP012-DHRUVI LAD
INTRODUCTION TO KAGGLE
• BigQuery is a cloud-based data warehouse from Google that lets users query
and analyze large amounts of read-only data. Using a SQL-like syntax,
BigQuery runs queries on billions of rows of data in a matter of seconds.
• This is iPaaS (integration platform-as-a-service) supports any combination of
on-premises, cloud data, and application integration scenarios.
FEATURES OF BIGQUERY
• The main component of BigQuery is Dremel query engine.
• There are huge amounts of unstructured data such as images, videos, log files,
and books present.
• All of this data needed to be queried. For this, MapReduce was designed.
• However, its batch-processing approach made it less than ideal for instant
querying.
• Dremel, on the other hand, was able to perform interactive querying on
billions of records in seconds.
ARCHITECTURE OF
BIGQUERY
DREMEL’S FEATURES AND
CHARACTERISTICS:
• Tree architecture:
• It uses tree architecture, which means that it treats a query as an
execution tree.
• Execution trees break an SQL query into pieces and then reassemble the
results for faster performance. Slots (or leaves) read billions of rows of
data and perform computations on them while the mixers (or branches)
aggregate the results.
• Columnar databases:
• Another reason for it’s incredibly fast performance is its use of a columnar data
storage format instead of the traditional row-based storage.
• Columnar databases allow for better compression due to the homogenous nature
of data stored within columns. In this design, only the required columns are pulled
out, making it an ideal choice for huge databases with billions of rows.
• Data sorting and aggregation operations are also easier with columnar databases
when compared to relational databases. This makes columnar databases more
suitable for intensive data analysis and the parallel processing approach employed
in Dremel’s tree architecture.
• Nested data storage:
• Join-based queries can be time-consuming in normalized databases,
and this challenge only gets worse in large databases.
• So Dremel opts for a different approach and permits the storage
of nested or repeated data using the data type — RECORD.
• This feature gives Dremel the capability to maintain relationships
between data inside a table. Nested data can be loaded from JSON files
or other source formats into tables.
• Columnar and nested data storage are ideal for querying semi-
structured and unstructured data, which constitute an important part
of the big data universe.
• Repetition level: the level of the nesting in the field path at which the repetition is happening.
• Definition level: how many optional/repeated fields in the field path have been defined.
IMPLEMENTATION IN KAGGLE:
• To load a dataset you first need to generate a dataset reference to point BQ to it.
• Any time, for working with BQ from Kaggle the project name is bigquery-public-data.
The method "client.dataset" is named
as if it returns a dataset, but it actually
gives us a dataset reference.
COMPARISON BETWEEN THE TWO
BigQuery MapReduce
• Query service for large datasets • Programming model for processing
large datasets
17CP006-LEENANCI PARMAR
17CP012-DHRUVI LAD