0% found this document useful (0 votes)
162 views18 pages

How Is Bigdata Handled in Kaggle?: 17Cp006-Leenanci Parmar 17CP012-DHRUVI LAD

Kaggle handles big data by supporting BigQuery, Google's cloud-based data warehouse. BigQuery uses the Dremel query engine to perform interactive queries on billions of records in seconds using a columnar database and nested data storage. To load a dataset on Kaggle, users first generate a BigQuery dataset reference to point to the data. This allows for fast, ad-hoc analysis of large datasets on Kaggle using BigQuery.

Uploaded by

Darshan Tank
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
162 views18 pages

How Is Bigdata Handled in Kaggle?: 17Cp006-Leenanci Parmar 17CP012-DHRUVI LAD

Kaggle handles big data by supporting BigQuery, Google's cloud-based data warehouse. BigQuery uses the Dremel query engine to perform interactive queries on billions of records in seconds using a columnar database and nested data storage. To load a dataset on Kaggle, users first generate a BigQuery dataset reference to point to the data. This allows for fast, ad-hoc analysis of large datasets on Kaggle using BigQuery.

Uploaded by

Darshan Tank
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

HOW IS BIGDATA HANDLED IN

KAGGLE?

17CP006-LEENANCI PARMAR
17CP012-DHRUVI LAD           
INTRODUCTION TO KAGGLE

• Kaggle is a crowdsourced data analysis competition platform. 


• Businesses bring their data problems and Kaggle hosts it.
• Scientists and programmers  compete to come up with the best solution.
• In 2017, Google  acquired  Kaggle. 
HOW KAGGLE WORKS ?

• Kaggle prepares the data and a description of the problem.


• Participants compete against each other to produce the best models. Work is
shared publicly through Kaggle Kernels. 
• Submissions are made through Kaggle Kernels, through manual upload or
using the Kaggle API.
• Thus, the main function of Kaggle is to provide the data publically.
HOW KAGGLE WORKS ? (CONTINUED..)

• Kaggle supports a variety of dataset publication formats.


• CSV,JSON,SQLite
• In addition to these datasets, KAGGLE also supports BigQuery.
WHAT IS BIGQUERY?

• BigQuery is a cloud-based data warehouse from Google that lets users query
and analyze large amounts of read-only data. Using a SQL-like syntax,
BigQuery runs queries on billions of rows of data in a matter of seconds.
• This is iPaaS (integration platform-as-a-service)  supports any combination of
on-premises, cloud data, and application integration scenarios.
FEATURES OF BIGQUERY
• The main component of BigQuery is Dremel query engine.
• There are huge amounts of unstructured data such as images, videos, log files,
and books present. 
• All of this data needed to be queried. For this, MapReduce was designed.
• However, its batch-processing approach made it less than ideal for instant
querying.
•  Dremel, on the other hand, was able to perform interactive querying on
billions of records in seconds.
ARCHITECTURE OF
BIGQUERY
DREMEL’S FEATURES AND
CHARACTERISTICS:

• Tree architecture:
• It uses tree architecture, which means that it treats a query as an
execution tree. 
• Execution trees break an SQL query into pieces and then reassemble the
results for faster performance. Slots (or leaves) read billions of rows of
data and perform computations on them while the mixers (or branches)
aggregate the results.
• Columnar databases:
• Another reason for it’s incredibly fast performance is its use of a columnar data
storage format instead of the traditional row-based storage.
• Columnar databases allow for better compression due to the homogenous nature
of data stored within columns. In this design, only the required columns are pulled
out, making it an ideal choice for huge databases with billions of rows.
• Data sorting and aggregation operations are also easier with columnar databases
when compared to relational databases. This makes columnar databases more
suitable for intensive data analysis and the parallel processing approach employed
in Dremel’s tree architecture.
• Nested data storage:
• Join-based queries can be time-consuming in normalized databases,
and this challenge only gets worse in large databases.
• So Dremel opts for a different approach and permits the storage
of nested or repeated data using the data type — RECORD.
• This feature gives Dremel the capability to maintain relationships
between data inside a table. Nested data can be loaded from JSON files
or other source formats into tables.
• Columnar and nested data storage are ideal for querying semi-
structured and unstructured data, which constitute an important part
of the big data universe.
• Repetition level: the level of the nesting in the field path at which the repetition is happening.
• Definition level: how many optional/repeated fields in the field path have been defined.
IMPLEMENTATION IN KAGGLE:

• To load a dataset you first need to generate a dataset reference to point BQ to it. 
• Any time, for working with BQ from Kaggle the project name is bigquery-public-data. 
The method "client.dataset" is named
as if it returns a dataset, but it actually
gives us a dataset reference.
COMPARISON BETWEEN THE TWO
BigQuery MapReduce
• Query service for large datasets • Programming model for processing
large datasets

• Ad hoc and trial-and- error 


interactive query of large dataset for • Batch processing of large dataset for
quick analysis and troubleshooting time-consuming data conversion or
aggregation

• Very fast response


• Not very fast (takes minutes - days)
WHY BIGQUERY?
• Analyzing data is becomes faster process using this , even on really large
datasets. Also, BigQuery has several tiers to support scalable massive data
storage and query processing. 
• This turns the user’s workflow into a more seamless process instead of the
previous fragmented practice, where data storage, querying, cleaning, and
analysis would take place across several tools and platforms.
• BigQuery ML is a set of extensions to the SQL language that allows the users
to easily (in minutes) create, train, and evaluate machine learning models and
their predictive performance.
REFERENCES:
• https://www.kaggle.com/dansbecker/getting-started-with-sql-and-bigquery
• https://towardsdatascience.com/want-to-use-bigquery-read-this-fab36822830
• https://www.kaggle.com/docs/datasets

17CP006-LEENANCI PARMAR
17CP012-DHRUVI LAD            

You might also like