How Is Bigdata Handled in Kaggle?: 17Cp006-Leenanci Parmar 17CP012-DHRUVI LAD

Kaggle handles big data by supporting BigQuery, Google's cloud-based data warehouse. BigQuery uses the Dremel query engine to perform interactive queries on billions of records in seconds using a columnar database and nested data storage. To load a dataset on Kaggle, users first generate a BigQuery dataset reference to point to the data. This allows for fast, ad-hoc analysis of large datasets on Kaggle using BigQuery.

Uploaded by

Darshan Tank

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

162 views18 pages

How Is Bigdata Handled in Kaggle?: 17Cp006-Leenanci Parmar 17CP012-DHRUVI LAD

Uploaded by

Darshan Tank

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 18

HOW IS BIGDATA HANDLED IN

KAGGLE?

17CP006-LEENANCI PARMAR
17CP012-DHRUVI LAD
INTRODUCTION TO KAGGLE

• Kaggle is a crowdsourced data analysis competition platform.

• Businesses bring their data problems and Kaggle hosts it.
• Scientists and programmers compete to come up with the best solution.
• In 2017, Google acquired Kaggle.
HOW KAGGLE WORKS ?

• Kaggle prepares the data and a description of the problem.

• Participants compete against each other to produce the best models. Work is
shared publicly through Kaggle Kernels.
• Submissions are made through Kaggle Kernels, through manual upload or
using the Kaggle API.
• Thus, the main function of Kaggle is to provide the data publically.
HOW KAGGLE WORKS ? (CONTINUED..)

• Kaggle supports a variety of dataset publication formats.

• CSV,JSON,SQLite
• In addition to these datasets, KAGGLE also supports BigQuery.
WHAT IS BIGQUERY?

• BigQuery is a cloud-based data warehouse from Google that lets users query
and analyze large amounts of read-only data. Using a SQL-like syntax,
BigQuery runs queries on billions of rows of data in a matter of seconds.
• This is iPaaS (integration platform-as-a-service) supports any combination of
on-premises, cloud data, and application integration scenarios.
FEATURES OF BIGQUERY
• The main component of BigQuery is Dremel query engine.
• There are huge amounts of unstructured data such as images, videos, log files,
and books present.
• All of this data needed to be queried. For this, MapReduce was designed.
• However, its batch-processing approach made it less than ideal for instant
querying.
• Dremel, on the other hand, was able to perform interactive querying on
billions of records in seconds.
ARCHITECTURE OF
BIGQUERY
DREMEL’S FEATURES AND
CHARACTERISTICS:

• Tree architecture:
• It uses tree architecture, which means that it treats a query as an
execution tree.
• Execution trees break an SQL query into pieces and then reassemble the
results for faster performance. Slots (or leaves) read billions of rows of
data and perform computations on them while the mixers (or branches)
aggregate the results.
• Columnar databases:
• Another reason for it’s incredibly fast performance is its use of a columnar data
storage format instead of the traditional row-based storage.
• Columnar databases allow for better compression due to the homogenous nature
of data stored within columns. In this design, only the required columns are pulled
out, making it an ideal choice for huge databases with billions of rows.
• Data sorting and aggregation operations are also easier with columnar databases
when compared to relational databases. This makes columnar databases more
suitable for intensive data analysis and the parallel processing approach employed
in Dremel’s tree architecture.
• Nested data storage:
• Join-based queries can be time-consuming in normalized databases,
and this challenge only gets worse in large databases.
• So Dremel opts for a different approach and permits the storage
of nested or repeated data using the data type — RECORD.
• This feature gives Dremel the capability to maintain relationships
between data inside a table. Nested data can be loaded from JSON files
or other source formats into tables.
• Columnar and nested data storage are ideal for querying semi-
structured and unstructured data, which constitute an important part
of the big data universe.
• Repetition level: the level of the nesting in the field path at which the repetition is happening.
• Definition level: how many optional/repeated fields in the field path have been defined.
IMPLEMENTATION IN KAGGLE:

• To load a dataset you first need to generate a dataset reference to point BQ to it.
• Any time, for working with BQ from Kaggle the project name is bigquery-public-data.
The method "client.dataset" is named
as if it returns a dataset, but it actually
gives us a dataset reference.
COMPARISON BETWEEN THE TWO
BigQuery MapReduce
• Query service for large datasets • Programming model for processing
large datasets

• Ad hoc and trial-and- error

interactive query of large dataset for • Batch processing of large dataset for
quick analysis and troubleshooting time-consuming data conversion or
aggregation

• Very fast response

• Not very fast (takes minutes - days)
WHY BIGQUERY?
• Analyzing data is becomes faster process using this , even on really large
datasets. Also, BigQuery has several tiers to support scalable massive data
storage and query processing.
• This turns the user’s workflow into a more seamless process instead of the
previous fragmented practice, where data storage, querying, cleaning, and
analysis would take place across several tools and platforms.
• BigQuery ML is a set of extensions to the SQL language that allows the users
to easily (in minutes) create, train, and evaluate machine learning models and
their predictive performance.
REFERENCES:
• https://www.kaggle.com/dansbecker/getting-started-with-sql-and-bigquery
• https://towardsdatascience.com/want-to-use-bigquery-read-this-fab36822830
• https://www.kaggle.com/docs/datasets

17CP006-LEENANCI PARMAR
17CP012-DHRUVI LAD

Computer Vision Bootcamp
No ratings yet
Computer Vision Bootcamp
181 pages
Com - Upgadata.up7723 Logcat
No ratings yet
Com - Upgadata.up7723 Logcat
1,306 pages
Deep Learning and Computer Vision in Remote Sensing
No ratings yet
Deep Learning and Computer Vision in Remote Sensing
574 pages
Site Handbook ME v2
0% (1)
Site Handbook ME v2
260 pages
UAV Aerial Image-Based Forest Fire Detection Using Deep Learning KEDDOUS AKILA 2.0
No ratings yet
UAV Aerial Image-Based Forest Fire Detection Using Deep Learning KEDDOUS AKILA 2.0
100 pages
Deep Learning: Huawei AI Academy Training Materials
No ratings yet
Deep Learning: Huawei AI Academy Training Materials
47 pages
SAP Business Application Studio
100% (1)
SAP Business Application Studio
176 pages
Lab Manual JAVA EO
100% (1)
Lab Manual JAVA EO
134 pages
Vector Data Model (GIS)
No ratings yet
Vector Data Model (GIS)
34 pages
Pradip Python-PPT-Geoinformatics (Pradip)
100% (1)
Pradip Python-PPT-Geoinformatics (Pradip)
8 pages
Computer Vision55
100% (1)
Computer Vision55
268 pages
02.ScanConversion MCA
No ratings yet
02.ScanConversion MCA
101 pages
TensorFlow Tutorial For Beginners (Article) - DataCamp PDF
No ratings yet
TensorFlow Tutorial For Beginners (Article) - DataCamp PDF
60 pages
Slides Deep Learning On AWS With NVIDIA From Training To Deployment
No ratings yet
Slides Deep Learning On AWS With NVIDIA From Training To Deployment
48 pages
Project Report On: COVID19 Testing Management System
100% (1)
Project Report On: COVID19 Testing Management System
43 pages
Jetson Partner Ecosystem Jetson Product and Use Case Ebook 2142232 r11
No ratings yet
Jetson Partner Ecosystem Jetson Product and Use Case Ebook 2142232 r11
35 pages
Lecture 1
100% (1)
Lecture 1
21 pages
Topic 2 Goal Tree and Problem Solving
No ratings yet
Topic 2 Goal Tree and Problem Solving
59 pages
Image Caption Generator
No ratings yet
Image Caption Generator
69 pages
Semester6 Major Project Final
No ratings yet
Semester6 Major Project Final
58 pages
Abhilasha Kenge Resume
No ratings yet
Abhilasha Kenge Resume
2 pages
Asset Report For 172.16.72.212
No ratings yet
Asset Report For 172.16.72.212
62 pages
ML Handwritten Notes
No ratings yet
ML Handwritten Notes
35 pages
Polyspace Code Prover Users Guide Unknown instant download
No ratings yet
Polyspace Code Prover Users Guide Unknown instant download
88 pages
Nvidia Learning Training Course Catalog
No ratings yet
Nvidia Learning Training Course Catalog
33 pages
Content-Based Image Retrieval Using Deep Learning
No ratings yet
Content-Based Image Retrieval Using Deep Learning
44 pages
CV Module 1
No ratings yet
CV Module 1
166 pages
IOT Mod-4
No ratings yet
IOT Mod-4
42 pages
Mining Presentation Wingtra CPG Tunisia March 2022
No ratings yet
Mining Presentation Wingtra CPG Tunisia March 2022
42 pages
Ch. 9: Introduction To Convolution Neural Networks (CNN) and Systems
No ratings yet
Ch. 9: Introduction To Convolution Neural Networks (CNN) and Systems
96 pages
PPL Lab Manual 2022-23
No ratings yet
PPL Lab Manual 2022-23
40 pages
Oracle Workflow Release 2.6
No ratings yet
Oracle Workflow Release 2.6
129 pages
Machine Learning Part 8
No ratings yet
Machine Learning Part 8
42 pages
Early Forest Fire Detection Paper
No ratings yet
Early Forest Fire Detection Paper
7 pages
Blackmagic Live Video Processing With OpenCV
No ratings yet
Blackmagic Live Video Processing With OpenCV
19 pages
CNN Hands On
No ratings yet
CNN Hands On
12 pages
AKTU Notes Machine Learning (ROE083) Unit-1 - UPTU Notes PDF
No ratings yet
AKTU Notes Machine Learning (ROE083) Unit-1 - UPTU Notes PDF
66 pages
State Oriented Programming
No ratings yet
State Oriented Programming
32 pages
Forest Fire Detection Using Computer Vision
No ratings yet
Forest Fire Detection Using Computer Vision
30 pages
180 Data Science and Machine Learning Projects With Python by Aman Kharwal Coders Camp Medium
No ratings yet
180 Data Science and Machine Learning Projects With Python by Aman Kharwal Coders Camp Medium
18 pages
Node+Hero+-+The+complete+Node js+tutorial+series+from+RisingStack
No ratings yet
Node+Hero+-+The+complete+Node js+tutorial+series+from+RisingStack
65 pages
Gpu-Applications-Catalog 2021
No ratings yet
Gpu-Applications-Catalog 2021
76 pages
MTech DATA SCIENCE & ENGINEERING HCL - 0
No ratings yet
MTech DATA SCIENCE & ENGINEERING HCL - 0
11 pages
Computer Vision
No ratings yet
Computer Vision
30 pages
Understanding Document For Update The ZISP6UPHR Program
No ratings yet
Understanding Document For Update The ZISP6UPHR Program
15 pages
Airport Problem Simulation
No ratings yet
Airport Problem Simulation
26 pages
Accelerate Computing Vision and Image Processing Using VPI 1.1 by Rodolfo Lima
No ratings yet
Accelerate Computing Vision and Image Processing Using VPI 1.1 by Rodolfo Lima
23 pages
Enabling Technologies and Federated Cloud
100% (1)
Enabling Technologies and Federated Cloud
38 pages
AI Driven Companies in Egypt
No ratings yet
AI Driven Companies in Egypt
16 pages
22 Python Libraries For Geospatial Data Analysis
No ratings yet
22 Python Libraries For Geospatial Data Analysis
10 pages
Python: Notes For Professionals
No ratings yet
Python: Notes For Professionals
10 pages
Image Captioning Using CNN and LSTM
No ratings yet
Image Captioning Using CNN and LSTM
9 pages
Autonomous Drone For Defence Machinery
No ratings yet
Autonomous Drone For Defence Machinery
5 pages
Programming Paradigm
No ratings yet
Programming Paradigm
14 pages
Data Science Projects
No ratings yet
Data Science Projects
3 pages
Bharat
No ratings yet
Bharat
13 pages
Plant Disease Identification
No ratings yet
Plant Disease Identification
17 pages
Gandhinagar Institute of Technology: B.E. Sem 8 Computer Engineering Department
No ratings yet
Gandhinagar Institute of Technology: B.E. Sem 8 Computer Engineering Department
10 pages
Module 2
No ratings yet
Module 2
20 pages
ACD key points
No ratings yet
ACD key points
7 pages
Data Science - A Kaggle Walkthrough - Understanding The Data - 2 PDF
No ratings yet
Data Science - A Kaggle Walkthrough - Understanding The Data - 2 PDF
9 pages
Library Management System For Stanford University
100% (1)
Library Management System For Stanford University
8 pages
IOT Drone Fundamentals
No ratings yet
IOT Drone Fundamentals
11 pages
Selenium Cheatsheet
No ratings yet
Selenium Cheatsheet
9 pages
Hansal Maniar
No ratings yet
Hansal Maniar
7 pages
AWS Machine Learning Engineer: Nanodegree Program Syllabus
100% (1)
AWS Machine Learning Engineer: Nanodegree Program Syllabus
18 pages
Introduction To DWH
No ratings yet
Introduction To DWH
10 pages
R-CNN, Fast R-CNN, Faster R-CNN, YOLO - Object Detection Algorithms
No ratings yet
R-CNN, Fast R-CNN, Faster R-CNN, YOLO - Object Detection Algorithms
11 pages
Computer Vision I: Ai Courses by Opencv
No ratings yet
Computer Vision I: Ai Courses by Opencv
9 pages
USE UMEX Present
No ratings yet
USE UMEX Present
20 pages
Quality Assurance Plan
No ratings yet
Quality Assurance Plan
7 pages
Rahul Shetty
No ratings yet
Rahul Shetty
6 pages
2.3.7 Packet Tracer - Navigate The Ios (ANSWERED)
100% (1)
2.3.7 Packet Tracer - Navigate The Ios (ANSWERED)
4 pages
Data Science Learning Path For 50 Days
No ratings yet
Data Science Learning Path For 50 Days
15 pages
C MCQ
No ratings yet
C MCQ
16 pages
Coding Statements TCS NQT
No ratings yet
Coding Statements TCS NQT
13 pages
Wingtra
No ratings yet
Wingtra
4 pages
Deep Learning and Computer Vision For Video Analytics
No ratings yet
Deep Learning and Computer Vision For Video Analytics
37 pages
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
No ratings yet
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
15 pages
An Introduction To Flex
No ratings yet
An Introduction To Flex
7 pages
GIS F2E Python - Features To Edge List in Python - Installation and Tutorial
No ratings yet
GIS F2E Python - Features To Edge List in Python - Installation and Tutorial
5 pages
Database Programming With SQL 10-3: Multiple-Row Subqueries Practice Activities
No ratings yet
Database Programming With SQL 10-3: Multiple-Row Subqueries Practice Activities
2 pages
Career Plans For Next 2 Years
No ratings yet
Career Plans For Next 2 Years
11 pages
Artificial Intelligence and Machine Learning in Business
No ratings yet
Artificial Intelligence and Machine Learning in Business
5 pages
Free and Open Source Document Management Systems
No ratings yet
Free and Open Source Document Management Systems
5 pages
Github Data Science Projects
No ratings yet
Github Data Science Projects
16 pages
Data Science - A Kaggle Walkthrough - Introduction - 1 PDF
No ratings yet
Data Science - A Kaggle Walkthrough - Introduction - 1 PDF
5 pages
Data Science Course Content
No ratings yet
Data Science Course Content
4 pages
Building Data-Driven Applications with LlamaIndex: A practical guide to retrieval-augmented generation (RAG) to enhance LLM applications
From Everand
Building Data-Driven Applications with LlamaIndex: A practical guide to retrieval-augmented generation (RAG) to enhance LLM applications
Andrei Gheorghiu
No ratings yet
Mastering BigQuery: Scalable Analytics on Google Cloud
From Everand
Mastering BigQuery: Scalable Analytics on Google Cloud
Robert Johnson
No ratings yet