0% found this document useful (0 votes)

9 views6 pages

gcloud dataflow jobs run iotflow

Uploaded by

vharish10859

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views6 pages

gcloud dataflow jobs run iotflow

Uploaded by

vharish10859

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 6

gcloud dataflow jobs run iotflow \

--gcs-location gs://dataflow-templates-us-west1/latest/PubSub_to_BigQuery \
--region us-west1 \
--worker-machine-type e2-medium \
--staging-location gs://qwiklabs-gcp-01-102801ff5577/temp \
--parameters inputTopic=projects/pubsub-public-data/topics/taxirides-
realtime,outputTableSpec=qwiklabs-gcp-01-102801ff5577:taxirides.realtime

# Copyright 2017 Google Inc.

#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""`data_ingestion.py` is a Dataflow pipeline which reads a file and writes its
contents to a BigQuery table.
This example does not do any transformation on the data.
"""

import argparse
import logging
import re
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

class DataIngestion:
"""A helper class which contains the logic to translate the file into
a format BigQuery will accept."""
def parse_method(self, string_input):
"""This method translates a single line of comma separated values to a
dictionary which can be loaded into BigQuery.

Args:
string_input: A comma separated list of values in the form of

state_abbreviation,gender,year,name,count_of_babies,dataset_created_date
Example string_input: KS,F,1923,Dorothy,654,11/28/2016

Returns:
A dict mapping BigQuery column names as keys to the corresponding value
parsed from string_input. In this example, the data is not transformed,
and
remains in the same format as the CSV.
example output:
{
'state': 'KS',
'gender': 'F',
'year': '1923',
'name': 'Dorothy',
'number': '654',
'created_date': '11/28/2016'
}
"""
# Strip out carriage return, newline and quote characters.
values = re.split(",",
re.sub('\r\n', '', re.sub('"', '', string_input)))
row = dict(
zip(('state', 'gender', 'year', 'name', 'number', 'created_date'),
values))
return row

def run(argv=None):
"""The main function which creates the pipeline and runs it."""

parser = argparse.ArgumentParser()

# Here we add some specific command line arguments we expect.

# Specifically we have the input file to read and the output table to write.
# This is the final stage of the pipeline, where we define the destination
# of the data. In this case we are writing to BigQuery.
parser.add_argument(
'--input',
dest='input',
required=False,
help='Input file to read. This can be a local file or '
'a file in a Google Storage Bucket.',
# This example file contains a total of only 10 lines.
# Useful for developing on a small set of data.
default='gs://spls/gsp290/data_files/head_usa_names.csv')

# This defaults to the lake dataset in your BigQuery project. You'll have
# to create the lake dataset yourself using this command:
# bq mk lake
parser.add_argument('--output',
dest='output',
required=False,
help='Output BQ table to write results to.',
default='lake.usa_names')

# Parse arguments from the command line.

known_args, pipeline_args = parser.parse_known_args(argv)

# DataIngestion is a class we built in this script to hold the logic for

# transforming the file into a BigQuery table.
data_ingestion = DataIngestion()

# Initiate the pipeline using the pipeline arguments passed in from the
# command line. This includes information such as the project ID and
# where Dataflow should store temp files.
p = beam.Pipeline(options=PipelineOptions(pipeline_args))

(p
# Read the file. This is the source of the pipeline. All further
# processing starts with lines read from the file. We use the input
# argument from the command line. We also skip the first line which is a
# header row.
| 'Read from a File' >> beam.io.ReadFromText(known_args.input,
skip_header_lines=1)
# This stage of the pipeline translates from a CSV file single row
# input as a string, to a dictionary object consumable by BigQuery.
# It refers to a function we have written. This function will
# be run in parallel on different workers using input from the
# previous stage of the pipeline.
| 'String To BigQuery Row' >>
beam.Map(lambda s: data_ingestion.parse_method(s))
| 'Write to BigQuery' >> beam.io.Write(
beam.io.BigQuerySink(
# The table name is a required argument for the BigQuery sink.
# In this case we use the value passed in from the command line.
known_args.output,
# Here we use the simplest way of defining a schema:
# fieldName:fieldType
schema='state:STRING,gender:STRING,year:STRING,name:STRING,'
'number:STRING,created_date:STRING',
# Creates the table in BigQuery if it does not yet exist.
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
# Deletes all data in the BigQuery table before writing.
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)))
p.run().wait_until_finish()

if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()

data transformation.py
# Copyright 2017 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

""" data_transformation.py is a Dataflow pipeline which reads a file and writes

its contents to a BigQuery table.

This example reads a json schema of the intended output into BigQuery,
and transforms the date data to match the format BigQuery expects.
"""

import argparse
import csv
import logging
import os

import apache_beam as beam

from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.io.gcp.bigquery_tools import parse_table_schema_from_json

class DataTransformation:
"""A helper class which contains the logic to translate the file into a
format BigQuery will accept."""

def __init__(self):
dir_path = os.path.dirname(os.path.realpath(__file__))
self.schema_str = ''
# Here we read the output schema from a json file. This is used to specify
the types
# of data we are writing to BigQuery.
schema_file = os.path.join(dir_path, 'resources',
'usa_names_year_as_date.json')
with open(schema_file) \
as f:
data = f.read()
# Wrapping the schema in fields is required for the BigQuery API.
self.schema_str = '{"fields": ' + data + '}'

def parse_method(self, string_input):

"""This method translates a single line of comma separated values to a
dictionary which can be loaded into BigQuery.

Args:
string_input: A comma separated list of values in the form of

state_abbreviation,gender,year,name,count_of_babies,dataset_created_date
example string_input: KS,F,1923,Dorothy,654,11/28/2016

example output:
{'state': 'KS',
'gender': 'F',
'year': '1923-01-01', <- This is the BigQuery date format.
'name': 'Dorothy',
'number': '654',
'created_date': '11/28/2016'
}
"""
# Strip out return characters and quote characters.
schema = parse_table_schema_from_json(self.schema_str)

field_map = [f for f in schema.fields]

# Use a CSV Reader which can handle quoted strings etc.

reader = csv.reader(string_input.split('\n'))
for csv_row in reader:
# Our source data only contains year, so default January 1st as the
# month and day.
month = '01'
day = '01'
# The year comes from our source data.
year = csv_row[2]

row = {}
i = 0
# Iterate over the values from our csv file, applying any
transformation logic.
for value in csv_row:
# If the schema indicates this field is a date format, we must
# transform the date from the source data into a format that
# BigQuery can understand.
if field_map[i].type == 'DATE':
# Format the date to YYYY-MM-DD format which BigQuery
# accepts.
value = '-'.join((year, month, day))

row[field_map[i].name] = value
i += 1

return row

def run(argv=None):
"""The main function which creates the pipeline and runs it."""
parser = argparse.ArgumentParser()
# Here we add some specific command line arguments we expect. Specifically
# we have the input file to load and the output table to write to.
parser.add_argument(
'--input', dest='input', required=False,
help='Input file to read. This can be a local file or '
'a file in a Google Storage Bucket.',
# This example file contains a total of only 10 lines.
# It is useful for developing on a small set of data
default='gs://spls/gsp290/data_files/head_usa_names.csv')
# This defaults to the temp dataset in your BigQuery project. You'll have
# to create the temp dataset yourself using bq mk temp
parser.add_argument('--output', dest='output', required=False,
help='Output BQ table to write results to.',
default='lake.usa_names_transformed')

# Parse arguments from the command line.

known_args, pipeline_args = parser.parse_known_args(argv)
# DataTransformation is a class we built in this script to hold the logic for
# transforming the file into a BigQuery table.
data_ingestion = DataTransformation()

# Initiate the pipeline using the pipeline arguments passed in from the
# command line. This includes information like where Dataflow should
# store temp files, and what the project id is.
p = beam.Pipeline(options=PipelineOptions(pipeline_args))
schema = parse_table_schema_from_json(data_ingestion.schema_str)

(p
# Read the file. This is the source of the pipeline. All further
# processing starts with lines read from the file. We use the input
# argument from the command line. We also skip the first line which is a
# header row.
| 'Read From Text' >> beam.io.ReadFromText(known_args.input,
skip_header_lines=1)
# This stage of the pipeline translates from a CSV file single row
# input as a string, to a dictionary object consumable by BigQuery.
# It refers to a function we have written. This function will
# be run in parallel on different workers using input from the
# previous stage of the pipeline.
| 'String to BigQuery Row' >> beam.Map(lambda s:
data_ingestion.parse_method(s))
| 'Write to BigQuery' >> beam.io.Write(
beam.io.BigQuerySink(
# The table name is a required argument for the BigQuery sink.
# In this case we use the value passed in from the command line.
known_args.output,
# Here we use the JSON schema read in from a JSON file.
# Specifying the schema allows the API to create the table correctly if
it does not yet exist.
schema=schema,
# Creates the table in BigQuery if it does not yet exist.
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
# Deletes all data in the BigQuery table before writing.
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)))
p.run().wait_until_finish()

if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()

Essential n8n Playbook
From Everand
Essential n8n Playbook
Leandro Calado
No ratings yet
Steps To Backup-Restore OpenEMR Express Edition
No ratings yet
Steps To Backup-Restore OpenEMR Express Edition
5 pages
1MRG000507 - Data Migration Instruction
No ratings yet
1MRG000507 - Data Migration Instruction
23 pages
3 - ETL Processing On Google Cloud Using Dataflow and BigQuery
0% (1)
3 - ETL Processing On Google Cloud Using Dataflow and BigQuery
15 pages
Hadoop - Session 7 Python
No ratings yet
Hadoop - Session 7 Python
6 pages
10 Python&Hadoop
No ratings yet
10 Python&Hadoop
32 pages
NgRx SignalStore: An effortless solution for state management
From Everand
NgRx SignalStore: An effortless solution for state management
Abdelfattah Ragab
No ratings yet
APIs
No ratings yet
APIs
5 pages
[Agent Builder] - BigQuery Structured Data {NJ}
No ratings yet
[Agent Builder] - BigQuery Structured Data {NJ}
49 pages
Exp 4(ii). Hive UDF executed steps 20-02-2025
No ratings yet
Exp 4(ii). Hive UDF executed steps 20-02-2025
5 pages
Spark Job Dataproc
No ratings yet
Spark Job Dataproc
4 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
7 BigData BigQuery Intelli
No ratings yet
7 BigData BigQuery Intelli
3 pages
Pandas Documentation PDF
No ratings yet
Pandas Documentation PDF
86 pages
DataGrokr Technical Assignment - Data Engineering (1) (1)
No ratings yet
DataGrokr Technical Assignment - Data Engineering (1) (1)
4 pages
Bigdata
No ratings yet
Bigdata
3 pages
Python Record Manual
No ratings yet
Python Record Manual
18 pages
Migrating Data From HDFS To Big Query
No ratings yet
Migrating Data From HDFS To Big Query
5 pages
Chapter 1
No ratings yet
Chapter 1
25 pages
Building Data Pipelines - 1
No ratings yet
Building Data Pipelines - 1
25 pages
Project documentation
No ratings yet
Project documentation
36 pages
CSS Grid Layout
From Everand
CSS Grid Layout
Abdelfattah Ragab
No ratings yet
Ian Talks JS A-Z: WebDevAtoZ, #1
From Everand
Ian Talks JS A-Z: WebDevAtoZ, #1
Ian Eress
No ratings yet
script
No ratings yet
script
5 pages
C++ Functions and tutorial
From Everand
C++ Functions and tutorial
Nino Paiotta
No ratings yet
PySpark Reference Guide
No ratings yet
PySpark Reference Guide
2 pages
Chesta-draksha Project File
No ratings yet
Chesta-draksha Project File
57 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Py Spark 3 Quick Reference Guide
No ratings yet
Py Spark 3 Quick Reference Guide
2 pages
DAwHPC L03 Data Cleaning Practical
No ratings yet
DAwHPC L03 Data Cleaning Practical
43 pages
From Data To Insights Course Summary
No ratings yet
From Data To Insights Course Summary
67 pages
2. Reading and Writing Files
No ratings yet
2. Reading and Writing Files
4 pages
Extract, Transform and Load (ETL)
No ratings yet
Extract, Transform and Load (ETL)
31 pages
Databricks Etl Pipeline 1699423882
No ratings yet
Databricks Etl Pipeline 1699423882
6 pages
Other Script5
No ratings yet
Other Script5
5 pages
To Migrate Data From Teradata To Google BigQuery
No ratings yet
To Migrate Data From Teradata To Google BigQuery
4 pages
DATA-Code 1-050624-120338
No ratings yet
DATA-Code 1-050624-120338
3 pages
Python, Machine Learning Course Content
No ratings yet
Python, Machine Learning Course Content
13 pages
SCD Typ2 in Databricks Azure
0% (1)
SCD Typ2 in Databricks Azure
8 pages
Fds Unit - III
No ratings yet
Fds Unit - III
58 pages
10 Lessons in Front-end
From Everand
10 Lessons in Front-end
Krasimir Tsonev
2/5 (1)
1 - Introduction ToPySpark
No ratings yet
1 - Introduction ToPySpark
26 pages
IP Project Saleha
No ratings yet
IP Project Saleha
34 pages
How to a Developers Guide to 4k: Developer edition, #3
From Everand
How to a Developers Guide to 4k: Developer edition, #3
Xinc Cyberwizard
No ratings yet
Data Frames
No ratings yet
Data Frames
12 pages
GCP Fundamentals Getting Started With BigQuery
No ratings yet
GCP Fundamentals Getting Started With BigQuery
5 pages
05 Functions
No ratings yet
05 Functions
6 pages
Abhishek BDA File
No ratings yet
Abhishek BDA File
23 pages
05 Data Warehouse Using Google Big Query
No ratings yet
05 Data Warehouse Using Google Big Query
6 pages
2 3-SVM Ipynb
No ratings yet
2 3-SVM Ipynb
111 pages
Chapter 3
No ratings yet
Chapter 3
33 pages
01 Spark
No ratings yet
01 Spark
7 pages
EMPLOYEE DATA ANALYSIS SYSTEM (IP CLASS XII)
No ratings yet
EMPLOYEE DATA ANALYSIS SYSTEM (IP CLASS XII)
26 pages
Snow SQL
No ratings yet
Snow SQL
3 pages
Python BigData Alternative Assignment
No ratings yet
Python BigData Alternative Assignment
5 pages
4220 5 (Python)
No ratings yet
4220 5 (Python)
12 pages
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Activity Overview - Course 3 Module 3 Google Data ANALYTICS
No ratings yet
Activity Overview - Course 3 Module 3 Google Data ANALYTICS
15 pages
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
De FiNal
No ratings yet
De FiNal
94 pages
Week 1 To Week 9
No ratings yet
Week 1 To Week 9
30 pages
Extracted
No ratings yet
Extracted
8 pages
June 2022 QP - Paper 1 Edexcel Computer Science GCSE
No ratings yet
June 2022 QP - Paper 1 Edexcel Computer Science GCSE
20 pages
Sheet1 Data and Computer Communications
No ratings yet
Sheet1 Data and Computer Communications
2 pages
SRAM Vs DRAM - How SRAM Works - How DRAM Works - Why SRAM Is Faster Than DRAM - English
No ratings yet
SRAM Vs DRAM - How SRAM Works - How DRAM Works - Why SRAM Is Faster Than DRAM - English
9 pages
PDF For C
No ratings yet
PDF For C
9 pages
MicroStrategy MDX Cube Reporting
No ratings yet
MicroStrategy MDX Cube Reporting
222 pages
Accelerate Cloud Analytics Modernization: Rapidly Deliver Trusted, Data-Driven Business Decisions
No ratings yet
Accelerate Cloud Analytics Modernization: Rapidly Deliver Trusted, Data-Driven Business Decisions
3 pages
OS -OU-Papers (1)
No ratings yet
OS -OU-Papers (1)
5 pages
SPP Spec V12
No ratings yet
SPP Spec V12
20 pages
Indzara Project Planner Basic ET0030022010001
No ratings yet
Indzara Project Planner Basic ET0030022010001
9 pages
Pppoe Server With Profiles - MikroTik Wiki
No ratings yet
Pppoe Server With Profiles - MikroTik Wiki
7 pages
Sean Dickerson: Professional Summary
No ratings yet
Sean Dickerson: Professional Summary
3 pages
Advanced Data Model
No ratings yet
Advanced Data Model
18 pages
UFMC Presentation PDF
No ratings yet
UFMC Presentation PDF
32 pages
EN AccessManager Professional User Manual DC1-0080A
No ratings yet
EN AccessManager Professional User Manual DC1-0080A
152 pages
Cat 3 Question Bank -20esec502-Mpmc-1
No ratings yet
Cat 3 Question Bank -20esec502-Mpmc-1
29 pages
Isbat Computer Organisation and Architecture
No ratings yet
Isbat Computer Organisation and Architecture
73 pages
Pablo Guede - Respirar Futbol - Compressed PDF
No ratings yet
Pablo Guede - Respirar Futbol - Compressed PDF
99 pages
Advanced Web Hacking PDF
100% (2)
Advanced Web Hacking PDF
21 pages
Nexys3 RM PDF
No ratings yet
Nexys3 RM PDF
22 pages
Unit 6 Arrays
No ratings yet
Unit 6 Arrays
18 pages
U Boot Abcdge
No ratings yet
U Boot Abcdge
232 pages
DBMS Ip
No ratings yet
DBMS Ip
7 pages
Answer To Review Quest 7
100% (2)
Answer To Review Quest 7
7 pages
C - Library - Stdio
No ratings yet
C - Library - Stdio
4 pages
kubernetes_cheat_sheet
No ratings yet
kubernetes_cheat_sheet
1 page
Data Security Considerations - (Backups, Archival Storage and Disposal of Data)
No ratings yet
Data Security Considerations - (Backups, Archival Storage and Disposal of Data)
3 pages
CHAPTER 2 - Configure A Network Operating System 2.0.1 Welcome 2.0.1.1 Chapter 2: Configure A Network Operating System
No ratings yet
CHAPTER 2 - Configure A Network Operating System 2.0.1 Welcome 2.0.1.1 Chapter 2: Configure A Network Operating System
18 pages
Manual SIWAREX WP521 WP522 en - PDF Page 81
No ratings yet
Manual SIWAREX WP521 WP522 en - PDF Page 81
1 page