Skip to content

Commit 466f558

Browse files
authored
[Dataflow] Add transformation sample (GoogleCloudPlatform#1131)
* Add transformation example code * Update README.md to reflect java-docs-samples * Add sample to parent pom.xml * minor edits * adding license * Update copyright header in log4j * Fix issue with README command and fixed typo and changed compression * Update example files to follow schema correctly
1 parent b683925 commit 466f558

File tree

12 files changed

+866
-0
lines changed

12 files changed

+866
-0
lines changed

dataflow/transforms/README.md

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
# Data Format Transformations using Cloud Dataflow and Apache Beam
2+
3+
Utility transforms to transform from one file format to another for a large number of files using
4+
[Apache Beam][apache_beam] running on [Google Cloud Dataflow][dataflow].
5+
6+
The transformations supported by this utility are:
7+
- CSV to Avro
8+
- Avro to CSV
9+
10+
## Setup
11+
12+
Setup instructions assume you have an active Google Cloud Project and with an associated billing account.
13+
The following instructions will help you prepare your development environment.
14+
15+
1. Install [Cloud SDK][cloud_sdk].
16+
1. Setup Cloud SDK
17+
18+
gcloud init
19+
20+
21+
1. Select your Google Cloud Project if not already selected
22+
23+
gcloud config set project [PROJECT_ID]
24+
25+
1. Clone repository
26+
27+
git clone https://github.com/GoogleCloudPlatform/java-docs-samples.git
28+
29+
1. Navigate to the sample code directory
30+
31+
cd dataflow/transforms
32+
33+
## Grant required permissions
34+
35+
The examples are configured for Cloud Dataflow which run on Google Compute Engine.
36+
The Compute Engine default service account requires the permissions
37+
`storage.objects.create`, `storage.objects.get`, and `storage.objects.create` to read and write
38+
objects in your Google Cloud Storage bucket IAM policy.
39+
40+
Learn more about [Cloud Storage IAM Roles][storage_iam_roles] and [Bucket-level IAM][bucket_iam].
41+
42+
The following steps are optional if:
43+
44+
* If the project you use to run these Dataflow transformations also own the buckets used to read/write objects.
45+
* If the bucket you're reading data from is public, e.g., allUsers are granted `roles/storage.objectViewer` viewer.
46+
47+
1. Get the Compute Engine default service account using the following gcloud command:
48+
49+
gcloud compute project-info describe
50+
51+
The default service can be found next to `defaultServiceAccount:` in response after running the command.
52+
53+
1. Grant the `roles/storage.objectViewer` role to the bucket to get and list objects from a Dataflow job:
54+
55+
gsutil --debug iam ch serviceAccount:[COMPUTE_DEFAULT_SERVICE_ACCOUNT]:objectViewer gs://[BUCKET_NAME]
56+
57+
* Replace `[COMPUTE_DEFAULT_SERVICE_ACCOUNT]` with the Compute Engine default service account.
58+
* Replace `[BUCKET_NAME]` with the bucket you use to read your input data.
59+
60+
1. Grant the `roles/storage.objectCreator` role to the bucket to create objects on output from a Dataflow job:
61+
62+
gsutil --debug iam ch serviceAccount:[COMPUTE_DEFAULT_SERVICE_ACCOUNT]:objectCreator gs://[BUCKET_NAME]
63+
64+
* Replace `[COMPUTE_DEFAULT_SERVICE_ACCOUNT]` with the Compute Engine default service account.
65+
* Replace `[BUCKET_NAME]` with the bucket you use to read your input data.
66+
67+
1. If the bucket contains both input and output data, grant the `roles/storage.objectAdmin` role to the default service
68+
account using the gsutil:
69+
70+
gsutil --debug iam ch serviceAccount:[COMPUTE_DEFAULT_SERVICE_ACCOUNT]:objectAdmin gs://[BUCKET_NAME]
71+
72+
* Replace `[COMPUTE_DEFAULT_SERVICE_ACCOUNT]` with the Compute Engine default service account.
73+
* Replace `[BUCKET_NAME]` with the bucket you use to read and write your input and output data respectively.
74+
75+
76+
## Using transformations
77+
78+
### Avro to CSV transformation
79+
80+
To transform Avro formatted files to Csv use the following command:
81+
82+
```bash
83+
# Example
84+
85+
mvn compile exec:java -Dexec.mainClass=com.example.AvroToCsv \
86+
-Dexec.args="--avroSchema=gs://bucket/schema.avsc --inputFile=gs://bucket/*.avro --output=gs://bucket/output --runner=Dataflow"
87+
```
88+
89+
Full description of options can be found by using the following command:
90+
91+
```bash
92+
mvn compile exec:java -Dexec.mainClass=org.solution.example.AvroToCsv -Dexec.args="--help=org.solution.example.SampleOptions"
93+
```
94+
95+
### CSV to Avro transformation
96+
97+
To transform CSV formatted files without a header to Avro use the following command:
98+
99+
```bash
100+
# Example
101+
102+
mvn compile exec:java -Dexec.mainClass=com.example.CsvToAvro \
103+
-Dexec.args="--avroSchema=gs://bucket/schema.avsc --inputFile=gs://bucket/*.csv --output=gs://bucket/output --runner=Dataflow"
104+
```
105+
106+
Full description of options can be found by using the following command:
107+
108+
```bash
109+
mvn compile exec:java -Dexec.mainClass=org.solution.example.CsvToAvro -Dexec.args="--help=org.solution.example.SampleOptions"
110+
```
111+
112+
Existing example does not support headers in a CSV files.
113+
114+
## Run Tests
115+
116+
Tests can be run locally using the DirectRunner.
117+
118+
119+
mvn verify
120+
121+
[storage_iam_roles]: https://cloud.google.com/storage/docs/access-control/iam-roles
122+
[bucket_iam]: https://cloud.google.com/storage/docs/access-control/iam
123+
[cloud_sdk]: https://cloud.google.com/sdk/docs/
124+
[dataflow]: https://cloud.google.com/dataflow/docs/
125+
[apache_beam]: https://beam.apache.org/
260 Bytes
Binary file not shown.
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
frank,natividad,10
2+
Karthi,thyagarajan,10
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
{
2+
"type": "record",
3+
"name": "User",
4+
"fields": [{
5+
"name": "first_name",
6+
"type": "string"
7+
}, {
8+
"name": "last_name",
9+
"type": "string"
10+
}, {
11+
"name": "age",
12+
"type": "int"
13+
}]
14+
}

dataflow/transforms/pom.xml

Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<!--
3+
Copyright 2018 Google LLC
4+
5+
Licensed under the Apache License, Version 2.0 (the "License");
6+
you may not use this file except in compliance with the License.
7+
You may obtain a copy of the License at
8+
9+
http://www.apache.org/licenses/LICENSE-2.0
10+
11+
Unless required by applicable law or agreed to in writing, software
12+
distributed under the License is distributed on an "AS IS" BASIS,
13+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
See the License for the specific language governing permissions and
15+
limitations under the License.
16+
-->
17+
<project xmlns="http://maven.apache.org/POM/4.0.0"
18+
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
19+
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
20+
<modelVersion>4.0.0</modelVersion>
21+
22+
<groupId>com.example</groupId>
23+
<artifactId>format-transforms</artifactId>
24+
<version>1.0-SNAPSHOT</version>
25+
26+
<packaging>jar</packaging>
27+
28+
<parent>
29+
<groupId>com.google.cloud.samples</groupId>
30+
<artifactId>shared-configuration</artifactId>
31+
<version>1.0.9</version>
32+
</parent>
33+
34+
<properties>
35+
<beam.version>2.4.0</beam.version>
36+
37+
<google-clients.version>1.22.0</google-clients.version>
38+
<hamcrest.version>1.3</hamcrest.version>
39+
<junit.version>4.12</junit.version>
40+
<maven.compiler.source>1.8</maven.compiler.source>
41+
<maven.compiler.target>1.8</maven.compiler.target>
42+
<maven-exec-plugin.version>1.6.0</maven-exec-plugin.version>
43+
<maven-jar-plugin.version>3.0.2</maven-jar-plugin.version>
44+
<maven-shade-plugin.version>3.0.0</maven-shade-plugin.version>
45+
<slf4j.version>1.7.25</slf4j.version>
46+
<surefire-plugin.version>2.20</surefire-plugin.version>
47+
</properties>
48+
49+
<build>
50+
<plugins>
51+
<plugin>
52+
<groupId>org.apache.maven.plugins</groupId>
53+
<artifactId>maven-surefire-plugin</artifactId>
54+
<version>${surefire-plugin.version}</version>
55+
<configuration>
56+
<parallel>all</parallel>
57+
<threadCount>4</threadCount>
58+
<redirectTestOutputToFile>true</redirectTestOutputToFile>
59+
</configuration>
60+
<dependencies>
61+
<dependency>
62+
<groupId>org.apache.maven.surefire</groupId>
63+
<artifactId>surefire-junit47</artifactId>
64+
<version>${surefire-plugin.version}</version>
65+
</dependency>
66+
</dependencies>
67+
</plugin>
68+
69+
<!-- Ensure that the Maven jar plugin runs before the Maven
70+
shade plugin by listing the plugin higher within the file. -->
71+
<plugin>
72+
<groupId>org.apache.maven.plugins</groupId>
73+
<artifactId>maven-jar-plugin</artifactId>
74+
<version>${maven-jar-plugin.version}</version>
75+
</plugin>
76+
77+
<!--
78+
Configures `mvn package` to produce a bundled jar ("fat jar") for runners
79+
that require this for job submission to a cluster.
80+
-->
81+
<plugin>
82+
<groupId>org.apache.maven.plugins</groupId>
83+
<artifactId>maven-shade-plugin</artifactId>
84+
<version>${maven-shade-plugin.version}</version>
85+
<executions>
86+
<execution>
87+
<phase>package</phase>
88+
<goals>
89+
<goal>shade</goal>
90+
</goals>
91+
<configuration>
92+
<finalName>${project.artifactId}-bundled-${project.version}</finalName>
93+
<filters>
94+
<filter>
95+
<artifact>*:*</artifact>
96+
<excludes>
97+
<exclude>META-INF/LICENSE</exclude>
98+
<exclude>META-INF/*.SF</exclude>
99+
<exclude>META-INF/*.DSA</exclude>
100+
<exclude>META-INF/*.RSA</exclude>
101+
</excludes>
102+
</filter>
103+
</filters>
104+
<transformers>
105+
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
106+
</transformers>
107+
</configuration>
108+
</execution>
109+
</executions>
110+
</plugin>
111+
</plugins>
112+
113+
<pluginManagement>
114+
<plugins>
115+
<plugin>
116+
<groupId>org.codehaus.mojo</groupId>
117+
<artifactId>exec-maven-plugin</artifactId>
118+
<version>${maven-exec-plugin.version}</version>
119+
<configuration>
120+
<cleanupDaemonThreads>false</cleanupDaemonThreads>
121+
</configuration>
122+
</plugin>
123+
</plugins>
124+
</pluginManagement>
125+
</build>
126+
127+
<dependencies>
128+
<dependency>
129+
<groupId>org.apache.beam</groupId>
130+
<artifactId>beam-sdks-java-extensions-google-cloud-platform-core</artifactId>
131+
<version>${beam.version}</version>
132+
</dependency>
133+
134+
<dependency>
135+
<groupId>org.apache.beam</groupId>
136+
<artifactId>beam-runners-google-cloud-dataflow-java</artifactId>
137+
<version>${beam.version}</version>
138+
</dependency>
139+
140+
<!-- Adds a dependency on the Beam SDK. -->
141+
<dependency>
142+
<groupId>org.apache.beam</groupId>
143+
<artifactId>beam-sdks-java-core</artifactId>
144+
<version>${beam.version}</version>
145+
</dependency>
146+
147+
<!-- Adds a dependency on the Beam Google Cloud Platform IO module. -->
148+
<dependency>
149+
<groupId>org.apache.beam</groupId>
150+
<artifactId>beam-sdks-java-io-google-cloud-platform</artifactId>
151+
<version>${beam.version}</version>
152+
</dependency>
153+
154+
<dependency>
155+
<groupId>org.slf4j</groupId>
156+
<artifactId>slf4j-api</artifactId>
157+
<version>${slf4j.version}</version>
158+
</dependency>
159+
160+
<dependency>
161+
<groupId>org.slf4j</groupId>
162+
<artifactId>slf4j-log4j12</artifactId>
163+
<version>${slf4j.version}</version>
164+
</dependency>
165+
166+
<!-- The DirectRunner is needed for unit tests. -->
167+
<dependency>
168+
<groupId>org.apache.beam</groupId>
169+
<artifactId>beam-runners-direct-java</artifactId>
170+
<version>${beam.version}</version>
171+
</dependency>
172+
<dependency>
173+
<groupId>org.mockito</groupId>
174+
<artifactId>mockito-core</artifactId>
175+
<version>2.18.0</version>
176+
</dependency>
177+
<dependency>
178+
<groupId>junit</groupId>
179+
<artifactId>junit</artifactId>
180+
<version>${junit.version}</version>
181+
</dependency>
182+
<dependency>
183+
<groupId>org.hamcrest</groupId>
184+
<artifactId>hamcrest-all</artifactId>
185+
<version>${hamcrest.version}</version>
186+
</dependency>
187+
</dependencies>
188+
</project>

0 commit comments

Comments
 (0)