Skip to content

Commit 785fcb8

Browse files
authored
feat: add a Cloud Bigtable fraud-detection terraform code (GoogleCloudPlatform#7254)
* feat: add a Cloud Bigtable fraud-detection terraform code This is the first PR out of three PRs that introduce Cloud Bigtable fraud-detection use-case example. This PR adds the following: 1) The Terraform code that builds and destroys the infrastructure. 2) An already-trained ML model that detects credit card fraudulent transactions. 3) The datasets used for training the ML model. 4) A simple Java test that builds the infrastructure. Future PRs will introduce the Dataflow jobs for preloading Cloud Bigtable and running the streaming fraud-detection pipeline.
1 parent 16dc417 commit 785fcb8

18 files changed

+226787
-0
lines changed
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# Credit card fraud detection using Cloud Bigtable
2+
3+
This sample application aims to build a fast and scalable fraud detection system using Cloud Bigtable as its feature store. The feature store holds demographic information (customer ids, addresses, etc.) and historical transactions. In order to determine if a transaction is fraudulent, the feature store queries the customer demographic information and transaction history.
4+
5+
Cloud Bigtable is a great fit to use as a feature store for the following reasons:
6+
7+
1. **Scalable:** Cloud Bigtable can handle petabytes of data, allowing the fraud detection service to scale to many customers.
8+
9+
2. **Fast:** It has a very low latency which helps in this use case because the system needs to identify if a transaction is fraudulent or not as soon as possible.
10+
11+
3. **Managed service:** Cloud Bigtable provides the speed and scale all in a managed service. There are also maintenance features like seamless scaling and replication as well as integrations with popular big data tools like Hadoop, Dataflow and Dataproc.
12+
13+
## System design
14+
![Fraud detection design](fraud detection-design.svg)
15+
16+
**1. Input/Output Cloud Pub/Sub topics:** The real-time transactions arrive at the Cloud Pub/Sub input topic, and the output is sent to the Cloud Pub/Sub output topic.
17+
18+
**2. ML Model:** The component that decides the probability of a transaction of being fraudulent. This sample application provides a pre-trained ML model and hosts it on VertexAI ([See ML Model section](#ml-model)).
19+
20+
**3. Cloud Bigtable as a Feature Store:** Cloud Bigtable stores customers’ demographics and historical data. The Dataflow pipeline queries Cloud Bigtable in real-time and aggregates customers' demographics and historical data.
21+
22+
**4. Dataflow Pipeline:** The streaming pipeline that orchestrates this whole operation. It reads the transaction details from the Cloud Pub/Sub input topic, queries Cloud Bigtable to build a feature vector that is sent to the ML model, and lastly, it writes the output to the Cloud Pub/Sub output topic.
23+
24+
**5. Data warehouse (BigQuery, Spark, etc):** This component stores the full history of all transactions queried by the system. It runs batch jobs for continuously training the ML model. Note that this component is outside the scope of this sample application as a pre-trained ML model is provided for simplicity.
25+
26+
The system design is written using the Terraform framework. All components' details can be found in the file **terraform/main.tf** and it includes the components listed above.
27+
28+
## Datasets
29+
30+
This sample application uses [Sparkov Data Generation Simulator](https://github.com/namebrandon/Sparkov_Data_Generation) to generate the datasets that are used for training the ML model and for testing it.
31+
32+
The directory **terraform/datasets/training_data** stores the datasets used for training the ML model. A pre-trained ML model comes with this sample application, but a custom ML model can be trained as well.
33+
34+
The directory **terraform/datasets/testing_data** stores the datasets that can be used for testing the ML model. The ML model was never trained against these transactions. Two testing datasets are provided: a dataset containing fraudulent transactions, and another dataset containing legitimate transactions.
35+
36+
37+
## Cloud Bigtable
38+
39+
### Schema design
40+
41+
Cloud Bigtable stores data in tables, each of which is a sorted key/value map. The table is composed of rows, each of which typically describes a single entity, and columns, which contain individual values for each row. Each row/column intersection can contain multiple cells. Each cell contains a unique timestamped version of the data for that row and column.
42+
43+
This design uses a single table to store all customers' information following [table design best practices.](https://cloud.google.com/bigtable/docs/schema-design#tables) The table is structured as follows:
44+
45+
46+
| row key | demographics column family | historical transactions column family |
47+
|-----------|:----------------------------------:|--------------------------------------:|
48+
| user_id 1 | Customer’s demographic information | Transaction details at time 10 |
49+
| | | Transaction details at time 7 |
50+
| | | Transaction details at time 4 |
51+
| | | ... |
52+
| user_id 2 | Customer’s demographic information | Transaction details at time 8 |
53+
| | | Transaction details at time 7 |
54+
| | | ... |
55+
56+
**Row Key:** The row key is the unique userID.
57+
58+
**Timestamps:** Cloud Bigtable Native timestamps are used rather than putting the timestamp as the row key suffix.
59+
60+
61+
### Column families
62+
63+
The data is separated over two column families. Having multiple column families allows for different garbage collection policies (See garbage collection section). Moreover, it is used to group data that is often queried together.
64+
65+
**Demographics Column Family:** This column family contains the demographics data for customers. Usually, each customer will have one value for each column in this column family.
66+
67+
**History Column Family:** This column family contains the historical transaction that this specific user had before. The dataflow pipeline aggregates the data in this column family and sends them along with the demographics data to the ML model.
68+
69+
### Cloud Bigtable configurations
70+
71+
**Number of nodes**
72+
73+
The Terraform code creates a Cloud Bigtable instance that has 1 node. This is a configurable number based on the amount of data and the volume of traffic received by the system. Moreover, Cloud Bigtable supports [autoscaling](https://cloud.google.com/bigtable/docs/autoscaling) where the number of nodes is dynamically selected based on the current system load.
74+
75+
**Garbage Collection Policy**
76+
77+
The current Terraform code does not have any garbage collection policies. However, it could be beneficial for this use case to set a garbage collection policy for the History column family. The ML model does not need to read all the history of the customer. For example, you can set a garbage collection policy to delete all transactions that are older than `N` months but keep at least `M` last transactions. The demographics column family could have a policy that prevents having more than one value in each column. You can read more about Cloud Bigtable Garbage Collection Policies by reading: [Types of garbage collection](https://cloud.google.com/bigtable/docs/garbage-collection#types)
78+
79+
**Replication**
80+
81+
The current Cloud Bigtable instance configuration does not provide any replication. However, in order to improve the system availability and lower the latency for transactions in different regions, the table can be replicated into multiple zones. This will make the system eventually consistent, but in a use-case like fraud detection eventual consistency usually works well. You can learn more by reading [Cloud Bigtable replication use cases](https://cloud.google.com/bigtable/docs/replication-overview#use-cases).
82+
83+
## ML Model
84+
85+
This sample application provides a pre-trained Boosted Trees Classifier ML model that uses similar parameters to what was done here: [How to build a serverless real-time credit card fraud detection solution](https://cloud.google.com/blog/products/data-analytics/how-to-build-a-fraud-detection-solution)
86+
87+
The ML model is located in the path: **terraform/model**

bigtable/use-cases/fraudDetection/fraud-detection-design.svg

Lines changed: 4 additions & 0 deletions
Loading
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<project xmlns="http://maven.apache.org/POM/4.0.0"
3+
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
4+
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
5+
<modelVersion>4.0.0</modelVersion>
6+
7+
<groupId>com.example</groupId>
8+
<artifactId>fraudDetection</artifactId>
9+
<version>1.0-SNAPSHOT</version>
10+
11+
<dependencies>
12+
<dependency>
13+
<artifactId>junit</artifactId>
14+
<groupId>junit</groupId>
15+
<scope>test</scope>
16+
<version>4.13.2</version>
17+
</dependency>
18+
</dependencies>
19+
20+
<properties>
21+
<maven.compiler.source>8</maven.compiler.source>
22+
<maven.compiler.target>8</maven.compiler.target>
23+
</properties>
24+
25+
</project>
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
/*
2+
* Copyright 2022 Google LLC
3+
*
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
*
8+
* http://www.apache.org/licenses/LICENSE-2.0
9+
*
10+
* Unless required by applicable law or agreed to in writing, software
11+
* distributed under the License is distributed on an "AS IS" BASIS,
12+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
* See the License for the specific language governing permissions and
14+
* limitations under the License.
15+
*/
16+
17+
import static org.junit.Assert.assertNotNull;
18+
19+
import java.io.BufferedReader;
20+
import java.io.IOException;
21+
import java.io.InputStreamReader;
22+
23+
public class FraudDetectionTestUtil {
24+
25+
// Make sure that the variable is set from running Terraform.
26+
public static void requireVar(String varName) {
27+
assertNotNull(varName);
28+
}
29+
30+
// Make sure that the required environment variables are set before running the tests.
31+
public static String requireEnv(String varName) {
32+
String value = System.getenv(varName);
33+
assertNotNull(
34+
String.format("Environment variable '%s' is required to perform these tests.", varName),
35+
value);
36+
return value;
37+
}
38+
39+
// Parse Terraform output and populate the variables needed for testing.
40+
private static void parseTerraformOutput(Process terraformProcess) throws IOException {
41+
BufferedReader reader =
42+
new BufferedReader(new InputStreamReader(terraformProcess.getInputStream()));
43+
44+
// Process terraform output.
45+
String line;
46+
while ((line = reader.readLine()) != null) {
47+
System.out.println(line);
48+
if (line.contains("pubsub_input_topic = ")) {
49+
StreamingPipelineTest.pubsubInputTopic = line.split("\"")[1];
50+
} else if (line.contains("pubsub_output_topic = ")) {
51+
StreamingPipelineTest.pubsubOutputTopic = line.split("\"")[1];
52+
} else if (line.contains("pubsub_output_subscription = ")) {
53+
StreamingPipelineTest.pubsubOutputSubscription = line.split("\"")[1];
54+
} else if (line.contains("gcs_bucket = ")) {
55+
StreamingPipelineTest.gcsBucket = line.split("\"")[1];
56+
} else if (line.contains("cbt_instance = ")) {
57+
StreamingPipelineTest.cbtInstanceID = line.split("\"")[1];
58+
} else if (line.contains("cbt_table = ")) {
59+
StreamingPipelineTest.cbtTableID = line.split("\"")[1];
60+
}
61+
}
62+
}
63+
64+
public static int runCommand(String command) throws IOException, InterruptedException {
65+
Process process = new ProcessBuilder(command.split(" ")).start();
66+
parseTerraformOutput(process);
67+
// Wait for the process to finish running and return the exit code.
68+
return process.waitFor();
69+
}
70+
}
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
/*
2+
* Copyright 2022 Google LLC
3+
*
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
*
8+
* http://www.apache.org/licenses/LICENSE-2.0
9+
*
10+
* Unless required by applicable law or agreed to in writing, software
11+
* distributed under the License is distributed on an "AS IS" BASIS,
12+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
* See the License for the specific language governing permissions and
14+
* limitations under the License.
15+
*/
16+
17+
import static org.junit.Assert.assertEquals;
18+
import static org.junit.Assert.assertNotNull;
19+
20+
import java.io.IOException;
21+
import org.junit.AfterClass;
22+
import org.junit.BeforeClass;
23+
import org.junit.Test;
24+
25+
public class StreamingPipelineTest {
26+
27+
// The following variables are populated automatically by running Terraform.
28+
static String cbtInstanceID;
29+
static String cbtTableID;
30+
static String gcsBucket;
31+
static String pubsubInputTopic;
32+
static String pubsubOutputTopic;
33+
static String pubsubOutputSubscription;
34+
private static String projectID;
35+
36+
@BeforeClass
37+
public static void beforeClass() throws InterruptedException, IOException {
38+
projectID = FraudDetectionTestUtil.requireEnv("GOOGLE_CLOUD_PROJECT");
39+
System.out.println("Project id = " + projectID);
40+
// Run terraform and populate all variables necessary for testing and assert
41+
// that the exit code is 0 (no errors).
42+
assertEquals(
43+
FraudDetectionTestUtil.runCommand(
44+
"terraform -chdir=terraform/ init"), 0);
45+
assertEquals(
46+
FraudDetectionTestUtil.runCommand(
47+
"terraform -chdir=terraform/ apply -auto-approve -var=project_id=" + projectID), 0);
48+
}
49+
50+
@AfterClass
51+
public static void afterClass() throws IOException, InterruptedException {
52+
53+
// Destroy all the resources we built before testing.
54+
assertEquals(
55+
FraudDetectionTestUtil.runCommand(
56+
"terraform -chdir=terraform/ destroy -auto-approve -var=project_id=" + projectID),
57+
0);
58+
}
59+
60+
// Assert that the variables exported by Terraform are not null.
61+
@Test
62+
public void testTerraformSetup() {
63+
FraudDetectionTestUtil.requireVar(pubsubInputTopic);
64+
FraudDetectionTestUtil.requireVar(pubsubOutputTopic);
65+
FraudDetectionTestUtil.requireVar(pubsubOutputSubscription);
66+
FraudDetectionTestUtil.requireVar(gcsBucket);
67+
FraudDetectionTestUtil.requireVar(cbtInstanceID);
68+
FraudDetectionTestUtil.requireVar(cbtTableID);
69+
70+
System.out.println("pubsubInputTopic= " + pubsubInputTopic);
71+
System.out.println("pubsubOutputTopic= " + pubsubOutputTopic);
72+
System.out.println("pubsubOutputSubscription= " + pubsubOutputSubscription);
73+
System.out.println("gcsBucket= " + gcsBucket);
74+
System.out.println("cbtInstanceID= " + cbtInstanceID);
75+
System.out.println("cbtTableID= " + cbtTableID);
76+
}
77+
78+
}

0 commit comments

Comments
 (0)