Inference Detector Hosting Design - Complete Version
Inference Detector Hosting Design - Complete Version
Inference Detector Hosting Design - Complete Version
Version
Overview
Amazon CodeGuru-Revieweris a pubic AWS service that uses program analysis and machine learning to do automated code
reviews. The Inference System is one of the core components of the service to infer on customer’s pull request or on demand
jobs, extract intermediate representation (IR) from the code artifacts provided by the customer, run the Guru Rules Engine on the
IR, generate recommendations, and surface results to different providers. With our launch in ReInvent 2019 and GA in OP2
2020, we have build the inference service to support the MVP use-cases of supporting pull request inferences and on demand
inferences for GitHub, CodeCommit, BitBucket and GitHubEnterprise repositories.
CodeGuru Reviewer team has been constantly working on adding more types of detectors, covering more areas of different best
practices, integrating additional framework to advance the recommendation quality and increase the coverage rate. Up to when
this doc is written, the Inference System has been running CodeGuru Java detector framework, which comprises 13 major
categories, including 76 distinct rules (63 for external production), and a total number of 1695 micro rules. With the upcoming
feature of supporting Python language, and integrating with DragonGlass (a code analysis engine that runs over the compiler
output and produce recommendations), it is expected that more detector frameworks are coming in. With the growing of
customer base traffic and increasing number of detector frameworks, scaling and development velocity will be a concern. We
need to further extend the service to be able to efficiently run these detectors and rules without impacting the performance, and
cleanly manage such big catalog of detectors.
This document describes design for extending existing Inference System to support the increasing number of detectors and
provides solutions on managing detectors. The idea is to split the inference detectors into smaller modules, where each module
runs in its own process and communicating with lightweight mechanisms. There will be a bare minimum of centralized
management of these detector services, and individual detector services can be written in different programming languages and
use different program analysis technologies.
Glossary
● Detector framework: A detector framework comprises a set of rules to run program analysis on customer code artifacts
and generate recommendations. Example detector frameworks are DragonGlass detector framework, MuGraph java
detector framework, MuGraph python detector framework.
● Rule: A rule is an analyzer that uses one tool to do program analysis and produce recommendations. The tool could be
Soot, GQL, cfn-lint or any other technologies. Each rule is focusing on one specific area of detections. Some example
rules are: s3 security best practices (DG), missing pagination rule (MUGraph Java), deadlock rule (MUGraph java). A rule
can contain a list of micro rules that share the same analyzing logic.
● Micro rule: A micro rule is a rule that share the same analyzing logic with the parent rule, but more granular. An example
could be polling to waiters rule. The same logic applies to different services(S3, Kinesis, CloudWatch, CloudFront and
more), and these micro rules share the same analyzing code.
● IR: intermediate representation of customer code.
● [P0] As the recommendation engine, I can take in the IR from feature extraction and get recommendations from
DragonGlass detectors.
● [P0] As a CodeGuru team member, I can view a list of DragonGlass detectors and corresponding versions that are running
in each stage and region.
● [P0] As a CodeGuru team member, I can dynamically enable/disable DragonGlass detector in any stage without interfere
other detectors.
● [P0] As a CodeGuru team member, I can deploy/rollback DragonGlass detector independently without interfere other
detectors.
● [P1] As a CodeGuru team member, I can view a list of all CodeGuru detectors and corresponding versions that are running
in each stage and region.
● [P1] As a CodeGuru team member, I can enable/disable any CodeGuru detectors in any stage without interfere other
detectors.
● [P2] As a CodeGuru team member, I can deploy/rollback any CodeGuru detector independently.
Design tenets
● Security: It securely processes customer code artifacts to provide recommendations.
● VM level isolation: It can process each request with VM level isolation.
● Multi-Framework: It supports detectors from different analysis frameworks (MuGraph Based, Soot Framework) and are
added as plugins without interrupting or impacting other platform detectors.
● Scalable: It supports executing and managing increasing number of detectors securely with low latency.
● Custom Detectors: It supports adding custom recommendations using customer defined detectors.
Out of scope
● This document doesn’t cover supporting for multiple languages. Language extension will be covered in a separate
project Multi-language support for CodeGuru Reviewer.
● This document doesn’t cover extracting the intermediate representation (IR) from source code. This part will be covered
in Code artifact system design.
● This document doesn’t cover Artifact Builder migration to external service, but will be extensible to support Artifact Builder
integration. (Artifact builder is a service that CodeGuru Reviewer team is using in handling and processing internal code
review requests using GoodCop partners.) [Work in Progress] Artifact Builder integration with CodeGuru-Reviewer
● This document doesn’t cover the training of detectors.
Related work
RELATED WORK
Overview of solution
With the expansion of the service to support more languages, and more distinct detector frameworks onboarding to CodeGuru
Reviewer, the current inference architecture is not scaling well. Current infrastructure lacks extensibility due to coupled workflow
with metadata service for database interaction, IR generation, detector execution and publishing comments to source providers,
which makes it hard to allow detectors to perform tests independently and validate outputs. As the inference service scales, it’s
important to provide a platform where detector frameworks can integrate/onboard easily, and roll out detectors smoothly without
impacting existing functionalities the service is providing. The flexibility of deploying detector analysis improvements
independently is also missing from the current infrastructure.
In this doc, we are proposing to use a hybrid model of single-tenancy and multi-tenancy architecture to host detector services.
The single-tenancy and multi-tenancy option will rely on detector framework details and their security levels. Each detector
framework runs as a service with their own choice of underlying compute engine (example: Lambda, EC2, ECS, Fargate) and
processes requests as it comes in. When an inference request comes in, workflow will load in detector configurations, analyze
the available code artifacts (provide by IR service) and map to a list of detectors to execute. This list of detectors will be executed
in parallel. Those detector executions will be running in a private VPC with no public internet access, and traffic will go through
PrivateLink endpoints to talk to related AWS services. The security isolation level for each request will stay as VM level isolation
to provide a secure experience for customer's code analysis. Each detector framework also has its own docker image to allow
independent iterations of developments and deployments. The detector hosting change from on-demand task to self running
service is expecting a latency improvements of 60 ~ 120 seconds to save the instance warm up time (ENI provisioning, docker
image pull and installation, metadata loading). Detector configurations will switch from static configuration to dynamic
configuration using AWS AppConfig to allow easier updating of enable/disable a rule in different stages, and avoid operational
churn. The configuration update time will shrink from multiple days of work to within seconds. This secure dynamic configuration
also offers syntax validation and version history management to provide a seamless secure configuration update experience.
Each detector service will auto scale the number of workers based on SQS request queue size and message live time.
Q1-Q3 2021
In 2021, we will be migrating existing MUGraph java detector and python detector to use above infrastructure. Those detectors
will also be running as services. The plan is to have them running in multi-tenancy mode with below security mitigation
approaches: 1) avoid any unnecessary disk usage and keep everything in memory to avoid any leftover remains; 2) add
encryption/decryption for any data that absolutely needs to go through disk; 3) utilize RAMFS as disk caching mechanism.
The current inference ECS Fargate container is responsible for cloning repository, fetching code artifacts, executing detectors
and generating recommendations. With the scaling of the service and more detector introduces from both CodeGuru team and
internal Amazon teams, the single container is no longer scalable. Detectors from different sources can depend on different code
artifacts, and can also run with different environment requirements. By splitting into separate workflows with more modularized
functions, it brings below benefits.
1. Scaling: it allows the service team to expand the workflow individually without coupling into one monolithic workflow.
2. Deployment: it allows the service team to roll back or roll forward a function without impacting other parts.
3. Fault tolerance: it allows the system to be more available and not cause blast radius errors.
4. Testability: it allows the tests to be more granular and focused without mimicking the whole workflow experience.
This doc is proposing to have separate workflows for FetchCodeArtifacts workflow and DetectorExecution workflow.
FetchCodeArtifacts workflow design will be covered in code artifact system design. DetectorExecution workflow is responsible
to take the input code artifacts, run all related detectors, and generate recommendations.
Detectors will be running in parallel in the detector execution workflow with different execution branches using
tepFunction Parallel state. They will fetch related configurations and code artifacts to build manifest, run detectors, and fetch
S
recommendations for the storage choice of each service.
1. Run detectors: this step takes the output for PrepareDetectorManifest step and invoke detector execution. See details
in Detector hosting infrastructure.
2. FetchRecommendations: this step fetches recommendations generated by corresponding detector of their choice of
storage, validates the output formats, aggregates them together and stores to inference storage database for later
operations to surface recommendations to customer.
WSD link
DETECTOR HOSTING INFRASTRUCTURE
Each detector will be running as a service using their choice of compute engine. For DragonGlass detector, we will be running it
in AWS ECS Fargate. Each Fargate task will be long running, and will be preloaded with the detector’s docker image, required
dependencies and metadata. When an inference request comes in, task instance picks up the request job, does program
analysis, produces recommendations and saves result to a dedicated S3 bucket. Once the request is done, the task instance will
be cleaned up and terminated.
Q: What level of isolation should the service provide in processing customer requests?
Inference Detector Hosting will be providing VM level isolation. One customer requests will be processed on one VM per time.
VM level isolation provides complete isolation from the host operating system and other VMs. Container level isolation is not
sufficient as there could be possible container breakout from the attacker, and they can gain access to the instance or OS that’s
hosing the container.
Operating system Runs a complete operating system including the kernel, thus requiring Runs the user mode portion of an operating system, and can be tailored to
more system resources (CPU, memory, and storage). contain just the needed services for your app, using fewer system resources.
Guest compatibility Runs just about any operating system inside the virtual machine. Runs on the same operating system version as the host.
We opted for this option for DragonGlass detectors as these detectors are currently running directly on customer’s build artifacts.
This increases the security risks especially given that we are integrating with DragonGlass’s framework for the first time in our
production service. DragonGlass uses Soot framework which is a third party framework and runs on build artifacts on the disk.
We looked at various options to make it multi-tenant - encrypting build artifacts, running them off of memory, separate the step of
generating the IR (Jimple) in CodeArtifactService - however each of the options presented risk to launch DragonGlass
integration. To keep it simple, we opted for single tenancy that addresses all the security concerns.
The concurrent task number will limit the total number The concurrent task number varies depends on the
of request that the Inference System can process region. Detector services will be set up to run in different
Scalability simultaneously. Since with the boot time, each task runs AWS accounts under different ECS cluster to allow a
longer than multi-tenancy architecture, it could be less bigger limit.
scalable than multi-tenancy.
Q: How does the service achieve one request per instance per time to provide VM level isolation?
Solution 1: application load balancer
One approach is to define our own application load balancer to distribute incoming application traffic to available task instances.
Application load balancer takes requests from clients and distributes them across targets in the target group. Load balancing is a
synchronous communication between clients and backend servers. If a request is not distributed to a target within time, the
request will time out and get aborted. This won’t work well for our use case.
Q: Is this a two-way door? Can we easily switch from single-tenancy to sequential-tenancy or multi-tenancy?
Yes and yes. This is a two-way door, and we can easily switch to sequential-tenancy or multi-tenancy if requirement have
changed. With this worker-poller setup, at any point that we want to switch to a different tenancy mode, all we need to do is
updating the polling strategy.
● For single-tenancy, worker polls one request per time, once the request is done, the instance will shut down to guarantee
no instances get reused for different requests.
● For sequential-tenancy, worker polls one request per time, one the request is done, the instance will make sure that all left-
over resources are deleted/cleaned, and instance state is reset to be ready to process next request.
● For multi-tenancy, worker polls multiple requests per time, those requests will be running in parallel in same instance.
DrawIO link
Q: Why choosing callback pattern with SQS over activities worker?
● Scalability
○ Solution 1 Step Functions activity scales up to 200TPS. Per Step Functions team, this is a scalability limit. It starts to
get slow when exceeding this limit. Activity is an only-once matching. When a worker calls GetActivityTask, Step
Functions matches the Activity to the worker call. If ECS worker dies, we won’t be able to get this token again. See
more details in Slack discussion threadwith Step Functions team.
○ Solution 2 for SQS, standard queues support a nearly unlimited number of API calls per second, per API
action. FIFO queues supports up to 3,000 transactions per second, per API method with batching.
Detector worker will be polling from SQS queue, for a duplicate message, detector can check its corresponding storage and
validate if recommendations have already be generated. If it’s already done, then discard the message.
[DrawIO link]
(Recommended) Option 2: multiple rule execution requests for each detector framework request for each inference request
This approach submits multiple SQS messages for each inference request per detector framework. These requests get polled to
ECS Fargate container, each message runs a subset of rules under the detector framework. For one single job, there could be
multiple containers executing different rules to generate recommendations.
// message 1
{
"inferenceJobIdentifier": {
"accountId": "01234567890",
"jobName": "job1",
"jobType": "OnDemand"
},
"codeArtifacts": [
{
"type": "binaryJar",
"s3Bucket": "code-artifact/job1",
"s3Key": "job1-binary-jar.tar.gz"
}
],
"detectorFramework": "DragonGlass",
"ruleName": "KmsBestPractices"
}
// message 2
{
"inferenceJobIdentifier": {
"accountId": "01234567890",
"jobName": "job1",
"jobType": "OnDemand"
},
"codeArtifacts": [
{
"type": "binaryJar",
"s3Bucket": "code-artifact/job1",
"s3Key": "job1-binary-jar.tar.gz"
}
],
"detectorFramework": "DragonGlass",
"ruleName": "TlsBestPractices"
}
[DrawIO link]
Recommendations will be stored in each detector service’s own choice of storage solution. Detector service should be
responsible to provide CodeGuru inference system access permission to retrieve generated recommendations from their choice
of storage. The recommendation format should follow CodeGuru inference system’s format. See example format below.
{
"recommendations": [
{
"repoName": "repo1",
"filePath": "src/foo/bar/def.java",
"startLine": 1,
"endLine": 2,
"comment": "comment1",
"detectorId": "aws-best-practice",
"confidenceScore": 1 // optional field
},
{
"repoName": "repo1",
"filePath": "src/foo/bar/abc.java",
"startLine": 10,
"endLine": 11,
"comment": "comment2",
"detectorId": "concurrency",
"confidenceScore": 0.8 // optional field
}
]
}
For CodeGuru owned detector services, recommendations will be encrypted and stored in S3 bucket. Detector service running in
ECS Fargate talks to S3 through PrivateLink endpoints for storing results.
NETWORK INFRASTRUCTURE
Detectors will be running in private subnets with no public internet access. All network traffic will go through PrivateLink endpoints
to connect the service VPC to AWS services. This setup will keep detector hosting servers secure as they are not publicly
accessible. Fargate instances will also have limited IAM roles and permissions attached to only allow access to buckets that it
needs to.
Each detector framework will have its own docker image. Docker image will be built in independent pipelines and stored to ECR.
This allows detectors to be deployed independently from each other and bring no interruptions.
Deploy ECS Fargate service using rolling update. The service scheduler will replace current running containers with the latest
version. When a service deployment is made, Fargate fetches the image specified in the task definition from ECR and creates
docker containers automatically. Depending on the configuration, Fargate will attempt to keep a certain percentage of old tasks
around until the new tasks can serve traffic. This percentage can be configured using the minimum percentage option. The
maximum tasks configuration option can be used to limit the number of running tasks (old + new) during a service deployment.
Once the desired number of tasks, as specified in the configuration is achieved, the old tasks are automatically terminated.
Reference:
1. https://docs.aws.amazon.com/AmazonECS/latest/developerguide/update-service.html
2. https://docs.aws.amazon.com/AmazonECS/latest/developerguide/deployment-type-ecs.html
● Underlying framework - Detector micro services should be separate based on underlying core analysis frameworks. If
detector relies on 3rd party libraries or tools to perform analysis, from security perspective, it’s recommended to keep
different tools separate from each other to avoid interfering each other.
● Complexity - For detector services that are growing and have an increasing number of rules, detector owners should
consider the resource utilization and memory consumption. For detector services running in containers, as more rules or
analyzers get added, the docker image size will increase. We should keep in mind to split the service when docker image
grows too big.
● Ownership - For detectors that are owned by different teams, it’s recommended to keep each team’s detector service as
an independent service to allow operations to be decoupled from each other.
● Latency - Each detector service should be able to keep latency under 5 minutes for p99 jobs. This latency should cover
from the time that an execution request comes in to the time that detector stores recommendations to storage system.
● Cost - For lightweight detector services, a cost estimate is required when considering whether to combine to an existing
detector service or split into another detector service.
● IRs - Detectors can be language agnostic, and infer on multiple different IRs. Each micro detector service should define
the required IRs to run analysis, and it can have one or multiple IRs.
○ Examples:
■ DG: bytecode (java) + source code
■ MUGraph Java: source code (java) + mugraph java
■ MUGraph Python: source code (python) + mugraph python
■ CloudFormation: cloudformation template (yaml/json)
An example:
Suppose that we are introducing a new DragonGlass detector to analyze based on python byte code, we should first consider
what underlying tool or framework the new detector is using. If it’s using Soot, which the same as DG detector java byte code
analysis, we should consider incorporate it to existing docker image and utilize the same detector framework. When integration
to existing detector framework, benchmark is required to understand what the new detector need in case of compute power (disk
usage, memory usage). If it’s getting too heavy to run in on container, we should also consider splitting into different services. If
the detector is using completely different analysis tool, we should consider starting a new detector service. When there are
doubts or questions, always reach out to CodeGuru Reviewer data plane team to present the use case and work together to
make the best choice.
DETECTOR MANAGEMENT
In existing inference java detectors, we are currently using brazil config to enable or disable a rule, to load in rule configuration,
thread count setting, and memory allocation. This static way of managing all the configurations is not scalable or operational
friendly. There are use cases where a config change needs to take effect within seconds (example: after a detector deployment,
one single rule is failing and other rules have no issues, we need to disable a rule to allow other rule to run while debugging), or
config needs to be tweaked several times (example: confidence threshold adjustment). Also, in our use case, each configuration
updates will lead to docker image re-build. Dynamic configuration will help mitigate these problems.
{
"Detectors": [
{
"name": "DragonGlassDetector",
"status": "active",
"rules":[
{
"name": "KMS best practices",
"status": "active",
"config": "s3://dgdetector/config/kms-best-practice/config-v1.json"
},
{
"name": "Javax.Crypto Best Practices",
"status": "active",
"config": "s3://dgdetector/config/javax-crypto-best-practice/config-v2.json"
},
{
"name": "TLS Best Practices",
"status": "active"
},
{
"name": "AMI Best Practices",
"status": "inactive"
}
]
},
{
"name": "MuGraphJavaDetector",
"status": "active",
"rules":[
{
"name": "code-clone",
"status": "active",
"config": "s3://java-detector/config/codeclone/config.json"
},
{
"name": "input-validation",
"status": "active",
"config": "s3://java-detector/config/javax-crypto-best-practice/config-v2.jso
}
]
},
{
"name": "MuGraphPythonDetector",
"status": "inactive",
"rules":[]
}
]
}
1. ECS with Fargate: Each task that uses the Fargate launch type has its own isolation boundary and does not share the
underlying kernel, CPU resources, memory resources, or elastic network interface with another task.
2. Step Function: a serverless function orchestrator that makes it easy to sequence AWS Lambda functions and multiple
AWS services into business-critical applications. Inference core workflow is based on Step Function state machine.
3. Lambda: an event-driven, serverless computing platform.
4. IAM: credential management.
5. S3: a storage service.
6. VPC: virtual private cloud that allows detectors to run in isolated sections of the AWS cloud
7. PrivateLink: it allows isolated container in private VPC to access services hosted on AWS and keep network traffic within
the AWS network.
SQS limit
1. Message size: The minimum message size is 1 byte (1 character). The maximum is 262,144 bytes (256 KB).
2. Message visibility timeout: The default visibility timeout for a message is 30 seconds. The minimum is 0 seconds. The
maximum is 12 hours.
3. Message throughput: Standard queues support a nearly unlimited number of API calls per second, per API action
(SendMessage, ReceiveMessage, or DeleteMessage).
4. More details: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-quotas.html
Security considerations
Q: What type of data will this system handle?
Customer code artifacts and intermediate representations will be pulled into the environment for detector execution.
Scaling considerations
Detector hosting system will auto scale the number of tasks running based on the traffic volume change. The detector service will
start with minimal instance count and maximum instance count. Below are proposed solutions for scaling up and down. Solution
2 is recommended.
Solution 1: Use worker utilization metric to scale up and down. Utilization = active worker count / total worker count. We can set
upper threshold and lower threshold. When utilization is low, scale down worker number. When utilization is high, scale up for
additional workers.
(Recommended) Solution 2: Use target tracking scaling policyto scale up and down based on SQS metrics.
ApproximateNumberOfMessagesVisible is the number of messages available for retrieval from the queue. An alarm can
be set up to autoscale based on this queue depth metric.
Solution 3: Use scheduled time scalingto scale based on the time schedule. For CodeGuruReviewer service, the traffic pattern
is more requests on weekdays than weekends, and more on day time than night time.
For ECS Service, the maximum number of tasks per service is 2000 tasks by default. This is adjustable. We will also mitigate this
limit by distributing detector services into separate accounts and individual clusters.
Availability considerations
This can be mitigated utilizing container timeoutsconfiguration. ECS provides stopTimeout configuration, which is the time
duration to wait before the container is forcefully killed. The max stop timeout value is 120 seconds. If job finishes within 2
minutes, this won’t be an issue. Based on current java detector latency, p99 jobs finish analysis within 1.19 minute (excluding
ECS Fargate task start up time). Setting shutdown timeout to max value will mitigate the impact for p99 jobs.
In addition, worker can monitor the SIGTERM signal when instance is about to shut down. When a worker receives this signal, it
shouldn’t pick up new requests.
Stale hosts
After each deployment, it’s possible that the deployment completes successfully but some instances are still running the old
revision. This wouldn’t be an issue for ECS Fargate, as Fargate manages tasks under deployment ID rather than ASG. All tasks
under old deployments will be tore down eventually.
Dependency outage
● ECS outage: if ECS has an outage, the workers wouldn’t be running normally and can’t process detector execution
requests. Requests will stay in SQS queue. Once ECS outage recovers, workers will be back online and process requests
from SQS queue.
● Step Functions outage: if StepFunctions has an outage, and in worst case scenario, workflows can’t be triggered or start
any detector executions. JobSweeper will be restarting those jobs once outage is recovered.
● S3 outage: if S3 has an outage and ECS task is unable to save generated result to S3 bucket. It will first retry S3 put
object operation. If retry times out, detector execution request will be considered as failed. The message gets put back to
SQS queue and waits for retry.
Operational considerations
● What resources and region specific components are we setting up and how can we automate it?
○ Are we creating new IAM roles/users?
■ yes, it will done with CloudFormation template.
○ Do we need to subscribe to SQS queues owned by other services?
■ no, we will own the SQS queues.
● Can the system be configured dynamically at runtime? What changes can be made? How?
○ Yes, the detector workflow can be configured dynamically using secure dynamic configurations.
○ Changes can be made to enable or disable a detector.
○ A new ops tool will be introduced to all us update configurations dynamically.
● How are you going to monitor your system's health? What alarms are you going to setup? What metrics do we need?
○ Alarm on SQS DLQ size > 0
○ Alarm on Fargate CPU utilization, memory utilization and heartbeat
○ Alarm on StepFunctions detector execution workflow latency
● What SOPs do we need to setup?
○ An SOP will be setup to onboard new detector framework to this detector hosting system.
● Is the system customer-facing?
○ No
Development plan
MILESTONES
(10/01) - Milestone M1: DragonGlass detector up and running for internal launch (ArtifactBuilder) - 8w
1. Set up separate accounts, pipeline, lpt package for hosting DG detectors and deploying changes - 2w
a. Accounts
b. Pipeline
c. VPC
d. Security group
e. ECS infra
2. Service integrated with DG framework 2w
a. Build docker image - 1w
b. Run DragonGlass in Fargate container as single tenancy with on-demand requests
3. Create S3 bucket for DG hosting service to store recommendation results - 1w
a. is encryption needed for internal launch (?) - yes
4. Convert DragonGlass detections to CodeGuru recommendation format - 1w
a. 1w for DataPlane team - fetching source code for recommendation mapping and validation
b. 1w for DragonGlass team (confirm if DG has committed to it)
5. Integration and test with the new StepFunctions w/f that drk@ is setting up - 1w
6. End to end test for A/B experience - 1w
1. Lambda to fetch DG recommendations from DG storage and store to S3 bucket for A/B to read - 1w
2. Integration with IR service to fetch artifacts from IR - 2w
3. Run same DG job request in multiple child jobs - 2 w
a. Can we follow up with Martin to close this out?
4. Status management for detector running - 1w
5. SQS queue set up for detector request - 0.5 w
6. Docker image for shared logic, polling from SQS - 2w
7. Auto scaling strategy, setup - 2w
8. Error handling - 1w
9. Integration test - 2w
10. End to end test and bug fixes - 1w
11. DG detector recommendation doc writer review - 0.5w
12. Benchmark performance test - 1w
13. Load test - 1w (post 11/15)
14. Pen test support - 1w
15. Address security findings - (?)
16. (TBD) AppConfig for detector configurations - 2w
DEPLOYMENT STRATEGY
No special considerations, as there will be LPTs for all new pipelines. Simply extending the pipeline and a regular deployment will
be needed.
RLA CHANGES
Testing plan
● Each detector framework should have its own unit tests and integration tests in their own pipelines.
● Each detector framework will also have their own regression test to validate that there is no performance downgrade.
● Canary will run continuously to test end to end process in every region.
import com.amazonaws.ClientConfiguration;
import com.amazonaws.auth.DefaultAWSCredentialsProviderChain;
import com.amazonaws.regions.Regions;
import com.amazonaws.services.stepfunctions.AWSStepFunctions;
import com.amazonaws.services.stepfunctions.AWSStepFunctionsClientBuilder;
import com.amazonaws.services.stepfunctions.model.GetActivityTaskRequest;
import com.amazonaws.services.stepfunctions.model.GetActivityTaskResult;
import com.amazonaws.services.stepfunctions.model.SendTaskFailureRequest;
import com.amazonaws.services.stepfunctions.model.SendTaskSuccessRequest;
import com.amazonaws.util.json.Jackson;
import com.fasterxml.jackson.databind.JsonNode;
import java.util.concurrent.TimeUnit;
while (true) {
GetActivityTaskResult getActivityTaskResult = client.getActivityTask(
new GetActivityTaskRequest().withActivityArn(ACTIVITY_ARN));
if (getActivityTaskResult.getTaskToken() != null) {
try {
JsonNode json = Jackson.jsonNodeOf(getActivityTaskResult.getInput(
String result = detectorActivities.runAnalysis(json.get("job").tex
client.sendTaskSuccess(new SendTaskSuccessRequest()
.withOutput(result)
.withTaskToken(getActivityTaskResult.getTaskToken()));
} catch (Exception e) {
client.sendTaskFailure(new SendTaskFailureRequest()
.withTaskToken(getActivityTaskResult.getTaskToken()));
}
} else {
Thread.sleep(1000);
}
}
}
}
For each detector framework that’s onboarding to CodeGuru inference service, a security review is required to make sure that
data is not used in non-volatile memories or storage.
Q: What are the security risks and mitigations with this multi-tenancy architecture setup?
Threat: Customer-A job runs in worker instance-1, the job artifacts remain in memory, customer-B (attacker)
chains vulnerabilities to get to memory.
Mitigation: No data can be exfiltrated out of the environment. All the worker instances are running in private VPC private subnets
with no public internet access. There is no internet gateway attached to the security groups. Each instance will also have strict
IAM role and policy attached, and the instance role can only access to inference service account for storing recommendations.
Threat: Customer-A (attacker) is able to manipulate memory or compromise the OS in such a way that when subsequent
customer jobs run, they are able to tamper the results of those jobs.
Mitigation:
1. Detector service will be running in JVM process, and we rely on java memory management to manage memory
securely. This includes enforcing runtime constraints through the use of theJVM, a security manager that sandboxes
untrusted code from the rest of the operating system, and a suite of security APIs that Java developers can utilize.
2. The result of the job run is the recommendations for the code artifacts. In detector service, we will be validating the
recommendations before storing to our service S3 buckets. Suppose that customer-A(attacker) tampered the result of a
subsequent customer-B job, the service will be check the recommendation format matches what we are expecting. If it’s
gibberish, it will get caught and dropped, meanwhile this worker instance will be marked as unhealthy and shut down. If it
didn’t get caught, results will be stored to our service S3 bucket. It’s not directly getting published to customer facing
services. In detector service, we are not doing any direct manipulation or communication with customer repositories or
source code providers. Attacker wouldn’t have direct access to any other customer’s resources. If tampered results are
stored to s3 bucket, the workflow will move on to publish comment lambda, here again, we do additional check of
recommendation validation to further validate the comment.
Detector hosting - Design Review Notes: Detector hosting - Design Review Notes
Update
DrawIO link