Inference Detector Hosting Design - Complete Version

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Inference Detector Hosting Design - Complete

Version
Overview

Amazon CodeGuru-Revieweris a pubic AWS service that uses program analysis and machine learning to do automated code
reviews. The Inference System is one of the core components of the service to infer on customer’s pull request or on demand
jobs, extract intermediate representation (IR) from the code artifacts provided by the customer, run the Guru Rules Engine on the
IR, generate recommendations, and surface results to different providers. With our launch in ReInvent 2019 and GA in OP2
2020, we have build the inference service to support the MVP use-cases of supporting pull request inferences and on demand
inferences for GitHub, CodeCommit, BitBucket and GitHubEnterprise repositories.

CodeGuru Reviewer team has been constantly working on adding more types of detectors, covering more areas of different best
practices, integrating additional framework to advance the recommendation quality and increase the coverage rate. Up to when
this doc is written, the Inference System has been running CodeGuru Java detector framework, which comprises 13 major
categories, including 76 distinct rules (63 for external production), and a total number of 1695 micro rules. With the upcoming
feature of supporting Python language, and integrating with DragonGlass (a code analysis engine that runs over the compiler
output and produce recommendations), it is expected that more detector frameworks are coming in. With the growing of
customer base traffic and increasing number of detector frameworks, scaling and development velocity will be a concern. We
need to further extend the service to be able to efficiently run these detectors and rules without impacting the performance, and
cleanly manage such big catalog of detectors.

This document describes design for extending existing Inference System to support the increasing number of detectors and
provides solutions on managing detectors. The idea is to split the inference detectors into smaller modules, where each module
runs in its own process and communicating with lightweight mechanisms. There will be a bare minimum of centralized
management of these detector services, and individual detector services can be written in different programming languages and
use different program analysis technologies.

Glossary
● Detector framework: A detector framework comprises a set of rules to run program analysis on customer code artifacts
and generate recommendations. Example detector frameworks are DragonGlass detector framework, MuGraph java
detector framework, MuGraph python detector framework.
● Rule: A rule is an analyzer that uses one tool to do program analysis and produce recommendations. The tool could be
Soot, GQL, cfn-lint or any other technologies. Each rule is focusing on one specific area of detections. Some example
rules are: s3 security best practices (DG), missing pagination rule (MUGraph Java), deadlock rule (MUGraph java). A rule
can contain a list of micro rules that share the same analyzing logic.
● Micro rule: A micro rule is a rule that share the same analyzing logic with the parent rule, but more granular. An example
could be polling to waiters rule. The same logic applies to different services(S3, Kinesis, CloudWatch, CloudFront and
more), and these micro rules share the same analyzing code.
● IR: intermediate representation of customer code.

Goals and objectives


Timeline: P0 = reinvent 2020, P1 = Q1-Q2 2021, P2 = Q3-Q4 2021

● [P0] As the recommendation engine, I can take in the IR from feature extraction and get recommendations from
DragonGlass detectors.
● [P0] As a CodeGuru team member, I can view a list of DragonGlass detectors and corresponding versions that are running
in each stage and region.
● [P0] As a CodeGuru team member, I can dynamically enable/disable DragonGlass detector in any stage without interfere
other detectors.
● [P0] As a CodeGuru team member, I can deploy/rollback DragonGlass detector independently without interfere other
detectors.
● [P1] As a CodeGuru team member, I can view a list of all CodeGuru detectors and corresponding versions that are running
in each stage and region.
● [P1] As a CodeGuru team member, I can enable/disable any CodeGuru detectors in any stage without interfere other
detectors.
● [P2] As a CodeGuru team member, I can deploy/rollback any CodeGuru detector independently.

Design tenets
● Security: It securely processes customer code artifacts to provide recommendations.
● VM level isolation: It can process each request with VM level isolation.
● Multi-Framework: It supports detectors from different analysis frameworks (MuGraph Based, Soot Framework) and are
added as plugins without interrupting or impacting other platform detectors.
● Scalable: It supports executing and managing increasing number of detectors securely with low latency.
● Custom Detectors: It supports adding custom recommendations using customer defined detectors.

Out of scope

● This document doesn’t cover supporting for multiple languages. Language extension will be covered in a separate
project Multi-language support for CodeGuru Reviewer.
● This document doesn’t cover extracting the intermediate representation (IR) from source code. This part will be covered
in Code artifact system design.
● This document doesn’t cover Artifact Builder migration to external service, but will be extensible to support Artifact Builder
integration. (Artifact builder is a service that CodeGuru Reviewer team is using in handling and processing internal code
review requests using GoodCop partners.) [Work in Progress] Artifact Builder integration with CodeGuru-Reviewer
● This document doesn’t cover the training of detectors.

Related work
RELATED WORK

1. CodeGuru related work


a. High level inference platform CodeGuru Reviewer Next Generation Inference Platform
b. Intermediate representation handling [DO NOT REVIEW - V1] Code artifact system design
c. Multi-language support for CodeGuru Reviewer
d. Detector hosting - single-tenancy vs multi-tenancy
2. DragonGlass related docs
a. DG System Design
b. Dragonglass detectors
c. Automata documentation: https://w.amazon.com/bin/view/ARG/Dragonglass/AutomataDocs/
3. Related designs for AWS team model hosting
a. NLP Amazon Comprehend Synchronous Inference Model
Hosting https://w.amazon.com/bin/view/Aws/ml/NLP/ModelHosting/
b. Amazon Transcribe (Project wolverine) Model
Hosting https://w.amazon.com/bin/view/Project_wolverine/Design/PIIRedaction/ModelHosting
c. https://w.amazon.com/bin/view/A9_mxnet_inference_engine_using_MMS/
d. https://w.amazon.com/bin/view/RBS_Tech/Sherlock/PlatformProjects/ModelHosting/Onboarding/
e. Security Review Documentation forApplication Autoscaling: Adaptor as a Service 
f. https://w.amazon.com/bin/view/AmazonLex/NLU/
4. ECS Fargate Roadrunner project - image caching
a. PRFAQ https://amazon.awsapps.com/workdocs/index.html#/document/dd40c621aff584fa178f03612768bffe9147110
fe32d87ca860a08503e9e3783
b. benchmarks: RoadRunner Benchmarks
5. Step functions tasks polling
a. FBACommons-StepFunctions
b. StepFunctionsActivityWorkerGuiceModules
c. https://docs.aws.amazon.com/cdk/api/latest/docs/@aws-cdk_aws-ecs-
patterns.QueueProcessingFargateService.html#initializer
6. Auto scaling
a. https://w.amazon.com/index.php/Lambda/Proposals/SQS_Poller_Autoscaling
b. https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-using-sqs-queue.html
c. https://sage.amazon.com/posts/935021
d. https://list-archive.amazon.com/messages/10488211
e. https://issues.amazon.com/issues/ECS-11687
7. Docker image
a. https://w.amazon.com/bin/view/AmazonTranslate/Deployments/Release201903Round11Docker/
b. https://w.amazon.com/bin/view/AmazonTranslate/Design/HostingReArchitecture/
c. https://w.amazon.com/bin/view/Users/Asle/DockerOverview/
d. https://docs.docker.com/develop/develop-images/dockerfile_best-practices/
8. Security
a. Secure Data Handling in Native AWS https://w.amazon.com/index.php/InfoSec/AWS/DataHandling
b. https://www.aristotle.a2z.com/recommendations/220
c. https://www.aristotle.a2z.com/recommendations/177
d. https://www.aristotle.a2z.com/implementations/23

Overview of solution
With the expansion of the service to support more languages, and more distinct detector frameworks onboarding to CodeGuru
Reviewer, the current inference architecture is not scaling well. Current infrastructure lacks extensibility due to coupled workflow
with metadata service for database interaction, IR generation, detector execution and publishing comments to source providers,
which makes it hard to allow detectors to perform tests independently and validate outputs. As the inference service scales, it’s
important to provide a platform where detector frameworks can integrate/onboard easily, and roll out detectors smoothly without
impacting existing functionalities the service is providing. The flexibility of deploying detector analysis improvements
independently is also missing from the current infrastructure.

In this doc, we are proposing to use a hybrid model of single-tenancy and multi-tenancy architecture to host detector services.
The single-tenancy and multi-tenancy option will rely on detector framework details and their security levels. Each detector
framework runs as a service with their own choice of underlying compute engine (example: Lambda, EC2, ECS, Fargate) and
processes requests as it comes in. When an inference request comes in, workflow will load in detector configurations, analyze
the available code artifacts (provide by IR service) and map to a list of detectors to execute. This list of detectors will be executed
in parallel. Those detector executions will be running in a private VPC with no public internet access, and traffic will go through
PrivateLink endpoints to talk to related AWS services. The security isolation level for each request will stay as VM level isolation
to provide a secure experience for customer's code analysis. Each detector framework also has its own docker image to allow
independent iterations of developments and deployments. The detector hosting change from on-demand task to self running
service is expecting a latency improvements of 60 ~ 120 seconds to save the instance warm up time (ENI provisioning, docker
image pull and installation, metadata loading). Detector configurations will switch from static configuration to dynamic
configuration using AWS AppConfig to allow easier updating of enable/disable a rule in different stages, and avoid operational
churn. The configuration update time will shrink from multiple days of work to within seconds. This secure dynamic configuration
also offers syntax validation and version history management to provide a seamless secure configuration update experience.
Each detector service will auto scale the number of workers based on SQS request queue size and message live time.

See below diagram for overview. DrawIO diagram link.


Re:invent plan
For Re:Invent this year, we are adapting this new infrastructure design for DragonGlass detector service. New StepFunctions
workflow will be created to handle DragonGlass detector request. DragonGlass detector will run as a service to process request
in single-tenancy mode. Detector execution request will be placed in a SQS queue specific to DragonGlass detector and each
message will contain a unique task token for StepFunctions workflow to track task status. ECS Fargate will be running as a
service to poll messages from SQS queue and execute DragonGlass rules to produce recommendations and store to S3 bucket.
This Fargate service will auto scale the number of tasks based on SQS message count and live time. When a request is
processed in ECS Fargate, the container will send heartbeat to StepFunctions workflow to make sure that task is not stuck. Once
a request is done processing, a success or failure result will be sent back to StepFunctions workflow.

Q1-Q3 2021
In 2021, we will be migrating existing MUGraph java detector and python detector to use above infrastructure. Those detectors
will also be running as services. The plan is to have them running in multi-tenancy mode with below security mitigation
approaches: 1) avoid any unnecessary disk usage and keep everything in memory to avoid any leftover remains; 2) add
encryption/decryption for any data that absolutely needs to go through disk; 3) utilize RAMFS as disk caching mechanism.

Architectural and component-level design


ARCHITECTURE DIAGRAM

Existing inference workflow Proposed new inference workflow diagram


diagram

Q: Why splitting into different workflows?

The current inference ECS Fargate container is responsible for cloning repository, fetching code artifacts, executing detectors
and generating recommendations. With the scaling of the service and more detector introduces from both CodeGuru team and
internal Amazon teams, the single container is no longer scalable. Detectors from different sources can depend on different code
artifacts, and can also run with different environment requirements. By splitting into separate workflows with more modularized
functions, it brings below benefits.
1. Scaling: it allows the service team to expand the workflow individually without coupling into one monolithic workflow.
2. Deployment: it allows the service team to roll back or roll forward a function without impacting other parts.
3. Fault tolerance: it allows the system to be more available and not cause blast radius errors.
4. Testability: it allows the tests to be more granular and focused without mimicking the whole workflow experience.

This doc is proposing to have separate workflows for FetchCodeArtifacts workflow and DetectorExecution workflow.
FetchCodeArtifacts workflow design will be covered in code artifact system design. DetectorExecution workflow is responsible
to take the input code artifacts, run all related detectors, and generate recommendations.

DETECTOR EXECUTION WORKFLOW

Detectors will be running in parallel in the detector execution workflow with different execution branches using
 tepFunction Parallel state. They will fetch related configurations and code artifacts to build manifest, run detectors, and fetch
S
recommendations for the storage choice of each service.

1. PrepareDetectorManifest: this step will be running on a lambda function.


a. It first checks detector configuration to check whether given detector is enabled/disabled. If disabled, this branch
ends.
b. It takes in the input job, calls IR service to fetch a list of required code artifacts and their corresponding storage path
for given detector type. (Code artifact detailed types are not final yet, will with with IR service design to provide more
details. )

// Sample PrepareDetectorManifest input


{
"inferenceJobIdentifier": {
"accountId": "01234567890",
"jobName": "job1",
"jobType": "OnDemand"
}
}

// Sample PrepareDetectorManifest output for DragonGlass detector


{
"inferenceJobIdentifier": {
"accountId": "01234567890",
"jobName": "job1",
"jobType": "OnDemand"
},
"codeArtifacts": {
{
"type": "binaryJar",
"s3Bucket": "code-artifact/job1",
"s3Key": "job1-binary-jar.tar.gz"
}
}
}
// Sample PrepareDetectorManifest output for Java detector
{
"inferenceJobIdentifier": {
"accountId": "01234567890",
"jobName": "job1",
"jobType": "OnDemand"
},
"codeArtifacts": {
{
"type": "sourceCodeJava",
"s3Bucket": "code-artifact/job1",
"s3Key": "job1-source-code-java.tar.gz"
},
{
"type": "muGraphJava",
"s3Bucket": "code-artifact/job1",
"s3Key": "job1-mugraph-java.tar.gz"
}
}
}

1. Run detectors: this step takes the output for PrepareDetectorManifest step and invoke detector execution. See details
in Detector hosting infrastructure.
2. FetchRecommendations: this step fetches recommendations generated by corresponding detector of their choice of
storage, validates the output formats, aggregates them together and stores to inference storage database for later
operations to surface recommendations to customer.

WSD link
DETECTOR HOSTING INFRASTRUCTURE

Each detector will be running as a service using their choice of compute engine. For DragonGlass detector, we will be running it
in AWS ECS Fargate. Each Fargate task will be long running, and will be preloaded with the detector’s docker image, required
dependencies and metadata. When an inference request comes in, task instance picks up the request job, does program
analysis, produces recommendations and saves result to a dedicated S3 bucket. Once the request is done, the task instance will
be cleaned up and terminated.

Q: What underlying compute engine will detector hosting service use?


Short answer: AWS ECS Fargate for DragonGlass detector
Long answer: We have considered following compute engines AWS ECS with Fargate, ECS with EC2, SageMaker endpoint,
SageMaker batch transform.
Pros Cons

1. Higher cost. (For 4vCPU 16GB memory, fargate


1. Serverless hosting. No need to maintain monthly cost is ~$167 per instance, ec2 monthly
ECS with Fargate hosts/hardware. cost is ~$138 per instance, it's around 20% higher.)
2. Existing inference infrastructure is built on Fargate. 2. Limited instance type choices (memory: 30GB,
Easy transition. cpu: 4vCPU, disk: 20GB)
3. Less customization options

1. Complete flexibility in terms or hardware options


ECS with EC2 2. Lower cost comparing with Fargate 1. Need to maintain our own hardware.
3. More customization options regarding instance type, 2. High operation effort.
storage options

1. Sagemaker endpoint has hard limit of 60


1. Manageed endpoint hosting support for ML trained seconds endpoint timeout. A request can run longer
models than 60 seconds limit. (critical for our use case)
SageMaker endpoint 2. Managed solution for hosting Models trained 2. Sagemaker only serves as a plateform, it doesn't
through Sagemaker. No worries about handle any reqeust VM isolation level that we are
updating/patching AMI images or updating kernels. looking for. We will need to define our own LB to
distribute requests.

High latency. SageMaker batch transform doesn't


maitain a fleet of warm instances. For each request,
SageMaker batch transform Allow a longer process time more than 60 seconds. it starts an EC2 instance from scratch. (critical for
our use case)
1. Lightweight serverless hosting 1. Memory limit of 3008MB
Lambda 2. On-demand launching has low latency 2. Time out of 15 minutes

Q: What level of isolation should the service provide in processing customer requests?
Inference Detector Hosting will be providing VM level isolation. One customer requests will be processed on one VM per time.
VM level isolation provides complete isolation from the host operating system and other VMs. Container level isolation is not
sufficient as there could be possible container breakout from the attacker, and they can gain access to the instance or OS that’s
hosing the container.

Feature Virtual machine Container


Isolation Provides complete isolation from the host operating system and other Typically provides lightweight isolation from the host and other containers,
VMs. This is useful when a strong security boundary is critical, such as but doesn't provide as strong a security boundary as a VM.
hosting apps from competing companies on the same server or cluster.

Operating system Runs a complete operating system including the kernel, thus requiring Runs the user mode portion of an operating system, and can be tailored to
more system resources (CPU, memory, and storage). contain just the needed services for your app, using fewer system resources.

Guest compatibility Runs just about any operating system inside the virtual machine. Runs on the same operating system version as the host.

Q: Do detectors process request in multi-tenancy mode or single-tenancy mode?


Short answer: For DragonGlass detector re:invent launch, it will be running in single-tenancy mode. This means that for each
ECS Fargate task, it only processes one customer request per time. The task will be terminated after request is done, and every
request will be picked up on a fresh new task. They don’t share any common VM, container, disk or memory resources.

We opted for this option for DragonGlass detectors as these detectors are currently running directly on customer’s build artifacts.
This increases the security risks especially given that we are integrating with DragonGlass’s framework for the first time in our
production service. DragonGlass uses Soot framework which is a third party framework and runs on build artifacts on the disk.
We looked at various options to make it multi-tenant - encrypting build artifacts, running them off of memory, separate the step of
generating the IR (Jimple) in CodeArtifactService - however each of the options presented risk to launch DragonGlass
integration. To keep it simple, we opted for single tenancy that addresses all the security concerns.

Long answer: See more details in Detector hosting - single-tenancy vs multi-tenancy.

Single-tenancy task Multi-tenancy service


With detector running as a service and process one
request at a time per host, multiple requests are
processed in one container. It could have code remains
One short living task container to process one customer or failure to clean up previous job, and poses the risk of
Security job, this prevents any risks of one request to tamper tampering following requests. This can be mitigated by
another customer request. using JVM for memory management, using encryption
and decrytpion for any data written to disk, and
potentially laverage RAMFS to use memory as file
system.

With multi-tenancy, instances will be always running and


Containers are launched on-demand in flight, it could ready to servce requests. Compare to launching new
Availability have higher chances to encouter issues when containers on run time, this approach could provide
ECS/ECR/S3 availability drops. higher availability and has less chance to encounter
errror in booting up instances.

Since service is always running, we pay for the entire


Containers are launched OnDemand. Wtih Fargate, you container running time, including idle time. Based on a
Cost only pay for the time when compute resoruces are in rough estimation, it could 2 times single-tenancy cost.
use. The actual cost can vary based on the actual auto-
scaling strategy. See cost estimation here.

Since containers are launched in flight, there is


additional latency in waiting for container to be
provision. Fargate task container goes through the
process of ENI provisining, docker image pulling,
installing. There is also warm up time for the service to With mutli-tenancy to run detectors as services, it save
Latency get started. This boot time adds to the overall the boot time for task containers to be ready and
processing time. For current CodeGuru inference process requests.
container, it takes average 158 seconds to start the
analysis.
ECS Fargate scheduler only supports about 2 TPS
tasks launch rate.

The concurrent task number will limit the total number The concurrent task number varies depends on the
of request that the Inference System can process region. Detector services will be set up to run in different
Scalability simultaneously. Since with the boot time, each task runs AWS accounts under different ECS cluster to allow a
longer than multi-tenancy architecture, it could be less bigger limit.
scalable than multi-tenancy.

For multi-tenancy, we will need to define our own auto-


Maintainability No need to maintain a fleet. Each task is on-demand scaling strategy, job scheduling/placement strategy to
and short running. place requests in distinct containers.
For single-tenancy, new changes are deployed by
uploading a new docker image in ECR, all following
Deployment tasks will be lauching tasks using the new docker During deployment, there could be potential increased
image. Running tasks will remain running until it's done. latency in processing jobs.
There is no interruptions

Multi-tenancy Single tenancy


Q: How do we address the cons of single-tenancy mode?
Latency: For on-demand ECS Fargate tasks, it launches task in flight and there are additional latency (from ENI provisioning,
docker image pulling/installation) in getting the container ready to process request. Current CodeGuru MuGraph Java detector
average boot time is 158 seconds. For DragonGlass detector, instead of running as a on-demand task, it will serve as a long
service in ECS Fargate, and process requests as it comes in.By switching to service mode, the task will already be running and
ready to serve traffic as a request comes in. To achieve single-tenancy, each instance will only process one request per time and
still provide VM level isolation.
Scalability and Availability: Each detector will be running in separate AWS service account. As the detector is running as a
service, corresponding auto scaling strategy will be attached to ensure the fleet is in a good shape to process the volume of
requests. There will be ready instances to process requests, and hence has a higher tolerance of availability drop from
ECS/ECR/S3.

Q: How does the service achieve one request per instance per time to provide VM level isolation?
Solution 1: application load balancer
One approach is to define our own application load balancer to distribute incoming application traffic to available task instances.
Application load balancer takes requests from clients and distributes them across targets in the target group. Load balancing is a
synchronous communication between clients and backend servers. If a request is not distributed to a target within time, the
request will time out and get aborted. This won’t work well for our use case.

(Recommended) Solution 2: worker-poller


For each inference job request, StepFunctions detector execution workflow submits a request for each detector execution
request. Workers will be running in ECS Fargate instances. Each instance has a worker running, and keeps polling from the
request queue. The worker polls one request per time. For each request, detector will execute, produce recommendations and
store to S3 bucket. Once the request is done, this instance will be terminated and not process any further requests. When an
instance dies, ECS Fargate spins up new instance to replace the dead instance. If a message fails to finish processing, it will
notify StepFunctions that this request has failed.

Q: Is this a two-way door? Can we easily switch from single-tenancy to sequential-tenancy or multi-tenancy?
Yes and yes. This is a two-way door, and we can easily switch to sequential-tenancy or multi-tenancy if requirement have
changed. With this worker-poller setup, at any point that we want to switch to a different tenancy mode, all we need to do is
updating the polling strategy.

● For single-tenancy, worker polls one request per time, once the request is done, the instance will shut down to guarantee
no instances get reused for different requests.
● For sequential-tenancy, worker polls one request per time, one the request is done, the instance will make sure that all left-
over resources are deleted/cleaned, and instance state is reset to be ready to process next request.
● For multi-tenancy, worker polls multiple requests per time, those requests will be running in parallel in same instance.

Q: How to achieve the worker-poller setup?


There are two proposals to achieve the worker-poller system setup.

Solution 1: Step Functions activities worker


Activities are a Step Functions feature that enables you to have a task in a state machine where the work is performed by a
worker that can be hosted on any compute platform (ECS, Lambda, EC2 and more). Step Functions provides APIs for creating
and listing activities, requesting a task, and for managing the flow of your state machine based on the results of your worker.
For our use case, when an inference job request comes in, detector execution workflow will call CreateActivityAPI for each
detector execution request. Detector service on ECS Fargate will be hosting an activity worker for each task container. This
activity will call StepFunction 
GetActivityTaskAPI for new tasks. The worker will be configured to process one request per
time. When the worker finishes processing the request, it provides a report of its success or failure back to StepFunction
workflow using SendTaskFailure or SendTaskSuccess.
Reference: https://docs.aws.amazon.com/step-functions/latest/dg/concepts-activities.html
See POC here: Proof of concept - step function activities worker: Detector hosting - Appendices and Supplementary
Information

DrawIO link here

(Recommended) Solution 2: Step Functions callback pattern with SQS


Reference: https://docs.aws.amazon.com/step-functions/latest/dg/callback-task-sample-sqs.html
For an inference job request, for each detector execution, Step Functions publishes a message to detector corresponding queue
that includes a tasks token. Step Functions then pauses, waiting for that token to be returned. ECS Fargate detector worker polls
messages from each detector’s SQS queue. Upon each message processing, ECS worker calls SendTaskSuccessor
SendTaskFailure with that same task token. When the task token is received, the Step Functions workflow continues.

DrawIO link
Q: Why choosing callback pattern with SQS over activities worker?

● Scalability
○ Solution 1 Step Functions activity scales up to 200TPS. Per Step Functions team, this is a scalability limit. It starts to
get slow when exceeding this limit. Activity is an only-once matching. When a worker calls GetActivityTask, Step
Functions matches the Activity to the worker call. If ECS worker dies, we won’t be able to get this token again. See
more details in Slack discussion threadwith Step Functions team.
○ Solution 2 for SQS, standard queues support a nearly unlimited number of API calls per second, per API
action. FIFO queues supports up to 3,000 transactions per second, per API method with batching.

Q: What happens if SQS queue contains a duplicate message?

Detector worker will be polling from SQS queue, for a duplicate message, detector can check its corresponding storage and
validate if recommendations have already be generated. If it’s already done, then discard the message.

Q: What does a SQS message contain?

Option 1: one detector framework request for each inference request


This approach submits one SQS message for each inference request per detector framework. The request gets polled to worker
ECS Fargate container, all rules under this detector framework will be executed in the same container.

// Sample SQS message for DG detector analysis


{
"inferenceJobIdentifier": {
"accountId": "01234567890",
"jobName": "job1",
"jobType": "OnDemand"
},
"codeArtifacts": [
{
"type": "binaryJar",
"s3Bucket": "code-artifact/job1",
"s3Key": "job1-binary-jar.tar.gz"
}
],
"detectorFramework": "DragonGlass"
}

[DrawIO link]

(Recommended) Option 2: multiple rule execution requests for each detector framework request for each inference request
This approach submits multiple SQS messages for each inference request per detector framework. These requests get polled to
ECS Fargate container, each message runs a subset of rules under the detector framework. For one single job, there could be
multiple containers executing different rules to generate recommendations.

// Sample SQS message for DG detector analysis

// message 1
{
"inferenceJobIdentifier": {
"accountId": "01234567890",
"jobName": "job1",
"jobType": "OnDemand"
},
"codeArtifacts": [
{
"type": "binaryJar",
"s3Bucket": "code-artifact/job1",
"s3Key": "job1-binary-jar.tar.gz"
}
],
"detectorFramework": "DragonGlass",
"ruleName": "KmsBestPractices"
}
// message 2
{
"inferenceJobIdentifier": {
"accountId": "01234567890",
"jobName": "job1",
"jobType": "OnDemand"
},
"codeArtifacts": [
{
"type": "binaryJar",
"s3Bucket": "code-artifact/job1",
"s3Key": "job1-binary-jar.tar.gz"
}
],
"detectorFramework": "DragonGlass",
"ruleName": "TlsBestPractices"
}

[DrawIO link]

Q: Where will the recommendations generated by detector services be stored?

Recommendations will be stored in each detector service’s own choice of storage solution. Detector service should be
responsible to provide CodeGuru inference system access permission to retrieve generated recommendations from their choice
of storage. The recommendation format should follow CodeGuru inference system’s format. See example format below.

{
"recommendations": [
{
"repoName": "repo1",
"filePath": "src/foo/bar/def.java",
"startLine": 1,
"endLine": 2,
"comment": "comment1",
"detectorId": "aws-best-practice",
"confidenceScore": 1 // optional field
},
{
"repoName": "repo1",
"filePath": "src/foo/bar/abc.java",
"startLine": 10,
"endLine": 11,
"comment": "comment2",
"detectorId": "concurrency",
"confidenceScore": 0.8 // optional field
}
]
}

For CodeGuru owned detector services, recommendations will be encrypted and stored in S3 bucket. Detector service running in
ECS Fargate talks to S3 through PrivateLink endpoints for storing results.

Q: How will this set up impact latency? [work in progress]

● What is the typical latency for Amazon SQS?


○ Typical latencies for SendMessage, ReceiveMessage, and DeleteMessage API requests are in the tens or low
hundreds of milliseconds.
● What is the latency of invoking an AWS Lambda function in response to an event?
○ AWS Lambda is designed to process events within milliseconds. Latency will be higher immediately after a Lambda
function is created, updated, or if it has not been used recently.

NETWORK INFRASTRUCTURE

Detectors will be running in private subnets with no public internet access. All network traffic will go through PrivateLink endpoints
to connect the service VPC to AWS services. This setup will keep detector hosting servers secure as they are not publicly
accessible. Fargate instances will also have limited IAM roles and permissions attached to only allow access to buckets that it
needs to.

DrawIO diagram link


DETECTOR DEPLOYMENT

Each detector framework will have its own docker image. Docker image will be built in independent pipelines and stored to ECR.
This allows detectors to be deployed independently from each other and bring no interruptions.

Q: What is a docker image?


A Docker image is an inert, immutable file that is built from a base image.

Q: How will docker image be built?


Solution 1: build one docker image
DrawIO link
(Recommended)Solution 2: use two docker images with sidecar pattern
Build shared logic to a shared image, and each detector framework have a separate image for detector specific logic.
Each Fargate task will have two containers running. One container will be the application container that has worker running and
keeps polling new requests from SQS queue. The other container will be detector analysis container where it infers on the
request.
The two containers will be running on the same host and communicating using local endpoints.
DrawIO link
Recommended to use solution 2, as this allows us to deploy common logic without updating detector specific images. This could
come in handy when there’s a hot fix of underlying shared logic in pulling code artifacts or storing results. Using two docker
images allows the deployment to go faster without updating all detector images. It also provides more modular containers, where
a modular container can be plugged in to new services with minimal changes.

Q: How will detector services be deployed?

Deploy ECS Fargate service using rolling update. The service scheduler will replace current running containers with the latest
version. When a service deployment is made, Fargate fetches the image specified in the task definition from ECR and creates
docker containers automatically. Depending on the configuration, Fargate will attempt to keep a certain percentage of old tasks
around until the new tasks can serve traffic. This percentage can be configured using the minimum percentage option. The
maximum tasks configuration option can be used to limit the number of running tasks (old + new) during a service deployment.
Once the desired number of tasks, as specified in the configuration is achieved, the old tasks are automatically terminated.
Reference:

1. https://docs.aws.amazon.com/AmazonECS/latest/developerguide/update-service.html
2. https://docs.aws.amazon.com/AmazonECS/latest/developerguide/deployment-type-ecs.html

Q: How will detector services be split?


It depends on the use case. Here are the tenets to drive the detector framework separations. Please follow the following tenets
and make a best judge.

● Underlying framework - Detector micro services should be separate based on underlying core analysis frameworks. If
detector relies on 3rd party libraries or tools to perform analysis, from security perspective, it’s recommended to keep
different tools separate from each other to avoid interfering each other.
● Complexity - For detector services that are growing and have an increasing number of rules, detector owners should
consider the resource utilization and memory consumption. For detector services running in containers, as more rules or
analyzers get added, the docker image size will increase. We should keep in mind to split the service when docker image
grows too big.
● Ownership - For detectors that are owned by different teams, it’s recommended to keep each team’s detector service as
an independent service to allow operations to be decoupled from each other.
● Latency - Each detector service should be able to keep latency under 5 minutes for p99 jobs. This latency should cover
from the time that an execution request comes in to the time that detector stores recommendations to storage system.
● Cost - For lightweight detector services, a cost estimate is required when considering whether to combine to an existing
detector service or split into another detector service.
● IRs - Detectors can be language agnostic, and infer on multiple different IRs. Each micro detector service should define
the required IRs to run analysis, and it can have one or multiple IRs.
○ Examples:
■ DG: bytecode (java) + source code
■ MUGraph Java: source code (java) + mugraph java
■ MUGraph Python: source code (python) + mugraph python
■ CloudFormation: cloudformation template (yaml/json)

An example:
Suppose that we are introducing a new DragonGlass detector to analyze based on python byte code, we should first consider
what underlying tool or framework the new detector is using. If it’s using Soot, which the same as DG detector java byte code
analysis, we should consider incorporate it to existing docker image and utilize the same detector framework. When integration
to existing detector framework, benchmark is required to understand what the new detector need in case of compute power (disk
usage, memory usage). If it’s getting too heavy to run in on container, we should also consider splitting into different services. If
the detector is using completely different analysis tool, we should consider starting a new detector service. When there are
doubts or questions, always reach out to CodeGuru Reviewer data plane team to present the use case and work together to
make the best choice.

DETECTOR MANAGEMENT

In existing inference java detectors, we are currently using brazil config to enable or disable a rule, to load in rule configuration,
thread count setting, and memory allocation. This static way of managing all the configurations is not scalable or operational
friendly. There are use cases where a config change needs to take effect within seconds (example: after a detector deployment,
one single rule is failing and other rules have no issues, we need to disable a rule to allow other rule to run while debugging), or
config needs to be tweaked several times (example: confidence threshold adjustment). Also, in our use case, each configuration
updates will lead to docker image re-build. Dynamic configuration will help mitigate these problems.

Reference: PoA Talk: Dynamic Configuration 


Proposing to use A
 WS AppConfigto manage detector configurations. AWS AppConfig is the externalized version of Safe
Dynamic Config (SDC). It allows the service to provide validation logic to ensure the configuration data is syntactically and
semantically correct according to our definitions before making it available to the application. It also gives the ability to deploy
configuration changes over a defined time period while monitoring the application so that we can catch errors and roll back the
changes if needed, to help minimize the impact to users.

Example configuration file

{
"Detectors": [
{
"name": "DragonGlassDetector",
"status": "active",
"rules":[
{
"name": "KMS best practices",
"status": "active",
"config": "s3://dgdetector/config/kms-best-practice/config-v1.json"
},
{
"name": "Javax.Crypto Best Practices",
"status": "active",
"config": "s3://dgdetector/config/javax-crypto-best-practice/config-v2.json"
},
{
"name": "TLS Best Practices",
"status": "active"
},
{
"name": "AMI Best Practices",
"status": "inactive"
}
]
},
{
"name": "MuGraphJavaDetector",
"status": "active",
"rules":[
{
"name": "code-clone",
"status": "active",
"config": "s3://java-detector/config/codeclone/config.json"
},
{
"name": "input-validation",
"status": "active",
"config": "s3://java-detector/config/javax-crypto-best-practice/config-v2.jso
}
]
},
{
"name": "MuGraphPythonDetector",
"status": "inactive",
"rules":[]
}
]
}

DEPENDENCIES AND CONSUMERS

1. ECS with Fargate: Each task that uses the Fargate launch type has its own isolation boundary and does not share the
underlying kernel, CPU resources, memory resources, or elastic network interface with another task.
2. Step Function: a serverless function orchestrator that makes it easy to sequence AWS Lambda functions and multiple
AWS services into business-critical applications. Inference core workflow is based on Step Function state machine.
3. Lambda: an event-driven, serverless computing platform.
4. IAM: credential management.
5. S3: a storage service.
6. VPC: virtual private cloud that allows detectors to run in isolated sections of the AWS cloud
7. PrivateLink: it allows isolated container in private VPC to access services hosted on AWS and keep network traffic within
the AWS network.

Restrictions, limitations, and constraints


Step Functions limit

1. Maximum execution time: 1 year


2. Maximum execution history size: 25000 events
3. Maximum activity pollers per Amazon Resource Name (ARN): 1,000 pollers calling GetActivityTask per ARN.
Exceeding this quota results in this error: "The maximum number of workers concurrently polling for activity tasks has been
reached."
4. Maximum input or result data size for a task, state, or execution: 32,768 characters. This quota affects tasks (activity or
Lambda function), state or execution result data, and input data when scheduling a task, entering a state, or starting an
execution.
5. More details: https://docs.aws.amazon.com/step-functions/latest/dg/limits.html

ECS Fargate limit

1. Maximum memory: 30GB


2. Maximum CPU: 4vCPU
3. Maximum ephemeral disk space: 20GB
4. Maximum containers per task definition: 10
5. Task definition size limit: 10 KiB
6. Maximum number of tasks per service: 2000
7. More details: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-quotas.html

SQS limit

1. Message size: The minimum message size is 1 byte (1 character). The maximum is 262,144 bytes (256 KB).
2. Message visibility timeout: The default visibility timeout for a message is 30 seconds. The minimum is 0 seconds. The
maximum is 12 hours.
3. Message throughput: Standard queues support a nearly unlimited number of API calls per second, per API action
(SendMessage, ReceiveMessage, or DeleteMessage).
4. More details: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-quotas.html

Security considerations
Q: What type of data will this system handle?
Customer code artifacts and intermediate representations will be pulled into the environment for detector execution.

1. Source code: customer source code


2. MUGraph: an internal graph data structure that represents customer’s code with a rich context. It represents programs at
statement and expression levels that capture both control and data flows between program elements.
3. Built artifacts: compiled jar file that’s provided by IR service

Detector will also generate recommendations for customer’s request.

Q: Where will the data be stored and accessed?


Customer code artifacts data will be pulled from a secure storage from IR service and deleted from the environment once
detector execution is done. IR service will be storing all data in S3 after encryption enforcing with bucket policy. Detector hosting
container will be pulling data from S3 with a IAM role for access and KMS for encryption/decryption. All data will be deleted once
detector execution is done. If host is failing to delete the temp data even after retry, a Stop signal will be sent to shut down the
instance to make sure that the same instance is not processing any more request jobs.
Generated recommendations will be stored to each detector framework’s choice of storage. For CodeGuru reviewer owned
detectors, the recommendations will be stored to a S3 bucket after encryption. S3 bucket will also be partitioned by timestamp,
account id, job id.

Q: Who can access your service?


Internal Amazon CodeGuru Data Plane team dev member.

Q: How are they authenticated and authorized?


Single sign-on solution with Amazon Federate.

Scaling considerations

HOW DOES THE DETECTOR HOSTING SYSTEM SCALE?

Detector hosting system will auto scale the number of tasks running based on the traffic volume change. The detector service will
start with minimal instance count and maximum instance count. Below are proposed solutions for scaling up and down. Solution
2 is recommended.

Solution 1: Use worker utilization metric to scale up and down. Utilization = active worker count / total worker count. We can set
upper threshold and lower threshold. When utilization is low, scale down worker number. When utilization is high, scale up for
additional workers.

(Recommended) Solution 2: Use target tracking scaling policyto scale up and down based on SQS metrics.
ApproximateNumberOfMessagesVisible is the number of messages available for retrieval from the queue. An alarm can
be set up to autoscale based on this queue depth metric.
Solution 3: Use scheduled time scalingto scale based on the time schedule. For CodeGuruReviewer service, the traffic pattern
is more requests on weekdays than weekends, and more on day time than night time.

WHAT ARE THE SCALING BOTTLENECKS?

For ECS Service, the maximum number of tasks per service is 2000 tasks by default. This is adjustable. We will also mitigate this
limit by distributing detector services into separate accounts and individual clusters.

WHAT DEGREE OF SCALING CAN THIS DESIGN ACCOMMODATE?

Availability considerations

Potential latency during deployment


During deployment, there could be increased latency in processing detector execution requests. When a deployment is triggered,
ECS instances will get a stop signal. If the instance worker is in the middle of processing a request and the instance shuts down
before the processing is done, the request message will be put back to SQS queue. This request will be picked up by another
worker. The churns of instance updates could lead to increased latency during deployment for those job requests.

This can be mitigated utilizing container timeoutsconfiguration. ECS provides stopTimeout configuration, which is the time
duration to wait before the container is forcefully killed. The max stop timeout value is 120 seconds. If job finishes within 2
minutes, this won’t be an issue. Based on current java detector latency, p99 jobs finish analysis within 1.19 minute (excluding
ECS Fargate task start up time). Setting shutdown timeout to max value will mitigate the impact for p99 jobs.

In addition, worker can monitor the SIGTERM signal when instance is about to shut down. When a worker receives this signal, it
shouldn’t pick up new requests.

Stale hosts
After each deployment, it’s possible that the deployment completes successfully but some instances are still running the old
revision. This wouldn’t be an issue for ECS Fargate, as Fargate manages tasks under deployment ID rather than ASG. All tasks
under old deployments will be tore down eventually.

Source provider outage


If source provider has an outage, for example GitHub WebHook has an outage and after recover, there will be a burst of
incoming requests. Inference will get increased traffic over a short time period, detector services will be auto scale the number of
running tasks based on request SQS queue size and live time of the oldest message. Once the burst traffic is done processing,
ECS service will scale down the number of tasks and process normal traffic.

Dependency outage
● ECS outage: if ECS has an outage, the workers wouldn’t be running normally and can’t process detector execution
requests. Requests will stay in SQS queue. Once ECS outage recovers, workers will be back online and process requests
from SQS queue.
● Step Functions outage: if StepFunctions has an outage, and in worst case scenario, workflows can’t be triggered or start
any detector executions. JobSweeper will be restarting those jobs once outage is recovered.
● S3 outage: if S3 has an outage and ECS task is unable to save generated result to S3 bucket. It will first retry S3 put
object operation. If retry times out, detector execution request will be considered as failed. The message gets put back to
SQS queue and waits for retry.

Operational considerations
● What resources and region specific components are we setting up and how can we automate it?
○ Are we creating new IAM roles/users?
■ yes, it will done with CloudFormation template.
○ Do we need to subscribe to SQS queues owned by other services?
■ no, we will own the SQS queues.
● Can the system be configured dynamically at runtime? What changes can be made? How?
○ Yes, the detector workflow can be configured dynamically using secure dynamic configurations.
○ Changes can be made to enable or disable a detector.
○ A new ops tool will be introduced to all us update configurations dynamically.
● How are you going to monitor your system's health? What alarms are you going to setup? What metrics do we need?
○ Alarm on SQS DLQ size > 0
○ Alarm on Fargate CPU utilization, memory utilization and heartbeat
○ Alarm on StepFunctions detector execution workflow latency
● What SOPs do we need to setup?
○ An SOP will be setup to onboard new detector framework to this detector hosting system.
● Is the system customer-facing?
○ No

Development plan
MILESTONES

(10/01) - Milestone M1: DragonGlass detector up and running for internal launch (ArtifactBuilder) - 8w

1. Set up separate accounts, pipeline, lpt package for hosting DG detectors and deploying changes - 2w
a. Accounts
b. Pipeline
c. VPC
d. Security group
e. ECS infra
2. Service integrated with DG framework 2w
a. Build docker image - 1w
b. Run DragonGlass in Fargate container as single tenancy with on-demand requests
3. Create S3 bucket for DG hosting service to store recommendation results - 1w
a. is encryption needed for internal launch (?) - yes
4. Convert DragonGlass detections to CodeGuru recommendation format - 1w
a. 1w for DataPlane team - fetching source code for recommendation mapping and validation
b. 1w for DragonGlass team (confirm if DG has committed to it)
5. Integration and test with the new StepFunctions w/f that drk@ is setting up - 1w
6. End to end test for A/B experience - 1w

(11/15) - Milestone M2: Re:invent launch ready [20 weeks]

1. Lambda to fetch DG recommendations from DG storage and store to S3 bucket for A/B to read - 1w
2. Integration with IR service to fetch artifacts from IR - 2w
3. Run same DG job request in multiple child jobs - 2 w
a. Can we follow up with Martin to close this out?
4. Status management for detector running - 1w
5. SQS queue set up for detector request - 0.5 w
6. Docker image for shared logic, polling from SQS - 2w
7. Auto scaling strategy, setup - 2w
8. Error handling - 1w
9. Integration test - 2w
10. End to end test and bug fixes - 1w
11. DG detector recommendation doc writer review - 0.5w
12. Benchmark performance test - 1w
13. Load test - 1w (post 11/15)
14. Pen test support - 1w
15. Address security findings - (?)
16. (TBD) AppConfig for detector configurations - 2w

DEPLOYMENT STRATEGY

No special considerations, as there will be LPTs for all new pipelines. Simply extending the pipeline and a regular deployment will
be needed.

RLA CHANGES

RLA requires changes to LPTs to extend regions and create alarms.

Testing plan

AUTOMATED INTEGRATION AND UNIT TESTS

● Each detector framework should have its own unit tests and integration tests in their own pipelines.
● Each detector framework will also have their own regression test to validate that there is no performance downgrade.
● Canary will run continuously to test end to end process in every region.

Appendices and Supplementary Information


PROOF OF CONCEPT - STEP FUNCTION ACTIVITIES WORKER

import com.amazonaws.ClientConfiguration;
import com.amazonaws.auth.DefaultAWSCredentialsProviderChain;
import com.amazonaws.regions.Regions;
import com.amazonaws.services.stepfunctions.AWSStepFunctions;
import com.amazonaws.services.stepfunctions.AWSStepFunctionsClientBuilder;
import com.amazonaws.services.stepfunctions.model.GetActivityTaskRequest;
import com.amazonaws.services.stepfunctions.model.GetActivityTaskResult;
import com.amazonaws.services.stepfunctions.model.SendTaskFailureRequest;
import com.amazonaws.services.stepfunctions.model.SendTaskSuccessRequest;
import com.amazonaws.util.json.Jackson;
import com.fasterxml.jackson.databind.JsonNode;

import java.util.concurrent.TimeUnit;

public class DetectorActivities {

private static final String ACTIVITY_ARN =


"arn:aws:states:us-west-2:013991436161:activity:detector-analysis-activity

public String runAnalysis(final String request) throws Exception {


System.out.println("Processing request: " + request);
return "{\"Processing request\": \"" + request + "\"}";
}

public static void main(final String[] args) throws Exception {


DetectorActivities detectorActivities = new DetectorActivities();
ClientConfiguration clientConfiguration = new ClientConfiguration();
clientConfiguration.setSocketTimeout((int) TimeUnit.SECONDS.toMillis(70));

AWSStepFunctions client = AWSStepFunctionsClientBuilder.standard()


.withRegion(Regions.US_WEST_2)
.withCredentials(new DefaultAWSCredentialsProviderChain())
.withClientConfiguration(clientConfiguration)
.build();

while (true) {
GetActivityTaskResult getActivityTaskResult = client.getActivityTask(
new GetActivityTaskRequest().withActivityArn(ACTIVITY_ARN));

if (getActivityTaskResult.getTaskToken() != null) {
try {
JsonNode json = Jackson.jsonNodeOf(getActivityTaskResult.getInput(
String result = detectorActivities.runAnalysis(json.get("job").tex
client.sendTaskSuccess(new SendTaskSuccessRequest()
.withOutput(result)
.withTaskToken(getActivityTaskResult.getTaskToken()));
} catch (Exception e) {
client.sendTaskFailure(new SendTaskFailureRequest()
.withTaskToken(getActivityTaskResult.getTaskToken()));
}
} else {
Thread.sleep(1000);
}
}
}
}

MULTI-TENANCY SECURITY CONSIDERATIONS

Q: How does multi-tenancy architecture clean up?


When the multi-tenancy instance is processing a request, it pulls related code artifacts to disk, runs analysis, generates
recommendations and stores recommendations to choice of storage. Once this request is done, the worker will delete all code
artifacts that are used for this previous job, delete generated temp files by analyzers if any, and make sure no customer code
segments remain in memory or cache. If any of the deletion fails even after retry, this instance will send a signal to shutdown the
instance to block it from processing any following request, which provides assurance that the machine handling requests is
always in a known good state. In addition, customer code artifacts should remain in volatile memories.

For each detector framework that’s onboarding to CodeGuru inference service, a security review is required to make sure that
data is not used in non-volatile memories or storage.

Q: What are the security risks and mitigations with this multi-tenancy architecture setup?

Threat: Customer-A job runs in worker instance-1, the job artifacts remain in memory, customer-B (attacker)
chains vulnerabilities to get to memory.
Mitigation: No data can be exfiltrated out of the environment. All the worker instances are running in private VPC private subnets
with no public internet access. There is no internet gateway attached to the security groups. Each instance will also have strict
IAM role and policy attached, and the instance role can only access to inference service account for storing recommendations.

Threat: Customer-A (attacker) is able to manipulate memory or compromise the OS in such a way that when subsequent
customer jobs run, they are able to tamper the results of those jobs.
Mitigation:

1. Detector service will be running in JVM process, and we rely on java memory management to manage memory
securely. This includes enforcing runtime constraints through the use of theJVM, a security manager that sandboxes
untrusted code from the rest of the operating system, and a suite of security APIs that Java developers can utilize.
2. The result of the job run is the recommendations for the code artifacts. In detector service, we will be validating the
recommendations before storing to our service S3 buckets. Suppose that customer-A(attacker) tampered the result of a
subsequent customer-B job, the service will be check the recommendation format matches what we are expecting. If it’s
gibberish, it will get caught and dropped, meanwhile this worker instance will be marked as unhealthy and shut down. If it
didn’t get caught, results will be stored to our service S3 bucket. It’s not directly getting published to customer facing
services. In detector service, we are not doing any direct manipulation or communication with customer repositories or
source code providers. Attacker wouldn’t have direct access to any other customer’s resources. If tampered results are
stored to s3 bucket, the workflow will move on to publish comment lambda, here again, we do additional check of
recommendation validation to further validate the comment.

Design Review Notes

Detector hosting - Design Review Notes: Detector hosting - Design Review Notes

Update

DrawIO link

You might also like