the_six_challenges_b2b_saas

Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

6

The Cloud-infrastructure
challenges every B2B SAAS
must tackle down.

CHAPTER 1: SECURITY AND COMPLIANCE 1


Contents

Introduction 5
Chapter 1: Security and Compliance 6
Security Best Practices 7
Secure Cloud Accounts 7
Defense-in-depth 8
Securing Your Data 8
Identity and Access Management. 9
Implement DevSecOps in the Pipeline 10
Perimetral Security 11
Audit Tooling 12
Getting Compliant 12
Chapter 2: Cost Optimization 14
Cost Impact 15
Optimize Your Process 16
Three Phases of Optimization 16
1. Analysis 16
2. Optimization 16
3. Operation 18
Bonus: Resource Tagging 19
Chapter 3: Cloud Automation 20
The Benefits of Automation 21
Process Reliability 21
Change Management 22
Disaster Recovery 22

CONTENTS 2
Indirect Benefits 22
Managing Infrastructure as Code 23
Tools for Automation 24
Kubernetes Operators and Automation 24
Managing Multi-tenant Apps 25
Chapter 4: Productivity through CI/CD 27
Impact 28
Pipeline Must-Haves 29
Automation 29
Steps 30
Metrics 30
Representative Local Environments 30
Application and Infrastructure as Code 31
Security and Compliance 31
Testing Must-Haves 31
Static Code Analysis 32
Unit Testing 32
Automated Smoke Testing 32
Code Coverage 33
Automated Deployments 33
Continuous Delivery on Kubernetes 33
CI/CD Tools 34
Chapter 5: Observability 36
Understanding Your System 37
Visualizing Your Metrics 38
Implementing Observability 38
Monitoring 39
Metrics 39
Log Centralization 39

CONTENTS 3
Tracing 39
Observability and Metrics Checklist 39
Observability Tools 40
Chapter 6: Availability and Scalability 43
You Can’t Buy Availability 45
Understand Your Metrics 45
Backups and Disaster Recovery 45
Autoscaling and Managed Databases 46
Scalability is a Must-Have 46
Testing 46
Criteria for Autoscaling 46
Business Continuity and Disaster Recovery 47
Business Continuity Plan 47
Conclusion 49

CONTENTS 4
Introduction

There are six challenges every B2B Startup and ScaleUp must
tackle when it comes to their Cloud Infrastructure and DevOps
methodologies: Security and Compliance, Cost Optimization,
Cloud Automation, Productivity through CI/CD, Observability, and
Availability and Scalability. These challenges encompass everything
that keeps a B2B Software-as-a-Service (SaaS) up and running.

Over the years at Flugel.it, we’ve seen many clients struggle with
one or more of these areas. They may be scaling up and taking on
Fortune 500 customers for the first time, requiring them to update
their security and compliance processes, or they may be moving
from legacy infrastructure to a fully automated one in the cloud.
However, while these areas may not be the focus of their busi-
ness, companies must manage those needs; otherwise, the sta-
bility of their business, user perception, data security, costs, and
productivity will be greatly affected.

Given the many challenges facing our clients’ businesses, we’ve de-
cided to compile our knowledge into this guide to better serve you.

We recommend you read the entire book, but each chapter can
be read independently or referred back to when you run into a
problem in a specific area. With this book in your toolkit, your
cloud infrastructure will be safer, secured, productive, cost-effecti-
ve, and ready to grow along with your business efficiently.

INTRODUCTION 5
Chapter 1:
Security and Compliance

CHAPTER 1: SECURITY AND COMPLIANCE 6


Protecting your company’s assets is paramount in today’s world;
this means taking security and compliance seriously, and it’s why
we are tackling this challenge first. Over 1000 data breaches oc-
curred in the US in 2020, and 158 million people were affected by
data exposures caused by lax security practices. Security breaches
can also harm your brand value and reputation - this in effect will
make companies reluctant to do business with you

While there are significant differences between security and com-


pliance, they are often considered jointly. Security relates to a set
of measures, tools, and processes put in place to protect com-
pany assets. In contrast, compliance mainly focuses on alignment
with regulations, standards, and/or best practices. Furthermore,
compliance is mandatory in the B2B SaaS world. You will struggle
to gain larger clients if you lack compliance, and insurance won’t
cover errors and omissions.

Security Best Practices

In setting up a trusted security Cloud infrastructure, we


recommend the following best practices:

Secure Cloud Accounts

Begin by segregating workloads by account based on their func-


tion or data sensitivity requirements. The networking layout in
each account must be organized in at least two subnets: public
and private. The public subnet is strictly for services that require

CHAPTER 1: SECURITY AND COMPLIANCE 7


exposure, and everything else must be in the private network. For
example when it comes to databases, your API services must be
deployed in the private subnet and exposed to the public in a load
balancer that runs on the public subnet.

Most Importantly, when you need to perform administrative


actions, access the resources and internal services in your com-
pany’s private network using a Virtual Private Network (VPN) or
Secured Shell (SSH).

Defense-In-depth

Networking at the nearest level can be a loophole to explore


your infrastructure, so it is recommended to imbibe the Defen-
se-in-depth concept. This will give you multiple layers of security
mechanisms and controls to secure the network hence a strict
data confidentiality, integrity, and availability. To implement defen-
se-in-depth, you may use a series of different defenses layered
together, such as firewalls, malware scanners, intrusion detection
systems, data encryption, and integrity auditing solutions. Having
these multiple layers embedded in the system you will effectively
close the gaps created when you rely on a singular security solution.

Securing Your Data

Data must be encrypted when it is in transit and at rest using the


tools provided by Cloud technology. Load balancers with TLS certi-
ficates provide security for endpoints exposed to customers. One

CHAPTER 1: SECURITY AND COMPLIANCE 8


of the most efficient encryption options in the cloud technology
space is the AWS EBS volume encryptions as it implements at-rest
encryptions.

In addition to being encrypted, data must also be classified. Fur-


thermore, in order to understand the data risks , its level of secu-
rity, authentication and access control you must apply optimum
classification. Classification ensures that data is rightly used for the
right purpose and by the approved users.

Identity and Access Management.

Identity and access management are critical parts of an informa-


tion security program. You will want to ensure that only authorized
users and components can access your resources, and only in the
manner clearly defined by the top organization.

Critically, it is essential to use Multi-Factor Authentication (MFA) for


every part of the system and make password complexity manda-
tory at your organization.

Also, you should use the Least Privilege Principle to give users the
minimum access needed to do their jobs so as to restrict access
to functions that are not required. On a final note, you would want
to make use of secure secret management while enabling AWS
Security HUB if you are in AWS and of course utilize IAM Roles,
services accounts, or the equivalent.

CHAPTER 1: SECURITY AND COMPLIANCE 9


Implement DevSecOps in the Pipeline

Security is not solely for your data or access management; you


will want every step of your development cycle to be secure. It
is crucial to implement security measures as early as possible in
the development lifecycle. This is often referred to as DevSecOps.
Below are a few of the steps that must be in a security pipeline:

Software composition analysis (SCA) or dependency scan-


ning: SCA tools perform automated scans of an application’s
code base, including related artifacts such as containers and
registries, to identify all open source components, their licen-
se compliance data, and any security vulnerabilities.

Static Application Security Analysis (SAST): This inspects for


vulnerabilities or bad practices in the source code. Example:
TOP 10 OWASP, but not limited to only those.

Dynamic Application Security Scanning (DAST) : When your


application is up and running, DAST will scan the application
for vulnerabilities.

Container scanning: If you are packaging containers (cu-


rrently the most common packaging format in the cloud),
scan built images to detect security vulnerabilities in the ima-
ges or third-party dependencies used to build the image.

CHAPTER 1: SECURITY AND COMPLIANCE 10


Image signature: Sign your container images to confirm
that they have not been modified before they are deployed.

Pen testing: Include regular pen-testing in your pipeline or


pen-testing outside the pipeline.

Perimetral Security

To improve perimetral security, we highly recommend using the


tooling offered by cloud providers. Some tools are offered by the
Cloud providers themselves (AWS, Azure, Google Cloud), but there
are companies like Cloudflare, Fastly that provide tooling as well.

Some tools to consider:


WAF (like AWS WAF or CloudFlare WAF): Protects traffic
before it enters into your application. It blocks some patterns
in the requests like SQL Injection or similar ones that could
exploit issues in your application. You should still control that
in the application itself, but WAF offers an additional layer of
protection.

AWS GuardDuty or AWS Shield: GuardDuty protects your


entire AWS account by detecting and reporting abnormal
activity. AWS Shield protects your application from DDoS
(Distributed Denial of Service Attacks) and minimizes latency
and downtime with automatic inline mitigations and always-
on detection.

CHAPTER 1: SECURITY AND COMPLIANCE 11


Audit Tooling
Your company will want to document and organize your security
records, audits, alerts, and other critical information.

If you are using AWS, Security HUB provides a single place


where you can visualize your security alerts from AWS tools.

Use audit trails to store and catalog events of actions taken


in your cloud resources. AWS CloudTrail, GCP Cloud Lo-
gging, and Azure Log Analytics are tools you can use to
check the activity being performed in your cloud provider
accounts.

Getting Compliant

Compliance with well-known certifications like ISO 27001 and


SOC2 will be required sooner or later. ISO 27001 is a standard
that focuses on developing an Information Security Management
System (ISMS). It requires conducting a risk assessment and im-
plementing security controls. SOC2, on the other hand, is a more
flexible set of audit reports showing the level of conformity you
have to a set of defined criteria. The criteria cover security, proces-
sing integrity, availability, privacy, and confidentiality. Only security
is mandatory for certification.

To start getting compliant, you can use CCM from CSA to demons-
trate to your clients that you can implement a formal approach
to security. While they are expensive and time-consuming, certifi-

CHAPTER 2: COST OPTIMIZATION 12


cations demonstrate your company’s maturity and knowledge to
your clients.

The Cloud Security Alliance (CSA) is the leading organization


committed to defining the best practices used to establish a secu-
re and protected cloud computing environment. CSA will provide
you with free tools, guides, and checklists to self-assess your orga-
nization’s maturity against any of the market standards.

CSA has been evolving their methodology since 2009 and getting
more and more coverage on each version, including the current
version, 4.0 of CCM (Cloud Control Matrix). Once completed, the
outcome is a maturity level that can be cross-referenced with
other frameworks for the market standard of your choice.

When it comes to Security and Compliance, it is never too soon to


start improving your protocols. Your assets will be more secure,
and your clients will have confidence in your expertise. Once your
company is able to stay secure and compliant, you will be ready to
move on to the next challenge.

CHAPTER 2: COST OPTIMIZATION 13


Chapter 2:
Cost Optimization

CHAPTER 2: COST OPTIMIZATION 14


When you consider the growing demand for cloud services, pai-
red with increased availability and variety, it’s no surprise the
costs incurred by businesses are sure to rise. In other words, cost
optimization is a challenge our clients face. Some may have even
started migrating more services to the cloud to take advantage of
the cloud’s budget flexibility. In fact, cost optimization is the main
reason for 47% of enterprises’ cloud migration.

Moving to the cloud can improve the bottom line for your com-
pany. Keep in mind; it will not eliminate all your costs. By most esti-
mates, about a third of a company’s IT budget goes towards cloud
services. To optimize your costs across the board, you must first
understand and learn how best to control them in the cloud. The
goal is not to reduce spending but to optimize spending. As your
business grows and you acquire more clients, your costs will most
likely increase. However, if those costs are optimized, profits will
undoubtedly improve.

Cost Impact

The cloud can offer unlimited scalability. It can also lower your IT
costs by only charging for what you use. At the same time, it is
essential to know that you are really paying for what you order. Up
to 70% of cloud costs are potentially wasted, either due to poor
adoption or underutilized features. It’s critical for your business to
understand the impact of the cloud on your bottom line.

CHAPTER 2: COST OPTIMIZATION 15


Start analyzing the impact of the cloud by defining a business me-
tric that will help you understand costs. THIS IS KEY. Examples of
metrics you could be tracking include cost per pageview, cost per
client, etc. Once you have a clear picture of the metric or metrics
you need to measure, you can create a cost report to help you
understand how and what you are spending on the cloud.

Optimize Your Process

Now that you have defined your metric, it is time to start optimi-
zing. The report you have created will track current usage as well
as its evolution. There are several best practices you can put into
place to get optimization underway.

Three Phases of Optimization

Before you begin, you will want to break the optimization process
down into three phases:

1. Analysis
The first phase of cost optimization is analysis. During this phase,
work towards understanding your current spending and focus on
reporting. You will want to compare current resource utilization
vs. current expenditures to get a sense of how your resources are
being used.

2. Optimization
At this point, you will apply the various types of optimization. These

CHAPTER 2: COST OPTIMIZATION 16


include financial, infrastructural, and application-based optimization.

Financial Optimization
All cloud providers offer different financial options to facilita-
te cost optimization. AWS provides Reserved Instances and
saving plans, which allow you to pay upfront and therefore
reduce costs. Other providers offer financial incentives, too.
GCP provides committed user discounts, and Azure has re-
served user discounts.

Spot Instances and Preemptible


If you are looking for bigger cost savings opportunities, you
may want to consider using spot and preemptible instances.
This is when cloud service providers offer their unused capa-
city at a very low cost. They could save you up to 90%. This is
very similar to financial optimization; however, it requires some
work on your infrastructure and applications to support them.
These instances could disappear at any moment, so you will
want to prepare your platform in case such an event occurs.
They are an excellent option for testing and development envi-
ronments and for some production workloads.

Rightsizing
When you rightsize, you analyze your computing services and
modify them to the required size. You will want to use the
right instance type and size for your workloads. This requires
monitoring and metrics collections to review capacity versus
current use and adjustments to the size of your infrastructu-
re. To use rightsizing, automation must be in place.

CHAPTER 2: COST OPTIMIZATION 17


Autoscaling
Autoscaling is used to increase your computing capacity
horizontally to meet an increasing demand. It helps you save
money because you will have the right resources in place
based on your current load. For example, when you use
auto-scaling, you do not have instances up and running when
the demand is low.

Turn On/Off
You do not need to run all your resources all the time. Tes-
ting environments and manual UAT environments are typica-
lly used during specific days and hours, so you can turn them
on and off on command. You may want to consider self-ser-
vice environments. These are good options for devs and QA.
Your teams can launch environments for specific needs and
destroy them later. Automation is key here. With it, you can
create, destroy, start and stop on demand, and use things
only when you need them.

Automation involves using on/off functions and stopping resour-


ces that you are not using during specific periods in the week,
month, etc. Lastly, you will want to look at applications via architec-
ture reviews, serverless, microservices, multi-tenancy, cacheability,
assets optimization, and storage utilization.

3. Operation
When you begin this phase of the optimization process, you will
continuously monitor and keep the optimizations up-to-date. This
is not a one-time activity. Cost optimization is an ongoing process.

CHAPTER 2: COST OPTIMIZATION 18


Your company will want to conduct regular cost reviews to help
measure and predict your needs.

Bonus: Resource Tagging

Tagging is an additional tool that will help you understand your


costs and general reports. Label or tag each user-defined key to
make it easy to search for and manage resources based on cri-
teria such as owner, environment, or purpose. You can use cost
allocation tags to help track your costs on a detailed level. This will
make your reports easy and efficient to use.

Cost Optimization is one of the most common challenges our


clients face. As more and more businesses migrate their opera-
tions and services to the cloud as a means to optimize business
costs, they will want to take a closer look at spending in the cloud.
It’s vital to put tools in place to further optimize predicted cloud
costs. Doing so means you will be able to dynamically respond to
your cloud needs and scale on command. The time to take action
and optimize cloud costs is now.

CHAPTER 3: CLOUD AUTOMATION 19


Chapter 3:
Cloud Automation

CHAPTER 3: CLOUD AUTOMATION 20


Infrastructure is everywhere. Keeping up with the growing comple-
xity and size of infrastructure at your organization is no easy task.
Doing a poor job of keeping up can result in costly delays to up-
dates, patches, or resource delivery. It could result in even worse
problems in the future. That’s why Cloud Automation is one of the
toughest challenges our clients face.

To regain visibility and control over their infrastructure, many


businesses and IT teams rely on automated Cloud Infrastructure,
which allows for servers and services to be provisioned with code
instead of physical or manual configurations. As of 2019, 68% of
businesses had cloud-based systems in place. Whether your busi-
ness is already cloud-native or considering a move from a legacy
infrastructure to a cloud infrastructure that uses automation, the-
re are many things you can learn about Cloud Automation.

The Benefits of Automation

When automating your cloud infrastructure, you essentially want


to implement “Infrastructure as Code.” In the past, IT teams would
manually configure their infrastructure. Using Infrastructure as
Code or IaC, the infrastructure takes the shape of a code file. The-
re are many direct and indirect benefits to doing this.

Process Reliability

First and foremost, automating your infrastructure or implemen-


ting IaC reduces human errors. The speed, control and process

CHAPTER 3: CLOUD AUTOMATION 21


reliability you gain will allow for the creation of new environments
or deployments for multiple purposes.

Change management

By versioning all the IaC files like you would application source
code, you will have full traceability. You will have a clear picture of
who did what and when. Another benefit of IaC is Process Enfor-
cement Documentation. The only documentation outside of the
configured files you will need will be for the architecture of the
infrastructure. This is the “WHAT” of the infrastructure. The “HOW”
is defined in the IaC.

Disaster Recovery

Heritage recovery strategies may not address the entire scenario


when disaster strikes. By tracing your work across a development
lifecycle, you can rebuild instead of restoring your infrastructure in
the same or another geographical location. When a disaster does
happen, you will be able to rebuild quickly and consistently.

Indirect Benefits

Two indirect benefits of IaC are right-sizing for cost optimization and
more straightforward and reliable ways to update for compliance
and security standards. There is no doubt about it; IaC lowers infras-
tructure management costs. When your organization uses the cloud,
it eliminates many hardware, employee, and physical space costs.
Security and compliance updates become easier to implement.

CHAPTER 3: CLOUD AUTOMATION 22


Managing Infrastructure as Code

Managing Infrastructure as Code can be tricky even for the most


experienced teams. One of the biggest challenges lies in how to
manage the code. Your code needs structure, organization, many
tests, and code deduplication. While IaC is more consistent and
reliable than a manual process, there are still a few best practices
you and your team need to be aware of:

1. Composable Infrastructure: A pre-cloud concept that can


be moved to the cloud. With this strategy, you can combine
IaC modules (i.e., Terraform modules) instead of connecting
boxes. Each module will manage each one of the compo-
nents of your infrastructure (network, databases, clusters,
etc.).

2. CI/CD: Infrastructure code must be managed like appli-


cation code. You need pipelines with automated tests and
deployments.

3. Version Locking: All your infrastructure and its modules


must be versioned. The versions you are applying in different
places must be tracked in Git to maintain the reliability of the
process.

4. Document: But do not over-document. Each module can


be understood by reading the code. Document what your
modules do and their inputs and outputs. You don’t need to
document how it works.

CHAPTER 3: CLOUD AUTOMATION 23


Tools for Automation

Many tools can be used for cloud automation. Below we list a few
you can use in your cloud infrastructure.

Terraform
When you want to predictably and safely build, update,
and improve your infrastructure, Terraform is your go-to
open-source IaC tool.

AWS CloudFormation
With this tool, you can model a collection of AWS and
third-party resources, provision and manage them throu-
ghout their life cycles.

Terragrunt
This tool is used to manage Terraform code. It helps to ma-
nage Terraform modules and states.

Terratest
This is a Go library that manages automated tests for Infras-
tructure as code.

Kubernetes Operators and Automation

Kubernetes operators were introduced as an implementation of


the Infrastructure as a Software concept. By using them, you can
abstract the deployment of applications and services in a Kuber-
netes cluster.

CHAPTER 3: CLOUD AUTOMATION 24


Operators have domain and application-specific knowledge that
allows them to automate some tasks, and extend the Kubernetes’s
API. By doing so, operators provide users new ways to manage
and orchestrate objects and abstract their complexity. In short, an
operator provides you with an API to specify the desired state whi-
le at the same time taking care of the details to reach that state.

Kubernetes operators can leverage two concepts to abstract


complex orchestration: custom resources and custom controllers.
Custom resources extend the API, adding new kinds of objects.
While a resource is an endpoint in Kubernetes, a custom resource
is one not shipped with Kubernetes but installed on a cluster after
its creation. Custom controllers manage custom and non-custom
objects according to the information stored in the custom resour-
ce. This allows you to extend the behavior without having to alter
the Kubernetes’ code.​​

Managing Multi-tenant Apps

With Kubernetes, you can fully automate the deployment and


operation of your workloads and automate how that is done.
When it comes to multi-tenant apps, our operators must be smart
enough to manage several tasks. All in all, there are many things a
Kubernetes operator can do for multi-tenant apps, including but
not limited to:

Managing one or more containers per tenant


Initializing a database or another service for each new user
Setting up backups for new tenants

CHAPTER 3: CLOUD AUTOMATION 25


By design, operators have all of the considerations necessary for
any specific Kubernetes application. They will ensure that every
part of its lifecycle is integrated right into the framework and is re-
ady for use when needed. Thus, operators are an essential piece
of software for automation.

Despite the challenges, cloud infrastructure automation and IaC,


and Kubernetes will make the lives of your IT team much easier
and save your organization money in the process. Gone are the
days of manually configuring your infrastructure. Your teams will
spend less time on manual tasks and more time creating and
innovating. Your company will benefit from the speed, efficiency,
and reliability, not to mention the cost optimization and innovation
of having an automated infrastructure.

CHAPTER 3: CLOUD AUTOMATION 26


Chapter 4:
Productivity through CI/CD

CHAPTER 4: PRODUCTIVITY THROUGH CI/CD 27


While managing B2B Software-as-a-Service (SaaS) products has
never been faster or easier, organizations working with big brands
and/or regulated industries may still run into productivity cha-
llenges. You, like many of our clients, might be one of these orga-
nizations. If your productivity is not consistently up to par, your
software pipelines may be to blame.

The Software Development Lifecycle (SDLC) methodology helps


address these problems. It encourages accelerating and streamli-
ning the process via automated pipelines and Continuous In-
tegration and Continuous Delivery, better known as CI/CD. By
integrating these philosophies into the development process,
organizations reduce the time it takes to implement changes, test
changes early and often, and increase productivity.

Impact

Productivity can have a massive impact on a business. Getting a


great product to market in a timely fashion is critical. Testing sof-
tware can take time, and doing it manually can slow this process
down, delaying time-to-market and impairing revenue potential.

Once a product does make it to market, it must be reliable and up-


to-date. Keeping SaaS products updated requires constant innova-
tion— innovation that must be achieved and deployed as quickly as
possible. Without innovation, a product may not be able to integra-
te with other technologies or may present additional problems and
delays for customers. Furthermore, an organization must have a
good Mean Time To Resolution (MTTR) so that clients know that the

CHAPTER 4: PRODUCTIVITY THROUGH CI/CD 28


company will respond quickly to bugs or other problems. A com-
pany also should address any security issues and disclose software
vulnerabilities promptly. A strong feedback loop between a com-
pany and its clients increases customer satisfaction.

These issues are often addressed manually. Again, this slows


down the development workflow and could harm the business
relationship. Pipeline automation prevents these negative conse-
quences by enforcing processes to improve the quality of releases
and security, reducing the impact of bugs.

An automated pipeline has many advantages and can improve your


software development process in several key ways, including by:

Reducing the number of manual steps to deploy


Allowing for metrics
Reducing the amount of effort to deploy (and subsequently
rollback)
Reducing developer complexity
Providing version control, tracking, and tracing

Pipeline Must-Haves

Automation

Fully automating all pipelines is the first task in improving produc-


tivity. It may be tempting to use some manual control or manual
gates, but this should only be used at an approval stage.

CHAPTER 4: PRODUCTIVITY THROUGH CI/CD 29


Steps

The steps that must be present in the pipeline are:

Build Test
Package Deploy

The entire pipeline should also have the following “substeps” within
the testing step: fail fast stages, security scans, code coverage, static
analysis, syntax analysis, functional (e2e) tests, and performance.

Metrics

Tracking metrics in the software development pipeline will help


determine the productivity of the development team. Metrics
developed by the DevOps Research and Assessment (DORA) team
at Google include:

Deployment frequency Change failure rate


Lead time for changes Time to restore a service

Other metrics you may want to look at are the incident rate, roll-
back rate, cycle time, availability, and feature velocity.

Representative Local Environments

Developers should first test in their local environments, followed


by testing in an automated CI environment. The setup used to run
a piece of code in a developer’s local environment should be as si-

CHAPTER 4: PRODUCTIVITY THROUGH CI/CD 30


milar as possible to the one used in the CI environment. This allows
the development team to find bugs early and more effectively.

Application and Infrastructure As Code

When developing a SaaS application using CI/CD methods, appli-


cation software is not the only code that must be in the pipeline.
The same is needed for Infrastructure as Code (IaC). The IaC must
be appropriately organized, using modules, which must be “unit
tested” as any software library. The deployments of IaC must be
automated in the pipeline as well.

Security and Compliance

While security and compliance are two different entities, a CI/CD


pipeline can automate security reviews, helping an organization
make better-informed decisions about what data and code go into
the different environments. Automating this process also ensures
that an organization can stay compliant while developing and helps
a development team stay productive by minimizing the time spent
on security and compliance issues.

Testing Must-Haves

While there are several stages of a CI/CD pipeline, testing is a


must-have. Testing continuously ensures that quality applications
and code are being delivered to users. A development team will
typically have more than one development and testing environ-
ment to test and review changes to the application.

CHAPTER 4: PRODUCTIVITY THROUGH CI/CD 31


Static Code Analysis

This tests the application’s source code by scanning it for patterns


that could impact the code’s quality, reliability, and security. It re-
ports practices that could open a security hole in your application.

It’s a great first step to execute because it applies the ‘fail fast’ pa-
ttern. The test runs quickly and can detect errors which could ge-
nerate issues later on in the pipeline. Waiting to find these issues
in the pipeline could take minutes or even hours in some cases.
It’s much better to do a static code analysis first.

Unit Testing

Once your code is completed, you can run unit tests. These tests
execute different functions of your application individually to
detect specific mistakes. Expecting some outputs to be validated,
unit tests call functions with specific inputs.

Automated Smoke Testing

This is the test you will run when your application is up and run-
ning in a test environment. Automated tests are executed to per-
form automated actions just like a real user and detect functional
errors. Smoke tests are used to validate a subset of the applica-
tion as a whole, to detect specific errors such as the connection to
databases, cache services, etc.

CHAPTER 4: PRODUCTIVITY THROUGH CI/CD 32


Code Coverage

After running so many tests on your code, you will want to know
how well your tests are actually testing your code. You may also
want to know if you have enough tests in place. Code coverage
can answer these questions. It will determine which code state-
ments have been executed during a test run and which have not.
Code coverage will point out which parts of the code may not have
been adequately tested and will require more testing.

Automated Deployments

All your deployments in your CI/CD pipelines must automatically


deploy to testing environments along with any updates to the
code. They must also be ready to deploy to production or staging
environments in just one click.

Continuous Delivery on Kubernetes

Kubernetes is a platform that is now widely used to run software


applications. As stated earlier in Chapter 3, you can fully automa-
te the deployment and running of your workloads and automate
how that is done with Kubernetes. While Continuous Delivery can
be implemented anywhere, Kubernetes makes the process faster
and more dynamic.

Kubernetes can deploy very quickly and allow for your engineering
teams to have increased agility. Below are a few of the best practi-
ces you can follow when using CD on Kubernetes.1
1 https://cloud.google.com/architecture/addressing-continuous-delivery-challenges-in-a-kubernetes-world
https://harness.io/blog/devops/kubernetes-ci-cd-best-practices/

CHAPTER 4: PRODUCTIVITY THROUGH CI/CD 33


Use Helm to package and deploy your Kubernetes applications.
Use a tool that implements the GitOps pattern.
Tweak the GitOps tool to rollback automatically in case of
application errors.
Use one cluster per environment. Isolate production from
staging and development.
Build health checks into your applications.

CI/CD Tools

There are many tools you can use to implement CI/CD pipelines.
These tools will help you build, test, and deploy your applications. They
range from the open-source Jenkins to paid solutions like CircleCI

Jenkins
An open-source and, more importantly, free automation ser-
ver to build, test, and deploy your applications.
CircleCI
A hosted CI/CD platform with enterprise solutions.
Github Action
Another hosted CI/CD platform with enterprise solutions.
Sonarqube
Continuously inspects your code and ensures its quality with
automatic reviews.
ArgoCD
This tool will continuously monitor your running applications
and is implemented as a Kubernetes controller.

CHAPTER 4: PRODUCTIVITY THROUGH CI/CD 34


FluxxCD
Another Kubernetes solution, this tool will keep your Kuber-
netes clusters in sync.

Overcoming the productivity challenge has countless benefits,


ranging from faster time to market to increased customer satis-
faction. A SDLC framework and, more specifically, CI/CD and an
automated pipeline are essential. Integrating these frameworks
and tools adequately takes time, effort, and expertise. Once this is
achieved, however, improvements will resonate throughout your
organization.

CHAPTER 4: PRODUCTIVITY THROUGH CI/CD 35


Chapter 5:
Observability

CHAPTER 5: OBSERVABILITY 36
At some point, you’re going to have problems with your system,
application, and/or infrastructure. Once you have your cloud
infrastructure and pipeline under control, you will want to unders-
tand what’s happening in the entire system through Observability.
The goal when addressing this challenge is to prevent as many is-
sues as possible, but the reality is that problems are unavoidable.
What’s most important is detecting problems before they impact
your business or users. This is when monitoring and, more impor-
tantly, Observability come into play.

Monitoring is something you and your team do actively. You will


monitor your system to detect problems, which might mean run-
ning tests to check the availability and performance of your system.
On the other hand, Observability is a property of your system that
uses outputs to understand what is going on inside. If your system
doesn’t externalize its internal state, no amount of monitoring will
help you detect specific problems in time. You must not only know
what is happening in your system but also why it is happening.

Understanding Your System

Using Observability has advantages beyond simply understanding


what’s going on in your system, application, and infrastructure.
Some problems can be detected early before they are noticeable.
Internal latency, too many locks in the database, or any problem
you see on the backend may become noticeable if you do not take
any action.

CHAPTER 5: OBSERVABILITY 37
Detection and communication are key here. You want to detect
the problems before your users, and you will also want to let them
know if the issue will affect them. Letting the world know that you
are actively monitoring and troubleshooting any given problem
will help put your customers at ease. You will be able to debug
and troubleshoot outages, service degradations, tackle bugs, (un-)
authorized activity, etc. It also helps to understand the uptime of
your SaaS and the quality of service your users have.

Visualizing Your Metrics

Metrics are a crucial component of Observability and business.


They help us improve our application and infrastructure. Sear-
ching for metrics that help you understand how your system
works is an ongoing process. For example, latency or responsi-
veness can be determined with metrics. When observability and
metrics are applied across the board, they must show that you
are addressing all the challenges of managing your infrastructure
properly: availability, productivity, costs, security, compliance, and
scalability. Dashboards and correct visualizations are critical.

Implementing Observability

As discussed above, when you start implementing Observability,


you will look at metrics and a combination of monitoring, log cen-
tralization, and tracing.

CHAPTER 5: OBSERVABILITY 38
Monitoring

Run active checks to verify the availability of critical components.

Metrics

Collect data points to count and assess errors, load, and other va-
riables from all the components required to run your application.
You will obtain your metrics from your application and required
services, infrastructure, pipelines, costs, and security incidents.

Log Centralization

Record information about events in your system in a central location.

Tracing

Track and identify problems in requests crossing through your


system.

Observability and Metrics Checklist

At Flugel, we’ve put together an Observability and metrics checklist


to help you detect problems before your clients notice them.

Place logs in a centralized place, and use a visualization tool


to review and query them.
Detect, count, and display in a dashboard error strings in logs.

CHAPTER 5: OBSERVABILITY 39
Collect metrics at the operating system level, service level,
and CSP level.
Group metrics to correlate technical metrics with business
metrics.
Create a dashboard providing metric data information to
business stakeholders. This dashboard must display costs,
security, availability, response time, and other metrics impac-
ting the business.
Track HTTP 5xx metrics in all the environments, and set
alerts for production.
Monitor TLS certificates.
Monitor exposed endpoints with HTTP checks.
Configure alerts properly to avoid alert spamming. Appro-
priately define methods to control this problem. Events that
can be detected from metrics but don’t require immediate
action shall not be considered as alerts and must be recor-
ded in an event log.
Provide distributed tracing.

By putting this checklist in place, you will begin to understand how


well Observability and monitoring are working in your organiza-
tion. Over time, you will gain insight into the health and perfor-
mance of your products, process, and people.

Observability Tools

There are many popular open-source Observability tools. We’ve


listed a few below.

CHAPTER 5: OBSERVABILITY 40
Prometheus
This is a monitoring system and time-series database. It gives
you a dimensional data model as well as robust and powerful
queries.
Grafana
This is a multi-platform solution that gives you time-series
analytics. A significant benefit of Grafana is that if you do not
want your data streamed over a vendor’s cloud, it can be
deployed on-prem.

Vector.dev
If you are looking for speed, this tool is your answer. Vector
is very fast and collects data end-to-end in your Observability
pipeline.

Elastic Stack
This tool provides log management and analytics in noisy and
distributed environments. It also scales as your data grows.

Cloud service providers also offer many Observability tools. A few


you may want to consider are:

AWS CloudWatch
If you are using AWS services, you may want to consider AWS
CloudWatch. This tool collects and correlates data across all
AWS products.

Azure Monitor
Working with your cloud and on-premises environments,

CHAPTER 5: OBSERVABILITY 41
Azure Monitor is Microsofts’ solution for data and analytics in
the cloud.

Cloud Monitoring
This tool is Google’s cloud operations suite. For those run-
ning applications on Google Cloud, this tool will let you track
and analyze your data.

Once appropriately implemented and visualized, metrics impact


the performance and quality of your business, and you will be well
on your way to improving your Observability.

CHAPTER 5: OBSERVABILITY 42
Chapter 6:
Availability and Scalability

CHAPTER 6: AVAILABILITY AND SCALABILITY 43


Many of our clients are in the process of growing their business.
If you are doing the same, then you have most likely learned that
meeting customer demand and being consistently available to
your customers are two of the most important things your busi-
ness must do. Furthermore, you probably already know that the
cloud enables both. In fact, availability and scalability are some of
the main reasons businesses are moving to the cloud. It’s anticipa-
ted that 94% of the internet workload will be in the cloud by 2021.

In short, being available and meeting demand refer to availability


and scalability, respectively. While these terms speak to different
concepts, you won’t have high availability (HA) without scalability. You
need both in order to have the right resources in place when you
and/or your users need them. So what are availability and scalability?

Availability is associated with reliability, fault tolerance, and disas-


ter recovery. It means your services are always ready when nee-
ded. It is key to the perception your users have about your service.
They want to be confident your services will be reliable and not
cause them costly downtime. Infrastructure will fail. It isn’t possible
to ensure that everything works 100% of the time. It’s vital for your
business to reduce the impact of failures and the MTTR.

Scalability, on the other hand, is about growing your service to meet


demand or decommissioning services when demand decreases. The
demand for provisions or decommissions could be cyclical, seasonal,
related to some special event, etc. The main benefits include adap-
ting to changes in workload or demand and being able to optimize
costs because you can dynamically provision for what you need.

CHAPTER 6: AVAILABILITY AND SCALABILITY 44


You Can’t Buy Availability

Your process must integrate availability, and your application must


collaborate with it. Availability must be part of your infrastructure
and/or applications from the very beginning of development.

Understand Your Metrics

Metrics will show you whether or not your infrastructure or appli-


cation has high availability. Review your data and check for redun-
dancies. The metrics you will want to monitor are:

Uptime
RTO (Recovery Time Objective)
RPO (Recovery Point Objective)
SLI (Service Level Indicator)
SLO (Service Level Objective)
SLA (Service Level Agreement)

Backups and Disaster Recovery

Disaster recovery is how you manage backups and infrastructure


to rebuild in other regions if needed. Your backups must be in
multiple locations. Backups, RTO, and RPO must match business
needs. All your infrastructure must be redundant and multi-AZ.
You should be able to use automation to rebuild your infrastructu-
re in another region.

CHAPTER 6: AVAILABILITY AND SCALABILITY 45


Autoscaling and Managed Databases

Autoscaling not only supports additional load, but it can also


assist HA. A managed database also provides HA in addition to FT,
Point-in-Time-Recovery, and AWS RDS. Object storage like S3 can
be protected with Glacier backups, delete protection flags, versio-
ning, etc.

Scalability is a Must-Have

There is no one way to have a scalable infrastructure. Auto-scaling


via adding or removing resources like AWS Auto Scaling groups, ca-
ching via CDNs, and app or DB caching is an option you may want
to consider. What is most important is that your infrastructure and
applications are scalability-friendly. This means Cloud-Native apps
must be stateless to ensure that adding and removing instances is
trivial, and sessions must be stored in a central/shared place.

Testing

Scalability needs testing. You need to do load testing. Simulate ex-


pected load (or even more than expected) to understand how your
auto-scaling works. Testing will also help you understand where
your application or platform crashes in case of unexpected load.

Criteria for Autoscaling

There are multiple criteria for autoscaling, including CPU, memory,


or a custom metric of your application. For example, in a machine

CHAPTER 6: AVAILABILITY AND SCALABILITY 46


learning application, some events are stored in Kafka to be pro-
cessed. Cluster scales in and out based on the number of messa-
ges queued.

The metrics you use for auto-scaling or how you create the scala-
ble infrastructure is up to you and your business. Building availa-
bility and scalability into your infrastructure and applications from
day one are essential for your clients and customers to trust and
use your products.

Business Continuity and Disaster Recovery

Knowing which DR method to apply depends on your Business


Continuity Process (BCP) and the required SLAs. Every service
that plays a critical role in your business should be considered to
avoid any event that might impact your business over time and,
more importantly, the users’ perception of your organization.
One example is third-party issue tracker services. If Atlassian goes
down, it may not directly impact your users, but the service desk
will be down, and you will not be able to accept support requests
for some time.

Business Continuity Plan

To protect your company from all threats, we recommend putting


in place a Business Continuity Plan. This is a process that creates a
system of prevention and recovery. It is, at times, something some
organizations ignore at their own risk. BCP not only puts in place a

CHAPTER 6: AVAILABILITY AND SCALABILITY 47


plan for DR, but it also helps you understand your organization as
a whole. You need to understand everything that impacts your bu-
siness to be able to work around potential issues, communicate to
your users, and decide what actions to take in case of a disaster.

CHAPTER 6: AVAILABILITY AND SCALABILITY 48


Conclusion

Operating a tech business in the cloud is not for the faint of heart.
There are many moving pieces and parts to consider. Managing
the six challenge areas we’ve laid out in this ebook often requires a
robust team of individuals, including a Cloud Architect, Cloud and
DevOps Engineers, and Compliance and Security engineers. These
roles are often backed by a Lead Cloud Architect and a CISO. You
may also want to include SMEs (Subject Matter Experts) to address
technology specific issues, a 24x7 Cloud monitoring and incident
management team-, and a Project Manager (PM) to coordinate all
the efforts. Hiring people takes time and must be done regularly to
guarantee the availability of resources and the quality of output.

Ever-evolving technologies will require your team to be continuous-


ly trained and kept up to speed in these areas. Most of all, you will
require documentation, best practices and a codebase to start
tackling these problems fast and efficiently. While it’s no easy task,
it can be done.

At Flugel.it, we wanted to share with you everything you and your


team need to know and highlight what you need to have under
control when operating in the cloud. From Security and Complian-
ce to BCPs, we’ve covered the main challenges facing many B2B
Startups and Scaleups when it comes to controlling and operating
their cloud infrastructure. To help you tackle these problems, we’ve
included as many recommendations and tools as possible.

CONCLUSION 49
https://flugel.it

infradevs@flugel.it

CHAPTER 1: SECURITY AND COMPLIANCE 50

You might also like