the_six_challenges_b2b_saas
the_six_challenges_b2b_saas
the_six_challenges_b2b_saas
The Cloud-infrastructure
challenges every B2B SAAS
must tackle down.
Introduction 5
Chapter 1: Security and Compliance 6
Security Best Practices 7
Secure Cloud Accounts 7
Defense-in-depth 8
Securing Your Data 8
Identity and Access Management. 9
Implement DevSecOps in the Pipeline 10
Perimetral Security 11
Audit Tooling 12
Getting Compliant 12
Chapter 2: Cost Optimization 14
Cost Impact 15
Optimize Your Process 16
Three Phases of Optimization 16
1. Analysis 16
2. Optimization 16
3. Operation 18
Bonus: Resource Tagging 19
Chapter 3: Cloud Automation 20
The Benefits of Automation 21
Process Reliability 21
Change Management 22
Disaster Recovery 22
CONTENTS 2
Indirect Benefits 22
Managing Infrastructure as Code 23
Tools for Automation 24
Kubernetes Operators and Automation 24
Managing Multi-tenant Apps 25
Chapter 4: Productivity through CI/CD 27
Impact 28
Pipeline Must-Haves 29
Automation 29
Steps 30
Metrics 30
Representative Local Environments 30
Application and Infrastructure as Code 31
Security and Compliance 31
Testing Must-Haves 31
Static Code Analysis 32
Unit Testing 32
Automated Smoke Testing 32
Code Coverage 33
Automated Deployments 33
Continuous Delivery on Kubernetes 33
CI/CD Tools 34
Chapter 5: Observability 36
Understanding Your System 37
Visualizing Your Metrics 38
Implementing Observability 38
Monitoring 39
Metrics 39
Log Centralization 39
CONTENTS 3
Tracing 39
Observability and Metrics Checklist 39
Observability Tools 40
Chapter 6: Availability and Scalability 43
You Can’t Buy Availability 45
Understand Your Metrics 45
Backups and Disaster Recovery 45
Autoscaling and Managed Databases 46
Scalability is a Must-Have 46
Testing 46
Criteria for Autoscaling 46
Business Continuity and Disaster Recovery 47
Business Continuity Plan 47
Conclusion 49
CONTENTS 4
Introduction
There are six challenges every B2B Startup and ScaleUp must
tackle when it comes to their Cloud Infrastructure and DevOps
methodologies: Security and Compliance, Cost Optimization,
Cloud Automation, Productivity through CI/CD, Observability, and
Availability and Scalability. These challenges encompass everything
that keeps a B2B Software-as-a-Service (SaaS) up and running.
Over the years at Flugel.it, we’ve seen many clients struggle with
one or more of these areas. They may be scaling up and taking on
Fortune 500 customers for the first time, requiring them to update
their security and compliance processes, or they may be moving
from legacy infrastructure to a fully automated one in the cloud.
However, while these areas may not be the focus of their busi-
ness, companies must manage those needs; otherwise, the sta-
bility of their business, user perception, data security, costs, and
productivity will be greatly affected.
Given the many challenges facing our clients’ businesses, we’ve de-
cided to compile our knowledge into this guide to better serve you.
We recommend you read the entire book, but each chapter can
be read independently or referred back to when you run into a
problem in a specific area. With this book in your toolkit, your
cloud infrastructure will be safer, secured, productive, cost-effecti-
ve, and ready to grow along with your business efficiently.
INTRODUCTION 5
Chapter 1:
Security and Compliance
Defense-In-depth
Also, you should use the Least Privilege Principle to give users the
minimum access needed to do their jobs so as to restrict access
to functions that are not required. On a final note, you would want
to make use of secure secret management while enabling AWS
Security HUB if you are in AWS and of course utilize IAM Roles,
services accounts, or the equivalent.
Perimetral Security
Getting Compliant
To start getting compliant, you can use CCM from CSA to demons-
trate to your clients that you can implement a formal approach
to security. While they are expensive and time-consuming, certifi-
CSA has been evolving their methodology since 2009 and getting
more and more coverage on each version, including the current
version, 4.0 of CCM (Cloud Control Matrix). Once completed, the
outcome is a maturity level that can be cross-referenced with
other frameworks for the market standard of your choice.
Moving to the cloud can improve the bottom line for your com-
pany. Keep in mind; it will not eliminate all your costs. By most esti-
mates, about a third of a company’s IT budget goes towards cloud
services. To optimize your costs across the board, you must first
understand and learn how best to control them in the cloud. The
goal is not to reduce spending but to optimize spending. As your
business grows and you acquire more clients, your costs will most
likely increase. However, if those costs are optimized, profits will
undoubtedly improve.
Cost Impact
The cloud can offer unlimited scalability. It can also lower your IT
costs by only charging for what you use. At the same time, it is
essential to know that you are really paying for what you order. Up
to 70% of cloud costs are potentially wasted, either due to poor
adoption or underutilized features. It’s critical for your business to
understand the impact of the cloud on your bottom line.
Now that you have defined your metric, it is time to start optimi-
zing. The report you have created will track current usage as well
as its evolution. There are several best practices you can put into
place to get optimization underway.
Before you begin, you will want to break the optimization process
down into three phases:
1. Analysis
The first phase of cost optimization is analysis. During this phase,
work towards understanding your current spending and focus on
reporting. You will want to compare current resource utilization
vs. current expenditures to get a sense of how your resources are
being used.
2. Optimization
At this point, you will apply the various types of optimization. These
Financial Optimization
All cloud providers offer different financial options to facilita-
te cost optimization. AWS provides Reserved Instances and
saving plans, which allow you to pay upfront and therefore
reduce costs. Other providers offer financial incentives, too.
GCP provides committed user discounts, and Azure has re-
served user discounts.
Rightsizing
When you rightsize, you analyze your computing services and
modify them to the required size. You will want to use the
right instance type and size for your workloads. This requires
monitoring and metrics collections to review capacity versus
current use and adjustments to the size of your infrastructu-
re. To use rightsizing, automation must be in place.
Turn On/Off
You do not need to run all your resources all the time. Tes-
ting environments and manual UAT environments are typica-
lly used during specific days and hours, so you can turn them
on and off on command. You may want to consider self-ser-
vice environments. These are good options for devs and QA.
Your teams can launch environments for specific needs and
destroy them later. Automation is key here. With it, you can
create, destroy, start and stop on demand, and use things
only when you need them.
3. Operation
When you begin this phase of the optimization process, you will
continuously monitor and keep the optimizations up-to-date. This
is not a one-time activity. Cost optimization is an ongoing process.
Process Reliability
Change management
By versioning all the IaC files like you would application source
code, you will have full traceability. You will have a clear picture of
who did what and when. Another benefit of IaC is Process Enfor-
cement Documentation. The only documentation outside of the
configured files you will need will be for the architecture of the
infrastructure. This is the “WHAT” of the infrastructure. The “HOW”
is defined in the IaC.
Disaster Recovery
Indirect Benefits
Two indirect benefits of IaC are right-sizing for cost optimization and
more straightforward and reliable ways to update for compliance
and security standards. There is no doubt about it; IaC lowers infras-
tructure management costs. When your organization uses the cloud,
it eliminates many hardware, employee, and physical space costs.
Security and compliance updates become easier to implement.
Many tools can be used for cloud automation. Below we list a few
you can use in your cloud infrastructure.
Terraform
When you want to predictably and safely build, update,
and improve your infrastructure, Terraform is your go-to
open-source IaC tool.
AWS CloudFormation
With this tool, you can model a collection of AWS and
third-party resources, provision and manage them throu-
ghout their life cycles.
Terragrunt
This tool is used to manage Terraform code. It helps to ma-
nage Terraform modules and states.
Terratest
This is a Go library that manages automated tests for Infras-
tructure as code.
Impact
Pipeline Must-Haves
Automation
Build Test
Package Deploy
The entire pipeline should also have the following “substeps” within
the testing step: fail fast stages, security scans, code coverage, static
analysis, syntax analysis, functional (e2e) tests, and performance.
Metrics
Other metrics you may want to look at are the incident rate, roll-
back rate, cycle time, availability, and feature velocity.
Testing Must-Haves
It’s a great first step to execute because it applies the ‘fail fast’ pa-
ttern. The test runs quickly and can detect errors which could ge-
nerate issues later on in the pipeline. Waiting to find these issues
in the pipeline could take minutes or even hours in some cases.
It’s much better to do a static code analysis first.
Unit Testing
Once your code is completed, you can run unit tests. These tests
execute different functions of your application individually to
detect specific mistakes. Expecting some outputs to be validated,
unit tests call functions with specific inputs.
This is the test you will run when your application is up and run-
ning in a test environment. Automated tests are executed to per-
form automated actions just like a real user and detect functional
errors. Smoke tests are used to validate a subset of the applica-
tion as a whole, to detect specific errors such as the connection to
databases, cache services, etc.
After running so many tests on your code, you will want to know
how well your tests are actually testing your code. You may also
want to know if you have enough tests in place. Code coverage
can answer these questions. It will determine which code state-
ments have been executed during a test run and which have not.
Code coverage will point out which parts of the code may not have
been adequately tested and will require more testing.
Automated Deployments
Kubernetes can deploy very quickly and allow for your engineering
teams to have increased agility. Below are a few of the best practi-
ces you can follow when using CD on Kubernetes.1
1 https://cloud.google.com/architecture/addressing-continuous-delivery-challenges-in-a-kubernetes-world
https://harness.io/blog/devops/kubernetes-ci-cd-best-practices/
CI/CD Tools
There are many tools you can use to implement CI/CD pipelines.
These tools will help you build, test, and deploy your applications. They
range from the open-source Jenkins to paid solutions like CircleCI
Jenkins
An open-source and, more importantly, free automation ser-
ver to build, test, and deploy your applications.
CircleCI
A hosted CI/CD platform with enterprise solutions.
Github Action
Another hosted CI/CD platform with enterprise solutions.
Sonarqube
Continuously inspects your code and ensures its quality with
automatic reviews.
ArgoCD
This tool will continuously monitor your running applications
and is implemented as a Kubernetes controller.
CHAPTER 5: OBSERVABILITY 36
At some point, you’re going to have problems with your system,
application, and/or infrastructure. Once you have your cloud
infrastructure and pipeline under control, you will want to unders-
tand what’s happening in the entire system through Observability.
The goal when addressing this challenge is to prevent as many is-
sues as possible, but the reality is that problems are unavoidable.
What’s most important is detecting problems before they impact
your business or users. This is when monitoring and, more impor-
tantly, Observability come into play.
CHAPTER 5: OBSERVABILITY 37
Detection and communication are key here. You want to detect
the problems before your users, and you will also want to let them
know if the issue will affect them. Letting the world know that you
are actively monitoring and troubleshooting any given problem
will help put your customers at ease. You will be able to debug
and troubleshoot outages, service degradations, tackle bugs, (un-)
authorized activity, etc. It also helps to understand the uptime of
your SaaS and the quality of service your users have.
Implementing Observability
CHAPTER 5: OBSERVABILITY 38
Monitoring
Metrics
Collect data points to count and assess errors, load, and other va-
riables from all the components required to run your application.
You will obtain your metrics from your application and required
services, infrastructure, pipelines, costs, and security incidents.
Log Centralization
Tracing
CHAPTER 5: OBSERVABILITY 39
Collect metrics at the operating system level, service level,
and CSP level.
Group metrics to correlate technical metrics with business
metrics.
Create a dashboard providing metric data information to
business stakeholders. This dashboard must display costs,
security, availability, response time, and other metrics impac-
ting the business.
Track HTTP 5xx metrics in all the environments, and set
alerts for production.
Monitor TLS certificates.
Monitor exposed endpoints with HTTP checks.
Configure alerts properly to avoid alert spamming. Appro-
priately define methods to control this problem. Events that
can be detected from metrics but don’t require immediate
action shall not be considered as alerts and must be recor-
ded in an event log.
Provide distributed tracing.
Observability Tools
CHAPTER 5: OBSERVABILITY 40
Prometheus
This is a monitoring system and time-series database. It gives
you a dimensional data model as well as robust and powerful
queries.
Grafana
This is a multi-platform solution that gives you time-series
analytics. A significant benefit of Grafana is that if you do not
want your data streamed over a vendor’s cloud, it can be
deployed on-prem.
Vector.dev
If you are looking for speed, this tool is your answer. Vector
is very fast and collects data end-to-end in your Observability
pipeline.
Elastic Stack
This tool provides log management and analytics in noisy and
distributed environments. It also scales as your data grows.
AWS CloudWatch
If you are using AWS services, you may want to consider AWS
CloudWatch. This tool collects and correlates data across all
AWS products.
Azure Monitor
Working with your cloud and on-premises environments,
CHAPTER 5: OBSERVABILITY 41
Azure Monitor is Microsofts’ solution for data and analytics in
the cloud.
Cloud Monitoring
This tool is Google’s cloud operations suite. For those run-
ning applications on Google Cloud, this tool will let you track
and analyze your data.
CHAPTER 5: OBSERVABILITY 42
Chapter 6:
Availability and Scalability
Uptime
RTO (Recovery Time Objective)
RPO (Recovery Point Objective)
SLI (Service Level Indicator)
SLO (Service Level Objective)
SLA (Service Level Agreement)
Scalability is a Must-Have
Testing
The metrics you use for auto-scaling or how you create the scala-
ble infrastructure is up to you and your business. Building availa-
bility and scalability into your infrastructure and applications from
day one are essential for your clients and customers to trust and
use your products.
Operating a tech business in the cloud is not for the faint of heart.
There are many moving pieces and parts to consider. Managing
the six challenge areas we’ve laid out in this ebook often requires a
robust team of individuals, including a Cloud Architect, Cloud and
DevOps Engineers, and Compliance and Security engineers. These
roles are often backed by a Lead Cloud Architect and a CISO. You
may also want to include SMEs (Subject Matter Experts) to address
technology specific issues, a 24x7 Cloud monitoring and incident
management team-, and a Project Manager (PM) to coordinate all
the efforts. Hiring people takes time and must be done regularly to
guarantee the availability of resources and the quality of output.
CONCLUSION 49
https://flugel.it
infradevs@flugel.it