Vsphere Design PDF
Vsphere Design PDF
Vsphere Design PDF
Pocketbook 2.0
Blog Edition
Full-Sized Design
Considerations for Your
Software-Defined Data Center
Brad Hedlund
Duncan Epping
Cormac Hogan
William Lam
Michael Webster
Josh Odgers
And many others...
press
vSphere Design
Pocketbook 2.0
Blog Edition
Full-Sized Design Considerations
for your Software-Defined Data Center
Contents
Chapter 1 Host Configuration
Percentage Based Admission Control Gives Lower
VM Restart Guarantee? Duncan Epping
1-2
1-7
1-12
2-2
2-8
2-11
2-16
3-2
3-9
3-12
4-2
4-9
4-17
Chapter 5 VM Configuration
Considerations When Migrating VMs Between
vCenter Servers William Lam
5-2
5-8
5-11
6-2
6-8
7-2
7-3
7-6
Foreword
The VMware community is a special one. In my many years of IT,
I have not seen a more active, passionate, and engaged group of
people.
The amount of time and effort people invest in supporting and
building the VMware community is astonishing, with amazing
initiatives every day like VCAP and VCDX study groups, podcasts,
vbrownbag webinars, vBeers, and unique challenges like the one
offered by the team of virtualdesignmaster.com.
Because the VMware community loves to share advice, in 2013 I
created the vSphere Design Pocketbook 1.0. Experts were challenged
to create very focused messages that were no longer than a
single tweet (i.e. 140 characters). The book ended up being a
tremendous hit, with PernixData distributing over 10,000 copies!
This year, I wanted to give a bigger canvas to contributors, allowing
them to submit recommendations, hints, and tips without the
character limit. In essence, I wanted to tap into the blogosphere and
re-purpose as much of the great content as possible.
Many of you were eager to submit. To that end, I received lots of
great content, and I want to thank everyone who participated. But,
in the spirit of creating a manageable sized pocketbook, we could
not publish everything. Some tough choices had to be made.
Below is the final result. I am pleased to introduce the vSphere
PocketBook 2.0 Blog Edition. It showcases some of the best
articles from the virtualization community over the past year,
covering everything from new technology introductions to highly
detailed vm configuration recommendations and IT infrastructure
design considerations. This was an extremely fun project for me,
Chapter 1
Host Configuration
3. You can power-on virtual machines until you are out of slots,
as a high reservation is set you will be severely limited!
Now you can imagine that Host Failures can be on the safe
side If you have 1 reservation set the math will be done with
that reservation. This means that a single 10GB reservation will
impact how many VMs you can power-on until HA screams that it
is out of resources. But at least you are guaranteed you can power
them on right? Well yes, but realistically speaking people disable
Admission Control at this point as that single 10GB reservation
allows you to power on just a couple of VMs. (16 to be precise.)
But that beats Percentage Based right because if I have a lot of
VMs who says my VM with 10GB reservation can be restarted?
First of all, if there are no unreserved resources available on any
given host to start this virtual machine then vSphere HA will ask
vSphere DRS to defragment the cluster. As HA Admission Control
had already accepted this virtual machine to begin with, chances
are fairly high that DRS can solve the fragmentation.
Also, as the percentage based admission control policy uses
reservations AND memory overhead how many virtual machines
do you need to have powered-on before your VM with 10 GB
memory reservation is denied to be powered-on? It would mean
that none of the hosts has 10GB of unreserved memory available.
That is not very likely as that means you would need to power-on
hundreds of VMs Probably way too many for your environment
to ever perform properly. So chances of hitting this scenario are
limited, extremely small.
Conclusion
Although theoretically possible, it is very unlikely you will end up
in situation where one or multiple virtual machines can not be
restarted when using the Percentage Based Admission Control
What is a Slot?
A slot is the minimum amount of CPU and memory resources
required for a single VM in an ESXi cluster. Slot size is an important concept because it affects admission control.
A VMware ESXi cluster needs a way to determine how many
resources need to be available in the event of a host failure. This
slot calculation gives the cluster a way to reserve the right amount
of resources.
We take the lower value between the CPU slot size and the
memory slot size to determine the number of virtual machines that
can be started up under admission control. So therefore, we could
safely start 384 machines on these ESXi hosts, have one fail, and
have the other host start all of them.
(I should mention that its unlikely that you could get 384 vms on
one of these hosts. That would be a great consolidation ratio.)
Problem Scenario
What if you have a single large VM with a reservation, but the rest
of the virtual machines are relatively small?
Lets look at the same environment, but this time lets make the
larger VM have a reservation on it.
Summary
If youre in a situation where you think you need to add extra ESXi
hosts to your cluster because you cant power on virtual machines
without exceeding your admission control rules, take a look at
your slot sizes first. It may save you some money on a host you
dont really need.
Solution:
Apply some CLI commands to force ESXi into understanding that
your drive is really SSD. Then reconfigure your Host Cache.
Instructions:
Look up the name of the disk and its naa.xxxxxx number in
VMware GUI. In another example, we found that the disks that are
not properly showing as SSD are:
Dell Serial Attached SCSI Disk (naa.600508e0000000002edc6d0e4e3bae0e) local SSD
DGC Fibre Channel Disk (naa.60060160a89128005a6304b3d121e111) SAN-attached SSD
Check in the GUI that both show up as non-SSD type.
SSH to ESXi host. Each ESXi host will require you to look up the
unique disk names and perform the commands below separately,
once per host.
Type the following commands, and find the NAA numbers of your
disks.
In the examples below, the relevant information is highlighted in
GRAY.
The commands you need to type are BOLD.
The comments on commands are in ITALICS.
Now we will add a rule to enable SSD on those 2 disks. Make sure
to specify your own NAA number when typing the commands.
Next, we will check to see that the commands took effect for the
2 disks.
VMW_SATP_LOCAL
3bae0e
naa.60060160a89128005a6304b-
enable_ssd
user
naa.600508e0000000002edc6d0e4e-
enable_ssd
user
If you get the error message above, thats OK. It takes time for the
reclaim command to work.
You can check in the CLI by running the command below and
looking for Is SSD: false
If it still does NOT say SSD, you need to wait. Eventually, the
command works and displays as SSD in CLI and the GUI.
Chapter 2
Cluster and vCenter Design
Strict security requirements that necessitate an enhanced separation between workloads. For example some organizations
will not tolerate DMZ and production workloads co-existing on
the same host or cluster.
King of Monster VMs may need to rule their own island.
Extremely resource intensive workloads may perform best,
and impact other workloads the least, if isolated on their own
cluster. This could be especially true if there are significant vCPU
allocation variations among your workloads. Make sure that your
performance testing bears this out before considering it.
Licensing, which is the most common reason to consider
Island Clusters. You may want to constrain OS licensing by
building clusters dedicated to particular platforms, such as
one cluster for Windows Server and another for your favourite
licensed Linux distribution. Application licensing constraints
tied to server resources are a favourite.
The keen reader has likely already figured out that the
Management Cluster mentioned earlier is a type of Island Cluster.
In that case a separate cluster was created for both resource and
security reasons. If those reasons didnt exist, then those workloads
would be placed in the Production Cluster along with the other
production workloads.
Now that we have an idea of the types of things to consider when
deciding to design an Island Cluster, how do we determine what
that cluster should look like?
There may be several reasons for a setup like this, perhaps you
want your VCSA to be available on a management VLAN but
reach ESXi hosts on another VLAN without having routing in place
between the segmented networks, or you just want to play around
with it like I am in this lab environment.
Disclaimer:
Is this supported by VMware? Probably not, but I simply don`t
know. Caveat emptor, and all that jazz.
vCenter Design
A lot of debate exists when designing a vCenter Server, whether
to place it as a virtual machine, or build a physical server for it the
decision varies with the variety of the enterprises out there as
each entity has its own needs and its own requirement.
And here goes.
Solutions?
So what to do?
Supported by VMware:
1. We used to have vCenter Server Heartbeat but its end of
availability was announced in June 2014 and sadly there is no
replacement for it yet, on the other hand you can still but a
suite from Neverfail (this is what vCenter Server Heartbeat was
based on) where they have something call IT continuity suite
but I am sure it will cost a fortune.
2. You can use vSphere Replication to replicate the vCenter
Server to a local host but this is not real time high availability
and you must power on the vCenter Server manually, you may
also use any other supported replication software for that matter.
3. You can schedule a PowerCLI script to clone the vCenter
Server virtual machine on a daily basis.
4. Rely on backups in case of disaster to restore the whole
virtual machine.
Unsupported by VMware:
Finally you can go with the unsupported way, something I that
I have been working on lately based on Windows Server 2012
Failover Cluster Manager, here is a diagram that would explain the
idea behind this solution:
How to start?
Before we start, we have to ask our self some questions:
1. Do I want a High Available (HA) setup?
2. Do I want to use self singed SSL certificates or do I want to
use a PKI environment (public or private)?
3. Witch database do I want to use?
4. Witch type of load balancer do I want to use?
So let us assume that we want to setup a management cluster
with the following:
2 SSO servers in HA mode on separate virtual machines
2 vCenter servers load balanced on separate virtual machines
(is this scenario we have to separate vSphere clusters)
2 Web clients load balanced on separate virtual machines
2 MS SQL database servers in HA mode on separate virtual
machines
Some rules that we have to stick to:
Each vCenter server has his own Inventory Service
SSO can only be in active/passive mode
Each vCenter server has his own VUM server.
Design
When we apply these rules we get the following design.
So why not place them with the Web Clients? What if you get a
third vCenter server? Then you have a problem, because this
vCenter server also needs a Inventory Service and you only can
run 1 Inventory Service per host.
Single Sign-on
Single Sign-on (SSO) came with vSphere 5 and is your central user
authentication mechanism. Single Sign-on can have his own user
database or makes a connection to (multiple) other authentications
services like Microsoft Active Directory. Therefore we dont want
only 1 SSO server. If this one fails, nobody can authenticate to
vCenter including VUM or other VMware servers as VMware vShield.
SSO can be configured in HA mode, but you have to have a load
balancer.
Load Balancer
Most VMware services can be load balanced but they cant do it
by themselves. You have to make use of a third-party solution. This
can either be a software or a hardware load balancer. Make sure
you are aware of the functionality you need. Then pick your load
balancer.
Conclusion
In a small environment you perfectly can combine multiple
VMware services (vCenter, SSO, Inventory and Web client) on the
same host.
As you environment grows, more and more services depend on
your VMware environment. In case of a complete power-down
situation you first want to start your VMware management cluster.
This gives you the option to start other service for you production
environment controlled.
Make sure before you start you create a solid design. Talk to you
customer and stakeholders. Those are the guys paying for it!
Chapter 3
Storage Configuration
And as expected, when we look via the ESXi host at how much
space is consumed on the VMFS volume on which the VMDK
resides, we can see that there is indeed 10GB of space consumed
by the flat files for these disks. This is before we do anything in the
Guest OS.
OK. So now I will add 2 additional 15GB thin disks to the VM.
And now when I examine the actual space consumed on the VMFS,
I see that these flat files are not consuming any space yet.
After the disks have been initialized (not formatted), I can now
see that they have begun to consume some space on the VMFS
volume. Since VMFS-5 blocks are allocated in 1MB chunks, one
block is needed for initialization.
Now we are ready to actually format the drive. In this first test, I
am going to use the Quick Format option to initialize the volume.
A quick format is the default, and it is selected automatically. This
is basically clearing the table of contents on the drive, and not
touching any of the data blocks.
OK. It would appear that only 91MB of the drive that was formatted
using the quick format option have been consumed by that operation. So our thin provisioned datastore is still providing some value
here. Lets now proceed with formatting my other 15GB volume
using a full format option (i.e. uncheck the quick format option)
The first thing you will notice is that the formatting takes a lot
longer, and is a gradual process. The difference here is that the
data blocks are also being zeroed out.
When this format is complete, I check the amount of space consumed on the VMFS volume once again. Now you can see that the
whole of my thin provisioned VMDK is consuming its full amount of
allocated space, negating its thin provisioning properties.
Assumptions
1. vSphere 5.0 or greater (To enable use of Datastore
Heartbearting)
2. Converged Network used for IP storage is highly available.
Motivation
1. Minimize the chance of a false positive isolation response
2. Ensure in the event the storage is unavailable that virtual
machines are promptly shutdown to allow HA to restart VMs
on hosts which may not be impacted by IP storage connectivity
issues.
3. Minimize impact on the applications/data and downtime.
Architectural Decision
Configure the following:
das.usedefaultisolationaddress To FALSE
das.isolationaddress1 : ISCSI/NFS Target 1 e.g: 192.168.1.10
das.isolationaddress2 : ISCSI/NFS Target 2 e.g: 192.168.2.10
Utilize Datastore heartbeating with multiple datastores (Manually
selected or Automatic).
Configure Host Isolation Response to: Power off.
Justification
1. In the event the iSCSI or NFS targets cannot be reached,
all datastores will be inaccessible which prevents VMs from
functioning normally.
2. If the storage is inaccessible, VMs cannot be Shutdown
therefore selecting Power Off as the isolation response prevents the VM being delayed by the 300 second Shutdown
time-out before it is powered off, thus improving the achievable recovery time.
3. In the event the isolation response is triggered and the
isolation does not impact all hosts within the cluster, the VM
will be restarted by HA onto a surviving host without delay.
Implications
1. In the unlikely event the storage connectivity outage is longer
than 30 seconds for vSphere 5.0 environments (or 60 seconds
for vSphere 5.1 onward) but LESS than the I/O time-out within
the Guest (Default 60 Seconds for Windows) the VM will be
powered off (Ungracefully shut down) unnecessarily as it could
have continued to run without suffering I/O time-outs and
the storage would have been restored before the guest OS
time-out was issued.
Alternatives
1. Set Host isolation response to Leave Powered On
2. Do not use Datastore heartbeating
3. Use the default isolation address
Part of the problem that happened was the size of the RDM changed
(increased size) but the snapshot pointed to the wrong smaller size.
However, even without any changes to the storage, a corrupted
snapshot chain can happen during an out-of-space situation.
I have intentionally introduced a drive geometry mismatch in my
test VM below note that the value after RW in the snapshot
TEST-RDM_1-00003.vmdk is 1 less than the value in the base disk
TEST-RDM_1.vmdk
The fix was to follow the entire chain of snapshots and ensure
everything was consistent. Start with the most current snap in the
chain. The parentCID value must be equal to the CID value in
the next snapshot in the chain. The next snapshot in the chain is
listed in the parentFileNameHint. So TEST-RDM_1-00003.vmdk
is looking for a ParentCID value of 72861eac, and it expects to see
that in the file TEST-RDM_1.vmdk.
If you open up Test-RDM_1.vmdk, you see a CID value of 72861eac
this is correct. You also see an RW value of 23068672. Since
this file is the base RDM, this is the correct value. The value in the
snapshot is incorrect, so you have to go back and change it to
match. All snapshots in the chain must match in the same way.
Chapter 4
Network and Security
Design
Server Virtualization
Applications are composed with both Compute and Network
resources. It doesnt make sense to have one without the other;
a symbiotic relationship. And for the last decade, one half of that
relationship (Compute) has been light years ahead of the other
(Network).Compute and Network is a symbiotic relationship
lacking any symmetry.
For example, its possible to deploy (virtual servers) the Compute
of an application within seconds, through powerful automation
enabled by software on general purpose hardware Server
Virtualization. The virtual network, on the other hand, is still
What is Virtualization?
Virtualization is the basic act ofdecouplingan infrastructure service
from the physical assets on which that service operates. The
service we want to consume (such as Compute, or Network) is not
described on, identified by, or strictly associated to any physical
asset. Instead, the service is described in a data structure, and
Networking and Security Design | 4-3
Packet forwarding is not the point of friction in provisioning applications. Current generation physical switches do this quite well
with dense line-rate 10/40/100G silicon and standard IP protocols
(OSPF, BGP). Packet forwarding is not the problem. The problem
addressed by network virtualization is the manual deployment of
network policy, features, and services constructing the network
architecture viewed by applications compute resources (virtual
machines).
Network Virtualization
Network Virtualization reproduces the L2-L7 network services necessary to deploy the applications virtual network at the same software virtualization layer hosting the applications virtual machines
the hypervisor kernel and its programmable virtual switch.
Similar to how server virtualization reproduces vCPU, vRAM,
and vNIC Network Virtualization software reproduces Logical
switches, Logical routers (L2-L3), Logical Load Balancers,Logical
Firewalls(L4-L7), and more, assembled in any arbitrary topology,
thereby presenting the virtual compute a complete L2-L7 virtual
network topology.
All of the feature configuration necessary to provision the
applications virtual network can now be provisioned at the
software virtual switch layer through APIs. No CLI configuration
per application is necessary in the physical network. The physical
network provides the common packet forwarding substrate. The
programmable software virtual switch layer provides the complete
virtual network feature set for each application, with isolation and
multi-tenancy.
Lets assume you will lose single uplink from I/O module A. This
situation is depicted below.
Here, pool deployment with NIOC enabled is on the left and then
the same pool deployed without NIOC on the right. As you can
see, the traffic on the right looks chaotic and thats what degraded
connectivity to the extent that the deployment failed soon after
(highlighted by the third arrow). Traffic on the left with NIOC
enabled has consistent bandwidth for each type, resulting in solid
lines and allows the cloning process to complete successfully.
Another screenshot shows the effect when NIOC is enabled part
way through the cloning operation:
Here again, you can see how enabling NIOC has a positive
effect on traffic flow. Initially, NIOC is not enabled so the traffic is
irregular and chaotic but enabling it part way, causes all traffic to
start taking their fair share on the interfaces, which is consistent
and evident from the nice solid lines.
Moral of the story: Enable NIOC on all your distributed switches,
wherever possible.
Host hardware
Network switches
Routers
Storage
Racks
Power Distribution
Data center
Host hardware
CPU
RAM
Network interfaces
Storage adapters
Power supplies
Network switches
ASIC groups
Host connections
Uplinks
Power supplies
Host connections
Power supplies
Storage system
Rack Infrastructure
Cooling
As you can see there are many components and when designing
for failure you can take the endeavor as far as you want to go.
I will attempt to cover some of the redundancy scenarios in the
sections below.
Just like anything else an ASIC can fail. While I have not had it
happen often it has happened a handful of times over the years.
When an ASIC fails all the ports it backs go with it. Imagine having
6 virtualization hosts each plugged into 6 adjacent switch ports
and an ASIC failure causing a loss of connectivity across all 6
Physical Resilience
Another item to consider is the physical location of your devices
whether they be network switches or the hosts themselves.
Consolidating hardware into a single rack will save space but
comes with potential risk. For example having all of the virtualization
hosts in one rack might also mean that two PDUs are shared for
all hosts. Spreading them out across multiple racks with multiple
PDUs and multiple PDUs upstream in the data center can add
resilience. Similarly spreading the switches side by side across
two racks might be better than having them in the same rack for
the same reason.
Wrapping up
Putting all the pieces together the following is what a redundant
networking design for a virtual infrastructure might look like.
Connectivity and configuration options vary so this is one example.
Capabilities such as switch stacking and loop prevention technologies can influence design choices. Detail such as spreading
connections across ASICs is not shown but is implied.
Hardware
Server
The virtual switches can have the following settings configured for
redundancy:
Beacon probing enabled to detect upstream switch or link
failures.
L ink-state tracking configured on the upstream switches where
supported.
M
ultiple vSwitch uplinks using one port from each adapter.
M
ultiple uplinks from different physical switches for added
redundancy.
4-28 | Networking and Security Design
Other notes
The core switches might be connected via vPC, or some other
virtual switching or stacking capability, so that they present to the
downstream switches as a single switch to prevent looping and
spanning tree shutting down a path. Each switch closest to the server
is uplinked to two separate core switches using multiple links such
that one or more connections go to one core switch while the other
connections go to the other switch. This ensures connectivity is
retained to the downstream switches and to the virtualization hosts
in the event of a core switch failure (or one or more links). If you use
multiple VDCs (Virtual Device Contexts) then be sure to do this for
each VDC.
Summary
As you can see you can take designing redundancy as far as you
want to. Budget certainly has an impact when it comes to being able
to buy one switch instead or two for example, but there are many
other factors to think about such as mixing and matching makes
and models (for driver/firmware redundancy for example) and also
simpler things such as how uplinks and downlinks are configured.
Take some time to really think about any design end to end and
plan for failure accordingly. It is much easier to do so when you are
putting together requirements versus after you have already begun
the purchasing and implementation process - or even after the fact.
Twitter: @virtualbacon
Chapter 5
VM Configuration
VM Configuration | 5-3
5-4 | VM Configuration
VM Configuration | 5-5
VM Configuration | 5-7
As you can clearly see from the above graph PVSCSI shows
the best relative performance with the highest IOPS and lowest
latency. It also had the lowest CPU usage. During the 32 OIO
test SATA showed 52% CPU utilization vs 45% for LSI Logic SAS
and 33% for PVSCSI. For the 64 OIO test CPU utilization stayed
relatively the same. If you are planning on using Windows Failover
Clustering you are not able to use PVSCSI as LSI Logic SAS is the
only adapter supported. Hopefully VMware will allow PVSCSI to
be used in cluster configurations in the future.
Final Word
Where possible I recommend using PVSCI. Before choosing
PVSCSI please make sure you are on the latest patches. There
have been problems with some of the driver version in the past,
prior to vSphere 5.0 Update 2.VMware KB2004578has the details.
5-8 | VM Configuration
VM Configuration | 5-9
Design Decision
Do not use VMware tools synchronization, instead use guest time
synchronization mechanisms.
Justification
Using in-guest timekeeping mechanisms is especially significant
for Windows virtual machines which are members of an Active
Directory domain because the authentication protocol used by
Active Directory, Kerberos, is time sensitive for security reasons.
The Windows Domain NTP server should be configured to get its
time from an external time source server.
Guests in general should be configured to get their time from
AD domain controllers. If not possible then the guests should be
configured to use an external NTP source. If this is not practical
from a security perspective (exp: you cannot open firewall ports
to an external source), then synchronization with host can be an
alternative.
Another supporting reason for avoiding VMware tools synchronization is the possible problem caused by excessive CPU overcommitment which can lead to a timekeeping drift at uncorrectable
rates by the guests.
5-10 | VM Configuration
Implications
All templates will need to be preconfigured to use an NTP source
within the guest OS and the existing VMs will need to be updated
to use the same NTP source.
Important Notes
Pay special attention to Domain Controllers and other time
sensitive applications, where it is advised to disable time synchronization completely, by adding these lines to the .vmx file of the
particular VMs:
tools.syncTime = FALSE
time.synchronize.continue = FALSE
time.synchronize.restore = FALSE
time.synchronize.resume.disk = FALSE
time.synchronize.shrink = FALSE
time.synchronize.tools.startup = FALSE
VM Configuration | 5-11
Chapter 6
Application Landscape
6-2 | VM Configuration
How you will perform this transition will be in large measure down
to the platforms you have in place for your host acceleration and
backup infrastructure. Often, this will involve some sort of scripting
where you will transition the VM pre-backup to write-through, and
then return the VM to write-back upon completion of the necessary
backup tasks. The period of time the VM is not accelerating writes
will be determined by how long the backup solution needs the VM
in the write-through state to capture a consistent state of the VM
for recovery.
Chapter 7
Words of Wisdom
since there is still added pressure from new user and virtualization
growth. The only way to turn the tide on the storage pressure is to
instrument a negative reinforcing loop. An I/O Offload solution
can help by decreasing the demand on the storage system and
thus provide better consolidation back onto the ESXi Hosts.
What this illustrates is how a Systems Thinking approach can help
overcome some of the complexity in a design decision. This is
only a small subset of the possibilities so my intention is to provide
more examples of this on my blog. If you want to learn more about
Systems Thinking check this short overview to a larger context.
http://www.thinking.net/Systems_Thinking/OverviewSTarticle.pdf
IT Infrastructure Design
Introduction
All things are created twice is one of the principles that immediately comes to mind when thinking of designing IT infrastructures.
Its the principle by Stephen Covey that says that we first create
things in our mind, before we even produce anything in the
physical world.
Think about that for a second. We think first before we do something. Its the process we do unconsciously all day long. We do it
every moment of the day, over and over again.The same goes
for designing new IT infrastructures. First think about it, write that
down into design documents and then build the IT infrastructure
as defined in your design documents.
Compare it to building a house. Nobody goes out, buys bricks
and mortar and then starts building something without a design
and a plan.The same goes for building a new IT infrastructure or
whatever it is that needs to be thought out before it is created. You
dont go out and randomly install software hoping it will result in
the optimal IT infrastructure that suits your needs.Or better yet the
needs of your customer / company.Cause most of the times you
dont design according to what you think is best. You design the
infrastructure to suit the needs and requirements of somebody else.
Just like with building a house, you are the architect. Trying to
figure out those needs and requirement of your customer. How big
it must be? How many people are going to live in it? How should
the plumbing / electricity be installed? And last but not least how
much in the total amount of money that can be spend?
IT Design Methodology
To be able to create a good design you need to have all the
stakeholders involved. Stakeholders are everybody that either
works with or depends on the IT infrastructure and is impacted by
the decisions made during the IT design process.
Typical roles that we need to look for as stakeholders can be
found in the following IT infrastructure areas:
IT Architecture infrastructure / application / enterprise
Security
A
pplication development / management
Infrastructure Operations
P
roject; business / IT
The amount of stakeholders will depend on the company size
and the amount of people that actually need to be involved.
Sometimes not all topics are relevant as input for the IT design
process and sometime people fulfill multiple roles. The stakeholder groups needs to be a good representation in order to get all the
requirements for the new infrastructure. To few is not going to be
Functional design
The functional design describes what the new IT infrastructure
must look like. The input for the functional design comes from the
various input methods that an architect can do in order to get the
information from the stakeholders.
K
ick-off meeting; Every design process should start with
kick-off meeting. This meeting involves all the IT infrastructure
pre-defined stakeholders. During the meeting the purpose and
goals of the design creation will be explained to everyone involved. This will make sure that everybody knows that is going
to happen and will help with buy-in into the design process.
Interviews with stakeholders; All stakeholders should be interviews to get their input into the design process. Stakeholders
will provide the input to the architect who will document them
into the functional design.
Workshops; Workshops can also be beneficial in gathering the
input for the functional design. In the workshops the architect
and stakeholders will make decisions about design topics and
will be documented in the functional design.
C
urrent stateanalysis; The analysis of the current state can
give a good overview of what the current IT infrastructure
looks like. This can then be translated into requirements that
go into the functional design or can raise design topics that
can be addressed in workshops and/or interviews.
C
ustomer documentation review; Documentation about
the current IT infrastructure contains information that can be
valuable about for the new IT infrastructure design. Review
of the current documentation is therefor needed to gather
information
O
perational Readiness Assessment; An assessment of the
current operation can deliver valuable insight into the IT
organization and generate requirements and / or constraints
for the new IT infrastructure design. Technically everything is
possible, but people and process also needs to be taken into
account when designing the new IT infrastructure.
Requirements
Requirements are the definition of what the new IT infrastructure
should look like. Each requirement is part of the definition of the
new IT infrastructure. Requirements are gathered from the stakeholders through the various input methods. The requirements can
be categorized for IT infrastructure in the following design areas:
Manageability
Availability
Performance
Recoverability
Security
Examples of requirements are:
The availability level defined for the IT infrastructure is
99,99%
Role Based Access Control (RBAC) needs to be implemented
in the IT infrastructure.
Constraints/Givens
Some facts are already defined upfront or design decisions have
already been made before the design process took place. These
are things that can not be changed and are no longer under the
influence of the IT infrastructure design team. These facts are
defined as constraints and / or givens in the functional design.
Examples of constraints / givens are:
Customer has bought Compute rack servers of type XLarge
1024. These servers need to be used in the IT infrastructure
The IT infrastructure needs to be located in the customers two
datacenters in A and B.
Assumptions
One needs to make a assumption if input for the technical design
is needed, but isnt available at the time of defining the functional
design. The assumption will be defined by the stakeholders and
the architect, as been the most probable fact. All assumptions are
documented into the functional design.
Examples of assumptions are:
The available hardware provides enough resources to host
1000 virtual machines.
Users, specifically with administrator privileges, have sufficient
knowledge of ITinfrastructureproducts to be able to manage
the IT infrastructure accordingly.
Out-of-Scope
Things that are not part of the new IT infrastructure design will
be defined as out-of-scope. Not everything needs to be listed of
course, but if the assumption could arise if something is part of the
design, but isnt, then list it under out-of-scope to be sure. This will
define clearly where the scope is of the design process.
Examples of out-of-scope definitions are:
The re-design of the VDI environment is out-of-scope of this
design process.
Disaster recovery of the IT infrastructure will not be defined
during the design process.
Risks
Risks are facts that are not under the influence of the IT
Infrastructure design team, but do impact the IT infrastructure
design. Every identified risk needs to be written down and
documented in the functional and / or technical design. Preferably
the risk needs to be documented with a definition how to mitigate
the risk. This defines how the risk will be controlled and how the
impact on the IT infrastructure design process will be minimized.
Examples of risks are:
The server hardware needs for the build of the IT infrastructure will be delivered two weeks before the build takes place.
The stakeholders do not have enough time to participate in
the design process. This could potentially lead to an incomplete functional design or an extension of the IT infrastructure
design project timelines.
All these aspects are defined in the functional design. They define
how the stakeholders want the new IT infrastructure to look like.
With a complete and signed functional design a start can be made
with the technical design.
Technical Design
The technical design is the technical translation of the functional
requirements. It defines how the IT infrastructure should be build
with the available soft- and hardware. This has to be done while
taking the requirements and constraints into account.
High-Level Design
It is useful tomake a translation between the functional design
and the technical design. The high level design is that translation
and provides a conceptual design of the IT infrastructure. This
conceptual design will be defined further in depth in the rest of
Availability/Recoverability Design
The IT infrastructure needs to provide the availability as defined by
the stakeholder. The technical design needs to provide the details
how the availability is guaranteed and how availability is achieved
in case of IT infrastructure failures. Part of availability is also the
recoverability of the IT infrastructure and how the IT infrastructure
can be recovered in case of a disaster.
Security Design
Security is one of the design topics that always comes back when
designing an IT infrastructure. The technical design needs to
provide the configuration to adhere to the security requirements
that have been defined in the functional design.
These topics will all be defined in the technical design. The
technical design is the technical blueprint how the IT infrastructure
will be set up and configured. The technical design results in
fulfilling the requirements as defined by the stakeholders in the
functional design. This has to be done while taking the constraints
and assumptions into account. Only then will the IT infrastructure
be build according to the wishes and needs of the stakeholders
that use or are depended on the IT infrastructure.
Operational Documentation
Besides the architectural design, with the functional and technical
design documentation, the design process also needs to provide
guidance on how to implement the technical design within the IT
infrastructure. Creating the following documents does this:
Installation Guides
The installation guides are derived from the technical design. They
provide in depth detail how to install and configure the components
that build up the IT infrastructure. These are step-by-step guides on
To Conclude...
Hopefully this IT design methodology has provided some insight
on how to design, build and operate an IT infrastructure. Every IT
infrastructure is different as it is designed and build based on the
requirements and constraints that are provided by the sponsor
and stakeholder in the IT infrastructure design process. However
the IT Infrastructure design methodology is the same for every
design. Following the steps in the IT infrastructure design process
press