Grid and Cloud 5 Unit Notes PDF
Grid and Cloud 5 Unit Notes PDF
Grid and Cloud 5 Unit Notes PDF
LECTURE NOTES
CS6703 – GRID AND CLOUD COMPUTING
(2013 Regulation)
Prepared by
Dr.R.Suguna , Dean - CSE
Mrs.T.Kujani, AP - CSE
Understand how Grid computing helps in solving large scale scientific problems.
Gain knowledge on the concept of virtualization that is fundamental to cloud computing.
Learn how to program the grid and the cloud.
Understand the security issues in the grid and the cloud environment.
UNIT I INTRODUCTION 9
Evolution of Distributed computing: Scalable computing over the Internet – Technologies for
network based systems – clusters of cooperative computers - Grid computing Infrastructures –
cloud computing - service oriented architecture – Introduction to Grid Architecture and
standards – Elements of Grid – Overview of Grid Architecture.
Cloud deployment models: public, private, hybrid, community – Categories of cloud computing:
Everything as a service: Infrastructure, platform, software - Pros and Cons of cloud computing –
Implementation levels of virtualization – virtualization structure – virtualization of CPU, Memory
and I/O devices – virtual clusters and Resource Management – Virtualization for data center
automation.
Open source grid middleware packages – Globus Toolkit (GT4) Architecture , Configuration –
Usage of Globus – Main components and Programming model - Introduction to Hadoop
Framework - Mapreduce, Input splitting, map and reduce functions, specifying input and output
parameters, configuring and running a job – Design of Hadoop file system, HDFS concepts,
command line and java interface, dataflow of File read & File write.
UNIT V SECURITY 9
Trust models for Grid security environment – Authentication and Authorization methods – Grid
security infrastructure – Cloud Infrastructure security: network, host and application level –
aspects of data security, provider data and its security, Identity and access management
architecture, IAM practices in the cloud, SaaS, PaaS, IaaS availability in the cloud, Key privacy
issues in the cloud.
TEXT BOOK:
1. Kai Hwang, Geoffery C. Fox and Jack J. Dongarra, ―Distributed and Cloud Computing:
Clusters, Grids, Clouds and the Future of Internet‖, First Edition, Morgan Kaufman Publisher, an
Imprint of Elsevier, 2012.
REFERENCES:
1. Jason Venner, ―Pro Hadoop- Build Scalable, Distributed Applications in the Cloud‖, A Press,
2009
2. Tom White, ―Hadoop The Definitive Guide‖, First Edition. O‟Reilly, 2009.
3. Bart Jacob (Editor), ―Introduction to Grid Computing‖, IBM Red Books, Vervante, 2005
4. Ian Foster, Carl Kesselman, ―The Grid: Blueprint for a New Computing Infrastructure‖, 2 nd
Edition, Morgan Kaufmann.
5. Frederic Magoules and Jie Pan, ―Introduction to Grid Computing‖ CRC Press, 2009.
6. Daniel Minoli, ―A Networking Approach to Grid Computing‖, John Wiley Publication, 2005.
7. Barry Wilkinson, ―Grid Computing: Techniques and Applications‖, Chapman and Hall,
CRC,Taylor and Francis Group, 2010.
Evolution of Distributed computing: Scalable computing over the Internet – Technologies for
network based systems – clusters of cooperative computers - Grid computing Infrastructures –
cloud computing - service oriented architecture – Introduction to Grid Architecture and
standards – Elements of Grid – Overview of Grid Architecture.
TEXT BOOK:
1. Kai Hwang, Geoffery C. Fox and Jack J. Dongarra, ―Distributed and Cloud Computing:
Clusters, Grids, Clouds and the Future of Internet‖, First Edition, Morgan Kaufman Publisher, an
Imprint of Elsevier, 2012.
REFERENCES: 1. Jason Venner, ―Pro Hadoop- Build Scalable, Distributed Applications in the
Cloud‖, A Press, 2009
2. Tom White, ―Hadoop The Definitive Guide‖, First Edition. O‟Reilly, 2009.
3. Bart Jacob (Editor), ―Introduction to Grid Computing‖, IBM Red Books, Vervante, 2005
4. Ian Foster, Carl Kesselman, ―The Grid: Blueprint for a New Computing Infrastructure‖, 2nd
Edition, Morgan Kaufmann.
5. Frederic Magoules and Jie Pan, ―Introduction to Grid Computing‖ CRC Press, 2009.
6. Daniel Minoli, ―A Networking Approach to Grid Computing‖, John Wiley Publication, 2005.
7. Barry Wilkinson, ―Grid Computing: Techniques and Applications‖, Chapman and Hall, CRC,
Taylor and Francis Group, 2010.
UNIT I INTRODUCTION 9
Evolution of Distributed computing: Scalable computing over the Internet – Technologies for
network based systems – clusters of cooperative computers - Grid computing Infrastructures –
cloud computing - service oriented architecture – Introduction to Grid Architecture and
standards – Elements of Grid – Overview of Grid Architecture.
Computing technology has undergone a series of platform and environment changes. Instead of
using a centralized computer to solve computational problems, a parallel and distributed
computing system uses multiple computers to solve large-scale problems over the Internet.
Hence, Distributed computing becomes data-intensive and network-centric.
On the HTC side, peer-to-peer (P2P) networks are formed for distributed file sharing and
content delivery applications. A P2P system is built over many client machines. Peer
machines are globally distributed in nature. P2P, cloud computing, and web service
platforms are more focused on HTC applications than on HPC applications.
We will discuss the viable approaches to build distributed operating systems for handling
massive parallelism in a distributed environment.
1.2.1 System Components and Wide-Area Networking
In recent years, considering the growth of component and network technologies in building HPC
or HTC systems, processor speed is measured by MIPS (million instructions per second) and
the network bandwidth is counted by Mbps or Gbps (Mega or Giga bits per second).
1.2.1.1 Advances in Processors:
CPU‘s today assume a multi-core architecture with dual, quad, six, or more processing
cores. By Moore‘s law, the processor speed is doubled in every 18 months. This doubling
effect was accurate in the past 30 years.
The clock rate increased from 10 MHz for Intel 286 to 4 GHz for Pentium 4 in 30 years.
However, the clock rate reached its limit on CMOS chips due to power limitations. Clock
speeds cannot continue to increase due to excessive heat generation and current leakage.
The ILP (instruction-level parallelism) is recommended in modern processors. ILP
mechanisms include multiple-issue superscalar architecture, dynamic branch prediction, and
speculative execution, etc. These ILP techniques are all hardware and compiler-supported.
In addition, DLP (data-level parallelism) and TLP (thread-level parallelism) are also highly
explored in today‘s processors.
Many processors are now upgraded to have multi-core and multithreaded micro-
architectures. The architecture of a typical multicore processor is shown in Fig.1.4. Each
core is essentially a processor with its own private cache (L1 cache). Multiple cores are
housed in the same chip.
The "memory wall" is the growing disparity of speed between CPU and memory outside the
CPU chip. An important reason for this disparity is the limited communication bandwidth
beyond chip boundaries. From 1986 to 2000, CPU speed improved at an annual rate of 55%
while memory speed only improved at 10%.
Faster processor speed and larger memory capacity result in a wider performance gap
between processors and memory. The memory wall may become an even worse problem
limiting CPU performance.
The rapid growth of flash memory and solid-state drive (SSD) also impacts the future of
HPC and HTC systems. The power increases linearly with respect to the clock frequency
and quadratically with respect to the voltage applied on chips. We cannot increase the clock
rate indefinitely. Lower the voltage supply is very much in demand.
Three VM Architectures
The concept of virtual machines is illustrated in Fig.1.7
The host machine is equipped with the physical hardware shown at the bottom. For
example, a desktop with x-86 architecture running its installed Windows OS as shown in
Fig.1.7(a).
The VM can be provisioned to any hardware system. The VM is built with virtual
resources managed by a guest OS to run a specific application. Between the VMs and
the host platform, we need to deploy a middleware layer called a virtual machine monitor
(VMM) .
Figure 1.7(b) shows a native VM installed with the use a VMM called a hypervisor at the
privileged mode. For example, the hardware has a x-86 architecture running the
Windows system. The guest OS could be a Linux system and the hypervisor is the XEN
system developed at Cambridge University. This hypervisor approach is also called
bare-metal VM, because the hypervisor handles the bare hardware (CPU, memory, and
I/O) directly.
host OS may have to be modified to some extent. Multiple VMs can be ported to one
given hardware system, to support the virtualization process.
Fig 1.7 Three ways of constructing a virtual machine (VM) embedded in a physical machine.
Virtualization Operations:
The VMM provides the VM abstraction to the guest OS. With full virtualization, the VMM exports
a VM abstraction identical to the physical machine; so that a standard OS such as Windows
2000 or Linux can run just as they would on the physical hardware.
Low-level VMM operations areillustrated in Fig.1..8. First, the VMs can be multiplexed between
hardware machines as shown in Fig.1..8(a). Second, a VM can be suspended and stored in a
stable storage as shown in Fig.1..8 (b). Third, a suspended VM can be resumed or provisioned
to a new hardware platform in Fig.1.8(c). Finally, a VM can be migrated from one hardware
platform to another platform as shown in Fig.1.8 (d).
VM approach will significantly enhance the utilization of server resources. Multiple server
functions can be consolidated on the same hardware platform to achieve higher system
efficiency. According to a claim by VMWare, the server utilization could be increased from
current 5-15% to 60-80%.
Virtual Infrastructures:
Physical resources for compute, storage, and networking at the bottom are mapped to the
needy applications embedded in various VMs at the top. Hardware and software are then
separated. Virtual Infrastructure is what connects resources to distributed applications. It is a
dynamic mapping of the system resources to specific applications. The result is decreased
costs and increased efficiencies and responsiveness.
Storage and energy efficiency are more important than shear speed performance. Data center
design emphasizes the performance/price ratio over speed performance alone.
A large data center may be built with thousands of servers. Smaller data centers are typically
built with hundreds of servers. The cost to build and maintain data center servers has increased
over the years.
About 60 percent of the cost to run a data center is allocated to management and maintenance.
The server purchase cost did not increase much with time. The cost of electricity and cooling did
increase from 5 percent to 14 percent in 15 years.
Commodity switches and networks are more desirable in data centers. Similarly, using
commodity x86 servers is more desired over expensive mainframes. The software layer handles
network traffic balancing, fault tolerance, and expandability.
Cluster Architecture
Figure 1.9 shows the architecture of a typical cluster built around a low-latency and high-
bandwidth interconnection network.
Single-System Image:
The system image of a computer is decided by the way the OS manages the shared cluster
resources. Most clusters have loosely-coupled node computers. All resources of a server node
is managed by its own OS. Thus, most clusters have multiple system images coexisting
simultaneously.
An ideal cluster should merge multiple system images into a single-system image (SSI) at
various operational levels. We need an idealized cluster operating system or some middlware to
support SSI at various levels, including the sharing of all CPUs, memories, and I/O across all
computer nodes attached to the cluster.
Figure 1.10 shows the hardware and software architecture of a typical cluster system. Each
node computer has its own operating system. On top of all operating systems, we deploy some
two layers of middleware at the user space to support the high availability and some SSI
features for shared resources or fast MPI communications.
Figure 1.10 The architecture of a working cluster with full hardware, software, and middleware
support for availability and single system image.
For example, since memory modules are distributed at different server nodes, they are
managed independently over disjoint address spaces. This implies that the cluster has multiple
images at the memory-reference level.
On the other hand, we may want all distributed memories to be shared by all servers by forming
a distributed shared memory (DSM) with a single address space. A DSM cluster thus has a
single-system image (SSI) at the memory-sharing level. Cluster explores data parallelism at the
job level with high system availability.
Unfortunately, a cluster-wide OS for complete resource sharing is not available yet. Middleware
or OS extensions were developed at the user space to achieve SSI at selected functional levels.
Without the middleware, the cluster nodes cannot work together effectively to achieve
cooperative computing.
The software environments and applications must rely on the middleware to achieve high
performance. The cluster benefits come from scalable performance, efficient message-passing,
high system availability, seamless fault tolerance, and cluster-wide job management as
summarized in Table 1.7.
In 30 years, there is a natural growth path from Internet to web and grid computing services.
Internet service such as the Telnet command enables connection from one computer to a
remote computer.
The Web service like http protocol enables remote access of remote web pages.
Grid computing is the collection of computer resources from multiple locations to reach a
common goal. Grid computing allows close interactions among applications running on
distant computers, simultaneously.
A computing grid offers an infrastructure that couples computers, software/middleware,
special instruments, and people and sensors together. The grid is often constructed across
LAN, WAN, or Internet backbone networks at a regional, national, or global scale.
Enterprises or organizations present grids as integrated computing resources. The
computers used in a grid are primarily workstations, servers, clusters, and supercomputers.
Personal computers, laptops, and PDAs can be used as access devices to a grid system.
Figure 11. shows an example computational grid built over multiple resource sites owned by
different organizations.
Larger computational grids like NSF TeraGrid and EGEE, and ChinaGrid have built
similar national infrastructures to perform distributed scientific grid applications
Figure 1.11 Computational grid or data grid providing computing utility, data, and information
services through resource sharing and cooperation among participating organizations
Grid Families:
National grid projects are followed by industrial grid platform development by IBM, Microsoft,
Sun, HP, Dell, Cisco, EMC, Platform Computing, etc
New grid service providers (GSP) and new grid applications are opened rapidly, similar to the
growth of Internet and Web services in the past two decades.
Generally grid systems are classified two families: namely computational or data grids and P2P
grids.
A data grid is an architecture or set of services that gives individuals or groups of users the
ability to access, modify and transfer extremely large amounts of geographically
distributed data for research purposes.
A P2P grid with peer groups managed locally arranged into a global system supported by
servers. • Grids would control the central servers while services at the edge are grouped into
―middleware peer groups‖. P2P technologies are part of the services of the middleware.
Cloud
A cloud is a pool of virtualized computer resources. A cloud can host a variety of different
workloads.
A cloud infrastructure provides a framework to manage scalable, reliable, on-demand
access to applications
A cloud is the ―invisible‖ backend to many applications
A model of computation and data storage based on ―pay as you go‖ access to ―unlimited‖
remote data center capabilities
A cloud supports redundant, self-recovering, highly scalable programming models that allow
workloads to recover from hardware/software failures. They monitor resource use in real
time to enable rebalancing of allocations when needed.
Internet Clouds:
Cloud Computing
Ian Foster defined cloud computing as follows: ―A large-scale distributed computing paradigm
that is driven by economics of scale, in which a pool of abstracted virtualized, dynamically-
scalable, managed computing power, storage, platforms, and services are delivered on demand
to external customers over the Internet‖.
(1) Cloud platform offers a scalable computing paradigm built around the datacenters.
(2) Cloud resources are dynamically provisioned by datacenters upon user demand.
(3) Cloud system provides computing power, storage space, and flexible platforms for upgraded
web-scale application services.
(4) Cloud computing relies heavily on the virtualization of all sorts of resources.
(5) Cloud computing defines a new paradigm for collective computing, data consumption and
delivery of information services over the Internet.
(6) Clouds stress the cost of ownership reduction in mega datacenters.
Cloud providers install and operate application software in the cloud and cloud users
access the software from cloud clients.
The SaaS model applies to business processes, industry applications, CRM (consumer
relationship mamagment), ERP (enterprise resources planning), HR (human resources)
and collaborative applications.
The pricing model for SaaS applications is typically a monthly or yearly flat fee per user,
so price is scalable and adjustable if users are added or removed at any point.
Examples of SaaS include: Google Apps, innkeypos, Quickbooks Online, Limelight
Video Platform, Salesforce.com, and Microsoft Office 365.
Cloud deployment Models
Cloud hosting deployment models represent the exact category of cloud environment and are
mainly distinguished by the proprietorship, size and access. It tells about the purpose and the
nature of the cloud.
Internet clouds offer four deployment modes: private, public, managed, and hybrid. The different
service level agreements and service deployment modalities imply the security to be a shared
responsibility of all the cloud providers, the cloud resource consumers and the third party cloud
enabled software providers.
Public Cloud: is a type of cloud hosting in which the cloud services are delivered over a
network which is open for public usage. In this the service provider renders services and
infrastructure to various clients. The customers do not have any distinguishability and control
over the location of the infrastructure.
Private Cloud: is also known as internal cloud; the platform for cloud computing is implemented
on a cloud-based secure environment that is safeguarded by a firewall which is under the
governance of the IT department that belongs to the particular corporate. Private cloud as it
permits only the authorized users, gives the organisation greater and direct control over their
data.
Managed cloud hosting is a process in which organizations share and access resources,
including databases, hardware and software tools, across a remote network via multiple servers
in another location.
Outsourcing local workload and/or resources to the cloud has become an appealing alternative
in terms of operational efficiency and cost effectiveness. This outsourcing practice particularly
From the consumer‘s perspective, this pricing model for computing has relieved many issues in
IT practices, such as the burden of new equipment purchases and the ever-increasing costs in
operation of computing facilities (e.g., salary for technical supporting personnel and electricity
bills).
From the provider‘s perspective, charges imposed for processing consumers‘ service
requests—often exploiting underutilized resources—are an additional source of revenue.
Listed below are 8 motivations of adapting the cloud for upgrading Internet applications and web
services in general.
(1) Desired location in areas with protected space and better energy efficiency.
(2) Sharing of peak-load capacity among a large pool of users, improving the overall utilization
(4) Significant reduction in cloud computing cost, compared with traditional computing
paradigms.
These architectures build on the traditional seven Open Systems Interconnection (OSI) layers
that provide the base networking abstractions.
On top of this we have a base software environment, which would be .NET or Apache Axis for
web services, the Java Virtual Machine for Java, and a broker network for CORBA.
On top of this base environment one would build a higher level environment reflecting the
special features of the distributed computing environment. This starts with entity interfaces and
inter-entity communication.
The entity interfaces correspond to the Web Services Description Language (WSDL), Java
method, and CORBA interface definition language (IDL) specifications in the distributed
systems. These interfaces are linked with customized, high-level communication systems:
SOAP, RMI, and IIOP in the three examples.
These communication systems support features including particular message patterns (such as
Remote Procedure Call or RPC), fault recovery, and specialized routing.
In the case of fault tolerance, the features in the Web Services Reliable Messaging (WSRM)
framework mimic the OSI layer capability (as in TCP fault tolerance) modified to match the
different abstractions (such as messages versus packets, virtualized addressing) at the entity
levels.
Security is a critical capability that either uses or reimplements the capabilities seen in concepts
such as Internet Protocol Security (IPsec) and secure sockets in the OSI layers.
JNDI (Jini and Java Naming and Directory Interface) illustrates different approaches used within
the Java distributed object model. The CORBA Trading Service, UDDI (Universal Description,
Discovery, and Integration), LDAP (Lightweight Directory Access Protocol), and ebXML
(Electronic Business using eXtensible Markup Language) are other examples of discovery and
information services.
Management services include service state and lifetime support; examples include the CORBA
Life Cycle and Persistent states, the different Enterprise JavaBeans models, Jini‘s lifetime
model, and a suite of web services specifications.
Loose coupling and support of heterogeneous implementations make services more attractive
than distributed objects. There are two choices of service architecture: web services or REST
systems. Both web services and REST systems have very distinct approaches to building
reliable interoperable systems.
In web services, one aims to fully specify all aspects of the service and its environment. This
specification is carried with communicated messages using Simple Object Access Protocol
(SOAP). The hosting environment then becomes a universal distributed operating system with
fully distributed capability carried by SOAP messages.
In the REST approach, one adopts simplicity as the universal principle and delegated most of
the hard problems to application (implementation specific) software. In a Web Service language
REST has minimal information in the header and the message body (that is opaque to generic
message processing) carries all needed information. REST architectures are clearly more
appropriate to rapidly technology environments.
REST can use XML schemas but not used that are part of SOAP; "XML over HTTP" is a
popular design choice.
Above the communication and management layers, we have the capability to compose new
entities or distributed programs by integrating several entities together.
In CORBA and Java, the distributed entities are linked with remote procedure calls and the
simplest way to build composite applications is to view the entities as objects and use the
traditional ways of linking them together. For Java, this could be as simple as writing a Java
program with method calls replaced by RMI (Remote Method Invocation) while CORBA
supports a similar model with a syntax reflecting the C++ style of its entity (object) interfaces.
As shown in Figure 1.14, service-oriented architecture (SOA) has evolved over the years. SOA
applies to building grids, clouds, grids of clouds, clouds of grids, clouds of clouds (also known
as interclouds), and systems of systems in general.
A large number of sensors provide data-collection services, denoted in the figure as SS (sensor
service). A sensor can be a ZigBee device, a Bluetooth device, a WiFi access point, a personal
computer, a GPA, or a wireless phone, among other things. Raw data is collected by sensor
services. All the SS devices interact with large or small computers, many forms of grids,
databases, the compute cloud, the storage cloud, the filter cloud, the discovery cloud, and so
on.
Filter services ( fs in the figure) are used to eliminate unwanted raw data, in order to respond to
specific requests from the web, the grid, or web services. A collection of filter services forms a
filter cloud.
In fact, wisdom or intelligence is sorted out of large knowledge bases. Finally, we make
intelligent decisions based on both biological and machine wisdom.
Most distributed systems require a web interface or portal. For raw data collected by a large
number of sensors to be transformed into useful information or knowledge, the data stream may
go through a sequence of compute, storage, filter, and discovery clouds.
Finally, the inter-service messages converge at the portal, which is accessed by all users. Two
example portals, OGFCE and HUBzero, are described in using both web service (portal) and
Web 2.0 (gadget) technologies. Many distributed programming models are also built on top of
these basic constructs.
Cloud computing refers to a client server architecture where typically the servers (called "the
cloud") reside remotely and are accessed via the internet, usually via a web browser. Grid
The following diagram depicts the generic grid architecture showing the functionality of
each layer:
Resource layer: is made up of actual resources that are part of the grid, such as computers,
storage systems, electronic data catalogues, and even sensors such as telescopes or other
instruments, which can be connected directly to the network.
Middleware layer: provides the tools that enable various elements such: servers, storage,
networks, etc. to participate in a unified grid environment.
Application layer: which includes different user applications (science, engineering, business,
financial), portal and development toolkits-supporting applications.
The application layer is where the users describe the applications to be submitted to the
grid. The resource layer is a widely distributed infrastructure, composed of different resources
linked via Internet. The main purpose of the resourcesis to host data and execute jobs. The
middleware is in charge of allocatingresources to jobs, and of other management issues.
The application layer sends job descriptions to the middleware, together with the
locations of the required input data. It then waits for a message saying whether the job was
finished or it was canceled. When it receives a job description the middleware tries to find a
resource to execute this job. If a suitable resource is found, it is first claimed and then the job is
sent to it.
The middleware monitors the status of the job and reacts to state changes. The resource
layer sends to the middleware the acknowledgments, and the information on the current state of
resources, new data elements, and finished jobs. When instructed by the application layer, the
middleware removes the data that is no longer needed from the resources.
The Global Grid Forum is a community-initiated forum of researchers and practitioners working
on grid computing, and a number of working groups are producing technical specs,
documenting user experiences, and writing implementation guidelines.
The need for open standards that define the interactions and foster interoperability
between components supplied from different sources has been the motivation for the Open
GridService Architecture/Open Grid Services Infrastructure (OGSA/OGSI) milestone
documentation published by the Forum.
The following describes the Grid Standards
a) OGSA (Open Grid Service Architecture)
OGSA is a service-oriented architecture (SOA). The aim of OGSA is to standardize grid
computing and to define a basic framework of a grid application structure.
OGSA main goals are:
- Resources must be handled in distributed and heterogeneous environments
- Support of QoS-orientated (Quality of Service) Service Level Agreement
- Partly autonomic management
- Definition of open, published interfaces and protocols that provide interoperability of
diverse resources
- Integration of existing and established standards
b) OGSA Services
The OGSA specifies services which occur within a wide variety of grid systems. They
can be divided into 4 broad groups: core services, data services, program execution
services, and resource management services.
i) The core services: This deals with Service Communication, Service
Management, Service Interaction and Security.
ii) Data Services: The wide range of different data types, usability and transparency
involve a large variety of different interfaces:
- Interfaces for caching
- Interfaces for data replication
- Interfaces for data access
- Interfaces for data transformation and filtering
- Interfaces for file and DBMS services
- Interfaces for grid storage services
iii) Program execution :
Main goal of this category is to enable applications to have coordinated access to
underlying VO resources, regardless of their physical location or access mechanisms.
e) GridFTP
GridFTP is a secure and reliable data transfer protocol providing high performance and
optimized for wide-area networks that have high bandwidth. As one might guess from its
name, it is based upon the Internet FTP protocol and includes extensions that make it a
desirable tool in a grid environment. GridFTP uses basic Grid security on both control
(command) and data channels. Features include multiple data channels for parallel
transfers, partial file transfers, third-party transfers, and more.
Grid Portal :
A portal/user interface functional block usually exists in the grid environment. The user
interaction mechanism (specifically, the interface) can take a number of forms. The interaction
mechanism typically is application specific
The Grid Security Infrastructure:
A user security functional block usually exists in the grid environment and, as noted
above, a key requirement for grid computing is security. In a grid environment, there is a need
for mechanisms to provide authentication, authorization, data confidentiality, data integrity, and
availability, particularly from a user‘s point of view. When a user‘s job executes, typically it
requires confidential message-passing services.
Broker Function:
A broker functional block usually exists in the grid environment. After the user is authenticated
by the user security functional block, the user is allowed to launch an application. At this
juncture, the grid system needs to identify appropriate and available resources that can/should
be used within the grid, based on the application and application-related parameters provided
by the user of the application. This task is carried out by a broker function.
In the situation where the user wishes to reserve a specific resource or to ensure that different
jobs within the application run concurrently, then a scheduler is needed to coordinate the
execution of the jobs.
Role of layers:
1) Fabric: interfaces local control. Need to manage a resource. Grid fabric layer provides
standardized access to local resource-specific operations. Software is provided to
discover. This layer provides the resources, which could comprise computers, storage
devices and databases. The resource could also be a logical entity such as a distributed
file system or computer pool.
2) Connectivity: secure communications. Need secure connectivity to resources.
Assumes trust is based on user, not service providers. Use public key infrastructure
(PKI). Integrate and obey local security policies in global view. This layer consists of the
core communication and authentication protocols required for transactions.
Communication protocols enable the exchange of data between fabric layer resources.
3) Resource: This layer builds on the Connectivity layer communication and authentication
protocols to define Application Program Interfaces (API) and Software Development Kit
(SDK) for secure negotiation, initiation, monitoring, control, accounting and payment of
sharing operations.
4) Collective: coordinated sharing of multiple resources. Need to coordinate sharing of
resources. This layer is different from the resource layer in the sense, while resource
layer concentrates on interactions with single resource; this layer helps in coordinating
multiple resources. Its tasks can be varied like Directory Services, Co-allocation and
scheduling, monitoring, diagnostic services, and software discovery services.
InterGrid
Intergrids often rely on the Internet. This crosses organization boundaries. Generally, an
intergrid may be used to collaborate on ―large‖ projects of common scientific interest. The
intergrid offers the opportunity for sharing, trading, or brokering resources over widespread
pools; computational ―processor-cycle‖ resources may also be obtained,as needed, from a utility
for a specified fee.
UNIT I INTRODUCTION
5. What is IoT?
The IoT refers to the networked interconnection of everyday objects, tools, devices, or
computers. One can view the IoT as a wireless network of sensors that interconnect all things in
our daily life. The dynamic connections will grow exponentially into a new dynamic network of
networks, called the Internet of Things (IoT).
6. Define Multithreading.
Multithreading is the ability of a program or an operating system process to manage its use by
more than one user at a time and to even manage multiple requests by the same user without
having to have multiple copies of the programming running in the computer.
7. What are Cloud Service Models?
Infrasturcture as a Service (IaaS)
Platform as a Service (PasS)
Software as a service (SaaS)
9. Define DataGrid.
A data grid is an architecture or set of services that gives individuals or groups of users the
ability to access, modify and transfer extremely large amounts of geographically
distributed data for research purposes.
SOA services communicate with messages formally defined via XML Schema (also
called XSD).
SOA services are maintained in the enterprise by a registry that acts as a directory
listing.
Each SOA service has a quality of service (QoS) associated with it.
Grid Computing is the concept of distributed computing technologies for computing resource
sharing among participants in a virtualized collection of organization.
BOD is just not about utility computing as it has a much broader set of ideas about the
transformation of business practices, process transformation and technology implementations.
The essential characteristics of on-demand business are responsiveness to the dynamics of
business, adapting to variable cost structures, focusing on core business competency and
resiliency for consistent availability.
Grid portal
Security (grid security infrastructure)
Broker (along with directory)
Scheduler
Data management
Job and resource management
Resources
A portal/user interface functional block usually exists in the grid environment. The user
interaction mechanism (specifically, the interface) can take a number of forms. The interaction
mechanism typically is application specific
A broker functional block usually exists in the grid environment. After the user is authenticated
by the user security functional block, the user is allowed to launch an application. At this
juncture, the grid system needs to identify appropriate and available resources that can/should
be used within the grid, based on the application and application-related parameters provided
by the user of the application.
Intergrids often rely on the Internet. This crosses organization boundaries. Generally, an
intergrid may be used to collaborate on ―large‖ projects of common scientific interest. The
intergrid offers the opportunity for sharing, trading, or brokering resources over widespread
pools; computational ―processor-cycle‖ resources may also be obtained as needed, from a utility
for a specified fee.
16 Marks Questions
TEXT BOOK:
1. Kai Hwang, Geoffery C. Fox and Jack J. Dongarra, ―Distributed and Cloud Computing:
Clusters, Grids, Clouds and the Future of Internet‖, First Edition, Morgan Kaufman Publisher, an
Imprint of Elsevier, 2012.
REFERENCES: 1. Jason Venner, ―Pro Hadoop- Build Scalable, Distributed Applications in the
Cloud‖, A Press, 2009
2. Tom White, ―Hadoop The Definitive Guide‖, First Edition. O‟Reilly, 2009.
3. Bart Jacob (Editor), ―Introduction to Grid Computing‖, IBM Red Books, Vervante, 2005
4. Ian Foster, Carl Kesselman, ―The Grid: Blueprint for a New Computing Infrastructure‖, 2nd
Edition, Morgan Kaufmann.
5. Frederic Magoules and Jie Pan, ―Introduction to Grid Computing‖ CRC Press, 2009.
6. Daniel Minoli, ―A Networking Approach to Grid Computing‖, John Wiley Publication, 2005.
7. Barry Wilkinson, ―Grid Computing: Techniques and Applications‖, Chapman and Hall, CRC,
Taylor and Francis Group, 2010.
OGSI, in effect, is the base infrastructure on which the OGSA is built, as illustrated pictorially in
Figure 2.3
The running of an individual service is called a service instance. Services and service
instances can be ―lightweight‖ and transient, or they can be long-term tasks that require ―heavy-
duty‖ support from the grid. Services and service instances can be dynamic or interactive, or
they can be batch processed.
Grid services include:
Discovery
Lifecycle
State management
Service groups
Factory
Notification
Handle map
A ―layering‖ approach is used to the extent possible in the definition of grid architecture
because it is advantageous for higher-level functions to use common lower-level functions. Grid
functionality can include the following, among others: information queries, network bandwidth
allocation, data management/extraction, processor requests, managing data sessions, and
balance workloads.
OGSA-related GGF groups include :
The Open Grid Services Architecture Working Group (OGSA-WG)
The Open Grid Services Infrastructure Working Group (OGSI-WG)
The Open Grid Service Architecture Security Working Group (OGSA-SECWG)
Database Access and Integration Services Working Group (DAIS-WG)
OGSA has introduced the concept of a Grid service as a building block of the service-
oriented framework. A Grid service is an enhanced Web service that extends the conventional
Web service functionality into the Grid domain. A Grid service handles issues such as state
management, global service naming, reference resolution, and Grid-aware security, which are
the key requirements for a seamless, universal Grid architecture.
For example, standard APIs enable application portability; without standard APIs,
application portability is difficult to accomplish. Standards enable cross-site interoperability;
without standard protocols, interoperability is difficult to achieve. Standards also enable the
deployment of a shared infrastructure .
Use of the OGSI standard, therefore, provides the following benefits:
Increased effective computing capacity. When the resources utilize the same
conventions, interfaces, and mechanisms, one can transparently switch jobs among grid
systems, both from the perspective of the server as well as from the perspective of the
client. This allows grid users to use more capacity and allows clients a more extensive
choice of projects that can be supported on the grid. Hence, with a gamut of platforms
and environments supported, along with the ability to more easily publish the services
available, there will be an increase in the effective computing capacity.
Interoperability of resources. Grid systems can be more easily and efficiently
developed and deployed when utilizing a variety of languages and a variety of platforms.
For example, it is desirable to mix service-provider components, work-dispatch tracking
systems, and systems management; this makes it easier to dispatch work to service
providers and for service providers to support grid services.
Speed of application development. Using middleware based on a standard expedites
the development of grid-oriented applications supporting a business environment.
Rather than spending time developing communication and management systems to help
support the grid system, the planner can, instead, spend time optimizing the
business/algorithmic logic related to the processing the data.
The functional requirements include fundamental, security and resource management functions.
The basic functionalities are as follows:
Basic Functionality Requirements: The basic functionalities include Discovery and
brokering, Metering and accounting, Data sharing, Deployment, Monitoring, Policy and Virtual
organizations.
Security Requirements: The security functions include Multiple security infrastructures,
Perimeter security solutions, Authentication, Authorization, and Accounting, Encryption,
Application and Network-Level Firewalls and Certification.
Resource Management Requirements: The resource management functions include
Provisioning, Resource virtualization, Optimization of resource usage, Transport management,
Access, Management and monitoring, Processor scavenging, Scheduling of service tasks, Load
balancing, Advanced reservation, Notification and messaging, Logging, Workflow management
and Pricing.
System Properties Requirements: The system properties functions include Fault
tolerance, Disaster recovery, Self-healing capabilities, Strong monitoring, Legacy application
Applications in the grid are normally grouped into two categories: computation-
intensive and data-intensive. For data-intensive applications, we may have to deal with massive
amounts of data.
The grid system must be specially designed to discover, transfer, and manipulate these
massive data sets. Transferring massive data sets is a time-consuming task. Efficient data
This data access method is also known as caching, which is often applied to enhance
data efficiency in a grid environment. By replicating the same data blocks and scattering them in
multiple regions of a grid, users can access the same data with locality of references. Some
key data will not be lost in case of failures. However, data replication may demand periodic
consistency checks. The increase in storage requirements and network bandwidth may cause
additional problems.
The strategies of replication can be classified into method types: dynamic and static. For
the static method, the locations and number of replicas are determined in advance and will not
be modified.
Dynamic strategies can adjust locations and number of data replicas according to
changes in conditions.
The most common replication strategies include preserving locality, minimizing update costs,
and maximizing profits.
Multiple participants may want to share the same data collection. To retrieve any piece
of data, we need a grid with a unique global namespace. Similarly, we desire to have unique file
names. To achieve these, we have to resolve inconsistencies among multiple data objects
bearing the same name. Access restrictions may be imposed to avoid confusion. Also, data
needs to be protected to avoid leakage and damage. Users who want to access data have to be
authenticated first and then authorized for access. There are four access models for organizing
a data grid, as listed here and shown in Figure2.9.
Monadic model: This is a centralized data repository model, shown in Figure 2.9(a). All the
data is saved in a central data repository. When users want to access some data they
have to submit requests directly to the central repository. No data is replicated for
preserving data locality. This model is the simplest to implement for a small grid. For a
large grid, this model is not efficient in terms of performance and reliability. Data
replication is permitted in this model only when fault tolerance is demanded.
Hierarchical model: The hierarchical model, shown in Figure 2.9(b), is suitable for building a
large data grid which has only one large data access directory. The data may be
transferred from the source to a second-level center. Then some data in the regional
center is transferred to the third-level center. After being forwarded several times,
specific data objects are accessed directly by users.
Federation model: This data access model shown in Figure 2.9(c) is better suited for designing
a data grid with multiple sources of data supplies. Sometimes this model is also known
as a mesh model. The data sources are distributed to many different locations.
Although the data is shared, the data items are still owned and controlled by their
original owners.
Hybrid model: This data access model is shown in Figure 2.9(d). The model combines the best
features of the hierarchical and mesh models. Traditional data transfer technology,
such as FTP, applies for networks with lower bandwidth. Network links in a data grid
often have fairly high bandwidth, and other data transfer models are exploited by high-
speed data transfer tools such as GridFTP developed with the Globus library.
Compared with traditional FTP data transfer, parallel data transfer opens multiple data streams
for passing subdivided segments of a file simultaneously. Although the speed of each
stream is the same as in sequential streaming, the total time to move data in all
streams can be significantly reduced compared to FTP transfer.
In striped data transfer, a data object is partitioned into a number of sections, and each section
is placed in an individual site in a data grid. When a user requests this piece of data, a
data stream is created for each site, and all the sections of data objects are transferred
simultaneously. Striped data transfer can utilize the bandwidths of multiple sites more
efficiently to speed up data transfer.
Event— Some occurrence within the state of the grid service or its environment that may be of
interest to third parties. This could be a state change or it could be environmental, such as a
timer event.
Message— An artifact of an event, containing information about an event that some entity
wishes to communicate to other entities.
Topic— A ―logical‖ communications channel and matching mechanism to which a requestor
may subscribe to receive asynchronous messages and publishers may publish messages.
2.7.13 Event
Events are generally used as asynchronous signaling mechanisms. The most common form is
―publish/subscribe,‖ in which a service ―publishes‖ the events that it exports There is also a
distinction between the reliability of an event being raised and its delivery to a client. A service
my attempt to deliver every occurrence of an event (reliable posting), but not be able to
guarantee delivery.
An event can be anything that the service decides it will be: a change in a state variable, entry
into a particular code segment, an exception such as a security violation or floating point
overflow, or the failure of some other expected event to occur..
The Common Management Model specification defines the base behavioral model for all
resources and resource managers in the grid management infrastructure. A mechanism is
defined by which resource managers can make use of detailed manageability information for a
resource that may come from existing resource models and instrumentation, such as those
expressed in CIM, JMX, SNMP, and so on, combined with a set of canonical operations
introduced by base CMM interfaces.
The CMM specification defines
The base manageable resource interface, which a resource or resource manager must
provide to be manageable
Canonical lifecycle states—the transitions between the states, and the operations
necessary for the transitions that complement OGSI lifetime service data
The ability to represent relationships among manageable resources, including a
canonical set of relationship types
Life cycle metadata - common to all types of managed resources for monitoring and
control of service data and operations based on life cycle state
Canonical services factored out from across multiple resources or domain specific
resource managers, such as an operational port type.
2 marks Questions and answers
UNIT II GRID SERVICES
1. Mention the Grid Services.
Discovery
Lifecycle
State management
Service groups
Factory
Notification
Handle map
2. Mention the OGSA-related GGF groups.
The Open Grid Services Architecture Working Group (OGSA-WG)
The Open Grid Services Infrastructure Working Group (OGSI-WG)
The Open Grid Service Architecture Security Working Group (OGSA-SECWG)
Database Access and Integration Services Working Group (DAIS-WG)
The basic functionalities include Discovery and brokering, Metering and accounting, Data
sharing, Deployment, Monitoring, Policy and Virtual organizations.
The security functions include Multiple security infrastructures, Perimeter security solutions,
Authentication, Authorization, and Accounting, Encryption, Application and Network-Level
Firewalls and Certification.
13. Mention four base data interfaces can be used to implement a variety of different data
service behaviors.
Data Description
DataAccess
DataFactory
DataManagement
14. Discuss the use of Data caching.
Data Caching. In order to improve performance of access to remote data items, caching
services will be employed. At the minimum, caching services for traditional flat file data will be
employed. Caching of other data types, such as views on RDBMS data, streaming data, and
application binaries, are also envisioned.
16 marks Questions
1. Explain Open Grid Services Architecture (OGSA) in detail with suitable diagram.
2. Explain about OGSA: A PRACTICAL VIEW and DETAILED VIEW
3. Explain about Data intensive grid service models
4. List the Various OGSA Services in detail.
TEXT BOOK:
1. Kai Hwang, Geoffery C. Fox and Jack J. Dongarra, ―Distributed and Cloud Computing:
Clusters, Grids, Clouds and the Future of Internet‖, First Edition, Morgan Kaufman Publisher, an
Imprint of Elsevier, 2012.
REFERENCES: 1. Jason Venner, ―Pro Hadoop- Build Scalable, Distributed Applications in the
Cloud‖, A Press, 2009
2. Tom White, ―Hadoop The Definitive Guide‖, First Edition. O‟Reilly, 2009.
3. Bart Jacob (Editor), ―Introduction to Grid Computing‖, IBM Red Books, Vervante, 2005
4. Ian Foster, Carl Kesselman, ―The Grid: Blueprint for a New Computing Infrastructure‖, 2nd
Edition, Morgan Kaufmann.
5. Frederic Magoules and Jie Pan, ―Introduction to Grid Computing‖ CRC Press, 2009.
6. Daniel Minoli, ―A Networking Approach to Grid Computing‖, John Wiley Publication, 2005.
7. Barry Wilkinson, ―Grid Computing: Techniques and Applications‖, Chapman and Hall, CRC,
Taylor and Francis Group, 2010.
1. Public Clouds
Easy to use: Some developers prefer public cloud due to its ease of access.
Generally, the public cloud operates at a pretty fast speed, which is also alluring to
some enterprises.
Typically a pay-per-use model (cost-effective): Often, public clouds operate on an
elastic pay-as-you-go model, so users only need to pay for what they use.
Operated by a third party: The public cloud isn't specific to a single business,
person or enterprise; it is constructed with shared resources and operated by third-
party providers.
Flexible: Public clouds allow users to easily add or drop capacity, and are typically
accessible from any Internet-connected device — users don't need to jump through
many hurdles in order to access.
Can be unreliable: Public cloud outages have made headlines in recent weeks,
leading to headaches for users.
Less secure: Public cloud often has a lower level of security and may be more
susceptible to hacks. Some public cloud providers also reserve the right to shift data
around from one region to another without notifying the user -– which may cause
issues, legal and otherwise, for a company with strict data security policies.
2. Private Clouds
A private cloud is built within the domain of an intranet owned by a single organization.
It is client owned and managed, and its access is limited to the owning clients and their
partners.
Its deployment was not meant to sell capacity over the Internet through publicly
accessible interfaces.
Private clouds give local users a flexible and agile private infrastructure to run service
workloads within their administrative domains.
A private cloud is supposed to deliver more efficient and convenient cloud services.
It may impact the cloud standardization, while retaining greater customization and
organizational control.
Examples of Private Cloud:
o Eucalyptus
o Ubuntu Enterprise Cloud - UEC (powered by Eucalyptus)
o Amazon VPC (Virtual Private Cloud)
o VMware Cloud Infrastructure Suite
o Microsoft ECI data center.
Main Characteristics
Flexible and scalable: Since the hybrid cloud, as its name suggests, employs facets
of both private and public cloud services, enterprises have the ability to mix and match
for the ideal balance of cost and security.
Cost effective: Businesses can take advantage of the cost-effectiveness of public
cloud computing, while also enjoying the security of a private cloud.
Becoming widely popular: More and more enterprises are adopting this type of
model.
In summary, public clouds promote standardization, preserve capital investment, and offer
application flexibility. Private clouds attempt to achieve customization and offer higher
The services provided over the cloud can be generally categorized into three different
service models: namely Infrastructure as a service (IaaS), Platform as a Service
(PaaS), and Software as a Service (SaaS).
All three models allow users to access services over the Internet, relying entirely on the
infrastructures of cloud service providers.
These models are offered based on various SLAs between providers and users. In a
broad sense, the SLA for cloud computing is addressed in terms of service availability,
performance, and data protection and security.
Figure below illustrates three cloud models at different service levels of the cloud.
o SaaS is applied at the application end using special interfaces by users or
clients.
o At the PaaS layer, the cloud platform must perform billing services and handle
job queuing, launching, and monitoring services.
o At the bottom layer of the IaaS services, databases, compute instances, the file
system, and storage must be provisioned to satisfy user demands.
Infrastructure as a Service
This involves offering hardware related services using the principles of cloud computing.
This model allows users to use virtualized IT resources for computing, storage, and
networking. In short, the service is performed by rented cloud infrastructure.
The user can deploy and run his applications over his chosen OS environment. The user
does not manage or control the underlying cloud infrastructure, but has control over the
OS, storage, deployed applications, and possibly select networking components.
This IaaS model encompasses storage as a service, compute instances as a service,
and communication as a service. The Virtual Private Cloud (VPC) provide Amazon EC2
clusters and S3 storage to multiple users. GoGrid, FlexiScale, and Aneka are good
examples. Table 4.1 summarizes the IaaS offerings by five public cloud providers.
3. Can Be Slow
Virtual Machines are presentation of a real machine using software that provides an operating
environment which can run or host a guest operating system. Virtual machines are created and
managed by virtual machine monitors.
Guest Operating System are Operating system which is running inside the created virtual
machine.
Virtual Machine Monitor (Hypervisor): Software that runs in a layer between host operating
system and one or more virtual machines that provides the virtual machine abstraction to the
guest operating systems. Example: Xen, KVM, VMWare.
The idea of virtualization is to separate the hardware from the software to yield better system
efficiency. Virtualization techniques can be applied to enhance the use of compute engines,
networks, and storage. With sufficient storage, any computer platform can be installed in
another host computer, even if they use processors with different instruction sets and run with
distinct operating systems on the same hardware.
This virtualization layer is known as hypervisor or virtual machine monitor (VMM) . The VMs are
shown in the upper boxes, where applications run with their own guest OS over the virtualized
CPU, memory, and I/O resources.
The virtualization software creates the abstraction of VMs by interposing a virtualization layer at
various levels of a computer system. Common virtualization layers include the instruction set
architecture (ISA) level, hardware level, operating system level, library support level, and
application level
• It can run a large amount of legacy binary codes written for various processors
on any given new hardware host machines
• best application flexibility
Shortcoming & limitation:
• One source instruction may require tens or hundreds of native target instructions
to perform its function, which is relatively slow.
• V-ISA requires adding a processor-specific software translation layer in the
complier.
Virtualization at Hardware Abstraction level:
• This virtualization creates isolated containers on a single physical server and the OS-
instance to utilize the hardware and software in datacenters.
• Typical systems: Jail / Virtual Environment / Ensim's VPS / FVM
Advantage:
• Has minimal starup/shutdown cost, low resource requirement, and high scalability;
synchronize VM and host state changes.
Shortcoming & limitation:
• All VMs at the operating system level must have the same kind of guest OS
• Poor application flexibility and isolation.
It creates execution environments for running alien programs on a platform rather than creating
VM to run the entire operating system.
• This layer sits as an application program on top of an operating system and exports an
abstraction of a VM that can run programs written and compiled to a particular abstract
machine definition.
• Typical systems: JVM , NET CLI , Panot
Advantage:
First, a VMM should provide an environment for programs which is essentially identical to
the original machine.
Second, programs run in this environment should show, at worst, only minor decreases in
speed.
Third, a VMM should be in complete control of the system resources. Any program run
under a VMM should exhibit a function identical to that which it runs on the original machine
directly.
A VMM should demonstrate efficiency in using the VMs. To guarantee the efficiency of a VMM,
a statistically dominant subset of the virtual processor‘s instructions needs to be executed
directly by the real processor, with no software intervention by the VMM.
(1) The VMM is responsible for allocating hardware resources for programs;
(2) it is not possible for a program to access any resource not explicitly allocated to it;
and
(3) it is possible under certain circumstances for a VMM to regain control of resources
already allocated.
A VMM is tightly related to the architectures of processors. It is difficult to implement a VMM for
some types of processors, such as the x86. Specific limitations include the inability to trap on
some privileged instructions. If a processor is not designed to support virtualization primarily, it
Hypervisor
• sits on the bare metal computer hardware like the CPU, memory, etc.
• All guest operating systems are a layer above the hypervisor.
• The original CP/CMS hypervisor developed by IBM was of this kind.
Type 2: hosted hypervisor
The first is the ability to use a variable number of physical machines and VM instances
depending on the needs of a problem. For example, a task may need only a single CPU
during some phases of execution but may need hundreds of CPUs at other times.
The second challenge concerns the slow operation of instantiating new VMs. Currently,
new VMs originate either as fresh boots or as replicates of a template VM, unaware of
the current application state.
Need for OS Virtualization
Moreover, full virtualization at the hardware level also has the disadvantages of slow
performance and low density, and the need for para-virtualization to modify the guest OS.
OS-level virtualization provides a feasible solution for these hardware-level virtualization issues.
Figure 3.7 Operating system virtualization from the point of view of a machine stack
The OpenVZ virtualization layer inside the host OS, which provides some OS images to
create VMs quickly.
The virtualization layer is inserted inside the OS to partition the hardware resources for
multiple VMs to run their applications in multiple virtual environments.
To implement OS-level virtualization, isolated execution environments (VMs) should be
created based on a single OS kernel. Furthermore, the access requests from a VM
need to be redirected to the VM‘s local resource partition on the physical machine. For
example, the chroot command in a UNIX system can create several virtual root
directories within a host OS. These virtual root directories are the root directories of all
VMs created.
Advantages of OS Extension for Virtualization
Most reported OS-level virtualization systems are Linux-based. Virtualization support on the
Windows-based platform is still in the research stage. The Linux kernel offers an abstraction
layer to allow software processes to work with and operate on resources without knowing the
hardware details.
Two OS tools (Linux vServer and OpenVZ) support Linux platforms to run other
platform-based applications through virtualization.
The third tool, FVM, is an attempt specifically developed for virtualization on the
Windows NT platform.
Linux vServer for Linux platforms Extends Linux kernels to implement a security
(http://linux-vserver.org/) mechanism to help build VMs by setting resource
limits and file attributes and changing the root
environment for VM isolation
FVM (Feather-Weight Virtual Uses system call interfaces to create VMs at the NY
Machines) for virtualizing the Windows kernel space; multiple VMs are supported by
NT platforms virtualized namespace and copy-on-write
Library-level virtualization is also known as user-level Application Binary Interface (ABI) or API
emulation. This type of virtualization can create execution environments for running alien
programs on a platform rather than creating a VM to run the entire operating system.
WABI (Windows Application Binary Middleware that converts Windows system calls
running on x86 PCs to Solaris system calls running on
Lxrun (Linux Run) A system call emulator that enables Linux applications
written for x86 hosts to run on UNIX systems such as
the SCO OpenServer
CUDA is a programming model and library for general-purpose GPUs. It leverages the high
performance of GPUs to run compute-intensive applications on host operating systems.
It consists of three user space components: the vCUDA library, a virtual GPU in the guest OS
(which acts as a client), and the vCUDA stub in the host OS (which acts as a server).
The vCUDA library resides in the guest OS as a substitute for the standard CUDA library. It is
responsible for intercepting and redirecting API calls from the client to the stub. Besides these
tasks, vCUDA also creates vGPUs and manages them.
hypervisor architecture
Full virtualization and host-based virtualization
para-virtualization
The hypervisor supports hardware-level virtualization on bare metal devices like CPU, memory,
disk and network interfaces. The hypervisor software sits directly between the physical
hardware and its OS. This virtualization layer is referred to as either the VMM or the hypervisor.
The hypervisor provides hypercalls for the guest OSes and applications. Depending on the
functionality, a hypervisor can assume a micro-kernel architecture like the Microsoft Hyper-V. Or
it can assume a monolithic hypervisor architecture like the VMware ESX for server virtualization.
A micro-kernel hypervisor includes only the basic and unchanging functions (such as physical
memory management and processor scheduling). The device drivers and other changeable
components are outside the hypervisor.
A monolithic hypervisor implements all the aforementioned functions, including those of the
device drivers.
Therefore, the size of the hypervisor code of a micro-kernel hypervisor is smaller than that of a
monolithic hypervisor. Essentially, a hypervisor must be able to convert physical devices into
virtual resources dedicated for the deployed VM to use.
Xen does not include any device drivers natively. It just provides a mechanism by which a
guest OS can have direct access to the physical devices. As a result, the size of the Xen
hypervisor is kept rather small.
Xen provides a virtual environment located between the hardware and the OS.
The core components of a Xen system are the hypervisor, kernel, and applications. The
organization of the three components is important.
Like other virtualization systems, many guest OSes can run on top of the hypervisor. However,
not all guest OSes are created equal, and one in particular controls the others.
The guest OS, which has control ability, is called Domain 0, and the others are called Domain
U.
Domain 0 is a privileged guest OS of Xen. It is first loaded when Xen boots without any file
system drivers being available. Domain 0 is designed to access hardware directly and manage
devices. Therefore, one of the responsibilities of Domain 0 is to allocate and map hardware
resources for the guest domains (the Domain U domains).
For example, Xen is based on Linux and its security level is C2. Its management VM is named
Domain 0, which has the privilege to manage other VMs implemented on the same host. If
Domain 0 is compromised, the hacker can control the entire system.
So, in the VM system, security policies are needed to improve the security of Domain 0. Domain
0, behaving as a VMM, allows users to create, copy, save, read, modify, share, migrate, and roll
back VMs as easily as manipulating a file, which flexibly provides tremendous benefits for users.
Unfortunately, it also brings a series of security problems during the software life cycle and data
lifetime.
full virtualization
host-based virtualization.
Full virtualization does not need to modify the host OS. It relies on binary translation
to trap and to virtualize the execution of certain sensitive, nonvirtualizable
instructions. The guest OSes and their applications consist of noncritical and critical
instructions.
In a host-based system, both a host OS and a guest OS are used. A virtualization
software layer is built between the host OS and guest OS.
Full Virtualization
With full virtualization, noncritical instructions run on the hardware directly while critical
instructions are discovered and replaced with traps into the VMM to be emulated by
software.
Both the hypervisor and VMM approaches are considered full virtualization.
Critical instructions are trapped into the VMM because binary translation can incur a large
performance overhead.
Noncritical instructions do not control hardware or threaten the security of the system, but
critical instructions do. Therefore, running noncritical instructions on hardware not only can
promote efficiency, but also can ensure system security
Figure. 3.11 Full Virtualization using a hypervisor / VMM on top of bare hardware device
This approach was implemented by VMware and many other software companies.
VMware puts the VMM at Ring 0 and the guest OS at Ring 1.
The VMM scans the instruction stream and identifies the privileged, control- and
behavior-sensitive instructions. When these instructions are identified, they are trapped
Host-Based Virtualization
This approach installs a virtualization layer on top of the host OS. This host OS is still
responsible for managing the hardware. The guest OSes are installed and run on top of the
virtualization layer. Dedicated applications may run on the VMs.
First, the user can install this VM architecture without modifying the host OS. The virtualizing
software can rely on the host OS to provide device drivers and other low-level services. This
will simplify the VM design and ease its deployment.
Second, the host-based approach appeals to many host machine configurations.
Compared to the hypervisor/VMM architecture, the performance of the host-based architecture
may also be low. When an application requests hardware access, it involves four layers of
mapping which downgrades performance significantly. When the ISA of a guest OS is different
The virtualization layer can be inserted at different positions in a machine software stack.
However, para-virtualization attempts to reduce the virtualization overhead, and thus improve
performance by modifying only the guest OS kernel.
Para-virtualized VM architecture:
The traditional x86 processor offers four instruction execution rings: Rings 0, 1, 2, and 3. The
lower the ring number, the higher the privilege of instruction being executed.
The OS is responsible for managing the hardware and the privileged instructions to execute at
Ring 0, while user-level applications run at Ring 3.
First, its compatibility and portability may be in doubt, because it must support the
unmodified OS as well.
Second, the cost of maintaining para-virtualized OSes is high, because they may require
deep OS kernel modifications.
Finally, the performance advantage of para-virtualization varies greatly due to workload
variations.
Compared with full virtualization, para-virtualization is relatively easy and more practical. The
main problem in full virtualization is its low performance in binary translation. To speed up binary
translation is difficult.
Many virtualization products employ the para-virtualization architecture. The popular Xen, KVM,
and VMware ESX are good examples.
This is a Linux para-virtualization system—a part of the Linux version 2.6.20 kernel.
Memory management and scheduling activities are carried out by the existing Linux kernel. The
KVM does the rest, which makes it simpler than the hypervisor that controls the entire machine.
Unlike the full virtualization architecture which intercepts and emulates privileged and sensitive
instructions at runtime, para-virtualization handles these instructions at compile time.
The guest OS kernel is modified to replace the privileged and sensitive instructions with
hypercalls to the hypervisor or VMM. Xen assumes such a para-virtualization architecture.
The guest OS running in a guest domain may run at Ring 1 instead of at Ring 0. This implies
that the guest OS may not be able to execute some privileged and sensitive instructions.
On an UNIX system, a system call involves an interrupt or service routine. The hypercalls apply
a dedicated service routine in Xen.
ESX is a VMM or a hypervisor for bare-metal x86 symmetric multiprocessing (SMP) servers. It
accesses hardware resources such as I/O directly and has complete resource management
control.
To improve performance, the ESX server employs a para-virtualization architecture in which the
VM kernel interacts directly with the hardware without involving the host OS.
Full virtualization
• Does not need to modify guest OS, and critical instructions are emulated by software
through the use of binary translation.
• VMware Workstation applies full virtualization, which uses binary translation to
automatically modify x86 software on-the-fly to replace critical instructions.
• Advantage: no need to modify OS.
• Disadvantage: binary translation slows down the performance.
Para virtualization
In this way, the VMM and guest OS run in different modes and all sensitive instructions of
the guest OS and its applications are trapped in the VMM. Mode switching is completed by
hardware.
This allows users to set up multiple x86 and x86-64 virtual computers and to use one or
more of these VMs simultaneously with the host operating system.
assumes the host-based virtualization. Xen is a hypervisor for use in IA-32, x86-64,
Itanium, and PowerPC 970 hosts.
For memory virtualization, Intel offers the EPT, which translates the virtual address to
the machine‘s physical addresses to improve performance.
For I/O virtualization, Intel implements VT-d and VT-c to support this.
Figure 3.15 Hardware Support for Virtualization in the Intel x86 Processor
Intel and AMD add an additional mode called privilege mode level (some people call it
Ring-1) to x86 processors.
Qperating systems can run at Ring 0 and the hypervisor can run at Ring -1. All the
privileged and sensitive instructions are trapped in the hypervisor automatically.
This technique removes the difficulty of implementing binary translation of full
virtualization. It also lets the operating system run in VMs without modification.
Intel Hardware-Assisted CPU Virtualization
Each page table of the guest OSes has a separate page table in the VMM corresponding to
it. The VMM page table is called the shadow page table. Nested page tables add another
layer of indirection to virtual memory.
The MMU already handles virtual-to-physical translations as defined by the OS.
Then the physical memory addresses are translated to machine addresses using another
set of page tables defined by the hypervisor.
Since modern operating systems maintain a set of page tables for every process, the
shadow page tables will get flooded. Consequently, the performance overhead and cost of
memory will be very high.
Extended Page Table by Intel for Memory Virtualization
To improve the efficiency of the software shadow page table technique Intel developed a
hardware-based EPT technique.
When a virtual address needs to be translated, the CPU will first look for the L4 page table
pointed to by Guest CR3.
SV-IO is to harness the rich resources of a multicore processor. All tasks associated with
virtualizing an I/O device are encapsulated in SV-IO.
It is located under the ISA and remains unmodified by the operating system or VMM
(hypervisor).
Figure illustrates the technique of a software-visible VCPU moving from one core to another and
temporarily suspending execution of a VCPU when there are no appropriate cores on which it
can run.
A virtual hierarchy is a cache hierarchy that can adapt to fit the workload or mix of workloads.
The hierarchy‘s first level locates data blocks close to the cores needing them for faster access,
establishes a shared-cache domain, and establishes a point of coherence for faster
communication.
When a miss leaves a tile, it first attempts to locate the block (or sharers) within the first level.
Space sharing is applied to assign three workloads to three clusters of virtual cores:
The basic assumption is that each workload runs in its own VM. However, space sharing
applies equally within a single operating system.
Statically distributing the directory among tiles can do much better, provided operating systems
or hypervisors carefully map virtual pages to physical frames.
VMs consolidates multiple functionalities on the same server to enhance the server
utilization and application flexibility.
VMs can be colonized (replicated) in multiple servers to promote distributed parallelism,
fault tolerance, and disaster recovery.
The size (number of nodes) of a virtual cluster can grow or shrink dynamically
Failure of any physical nodes may disable some VMs installed on the failing nodes. But
the failure of VMs will not pull down the host system.
Figure shows the concept of a virtual cluster based on application partitioning or
customization.
Deployment refers to
o construct and distribute software stacks (OS, libraries, applications) to a physical
node inside clusters as fast as possible
o to quickly switch runtime environments from one user‘s virtual cluster to another
user‘s virtual cluster.
If one user finishes using his system, the corresponding virtual cluster should shut down
or suspend quickly to save the resources to run other VMs for other users.
The live migration of VMs allows workloads of one node to transfer to another node.
VMs cannot randomly migrate among themselves. But the potential overhead caused by
live migrations may have serious negative effects on cluster utilization, throughput, and
QoS issues.
Load balancing of applications can be achieved using the load index and frequency of
user logins. The automatic scale-up and scale-down mechanism of a virtual cluster can
be implemented based on this model.
Dynamically adjusting loads among nodes by live migration of VMs is desired, when the
loads on cluster nodes become quite unbalanced.
3.6.1.2 High-Performance Virtual Storage
Template VM can be distributed to several physical hosts in the cluster to customize the
VMs. To efficiently manage the disk spaces occupied by template software packages,
some storage architecture design can be applied to reduce duplicated blocks in a
distributed file system of virtual clusters. Hash values are used to compare the contents
of data blocks.
Users have their own profiles which store the identification of the data blocks for
corresponding VMs in a user-specific virtual cluster. New blocks are created when users
modify the corresponding data.
There are four steps to deploy a group of VMs onto a target cluster:
o preparing the disk image,
o configuring the VMs,
o choosing the destination nodes, and
o executing the VM deployment command on every host.
To simplify the disk image preparation process. template is used. It is a disk image that
includes a preinstalled operating system with or without certain application software.
Users choose a proper template according to their requirements and make a duplicate of
it as their own disk image. Templates could implement the COW (Copy on Write) format.
A new COW backup file is very small and easy to create and transfer which reduces disk
space consumption.
Every VM is configured with a name, disk image, network setting, and allocated CPU
and memory. VMs with the same configurations could use preedited profiles to simplify
the process.
Normally, users do not care which host is running their VM. A strategy to choose the
proper destination host for any VM is needed.
The deployment principle is to fulfill the VM requirement and to balance workloads
among the whole host network.
o cluster manager resides on a guest system. Multiple VMs form a virtual cluster.
o Example: openMosix (Linux cluster running different guest systems on top of the
VMM.
o Example: VMware HA system that can restart a guest system after failure.
Use an independent cluster manager on both the host and guest systems.
In terms of functionality, Eucalyptus works like AWS APIs. Therefore, it can interact with
EC2. It does provide a storage API to emulate the Amazon S3 API for storing user data
and VM images. It is installed on Linux-based platforms, is compatible with EC2 with
SOAP and Query, and is S3-compatible with SOAP and REST. CLI and web portal
services can be applied with Eucalyptus.
FIGURE 3.26 Eucalyptus for building private clouds by establishing virtual networks over the
VMs linking through Ethernet and the Internet. Courtesy of D. Nurmi, et al. [45]
Figure shows vSphere‘s overall architecture. The system interacts with user applications via an
interface layer, called vCenter.
vSphere is primarily intended to offer virtualization support and resource management of data-
center resources in building private clouds. VMware claims the system is the first cloud OS that
supports availability, security, and scalability in providing cloud computing services.
The vSphere 4 is built with two functional software suites: infrastructure services and application
services.
It also has three component packages intended mainly for virtualization purposes:
vCompute is supported by ESX, ESXi, and DRS virtualization libraries from VMware;
The application services are also divided into three groups: availability, security, and scalability.
Availability support includes VMotion, Storage VMotion, HA, Fault Tolerance, and Data
Recovery from VMware.
The scalability package was built with DRS and Hot Add
FIGURE 3.27 vSphere/4, a cloud operating system that manages compute, storage, and
network resources over virtualized data centers. Courtesy of VMware, April 2010 [72]
Once a hacker successfully enters the VMM or management VM, the whole system is in
danger.
FIGURE 3.28 The architecture of livewire for intrusion detection using a dedicated VM.
The VM-based IDS contains a policy engine and a policy module. The policy framework
can monitor events in different guest VMs by operating system interface library and
PTrace indicates trace to secure policy of monitored host.
It‘s difficult to predict and prevent all intrusions without delay. Therefore, an analysis of
the intrusion action is extremely important after an intrusion occurs. Most computer
systems use logs to analyze attack actions, but it is hard to ensure the credibility and
integrity of a log.
The IDS log service is based on the operating system kernel. Thus, when an operating
system is invaded by attackers, the log service should be unaffected.
Besides IDS, honeypots and honeynets are also prevalent in intrusion detection. They
attract and provide a fake system view to attackers in order to protect the real system. In
addition, the attack action can be analyzed, and a secure IDS can be built.
FIGURE 3.29 Techniques for establishing trusted zones for virtual cluster insulation and VM
isolation. Courtesy of L. Nick, EMC [40]
The arrowed boxes on the left and the brief description between the arrows and the
zoning boxes are security functions and actions taken at the four levels from the users to
the providers.
The small circles between the four boxes refer to interactions between users and
providers and among the users themselves.
The arrowed boxes on the right are those functions and actions applied between the
tenant environments, the provider, and the global communities.
The formation of virtual task forces, or groups, to solve specific problems associated with the
virtual organization. The dynamic provisioning and management capabilities of the resource
required meeting the SLA‘s.
Advantage:
• It can run a large amount of legacy binary codes written for various processors
on any given new hardware host machines
• best application flexibility
Llimitation:
• One source instruction may require tens or hundreds of native target instructions
to perform its function, which is relatively slow.
• V-ISA requires adding a processor-specific software translation layer in the
complier.
First, a VMM should provide an environment for programs which is essentially identical to
the original machine.
Second, programs run in this environment should show, at worst, only minor decreases in
speed.
Third, a VMM should be in complete control of the system resources. Any program run
under a VMM should exhibit a function identical to that which it runs on the original machine
directly.
Xen is an open source hypervisor program developed by Cambridge University. Xen is a micro-
kernel hypervisor, which separates the policy from the mechanism. The Xen hypervisor
implements all the mechanisms, leaving the policy to be handled by Domain 0. Xen does not
include any device drivers natively. It just provides a mechanism by which a guest OS can have
direct access to the physical devices.
First, the user can install this VM architecture without modifying the host OS. The
virtualizing software can rely on the host OS to provide device drivers and other low-
level services. This will simplify the VM design and ease its deployment.
Second, the host-based approach appeals to many host machine configurations.
16. Define Full Virtualization with its pros and cons.
• Does not need to modify guest OS, and critical instructions are emulated by software
through the use of binary translation.
• VMware Workstation applies full virtualization, which uses binary translation to
automatically modify x86 software on-the-fly to replace critical instructions.
• Advantage: no need to modify OS.
Disadvantage: binary translation slows down the performance
19. Mention the four steps to deploy a group of VMs onto a target cluster
a. preparing the disk image,
b. configuring the VMs,
c. choosing the destination nodes, and
d. executing the VM deployment command on every host.
Consolidation enhances hardware utilization. Many underutilized servers are consolidated into
fewer servers to enhance resource utilization. Consolidation also facilitates backup services and
disaster recovery.This approach enables more agile provisioning and deployment of resources.
In a virtual environment, the images of the guest OSes and their applications are readily cloned
and reused.
Eucalyptus is an open source software system intended mainly for supporting Infrastructure as
a Service (IaaS) clouds. The system primarily supports virtual networking and the management
of VMs; virtual storage is not supported. Its purpose is to build private clouds that can interact
with end users through Ethernet or the Internet.
16 marks Questions
1. Discuss about cloud deployment models in detail.
2. Explain in detail about the levels of virtualization implementation.
3. Explain CPU, Memory and I/O virtualization in detail.
4. Explain about the Live migration steps
5. Discuss about Server consolidation in data centers
REFERENCES: 1. Jason Venner, ―Pro Hadoop- Build Scalable, Distributed Applications in the
Cloud‖, A Press, 2009
2. Tom White, ―Hadoop The Definitive Guide‖, First Edition. O‟Reilly, 2009.
3. Bart Jacob (Editor), ―Introduction to Grid Computing‖, IBM Red Books, Vervante, 2005
4. Ian Foster, Carl Kesselman, ―The Grid: Blueprint for a New Computing Infrastructure‖, 2nd
Edition, Morgan Kaufmann.
5. Frederic Magoules and Jie Pan, ―Introduction to Grid Computing‖ CRC Press, 2009.
7. Barry Wilkinson, ―Grid Computing: Techniques and Applications‖, Chapman and Hall, CRC,
Taylor and Francis Group, 2010.
APPLICATIONS
CORE MIDDLEWARE
Distributed Resources
Coupling Services
FABRIC
Local resource management
Grid applications and portals are typically developed using Grid-enabled languages and utilities
such as HPC++ or MPI.
4.1.1 Basic Functional Grid Middleware Packages
A lot of significant software has been designed and realized.
UNICORE Middleware
UNICORE is a vertically integrated Grid computing environment that facilitates the following:
A seamless, secure and intuitive access to resources in a distributed environment – for
end users.
Solid authentication mechanisms integrated into their administration procedures,
reduced training effort and support requirements – for Grid sites.
Easy relocation of computer jobs to different platforms – for both end users and Grid
sites.
4.1.2 Globus
The Globus project provides open source software toolkit that can be used to build
computational grids and grid based applications. It allows sharing of computing power,
databases, and other tools securely online across corporate, institutional and geographic
boundaries without sacrificing local autonomy. The core services, interfaces and protocols in the
Globus toolkit allow users to access remote resources seamlessly while simultaneously
preserving local control over who can use resources and when.
4.1.3 Legion
Legion defines a set of core object types that support basic system services, such as
naming and binding, object reation, activation, deactivation, and deletion. These objects provide
4.1.4 Gridbus
The Gridbus Project is an open-source, multi-institutional project led by the GRIDS Lab
at the University of Melbourne. It is engaged in the design and development of service-oriented
cluster and grid middleware technologies to support eScience and eBusiness applications. It
extensively leverages related software technologies and provides an abstraction layer to hide
idiosyncrasies of heterogeneous resources and low-level middleware technologies from
application developers.
Gridbus supports commoditization of grid services at various levels:
- Raw resource level (e.g., selling CPU cycles and storage resources)
- Application level (e.g., molecular docking operations for drug design application )
- Aggregated services (e.g., brokering and reselling of services across multiple domains)
Example of staging:
<fileStageIn>
<transfer>
<sourceURL>
gsiftp://host1.examplegrid.org:2811/home…user1/userDataFile
</ sourceURL>
<destinationURL>
file://${GLOBUS_USER_HOME}/…transferred_files
</destinationURL>
</transfer>
</fileStageIn>
GLOBUS_USER_HOME refers to home directory of the user on the remote host.
In the job description file we can specify where the standard output and error files to be directed.
These files can then be staged out from the remote server. To redirect standard output and
standard error we add the following job description file
<stdout>${GLOBUS_USER_HOME}/test.out</stdout>
<stderr>${GLOBUS_USER_HOME}/test.err</stderr>
After the completion of job, these files can be transferred to the submission node by adding the
following to the job description file.
<fileStageOut>
<transfer>
4.5.4.1 WS GRAM
WS GRAM is the Grid service that provides the remote execution and status
management of jobs. When a job is submitted by a client, the request is sent to the remote host
as a SOAP message, and handled by WS GRAM service located in the remote host. The WS
GRAM service can collaborate with the RFT service for staging files required by jobs. In order to
enable staging with RFT, valid credentials should be delegated to the RFT service by the
Delegation service.
4.5.4.2 Globus Teleoperations Control Protocol (GTCP)
Globus Teleoperations Control Protocol is the WSRF version of NEESgrid Teleoperations
Control Protocol (NTCP). Currently, GTCP is a technical preview component
A factory callback can be implemented to provide custom factories for your services. It can, for
instance, be used to create services in remote hosting environments. Most implementations are,
A Grid service client can be written directly on top of the JAX-RPC client APIs. The handle is
passed into a ServiceLocator that constructs a proxy, or stub, responsible for making the call
using the network binding format defined in the WSDL for the service. The proxy is exposed
using a standard JAX-RPC generated PortType interface.
The user is responsible for handling the job setup, specifying the input location(s),
specifying the input, and ensuring the input is in the expected format and location. The
framework is responsible for distributing the job among the TaskTracker nodes of the cluster;
running the map, shuffle, sort, and reduce phases; placing the output in the output directory;
and informing the user of the job-completion status.
The job created by the code in MapReduceIntro.java will read all of its textual input line
by line, and sort the lines based on that portion of the line before the first tab character. If there
are no tab characters in the line, the sort will be based on the entire line. The
MapReduceIntro.java file is structured to provide a simple example of configuring and running
/** Inform the framework that the mapper class will be the {@link
IdentityMapper}. This class simply passes the input Key Value pairs directly to its
output, which in our case will be the shuffle.
*/
conf.setMapperClass(IdentityMapper.class);
FileOutputFormat.setOutputPath(conf,MapReduceIntroConfig.getOutputDirectory
());
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setNumReduceTasks(1);
conf.setReducerClass(IdentityReducer.class);
logger .info("Launching the job.");
/** Send the job configuration to the framework and request that the job be run.
*/
final RunningJob job = JobClient.runJob(conf);
logger.info("The job has completed.");
if (!job.isSuccessful())
{
logger.error("The job failed.");
The magic piece of code is the line output.collect(key, val), which passes a key/value pair back
to the framework for further processing.
Common Mappers
One common mapper drops the values and passes only the keys forward:
public void map(K key, V val, OutputCollector<K, V> output, Reporter reporter)
throws IOException
{
output.collect(key, null); /** Note, no value, just a null */
}
Common Reducers
A common reducer drops the values and passes only the keys forward:
public void map(K key, V val, OutputCollector<K, V> output, Reporter reporter) throws
IOException
{
output.collect(key, null);
}
Another common reducer provides count information for each key:
protected Text count = new Text();
/** Writes all keys and values directly to output. */
public void reduce(K key, Iterator<V> values, OutputCollector<K, V> output, Reporter reporter)
throws IOException
{
int i = 0;
while (values.hasNext())
{
i++
}
count.set( "" + i );
output.collect(key, count);
}
The length of the InputSplit is measured in bytes. Every InputSplit has a storage locations. The
storage locations are used by the MapReduce system to place map tasks as close to split's data
as possible. The tasks are processed in the order of the size of the splits, largest one get
processed first. This is done in order to minimize the job runtime. One important thing to
remember is that InputSplit doesn't contain input data but a reference to the data.
@Override
public NullWritable getCurrentKey() throws IOException, InterruptedException
{
return NullWritable.get();
}
@Override
public BytesWritable getCurrentValue() throws IOException,
InterruptedException {
2) Text Input
• Hadoop excels at processing unstructured text.
TextInputFormat:
The key is the line number, and the value is the line.
• TextInputFormat is the default InputFormat. Each record is a line of input.
• The key, a LongWritable, is the byte offset within the file of the beginning of the line.
• The value is the contents of the line, excluding any line terminators (newline, carriage
return), and is packaged as a Text object.
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
• is divided into one split of four records. The records are interpreted as the following
key-value pairs:
(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)
• Consider the following input file, where → represents a (horizontal) tab character:
line1→On the top of the Crumpetty Tree
line2→The Quangle Wangle sat,
line3→But his face you could not see,
line4→On account of his Beaver Hat.
• Like in the TextInputFormat case, the input is in a single split comprising four records,
although this time the keys are the Text sequences before the tab in each line:
(line1, On the top of the Crumpetty Tree)
(line2, The Quangle Wangle sat,)
(line3, But his face you could not see,)
(line4, On account of his Beaver Hat.)
NLineInputFormat
Similar to KeyValueTextInputFormat, but the splits are based on N lines of input
rather than Y bytes of input.
• With TextInputFormat and KeyValueTextInputFormat, each mapper receives a variable
number of lines of input.
• The number depends on the size of the split and the length of the lines.
• If we want your mappers to receive a fixed number of lines of input, then
NLineInputFormat is the InputFormat to use.
• N refers to the number of lines of input that each mapper receives.
XML
• Most XML parsers operate on whole XML documents, so if a large XML document is
made up of multiple input splits, then it is a challenge to parse these individually.
• Large XML documents that are composed of a series of ―records‖ can be broken into
these records using simple string or regular-expression matching to find start and end
tags of records.
3) Binary Input
• Hadoop MapReduce is not just restricted to processing textual data—it has support for
binary formats, too.
SequenceFileInputFormat
The input file is a Hadoop sequence file, containing serialized key/value pairs.
• Hadoop‘s sequence file format stores sequences of binary key-value pairs. Sequence
files are well suited as a format for MapReduce data since they are splittable, they
support compression as a part of the format, and they can store arbitrary types using a
variety of serialization frameworks.
• To use data from sequence files as the input to MapReduce, you use
SequenceFileInputFormat.
• The keys and values are determined by the sequence file, and you need to make sure
that your map input types correspond.
SequenceFileAsTextInputFormat
• Although the input to a MapReduce job may consist of multiple input files, all of the input
is interpreted by a single InputFormat and a single Mapper.
• As the data format evolves, so you have to write your mapper to cope with all of your
legacy formats.
• Or, you have data sources that provide the same type of data but in different formats.
• These cases are handled elegantly by using the MultipleInputs class, which allows you
to specify the InputFormat and Mapper to use on a per-path basis.
5) Database Input (and Output)
• DBInputFormat is an input format for reading data from a relational database, using
JDBC
• The corresponding output format is
• DBOutputFormat, which is useful for dumping job outputs (of modest size) into a
database.
4.8.2 Output Formats
• Hadoop has output data formats that correspond to the input formats covered in the
previous section.
Formats:
• Text Output
• Binary Output
• Multiple Outputs
• Lazy Output
• Database Output
2) Binary Output
• SequenceFileOutputFormat
• SequenceFileAsBinaryOutputFormat
• MapFileOutputFormat
SequenceFileOutputFormat
• As the name indicates, SequenceFileOutputFormat writes sequence files for its output.
• This is a good choice of output if it forms the input to a further MapReduce job, since it is
compact and is readily compressed.
SequenceFileOutputFormat
• As the name indicates, SequenceFileOutputFormat writes sequence files for its output.
• This is a good choice of output if it forms the input to a further MapReduce job, since it is
compact and is readily compressed.
package com.apress.hadoopbook.examples.ch2;
import java.io.IOException;
import java.util.Formatter;
import java.util.Random;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.JobConf;
import org.apache.log4j.Logger;
public class MapReduceIntroConfig
{
// Log4j is the recommended way to provide textual information to the user about the job.
protected static Logger logger = Logger.getLogger(MapReduceIntroConfig.class);
protected static Path inputDirectory = new Path("file:///tmp/MapReduceIntroInput");
// This is the directory that the job output will be written to. It must not * exist at Job Submission
time.
protected static Path outputDirectory = new Path("file:///tmp/MapReduceIntroOutput");
protected static void exampleHouseKeeping(final JobConf conf, final Path inputDirectory, final
Path outputDirectory) throws IOException
conf.set("mapred.job.tracker", "local");
conf.setInt("io.sort.mb", 1);
generateSampleInputIf(conf, inputDirectory);
if (!removeIf(conf, outputDirectory))
{
logger.error("Unable to remove " + outputDirectory + "job aborted");
if (!inputDirectoryExists)
{
if (!fs.mkdirs(inputDirectory))
{
logger.error("Unable to make the inputDirectory "
+ inputDirectory.makeQualified(fs) + " aborting job");
System.exit(1);
}
}
final int fileCount = 3;
final int maxLines = 100;
generateRandomFiles(fs, inputDirectory, fileCount, maxLines);
}
public static Path getInputDirectory()
HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost
of seeks. By making a block large enough, the time to transfer the data from the disk can
be made to be significantly larger than the time to seek to the start of the block. Thus the
time to transfer a large file made of multiple blocks operates at the disk transfer rate.
2) Namenodes and Datanodes- An HDFS cluster has two types of node operating in a
master-worker pattern: a namenode (the master) and a number of datanodes
(workers). The namenode manages the file system namespace. It maintains the
filesystem tree and the metadata for all the files and directories in the tree. This
information is stored persistently on the local disk in the form of two files: the
namespace image and the edit log.
A client accesses the filesystem on behalf of the user by communicating with the
namenode and datanodes. Datanodes are the workhorses of the filesystem. They store
and retrieve blocks when they are told to (by clients or the namenode), and they report
back to the namenode periodically with lists of blocks that they are storing. Without the
namenode, the filesystem cannot be used. It is also possible to run a secondary
namenode, which despite its name does not act as a namenode. Its main role is to
periodically merge the namespace image with the edit log to prevent the edit log from
becoming too large.
3) The Command-Line Interface- There are many interfaces to HDFS, but the command
line is one of the simplest and, to many developers, the most familiar. There are two
properties that we set in the pseudo-distributed configuration that deserve further
4) Basic Filesystem Operations - The filesystem is ready to be used, and we can do all
of the usual filesystem operations such as reading files, creating directories, moving
files, deleting data, and listing directories. We can type hadoop fs -help to get detailed
help on every command.
HDFS Interfaces
1. HTTP- HDFS defines a read-only interface for retrieving directory listings and data over
HTTP. This protocol is not tied to a specific HDFS version, making it possible to write
clients that can use HTTP to read data from HDFS clusters that run different versions
of Hadoop.
2. FTP- There is an FTP interface to HDFS, which permits the use of the FTP protocol to
interact with HDFS. This interface is a convenient way to transfer data into and out of
HDFS using existing FTP clients.
Example. Displaying files from a Hadoop filesystem on standard output twice, by using
seek
public class FileSystemDoubleCat
{
public static void main(String[] args) throws Exception
{
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FSDataInputStream in = null;
try
{
in = fs.open(new Path(uri));
IOUtils.copyBytes(in, System.out, 4096, false);
in.seek(0); // go back to the start of the file
IOUtils.copyBytes(in, System.out, 4096, false);
}
finally
{
IOUtils.closeStream(in);
}
}
}
Here‘s the result of running it on a small file:
% hadoop FileSystemDoubleCat hdfs://localhost/user/tom/quangle.txt
On the top of the Crumpetty Tree
Writing Data
The FileSystem class has a number of methods for creating a file. The simplest is the method
that takes a Path object for the file to be created and returns an output stream to write to:
public FSDataOutputStream create(Path f) throws IOException
There are overloaded versions of this method that allow you to specify whether to forcibly
overwrite existing files, the replication factor of the file, the buffer size to use when writing the
file, the block size for the file, and file permissions.
Example. Copying a local file to a Hadoop filesystem
public class FileCopyWithProgress
{
public static void main(String[] args) throws Exception
FSDataOutputStream
The create() method on FileSystem returns an FSDataOutputStream, which, like
FSDataInputStream, has a method for querying the current position in the file:
package org.apache.hadoop.fs;
public class FSDataOutputStream extends DataOutputStream implements Syncable {
public long getPos() throws IOException {
// implementation elided
}
// implementation elided
}
However, unlike FSDataInputStream, FSDataOutputStream does not permit seeking. This is
because HDFS allows only sequential writes to an open file or appends to an already written
file.
Core Grid middleware offers services such as remote process management, co-allocation of
resources, storage access, information registration and discovery, security, and aspects of
Quality of Service (QoS) such as resource reservation and trading.
User-level Grid middleware utilizes the interfaces provided by the low-level middleware to
provide higher level abstractions and services. These include application development
environments, programming tools and resource brokers for managing resources and scheduling
application tasks for execution on global resources.
3. List the features of UNICORE.
User driven job creation and submission
Job management
Data management
Application support
Flow control
Single sign-on
Support for legacy jobs
Resource management
Trust models for Grid security environment – Authentication and Authorization methods – Grid
security infrastructure – Cloud Infrastructure security: network, host and application level –
aspects of data security, provider data and its security, Identity and access management
architecture, IAM practices in the cloud, SaaS, PaaS, IaaS availability in the cloud, Key privacy
issues in the cloud
TEXT BOOK:
1. Kai Hwang, Geoffery C. Fox and Jack J. Dongarra, ―Distributed and Cloud Computing:
Clusters, Grids, Clouds and the Future of Internet‖, First Edition, Morgan Kaufman Publisher, an
Imprint of Elsevier, 2012.
REFERENCES: 1. Jason Venner, ―Pro Hadoop- Build Scalable, Distributed Applications in the
Cloud‖, A Press, 2009
2. Tom White, ―Hadoop The Definitive Guide‖, First Edition. O‟Reilly, 2009.
3. Bart Jacob (Editor), ―Introduction to Grid Computing‖, IBM Red Books, Vervante, 2005
4. Ian Foster, Carl Kesselman, ―The Grid: Blueprint for a New Computing Infrastructure‖, 2nd
Edition, Morgan Kaufmann.
5. Frederic Magoules and Jie Pan, ―Introduction to Grid Computing‖ CRC Press, 2009.
6. Daniel Minoli, ―A Networking Approach to Grid Computing‖, John Wiley Publication, 2005.
7. Barry Wilkinson, ―Grid Computing: Techniques and Applications‖, Chapman and Hall, CRC,
Taylor and Francis Group, 2010.
SECURITY
Trust models for Grid security environment – Authentication and Authorization methods – Grid
security infrastructure – Cloud Infrastructure security: network, host and application level –
aspects of data security, provider data and its security, Identity and access management
architecture, IAM practices in the cloud, SaaS, PaaS, IaaS availability in the cloud, Key privacy
issues in the cloud
Figure 5.2 Interactions among multiple parties in a sequence of trust delegation operations
using the PKI services in a GT4-enabled grid environment.
For Charlie to accept the subtask Y, Bob needs to show Charlie some proof of entrust from
Alice. A proxy credential is the solution proposed by GSI.
A proxy credential is a temporary certificate generated by a user. Two benefits are seen by
using proxy credentials.
First, the proxy credential is used by its holder to act on behalf of the original user or the
delegating party. A user can temporarily delegate his right to a proxy.
Second, single sign-on can be achieved with a sequence of credentials passed along
the trust chain. The delegating party (Alice) need not verify the remote intermediate
parties in a trust chain.
The only difference between the proxy credential and a digital certificate is that the proxy
credential is not signed by a CA. We need to know the relationship among the certificates of the
CA and Alice, and proxy credential of Alice.
The CA certificate is signed first with its own private key.
Second, the certificate Alice holds is signed with the private key of the CA.
Finally, the proxy credential sent to her proxy (Bob) is signed with her private key.
The procedure delegates the rights of Alice to Bob by using the proxy credential.
First, the generation of the proxy credential is similar to the procedure of generating a
user certificate in the traditional PKI.
Second, when Bob acts on behalf of Alice, he sends the request together with Alice‘s
proxy credential and the Alice certificate to Charlie.
Third, after obtaining the proxy credential, Charlie finds out that the proxy credential is
signed by Alice. So he tries to verify the identity of Alice and finds Alice trustable. Finally,
Charlie accepts Bob‘s requests on behalf of Alice. This is called a trust delegation chain.
5.2.1 Authorization for Access Control
Figure 5.3 Three authorization models: the subject-push model, resource-pulling model, and the
authorization agent model.
Figure 5.4 GSI functional layers at the message and transport levels.
TLS (transport-level security) or WS-Security and WS-Secure Conversation (message-level) are
used as message protection mechanisms in combination with SOAP.
X.509 End Entity Certificates or Username and Password are used as authentication
credentials.
X.509 Proxy Certificates and WS-Trust are used for delegation.
Figure 5.6 A sequence of trust delegations in which new certificates are signed by the owners
rather by the CA.
The certificate also includes a time notation after which the proxy should no longer be accepted
by others. Proxies have limited lifetimes. Because the proxy isn‘t valid for very long, it doesn‘t
have to stay quite as secure as the owner‘s private key, and thus it is possible to store the
proxy‘s private key in a local storage system without being encrypted, as long as the
permissions on the file prevent anyone else from looking at them easily.
The foundational infrastructure for a cloud must be inherently secure whether it is a private or
public cloud or whether the service is SAAS, PAAS or IAAS. It will require:
The infrastructure security can be viewed, assessed and implemented according its building
levels - the network, host and application levels
When reviewing host security and assessing risks, the context of cloud services delivery models
(SaaS, PaaS, and IaaS) and deployment models public, private, and hybrid) should be
considered. The host security responsibilities in SaaS and PaaS services are transferred to the
provider of cloud services. IaaS customers are primarily responsible for securing the hosts
provisioned in the cloud (virtualization software security, customer guest OS or virtual server
security).
Dos Attack Prevent the authorized DoS attacks can be prevented with
user to accessing a firewall but they have configured
services on network properly
Sniffer Data is not encrypted & Detect based on ARP and RTT.
Network Level
Attack flowing in network, and Implement Internet Protocol
Security (IPSec) to encrypt
chance to read the vital network traffic
information. System administrator can prevent
this attack to be tight on security,
i.e one time password or ticketing
authentication
wrong announcement on
IP address associated
with a autonomous
system(AS)
Host Level Security Single hardware unit is Hooksafe that can provide generic
protection against kernelmode
concerns difficult to monitor multiple
operating systems. rootkits
with the Malicious code get control
hypervisor of the system
virtual servers
Backdoor Debug options are left Scan the system periodically for
SUID/SGID files
and debug enabled unnoticed, it
provide an easy entry to a Permissions and ownership of
options hacker into the web-site important files and directories
and let him make
changes at the web-site periodically
level
Hidden field Certain fields are hidden Avoid putting parameters into a
in the web-site and it‘s query string
manipulation used by the developers.
Hacker can easily modify
on the web page.
a. Data in transit
b. Data at rest
c. Processing of data including multi tenancy
d. Data Lineage
e. Data Provenance
f. Data remanance
a. Data in Transit
Encryption of data stored on media is used to protect the data from unauthorized
access should the media ever be stolen. Physical access can get past file system
permissions, but if the data is stored in encrypted form and the attacker does not have
the decryption key, they can‘t access
Most encryption at rest uses a symmetric algorithm so that data can be very quickly
encrypted and decrypted. However, since the symmetric key itself needs to be
protected, they can use a PIN, password, or even a PKI certificate on a smart card to
secure the symmetric key, making it very difficult for an attacker to compromise.
The Secure Provenance (SPROV) scheme automatically collects data provenance at the
application layer. It provides security assurances of confidentiality and integrity of the data
provenance. In this scheme, confidentiality is ensured by employing state-of the-art
encryption techniques while integrity is preserved using the digital signature of the user who
takes any actions. Each record in the data provenance includes the signed checksum of
previous record in the chain
f. Data Remanence.
This refers to the residual data representation present even after being deleted or erased.
Data remanence may lead to disclosure of sensitive information possible when storage
media is released into an uncontrolled environment (e.g., thrown in the trash, or lost).
Various techniques have been developed to counter data remanence. These techniques are
classified as clearing, purging/sanitizing or destruction.
Specific methods include overwriting, degaussing, encryption, and media destruction.
When using a public cloud, it is important that it must be capable of providing both confidentiality
as well as integrity. Obviously when using protocols such as FTPS, HTTPS, SCP for data
transfer through the internet, simply encrypting the data along with a non-secure protocols such
as FTP, HTTP results in confidentiality but the integrity of the data is not ensured.
In IaaS services, it is strongly recommended for small storage and the encryption is possible.
But in PaaS and SaaS cloud based application, compensating control is not possible because it
may prevent it from searching or indexing.
Applications provided with cloud computing are designed with data tagging to prevent
unauthorized access to user data. All the data must be encrypted to transfer or receive from
cloud but there is no proper method to process encrypted data, IBM developed a fully
Homomorphic Encryption Scheme. Using the scheme , we can process the data without
Data lockdown
Access policies
Security intelligence
First, make sure that data is not readable and that the solution offers strong key management.
Second, implement access policies that ensure only authorized users can gain access to
sensitive information, so that even privileged users such as root user cannot view sensitive
information. Third, incorporate security intelligence that generates log information, which can be
used for behavioral analysis to provide alerts that trigger when users are performing actions
outside of the norm.
Security assessment in an organization must be carried out periodically by person who are able
to identify and fix the problems efficiently. Shared risk will appear in multi-tier service
arrangement provider where the service provider acquire an infrastructure needed for it from
another service provider,thereby,it potentially affects all parties.
Staff Security Screening is important because usually cloud provider employs contractors to
undergo an investigation of your employees under policy.
Distributed Data Center is needed to make the cloud servive provider less prone to
geographical disasters ,in such a way eliminating the use of periodically tested disaster rcovery
plan.
Physical Security is nothing but the clients must have a good knowledge about the security
levels of all the cloud service providers. Coding of the cloud service provider must be based on
the standard methods that can be further documented and demonstrated to the client in future in
order to make them sure about the secure coding process. Data Leakage is one drawbacks of
every cloud provider , so it is always recommended to have a encrypted format of data is
transmitted and received
While many organizations have implemented encryption for data security, they often overlook
inherent weaknesses in key management, access control, and monitoring of data access. If
encryption keys are not sufficiently protected, they are vulnerable to theft by malicious hackers.
The encryption implementation must incorporate a robust key management solution to provide
assurance that the keys are sufficiently protected. It‘s critical to audit the entire encryption and
key management solution
Therefore, any data-centric approach must incorporate encryption, key management, strong
access controls, and security intelligence to protect data in the cloud and provide the requisite
level of security. By implementing a layered approach that includes these critical elements,
organizations can improve their security posture more effectively and efficiently than by focusing
exclusively on traditional network-centric security methods.
The strategy should incorporate a blueprint approach that addresses compliance requirements
and actual security threats. Best practices should include securing sensitive data, establishing
appropriate separation of duties between IT operations and IT security, ensuring that the use of
cloud data conforms to existing enterprise policies, as well as strong key management and strict
access policies.
Internal consistency — Ensures that internal data is consistent. For example, assume that an
internal database holds the number of units of a particular item in each department of an
organization. The sum of the number of units in each department should equal the total number
of units that the database has recorded internally for the whole organization.
External consistency — Ensures that the data stored in the database is consistent with the real
world. Using the preceding example, external consistency means that the number of items
recorded in the database for each department is equal to the number of items that physically
exist in that department.
Specific security challenges pertain to each of the three cloud service models—Software as a
Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (IaaS).
◗ SaaS deploys the provider‘s applications running on a cloud infrastructure; it offers anywhere
access, bu-t also increases security risk. With this service model it‘s essential to implement
policies for identity management and access control to applications. For example, with
Salesforce.com, only certain salespeople may be authorized to access and download
confidential customer sales information.
◗ IaaS lets the consumer provision processing, storage, networks, and other fundamental
computing resources and controls operating systems, storage, and deployed applications. As
with Amazon Elastic Compute Cloud (EC2), the consumer does not manage or control the
underlying cloud infrastructure. Data security is typically a shared responsibility between the
cloud service provider and the cloud consumer. Data encryption without the need to modify
applications is a key requirement in this environment to remove the custodial risk of IaaS
infrastructure personnel accessing sensitive data.
Identification is the act of a user professing an identity to a system, usually in the form of a
username or user logon ID to the system. Identification establishes user accountability for the
actions on the system. User IDs should be unique and not shared among different individuals. In
many large organizations, user IDs follow set standards, such as first initial followed by last
name, and so on. In order to enhance security and reduce the amount of information available
to an attacker, an ID should not reflect the user‘s job title or function.
Authentication is verification that the user‘s claimed identity is valid, and it is usually
implemented through a user password at logon. Authentication is based on the following three
factor types:
Sometimes a fourth factor, something you do, is added to this list. Something you do might be
typing your name or other phrases on a keyboard.
Two-factor authentication requires two of the three factors to be used in the authentication
process. For example, withdrawing funds from an ATM machine requires two-factor
Passwords
o Because passwords can be compromised, they must be protected. In the ideal case,
a password should be used only once. This ―one-time password,‖ or OTP, provides
maximum security because a new password is required for each new logon.
o A password that is the same for each logon is called a static password. A password
that changes with each logon is termed a dynamic password. The changing of
passwords can also fall between these two extremes.
o Passwords can be required to change monthly, quarterly, or at other intervals,
depending on the criticality of the information needing protection and the password‘s
frequency of use. Obviously, the more times a password is used, the more chance
there is of it being compromised.
o A passphrase is a sequence of characters that is usually longer than the allotted
number for a password. The passphrase is converted into a virtual password by the
system.
o In all these schemes, a front-end authentication device or a back-end authentication
server, which services multiple workstations or the host, can perform the
authentication.
o Passwords can be provided by a number of devices, including tokens, memory
cards, and smart cards.
Tokens
Tokens, in the form of small, hand-held devices, are used to provide passwords.
Memory cards provide nonvolatile storage of information, but they do not have
any processing capability. A memory card stores encrypted passwords and other
related identifying information. A telephone calling card and an ATM card are
examples of memory cards.
Smart Cards
Smart cards provide even more capability than memory cards by incorporating
additional processing power on the cards. These credit-card-size devices
comprise microprocessor and memory and are used to store digital signatures,
private keys, passwords, and other personal information.
Biometrics
In addition to the accuracy of the biometric systems, other factors must be considered, including
enrollment time, throughput rate, and acceptability.
Enrollment time is the time that it takes to initially register with a system by providing samples
of the biometric characteristic to be evaluated. An acceptable enrollment time is around two
minutes. For example, in fingerprint systems the actual fingerprint is stored and requires
approximately 250KB per finger for a high-quality image. This level of information is required for
one-to-many searches in forensics applications on very large databases.
In finger-scan technology, a full fingerprint is not stored; rather, the features extracted from this
fingerprint are stored by using a small template that requires approximately 500 to 1,000 bytes
of storage. The original fingerprint cannot be reconstructed from this template. Finger-scan
technology is used for one-to-one verification by using smaller databases.
The throughput rate is the rate at which the system processes and identifies or authenticates
individuals. Acceptable throughput rates are in the range of 10 subjects per minute.
Collected biometric images are stored in an area referred to as a corpus. The corpus is stored in
a database of images. Potential sources of error include the corruption of images during
collection, and mislabeling or other transcription problems associated with the database.
Therefore, the image collection process and storage must be performed carefully with constant
checking.
The following are typical biometric characteristics that are used to uniquely authenticate an
individual‘s identity:
Fingerprints — Fingerprint characteristics are captured and stored. Typical CERs are 4–
5%.
Retina scans — The eye is placed approximately two inches from a camera and an
invisible light source scans the retina for blood vessel patterns. CERs are approximately
1.4%.
Iris scans — A video camera remotely captures iris patterns and characteristics. CER
values are around 0.5%.
An identity management effort can be supported by software that automates many of the
required tasks.
The Open Group and the World Wide Web Consortium (W3C) are working toward a standard
for a global identity management system that would be interoperable, provide for privacy,
implement accountability, and be portable.
Identity management is also addressed by the XML-based eXtensible Name Service (XNS)
open protocol for universal addressing. XNS provides the following capabilities:
These and other related objectives flow from the organizational security policy. This policy is a
high-level statement of management intent regarding the control of access to information and
the personnel who are authorized to receive that information.
Three things that must be considered for the planning and implementation of access control
mechanisms are threats to the system, the system‘s vulnerability to these threats, and the risk
that the threats might materialize. These concepts are defined as follows:
Threat — An event or activity that has the potential to cause harm to the information
systems or networks
Controls
Controls are implemented to mitigate risk and reduce the potential for loss.
Two important control concepts are separation of duties and the principle of least privilege.
Separation of duties requires an activity or process to be performed by two or more entities for
successful completion. Thus, the only way that a security policy can be violated is if there is
collusion among the entities. For example, in a financial environment, the person requesting that
a check be issued for payment should not also be the person who has authority to sign the
check.
Least privilege means that the entity that has a task to perform should be provided with the
minimum resources and privileges required to complete the task for the minimum necessary
period of time.
Control measures can be administrative, logical (also called technical), and physical in their
implementation.
Controls provide accountability for individuals who are accessing sensitive information in a cloud
environment. This accountability is accomplished through access control mechanisms that
require identification and authentication, and through the audit function.
These controls must be in accordance with and accurately represent the organization‘s security
policy. Assurance procedures ensure that the control mechanisms correctly implement the
security policy for the entire life cycle of a cloud information system.
In general, a group of processes that share access to the same resources is called a protection
domain, and the memory space of these processes is isolated from other running processes.
Choosing not to use the root account improves security in a number of ways.
IAM administrators with this mindset introduce a significant degree of risk into an organization‘s
security policy. Greater entitlement than necessary opens the door for human error and
introduces the need for more complex audits; IAM policies greatly simplify an auditor‘s
investigation into who has access to which resources.
Best practice is to grant least privilege — and then grant more privileges on a granular level if
needed.
It is useful to note that S3 is a special service in that one can restrict access both through IAM
and through S3 Bucket Policies; one can further lock down access to an S3 bucket by
stipulating the actions the user can take in that bucket. For example, a user can be granted IAM
access to the bucket, but be denied if they are accessing it from an IP address outside of an IP
range set out in a bucket policy.
In a complex healthcare enterprise, the ongoing investigation that entitlement definition requires
often necessitates that one or several IT staff with the task of constantly updating, removing,
and re-auditing IAM policies.
What are the requirements of each application? What S3 buckets need to be accessed by which
teams? Who got fired and who hired? These new permissions and the reason for them must be
well documented; this is one form of administrative work that is well worth the red tape.
Work done proactively here will save hours of forensic time if something goes wrong and if
auditors come in, there is much less that needs to be gathered. The organization has to prove
that the function can only be performed by certain people in a central location, so auditing is
fairly simple.
3. MFA everywhere + federated access
Multi-Factor Authentication provides an important level of security in any environment. Even if a
password is shared or gets inadvertently released, malicious users still cannot access the
account. This is particularly important in HIPAA-compliant environments.
Since CloudFormation is made to create and destroy infrastructure, it is that much more
important that IAM policies are managed effectively. CloudFormation only runs under the
context of the user running it, or else this powerful tool could become a powerful weapon.
CloudFormation will fail if a user tries to automate a function beyond its IAM role.
IAM puts you in a position of always having control over your environment. This is essential not
only in HIPAA-compliant environments, but in any environment that hosts sensitive or
proprietary data. Through the correct implementation of IAM policies, AWS is fully capable of
hosting sensitive data and may even provide a more granular level of user management
security than a traditional hosting environment.
Researchers are heavily involved in finding new technologies that can make cloud computing
more reliable from a security, performance and availability's perspective.
Traditionally the resources required for businesses have been locally installed, setup and
maintained by the organizations.
The organizations interact with each other in a very controlled and secured environment. They
often sign the service level agreements (SLAs) that hold each party engaged with certain
accountabilities.
In some situations a downtime of few hours can lead to a loss of hundreds of thousands of
dollars. Establishing robust monitoring tools and practices will bring long terms benefits in terms
of achieving high availability in the cloud.
Technically there are several levels where high availability can be achieved. These levels
include application level, data center level, infrastructure level and geographic location level.
One of the very basic goals of high availability is to avoid single point of failures as much as
possible to achieve operational continuity, redundancy and fail-over capability.
Dynamic scalability of the services is one of the very important features of the cloud. This goes
a long way in achieving high availability.
Amazon‘s EC2 scales up the services by provisioning additional servers very easily and in a
short amount of time. It provides dynamic scalability capabilities which help in load balancing
and effective handling of the sudden and unexpected increase in the network traffic.
This dynamic scalability can be programmatically controlled via cloud servers API.
Programmatically controlled environments provide near real time scalability capabilities. With a
single API call several virtual machine instances can be added to a cluster.
And since the resources are fixed at the beginning of the computation the applications can be
scaled up or scaled down as the requirement to adjust the workload arises.
These adjustments can be in the form of requesting more machines or terminating the ones that
are not needed. Amazon's EC2 provides capabilities to control and manage the resources per
user needs. This in turn helps the web services to provide high availability.
In this environment the physical hardware resources are abstracted away and are exposed as
compute resources for the cloud applications to use them.
A Windows Azure Fabric is controlled by a Fabric Controller. This windows fabric is responsible
for exposing the storage and computing resources by abstracting the hardware resources.
In addition the instances of the applications are monitored for availability and scalability. This is
done automatically in the environment. If one instance of the application goes down for some
reason the Fabric Controller is notified and the application is instantiated in another virtual
machine. This process ensures that the application availability is achieved with minimal impacts
to the downtime in a consistent manner.
Open source private cloud vendor Eucalyptus has targeted high-availability in Eucalyptus 3.0.
In Eucalyptus 3.0 there are now multiple controllers for high-availability. The controllers are web
services that help to orchestrate the real time operation of the cloud.
In terms of deployment, this can be made across two or more racks with separate controllers on
each. The high-availability feature will detect networking, compute, memory and hardware
failures and then fail-over to a working stable node.
The ability to control what information one reveals about oneself over the Internet, and who can
access that information, has become a growing concern. These concerns include whether email
can be stored or read by third parties without consent, or whether third parties can track the web
sites someone has visited. Another concern is whether web sites which are visited collect, store,
and possibly share personally identifiable information about users.
(PII), as used in information security, refers to information that can be used to uniquely identify,
contact, or locate a single person or can be used with other sources to uniquely identify a single
individual.
Privacy is an important business issue focused on ensuring that personal data is protected from
unauthorized and inappropriate collection, use, and disclosure, ultimately preventing the loss of
customer trust and inappropriate fraudulent activity such as identity theft, email spamming, and
phishing.
Adhering to privacy best practices is simply good business but is typically ensured by legal
requirements. Many countries have enacted laws to protect individuals‘ right to have their
privacy respected, such as Canada‘s Personal Information Protection and Electronic
Documents Act (PIPEDA), the European Commission‘s directive on data privacy, the Swiss
Federal Data Protection Act (DPA), and the Swiss Federal Data Protection Ordinance.
In the United States, individuals‘ right to privacy is also protected by business-sector regulatory
requirements such as the Health Insurance Portability and Accountability Act (HIPAA), The
Gramm-Leach- Bliley Act (GLBA), and the FCC Customer Proprietary Network Information
(CPNI) rules.
Any data that is collected directly from a customer (e.g., entered by the customer via an
application‘s user interface)
Any data about a customer that is gathered indirectly (e.g., metadata in documents)
Any data about a customer‘s usage behavior (e.g., logs or history)
Any data relating to a customer‘s system (e.g., system configuration,IP address)
Personal data (sometimes also called personally identifiable information) is any piece of data
which can potentially be used to uniquely identify, contact, or locate a single person or can be
used with other sources to uniquely identify a single individual.
A subset of personal data is defined as sensitive and requires a greater level of controlled
collection, use, disclosure, and protection. Sensitive data includes some forms of identification
such as Social Security number, some demographic information, and information that can be
used to gain access to financial accounts, such as credit or debit card numbers and account
numbers in combination with any required security code, access code, or password. Finally, it is
important to understand that user data may also be personal dasta.
The entire contents of a user‘s storage device may be stored with a single cloud provider or with
many cloud providers. Whenever an individual, a business, a government agency, or other
entity shares information in the cloud, privacy or confidentiality questions may arise.
A user‘s privacy and confidentiality risks vary significantly with the terms of service and privacy
policy established by the cloud provider. For some types of information and some categories of
cloud computing users, privacy and confidentiality rights, obligations, and status may change
when a user discloses information to a cloud provider.
Disclosure and remote storage may have adverse consequences for the legal status of or
protections for personal or business information. The location of information in the cloud may
have significant effects on the privacy and confidentiality protections of information and on the
privacy obligations of those who process or store the information. Information in the cloud may
have more than one legal location at the same time, with differing legal consequences.
Laws could oblige a cloud provider to examine user records for evidence of criminal activity and
other matters. Legal uncertainties make it difficult to assess the status of information in the
cloud as well as the privacy and confidentiality protections available to users.
Collection: You should have a valid business purpose for developing applications and
implementing systems that collect, use or transmit personal data.
Notice: There should be a clear statement to the data owner of a company‘s/providers intended
collection, use, retention, disclosure, transfer, and protection of personal data.
Choice and consent: The data owner must provide clear and unambiguous consent to the
collection, use, retention, disclosure, and protection of personal data.
Use: Once it is collected, personal data must only be used (including transfers to third parties)
in accordance with the valid business purpose and as stated in the Notice.
Security: Appropriate security measures must be in place (e.g., encryption) to ensure the
confidentiality, integrity, and authentication of personal data during transfer, storage, and use.
Access: Personal data must be available to the owner for review and update. Access to
personal data must be restricted to relevant and authorized personnel.
Retention: A process must be in place to ensure that personal data is only retained for the
period necessary to accomplish the intended business purpose or that which is required by law.
Disposal: The personal data must be disposed of in a secure and appropriate manner (i.e.,
using encryption disk erasure or paper shredders).
Particular attention to the privacy of personal information should be taken in an a SaaS and
managed services environment when
There should be an emphasis on notice and consent, data security and integrity, and enterprise
control for each of the events above as appropriate.
The following observations are made on the future of policy and confidentiality in the cloud
computing environment:
Threat — An event or activity that has the potential to cause harm to the information
systems or networks
Vulnerability — A weakness or lack of a safeguard that can be exploited by a threat,
causing harm to the information systems or networks
Risk — The potential for harm or loss to an information system or network; the
probability that a threat will materialize
20. Define PII.
(PII), as used in information security, refers to information that can be used to uniquely identify,
contact, or locate a single person or can be used with other sources to uniquely identify a single
individual.
16 marks Questions
1.Explain Cloud Infrastructure in detail.
2.Explain in detail about Identity and Access Management Architecture in detail.
3. List the Key privacy issues in cloud and explain each in detail.
4. Explain the trust models for Grid security environment and its challenges.