Tools and techniques Sam Siewert (siewerts@colorado.edu), Principal Software Architect/Adjunct Professor, University of Colorado Summary: In an ideal world, all systems would have linear scaling of all resources with linear cost, but this is rarely the case. Cost may include not only capital expenditures but operational costs for increased cooling, power, rack space, and management requirements. System designers and solution architects who plan ahead for scaling can at least control cost, make initial trade-offs for the long term, and provide mostly linear scaling with similar increases in capital and operating costs. Choosing the right scaling strategyranging from simple server-client to clusters to grid, cloud, or general Internet servicesup front is critical. This article arms systems designers and solution architects with methods for success. Date: 14 Oct 2008 Level: Introductory PDF: A4 and Letter (90KB | 8 pages)Get Adobe Reader Activity: 443 views Comments: 0 (Add comments) Average rating (based on 3 votes) Scaling is most often thought of as the ability to expand services, increase access to data, or add client load. The ability to handle more clients by providing more services and data access is most often achieved by scaling server-side processor, input/output (I/O), memory, and storage. But this article, the third installment in this series on infrastructure architecture, looks at alternative architectures and considers how scaling fits into new paradigms such as grid and cloud computing. Too often organizations overlook costly operational expenditures associated with scaling, including power, cooling, and rack space. Furthermore, good preparation for scaling can help eliminate I/O, processor, memory, or storage bottlenecksthe topic of the second article in this series. Resources such as power are not discussed in this article, but the Resources section provides links to more information on that topic. Scaling beyond a server with clients requires strategies that include clustering (both processor and file systems), grid computing, and cloud computing and is generally defined on the uppermost bound by ubiquitous Internet services that can meet rapidly changing public demand. This article's first focus is on planning ahead to determine scaling bounds and strategies for scaling that are as close as possible to linear and infinite. Second, it looks at each major service resource for server level, cluster, grid, and cloud computing software as a service (SaaS) and hardware as a service (HaaS) scaling, including processor, I/O, memory and storage, and strategies to scale and balance each. The potential breadth and depth of generalized scaling strategies for servers, clusters, grids, and cloud computing is immense, but this article provides concrete examples as well as points to greater detail so that you can tackle difficult problems on large systems at the infrastructure level. Principals and goals for linear scaling systems Infrastructure architecture essentials, Part 3: System design methods for sc... http://www.ibm.com/developerworks/library/ar-infraarch3/ 1 of 8 8/22/2009 6:52 PM A linear scaling system requires a capable compute node, such as an IBM System x server or BladeCenter system, which provides symmetrical multiprocessing (SMP) scaling for processor or memory coupled with sufficient I/O scaling for clustering, storage networks, management, and client-access networking. Looking forward to larger scale systems with broader scope and client access, it is most important that the basic compute and storage subsystems be carefully designed for expansion: Hierarchical scaling: Cloud computing centers potentially composed of grids, high performance computing (HPC) clusters, or SMP servers with client access at all levels Service-Oriented Architecture (SOA): Careful consideration of coupling of computations and data access for applications along with client access networks High speed interconnect (HSI) cluster networks: Networks such as Infiniband, Myranet, and 10GE Scalable storage access through a storage area network (SA): 8G Fibre Channel, Internet small computer system interface (iSCSI) over Infiniband or 10GE, or Fibre Channel over Ethernet/converged network adapters (FCoE/CNA) and protocol offloading host bus interfaces SMP compute nodes: Nodes with sufficient processor, memory, and I/O channel expansion for storage, cluster, management, and client networks Scalable file systems: Network access storage (NAS) head or gateway designs with parallel file systems to scale with storage and the number of clients along with network file system (NFS) protocol acceleration technologies like remote direct memory access (RDMA) Green factor: The scalable power, cooling, and rack density of each subsystem selected Geo-scaling: Using switch uplinks to dark fiber with dense wavelength division multiplexing (DWDM) add and drop mulitplexors Terrascale, petascale, and exascale challenges Along with operational costs, management of resources is perhaps the biggest problem in scaling. Future systems will have to provide more autonomic features, including self-configuring, healing, optimizing, and protecting (self-CHOP), to reduce IT costs for radically scaled-out systems in terms of compute capability and numbers of clients. Likewise, the green aspect of basic building blocks, including power, cooling, and rack density, will become increasingly important. To date, focus has often been on acquisition cost rather than the total cost of ownership (TCO) and the cost of providing services. Skills and competencies: planning for growth up front As shown in Figure 1, systems can scale from simple client computers to larger SMP servers. Client computers connect to clusters of SMP servers that can divide and conquer algorithms, given their computing power. Clusters of SMP servers can likewise provide concurrent services to grid systems. Built on cluster and grid computing, centralized cloud computing services on the Internet is growing rapidly. Cloud computing might be as simple as a shared calendar and not require HPC clusters, but then again, access to HPC over the Internet is becoming a growing concern in both the academic world and the business computing world (see Resources). The clusters and SMP servers in a grid should themselves scale in terms of processor, I/O, memory, and storage access so that they can accommodate application hosting goals for services in the grid or cloud computing center. Perhaps the best place to start is by asking yourself a series of scaling questions, such as: To what extent can processor, memory, and I/O channels be expanded on each SMP compute node? How will storage access, client network access, and cluster network synchronization and data sharing be balanced for applications in a given cluster? How will client networks and management scale for multiple servers and clusters? Will services be made available to a larger number of clients, necessitating grid management? Will there be value in opening up services to the public? Infrastructure architecture essentials, Part 3: System design methods for sc... http://www.ibm.com/developerworks/library/ar-infraarch3/ 2 of 8 8/22/2009 6:52 PM Figure 1. Examples of scaling and application coupling Note that Figure 1 shows use of IBM System x3650 nodes and Fibre Channel SAN-attached DS4800 storage as an example of building block to go from SMP server to a cluster of System x3650 servers with DS4800 SAN storage and a parallel file system such as global parallel file system (GPFS). Clusters designed this way can be placed in a grid; for higher density, using a BladeCenter system might be considered in addition to or to rehost clustered services. Grid provides coordinated management, security, and client/user management tools to simplify IT work associated with a large number of clients in an SOA. Tools and techniques: estimating workload and scaling A detailed overview of I/O, processor, and memory performance tools was provided in the second article in this series. One of the best ways to estimate workload is simply to run applications and, in a simple SOA, plan on scaling based on the number of clients expected to run each type of service. Good scaling benchmark tools emulate client service requests using threaded or asynchronous workload and can be scaled and run directly on compute nodes or clusters or from a high performance client-emulator node. For example, a cluster might have a bonded 2 10 Gigabit Ethernet interface for a NAS head to 24GB Ethernet NAS clients. Figure 2 shows a basic I/O, processing, and memory scaling model. One of the most significant drivers that may not be immediately apparent is the extent to which host bus adapters (HBAs) for storage, host channel adapters (HCAs) for cluster interconnect, and network adapters for client/management networks offload protocol processingor, put another way, how much host-node loading do each of these I/O interfaces and their stacks place on compute nodes? Figure 2. SMP node I/O scaling and workload considerations Infrastructure architecture essentials, Part 3: System design methods for sc... http://www.ibm.com/developerworks/library/ar-infraarch3/ 3 of 8 8/22/2009 6:52 PM Server scaling Server scaling requires detailed knowledge of the processor complex, the processor-I/O-memory bus architecture, and the host channel design. For example, in Figure 2, if the System x3650 server is used as a compute node, this system has two gen1 x8 and two gen1 x4 PCI-e I/O channels along with on-board dual-GB Ethernet interfaces and a redundant array of independent disks (RAID) controller as well as several high availability features. So, with the System x3650, one possible configuration would be a two-port 10GE network adapter on the x8 bus slot for the HSI, one x8 8G two-port Fibre Channel HBA on one of the x8s, and a one-port 10GE network adapter on each of the remaining x4 PCI-e slots for the client network interface. This configuration provides 20Gbps full duplex to client uplink, 16Gbps to SAN storage, and 20Gbps to the HSI for clustering and uses the two built-in gigabit Ethernet interfaces for redundant management interfaces. This particular configuration leaves no bandwidth unused out of the 24 gen1 2.5Gbps PCI-e I/O channels, which is a total of 60Gbps (20Gbps client, 16Gbps SAN, 20Gbps cluster HSI, 2Gbps management, and 12Gbps internal RAID controller = 60Gbps). Making sure of bandwidth for the I/O interface into the processor complex is a great start and is nothing more than accounting. It should be noted carefully here that PCI-e, gigE/10GE, and Fibre Channel are all full duplex transports, so they are capable of simultaneous transmit and receive data transfers. As such, the total bandwidth in this system is 120Gbps, which exceeds the memory bandwidth significantly, as you'll see. Skills and competencies: planning for server scaling Digging deeper than I/O channels and configuration of HBAs, HCAs, and network adapters to best use that bandwidth, you must also look at memory bandwidth, latency and processor scaling. Memory bandwidth can become a bottleneck, because messages and I/Os are often stored, processed, and forwarded so that memory bandwidth becomes a critical scaling parameter. In the example with the System x3650 server, according to the User's Guide, it supports 12 fully buffered PC2-5300 DIMMs, which is DDR2-667 with 6 ns cycle time, supports a 333MHz I/O bus, and is capable of 667 million data transfers per second, or 5.333GB/sec. This is essentially 53.33Gbps and equal to approximately the half-duplex I/O capability of 60Gbps. Clearly, with this system and careful planning, you're likely to use all the memory capability. Ideally, it will be the bottleneck, assuming you keep I/O channels near saturation at half duplex, keep code mostly running out of cache, and Infrastructure architecture essentials, Part 3: System design methods for sc... http://www.ibm.com/developerworks/library/ar-infraarch3/ 4 of 8 8/22/2009 6:52 PM mostly DMA directly into and out of memory-mapped kernel buffers. Tools and techniques: don't leave bandwidth, processor, or memory on the table IBM offers several tools and documents for sizing their BladeCenter systems and System x servers (see Resources). Measuring actual memory bandwidth as well as consulting the specification is useful, and tools for this as well as catalogues of measurement's can be found on Dr. Bandwidth's Web page (see Resources). Processor scaling is best measured by benchmarking the most complex algorithms that will operate on data going in and coming out of memory at line rate and based on the number of threads or asynchronous I/O and processing contexts that the services must provide for clients. There really is no substitute for running at least the core algorithms on the proposed SMP node to estimate processor requirements. Many tools exist for analyzing the results, including profiling tools like VTune and basic processor loading monitor tools (see Resources). Cluster scaling As shown in Figure 2, sufficient cluster bandwidth to and from each SMP node is required for message passing and synchronization in parallel computations as well as parallel file systems like the IBM general parallel file system (see Resources). One of the key decisions is the cluster file system that you'll use. The IBM white paper "An Introduction to GPFS Version 3.2," provides an excellent overview of both SAN clusters and NAS head/gateway client clustered system configurations. Clusters are built for numerous reasons, including: Compute scaling: Breaking algorithms up into subparts, computing intermediate results, and merging results from numerous SMP nodes through message passing and distributed synchronization mechanisms High availability: Replication of NAS file services for clients to ensure access to stored data despite potential server downtime I/O scaling: For I/O-intensive applications that interface with SAN RAID to increase I/O bandwidth Client service scaling: To simply handle more concurrent client service requests from one cluster The skills, competencies, tools, and techniques for cluster scaling go beyond the scope of this article, as does grid scaling and cloud computing center scaling. However, good practices at the SMP compute node will provide good staging for these higher order scaling architectures. You can find numerous resources to assist with cluster scaling in Resources after you've identified the goals for cluster scaling enumerated previously. Grid and cloud scaling As shown in Figure 1, clusters provide I/O, processor, and storage scaling but generally don't prescribe management scaling methods, client-side management methods, or security. Scaling the full infrastructure, including many services that may be hosted on multiple clusters or SMP servers and clients that use these services, is the domain of grid computing. Grid computing is concerned with: Resource virtualization: For storage, networks, and processors through virtual disk arrays, network interface multipath management, and virtual machines User interface portals: Including definitions for secure Web access such as Web Services Description Language (WSDL) and Simple Object Access Protocol (SOAP) System management: Including provisioning and autonomic features for management of IT assets Complete coverage of grid scaling is not possible in this article, but the IBM Research Journal provides Infrastructure architecture essentials, Part 3: System design methods for sc... http://www.ibm.com/developerworks/library/ar-infraarch3/ 5 of 8 8/22/2009 6:52 PM in-depth studies, and many grid tools are available from IBM as well (see Resources). Likewise, cloud computing, which is a relatively new but rapidly growing architecture is beyond the scope of this paper. However, the basic concept of cloud computing is to provide HaaS and SaaS, which is enabled by building well-designed SMP and cluster servers in grid architectures that make generally useful applications available to users over the Web. For example, everything from shared calendars, code version control and management, e-mail, and many environments for social networking have come out of Web-enabled applications. This trend is growing and starting to include even HPC applications. Green factors The cost of scaling is not only the capital expenditure to add new processing capability, I/O channels, memory, storage, or networkingbut also the cost to power, cool, host, and manage these new resources. Although grid architecture and autonomic computing can help with management scaling, the green factor of subsystems and components is critical for keeping operation expenditures lower. Several trends are helping, including lower power storage using either solid state disk (SSD) flash drives and small form factor disk arrays. Likewise, the relentless pursuit of higher clock rate processors has given way to more cores with better SMP scaling designs and methods for clustering nodes. Both are helping keep costs down. Most often, customers design based on initial expense, such as cost per Gb, rather than power and performance cost measures, which become more significant as systems are scaled up and over long-term operations. In addition, management and TCO are important considerations, along with green factors. Conclusion Scaling is a planning exercise that requires estimation of future needs, budgets, and trade-offs between initial cost and long-term operational costs. Most systems don't have HPC requirements that are difficult to meet, rather simple client and service scaling needs. With some careful analysis and the selection of base SMP nodes and cluster design, you can scale these systems effectively and with minimal waste if resources are balanced initially and kept in balance. Don't overlook the scaling of management, either, and grid computing provides some great resources for consideration to prevent IT burden from scaling up too much with systems. Resources Learn See the IBM High Performance On Demand Solutions site for the latest on cloud computing and SOA scaling. The IBM Research Journal on grid computing provides an excellent, in-depth review of what grid can do for scaling management and user access to large-scale computing resources. See the IBM HPC Cluster solutions site for guidance on cluster solutions. See the IBM white paper An Introduction to GPFS Version 3.2 site for an overview of SAN clusters and NAS head/gateway client-clustered system configurations. For tools and information on compute node sizing for SMP servers or clusters, see the Configuration Infrastructure architecture essentials, Part 3: System design methods for sc... http://www.ibm.com/developerworks/library/ar-infraarch3/ 6 of 8 8/22/2009 6:52 PM tools page at IBM. See the IBM System x3650 User's Guide The IBM General Parallel File System, or GPFS, runs on both the Linux and IBM AIX SMP operating systems for cluster and grid scaling. See Part 2 of this series for a more detailed overview of tools you can use to benchmark systems to verify specifications and to find bottlenecks. Check out the fourth installment in the "Big Iron" series, "Power, cooling, and performance: Find the right balance" (developerWorks, Sam Siewert, 17 May 2005), for more information about power consumption and related topics. Read the developerWorks series, Cloud computing with Amazon Web services by Prabhakar Chaganti for more information on cloud computing. Performance Tuning for Linux Servers by Sandra K. Johnson, Gerrit Huizenga, and Badari Pulavarty (IBM Press, 2005) provides a great in-depth look at performance analysis, tuning methods, and workload generators for Linux. Browse the technology bookstore for books on these and other technical topics. Get the RSS feed for this series. Get products and technologies Download IBM product evaluation versions and get your hands on application development tools and middleware products from DB2, Lotus, Rational, Tivoli, and WebSphere. Check out "System x and BladeCenter system grid computing solutions from IBM. For memory bandwidth measurements and tools, see Dr. Bandwidth's web page. Check out Raptor 10GE/gigE switches, which employ 10GE customized layer-2 protocol to replace trunking and provide simple solutions for using dark fiber to expand clusters over large distances. Discuss Check out developerWorks blogs and get involved in the developerWorks community. About the author Infrastructure architecture essentials, Part 3: System design methods for sc... http://www.ibm.com/developerworks/library/ar-infraarch3/ 7 of 8 8/22/2009 6:52 PM Dr. Sam Siewert is a systems and software architect who has worked in the aerospace, telecommunications, digital cable, and storage industries. He also teaches at the University of Colorado at Boulder in the Embedded Systems Certification Program, which he co-founded in 2000. His research interests include high-performance computing, broadband networks, real-time media, distance learning environments, and embedded real-time systems. Trademarks | My developerWorks terms and conditions Infrastructure architecture essentials, Part 3: System design methods for sc... http://www.ibm.com/developerworks/library/ar-infraarch3/ 8 of 8 8/22/2009 6:52 PM