2 Concepts and Products
Applies To: Windows HPC Server 2008
We assume that the readers may not be familiar with every concept discussed in the remaining chapters in both Linux and Windows® environments. Therefore, this chapter introduces the technologies (Master Boot Record, Dual-boot, virtualization and Pre-boot eXecution Environment) and products (Linux Bull Advanced Server, Windows Server°2008°HPC Edition and PBS Professional) mentioned in this document.
If you are already familiar with these concepts or are more interested in general Hybrid OS Cluster (HOSC) considerations, you may want to skip this chapter and go directly to Chapter 3.
2.1 Master Boot Record (MBR)
The 512-byte boot sector is called the Master Boot Record (MBR). It is the first sector of a partitioned data storage device such as a hard disk. The MBR is usually overwritten by operating system (OS) installation procedures; the MBR previously written on the device is then lost.
The MBR includes the partition table of the 4 primary partitions and a bootstrap code that can start the OS or load and run the boot loader code (see the complete MBR structure in Table 3 of Appendix C.1). A partition is encoded as a 16-byte structure with size, location and characteristic fields. The first 1-byte field of the partition structure is called the boot flag.
Windows MBR starts the OS installed on the active partition. The active partition is the first primary partition that has its boot flag enabled. You can select an OS by activating the partition where it is installed. Tools diskpart.exe and fdisk can be used to change partition activation. Appendix D.1.3 and Appendix D.2.3 give examples of commands that enable/disable the boot flag.
Linux MBR can run a boot loader code (e.g., GRUB or Lilo). You can then select an OS interactively from its user interface at the console. If no choice is given at the console, the OS selection is taken from the boot loader configuration file that you can edit in advance before a reboot (e.g., grub.conf for the GRUB boot loader). If necessary, the Linux boot loader configuration file (that is written in a Linux partition) can be replaced from a Windows command line with the dd.exe tool.
Appendix C.2 explains how to save and restore the MBR of a device. It is very important to understand how the MBR works in order to properly configure dual-boot systems.
2.2 Dual-boot
Dual-booting is an easy way to have several operating systems (OS) on a node. When an OS is run, it has no interaction with the other OS installed so the native performance of the node is not affected by the use of the dual-boot feature. The only limitation is that these OSs cannot be run simultaneously.
When designing a dual-boot node, the following points should be analyzed:
The choice of the MBR (and choice of the boot loader if applicable)
The disk partition restrictions. For example, Windows must have a system partition on at least one primary partition of the first device)
The compatibility with Logical Volume Managers (LVM). For example, RHEL5.1 LVM creates a logical volume with the entire first device by default and this makes it impossible to install a second OS on this device.
When booting a computer, the dual-boot feature gives the ability to choose which OS to start from multiple OSs installed on that computer. At boot time, the way you can select the OS of a node depends on the installed MBR. A dual-boot method that relies on Linux MBR and GRUB is described in [1]. Another dual-boot method that exploits the properties of active partitions is described in [2] an [3].
2.3 Virtualization
The virtualization technique is used to hide the physical characteristics of computers and only present a logical abstraction of these characteristics. Virtual Machines (VM) can be created by the virtualization software: each VM has virtual resources (CPUs, memory, devices, network interfaces, etc.) whose characteristics (quantity, size, etc.) are independent from those available on the physical server. The OS installed in a VM is called a guest OS: the guest OS can only access the virtual resources available in its VM. Several VMs can be created and run on one physical node. These VMs appear like physical machines for the applications, the users and the other nodes (physical or virtual).
Virtualization is interesting in the context of our study for two reasons:
It makes possible the installation of several management nodes (MN) on a single physical server. This is an important point for installing several OS on a cluster without increasing its cost with the installation of an additional physical MN server.
It provides a fast and rather easy way to switch from one OS to another: by starting a VM that runs an OS while suspending another VM that runs another OS.
A hypervisor is a software layer that runs at a higher privilege level on the hardware. The virtualization software runs in a partition (domain 0 or dom0), from where it controls how the hypervisor allocates resources to the virtual machines. The other domains where the VMs run are called unprivileged domains and noted domU. A hypervisor normally enforces scheduling policies and memory boundaries. In some Linux implementations it also provides access to hardware devices via its own drivers. On Windows, it does not.
The virtualization software can be:
Host-based (like VMware): this means that the virtualization software is installed on a physical server with a classical OS called the host OS.
Hypervisor-based (like Windows Server® 2008 Hyper-V™ and Xen): in this case, the hypervisor runs at a lower level than the OS. The “host OS” becomes just another VM that is automatically started at boot time. Such virtualization architecture is shown in Figure 1.
Figure 1 Overview of hypervisor-based virtualization architecture
“Full virtualization” is an approach which requires no modification to the hosted operating system, providing the illusion of a complete system of real hardware devices. Such Hardware Virtual Machines (HVM) require hardware support provided for example by Intel® Virtual Technology (VT) and AMD-V technology. Recent Intel® Xeon® processors support full virtualization thanks to the Intel® VT. Windows is only supported on fully-virtualized VMs and not on para-virtualized VMs. “para-virtualization” is an approach which requires modification to the operating system in order to run in a VM.
The market provides many virtualization software packages among which:
Xen [6]: a freeware for Linux included in the RHEL5 distribution which allows a maximum of 8 virtual CPUs per virtual machine (VM). Oracle VM and Sun xVM VirtualBox are commercial implementations.
VMware [7]: commercial software for Linux and Windows which allows a maximum of 4 virtual CPUs per VM.
Hyper-V [8]: a solution provided by Microsoft which only works on Windows Server 2008 and allows only 1 virtual CPU per VM for non-Windows VM.
PowerVM [9] (formerly Advanced POWER Virtualization): an IBM solution for UNIX and Linux on most processor architectures that does not support Windows as a guest OS.
Virtuozzo [10]: a ‘Parallels, Inc’ solution designed to deliver near native physical performance. It only supports VMs that run the same OS as the host OS (i.e., Linux VMs on Linux hosts and Windows VMs on Windows hosts).
OpenVZ [11]: an operating system-level virtualization technology licensed under GPL version 2. It is a basis of Virtuozzo [10]. It requires both the host and guest OS to be Linux, possibly of different distributions. It has a low performance penalty compared to a standalone server.
2.4 PXE
The Pre-boot eXecution Environment (PXE) is an environment to boot computers using a network interface independently of available data storage devices or installed OS. The end goal is to allow a client to network boot and receive a network boot program (NBP) from a network boot server.
In a network boot operation, the client computer will:
Obtain an IP address to gain network connectivity: when a PXE-enabled boot is initiated, the PXE-based ROM requests an IP address from a Dynamic Host Configuration Protocol (DHCP) server using the normal DHCP discovery process (see the detailed process in Figure 2). It will receive from the DHCP server an IP address lease, information about the correct boot server and information about the correct boot file.
Discover a network boot server: with the information from the DHCP server the client establishes a connection to the PXE servers (TFTP, WDS, NFS, CIFS, etc.).
Download the NBP file from the network boot server and execute it: the client uses Trivial File Transfer Protocol (TFTP) to download the NBP. Examples of NBP are: pxelinux.0 for Linux and WdsNbp.com for Windows Server.
When booting a compute node with PXE, the goal can be to install or run it with an image deployed through the network, or just to run it with an OS installed on its local disk. In the latter case, the PXE just answers the compute node requests by indicating that it must boot on the next boot device listed in its BIOS.
Figure 2 DHCP discovery process
2.5 Job schedulers and resource managers in a HPC cluster
In an HPC cluster, a resource manager (aka Distributed Resource Management System (DRMS) or Distributed Resource Manager (DRM)) gathers information about all cluster resources that can be used by application jobs. Its main goal is to give accurate resource information about the cluster usage to a job scheduler.
A job scheduler (aka batch scheduler or batch system) is in charge of unattended background executions. It provides a user interface for submitting, monitoring and terminating jobs. It is usually responsible for the optimization of job placement on the cluster nodes. For that purpose it deals with resource information, administrator rules and user rules: job priority, job dependencies, resource and time limits, reservation, specific resource requirements, parallel job management, process binding, etc. With time, job schedulers and resource managers evolved in such a way that they are now usually integrated under a unique product name. Here are such noteworthy products:
PBS Professional [12]: supported by Altair for Linux/Unix and Windows
Torque [13]: an open source job scheduler based on the original PBS project. It can be used as a resource manager by other schedulers (e.g., Moab workload manager).
SLURM (Simple Linux Utility for Resource Management) [14]: freeware and open source
LSF (Load Sharing Facility) [15]: supported by Platform for Linux/Unix and Windows
SGE (Sun Grid Engine) [16]: supported by Sun Microsystems
OAR [17]: freeware and open source for Linux, AIX and SunOS/Solaris
Microsoft Windows HPC Server 2008 job scheduler: included in the Microsoft HPC pack [5]
2.6 Meta-Scheduler
According to Wikipedia [18], “Meta-scheduling or Super scheduling is a computer software technique of optimizing computational workloads by combining an organization's multiple Distributed Resource Managers into a single aggregated view, allowing batch jobs to be directed to the best location for execution”. In this paper, we consider that the meta-scheduler is able to submit jobs on cluster nodes with heterogeneous OS types and that it can switch automatically the OS type of these nodes when necessary (for optimizing computational workloads). Here is a partial list of meta-schedulers currently available:
Moab Grid Suite and Maui Cluster scheduler [19]: supported by Cluster Resources, Inc.
GridWay [20]: a Grid meta-scheduler by the Globus Alliance
CSF (Community Scheduler Framework) [21]: an open source framework (an add-on to the Globus Toolkit v.3) for implementing a grid meta-scheduler, developed by Platform Computing
Recent job schedulers can sometime be adapted and configured to behave as “simple” meta-schedulers.
2.7 Bull Advanced Server for Xeon
Description
Bull Advanced Server for Xeon (XBAS) is a robust and efficient Linux solution that delivers total cluster management. It addresses each step of the cluster lifecycle with a centralized administration interface: installation, fast and reliable software deployments, topology-aware monitoring and fault handling (to dramatically lower time-to-repair), cluster optimization and expansion. Integrated, tested and supported by Bull [4], XBAS federates the very best of Open Source components, complemented by leading software packages from well known Independent Software Vendors, and gives them a consistent view of the whole HPC cluster through a common cluster database: the clusterdb. XBAS is fully compatible with standard RedHat Enterprise Linux (RHEL). Latest Bull Advanced Server for Xeon 5 release (v3.1) is based on RHEL5.3.
Note
The Bull Advanced Server for Xeon 5 release that was used to illustrate examples in this paper is v1.1 based on RHEL5.1 because this was the latest release when we built the first prototypes in May 2008.
Cluster installation mechanisms
The Installation of an XBAS cluster starts with the setup of the management node (see the installation & configuration guide [22]). The compute nodes are then deployed by automated tools.
BIOS settings must be set so that XBAS compute nodes boot on network with PXE by default. The PXE files stored on the management node indicate if a given compute node should be installed (i.e., its DEFAULT label is ks) or if it is ready to be run (i.e., its DEFAULT label is local_primary).
In the first case, a new OS image should be deployed.
Note
The Bull Advanced Server for Xeon 5 release that was used to illustrate examples in this paper is v1.1 based on RHEL5.1 because this was the latest release when we built the first prototypes in May 2008.The Bull Advanced Server for Xeon 5 release that was used to illustrate examples in this paper is v1.1 based on RHEL5.1 because this was the latest release when we built the first prototypes in May 2008.
During the PXE boot process, operations to be executed on the compute node are written in the kickstart file. Tools based on PXE are provided by XBAS to simplify the installation of compute nodes. The “preparenfs” tool writes the configuration files with the information given by the administrator and with those found in the clusterdb. The generated configuration files are: the PXE files (e.g., /tftpboot/C0A80002), the DHCP configuration file (/etc/dhcpd.conf), the kickstart file (e.g., /release/ks/kickstart) and the NFS export file (/etc/exportfs). No user interface access (remote or local) to the compute node is required during its installation phase with the preparenfs tool. Figure 3 shows the sequence of interactions between a new XBAS compute node being installed and the servers run on the management node (DHCP, TFTP and NFS). On small clusters, the “preparenfs” tool can be used to install every CN. On large clusters the ksis tool can be used to optimize the total deployment time of the cluster by cloning the first CN installed with the “preparenfs” tool.
In the second case, the CN is already installed and the compute node just needs to boot locally on its local disk. Figure 4 shows the XBAS compute node normal boot scheme.
Figure 3 XBAS compute node PXE installation scheme
Figure 4 XBAS compute node PXE boot scheme
2.8 Windows HPC Server 2008
Description
Microsoft Windows HPC Server 2008 (HPCS), the successor to Windows Compute Cluster Server (WCCS) 2003, is based on the Windows Server 2008 operating system and is designed to increase productivity, scalability and manageability. This new name reflects Microsoft HPCs readiness to tackle the most challenging HPC workloads [5]. HPCS includes key features, such as new high-speed networking, highly efficient and scalable cluster management tools, advanced failover capabilities, a service oriented architecture (SOA) job scheduler, and support for partners’ clustered file systems. HPCS gives access to an HPC platform that is easy to deploy, operate, and integrate with existing enterprise infrastructures.
Cluster installation mechanisms
The Installation of a Windows HPC cluster starts with the setup of the head node (HN). For the deployment of a compute node (CN), HPCS uses Windows Deployment Service (WDS), which fully installs and configures HPCS and adds the new node to the set of Windows HPC compute nodes. WDS is a deployment tool provided by Microsoft, it is the successor of Remote Installation services (RIS), and it handles all the compute node installation process and acts as a TFTP server.
During the first installation step, Windows Preinstallation Environment (WinPE) is the boot operating system. It is a lightweight version of Windows Server 2008 that is used for the deployment of servers. It is intended as a 32-bit or 64-bit replacement for MS-DOS during the installation phase of Windows, and can be booted via PXE, CD-ROM, USB flash drive or hard disk.
BIOS settings should be set so that HPCS compute nodes boot on network with PXE (we assume that a private network exists and that CNs send PXE requests there first). From the head node point of view, a compute node must be deployed if it doesn’t have any entry into the Active Directory (AD), or if the cluster administrator has explicitly specified that it must be re-imaged. When a compute node with no OS boots, it first sends a DHCP request in order to get an IP address, a valid network boot server and the name of a network boot program (NBP). When the DHCP server has answered, the CN downloads the NBP called WdsNbp.com from the WDS server. The purpose is to detect the architecture and to wait for other downloads from the WDS server.
Then, on the HPCS administration console of the head node, the new compute node appears as “pending approval”. The installation starts once the administrator assigns a deployment template to it. A WinPE image is sent and booted on the compute node; files are transferred in order to prepare the Windows Server 2008 installation, and an unattended installation of Windows Server 2008 is played. Finally, the compute node is joined to the domain and the cluster. Figure 5 shows the details of PXE boot operations executed during the installation procedure.
If the CN has already been installed, the AD already contains the corresponding computer object, so the WDS server sends him a NBP called abortpxe.com which boots the server by using the next boot item in the BIOS without waiting for a timeout. Figure 6 shows the PXE boot operations executed in this case.
Figure 5 HPCS compute node PXE installation scheme
Figure 6 HPCS compute node PXE boot scheme
2.9 PBS Professional
This Section presents PBS Professional, the job scheduler that we used as meta-scheduler for building the HOSC prototype described in Chapter 5. PBS Professional is part of the PBS GridWorks software suite. It is the professional version of the Portable Batch System (PBS), a flexible workload management system, originally developed to manage aerospace computing resources at NASA. PBS Professional has since become the leader in supercomputer workload management and the de facto standard on Linux clusters. A few of the more important features of PBS Professional 10 are listed below:
Enterprise-wide Resource Sharing provides transparent job scheduling on any PBS system by any authorized user. Jobs can be submitted from any client system both local and remote.
Multiple User Interfaces provides a traditional command line and a graphical user interface for submitting batch and interactive jobs; querying job, queue, and system status; and monitoring job.
Job Accounting offers detailed logs of system activities for charge-back or usage analysis per user, per group, per project, and per compute host.
Parallel Job Support works with parallel programming libraries such as MPI. Applications can be scheduled to run within a single multi-processor computer or across multiple systems.
Job-Interdependency enables the user to define a wide range of interdependencies between jobs.
Computational Grid Support provides an enabling technology for metacomputing and computational grids.
Comprehensive API includes a complete Application Programming Interface (API).
Automatic Load-Leveling provides numerous ways to distribute the workload across a cluster of machines, based on hardware configuration, resource availability, keyboard activity, and local scheduling policy.
Common User Environment offers users a common view of the job submission, job querying, system status, and job tracking over all systems.
Cross-System Scheduling ensures that jobs do not have to be targeted to a specific computer system. Users may submit their job, and have it run on the first available system that meets their resource requirements.
Job Priority allows users the ability to specify the priority of their jobs.
Username Mapping provides support for mapping user account names on one system to the appropriate name on remote server systems. This allows PBS Professional to fully function in environments where users do not have a consistent username across all hosts.
Broad Platform Availability is achieved through support of Windows and every major version of UNIX and Linux, from workstations and servers to supercomputers.