Virtual Cluster For HPC Education

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Virtual cluster for HPC education∗

Linh B. Ngo1 , Jon Kilgannon1


1
Computer Science
West Chester University of Pennsylvania
West Chester, PA, 19380
{lngo,jk880380}@wcupa.edu

Abstract
For many institutions, it is challenging to procure and maintain re-
sources to teach parallel and distributed computing. While existing ded-
icated environments such as XSEDE are available, they often have high
level of utilization, leading to difficulty in supporting real-time hands-on
in-class sessions, especially for larger classes. This work describes the de-
sign and development of a Linux-based distributed cyberinfrastructure
(CI) to address this problem. Unlike typical production-level environ-
ment, the CI is designed to be dynamically customized and deployed
on a federal cloud resource. Besides computing, the CI provides a job
scheduler, message passing, networked storage, and single sign-on mech-
anisms. Configurations of these components can be adjusted prior to the
automatic installation process during deployment. Scalability is demon-
strated, both as the number of cores and shared storage nodes increases,
showing that the proposed cluster emulates a large-scale system.

1 Introduction
For the majority of teaching-focus higher education institutions, institutional
missions often emphasizes community services, liberal arts, teaching quality,
accessibility, and commitment to diversity [8]. These institutions usually dis-
tribute financial and human resources to meet their primary teaching and stu-

∗ Copyright ©2018 by the Consortium for Computing Sciences in Colleges. Permission to

copy without fee all or part of this material is granted provided that the copies are not made
or distributed for direct commercial advantage, the CCSC copyright notice and the title of
the publication and its date appear, and notice is given that copying is by permission of the
Consortium for Computing Sciences in Colleges. To copy otherwise, or to republish, requires
a fee and/or specific permission.

1
dent service goals across all disciplines. This makes it difficult to support large-
scale cyberinfrastructure resources. While there is no previous study regarding
the availability of computing resources at smaller institutions, anecdotal ev-
idence points toward a clear lack of local resources for educational purposes
[4]. Even though there are large-scale national infrastructures such as XSEDE,
existing utilization from non-research institutions on these resources is low and
grows at a rate much smaller than that of research institutions. Furthermore,
high utilization rate from prioritized activities leads to increased wait time,
particularly during the day. This prevents effective implementation of in-class
hands-on learning activities.
Efforts have been made to develop affordable computing clusters that can
be used to teach basic PDC concepts. One approach is to boot a temporary
networked computer lab into a pre-configured distributed computing environ-
ment [3]. Combined with a small-scale multi-processor buildout hardware kit,
we have the ability to create inexpensive and portable mini clusters for edu-
cation [2]. Alternatively, advances in virtualization technologies have led to
solutions that support the creation of virtual computing clusters within exist-
ing computer laboratories. These clusters can either scale across all resources
[10] or are comprised of many mini clusters for learning at individual levels [7].
Without leveraging existing on-premise resources, reduction in hardware costs
leads to approaches that lean toward the development of personal computing
clusters. The costs can range from around $3,000 [1] to $200 [11]. In both
scenarios, they also require additional administrative effort, which could either
be facilitated by student teams, supported by institutional staff or require time
effort from the instructors. This presents challenges to institutions with lim-
ited resources, teaching responsibilities are high, and typical students are not
prepared to take up advanced Linux system administration tasks.
We present an approach that leverages cloud computing to provision a
cyberinfrastructure on which a full-fledged virtual supercomputer can be de-
signed and deployed. While conceptually similar to [7], our work does not rely
on premade VM component images. Instead, we utilize an academic cloud
[9] to create blueprints that correspond to components of a supercomputer in-
frastructure. At the cost of longer startup time, this allows us to provide a
high level of customization to clusters deployed through the platform. The
individual tasks in our CI blueprint are carried out as direct automated Linux
instructions which helps providing more insights into how the system works.
In additional to larger deployment supporting entire classes, the blueprint also
allows smaller cluster (taking only a portion of a physical node) to be deployed
should students wish to study on their own.
The remainder of this paper is organized as follows. Section 2 describes the
design and deployment environments of the proposed cloud-based CI. Section 3

2
presents and summarizes various administrative and performance scaling tests
regarding operations of the cloud-based CIs. Section 4 concludes the paper
and discusses future work.

2 Design and Deployment


We design and deploy a CI that supports the following components: Single-
Sign-On (SSO), scheduler, networked file system, parallel file system, and com-
puting nodes. The selected cloud environment is CloudLab, a federally funded
academic cloud infrastructure.

2.1 CloudLab
Funded by the National Science Foundation in 2014, CloudLab has been built
to provide researchers with a robust cloud-based environment for next gener-
ation computing research [9]. As of Fall 2019, CloudLab boasts an impressive
collection of hardware. At the Utah site, there are a total of 785 nodes, includ-
ing 315 with ARMv8, 270 with Intel Xeon-D, and 200 with Intel Broadwell.
The compute nodes at Wisconsin include 270 Intel Haswell nodes with memory
ranging between 120GB and 160GB and 260 Intel Skylake nodes with mem-
ory ranging between 128GB and 192GB. At Clemson University, there are 100
nodes running Intel Ivy Bridges, 88 nodes running Intel Haswell, and 72 nodes
running Intel Skylake. All of Clemson’s compute nodes have large memory (be-
tween 256GB and 384GB), and there are also two additional storage-intensive
nodes that have a total of 270TB of storage available.
In order to provision resources using CloudLab, a researcher needs to de-
scribe the necessary computers, network topologies, and startup commands in
a resource description document. CloudLab provides a browser-based graphical
interface that allows users to visually design this document through drag-and-
drop actions. For large and complex profiles, this document can be automat-
ically generated via Python in a programmatic manner. Listing 1 describes a
Python script that will generate a resource description document that requests
six virtual machines, each of which has two cores, 4GB of RAM, and runs
CentOS 7. Their IP addresses ranges from 192.168.1.1 through 192.168.1.6.

Listing 1: A CloudLab profile written in Python to describe a 6-node cluster


import geni.portal as portal
import geni.rspec.pg as pg
import geni.rspec.igext as IG

pc = portal.Context()
request = pc.makeRequestRSpec()

link = request.LAN("lan")
for i in range(6):
if i == 0:
node = request.XenVM("head")

3
node.routable_control_ip = "true"
elif i == 1:
node = request.XenVM("metadata")
elif i == 2:
node = request.XenVM("storage")
else:
node = request.XenVM("compute-" + str(i))
node.cores = 2
node.ram = 4096

node.disk_image = "urn:publicid:IDN+emulab.net+image+emulab-ops:CENTOS7-64-STD"
iface = node.addInterface("if" + str(i-3))
iface.component_id = "eth1"
iface.addAddress(pg.IPv4Address("192.168.1." + str(i + 1), "255.255.255.0"))
link.addInterface(iface)
pc.printRequestRSpec(request)

The resource description document instructs CloudLab to provision and in-


stantiate the experiments. Once all resources are allocated and images for the
computing components are booted on top of bare metal infrastructure, users
are granted complete administrative privilege over the provisioned infrastruc-
ture. CloudLab also supports a direct integration between publicly readable
git repositories and its profile storage infrastructure. This minimizes the effort
needed to modify an existing profile while still maintaining a complete history
of previous changes.

2.2 Virtual Supercomputer Blueprint and Deployment


The default blueprint allows users to customize the capacity of their virtual
supercomputer. They can decide the number of compute nodes, the number
of parallel file system nodes, and the number of CPUs and size of memory per
node. While it is possible to request complete physical nodes, we decide to
initially deploy all nodes as virtual machines. Only the head node has a public
IP address, which supports an SSH connection. Configuration and installation
tasks are automated and grouped into individual files. Each file is responsible
for the deployment of one service. These services include LDAP directory
services for passwordless SSO (LDAP server/client), a distributed file system
running NFS (NFS server/client), a parallel file system (BeeGFS management,
meta, and storage servers/client) [5], a scheduler (Slurm server/client) [6], and
a parallel programming library (OpenMPI). Figure 1 illustrates a dependency
graph that represents the workflow among providers and consumers of the
deployed services.
These services are split across several servers both to spread the workload
and to emulate an environment similar to those seen in large-scale supercom-
puters. To reduce the footprint of cloud resources, the current blueprint groups
related services on the same server. It is possible to modify the blueprint to
adjust this service placement. The number of nodes for the computing com-
ponent and for the parallel file system components can be set at deployment
time.

4
Figure 1: Services provided by each server component

LDAP and Single-Sign-On: Our CI forwards the port of the SSH server
on the head node to allow users to sign in remotely. LDAP provides user ac-
counts with uniform UIDs and GIDs across all nodes in the CI. All nodes in
the CI authenticate accounts against LDAP, enabling a streamlined environ-
ment for all tasks, including passwordless SSH connections between nodes in
the server and shared network storage. The automatic deployment of LDAP is
facilitated through the use of Debian preseeding and configuration scripts. A
pre-configured list of users and groups is included to allow repeated deployment
and can be modified. LDAP is the first component to be deployed before any
other service.
Filesystems: By default, each instance on CloudLab comes with a total
storage space of 16GB. It is possible to attach storage blocks to each instance
to serve as expanded local scratch for the compute nodes. Two remote net-
work storage infrastructures are included with the CI blueprint. The first is
a network file system (NFS) that is setup on its own node. The NFS filesys-
tem provides four directories to be mounted across the compute and head
nodes: /home: provides shared user home directories, /software: provides
shared client sofware, /opt: provides shared system software, and /mpishare:
contains MPI sample codes for students. Housing home directories on the NFS
server provides uniform access to user files across the entire CI and allows the
user to easily run MPI scripts on all compute nodes. It also makes password-
less sign-on between nodes simpler, as each user will have a one SSH key in
the shared home directory. The second remote network storage infrastructure
is BeeGFS, a parallel file system. BeeGFS consists of three primary services:

5
management, metadata, and storage. In the current default blueprint, we pro-
vision one node for the management service, one node for the metadata service,
and two nodes for storage. Before deploying the CI, it is possible to customize
the blueprint to include more storage servers (improving parallel I/O). It is
also possible to merge all three services onto one or two nodes, at the cost of
performance. The BeeGFS service configuration is stored in a simple JSON
file which loads from the GitHub repository, allowing the user to have the same
PFS architecture each time the CI is deployed but also allowing the architec-
ture to be quickly updated before deployment. Once deployed, the BeeGFS
storage is mounted as /scratch. Under this directory, each individual user ac-
count in LDAP automatically has a subdirectory, named after the user’s login
name. These user scratch directories are available across head and all compute
nodes.
Scheduler: We include SLURM as our default scheduler in the blueprint,
as it is the most popular scheduler across XSEDE sites. The NFS filesystem en-
ables distribution of configuration files for various SLURM components across
all nodes, including those of MariaDB, the back end database for SLURM. To
automate the configuration process, we created boilerplate configuration files
containing dummy information and updated these at deployment time.
Application (OpenMPI): Once SLURM has been installed and config-
ured, OpenMPI can be deployed. Because the virtual supercomputer is de-
ployed in a cloud environment and is also attached to the Internet for user
access via SSH, there are several network interfaces on each node. For Open-
MPI to run properly, it must be configured at install time to use only the
internal IP network. This is achieved through MPI’s configuration files, which
are loaded with the names of network interfaces provisioned at deployment.
Compute Nodes: These run the clients for LDAP, NFS, BeeGFS, SLURM,
and OpenMPI, allowing them to obtain single sign-on, share directories, mount
scratch space, and to participate in the parallel computing process. The /home,
/opt, /software, and /mpishare directories are mounted on each compute node
on the NFS server, and all compute nodes are provided scratch space on a
BeeGFS parallel file system.

2.2.1 Deployment
The hardware for our CI is requested from CloudLab using CloudLab’s Python
scripting support. The same Python script is used to launch accompanying
Bash scripts that deploy and configure CI software components. The Python
and Bash scripts are stored in a GitHub repository and updates are automat-
ically pulled by CloudLab, allowing the latest version of the CI to be incorpo-
rated into the corresponding CloudLab profile with each run.
The entire system deploys and configures itself automatically over the course

6
(a) Three physical computing nodes. (b) A single computing node.

Figure 2: Example CloudLab deployment

of several hours, depending on the number of nodes requested. The user who
deploys the system can place sample scripts, information, or data in the /source
directory on GitHub, and those files will be automatically copied into the
shared scratch directory accessible by all compute nodes. Figure 2a illustrates
a sample deployment of a supercomputer with one head node, one NFS node,
six BeeGFS nodes (one metadata, one management and four storage nodes),
and four compute nodes. The nodes in this deployment are spread across three
physical systems hosted at CloudLab’s Clemson site. A slightly smaller (in
term of node count) deployment hosted at CloudLab’s Wisconsin site is shown
in Figure 2b. This deployment only has four BeeGFS nodes (one metadata,
one management and two storage nodes) instead of six.

3 Operational and Scaling Tests


The tests described in this section aim to demonstrate that operations of the
deployed CI are similar to those of an actual supercomputing infrastructure.
Furthermore, scaling behaviors should reflect real world environment. That is,
as we increase the number of computing cores, performance improvements can
be observed.

3.1 Operational Tests


With LDAP, users are able to SSH directly into the deployed CI via the head
node without having to have a CloudLab account. This enables adhoc training
scenarios where only temporary accounts are needed. Once users logged into
the head node, interactions with the scheduler such as viewing the comput-
ing nodes’ status (sinfo), submitting jobs (sbatch), and watching the queue
(squeue) can be done as usual. This is demonstrated in Figure 3.

7
Figure 3: Standard interaction with the scheduler for job submissions

(a) Compute-intensive tasks (b) Write-intensive tasks

Figure 4: Speedup as number of processes increase

3.2 Scaling Behavior Tests


To evaluate scaling behavior, two tests are carried out. The first is an MPI
program that estimates integral of x2 using trapezoids. A SLURM submission
script runs and times the MPI program using increasing numbers of processes.
Calculated speedups are shown in Figure 4a. With 32 total CPUs available
(8 core per worker node), near linear speedups are observed as the number
of processes increase from 1 to 2 and then to 4. With 8 and 16 processes,
speedups are still observed but with a much reduced rate. At 32 processes,
speedup begins to stagnate and perceivably, decrease.
The second test measures the influence of parallel I/O via BeeGFS. This
is done by using MPIIO routines to write 512MB of random data to a file.
Speedups are measured as the number of processes increases from 1 to 32. and
visualized in Figure 4b. The figure indicates that speedup remains unchanged
for 1, 2, 4, and 8 processes. At 16 and 32 processes, the writing speed is
doubled.
The behavior shown in Figure 4b can be explained as follows. First, the
BeeGFS is setup with only two storage nodes, therefore, the maximum pos-
sible speedup for read/write access is two. Default MPI settings prioritize
placements of tasks on CPUs belong to the same compute nodes. With up

8
to 8 processes running on the same virtual computing node, writing activities
are limited by a single virtual network connection and therefore unable to take
advantage of the two BeeGFS storage servers. With more than 8 processes,
speedups are observed, but capped at 2 due to similar logic from the storage
servers’ perspective.

4 Conclusion and Future Work


It has been demonstrated that from an operational perspective, the CI can pro-
vide learners with a working environment similar to that of a supercomputer.
It enables organizations to become more proactive in CI training for users and
professionals without incurring additional risks in term of performance reduc-
tion or system errors due to these educational activities. The materialization
of a dynamic mini “supercomputer“ in the cloud demonstrated in this work
provides the motivation for a number of future projects. One such example
is the development of additional blueprints to enable site-specific educational
materials. Another example is a project that explores whether it is possible to
fit a toy cluster on a personal computing device. While VM-based solutions
exist, for a multi-node supercomputer emulator, the performance overhead on
storage, memory, and CPU can be significant on a laptop. We will explore
how various components of the supercomputer emulator can be packed into
containers. This helps to understand the minimal amount of CPU and RAM
needed to run an emulated cluster and whether scaling behaviors can be ob-
served. All source codes developed for this work are made available on GitHub
at ...

References
[1] Joel C Adams and Tim H Brom. Microwulf: a beowulf cluster for every
desk. ACM SIGCSE Bulletin, 40(1):121–125, 2008.
[2] Ivan Babic, Aaron Weeden, Mobeen Ludin, Skylar Thompson, Charles
Peck, Kristin Muterspaw, Andrew Fitz Gibbon, Jennifer Houchins, and
Tom Murphy. Littlefe and bccd as a successful on-ramp to hpc. In Proceed-
ings of the 2014 Annual Conference on Extreme Science and Engineering
Discovery Environment, page 73. ACM, 2014.
[3] Sarah M Diesburg, Paul A Gray, and David Joiner. High performance
computing environments without the fuss: the bootable cluster cd. In
19th IEEE International Parallel and Distributed Processing Symposium,
pages 8–pp. IEEE, 2005.

9
[4] Jeremy Fischer, Steven Tuecke, Ian Foster, and Craig A Stewart. Jet-
stream: a distributed cloud infrastructure for underresourced higher edu-
cation communities. In Proceedings of the 1st Workshop on The Science
of Cyberinfrastructure: Research, Experience, Applications and Models,
pages 53–61. ACM, 2015.

[5] Jan Heichler. An introduction to beegfs, 2014.


[6] Don Lipari. The slurm scheduler design. In SLURM User Group Meeting,
Oct. 9, volume 52, page 52, 2012.
[7] Paul Marshall, Michael Oberg, Nathan Rini, Theron Voran, and Matthew
Woitaszek. Virtual clusters for hands-on linux cluster construction ed-
ucation. In Proc. of the 11th LCI International Conference on High-
Performance Clustered Computing, 2010.
[8] Christopher C Morphew and Matthew Hartley. Mission statements: A
thematic analysis of rhetoric across institutional type. The Journal of
Higher Education, 77(3):456–471, 2006.

[9] Robert Ricci, Eric Eide, and CloudLab Team. Introducing cloudlab: Sci-
entific infrastructure for advancing cloud architectures and applications. ;
login:: the magazine of USENIX & SAGE, 39(6):36–38, 2014.
[10] Elizabeth Shoop, Richard Brown, Eric Biggers, Malcolm Kane, Devry Lin,
and Maura Warner. Virtual clusters for parallel and distributed education.
In Proceedings of the 43rd ACM technical symposium on Computer Science
Education, pages 517–522. ACM, 2012.
[11] David Toth. A portable cluster for each student. In 2014 IEEE Inter-
national Parallel & Distributed Processing Symposium Workshops, pages
1130–1134. IEEE, 2014.

10

You might also like