|
HPCVL's Sunfire Cluster
The Sunfire Compute Cluster at HPCVL is the default production
cluster. It is based on Symmetric Multiprocessor (SMP) systems using
the UltraSPARC line of processors and the Solaris Operating
Environment. This page explains essential features of the cluster and
is meant as a basic guide for its usage.
1. What is the cluster?
The Sunfire compute cluster that is based
on
Sunfire 25000 servers each of which have 72 x (2MB on-chip L2
cache and 32MB L3 cache) dual-core
(CPU)
UltraSPARC-IV+ processors . There are five (5) of these servers
available, called hpcvl0,hpcvl3,hpcvl4,hpcvl5 and hpcvl6, plus an E2900 login node
called sflogin0 with the same chip architecture and
OS.
The current configurations are:
- Four Sun Fire 25000 Nodes (hpcvl0,hpcvl3,hpcvl4 and hpcvl5) with 72 X dual-core
UltraSPARC-IV+ 1.5 GHz processors with 576 GB of RAM.
- One Sun Fire 25000 Node (hpcvl6) with 72 X dual-core UltraSPARC-IV+ 1.8 GHz processors with 576 GB of RAM.
- One SunFire E2900 (sflogin0) with 24 x 1.8 GHz UltraSPARC-IV+ processors and 192 GB RAM.
- Two Sun Fire 6900 Nodes (1 at U of O, and 1 at Carleton) with 24 x UltraSPARC-IV+ processors with 192 GB of RAM. Both are to be mainly used as workup nodes.
- One Sun Fire 4800 with 12 x UltraSPARC-III processors with 48 GB of RAM at Ryerson University. Currently used as a workup node.
2. Why this cluster?
The main emphasis of the Sunfire cluster is on "standard parallel
jobs". Because they are SMP machines, they offer a substantial amount
of memory. With a 2 Floating-Point Units per compute core, they are
able to process floating-point intensive jobs at a theoretical peak of
345.6 GFlops (518.4 GFlops) per server.
3. Who should use this cluster?
The Sunfire machines are curerently the default compute cluster,
and are suitable for applications that require considerable amount of
memory and/or scale to a moderate number of processors. They can
process both sharded-memory based applications (usually programmed
using OpenMP directives), and distributed-memory parallel programs
often using MPI.
Applications that are very floating-point extensive, or depend
crucially on cache usage should be run on this cluster or on
our M9000
servers.
We suggest you consider using the compute cluster if
- Your application is explicitly or automatically multi-threaded
(for instance, using OpenMP) and shows at least some scaling for
moderately large numbers of threads (>20).
- Your application is based on MPI or PVM, and uses substantial
amounts of communication. The SMP nature of the Sunfire 25Ks enables
very fast intra-node communication.
- Your application uses substantial amounts of memory. For extremely
large memory usage, the M9000 servers should be preferable.
- Your application is commercially licensed on a per-process basis.
The cluster might not be suitable if
- Your application is "trivially parallel", employing
distributed-memory systems such as MPI, and uses almost no
communication. For this purpose,
our
Victoria-Falls cluster is preferable.
- Your application consists of a very large number of independent
serial runs. Again,
the Victoria-Falls
cluster should be used.
4. How do I use this cluster?
a) ... to access
Login access to the headnode of the compute cluster is available
via the HPCVL Secure Portal at https://portal.hpcvl.queensu.ca/.
Clicking on the "Secure Desktop" tab in the portal will present you
with a list of applications. Choose the one saying "xterm (sfnode0)"
or "dtterm (sfnode0)". This will bring up a login terminal on the
Sunfire cluster login node sflogin0. Note that the compute nodes of
the Sunfire cluster are accessed via Grid Engine by default.
The file systems for all our clusters are shared, so you will be
using the same home directory. Everything else will also be very
similar on all standard clusters, including OS, shell setup, and Grid
Engine usage. The login node can be used for compilation,
program development, and testing only, not for production
jobs.
b) ... to compile and link
Compilingn and linking for the Sunfire Cluster is very simple:
- Make sure you are using Studio 12 compilers. This is the
default, but if you have entries in your shell setup that reset the
compiler, you might have to modify these by typing
use studio12
- Many optimization options in the Studio compilers, such as -fast
imply settings that involve -native, i.e. they optimize for the
architecture and chipset of the machine on which you are doing the
compilation. These settings do not have to be changed. The
compilation should be done on the login node sflogin0.
For a general introduction,
see http://www.hpcvl.org/faqs/programming/parallel-prog-faq.html.
For applications that cannot be re-compiled (for instance, because
the source code is not accessible), compilations for any
post-USIII UltraSparc chip will work.
c) ... to run jobs
As mentioned earlier, program runs for user and application
software on the login node are allowed only for test
purposes. Production runs must be submitted to Grid
Engine. This is exactly as on the Sunfire cluster. For a
description of how to use Grid Engine,
see the
HPCVL GridEngine FAQ
Grid Engine will schedule jobs to a default pool of machines unless
otherwise stated. This default pool contains presently only
the Sunfire 25K's, i.e. hpcvl0-hpcvl6. Therefore, no additional
changes need to be made to use them.
Note that the number of processes for these machines must be chosen
such that dedicated scheduling is possible. It is therefore important,
that if a maximum of 8 processes are running, 8 CPU'as are requested
through Grid Engine. Which specific number to choose must be
determined largely by experimentation specifically for each
application.
d) ... to optimize
While in many cases, optimization options such as -fast will
result in excellent performance, for larbger applications it is often
necessary to analyze the timing profile of typical runs to uncover
bottlenecks and optimize on a source-code
level. The Sun
Studio Performance Analyzer is an exceelent tool to help with this
task.
5. Help?
...to find more information
Our user support (please contact us
at help@hpcvl.org), can supply you
with specific help, and is glad to answer questions about cluster
usage.
|