Writing a system and platform definition
This tutorial will provide an introduction to writing system and platform definitions in Ramble. You should learn what these definition files are for, and how to use them to standardize experiments in your compute facility. The examples will focus on generic systems, but should still provide a good introduction.
It is a good idea to have a basic working understanding of how to create and use Ramble workspaces before starting this tutorial. You should at least be familiar with the content of the Hello World Tutorial.
This tutorial is intended to be a practical, hands-on guide to creating a simple set of system and platform definitions.
Installation
To install Ramble, see the Getting Started guide.
NOTE: This tutorial does not require a package manager to be installed or configured.
Ramble Repositories
Before writing our definitions, we will create a repository to house the application definition. Repositories in Ramble can house any object type, and are not limited to only those in this tutorial.
The ramble repo command is used to manage object repositories in Ramble. To
create a new repository, execute the following:
$ ramble repo create tutorial-repo
This will create a new directory named tutorial-repo in your current
directory. Inside, directories will exist for each of the object types. These
will include things like applications and package_managers. Resulting
object paths in this repo would look like:
tutorial-repo/applications/hostname/application.py
You can also create a repository without object subdirectories using:
$ ramble repo create tutorial-repo -d ""
In this case, the resulting structure looks like:
tutorial-repo/hostname/application.py
This latter layout allows objects with the same name but different types to coexist in the same directory. For example, if there was a package manager named hostname, it would exist in the following path:
tutorial-repo/hostname/package_manager.py
The remaining commands in this tutorial will assume your repository layout matches the first example, but you can feel free to use either layout. Just map any paths to the correct layout.
The actual object files within the repository are named based on the object they represent. Below is a mapping of some object types to file names:
Application -
application.pyBase Application -
base_application.pyModifier -
modifier.pyPackage Manager -
package_manager.pyWorkflow Manager -
workflow_manager.pySystem -
system.pyPlatform -
platform.py
As listed with the application.py file, each object also has a
corresponding base version, that is mostly use to help inheritance into
several concrete objects.
Once your repository is created, you can register it with Ramble by issuing the following command:
$ ramble repo add tutorial-repo
NOTE: Ramble comes with a default builtin repository. Adding new
repositories gives them a higher precedence to other existing repositories.
Ramble uses this precedence ordering to decide which object definition is used
when multiple exist with the same name. Each repository has a namespace, and
these namespaces can be used to refer to specific instances of each object
definition.
System And Platform Classes
System classes are intended to represent a cluster in a compute facility, while a platform class is intended to represent a compute node within the system. These objects are composble, as a system could be constructed out of many different platform types.
Systems allow specification of a default workflow manager, default package manager, and a default platform. Users are allowed to override these within their workspaces. Systems can define their available platforms, which will force users to select one of the available platforms.
To begin with, we’ll create a platform definition, and then a system definition that uses the new platform definition.
Your Platform Definition
In this section, you will write a platform definition representing whatever machine you are running the tutorial on. The values in this section don’t actually matter, but we will refer to Linux commands to get some system information.
Platform definitions are intended to represent a specific node. As a result, they have some node level definitions in them. The base platform definition is expected to define the following variables.
max_accelerators_per_node- (GPU/TPU/etc.) The number of accelerators each node has
max_sockets_per_node- (CPU) The number of sockets each node has
max_threads_per_core- (CPU) The number of threads available on each core
max_cores_per_node- (CPU) The number of cores available on each node
max_memory_per_node- (RAM) The amount of RAM in GB each node has
By default, a variant (validate_platform) is defined, and set to True, that
requires these variables to be defined. You are free to change the default for
your platform, however you can also simply define them to be any value within
your platform definition.
To collect the CPU quantities, we will use the lscpu command. Examine the
output of this command, and extract all of the CPU quantities. We will refer to
these later when we’re writing the platform definition. The RAM quantity can be
collected using free -h, which will print the total amount of RAM on your
system. How you query the number of accelerators varies based on the
accelerator you are using. For the purposes of this tutorial, we will assume
your platform does not have any accelerators.
Create Platform Definition
At this point, you should create a new platform file in:
tutorial-repo/platforms/my-platform/platform.py
You can create this file using the following commands, or you can use whatever method you prefer:
$ mkdir -p tutorial-repo/platforms/my-platform
$ touch tutorial-repo/platforms/my-platform/platform.py
You can edit the platform.py file using your editor of choice, or by
executing:
$ ramble edit --type platforms my-platform
Platform Class
Ramble provides a module (e.g. platkit) which imports useful methods,
language features, and utility classes when constructing platform definitions.
Each platform definition should import this using:
from ramble.platkit import *
Platform definitions in Ramble contain a python class that attributes of the
platform. The name of the class matches the directory name for the platform,
but converted to CamelCase. For example, our platform directory is named
my-platform and the class name should be MyPlatform as a result.
Ramble also provides a base class, PlatformBase which handles applying the
language definitions, and several other standard aspects of how platforms
function in Ramble. Repositories are allowed to define base platform classes
(of the base_platform.py type) that can be used to build more complicated
inheritance chains, but we will not cover those in this tutorial.
Platform definitions should also have a class level name attribute that
matches (exactly) the directory name of the object. In our case:
name = 'my-platform'
Our beginning platform definition might look something like the following:
from ramble.platkit import *
class MyPlatform(PlatformBase):
name = 'my-platform'
At this point, you should be able to see your platform definition in the output of:
$ ramble list --type platforms
Adding Platform Attributes
Previously, you collected information about your platform. In this section, you will add variable definitions to ensure your platform functions correctly.
This tutorial will assume you have 2 sockets per node, 1 thread per core, 64 cores per node, and 512 GB of RAM. It will also assume you have 2 accelerators on your platform, and they are both GPUs.
You can use the variable directive to define variables within the platform
definition for each of these quantities. For example, defining the amount of
memory per node can be done using:
variable(
"max_memory_per_node",
default=512,
description="Amount of RAM in GB on each node",
)
Additionally, you can use the variant directive to enable users to control
aspects of the platform. The PlatformBase class automatically adds two
variants for accelerator which can be True or False (defaults to
False), and accelerator_type which has a default value of None but
can take a value of a string.
Since we are assuming your platform has 2 GPUs per node, we will update these variants, but if your platform doesn’t actually have this you can ignroe this portion. These variants can be defined as follows:
variant(
"accelerator",
default=True,
description="Whether platform has accelerators or not",
)
variant(
"accelerator_type",
default="GPU",
values=[None, GPU],
description="Type of accelerator on this platform",
)
Now, if you add the remaining variable definitions to the class, it might look something like the following:
from ramble.platkit import *
class MyPlatform(PlatformBase):
name = 'my-platform'
variable(
"max_sockets_per_node",
default=2,
description="Number of sockets on each node",
)
variable(
"max_threads_per_core",
default=1,
description="Threads on each core",
)
variable(
"max_cores_per_node",
default=64,
description="Number of cores on each node"
)
variable(
"max_memory_per_node",
default=512,
description="Amount of RAM in GB on each node"
)
variable(
"max_accelerators_per_node",
default=2,
description="Number of accelerators on each node",
)
variant(
"accelerator",
default=True,
description="Whether this platform has accelerators or not",
)
variant(
"accelerator_type",
default="GPU",
values=[None, "GPU"],
description="Type of accelerator on this platform",
)
All of this information should show up correctly when you execute:
$ ramble info --type platforms my-platform
While this concludes the creation of your platform class in this tutorial, you are free to add additional variables, variants, and more advanced features to your platform definition.
Your System Definition
In this section, you will write a system definition representing a fictitious cluster created out of nodes from the platform class you just created. We will assume your cluster will use the Spack package manager, and the SLURM workload manager just to show off some of the features of system classes. However, you are free to manipulate your system class however you wish.
System classes are intended to represent a cluster. Clusters are assumed to be collections of nodes that are represented by platform classes, like the one you just created. While the platform class has several required variables, the system class only has one that it directly requires, but other aspects of a system can imply additional required variables.
The only variable the system class requires is max_nodes which should
define how many nodes of a given platform there are. Each platform in the
system can have a different number of nodes, and this variable can take
different values based on the platform selected. For the purposes of this
tutorial, we’ll assume your have 4 nodes in your system and you only have nodes
of the type my-platform.
Create System Definition
At this point, you should create a new system file in:
tutorial-repo/systems/my-system/system.py
You can create this file using the following commands, or you can use whatever method you prefer:
$ mkdir -p tutorial-repo/systems/my-system
$ touch tutorial-repo/systems/my-system/system.py
You can edit the system.py file using your editor of choice, or by
executing:
$ ramble edit --type systems my-system
System Class
Like with platform classes, Ramble provides a module (e.g. syskit) which
imports useful functionality for creating system classes. Each system
definition should import this using:
from ramble.syskit import *
The name of system classes follows the same pattern as platform classes (as
well as any other class in Ramble). As a result, your system class should be
named MySystem, and it should have name = 'my-system' as a class level
attribute.
The result might look something like the following:
from ramble.syskit import *
class MySystem(SystemBase):
name = 'my-system'
As with platforms, you should be able to see the my-system system listed in
the output of:
$ ramble list --type systems
Defining The System
Similar to platform classes, the system classes can define variables. In this
case, we need to define max_nodes, but then are free to define
additional variables. Additionally, system classes can defined a defaults for
each of the package manager, workflow manager, and platform. System classes can
also define the available platforms, to help prevent users from configuring
experiments that won’t function properly. It is important to note that any
validation can be disabled by the user, by setting the variant
validate_system: False in their workspace.
As mentioned before, we will assume your system has Spack as the package manager, and SLURM as the workflow manager. The resulting system class might look something like the following:
from ramble.syskit import *
class MySystem(SystemBase):
name = 'my-system'
available_platforms(['my-platform'])
default_workflow_manager('slurm')
default_package_manager('spack')
default_platform('my-platform')
with when("platform=my-platform"):
variable(
"max_nodes",
default=4,
description="Number of nodes of this platform in system",
)
At this stage, you have a fairly complete system class, and all of this
information should be viewable using ramble info --type systems my-system.
As shown in this example, the with when(...) context manager can be used to
define variables for each platform, and construct more complicated behaviors
within a system class.
Default Workflow Manager Variables
Some workflow managers have additional required variables, to ensure they
function properly. In this case, we are using the SLURM workflow manager, which
requires a variable slurm_partition to be defined, that tells experiments
how to submit jobs onto the correct hardware. As mentioned earlier, a system
could contain multiple platforms, representing different physical nodes (and in
the case of SLURM, these might be separate partitions).
To help connect our platform to our workflow manager, the system can define the
required variables. This can be done using the with when context manager we
saw earlier, or it can be accomplished using the platform_variable_map
directive. This directive functions as almost the inverse of the context
manager. Below is an example of this directive being used to define the
slurm_partition variable.
platform_variable_map(
"slurm_partition",
var_map = {
"my-platform": "partition1",
}
)
Default Software Configuration
Systems (and platforms) sometimes have software installed on them that should be used when building new software for experiments. This might include system provided compilers, or MPI implementations, or even something like OpenSSH. Package managers have their own available to connect to this software, and systems can help provide system specific configuration files to the package manager in Ramble. In this case, we’re assuming you are using Spack.
Spack has several configuration files that can be manipulated to customize Spack’s behavior on your system. As an example, Spack has a packages.yaml file that can be used to control package preferences and to connect to external packages or compilers.
For the purposes of this tutorial, we will assume that the image your system
uses has OpenSSH and OpenMPI both installed in /usr/, and we want to tell
Spack (through Ramble) that these exist, and users shouldn’t build their own
installation of these packages. An example packages.yaml file can be seen
below:
packages:
openssh:
buildable: false
externals:
- spec: openssh@9.9p1
prefix: /usr
openmpi:
buildable: false
externals:
- spec: openmpi@4.1.4
prefix: /usr
Ramble has a directive auxiliary_software_file that can be used to add a
file that should be included in every environment created within a workspace,
when a specific package manager is used. By default, this directive will search
for files along side the python file for the object that is registering the
auxiliary software file.
To use this directive, write the contents from the example packages.yaml
into the file:
tutorial-repo/systems/my-system/packages.yaml.tpl
Now, within the system class, add the following:
with when("package_manager_family=spack"):
auxiliary_software_file(
"packages.yaml",
src_path="packages.yaml.tpl",
dest_path="packages.yaml",
)
This will cause experiments that are generated using the my-system system
class, that also use the Spack package manager to apply our example
packages.yaml file to their software environments.
NOTE: While we are showing this directive in the context of system classes, any object can register auxiliary software files.
NOTE: The context manager here uses the family of package managers rather than an explicit package manager name. This helps to ensure that inherited package managers with customized behavior will still be identified as part of this family.
NOTE: While the Spack package manager handles applying these YAML files to resulting environments, other package managers might handle this behavior differently.
Testing System and Platform Definitions
At this point, you have successfully created both a platform, and a system class. You are now able to create experiments that would utilize these two classes. This section will walk through testing these classes, but you won’t actually execute any experiments.
Configure Experiments
To begin with, we need to create a workspace that uses these new classes. Here, we will pretend we are going to create experiments using Gromacs. To begin with, create and activate a test workspace:
$ ramble workspace create -d test-sys-plat -a
Next, we will add some gromacs experiments to the workspace:
$ ramble workspace manage experiments gromacs --wf water_bare -V system=my-system \
-v n_ranks={processes_per_node}*{n_nodes} -v n_nodes=[1,2,4] -v processes_per_node={cores_per_node} \
-e system-test-{n_nodes}
Now that the experiments are configured, we will add some software packages and an environment.
$ ramble workspace manage software --pkg gcc --spec "gcc@14.2.0 +binutils target=x86_64"
$ ramble workspace manage software --pkg gromacs --spec gromacs@{application::version} --compiler gcc
$ ramble workspace manage software --pkg openmpi --spec openmpi@4.1.4
$ ramble workspace manage software --env gromacs --environment-packages gromacs,openmpi
At this point you should should be able examine the experiments in the workspace using:
$ ramble workspace info
This should show that there are three experiments, all using Gromacs, and changing the node count between 1, 2, and 4. Once this prints the three experiments correctly, you can perform a dry-run setup using:
$ ramble workspace setup --dry-run
You can now examine the slurm_experiment_sbatch scripts inside the
experiment directories (i.e.
test-sys-plat/experiments/gromacs/water_bare/test-1/slurm_experiment_sbatch)
to see the partition name is applied correct, and the number of cores per node
and other platform settings are correct.
Once this is verified, you can examine the contents of the software environment
(i.e. test-sys-plat/software/spack/gromacs/spack.yaml) and see that the
packages section has been applied from the template config we defined
earlier.
Summary and Final Cleanup
At this stage, you have now created new system and platform definitions that can customize the behavior of a workspace for a specific set of hardware. You have tested it within a workspace, and have constructed a custom object repository to create new definitions in.
To clean up your system, make sure to deactivate your workspace before trying to remove it. These steps can be completed with:
$ ramble workspace deactivate
$ rm -rf test-sys-plat
You are also free to delete the tutorial-repo repository, but make sure you
unregister it from your list of repositories using the ramble repo rm
command.