tutorials GitHub

Registering an HPC System

HPC systems usually rely on a batch scheduler such as Slurm to schedule jobs on a cluster of machines. You can register an HPC cluster as a Tapis system to enable to Tapis to submit and monitor jobs on the cluster for you.

In this example, we will register system on the Stampede2 supercomputer at TACC. Access to Stampede2 requires that your TACC account have a valid allocation on Stampede2. If you do not have an allocation on Stampede2, you can still use the concepts illustrated in this tutorial to register another HPC cluster that you do have access to.

The System Description

To register a system with Tapis, you describe the system in a JSON object. The description includes information about how to connect to the system, what kind of conatiner runtimes are available, what scheduler types, the queues defined, and more.

The following contains a description of the Stampede2 cluster. Copy and paste the code below into your Jupyter notebook, and update <username> in the id field to your Tapis username.

s2_system = {
  "id": "stampede2.<username>",
  "description": "System for running jobs on the Stampede2 cluster",
  "systemType": "LINUX",
  "host": "stampede2.tacc.utexas.edu",
  "defaultAuthnMethod": "PKI_KEYS",
  "effectiveUserId": "${apiUserId}",
  "port": 22,
  "rootDir": "/",
  "canExec": True,
  "jobRuntimes": [ { "runtimeType": "SINGULARITY" } ],
  "jobWorkingDir": "HOST_EVAL($WORK2)",
  "canRunBatch": True,
  "batchScheduler": "SLURM",
  "batchSchedulerProfile": "tacc",
  "batchDefaultLogicalQueue": "tapisNormal",
  "batchLogicalQueues": [
    {
      "name": "tapisNormal",
      "hpcQueueName": "normal",
      "maxJobs": 50,
      "maxJobsPerUser": 10,
      "minNodeCount": 1,
      "maxNodeCount": 16,
      "minCoresPerNode": 1,
      "maxCoresPerNode": 68,
      "minMemoryMB": 1,
      "maxMemoryMB": 16384,
      "minMinutes": 1,
      "maxMinutes": 60
    }
  ]
}

In the description above, we set effectiveUserId to the string ${apiUserId}. Recall that this tells Tapis to use the identity (that is, the username) associated with the token on the API request whenever it interacts with this system. We could have just hard-coded our own username (e.g., "jstubbs") instead, but this approach means that if we share the system with another Tapis user, Tapis will use that user’s identity to interact with the system instead of our own. We’ll cover sharing in more detail in a future tutorial.

To keep things simple, our description of Stampede2 includes just one queue, the normal queue. We can add additional queues to the description if we wish to submit jobs to them.

Note also our use of HOST_EVAL($WORK2) for jobWorkingDir. The HOST_EVAL() function instructs Tapis to evaluate an environment variable (in this case, the $WORK2 variable) on the host itself to determine the working directory for jobs. This is useful whenever you want the job working directory to be dynamically determined from a variable defined on the system.

With the system description defined, we are ready to register it with Tapis. We do that as follows:

t.systems.createSystem(**s2_system)

We should now be able to list systems and see our Stampede2 system

t.systems.getSystems()

The output should include something like this:

[
 canExec: true
 defaultAuthnMethod: PKI_KEYS
 effectiveUserId: jstubbs
 host: stampede2.tacc.utexas.edu
 id: stampede2.jstubbs
 owner: jstubbs
 systemType: LINUX,
 ...
]

We can also retrieve full details about our system using its id (update the <username> in the call below):

t.systems.getSystem(systemId='stampede2.<username>')

The output is much more verbose:

authnCredential: None
batchDefaultLogicalQueue: tapisNormal
batchLogicalQueues: [
hpcQueueName: normal
maxCoresPerNode: 68
maxJobs: 50
maxJobsPerUser: 10
maxMemoryMB: 16384
maxMinutes: 60
maxNodeCount: 16
minCoresPerNode: 1
minMemoryMB: 1
minMinutes: 1
minNodeCount: 1
name: tapisNormal]
...

Next Steps

Note that before we can actually use this system with Tapis, we will need to register at least one credential for it. We will do that next.

Next-> Registering System Credentials

Additional Resources

  • Additional details about the Systems endpoints - API specification