caliban cluster

This subcommand allows you to create and submit jobs to a GKE cluster using caliban’s packaging and interface features.

caliban cluster ls

This command lists the clusters currently available in your project.

usage: caliban cluster ls [-h] [--helpfull] [--project_id PROJECT_ID]
                          [--cloud_key CLOUD_KEY] [--zone ZONE]

list clusters

optional arguments:
  -h, --help            show this help message and exit
  --helpfull            show full help message and exit
  --project_id PROJECT_ID
                        ID of the GCloud AI Platform/GKE project to use for
                        Cloud job submission and image persistence. (Defaults
                        to $PROJECT_ID; errors if both the argument and
                        $PROJECT_ID are empty.) (default: None)
  --cloud_key CLOUD_KEY
                        Path to GCloud service account key. (Defaults to
                        $GOOGLE_APPLICATION_CREDENTIALS.) (default: None)
  --zone ZONE           zone (default: None)

Here you may specify a specific project, credentials file, or cloud zone to narrow your listing. If you do not specify these, caliban tries to determine these from the system defaults.

caliban cluster create

This command creates a new cluster in your project. Typically if you are going to use GKE in your project, you will create a single long-running cluster in your project first, and leave it running across many job submissions. In caliban we configure the cluster to take advantage of autoscaling wherever possible.

In GKE, there are two types of autoscaling. The first is known as ‘cluster autoscaling’. This mode automatically increases the number of nodes in your cluster’s node pools as job demand increases. In caliban, we configure this automatically and we query your cpu and accelerator quota to configure the cluster autoscaling limits. In this way, your cluster will automatically add nodes when you need them, and then automatically delete them when they are no longer needed. This is, of course, quite useful for keeping your costs low.

The second type of autoscaling in GKE is ‘node autoprovisioning’. This form of autoprovisioning addresses the issue that accelerator-enabled instances must be allocated from a node pool of instances where the particular cpu/memory/gpu configuration is fixed. For simple configurations where you support only a small number of node configurations, you can manually create autoscaling node pools. If, however, you wish to support several, or in caliban’s case, general, configurations, then this becomes more difficult. Node autoprovisioning automatically creates autoscaling node pools based on the requirements of the jobs submitted to the cluster, and also deletes these node pools once they are no longer needed. In caliban we enable node autoprovisioning so you can specify your gpu- and machine- types on a per-job basis, and the kubernetes engine will automatically create the appropriate node pools to accomodate your jobs.

The syntax for this command is as follows:

totoro@totoro:$ caliban cluster create --help
usage: caliban cluster create [-h] [--helpfull] [--project_id PROJECT_ID]
                              [--cloud_key CLOUD_KEY]
                              [--cluster_name CLUSTER_NAME] [--zone ZONE]
                              [--dry_run]
                              [--release_channel ['UNSPECIFIED', 'RAPID', 'REGULAR', 'STABLE']]
                              [--single_zone]

create cluster

optional arguments:
  -h, --help            show this help message and exit
  --helpfull            show full help message and exit
  --project_id PROJECT_ID
                        ID of the GCloud AI Platform/GKE project to use for
                        Cloud job submission and image persistence. (Defaults
                        to $PROJECT_ID; errors if both the argument and
                        $PROJECT_ID are empty.) (default: None)
  --cloud_key CLOUD_KEY
                        Path to GCloud service account key. (Defaults to
                        $GOOGLE_APPLICATION_CREDENTIALS.) (default: None)
  --cluster_name CLUSTER_NAME
                        cluster name (default: None)
  --zone ZONE           for a single-zone cluster, this specifies the zone for
                        the cluster control plane and all worker nodes, while
                        for a multi-zone cluster this specifies only the zone
                        for the control plane, while worker nodes may be
                        created in any zone within the same region as the
                        control plane. The single_zone argument specifies
                        whether to create a single- or multi- zone cluster.
                        (default: None)
  --dry_run             Don't actually submit; log everything that's going to
                        happen. (default: False)
  --release_channel ['UNSPECIFIED', 'RAPID', 'REGULAR', 'STABLE']
                        cluster release channel, see
                        https://cloud.google.com/kubernetes-
                        engine/docs/concepts/release-channels (default: REGULAR)
  --single_zone         create a single-zone cluster if set, otherwise create
                        a multi-zone cluster: see
                        https://cloud.google.com/kubernetes-
                        engine/docs/concepts/types-of-
                        clusters#cluster_availability_choices (default: False)

You can use the --dry_run flag to see the specification for the cluster that would be submitted to GKE without actually creating the cluster.

A typical creation request (with --dry_run):

totoro@totoro:$ caliban cluster create --zone us-central1-a --cluster_name newcluster --dry_run
I0303 13:07:34.257717 140660011796288 cli.py:160] request:
{'cluster': {'autoscaling': {'autoprovisioningNodePoolDefaults': {'oauthScopes': ['https://www.googleapis.com/auth/compute',
                                                                                  'https://www.googleapis.com/auth/cloud-platform']},
                             'enableNodeAutoprovisioning': 'true',
                             'resourceLimits': [{'maximum': '72',
                                                 'resourceType': 'cpu'},
                                                {'maximum': '4608',
                                                 'resourceType': 'memory'},
                                                {'maximum': '8',
                                                 'resourceType': 'nvidia-tesla-k80'},
                                                {'maximum': '1',
                                                 'resourceType': 'nvidia-tesla-p100'},
                                                {'maximum': '1',
                                                 'resourceType': 'nvidia-tesla-v100'},
                                                {'maximum': '1',
                                                 'resourceType': 'nvidia-tesla-p4'},
                                                {'maximum': '4',
                                                 'resourceType': 'nvidia-tesla-t4'}]},
             'enable_tpu': 'true',
             'ipAllocationPolicy': {'useIpAliases': 'true'},
             'locations': ['us-central1-a',
                           'us-central1-b',
                           'us-central1-c',
                           'us-central1-f'],
             'name': 'newcluster',
             'nodePools': [{'config': {'oauthScopes': ['https://www.googleapis.com/auth/devstorage.read_only',
                                                       'https://www.googleapis.com/auth/logging.write',
                                                       'https://www.googleapis.com/auth/monitoring',
                                                       'https://www.googleapis.com/auth/service.management.readonly',
                                                       'https://www.googleapis.com/auth/servicecontrol',
                                                       'https://www.googleapis.com/auth/trace.append']},
                            'initialNodeCount': '3',
                            'name': 'default-pool'}],
             'releaseChannel': {'channel': 'RAPID'},
             'zone': 'us-central1-a'},
 'parent': 'projects/totoro-project/locations/us-central1-a'}

Cluster creation can take a while to complete (often on the order of five minutes). When you use caliban to create a cluster, caliban will provide a link to the relevant GCP dashboard page where you can monitor the progress of your cluster creation request. Caliban will also monitor your creation request, and when your cluster is created, it will apply a daemonset to your cluster to automatically apply nvidia drivers to any gpu-enabled nodes that get created, as described here.

caliban cluster delete

This command simply deletes an existing cluster. Typically you will leave your cluster running, but the cluster does consume some resources even when idle, so if you are not actively using the cluster you may want to shut it down to save money.

The syntax of this command:

totoro@totoro:$ caliban cluster delete --help
usage: caliban cluster delete [-h] [--helpfull] [--project_id PROJECT_ID]
                              [--cloud_key CLOUD_KEY]
                              [--cluster_name CLUSTER_NAME] [--zone ZONE]

delete cluster

optional arguments:
  -h, --help            show this help message and exit
  --helpfull            show full help message and exit
  --project_id PROJECT_ID
                        ID of the GCloud AI Platform/GKE project to use for
                        Cloud job submission and image persistence. (Defaults
                        to $PROJECT_ID; errors if both the argument and
                        $PROJECT_ID are empty.) (default: None)
  --cloud_key CLOUD_KEY
                        Path to GCloud service account key. (Defaults to
                        $GOOGLE_APPLICATION_CREDENTIALS.) (default: None)
  --cluster_name CLUSTER_NAME
                        cluster name (default: None)
  --zone ZONE           zone (default: None)

As with most caliban commands, if you do not specify arguments, then caliban does its best to determine them from defaults. For example, if you have only a single cluster in your project, you can simply type caliban cluster delete.

caliban cluster job submit

Most of the cli arguments for caliban cluster job submit are the same as those for caliban cloud:

totoro@totoro:$ caliban cluster job submit --help
usage: caliban cluster job submit [-h] [--helpfull]
                               [--cluster_name CLUSTER_NAME] [--nogpu]
                               [--cloud_key CLOUD_KEY] [--extras EXTRAS]
                               [-d DIR] [--image_tag IMAGE_TAG]
                               [--project_id PROJECT_ID]
                               [--min_cpu MIN_CPU] [--min_mem MIN_MEM]
                               [--gpu_spec NUMxGPU_TYPE]
                               [--tpu_spec NUMxTPU_TYPE]
                               [--tpu_driver TPU_DRIVER]
                               [--nonpreemptible_tpu] [--force]
                               [--name NAME]
                               [--experiment_config EXPERIMENT_CONFIG]
                               [-l KEY=VALUE] [--nonpreemptible]
                               [--dry_run] [--export EXPORT]
                               [--xgroup XGROUP]
                               module ...

submit cluster job(s)

positional arguments:
  module                Code to execute, in either trainer.train' or
                        'trainer/train.py' format. Accepts python scripts,
                        modules or a path to an arbitrary script.

optional arguments:
  -h, --help            show this help message and exit
  --helpfull            show full help message and exit
  --cluster_name CLUSTER_NAME
                        cluster name (default: None)
  --nogpu               Disable GPU mode and force CPU-only. (default: True)
  --cloud_key CLOUD_KEY
                        Path to GCloud service account key. (Defaults to
                        $GOOGLE_APPLICATION_CREDENTIALS.) (default: None)
  --extras EXTRAS       setup.py dependency keys. (default: None)
  -d DIR, --dir DIR     Extra directories to include. List these from large to
                        small to take full advantage of Docker's build cache.
                        (default: None)
  --image_tag IMAGE_TAG
                        Docker image tag accessible via Container Registry. If
                        supplied, Caliban will skip the build and push steps
                        and use this image tag. (default: None)
  --project_id PROJECT_ID
                        ID of the GCloud AI Platform/GKE project to use for
                        Cloud job submission and image persistence. (Defaults
                        to $PROJECT_ID; errors if both the argument and
                        $PROJECT_ID are empty.) (default: None)
  --min_cpu MIN_CPU     Minimum cpu needed by job, in milli-cpus. If not
                        specified, then this value defaults to 1500 for
                        gpu/tpu jobs, and 31000 for cpu jobs. Please note that
                        gke daemon processes utilize a small amount of cpu on
                        each node, so if you want to have your job run on a
                        specific machine type, say a 2-cpu machine, then if
                        you specify a minimum cpu of 2000, then your job will
                        not be schedulable on a 2-cpu machine as the daemon
                        processes will push the total cpu needed to more than
                        two full cpus. (default: None)
  --min_mem MIN_MEM     Minimum memory needed by job, in MB. Please note that
                        gke daemon processes utilize a small amount of memory
                        on each node, so if you want to have your job run on a
                        specific machine type, say a machine with 8GB total
                        memory, then if you specify a minimum memory of
                        8000MB, then your job will not be schedulable on a 8GB
                        machine as the daemon processes will push the total
                        memory needed to more than 8GB. (default: None)
  --gpu_spec NUMxGPU_TYPE
                        Type and number of GPUs to use for each AI
                        Platform/GKE submission. Defaults to 1xP100 in GPU
                        mode or None if --nogpu is passed. (default: None)
  --tpu_spec NUMxTPU_TYPE
                        Type and number of TPUs to request for each AI
                        Platform/GKE submission. Defaults to None. (default:
                        None)
  --tpu_driver TPU_DRIVER
                        tpu driver (default: 1.14)
  --nonpreemptible_tpu  use non-preemptible tpus: note this only applies to
                        v2-8 and v3-8 tpus currently, see:
                        https://cloud.google.com/tpu/docs/preemptible
                        (default: False)
  --force               Force past validations and submit the job as
                        specified. (default: False)
  --name NAME           Set a job name for AI Platform or GKE jobs. (default:
                        None)
  --experiment_config EXPERIMENT_CONFIG
                        Path to an experiment config, or 'stdin' to read from
                        stdin. (default: None)
  -l KEY=VALUE, --label KEY=VALUE
                        Extra label k=v pair to submit to Cloud. (default:
                        None)
  --nonpreemptible      use non-preemptible VM instance: please note that you
                        may need to upgrade your cluster to a recent
                        version/use the rapid release channel for preemptible
                        VMs to be supported with node autoprovisioning:
                        https://cloud.google.com/kubernetes-
                        engine/docs/release-notes-rapid#december_13_2019
                        (default: False)
  --dry_run             Don't actually submit; log everything that's going to
                        happen. (default: False)
  --export EXPORT       Export job spec(s) to file, extension must be one of
                        ('.yaml', '.json') (for example: --export my-job-
                        spec.yaml) For multiple jobs (i.e. in an experiment
                        config scenario), multiple files will be generated
                        with an index inserted (for example: --export my-job-
                        spec.yaml would yield my-job-spec_0.yaml, my-job-
                        spec_1.yaml...) (default: None)
  --xgroup XGROUP       This specifies an experiment group, which ties
                        experiments and job instances together. If you do not
                        specify a group, then a new one will be created. If
                        you specify an existing experiment group here, then
                        new experiments and jobs you create will be added to
                        the group you specify. (default: None)

pass-through arguments:
  -- YOUR_ARGS          This is a catch-all for arguments you want to pass
                        through to your script. any arguments after '--' will
                        pass through.

Again, this command very closely mirrors caliban cloud.

You can export job requests created with caliban as a yaml or json file using the --export flag. You can then use this file with caliban cluster job submit_file or ``kubectl` <https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#running-an-example-job>`_ to submit the same job again.

caliban cluster job submit_file

This command submits a kubernetes k8s job file to your cluster. This can be useful if you have a job that you run regularly, as you can create the job initially with caliban cluster job submit and use the --export option to save the job spec file. Then you can use this command to submit the job again without having to specify all of the cli arguments.

The syntax of this command:

totoro@totoro:$ caliban cluster job submit_file --help
usage: caliban cluster job submit_file [-h] [--helpfull]
                                       [--cluster_name CLUSTER_NAME]
                                       [--cloud_key CLOUD_KEY]
                                       [--project_id PROJECT_ID] [--dry_run]
                                       job_file

submit gke job from yaml/json file

positional arguments:
  job_file              kubernetes k8s job file ('.yaml', '.json')

optional arguments:
  -h, --help            show this help message and exit
  --helpfull            show full help message and exit
  --cluster_name CLUSTER_NAME
                        cluster name (default: None)
  --cloud_key CLOUD_KEY
                        Path to GCloud service account key. (Defaults to
                        $GOOGLE_APPLICATION_CREDENTIALS.) (default: None)
  --project_id PROJECT_ID
                        ID of the GCloud AI Platform/GKE project to use for
                        Cloud job submission and image persistence. (Defaults
                        to $PROJECT_ID; errors if both the argument and
                        $PROJECT_ID are empty.) (default: None)
  --dry_run             Don't actually submit; log everything that's going to
                        happen. (default: False)

Thus a common invocation would resemble:

caliban cluster job submit_file my-job.yaml