caliban status¶
The caliban status
command allows you to check on the status of jobs submitted
via caliban. There are two primary modes for this command. The first returns
your most recent job submissions across all experiment groups:
$ caliban status --max_jobs 5
most recent 5 jobs for user totoro:
xgroup totoro-xgroup-2020-05-28-11-33-35:
docker config 1: job_mode: CPU, build url: ~/sw/cluster/caliban/tmp/cpu, extra dirs: None
experiment id 28: cpu.py --foo 3 --sleep 2
job 56 STOPPED GKE 2020-05-28 11:33:35 container: gcr.io/totoro-project/0f6d8a3ddbee:latest name: job-stop-test-rssqq
experiment id 29: cpu.py --foo 3 --sleep 600
job 57 STOPPED GKE 2020-05-28 11:33:36 container: gcr.io/totoro-project/0f6d8a3ddbee:latest name: job-stop-test-c5x6v
xgroup totoro-xgroup-2020-05-28-11-40-52:
docker config 1: job_mode: CPU, build url: ~/sw/cluster/caliban/tmp/cpu, extra dirs: None
experiment id 30: cpu.py --foo 3 --sleep -1
job 58 STOPPED CAIP 2020-05-28 11:40:54 container: gcr.io/totoro-project/0f6d8a3ddbee:latest name: caliban_totoro_20200528_114052_1
experiment id 31: cpu.py --foo 3 --sleep 2
job 59 STOPPED CAIP 2020-05-28 11:40:55 container: gcr.io/totoro-project/0f6d8a3ddbee:latest name: caliban_totoro_20200528_114054_2
experiment id 32: cpu.py --foo 3 --sleep 600
job 60 RUNNING CAIP 2020-05-28 11:40:56 container: gcr.io/totoro-project/0f6d8a3ddbee:latest name: caliban_totoro_20200528_114055_3
Here we can see five jobs that we recently submitted, in two experiment groups.
The first experiment group has jobs submitted to GKE, while the second has jobs
submitted to CAIP. You can specify the maximum number of jobs to return using
the --max_jobs
flag.
The second mode for the caliban status
command returns jobs in a given
experiment group, using the --xgroup
flag:
$ caliban status --xgroup xg2 --max_jobs 2
xgroup xg2:
docker config 1: job_mode: CPU, build url: ~/sw/cluster/caliban/tmp/cpu, extra dirs: None
experiment id 1: cpu.py --foo 3 --sleep -1
job 34 FAILED CAIP 2020-05-08 18:26:56 container: gcr.io/totoro-project/e2a0b8fca1dc:latest name: caliban_totoro_1_20200508_182654
job 37 FAILED CAIP 2020-05-08 19:01:08 container: gcr.io/totoro-project/e2a0b8fca1dc:latest name: caliban_totoro_1_20200508_190107
experiment id 2: cpu.py --foo 3 --sleep 2
job 30 SUCCEEDED LOCAL 2020-05-08 09:59:04 container: e2a0b8fca1dc
job 35 SUCCEEDED CAIP 2020-05-08 18:26:57 container: gcr.io/totoro-project/e2a0b8fca1dc:latest name: caliban_totoro_2_20200508_182656
experiment id 5: cpu.py --foo 3 --sleep 600
job 36 STOPPED CAIP 2020-05-08 18:26:58 container: gcr.io/totoro-project/e2a0b8fca1dc:latest name: caliban_totoro_3_20200508_182657
job 38 SUCCEEDED CAIP 2020-05-08 19:01:09 container: gcr.io/totoro-project/e2a0b8fca1dc:latest name: caliban_totoro_3_20200508_190108
Here we can see the jobs that have been submitted as part of the xg2
experiment group. By specifying --max_jobs 2
in the call, we can see the two
most recent job submissions for each experiment in the group. In this case, we
can see that experiment 2 was submitted both locally and to CAIP at different
times. We can also see that experiment 1 failed (due to an invalid parameter),
and that the first submision to CAIP of experiment 5 was stopped by the user.
Another interesting thing to note here is that the container hash is the same for each of these job submissions, so we can tell that the underlying code did not change between submissions.
This command supports the following arguments:
$ caliban status --help
usage: caliban status [-h] [--helpfull] [--xgroup XGROUP]
[--max_jobs MAX_JOBS]
optional arguments:
-h, --help show this help message and exit
--helpfull show full help message and exit
--xgroup XGROUP experiment group
--max_jobs MAX_JOBS Maximum number of jobs to view. If you specify an
experiment group, then this specifies the maximum
number of jobs per experiment to view. If you do not
specify an experiment group, then this specifies the
total number of jobs to return, ordered by creation
date, or all jobs if max_jobs==0.