caliban resubmit¶
Often one needs to re-run an experiment after making code changes, or to run the
same code with a different random seed. Caliban supports this with its
resubmit
command.
This command allows you to resubmit jobs in an experiment group without having to remember or re-enter all of the parameters for your experiments. For example, suppose you run a set of experiments in an experiment group on CAIP:
caliban cloud --xgroup resubmit_test --nogpu --experiment_config experiment.json cpu.py -- --foo 3
You then realize that you made a coding error, causing some of your jobs to fail:
$ caliban status --xgroup resubmit_test
xgroup resubmit_test:
docker config 1: job_mode: CPU, build url: ~/sw/cluster/caliban/tmp/cpu, extra dirs: None
experiment id 37: cpu.py --foo 3 --sleep 2
job 69 SUCCEEDED CAIP 2020-05-29 10:53:41 container: gcr.io/totoro-project/cffd1475aaca:latest name: caliban_totoro_20200529_105340_2
experiment id 38: cpu.py --foo 3 --sleep 1
job 68 FAILED CAIP 2020-05-29 10:53:40 container: gcr.io/totoro-project/cffd1475aaca:latest name: caliban_totoro_20200529_105338_1
You then go and modify your code, and now you can use the resubmit
command to
run the jobs that failed:
$ caliban resubmit --xgroup resubmit_test
the following jobs would be resubmitted:
cpu.py --foo 3 --sleep 1
job 68 FAILED CAIP 2020-05-29 10:53:40 container: gcr.io/totoro-project/cffd1475aaca:latest name: caliban_totoro_20200529_105338_1
do you wish to resubmit these 1 jobs? [yN]: y
rebuilding containers...
...
Submitting request!
...
Checking back in with caliban status
shows that the code change worked, and
now all of the experiments in the group have succeeded, and you can see that the
container hash has changed for the previously failed jobs, reflecting your code
change:
$ caliban status --xgroup resubmit_test
xgroup resubmit_test:
docker config 1: job_mode: CPU, build url: ~/sw/cluster/caliban/tmp/cpu, extra dirs: None
experiment id 37: cpu.py --foo 3 --sleep 2
job 69 SUCCEEDED CAIP 2020-05-29 10:53:41 container: gcr.io/totoro-project/cffd1475aaca:latest name: caliban_totoro_20200529_105340_2
experiment id 38: cpu.py --foo 3 --sleep 1
job 70 SUCCEEDED CAIP 2020-05-29 11:03:01 container: gcr.io/totoro-project/81b2087b5026:latest name: caliban_totoro_20200529_110259_1
The resubmit
command supports the following arguments:
$ caliban resubmit --help
usage: caliban resubmit [-h] [--helpfull] [--xgroup XGROUP] [--dry_run] [--all_jobs] [--project_id PROJECT_ID] [--cloud_key CLOUD_KEY]
optional arguments:
-h, --help show this help message and exit
--helpfull show full help message and exit
--xgroup XGROUP experiment group
--dry_run Don't actually submit; log everything that's going to happen.
--all_jobs resubmit all jobs regardless of current state, otherwise only jobs that are in FAILED or STOPPED state will be resubmitted
--project_id PROJECT_ID
ID of the GCloud AI Platform/GKE project to use for Cloud job submission and image persistence. (Defaults to $PROJECT_ID; errors if both the argument and $PROJECT_ID are empty.)
--cloud_key CLOUD_KEY
Path to GCloud service account key. (Defaults to $GOOGLE_APPLICATION_CREDENTIALS.)