Job Manager and job API
Job Manager and job API
The Job Manager, aka "job-scheduler", is a web API service, that you use to create, delete and monitor the state of jobs.
Radix creates one job-scheduler per job defined in radixconfig.yaml
. A job-scheduler will listen to the port defined by schedulerPort
and host name equal to the name
of the job. The job-scheduler API can only be accessed by components running in the same environment, and it is not exposed to the Internet. No authentication is required.
The Job Manager exposes the following methods for managing jobs:
GET /api/v1/jobs
Get states (with names and statuses) for all jobsGET /api/v1/jobs/{jobName}
Get state for a named jobDELETE /api/v1/jobs/{jobName}
Delete a named jobPOST /api/v1/jobs/{jobName}/stop
Stop a named job
... and the following methods for managing batches:
GET /api/v1/batches
Get states (with names and statuses) for all batchesGET /api/v1/batches/{batchName}
Get state for a named batch and statuses of its jobsDELETE /api/v1/batches/{batchName}
Delete a named batchPOST /api/v1/batches/{batchName}/stop
Stop a named batchPOST /api/v1/batches/{batchName}/jobs/{jobName}/stop
Stop a named job of a batch
Create a single job
POST /api/v1/jobs
Create a new job using the Docker image that Radix built for the job. Job-specific arguments can be sent in the request body
{
"payload": "Sk9CX1BBUkFNMTogeHl6Cg==",
"jobId": "my-job-1",
"imageTagName": "1.0.0",
"timeLimitSeconds": 120,
"backoffLimit": 10,
"failurePolicy": {
"rules": [
{
"action": "FailJob",
"onExitCodes": {
"operator": "In",
"values": [42]
}
}
]
},
"resources": {
"limits": {
"memory": "32Mi",
"cpu": "300m"
},
"requests": {
"memory": "16Mi",
"cpu": "150m"
}
},
"node": {
"gpu": "gpu1, gpu2, gpu3",
"gpuCount": "6"
}
}
payload
, jobId
, imageTagName
, timeLimitSeconds
, backoffLimit
, failurePolicy
, resources
and node
are all optional fields and any of them can be omitted in the request.
imageTagName
field allows to alter specific job image tag. In order to use it, the {imageTagName}
need to be set as described in the radixconfig.yaml
Create a batch of jobs
POST /api/v1/batches
Create a new batch of single jobs, using the Docker image, that Radix built for the job component. Job-specific arguments can be sent in the request body, specified individually for each item injobScheduleDescriptions
with default values defined indefaultRadixJobComponentConfig
.
{
"batchId": "random-batch-id-123",
"defaultRadixJobComponentConfig": {
"imageTagName": "1.0.0",
"timeLimitSeconds": 200,
"backoffLimit": 5,
"resources": {
"limits": {
"memory": "200Mi",
"cpu": "200m"
},
"requests": {
"memory": "100Mi",
"cpu": "100m"
},
"node": {
"gpu": "gpu1",
"gpuCount": "2"
}
}
},
"jobScheduleDescriptions": [
{
"payload": "{'data':'value1'}",
"jobId": "my-job-1",
"imageTagName": "1.0.0",
"timeLimitSeconds": 120,
"backoffLimit": 10,
"resources": {
"limits": {
"memory": "32Mi",
"cpu": "300m"
},
"requests": {
"memory": "16Mi",
"cpu": "150m"
}
},
"node": {
"gpu": "gpu1, gpu2, gpu3",
"gpuCount": "6"
}
},
{
"payload": "{'data':'value2'}",
"jobId": "my-job-2",
...
},
{
"payload": "{'data':'value3'}",
...
}
]
}
Starting a new job
The example configuration at the top has component named backend
and two jobs, compute
and etl
. Radix creates two job-schedulers, one for each of the two jobs. The job-scheduler for compute
listens to http://compute:8000
, and job-scheduler for etl
listens to http://etl:9000
.
To start a new single job, send a POST
request to http://compute:8000/api/v1/jobs
with request body set to
{
"payload": "{\"x\": 10, \"y\": 20}"
}
The job-scheduler creates a new job and mounts the payload from the request body to a file named payload
in the directory /compute/args
.
Once the job has been created successfully, the job-scheduler
responds to backend
with a job state object
{
"name": "batch-compute-20230220101417-idwsxncs-rkwaibwe",
"started": "",
"ended": "",
"status": "Running"
}
name
is the unique name for the job. This is the value to be used in theGET /api/v1/jobs/{jobName}
andDELETE /api/v1/jobs/{jobName}
methods. It is also the host name to connect to running job's container, with its exposed port, e.g.http://batch-compute-20230220100755-xkoxce5g-mll3kxxh:3000
started
is the date and time the job was started. It is represented in RFC3339 form and is in UTC.ended
is the date and time the job successfully ended. Also represented in RFC3339 form and is in UTC. This value is only set forSucceeded
jobs.status
is the current status of the job. Possible values areWaiting
,Stopping
,Stopped
,Active
,Running
,Succeeded
,Failed
.Active
status means that the job has a replica created, but this replica is not ready (due to such reasons as volume mount is not ready, or it is a problem to schedule replica on a node because not enough memory available, etc.), this status can remain forever. StatusFailed
if the job's replica container exits with a non-zero exit code, andSucceeded
if the exit code is zero.
Getting the status of all existing jobs
Get a list of all single jobs with their states by sending a GET
request to http://compute:8000/api/v1/jobs
. The response is an array of job state objects, similar to the response received when creating a new job. Jobs that have been started within a batch are not included in this list
[
{
"name": "batch-compute-20230220100755-xkoxce5g-mll3kxxh",
"started": "2021-04-07T09:08:37Z",
"ended": "2021-04-07T09:08:45Z",
"status": "Succeeded"
},
{
"name": "batch-compute-20230220101417-idwsxncs-rkwaibwe",
"started": "2021-04-07T10:55:56Z",
"ended": "",
"status": "Failed"
}
]
To get state for a specific job (single or one within a batch), e.g. batch-compute-20230220100755-xkoxce5g-mll3kxxh
, send a GET
request to http://compute:8000/api/v1/jobs/batch-compute-20230220100755-xkoxce5g-mll3kxxh
. The response is a single job state object
{
"name": "batch-compute-20230220100755-xkoxce5g-mll3kxxh",
"started": "2021-04-07T09:08:37Z",
"ended": "2021-04-07T09:08:45Z",
"status": "Succeeded"
}
Deleting an existing job
The job list in the example above has a job named batch-compute-20230220101417-idwsxncs-rkwaibwe
. To delete it, send a DELETE
request to http://compute:8000/api/v1/jobs/batch-compute-20230220101417-idwsxncs-rkwaibwe
. A successful deletion will respond with result object. Only single job can be deleted with this method
{
"status": "Success",
"message": "job batch-compute-20230220101417-idwsxncs-rkwaibwe successfully deleted",
"code": 200
}
Stop a job
The job list in the example above has a job named batch-compute-20230220100755-xkoxce5g-mll3kxxh
. To stop it, send a POST
request to http://compute:8000/api/v1/jobs/batch-compute-20230220100755-xkoxce5g-mll3kxxh/stop
. A successful stop will respond with result object. Only single job can be stopped with this method. Stop of a job automatically deletes corresponding Kubernetes job and its replica, as well as its log. The job will get the status "Stopped".
{
"status": "Success",
"message": "job batch-compute-20230220100755-xkoxce5g-mll3kxxh successfully stopped",
"code": 200
}
{
"status": "Success",
"message": "job batch-compute-20230220101417-idwsxncs-rkwaibwe successfully stopped",
"code": 200
}
Starting a new batch of jobs
To start a new batch of jobs, send a POST
request to http://compute:8000/api/v1/batches
with request body set to
{
"jobScheduleDescriptions": [
{
"payload": "{\"x\": 10, \"y\": 20}"
},
{
"payload": "{\"x\": 20, \"y\": 30}"
}
]
}
Batch ID
Batch can have batchId
- it is an optional string, where any value can be put. Radix does not process it. It can exist in a batchScheduleDescription
(a request body json) for a batch.
If the batchId
is specified, it will be returned in the batch status, and it will be shown in the Radix console in the batch list.
Job ID
Jobs can have jobId
- it is an optional string, where any value can be put. Radix does not process it. It can exist in a jobScheduleDescription
for a single job or in batch jobs
If the jobId
is specified, it will be returned in the job's status, and it will be shown in the Radix console in the job list.
Job ID in a single job
{
"jobId": "my-job",
"payload": "{\"x\": 10, \"y\": 20}"
}
Job ID in the batch jobs
{
"jobScheduleDescriptions": [
{
"jobId": "my-job-1",
"payload": "{\"x\": 10, \"y\": 20}"
},
{
"jobId": "my-job-2",
"payload": "{\"x\": 20, \"y\": 30}"
}
]
}
Default parameters for jobs can be defined within DefaultRadixJobComponentConfig
. These parameters can be overridden for each job individually in JobScheduleDescriptions
{
"defaultRadixJobComponentConfig": {
"imageTagName": "1.0.0",
"timeLimitSeconds": 200,
"backoffLimit": 5,
"resources": {
"limits": {
"memory": "200Mi",
"cpu": "200m"
},
"requests": {
"memory": "100Mi",
"cpu": "100m"
}
}
},
"jobScheduleDescriptions": [
{
"payload": "{'data':'value1'}",
"timeLimitSeconds": 120,
"backoffLimit": 2,
"resources": {
"limits": {
"memory": "32Mi",
"cpu": "300m"
},
"requests": {
"memory": "16Mi",
"cpu": "150m"
}
},
"node": {
"gpu": "gpu1, gpu2, gpu3",
"gpuCount": "6"
}
},
{
"payload": "{'data':'value2'}",
"imageTagName": "2.0.0"
},
{
"payload": "{'data':'value3'}",
"timeLimitSeconds": 300,
"backoffLimit": 10,
"node": {
"gpu": "gpu3",
"gpuCount": "1"
}
}
]
}
The job-scheduler creates a new batch, which will create single jobs for each item in the JobScheduleDescriptions
.
Once the batch has been created, the job-scheduler
responds to backend
with a batch state object
{
"batchName": "batch-compute-20220302170647-6ytkltvk",
"name": "batch-compute-20220302170647-6ytkltvk-tlugvgs",
"created": "2022-03-02T17:06:47+01:00",
"status": "Running"
}
batchName
is the unique name for the batch. This is the value to be used in theGET /api/v1/batches/{batchName}
andDELETE /api/v1/batches/{batchName}
methods.started
is the date and time the batch was started. The value is represented in RFC3339 form and is in UTC.ended
is the date and time the batch successfully ended (empty when not completed). The value is represented in RFC3339 form and is in UTC. This value is only set forSucceeded
batches. Batch is ended when all batched jobs are completed or failed.status
is the current status of the batch. Possible values areRunning
,Succeeded
andFailed
. Status isFailed
if the batch fails for any reason.
Get a list of all batches
Get a list of all batches with their states by sending a GET
request to http://compute:8000/api/v1/batches
. The response is an array of batch state objects, similar to the response received when creating a new batch
[
{
"name": "batch-compute-20220302155333-hrwl53mw",
"created": "2022-03-02T15:53:33+01:00",
"started": "2022-03-02T15:53:33+01:00",
"ended": "2022-03-02T15:54:00+01:00",
"status": "Succeeded"
},
{
"name": "batch-compute-20220302170647-6ytkltvk",
"created": "2022-03-02T17:06:47+01:00",
"started": "2022-03-02T17:06:47+01:00",
"status": "Running"
}
]
Get a state of a batch
To get state for a specific batch, e.g. batch-compute-20220302155333-hrwl53mw
, send a GET
request to http://compute:8000/api/v1/batches/batch-compute-20220302155333-hrwl53mw
. The response is a batch state object, with states of its jobs and their replicas (pods) statuses.
{
"name": "batch-compute-20220302155333-hrwl53mw",
"created": "2022-03-02T15:53:33+01:00",
"started": "2022-03-02T15:53:33+01:00",
"ended": "2022-03-02T15:54:00+01:00",
"status": "Succeeded",
"updated": "2022-03-02T15:54:00+01:00",
"jobStatuses": [
{
"jobId": "job1",
"batchName": "batch-compute-20220302155333-hrwl53mw",
"name": "batch-compute-20220302155333-hrwl53mw-fjhcqwj7",
"created": "2022-03-02T15:53:36+01:00",
"started": "2022-03-02T15:53:36+01:00",
"ended": "2022-03-02T15:53:56+01:00",
"status": "Succeeded",
"updated": "2022-03-02T15:53:56+01:00",
"podStatuses": [
{
"name": "batch-compute-20220302155333-hrwl53mw-fjhcqwj7-5sfnl",
"created": "2022-03-02T15:53:36Z",
"startTime": "2022-03-02T15:53:36Z",
"endTime": "2022-03-02T15:53:56Z",
"containerStarted": "2022-03-02T15:53:36Z",
"replicaStatus": {
"status": "Succeeded"
},
"image": "radixprod.azurecr.io/radix-app-dev-compute:6k8vv",
"imageId": "radixprod.azurecr.io/radix-app-dev-compute@sha256:1f9ce890db8eb89ae0369995f76676a58af2a82129fc0babe080a5daca86a44e",
"exitCode": 0,
"reason": "Completed"
}
]
},
{
"jobId": "job2",
"batchName": "batch-compute-20220302155333-hrwl53mw",
"name": "batch-compute-20220302155333-hrwl53mw-qjzykhrd",
"created": "2022-03-02T15:53:39+01:00",
"started": "2022-03-02T15:53:39+01:00",
"ended": "2022-03-02T15:53:56+01:00",
"status": "Succeeded",
"updated": "2022-03-02T15:53:56+01:00",
"podStatuses": [
{
"name": "batch-compute-20220302155333-hrwl53mw-qjzykhrd-5sfnl",
"created": "2022-03-02T15:53:39Z",
"startTime": "2022-03-02T15:53:40Z",
"endTime": "2022-03-02T15:53:56Z",
"containerStarted": "2022-03-02T15:53:40Z",
"replicaStatus": {
"status": "Succeeded"
},
"image": "radixprod.azurecr.io/radix-app-dev-compute:6k8vv",
"imageId": "radixprod.azurecr.io/radix-app-dev-compute@sha256:1f9ce890db8eb89ae0369995f76676a58af2a82129fc0babe080a5daca86a44e",
"exitCode": 0,
"reason": "Completed"
}
]
}
]
}
If the job's replica failed and job-component has backoffLimit greater then 0
, podStatus
contains exitCode
and reason
for failed pods. podIndex
gives an order of pod statuses (starting from 0
)
{
"name": "batch-compute-20220302155333-hrwl53mw",
"created": "2022-03-02T15:53:33+01:00",
"started": "2022-03-02T15:53:33+01:00",
"ended": "2022-03-02T15:53:48+01:00",
"status": "Failed",
"updated": "2022-03-02T15:53:48+01:00",
"jobStatuses": [
{
"jobId": "job1",
"batchName": "batch-compute-20220302155333-hrwl53mw",
"name": "batch-compute-20220302155333-hrwl53mw-fjhcqwj7",
"created": "2022-03-02T15:53:36+01:00",
"started": "2022-03-02T15:53:36+01:00",
"ended": "2022-03-02T15:53:56+01:00",
"status": "Failed",
"message": "Job has reached the specified backoff limit",
"updated": "2022-03-02T15:53:56+01:00",
"podStatuses": [
{
"name": "batch-compute-20220302155333-hrwl53mw-fjhcqwj7-wbn9q",
"created": "2022-03-02T15:53:36Z",
"startTime": "2022-03-02T15:53:36Z",
"endTime": "2022-03-02T15:53:40Z",
"containerStarted": "2022-03-02T15:53:36Z",
"replicaStatus": {
"status": "Failed"
},
"image": "radixprod.azurecr.io/radix-app-dev-compute:6k8vv",
"imageId": "radixprod.azurecr.io/radix-app-dev-compute@sha256:1f9ce890db8eb89ae0369995f76676a58af2a82129fc0babe080a5daca86a44e",
"exitCode": 1,
"reason": "Error"
},
{
"name": "batch-compute-20220302155333-hrwl53mw-fjhcqwj7-859xq",
"created": "2022-03-02T15:53:40Z",
"startTime": "2022-03-02T15:53:42Z",
"endTime": "2022-03-02T15:53:48Z",
"containerStarted": "2022-03-02T15:53:42Z",
"replicaStatus": {
"status": "Failed"
},
"image": "radixprod.azurecr.io/radix-app-dev-compute:6k8vv",
"imageId": "radixprod.azurecr.io/radix-app-dev-compute@sha256:1f9ce890db8eb89ae0369995f76676a58af2a82129fc0babe080a5daca86a44e",
"podIndex": 1,
"exitCode": 1,
"reason": "Error"
}
]
}
]
}
Delete a batch
The batch list in the example above has a batch named batch-compute-20220302155333-hrwl53mw
. To delete it, send a DELETE
request to http://compute:8000/api/v1/batches/batch-compute-20220302155333-hrwl53mw
. A successful deletion will respond with result object. Deleting of a batch job automatically deletes all jobs, belonging to this batch job.
{
"status": "Success",
"message": "batch batch-compute-20220302155333-hrwl53mw successfully deleted",
"code": 200
}
Stop an existing batch
The batch list in the example above has a batch named batch-compute-20220302155333-hrwl53mw
. To stop it, send a POST
request to http://compute:8000/api/v1/batches/batch-compute-20220302155333-hrwl53mw/stop
. A successful stop will respond with result object. Stop of a batch automatically deletes all batch Kubernetes jobs and their replicas, belonging to this batch job, as well as their logs. All not completed jobs will get the status "Stopped".
{
"status": "Success",
"message": "batch batch-compute-20220302155333-hrwl53mw successfully stopped",
"code": 200
}
Stop a jobs in a batch
The batch list in the example above has a batch named batch-compute-20220302155333-hrwl53mw
and jobs, one of whicvh has name batch-compute-20220302155333-hrwl53mw-fjhcqwj7
. To stop this job, send a POST
request to http://compute:8000/api/v1/batches/batch-compute-20220302155333-hrwl53mw/jobs/batch-compute-20220302155333-hrwl53mw-fjhcqwj7/stop
. A successful stop will respond with result object. Stop of a batch job automatically deletes corresponding Kubernetes job and its replica, as well as its log. The job will get the status "Stopped".
{
"status": "Success",
"message": "job batch-compute-20220302155333-hrwl53mw-fjhcqwj7 in the batch batch-compute-20220302155333-hrwl53mw successfully stopped",
"code": 200
}