Overview
Bakes represent model training jobs that use rollout data as training datasets. They are composed from a base template plus per-bake overrides and support advanced features like LoRA adapters and DeepSpeed.Endpoints
List Bakes
List all bakes in a repository. Endpoint:GET /v1/repo/{repo_name}/bakes
Request:
200 OK
Get Bake
Get bake definition and metadata. Endpoint:GET /v1/repo/{repo_name}/bakes/{bake_name}
Request:
Bake name
Repository name
200 OK
status:string- Bake status:'not_started','running','complete', or'failed'config:object- Complete bake configuration (datasets, model, optimizer, etc.)job_id:integer | null- Coordinator job ID (present when status is'running'or'complete')progress_percent:number | null- Training progress percentage (0-100, present when running)model_name:Array<string> | null- List of model checkpoint paths in format'user/repo/bake_name/checkpoint'. Index 0 is the latest checkpoint. Only present whenstatusis'complete'loss:object | null- Training loss metrics (present when status is'running'or'complete'and metrics are available)latest_loss: Latest training loss valuefinal_loss: Final training loss valuemin_loss: Minimum loss encountered during trainingmax_loss: Maximum loss encountered during training
error:string | null- Error message if bake failed (present whenstatusis'failed')lines:integer | null- Not applicable for bakes (always null)
Create or Update Bake
Create or update a bake configuration. Endpoint:POST /v1/repo/{repo_name}/bakes
Request:
Repository name
Bake name
Template: ‘default’ or existing bake name
Bake configuration overrides. See Bake Configuration.Model Configuration: The
model.parent_model_name field defaults to the repository’s base model if not specified. You can override it with a base model (e.g., "Qwen/Qwen3-32B") or a previously baked model (e.g., "user/repo/bake_name/checkpoint").Start Bake
Start a bake (training) job. Endpoint:POST /v1/repo/{repo_name}/bakes/{bake_name}
Request:
200 OK
Bake name
Repository name
- Complete bake configuration (datasets + training settings)
- All referenced targets have completed rollouts
Ensure all target rollouts are complete before starting bake
- Idempotent: repeated calls return current state
- Asynchronous: returns immediately
- Poll
get()to monitor status
Python SDK: The
repo_name parameter must be passed as a keyword argument (not positional). This is intentional for API clarity and consistency.Polling: By default, poll=True automatically waits for the job to complete. Manual polling loops are no longer needed unless you set poll=False.BakeResponse
status: Bake status (‘not_started’, ‘running’, ‘complete’, ‘failed’)job_id: Coordinator job ID (if queued/running)progress_percent: Training progress (0-100)model_name: List of checkpoint paths (when complete)loss: Training loss metrics (when available)error: Error message (if failed)
Batch Create or Update Bakes
Create or update multiple bakes. Endpoint:POST /v1/repo/{repo_name}/bakes/batch
Request:
Create or Update Bake (Deprecated)
Endpoint:PUT /v1/repo/{repo_name}/bakes/{bake_name}
Request:
Batch Create or Update Bakes (Deprecated)
Endpoint:PUT /v1/repo/{repo_name}/bakes/batch
Request:
Get Bake Metrics
Get training metrics for a bake. Returns all metrics fromtrain_log_metrics.jsonl as a JSON array.
Endpoint: GET /v1/repo/{repo_name}/bakes/{bake_name}/metrics
Request:
200 OK
Repository name
Bake name
Use Case: Useful for plotting loss curves and other training metrics on the frontend. Each entry contains metrics like
iter, loss, train_loss, lr, epoch, etc.Get Bake Download URL
Get a presigned URL for downloading model weights by repository and bake name. Endpoint:GET /v1/repo/{repo_name}/bakes/{bake_name}/download
Request (latest checkpoint):
200 OK
Repository name
Bake name
Checkpoint number (defaults to latest checkpoint)
URL expiry in seconds (1-604800, default: 3600). Maximum: 7 days (604800 seconds)
Delete Bake
Delete a bake from the repository. Endpoint:DELETE /v1/repo/{repo_name}/bakes/{bake_name}
Request:
Complete Training Workflow
1
Verify Rollouts Complete
2
Configure Bake
3
Start Training
4
Monitor Progress
5
Check Training Metrics (Optional)
6
Download Model Weights (When Complete)
Configuration Examples
LoRA Training
Key Configuration Fields
See Bake Configuration Reference for complete details.Dataset Configuration
List of targets to use as training data
target(string, required): Target nameweight(float): Dataset sampling weight
Training Parameters
Number of training epochs
Micro batch size per device
Gradient accumulation steps for effective batch size. Effective batch size =
micro_batch_size * gradient_accumulation_steps * num_gpusTotal number of trajectories to use for training. If not specified, uses all available trajectories from the datasets.
Random seed for reproducibility
Best Practices
Verify Prerequisites
Verify Prerequisites
Ensure all target rollouts are complete before starting training
Start Small
Start Small
Test with fewer epochs and smaller datasets first, then scale up
Version Your Bakes
Version Your Bakes
Use descriptive names with versions:
coding_v1, coding_v2_loraMonitor Progress
Monitor Progress
Check status periodically but not too frequently (every 30-60 seconds). Use the
loss field to track training quality. A good final loss is typically around 4e-7.Check Checkpoints
Check Checkpoints
When a bake completes,
model_name contains checkpoint paths. Index 0 is the latest checkpoint. Use this path for inference or downloading weights.Download Weights
Download Weights
Use the download endpoint to get presigned URLs for model weights. URLs expire after 1 hour by default, so download promptly.
Track Metrics
Track Metrics
Use the metrics endpoint to get detailed training logs for plotting loss curves and analyzing training behavior.
Error Handling
Not Found (404)
Starting a bake that doesn’t exist:404 Not Found
Prerequisites Not Met
Starting a bake when rollouts aren’t complete: Response:400 Bad Request
Insufficient Credits
Starting a bake without sufficient credit balance: Response:402 Payment Required
Bake Already Exists
Creating a bake that already exists with different configuration: Response:409 Conflict
Bake Failed
When a bake fails during training: Response:200 OK (status field indicates failure)