From zero to your first inference job in minutes. Create an API key, call the Python SDK, and understand how Olive separates the people who own an account from the developers who build on it.
Step 1
An API key is how your code proves who it is. Every key belongs to an Olive account and inherits that account's models, quotas, and billing.
Sign in to the customer portal
Open the Olive customer portal and sign in. If this is your organization’s first time, the person who signs up becomes the account Admin — more on what that means in Admins & developers.
Create a key in the API Keys tab
Go to API Keys → Create API key, give it a label you’ll recognize later (e.g. publishing-pipeline), and copy the value. Keys begin with olv_ and are shown only once — Olive stores a hash, never the key itself, so if you lose it you simply revoke and mint a new one.
Store it as an environment variable
Treat the key like a password. Keep it out of source control and inject it through the environment or your secrets manager:
# Never hard-code keys. Keep them in the environment.
export OLIVE_API_KEY="olv_your_key_here"Step 2
The Olive Python SDK wraps the REST API with typed clients, retries, and helpful errors. It needs Python 3.9 or newer.
pip install olive-computeStep 3
One import, one client, one call. Pass the key from the environment variable you set above; the call blocks until the network returns a result.
import os
from olive import OliveClient
client = OliveClient(api_key=os.environ["OLIVE_API_KEY"])
# Run a single inference call against the default chat model.
# The call blocks until the job completes on the network.
reply = client.inference(
"Summarize the benefits of print-on-demand for a small publisher.",
max_tokens=256,
)
print(reply)That’s a complete job: the SDK submits it to the Olive network, a provider device runs the model, and the generated text comes back as a string. No servers to manage, no cloud capacity to provision.
inference()blocks until that string comes back. For longer work, or if you don’t want your process blocked while it waits, submit the job and poll separately — see Going further below.
Step 4
Pin a model, generate embeddings, run long jobs asynchronously, and handle failures cleanly.
Omit model= to use the default, or pin any catalog model. The compute tier controls the hardware your job lands on.
# Browse the catalog and pin a specific model.
for m in client.list_models(modality="chat"):
print(m["id"], "·", m["pricing"]["input_per_1m_tokens_usd"], "USD / 1M tokens")
reply = client.inference(
"Draft a back-cover blurb for a regional cookbook.",
model="meta/llama-3.2-3b-instruct",
compute="medium", # light · medium · heavy
temperature=0.4,
)
print(reply)| Tier | Resources | Best for |
|---|---|---|
| light | 1 core · 2 GB | Embeddings, short inputs |
| medium | 2 cores · 4 GB | Standard inference (default) |
| heavy | 4 cores · 8 GB | Long context, large batches |
Turn text into vectors for search, clustering, or deduplication across a catalog.
# Embeddings for search, clustering, or dedup across a catalog.
vectors = client.embeddings(
["The Hudson Valley Baker", "Seasonal Preserves & Pickles"],
model="baai/bge-small-en-v1.5",
)
print(len(vectors), "vectors ·", len(vectors[0]), "dims")submit_job() returns a handle immediately so your process keeps moving; call job.wait() when you need the result.
import json
# For long-running work, submit and poll separately so your
# process isn't blocked while the job runs on the network.
# input_data is a JSON-encoded string — same shape inference() builds for you.
job = client.submit_job(
workload_type="inference",
input_data=json.dumps({"prompt": "Write a 400-word author bio.", "max_tokens": 600}),
model="meta/llama-3.2-3b-instruct",
compute="heavy",
)
print(job.id, job.status) # e3b2a1c0-... running
result = job.wait(timeout=300) # blocks until done, or raises JobError on failure/timeout
output = json.loads(result["output_data"])
print(output["text"]) # same "text" field inference() unwraps for youThe SDK raises typed exceptions so you can separate an auth problem from a rate limit from a failed job. It retries transient network and server errors for you.
import os
from olive import OliveClient, AuthError, JobError, RateLimitError
try:
client = OliveClient(api_key=os.environ["OLIVE_API_KEY"])
reply = client.inference("Hello, Olive.")
except AuthError:
# Bad or revoked key — mint a new one in the portal.
print("Check OLIVE_API_KEY")
except RateLimitError as e:
print(f"Slow down — retry after {e.retry_after}s")
except JobError as e:
# The job ran but failed or timed out on the network.
print(f"Job failed: {e}")Step 5
Olive separates two responsibilities on every account. Understanding them now means your setup scales cleanly as your team grows.
Admin
Owns the account. Manages billing, controls which models and compute tiers are enabled, and issues or revokes API keys.
Developer
Builds on the account. Uses an API key to run inference, embeddings, and jobs against the enabled models — without touching billing or account settings.
During private beta
staging vs prod), label them clearly, and revoke a key the moment it’s no longer needed. That habit is exactly what the multi-developer model below builds on.Step 6
These features ship after private beta. They're documented here so you can design your integration around them today.
On the roadmap · Multi-developer support with role-based access
On the roadmap · Spend caps at the developer and account level
Building against Olive during private beta and want early access to these? Mention it when you reach out — we’re prioritizing based on what customers actually need.
Create a key, pip install olive-compute, and make your first call.