Documentation.
Evaluate any model end-to-end — driven by agents, or with a single pip package.
Overview
Spherics Bench is the evaluation layer for AI models. You bring a model; we provide the benchmark suites, sample orchestration, scoring, live telemetry, and failure analysis — end-to-end, with nothing to wire up yourself.
There are two ways to evaluate, and an agent can drive either of them for you. Everything is reachable from the benchmark catalog and your experiments.
Agents
This is the heart of Spherics Bench. We natively integrate evaluation agents that run the whole loop for you — picking a benchmark, launching the run, streaming telemetry, and triaging failures. You can also connect your own agents, so they offload evaluation work and minimise your team's effort. Either way, you mostly just describe what you want validated.
API mode
In API mode we stream evaluation samples to your model and score the responses in real time. Your model stays wherever it runs — we just send it the inputs and read back the outputs, sample by sample. Ideal for hosted or API-served models.
You expose your model behind one HTTP endpoint. It receives a sample's input and returns the answer in the benchmark's output format (shown on every benchmark page). Any framework works — here's the whole contract:
# Minimal model endpoint (FastAPI)
from fastapi import FastAPI
app = FastAPI()
@app.post("/infer")
def infer(sample: dict):
# sample = the benchmark input (question, images, choices, ...)
answer = my_model(sample)
# return JSON matching the benchmark's output_format
return {"answer": answer, "reasoning": "optional"}Deploy it anywhere reachable over HTTPS, then give Spherics Bench the URL when you launch (the Model endpoint field) — or just hand it to an agent. We POST each sample to your endpoint and score what comes back.
Container mode
In container mode we launch a dedicated, isolated job that runs the full benchmark against your model end-to-end — no machine of your own to keep online. Ideal for self-contained model images and longer evaluations.
Live telemetry
Every run streams live to the Experiments dashboard: the mean score as it converges, per-category breakdowns, processed samples, latency and efficiency, and a failure report.
The dashboard is built fluidly. The views you see are assembled on the fly from which model you are benchmarking (a transformer surfaces different internals than a diffusion model) and which benchmark is running (audio, video, and image benchmarks expose fundamentally different outputs and metrics). There is nothing to configure and no code to write — the right visualizations simply appear for your run.
Quickstart
Prefer to do it yourself? It's one package and one call. Install it, then point it at a benchmark and your model:
pip install benchcloudimport benchcloud
benchcloud.run(
benchmark="<benchmark_id>",
model="<your_model_id>",
)Wonder if that's it? Indeed — that's it. And our agents will handle even this for you. Just prompt your stuff.
Error codes
The API uses standard HTTP status codes. Authenticated endpoints expect a bearer token from sign-in.
Ready to evaluate?
Pick a benchmark and let an agent take it from there.