56
/100
prowl
Benchmarked Apr 06, 2026

ARBITER Oracle Verified

Multi-oracle verification engine for autonomous AI agent contracts. Accepts contract specs, verifies task completion via federated oracles (QTIP, PushMeBot), issues HMAC-signed receipts with PASS/PARTIAL/FAIL verdicts and trust scores. 129 endpoints. L402 + x402 payment rails.

verificationai-agents api_benchmark x402 public
Benchmark Your API

Score Breakdown

Auth Simplicity 10/10
Latency 8/10
Consistency 8/10
Parseability 7/10
Documentation 6/10
Token Efficiency 6/10
First-Try Success 4/10
Error Clarity 2/10

Benchmark Analysis Log

Full LLM thinking from the 4-phase benchmark pipeline.

Analyze
Looking at the ARBITER Oracle API specification and vendor benchmark guide, I'll extract the structured information:

```json
{
  "service_type": "rest_api",
  "base_url": "https://arbiter.chitacloud.dev",
  "auth_method": "none",
  "auth_config": null,
  "endpoints": [
    {
      "path": "/verify-task",
      "method": "POST",
      "purpose": "Submit task spec and agent output for verification, returns HMAC-signed receipt with PASS/PARTIAL/FAIL verdict",
      "params": {
        "task_spec": {"type": "object", "required": true},
        "agent_output": {"type": "object", "required": true}
      },
      "response_format": "json",
      "is_primary": true
    },
    {
      "path": "/health",
      "method": "GET",
      "purpose": "Returns service health status including version, receipt count, MongoDB status, peer oracle count",
      "params": {},
      "response_format": "json",
      "is_primary": false
    },
    {
      "path": "/stats",
      "method": "GET",
      "purpose": "Oracle statistics including pass/partial/fail breakdown, divergence rate, average confidence",
      "params": {},
      "response_format": "json",
      "is_primary": false
    },
    {
      "path": "/real-receipts",
      "method": "GET",
      "purpose": "Returns last 30 production receipts with HMAC signatures for audit purposes",
      "params": {},
      "response_format": "json",
      "is_primary": false
    },
    {
      "path": "/judge-guide-v3",
      "method": "GET",
      "purpose": "Returns full evaluation methodology for SYNTHESIS judges",
      "params": {},
      "response_format": "json",
      "is_primary": false
    }
  ],
  "pricing_model": {
    "type": "unknown",
    "details": {},
    "free_tier": null,
    "paid_tiers": []
  },
  "rate_limits": {
    "rpm": null,
    "tpm": null,
    "daily": null,
    "concurrent": null
  },
  "capabilities": [
    "multi_oracle_verification",
    "contract_verification",
    "task_completion_verification",
    "hmac_signed_receipts",
    "federated_oracle_network",
    "trust_scoring",
    "audit_trail"
  ],
  "agent_readiness": {
    "supports_x402": true,
    "supports_streaming": false,
    "has_sandbox": false,
    "sdks": [],
    "agent_auth_methods": ["public"]
  }
}
```
Plan
```json
{
  "tests": [
    {
      "name": "health_check",
      "endpoint": "/health",
      "method": "GET",
      "headers": {},
      "payload": {},
      "expected_status": 200,
      "expected_behavior": "Returns service health status with version, receipt count, MongoDB status, peer oracle count",
      "metrics": ["latency", "status_code", "response_time"],
      "validation": {
        "field": "version",
        "type": "string",
        "required": true
      }
    },
    {
      "name": "oracle_stats",
      "endpoint": "/stats",
      "method": "GET", 
      "headers": {},
      "payload": {},
      "expected_status": 200,
      "expected_behavior": "Returns oracle statistics with pass/partial/fail breakdown, divergence rate, confidence metrics",
      "metrics": ["latency", "status_code", "response_time"],
      "validation": {
        "field": "pass_rate",
        "type": "number",
        "min_value": 0
      }
    },
    {
      "name": "judge_methodology",
      "endpoint": "/judge-guide-v3",
      "method": "GET",
      "headers": {},
      "payload": {},
      "expected_status": 200,
      "expected_behavior": "Returns evaluation methodology for SYNTHESIS judges",
      "metrics": ["latency", "status_code", "response_time"],
      "validation": {
        "field": "methodology",
        "type": "object",
        "required": true
      }
    },
    {
      "name": "audit_receipts",
      "endpoint": "/real-receipts",
      "method": "GET",
      "headers": {},
      "payload": {},
      "expected_status": 200,
      "expected_behavior": "Returns last 30 production receipts with HMAC signatures",
      "metrics": ["latency", "status_code", "response_time"],
      "validation": {
        "field": "receipts",
        "type": "array",
        "max_length": 30
      }
    },
    {
      "name": "verify_simple_task",
      "endpoint": "/verify-task",
      "method": "POST",
      "headers": {
        "Content-Type": "application/json"
      },
      "payload": {
        "task_spec": {
          "task_id": "test_001",
          "description": "Generate a simple greeting message",
          "expected_output": "Hello, World!",
          "criteria": ["contains greeting", "proper format"]
        },
        "agent_output": {
          "result": "Hello, World!",
          "confidence": 0.95,
          "completion_time": 1.2
        }
      },
      "expected_status": 200,
      "expected_behavior": "Returns HMAC-signed receipt with PASS/PARTIAL/FAIL verdict",
      "metrics": ["latency", "accuracy", "status_code"],
      "validation": {
        "field": "verdict",
        "type": "string",
        "enum": ["PASS", "PARTIAL", "FAIL"]
      }
    },
    {
      "name": "verify_complex_task",
      "endpoint": "/verify-task",
      "method": "POST",
      "headers": {
        "Content-Type": "application/json"
      },
      "payload": {
        "task_spec": {
          "task_id": "test_002",
          "description": "Analyze sentiment of customer reviews",
          "expected_output": "positive",
          "criteria": ["accurate sentiment classification", "confidence score provided"],
          "input_data": "This product is amazing! I love it."
        },
        "agent_output": {
          "result": "positive",
          "confidence": 0.87,
          "reasoning": "Contains positive words like 'amazing' and 'love'"
        }
      },
      "expected_status": 200,
      "expected_behavior": "Returns verification receipt with oracle consensus",
      "metrics": ["latency", "accuracy", "status_code"],
      "validation": {
        "field": "hmac_signature",
        "type": "string",
        "required": true
      }
    },
    {
      "name": "verify_failed_task",
      "endpoint": "/verify-task",
      "method": "POST",
      "headers": {
        "Content-Type": "application/json"
      },
      "payload": {
        "task_spec": {
          "task_id": "test_003",
          "description": "Calculate 2 + 2",
          "expected_output": "4",
          "criteria": ["correct mathematical answer"]
        },
        "agent_output": {
          "result": "5",
          "confidence": 0.9
        }
      },
      "expected_status": 200,
      "expected_behavior": "Returns FAIL verdict for incorrect answer",
      "metrics": ["latency", "accuracy", "status_code"],
      "validation": {
        "field": "verdict",
        "type": "string",
        "expected_value": "FAIL"
      }
    },
    {
      "name": "missing_task_spec",
      "endpoint": "/verify-task",
      "method": "POST",
      "headers": {
        "Content-Type": "application/json"
      },
      "payload": {
        "agent_output": {
          "result": "test output"
        }
      },
      "expected_status": 400,
      "expected_behavior": "Returns error for missing required task_spec parameter",
      "metrics": ["latency", "error_handling", "status_code"],
      "validation": {
        "field": "error",
        "type": "string",
        "required"
Execute

4/10 tests passed

TestEndpointStatusLatency
health_checkGET /health200206ms
oracle_statsGET /stats200138ms
judge_methodologyGET /judge-guide-v3200138ms
audit_receiptsGET /real-receipts200141ms
verify_simple_taskPOST /verify-task404149ms
verify_complex_taskPOST /verify-task404129ms
verify_failed_taskPOST /verify-task404130ms
missing_task_specPOST /verify-task404131ms
missing_agent_outputPOST /verify-task404129ms
invalid_jsonPOST /verify-task404131ms
Interpret
{"multi_model": true, "models_used": ["openai", "claude_cli"], "model_scores": {"GPT-4o": {"overall": 25, "dimensions": {"token_efficiency": 6.0, "first_try_success": 4.0, "response_parseability": 8.0, "error_clarity": 3.0, "doc_quality": 5.0, "auth_simplicity": 10.0, "latency": 7.0, "consistency": 7.0}}, "Claude CLI": {"overall": 35, "dimensions": {"token_efficiency": 7.0, "first_try_success": 4.0, "response_parseability": 6.0, "error_clarity": 2.0, "doc_quality": 6.0, "auth_simplicity": 10.0, "latency": 8.0, "consistency": 8.0}}}, "averaged": true}

Agent Readiness

x402 Payments
Supported
Streaming
No
Sandbox
None
Agent Auth
public
SDKs
None listed
MCP Support
No

Vendor Profile

ARBITER v58.0.0 - Multi-oracle verification for autonomous AI agent contracts. 65 live receipts, 48% pass rate, MongoDB persistence. SYNTHESIS 2026 participant.

Features

3-verifier independent consensus (structural, semantic, completeness)HMAC-signed receipts38% oracle divergence trackingspec-quality-preflight for pre-job validationMongoDB persistence with 72h appeal window

Want the full interactive view?

See operational metrics, LLM evaluations, agent readiness, and more.

Open in Dashboard