Evaluation engine

Optionally run prompts through multiple models. Find where they disagree.

Three evaluation strategies for multi-perspective analysis — from fast side-by-side comparison to adversarial contradiction detection. Available across all three strategies: Basic, Multi-perspective, and Adversarial. Bring your own key. Accessible via the FusionLayer MCP server in Claude Code, Cursor, and any MCP-compatible host.

Smart evaluation On — FusionLayer routes to the best-fit model per task

How it works

The engine runs in the background. Your users never see the seams.

When a user sends a message, FusionLayer classifies the task type (code, writing, analysis, etc.) and uses that signal — along with aggregated routing data from across the user base — to decide which model is most likely to produce the best result for that task. Classification runs server-side. The classifier output (a topic label) — not the prompt — drives routing decisions and bandit selection.

The response your user sees is the best one FusionLayer found. If the setting is off, the user's preferred model handles everything. One toggle, one tradeoff: routing quality vs. latency budget.

Task classification runs server-side — prompt text is not stored in routing signals, only the resulting topic label
Routing is informed by aggregated signal across the user base
Fastest acceptable model wins in a tie
Implicit feedback (retry, edit, continue) improves routing over time
Works with any vendor connector or bring-your-own-key setup

What enters the aggregator

The shared aggregator receives only a task-type label, token counts, latency, and an implicit quality signal — no prompt text, no conversation content. In managed Smart mode, classification runs server-side and prompt text enters the routing pipeline transiently to produce the label; it is not stored in routing signals. This is enforced by a server-side field whitelist — not a policy, an enforcement.

What the engine observes

Task-type label Produced server-side. Only the label is retained — not the prompt.

Token counts Input, output, context, reasoning

Model + vendor chosen Which model handled the request

Latency (ms) End-to-end wall time

Estimated cost (USD) Per-request model cost

Implicit quality signal Retry, edit, continue — not content

Prompt text, conversation content, and any personally identifiable information are not included — even as an anonymous signal. The schema is enforced server-side.

Evaluation strategies

Three strategies. Pick the right one for the task.

The eval engine exposes three explicit strategies via MCP tools. Each is a distinct tradeoff between speed, cost, and depth of analysis.

Basic

Side-by-side comparison

Send the same prompt to multiple vendors in parallel. Responses come back independently — no aggregation, no judge. Compare directly.

Best for: quick sanity checks, spotting outlier responses, cost benchmarking across vendors.

fl_eval_basic
vendors: [anthropic, openai, google]
→ three responses, side-by-side

Multi-perspective

Parallel aggregation

Run multiple vendors and aggregate with one of three strategies: consensus (median response), union (all responses combined), or best (lowest latency winner).

Best for: high-stakes answers where you want a synthesized view, not just the fastest reply.

fl_eval_multi_perspective
aggregation: consensus
→ single aggregated response + confidence

Adversarial

Contradiction detection

Run two or more vendors, then use an LLM judge to find specific factual contradictions between their responses. Each contradiction is returned with severity (low / medium / high) and an explanation.

Best for: factual research, medical / legal questions, any domain where model hallucination is expensive.

fl_eval_adversarial
vendors: [anthropic, openai]
→ contradictions[] + consensus

All three strategies are available as MCP tools — install the FusionLayer MCP server and call them from Claude Code, Cursor, or any MCP-compatible host. Bring your own API keys.

Install via MCP →

Common questions

Does FusionLayer send my prompts to multiple models at once?

It depends on the strategy. The Basic and Multi-perspective strategies send the same prompt to multiple vendors in parallel — that is the point. The Adversarial strategy also runs multiple vendors and then uses an LLM judge to find contradictions. Smart evaluation (the background routing mode) is different: it classifies the task server-side and routes to one model — the one most likely to do best. All strategies are opt-in; you choose which one to use per request via the MCP tools or the in-app settings.

Can I disable smart evaluation?

Yes. Turn the setting off and FusionLayer forwards every request to the user's preferred model, unchanged. Memory injection still works regardless of the evaluation setting.

How does the engine get better over time?

Each request produces an implicit quality signal (did the user retry? edit? continue?) that is aggregated across the user base and fed back into routing. No prompt text is involved — only the result metadata. The engine improves routing for each task type as more signal accumulates.

What happens in Private or Incognito mode?

Evaluation still runs. Routing uses server-side task classification regardless of the privacy mode. What changes is memory — Private and Incognito sessions use encrypted or ephemeral storage, not the server-indexed context store. The routing signal (aggregated metadata only) still contributes to the shared model regardless of privacy mode.

Can I bring my own keys and still use the eval engine?

Yes. Bring-your-own-key connectors work with the eval engine. The engine routes within your connected vendors. If you have only one vendor key, the engine optimizes within that vendor's model family.