feat: Add AI online evaluations (Judge, Evaluator, IRunner) by mattrmc1 · Pull Request #301 · launchdarkly/dotnet-core

mattrmc1 · 2026-06-30T18:28:25Z

Summary

Adds AI online evaluations to LaunchDarkly.ServerSdk.Ai. A caller that supplies a runnerFactory to LdAiClient gets automatic judge evaluation wired into every CompletionConfig and AgentConfig — each returned config carries an Evaluator that can score model output against the judges declared in the flag's judgeConfiguration. When no runnerFactory is provided (or no judges are configured), configs receive a noop Evaluator so callers never need null checks.

Implements the AIEVALS and AIRUNNER specs (sections 1.1–1.4). createJudge (AIEVALS 1.2) is intentionally omitted per .NET/Java convention — judges are created internally by the SDK, not by user code.

New public types

// Provider-facing runner interface (AIRUNNER 1.2)
public interface IRunner
{
    Task<RunnerResult> RunAsync(string input,
        IReadOnlyDictionary<string, object> outputType = null);
}

// Runner return type (AIRUNNER 1.3)
public sealed record RunnerResult(
    string Content,
    AiMetrics Metrics,
    object Raw = null,
    IReadOnlyDictionary<string, object> Parsed = null);

// Evaluation orchestrator (AIEVALS 1.4)
public sealed class Evaluator
{
    public static Evaluator Noop();
    public Task<IReadOnlyList<JudgeResult>> EvaluateAsync(string input, string output);
}

// Single-judge executor (AIEVALS 1.1)
public sealed class Judge
{
    public LdAiJudgeConfig Config { get; }
    public IRunner Runner { get; }
    public Task<JudgeResult> EvaluateAsync(string input, string output, double? samplingRate = null);
    public Task<JudgeResult> EvaluateMessagesAsync(
        IReadOnlyList<LdAiConfigTypes.Message> messages,
        RunnerResult runnerResult, double? samplingRate = null);
}

`LdAiClient` changes

// New optional parameter
public LdAiClient(ILaunchDarklyClient client,
    Func<LdAiJudgeConfig, IRunner> runnerFactory = null);

When runnerFactory is non-null, ConfigFactory.BuildEvaluator iterates the flag's judgeConfiguration, evaluates each judge key as a flag variation, creates a Judge + IRunner pair per enabled judge, and attaches the resulting Evaluator to the config. Disabled judges, null runners, and initialization exceptions are logged and skipped — no single judge failure prevents the others from being built.

`LdAiConfig` base class

All config types (LdAiCompletionConfig, LdAiAgentConfig, LdAiJudgeConfig) now carry an Evaluator property via the base class. LdAiJudgeConfig always receives Evaluator.Noop() (judges don't evaluate themselves).

`JudgeResult` changes

Updated to match AIEVALS 1.3.1 defaults:

Field	Before	After
`MetricKey`	required	optional (default `null`)
`Score`	required	optional (default `0.0`)
`Sampled`	default `true`	default `false`
`Success`	default `true`	default `false`
`ErrorMessage`	—	new field
`Reasoning`	—	new field

Null safety (ref: java-core#175 discussion)

BuildEvaluator handles every failure mode without throwing:

runnerFactory == null or empty judgeConfiguration → Evaluator.Noop()
Runner factory returns null → warn + skip judge
Judge config disabled → warn + skip judge
Any exception during judge init → warn + skip judge
Missing evaluationMetricKey → JudgeResult(success: false, errorMessage: ...)
Score out of [0, 1] range → JudgeResult(success: false, errorMessage: ...)
Sampling rate NaN/Infinity/negative/> 1.0 → normalized to safe bounds

Judge constructor still uses ArgumentNullException guards, but these are never hit in practice because BuildEvaluator validates inputs before construction.

Migration

None required. The LdAiClient constructor gains an optional runnerFactory parameter (default null) — existing callers are unaffected. JudgeResult default changes are source-compatible (all parameters are now optional with safe defaults). No members removed or renamed.

Test plan

dotnet test pkgs/sdk/server-ai/test/LaunchDarkly.ServerSdk.Ai.Tests.csproj --framework net8.0 passes
JudgeTest (435 lines) covers: successful evaluation with score/reasoning extraction, sampling skip path, samplingRate edge cases (NaN, negative, > 1.0), runner exception handling, missing evaluationMetricKey validation, out-of-range score rejection with errorMessage, evaluateMessages formatting, null/empty message handling
EvaluatorTest (210 lines) covers: noop returns empty list, multi-judge execution with per-judge sampling, missing judge key logs warning and skips, noop does not log warnings
LdAiCompletionConfigTest (189 lines added) covers: Evaluator attached when runnerFactory provided, noop Evaluator when no runner factory, noop when judgeConfiguration is empty, disabled judge skipped, null runner skipped
LdAiJudgeConfigTest covers: JudgeResult ErrorMessage and Reasoning fields, judge config always receives noop Evaluator
LdAiConfigTrackerTest covers: TrackJudgeResult with new optional fields

Note

Medium Risk
New async judge execution and extra flag evaluations on config build can affect latency and cost; JudgeResult default changes are source-compatible but may alter behavior for callers relying on old implicit defaults.

Overview
Adds AI online evaluations to the server AI SDK: callers can pass an optional runnerFactory to LdAiClient so completion and agent configs get a non-noop Evaluator when the flag includes judgeConfiguration.

ConfigFactory now builds evaluators by resolving each judge key as a flag variation, creating Judge + IRunner pairs (with failures logged and skipped), and attaching the result on normal and default fallback paths—including agent graphs via graphKey for judge tracking. Judge prompt variables from the parent config call are forwarded into nested judge config builds.

New public surface: IRunner, RunnerResult, Judge, and Evaluator (with Noop()). LdAiConfig exposes a always-non-null Evaluator; judge configs use Evaluator.Noop().

JudgeResult gains ErrorMessage and Reasoning, looser optional constructor defaults (Sampled/Success now default false), and JudgeConfiguration.Judge.SamplingRate is double? (omitted on wire when null; parse only reads numeric samplingRate).

^{Reviewed by Cursor Bugbot for commit 6f7d88d. Bugbot is set up for automated code reviews on this repo. Configure here.}

…rthy/AIC-2660/aievals-core-judge

…pling normalization

…judge config, debug log for disabled judges

…Key on judge trackers

…rthy/AIC-2660/aievals-core-judge

…unchdarkly/dotnet-core into mmccarthy/AIC-2660/aievals-core-judge

cursor

Cursor Bugbot has reviewed your changes using default effort and found 3 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 6f7d88d. Configure here.}

cursor · 2026-07-01T20:50:08Z

+            {
+                var defaultValue = LdAiJudgeConfigDefault.Disabled;
+                var ldValue = _client.JsonVariation(judgeEntry.Key, context, defaultValue.ToLdValue());
+                var judgeConfig = BuildJudgeConfig(judgeEntry.Key, ldValue, context, defaultValue, variables, graphKey: graphKey);


Judge sampling ignores interpolate flag

Medium Severity

BuildEvaluator always calls BuildJudgeConfig with the default interpolate: true, even when the parent config was built with interpolate: false (template APIs). Nested judge configs then run Mustache interpolation (including ldctx) while completion/agent instructions and messages stay as raw templates, so template configs and runnerFactory output can disagree with the documented template behavior.

Additional Locations (2)

pkgs/sdk/server-ai/src/Config/ConfigFactory.cs#L267-L308

pkgs/sdk/server-ai/src/LdAiClient.cs#L156-L165

^{Reviewed by Cursor Bugbot for commit 6f7d88d. Configure here.}

cursor · 2026-07-01T20:50:09Z

+        var effectiveRate = samplingRate.HasValue ? NormalizeSamplingRate(samplingRate.Value) : 1.0;
+        if (new Random().NextDouble() > effectiveRate)
+        {
+            return new JudgeResult(sampled: false, judgeConfigKey: Config.Key);


Same random draw per judges

Medium Severity

Each EvaluateAsync call uses new Random().NextDouble() for sampling. When an Evaluator runs multiple judges in one request, those instances are often created in the same tick and share the same seed, so they draw the same value and always skip or always run together instead of independent per-judge sampling.

^{Reviewed by Cursor Bugbot for commit 6f7d88d. Configure here.}

cursor · 2026-07-01T20:50:09Z

+
+            var result = await judge.EvaluateAsync(input, output, judgeEntry.SamplingRate);
+            results.Add(result);
+        }


Duplicate judge keys run twice

Low Severity

If judgeConfiguration.judges lists the same key more than once, BuildEvaluator stores one Judge in a dictionary but builds filteredConfig from the full list without deduplicating. EvaluateAsync then invokes the same judge once per duplicate entry and returns multiple results for one logical judge.

Additional Locations (1)

pkgs/sdk/server-ai/src/Config/ConfigFactory.cs#L305-L307

^{Reviewed by Cursor Bugbot for commit 6f7d88d. Configure here.}

mattrmc1 added 3 commits June 15, 2026 11:21

[AIC-2660] Implement AIEVALS (first pass)

a625933

Merge branch 'main' of github.com:launchdarkly/dotnet-core into mmcca…

db62ad3

…rthy/AIC-2660/aievals-core-judge

fix: AIEVALS spec compliance — JudgeResult defaults, null safety, sam…

a680fb4

…pling normalization

mattrmc1 changed the title ~~Mmccarthy/aic 2660/aievals core judge~~ feat: Add AI online evaluations (Judge, Evaluator, IRunner) Jun 30, 2026

fix: set errorMessage on out-of-range score for JS/Python parity

f920e9d

mattrmc1 marked this pull request as ready for review June 30, 2026 20:23

mattrmc1 requested a review from a team as a code owner June 30, 2026 20:23

cursor Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread pkgs/sdk/server-ai/src/Config/ConfigFactory.cs Outdated

Comment thread pkgs/sdk/server-ai/src/Evals/Judge.cs

Comment thread pkgs/sdk/server-ai/src/Evals/Evaluator.cs

mattrmc1 requested review from jsonbailey and tanderson-ld June 30, 2026 20:47

Merge branch 'main' into mmccarthy/AIC-2660/aievals-core-judge

aa959ce

cursor Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread pkgs/sdk/server-ai/src/Config/ConfigFactory.cs Outdated

Comment thread pkgs/sdk/server-ai/src/Config/ConfigFactory.cs

Comment thread pkgs/sdk/server-ai/src/Evals/Judge.cs

mattrmc1 and others added 3 commits June 30, 2026 16:39

Merge branch 'main' into mmccarthy/AIC-2660/aievals-core-judge

b814097

fix: robust judge score parsing, fallback evaluator wiring, filtered …

66e2c31

…judge config, debug log for disabled judges

fix: forward caller variables to embedded judge message interpolation

f5f064e

cursor Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread pkgs/sdk/server-ai/src/Evals/Judge.cs

Comment thread pkgs/sdk/server-ai/src/Evals/Evaluator.cs

Merge branch 'main' into mmccarthy/AIC-2660/aievals-core-judge

3bea03d

cursor Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread pkgs/sdk/server-ai/src/Config/ConfigFactory.cs Outdated

mattrmc1 added 3 commits July 1, 2026 15:47

fix: NaN/Infinity score rejection, nullable judge samplingRate, graph…

b3bad64

…Key on judge trackers

Merge branch 'main' of github.com:launchdarkly/dotnet-core into mmcca…

c06c4b9

…rthy/AIC-2660/aievals-core-judge

Merge branch 'mmccarthy/AIC-2660/aievals-core-judge' of github.com:la…

6f7d88d

…unchdarkly/dotnet-core into mmccarthy/AIC-2660/aievals-core-judge

cursor Bot reviewed Jul 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add AI online evaluations (Judge, Evaluator, IRunner)#301

feat: Add AI online evaluations (Judge, Evaluator, IRunner)#301
mattrmc1 wants to merge 12 commits into
mainfrom
mmccarthy/AIC-2660/aievals-core-judge

mattrmc1 commented Jun 30, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jul 1, 2026

Uh oh!

cursor Bot Jul 1, 2026

Uh oh!

cursor Bot Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mattrmc1 commented Jun 30, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New public types

LdAiClient changes

LdAiConfig base class

JudgeResult changes

Null safety (ref: java-core#175 discussion)

Migration

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jul 1, 2026

Choose a reason for hiding this comment

Judge sampling ignores interpolate flag

Uh oh!

cursor Bot Jul 1, 2026

Choose a reason for hiding this comment

Same random draw per judges

Uh oh!

cursor Bot Jul 1, 2026

Choose a reason for hiding this comment

Duplicate judge keys run twice

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mattrmc1 commented Jun 30, 2026 •

edited by cursor Bot

Loading

`LdAiClient` changes

`LdAiConfig` base class

`JudgeResult` changes