feat: Add AI online evaluations (Judge, Evaluator, IRunner)#301
feat: Add AI online evaluations (Judge, Evaluator, IRunner)#301mattrmc1 wants to merge 12 commits into
Conversation
…rthy/AIC-2660/aievals-core-judge
…pling normalization
…Key on judge trackers
…rthy/AIC-2660/aievals-core-judge
…unchdarkly/dotnet-core into mmccarthy/AIC-2660/aievals-core-judge
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 3 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 6f7d88d. Configure here.
| { | ||
| var defaultValue = LdAiJudgeConfigDefault.Disabled; | ||
| var ldValue = _client.JsonVariation(judgeEntry.Key, context, defaultValue.ToLdValue()); | ||
| var judgeConfig = BuildJudgeConfig(judgeEntry.Key, ldValue, context, defaultValue, variables, graphKey: graphKey); |
There was a problem hiding this comment.
Judge sampling ignores interpolate flag
Medium Severity
BuildEvaluator always calls BuildJudgeConfig with the default interpolate: true, even when the parent config was built with interpolate: false (template APIs). Nested judge configs then run Mustache interpolation (including ldctx) while completion/agent instructions and messages stay as raw templates, so template configs and runnerFactory output can disagree with the documented template behavior.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit 6f7d88d. Configure here.
| var effectiveRate = samplingRate.HasValue ? NormalizeSamplingRate(samplingRate.Value) : 1.0; | ||
| if (new Random().NextDouble() > effectiveRate) | ||
| { | ||
| return new JudgeResult(sampled: false, judgeConfigKey: Config.Key); |
There was a problem hiding this comment.
Same random draw per judges
Medium Severity
Each EvaluateAsync call uses new Random().NextDouble() for sampling. When an Evaluator runs multiple judges in one request, those instances are often created in the same tick and share the same seed, so they draw the same value and always skip or always run together instead of independent per-judge sampling.
Reviewed by Cursor Bugbot for commit 6f7d88d. Configure here.
|
|
||
| var result = await judge.EvaluateAsync(input, output, judgeEntry.SamplingRate); | ||
| results.Add(result); | ||
| } |
There was a problem hiding this comment.
Duplicate judge keys run twice
Low Severity
If judgeConfiguration.judges lists the same key more than once, BuildEvaluator stores one Judge in a dictionary but builds filteredConfig from the full list without deduplicating. EvaluateAsync then invokes the same judge once per duplicate entry and returns multiple results for one logical judge.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 6f7d88d. Configure here.


Summary
Adds AI online evaluations to
LaunchDarkly.ServerSdk.Ai. A caller that supplies arunnerFactorytoLdAiClientgets automatic judge evaluation wired into everyCompletionConfigandAgentConfig— each returned config carries anEvaluatorthat can score model output against the judges declared in the flag'sjudgeConfiguration. When norunnerFactoryis provided (or no judges are configured), configs receive a noopEvaluatorso callers never need null checks.Implements the AIEVALS and AIRUNNER specs (sections 1.1–1.4).
createJudge(AIEVALS 1.2) is intentionally omitted per .NET/Java convention — judges are created internally by the SDK, not by user code.New public types
LdAiClientchangesWhen
runnerFactoryis non-null,ConfigFactory.BuildEvaluatoriterates the flag'sjudgeConfiguration, evaluates each judge key as a flag variation, creates aJudge+IRunnerpair per enabled judge, and attaches the resultingEvaluatorto the config. Disabled judges, null runners, and initialization exceptions are logged and skipped — no single judge failure prevents the others from being built.LdAiConfigbase classAll config types (
LdAiCompletionConfig,LdAiAgentConfig,LdAiJudgeConfig) now carry anEvaluatorproperty via the base class.LdAiJudgeConfigalways receivesEvaluator.Noop()(judges don't evaluate themselves).JudgeResultchangesUpdated to match AIEVALS 1.3.1 defaults:
MetricKeynull)Score0.0)SampledtruefalseSuccesstruefalseErrorMessageReasoningNull safety (ref: java-core#175 discussion)
BuildEvaluatorhandles every failure mode without throwing:runnerFactory == nullor emptyjudgeConfiguration→Evaluator.Noop()null→ warn + skip judgeevaluationMetricKey→JudgeResult(success: false, errorMessage: ...)[0, 1]range →JudgeResult(success: false, errorMessage: ...)NaN/Infinity/negative/> 1.0→ normalized to safe boundsJudgeconstructor still usesArgumentNullExceptionguards, but these are never hit in practice becauseBuildEvaluatorvalidates inputs before construction.Migration
None required. The
LdAiClientconstructor gains an optionalrunnerFactoryparameter (defaultnull) — existing callers are unaffected.JudgeResultdefault changes are source-compatible (all parameters are now optional with safe defaults). No members removed or renamed.Test plan
dotnet test pkgs/sdk/server-ai/test/LaunchDarkly.ServerSdk.Ai.Tests.csproj --framework net8.0passesJudgeTest(435 lines) covers: successful evaluation with score/reasoning extraction, sampling skip path,samplingRateedge cases (NaN, negative,> 1.0), runner exception handling, missingevaluationMetricKeyvalidation, out-of-range score rejection witherrorMessage,evaluateMessagesformatting, null/empty message handlingEvaluatorTest(210 lines) covers: noop returns empty list, multi-judge execution with per-judge sampling, missing judge key logs warning and skips, noop does not log warningsLdAiCompletionConfigTest(189 lines added) covers:Evaluatorattached whenrunnerFactoryprovided, noopEvaluatorwhen no runner factory, noop whenjudgeConfigurationis empty, disabled judge skipped, null runner skippedLdAiJudgeConfigTestcovers:JudgeResultErrorMessageandReasoningfields, judge config always receives noopEvaluatorLdAiConfigTrackerTestcovers:TrackJudgeResultwith new optional fieldsNote
Medium Risk
New async judge execution and extra flag evaluations on config build can affect latency and cost;
JudgeResultdefault changes are source-compatible but may alter behavior for callers relying on old implicit defaults.Overview
Adds AI online evaluations to the server AI SDK: callers can pass an optional
runnerFactorytoLdAiClientso completion and agent configs get a non-noopEvaluatorwhen the flag includesjudgeConfiguration.ConfigFactorynow builds evaluators by resolving each judge key as a flag variation, creatingJudge+IRunnerpairs (with failures logged and skipped), and attaching the result on normal and default fallback paths—including agent graphs viagraphKeyfor judge tracking. Judge prompt variables from the parent config call are forwarded into nested judge config builds.New public surface:
IRunner,RunnerResult,Judge, andEvaluator(withNoop()).LdAiConfigexposes a always-non-nullEvaluator; judge configs useEvaluator.Noop().JudgeResultgainsErrorMessageandReasoning, looser optional constructor defaults (Sampled/Successnow default false), andJudgeConfiguration.Judge.SamplingRateisdouble?(omitted on wire when null; parse only reads numericsamplingRate).Reviewed by Cursor Bugbot for commit 6f7d88d. Bugbot is set up for automated code reviews on this repo. Configure here.