When you ship an AI-powered "find similar" feature backed by embeddings and a vector store, every change — a new model, a different chunking strategy, a tweaked threshold — raises the same question: did this make retrieval better, or just different? Without a measurable baseline, you end up tuning by vibes.
In this article, I will walk through how I built a small but proper evaluation harness in Spring Boot that measures retrieval quality using a curated set of golden pairs. It computes industry-standard metrics like Recall@K and Mean Reciprocal Rank (MRR), and exposes them via an admin-only endpoint so the team can re-run evaluations any time before promoting a change to production.
1. The Problem with Tuning by Vibes #
In a support portal, agents resolve incoming tickets faster when the system suggests similar tickets that were already resolved in the past. The "find similar" feature is implemented with three moving parts:
- a content builder that turns a ticket into a piece of text,
- an embedding model that turns that text into a vector,
- a vector store that returns the top-K nearest neighbours.
Every component in that pipeline is a knob. Change the embedding model? The vector dimensions, semantic neighbourhood, and similarity scores change. Tweak how the ticket is rendered into text? Different signals are encoded. Adjust the similarity threshold? The recall/precision trade-off shifts.
The only honest way to know whether a change is an improvement is to define what "good" looks like — with concrete examples — and measure against it.
2. Golden Pairs: The Source of Truth #
A golden pair is a labeled query: "for this input, the system should return these tickets". The labels are produced by domain experts (here, support engineers who know which past resolutions actually applied to a given new ticket).
I model a single pair as a Java record:
@JsonInclude(JsonInclude.Include.NON_NULL)
public record EvalGoldenPair(
String id,
String queryText,
Long queryTicketId,
List<Long> expectedSimilarTicketIds,
String note
) {
public EvalGoldenPair {
if ((queryText == null || queryText.isBlank()) && queryTicketId == null) {
throw new IllegalArgumentException(
"EvalGoldenPair " + id + " must define either queryText or queryTicketId");
}
if (expectedSimilarTicketIds == null || expectedSimilarTicketIds.isEmpty()) {
throw new IllegalArgumentException(
"EvalGoldenPair " + id + " must define at least one expectedSimilarTicketId");
}
}
}
A few decisions are worth calling out:
queryTextorqueryTicketId. Sometimes the labeled query is a hand-written description simulating a new ticket. Other times it is an existing resolved ticket used as a query against the rest of the corpus — a leave-one-out style evaluation. The compact constructor enforces that exactly one route is taken.expectedSimilarTicketIdsis a list, not a single ID. Multiple past tickets may legitimately be relevant. The first element is the most relevant; the rest are also acceptable hits.
note. A free-text field where the labeller explains why this is a golden pair. Future-you, debugging a regression, will be grateful.
The dataset wrapper is a simple list:
public record EvalDataset(
String name,
String description,
List<EvalGoldenPair> pairs
) {}
Wrapping the list in a record (instead of using a bare List<EvalGoldenPair>) lets future versions add fields like datasetVersion or tags without breaking existing files.
3. Loading the Dataset from the Classpath #
Datasets live in JSON on the classpath. I keep a golden-pairs.example.json checked into the repo so anyone cloning the project gets a runnable starter, while the real dataset (which can include sensitive customer data) lives at golden-pairs.json and is .gitignored.
@Slf4j
@Component
@RequiredArgsConstructor
public class EvalDatasetLoader {
static final String DEFAULT_DATASET_PATH = "classpath:ai-eval/golden-pairs.json";
static final String EXAMPLE_DATASET_PATH = "classpath:ai-eval/golden-pairs.example.json";
private final ResourceLoader resourceLoader;
private final ObjectMapper objectMapper;
public EvalDataset load() {
return load(DEFAULT_DATASET_PATH);
}
public EvalDataset load(String path) {
Resource resource = resourceLoader.getResource(path);
if (!resource.exists()) {
log.warn("Eval dataset not found at {} — falling back to example dataset at {}",
path, EXAMPLE_DATASET_PATH);
resource = resourceLoader.getResource(EXAMPLE_DATASET_PATH);
if (!resource.exists()) {
throw new IllegalStateException(
"No eval dataset found at " + path + " or " + EXAMPLE_DATASET_PATH);
}
}
try (InputStream in = resource.getInputStream()) {
EvalDataset dataset = objectMapper.readValue(in, EvalDataset.class);
List<EvalGoldenPair> pairs = dataset.pairs();
if (pairs == null || pairs.isEmpty()) {
throw new IllegalStateException("Eval dataset at " + path + " contains no pairs");
}
log.info("Loaded eval dataset '{}' with {} pairs from {}",
dataset.name(), pairs.size(), path);
return dataset;
} catch (IOException ex) {
throw new IllegalStateException("Failed to read eval dataset from " + path, ex);
}
}
}
The path is parameterised so the team can ship multiple datasets — one per product, customer, or quality tier — and pick which to run from the admin endpoint.
4. The Metrics: Recall@K and MRR #
Two metrics drive the evaluation:
- Recall@K — for each pair, did at least one of the expected IDs show up in the top-K retrieved results? Average over all pairs. I report Recall@1, Recall@3, and Recall@5.
- Mean Reciprocal Rank (MRR) — for each pair, take
1 / rank-of-first-relevant-result, where rank is 1-indexed; if none of the expected IDs appears in the top-K, contribute0. Average over all pairs. MRR rewards the system for putting the right answer near the top, not just somewhere in the list.
I also keep a sanity-check metric — average top-1 cosine similarity score — but it is not a quality signal: a confidently wrong answer can have a higher score than a less confident correct one. It is useful for spotting score collapse after a bad config change, nothing more.
5. Running the Evaluation #
The EmbeddingEvaluator queries the live retrieval pipeline once per pair, computes per-pair metrics, and aggregates them into a report:
@Slf4j
@Service
@RequiredArgsConstructor
public class EmbeddingEvaluator {
static final int TOP_K = 5;
private final TicketEmbeddingService embeddingService;
private final TicketRepository ticketRepository;
private final TicketContentBuilder contentBuilder;
public EvalReport evaluate(EvalDataset dataset) {
long started = System.currentTimeMillis();
List<EvalReport.PerPairResult> perPair = new ArrayList<>(dataset.pairs().size());
int hitsAt1 = 0, hitsAt3 = 0, hitsAt5 = 0;
double mrrSum = 0.0;
double topScoreSum = 0.0;
int topScoreCount = 0;
for (EvalGoldenPair pair : dataset.pairs()) {
EvalReport.PerPairResult result = evaluatePair(pair);
perPair.add(result);
if (result.hitAt1()) hitsAt1++;
if (result.hitAt3()) hitsAt3++;
if (result.hitAt5()) hitsAt5++;
if (result.firstRelevantRank() > 0) {
mrrSum += 1.0 / result.firstRelevantRank();
}
if (!result.retrievedScores().isEmpty()) {
topScoreSum += result.retrievedScores().getFirst();
topScoreCount++;
}
}
int total = dataset.pairs().size();
double averageTopScore = topScoreCount == 0 ? 0.0 : topScoreSum / topScoreCount;
long duration = System.currentTimeMillis() - started;
return new EvalReport(
total,
(double) hitsAt1 / total,
(double) hitsAt3 / total,
(double) hitsAt5 / total,
mrrSum / total,
averageTopScore,
duration,
perPair
);
}
private EvalReport.PerPairResult evaluatePair(EvalGoldenPair pair) {
String queryText = resolveQueryText(pair);
List<TicketSimilarityResult> hits = embeddingService.findSimilar(queryText, TOP_K, 0.0);
List<Long> retrievedIds = new ArrayList<>(hits.size());
List<Double> retrievedScores = new ArrayList<>(hits.size());
for (TicketSimilarityResult hit : hits) {
// Skip the query ticket itself — it would always rank #1 in a
// leave-one-out eval and would mask retrieval quality.
if (pair.queryTicketId() != null && hit.ticketId() == pair.queryTicketId()) {
continue;
}
retrievedIds.add(hit.ticketId());
retrievedScores.add(hit.score());
}
int firstRelevantRank = firstRelevantRank(retrievedIds, pair.expectedSimilarTicketIds());
return new EvalReport.PerPairResult(
pair.id(),
pair.expectedSimilarTicketIds(),
retrievedIds,
retrievedScores,
firstRelevantRank,
firstRelevantRank == 1,
firstRelevantRank > 0 && firstRelevantRank <= 3,
firstRelevantRank > 0 && firstRelevantRank <= 5
);
}
private static int firstRelevantRank(List<Long> retrieved, List<Long> expected) {
for (int rank = 1; rank <= retrieved.size(); rank++) {
if (expected.contains(retrieved.get(rank - 1))) {
return rank;
}
}
return 0;
}
}
Three subtle decisions are worth pulling out:
TOP_Kis fixed at 5, not read from config. In production, top-K is configurable per deployment. But if the eval used the same setting, every config tweak would silently change the reported numbers — and you would not know whether the model improved or you simply retrieved more candidates. A fixed eval window keeps runs comparable.- The minimum-similarity threshold passed to
findSimilaris0.0. Recall already encodes whether a result was good enough. Filtering by score would conflate two different questions: "is the right ticket findable?" vs. "is the score above a threshold?". - Leave-one-out skip. When the query is itself an existing ticket, the vector store will gleefully return that ticket as the perfect top-1 match. Skipping it before computing rank prevents the eval from looking better than it really is.
6. The Report #
EvalReport is the aggregated picture, but each per-pair result is preserved so failures are debuggable:
public record EvalReport(
int totalPairs,
double recallAt1,
double recallAt3,
double recallAt5,
double meanReciprocalRank,
double averageTopScore,
long durationMillis,
List<PerPairResult> perPair
) {
public record PerPairResult(
String id,
List<Long> expectedSimilarTicketIds,
List<Long> retrievedTicketIds,
List<Double> retrievedScores,
int firstRelevantRank,
boolean hitAt1,
boolean hitAt3,
boolean hitAt5
) {}
}
The top-level numbers tell you whether the change is good. The per-pair list tells you which queries regressed — and that is what makes the next iteration possible. "Recall@5 dropped from 0.78 to 0.74" is a problem statement; "Recall@5 dropped from 0.78 to 0.74 and these three specific pairs flipped" is a debuggable lead.
7. Triggering Evaluation On-Demand #
Finally, an admin-only HTTP endpoint to re-run the evaluation any time:
@RestController
@RequestMapping(API_PREFIX + "/ai/admin/evaluation")
@RequiredArgsConstructor
public class EvaluationAdminController {
private final EvalDatasetLoader datasetLoader;
private final EmbeddingEvaluator evaluator;
@PostMapping("/run")
@PreAuthorize(Role.HAS_ADMIN)
public EvalReport run(@RequestParam(required = false) String datasetPath) {
EvalDataset dataset = datasetPath == null || datasetPath.isBlank()
? datasetLoader.load()
: datasetLoader.load(datasetPath);
return evaluator.evaluate(dataset);
}
}
The intended workflow is simple:
- Change a retrieval parameter, model, or prompt.
- Hit
POST /api/ai/admin/evaluation/run. - Compare the report to the previous baseline.
- Only promote the change if the numbers improved (or held steady on the metrics that matter).
The same evaluator can also be wired into a CI job that fails the build if Recall@5 drops below a floor — turning the harness into a regression gate before code ever reaches production.
Wrapping Up #
Building an AI feature without an evaluation harness is like refactoring without a test suite: every change feels like progress, but you have no way to prove it. The harness I walked through here is intentionally small — three records, two services, one controller, and a JSON file — yet it converts a fuzzy "is this better?" question into a number you can compare across runs.
The investment pays for itself the first time someone asks "are we sure the new embedding model is actually better?" and the answer is "Recall@5 went from 0.71 to 0.83, and here are the four pairs that flipped". From vibes to numbers.
Natural next steps, when you outgrow this minimal version, are: tracking metrics over time per dataset, adding precision-style metrics for cases where false positives hurt, or scoring the LLM’s answer instead of the retrieval (using LLM-as-judge). But you do not need any of that on day one. A small dataset and Recall@K are enough to start steering with data instead of intuition.
If you have any questions or want to discuss building one of these for your own retrieval pipeline, feel free to reach out.