Eval run: Agenda Intelligence MD, May 2026

Self-scored. Not a validated benchmark. This page reports a small (N=3) self-scoring run of Agenda Intelligence MD against its own published rubric and its own published before/after cases. The rubric, the texts, and the scores are all public; anyone can re-score independently. The run is reproducible; the conclusion is not authoritative.

Why this page exists

Case studies on this site describe what the skill does. They do not show, in numbers, how much the protocol changes the output. This page closes that gap with the smallest honest experiment I can run today: take the public rubric, score the public before/after cases, and publish the result.

The result is a useful directional signal, not a benchmark. The biggest caveats are: I authored both the protocol and the rubric; I selected the cases; and I did the scoring. Independent re-scoring is welcome and will likely shift individual scores by at least one point per criterion.

Setup

Subject under test: Agenda Intelligence MD, the Markdown protocol layer of the reasoning-skill portfolio.
Cases: the three before/after pairs already published in the repo.
Rubric: the 8-criterion 0–2 scoring rubric in examples/before-after/evaluation-rubric.md. Max score = 16.
- Criteria: signal classification, what changed, actor specificity, mechanism, uncertainty, falsifiability, watch-next indicators, decision value.
- 0 = missing or generic; 1 = partially present; 2 = specific and decision-useful.
- Bands: 0–5 mostly summary · 6–10 usable orientation · 11–16 decision-useful.
Scoring: one human pass (the author of the protocol and the rubric), no LLM judge in this run.
Date of scoring: 2026-05-09.

Per-case results

Case 1 — Sanctions routing through Central Asia

Criterion	After	Justification (Before → After)
Signal classification	2	“Importance of sanctions compliance” → “Weak signal → signal if supported by customs anomalies, designations, bank behavior, or named intermediaries.”
What changed	2	“Become more important trade routes” → identifies the relevant unit of change as banks / customs brokers / BO structures, not geography.
Actor specificity	2	“Companies”, “businesses”, “Western regulators” → segmented Western regulators, correspondent banks, local banks, customs authorities, smaller intermediaries; each tied to leverage.
Mechanism	2	“Could increase scrutiny” → enforcement-attention → tier-1 KYC → tier-2 de-risking → reputational contamination.
Uncertainty	2	None named → “Whether the reports reflect isolated evasion cases, statistical noise, or a repeatable routing system.”
Falsifiability	2	None → watch-next list (designations, customs data, named intermediaries) explicitly named as confirming/weakening signals.
Watch-next indicators	2	“Monitor regulatory developments” → 7 concrete indicators with what each would tell you.
Decision value	2	“Conduct due diligence” → “Treat as compliance-relevant signal unless enforcement action / customs data / named firms / regulator guidance escalate it” — clear decision frame.
Total / 16	16	Δ = +16

Case 2 — Red Sea shipping disruption

Criterion	Before	After	Justification (Before → After)
Signal classification	0	2	“Concerning” → “Escalation marker / possible trigger event if attacks are repeated, claimed credibly, or followed by shipping/insurance behavior.”
What changed	0	2	“Concerning” → “Risk shifts from geopolitical background noise to operational exposure when shipping lines, insurers, ports, or governments change behavior.”
Actor specificity	0	2	“Companies”, “businesses” → 5 segmented groups (importers/exporters Asia–Europe; logistics firms; insurers and trade finance; energy/commodity traders; firms with tight delivery windows).
Mechanism	1	2	Names “supply chains, shipping costs, delays” but only as surface labels → “freight cost, route timing, insurance premiums, force majeure risk, inventory planning, contractual delivery obligations, sanctions screening.”
Uncertainty	0	2	None → “Whether this remains episodic harassment or becomes a sustained route-risk regime.”
Falsifiability	0	2	None → three explicit scenarios with distinguishing indicators.
Watch-next indicators	0	2	“Monitor the situation” → “carrier route announcements, war-risk insurance rates, naval advisories, Suez traffic data, port delays, credible claims of responsibility, government force-protection changes.”
Decision value	0	2	“Consider alternative routes” → trigger conditions tied to scenarios; tells the reader when a single incident is and is not sufficient to act.
Total / 16	1	16	Δ = +15

Case 3 — EU AI Act implementation guidance

Criterion	Before	After	Justification (Before → After)
Signal classification	0	2	“Important regulatory framework” → “Signal / compliance-relevant development; not automatically a new legal obligation.”
What changed	0	2	“Moving forward with practical application” → “Moved from broad legal text toward implementation detail; companies can now compare their controls against regulator expectations rather than guessing from the statute.”
Actor specificity	1	2	“Providers and deployers … especially high-risk AI” — partially specific → 4 segmented groups including non-EU firms needing EU market access.
Mechanism	0	2	“May affect” → “Guidance can shape enforcement even when it is not itself the law” + the institutional-path layer (Commission guidance vs. agency guidance vs. harmonized standards vs. delegated acts vs. national authority interpretation).
Uncertainty	0	2	None → “Whether the guidance will become the practical enforcement baseline or remain a non-binding interpretation.”
Falsifiability	0	2	None → three scenarios with the watch-next set that would distinguish them.
Watch-next indicators	0	2	“Monitor developments” → “implementing/delegated acts, harmonized standards, regulator statements, enforcement deadlines, first national authority actions, and product/vendor behavior by large EU-facing AI firms.”
Decision value	0	2	“Review processes” (generic baseline) → “The risk is not only non-compliance, but building the wrong documentation, risk-management, or vendor-review process before enforcement starts.” Tells the reader where the act-now / wait-and-watch line is.
Total / 16	2	16	Δ = +14

Aggregate (N=3)

Statistic	Before	After	Delta
Mean total	1.0 / 16	16.0 / 16	+15.0
Median total	1 / 16	16 / 16	+15
Min / Max total	0 / 2	16 / 16	+14 / +16
Per-criterion mean (Before)	0.13 / 2	—	—
Per-criterion mean (After)	—	2.00 / 2	—

All three Before outputs land at the bottom of the rubric’s “mostly summary” band. All three After outputs land at the top of the “decision-useful” band.

Honesty caveats

The numbers above are real but the design choices around them limit how much weight they carry.

Self-scored. The protocol author, the rubric author, and the scorer are the same person.
Self-selected cases. The cases were written to demonstrate the protocol’s value. They are not a random sample of agenda-monitoring questions.
Stylized Before outputs. The “Before” texts read like a typical generic agent answer, but they were authored, not produced by a contemporaneous LLM run. A real LLM might or might not produce something better.
Single rater, single pass. No second rater, no LLM judge cross-check.
Small N. Three cases. The 95 % CI on the mean delta is wide; do not generalize beyond what is shown.
Rubric symmetry. The rubric was designed alongside the protocol. Some items (e.g., “watch-next indicators”) map closely to fields the protocol explicitly produces, which makes high After scores easier than they would be against an external rubric.

What would change my judgment

Independent re-scoring lands within ±1 per criterion. Strong corroboration. Publish independent scores alongside.
Independent re-scoring disagrees by >2 points per criterion on average. Rubric needs work or my scoring is biased; revise the rubric and re-run.
A real LLM produces a Before that scores >5 / 16. The current “Before” texts are too generic; rerun with live LLM output.
A larger run (N ≥ 10) collapses the mean delta below +8. The effect is overestimated; tighten claims about the protocol everywhere.
A larger run preserves Δ ≥ +12 across mixed (selected and adversarial) cases. Then the protocol-shaped output really does dominate the baseline on the chosen rubric, and a more public benchmark becomes worth building.

How to reproduce

Open the three before/after files and the rubric in agenda-intelligence-md/examples/before-after/.
Score each Before and After against the 8 criteria, 0–2.
Compare your totals and per-criterion scores against the table above.
Open an issue with deltas you disagree with, and where on the rubric the disagreement comes from.

Case Study: Agenda Intelligence MD
Agenda Intelligence MD on GitHub
The repo also ships a secondary 100-point quality rubric (evals/rubric.md), an LLM-judge prompt (evals/llm_judge_prompt.txt), and a human checklist (evals/human_checklist.md). A future run can use those instead of the simpler 8-criterion rubric used here.