Self-scored. Not a validated benchmark. This page reports a small (N=3) self-scoring run of Agenda Intelligence MD against its own published rubric and its own published before/after cases. The rubric, the texts, and the scores are all public; anyone can re-score independently. The run is reproducible; the conclusion is not authoritative.
Case studies on this site describe what the skill does. They do not show, in numbers, how much the protocol changes the output. This page closes that gap with the smallest honest experiment I can run today: take the public rubric, score the public before/after cases, and publish the result.
The result is a useful directional signal, not a benchmark. The biggest caveats are: I authored both the protocol and the rubric; I selected the cases; and I did the scoring. Independent re-scoring is welcome and will likely shift individual scores by at least one point per criterion.
examples/before-after/evaluation-rubric.md. Max score = 16.
| Criterion | Before | After | Justification (Before → After) |
|---|---|---|---|
| Signal classification | 0 | 2 | “Importance of sanctions compliance” → “Weak signal → signal if supported by customs anomalies, designations, bank behavior, or named intermediaries.” |
| What changed | 0 | 2 | “Become more important trade routes” → identifies the relevant unit of change as banks / customs brokers / BO structures, not geography. |
| Actor specificity | 0 | 2 | “Companies”, “businesses”, “Western regulators” → segmented Western regulators, correspondent banks, local banks, customs authorities, smaller intermediaries; each tied to leverage. |
| Mechanism | 0 | 2 | “Could increase scrutiny” → enforcement-attention → tier-1 KYC → tier-2 de-risking → reputational contamination. |
| Uncertainty | 0 | 2 | None named → “Whether the reports reflect isolated evasion cases, statistical noise, or a repeatable routing system.” |
| Falsifiability | 0 | 2 | None → watch-next list (designations, customs data, named intermediaries) explicitly named as confirming/weakening signals. |
| Watch-next indicators | 0 | 2 | “Monitor regulatory developments” → 7 concrete indicators with what each would tell you. |
| Decision value | 0 | 2 | “Conduct due diligence” → “Treat as compliance-relevant signal unless enforcement action / customs data / named firms / regulator guidance escalate it” — clear decision frame. |
| Total / 16 | 0 | 16 | Δ = +16 |
| Criterion | Before | After | Justification (Before → After) |
|---|---|---|---|
| Signal classification | 0 | 2 | “Concerning” → “Escalation marker / possible trigger event if attacks are repeated, claimed credibly, or followed by shipping/insurance behavior.” |
| What changed | 0 | 2 | “Concerning” → “Risk shifts from geopolitical background noise to operational exposure when shipping lines, insurers, ports, or governments change behavior.” |
| Actor specificity | 0 | 2 | “Companies”, “businesses” → 5 segmented groups (importers/exporters Asia–Europe; logistics firms; insurers and trade finance; energy/commodity traders; firms with tight delivery windows). |
| Mechanism | 1 | 2 | Names “supply chains, shipping costs, delays” but only as surface labels → “freight cost, route timing, insurance premiums, force majeure risk, inventory planning, contractual delivery obligations, sanctions screening.” |
| Uncertainty | 0 | 2 | None → “Whether this remains episodic harassment or becomes a sustained route-risk regime.” |
| Falsifiability | 0 | 2 | None → three explicit scenarios with distinguishing indicators. |
| Watch-next indicators | 0 | 2 | “Monitor the situation” → “carrier route announcements, war-risk insurance rates, naval advisories, Suez traffic data, port delays, credible claims of responsibility, government force-protection changes.” |
| Decision value | 0 | 2 | “Consider alternative routes” → trigger conditions tied to scenarios; tells the reader when a single incident is and is not sufficient to act. |
| Total / 16 | 1 | 16 | Δ = +15 |
| Criterion | Before | After | Justification (Before → After) |
|---|---|---|---|
| Signal classification | 0 | 2 | “Important regulatory framework” → “Signal / compliance-relevant development; not automatically a new legal obligation.” |
| What changed | 0 | 2 | “Moving forward with practical application” → “Moved from broad legal text toward implementation detail; companies can now compare their controls against regulator expectations rather than guessing from the statute.” |
| Actor specificity | 1 | 2 | “Providers and deployers … especially high-risk AI” — partially specific → 4 segmented groups including non-EU firms needing EU market access. |
| Mechanism | 0 | 2 | “May affect” → “Guidance can shape enforcement even when it is not itself the law” + the institutional-path layer (Commission guidance vs. agency guidance vs. harmonized standards vs. delegated acts vs. national authority interpretation). |
| Uncertainty | 0 | 2 | None → “Whether the guidance will become the practical enforcement baseline or remain a non-binding interpretation.” |
| Falsifiability | 0 | 2 | None → three scenarios with the watch-next set that would distinguish them. |
| Watch-next indicators | 0 | 2 | “Monitor developments” → “implementing/delegated acts, harmonized standards, regulator statements, enforcement deadlines, first national authority actions, and product/vendor behavior by large EU-facing AI firms.” |
| Decision value | 0 | 2 | “Review processes” (generic baseline) → “The risk is not only non-compliance, but building the wrong documentation, risk-management, or vendor-review process before enforcement starts.” Tells the reader where the act-now / wait-and-watch line is. |
| Total / 16 | 2 | 16 | Δ = +14 |
| Statistic | Before | After | Delta |
|---|---|---|---|
| Mean total | 1.0 / 16 | 16.0 / 16 | +15.0 |
| Median total | 1 / 16 | 16 / 16 | +15 |
| Min / Max total | 0 / 2 | 16 / 16 | +14 / +16 |
| Per-criterion mean (Before) | 0.13 / 2 | — | — |
| Per-criterion mean (After) | — | 2.00 / 2 | — |
All three Before outputs land at the bottom of the rubric’s “mostly summary” band. All three After outputs land at the top of the “decision-useful” band.
The numbers above are real but the design choices around them limit how much weight they carry.
agenda-intelligence-md/examples/before-after/.evals/rubric.md), an LLM-judge prompt (evals/llm_judge_prompt.txt), and a human checklist (evals/human_checklist.md). A future run can use those instead of the simpler 8-criterion rubric used here.