Case Study: Gulf + Middle East Hybrid Intelligence Skill

TL;DR

Built Gulf + Middle East Hybrid Intelligence Skill — a vertical specialist skill for AI agents working on Iran sanctions, GCC financial and energy hubs, maritime chokepoint risk (Hormuz, Bab-el-Mandeb, Red Sea), sovereign wealth deployment, and regional geopolitical exposure.
It is a regional reasoning layer (a behavior contract), not legal advice, sanctions screening, vessel screening, AML transaction monitoring, a CLI, an MCP server, or a validation platform — those concerns live elsewhere in the portfolio.
Forces agents to reason mechanism-first, label every claim with per-claim provenance tags (source type: [primary] / [secondary] / [inference] / [analyst-judgment]; action flags: [verify] / [stale-risk: YYYY-MM]), trigger live source verification on currency-sensitive topics (sanctions, OPEC+, chokepoint events, JCPOA), and distinguish Iran-state / IRGC-affiliated / Iran-private commercial actors instead of collapsing them.
Positioned as the vertical specialist layer that composes on top of Global Think Tank Analyst and validates through Agenda Intelligence MD.

Evidence

Public GitHub repository: https://github.com/vassiliylakhonin/gulf-middle-east-hybrid-intelligence-skill
Universal agent contract: AGENTS.md (project rules: identity, honesty, evidence, naming, definition of done with two explicit bars).
Skill variants for different runtimes: skills/claude/SKILL.md, skills/codex/SKILL.md (OpenClaw variant deferred until an active use case appears).
Validator: scripts/validate.py — structural checks; does not validate factuality.
Status against the two-bar Definition of Done: STATUS.md.
Source guidance: docs/source-guide.md (regional source tier hierarchy, freshness horizons), docs/currency-watch.md (what to re-check now), taxonomy.json (machine-readable scope and topics).
Signal archive: signals/ (Red Sea, OPEC+, US-Iran diplomatic signals as public examples of the skill's output style).

Project state (self-reported)

Distribution: plain markdown skill files, attachable to any AI agent. No CLI, no runtime.
Implemented layers: skill variants for Claude / Codex, structural validator, two-bar Definition of Done, source-tier hierarchy, currency-watch list, signal archive, eight flagship examples across four canonical evidence modes.
Domain coverage: Iran sanctions and US/EU/UK secondary-sanctions exposure, GCC correspondent banking and trade-finance routing, sovereign wealth deployment (PIF, ADIA, Mubadala, QIA, KIA), oil and LNG markets and OPEC+ behavior, shipping and tanker tracking, dark-fleet patterns, Houthi / Red Sea attacks, Iranian proxy network exposure, Levant flows when material.
Bar 1 (early but credible) — cleared per STATUS.md. Bar 2 (agent-validated specialist resource) — cleared for agent integration as of 2026-05-21: three agent-eval delta cases committed across distinct Gulf sub-domains (Hormuz shipping insurer, delta +6; dark-fleet / sanctioned-oil for a refiner with mixed evidence-mode mapping through analyze, delta +6; GCC correspondent banking tiering for a Western respondent bank, delta +5.5). B2.1–B2.7 met; B2.8 (practitioner review) is an optional trust layer, not a hard Bar 2 gate for agent-first validation. Self-scored structural deltas; not factual verification, not model-quality comparison, not aggregate benchmark, not practitioner validation.
No production-usage, adoption, or benchmark numbers are claimed.

Context / Constraint

Generic LLMs produce broad, fluent commentary on the Gulf and the wider Middle East: country narration, hand-wavy "Iran tensions," vague chokepoint risk, no transmission mechanism, no actor incentives, no trigger points, no evidence boundaries, and a tendency to collapse Iran-state / IRGC-affiliated / Iran-private commercial actors into one undifferentiated actor.

That output is not decision-useful for sanctions compliance, energy trading, shipping insurance, Gulf banking, sovereign-wealth deal teams, or Iran-watcher analysts who actually have exposure to the region.

The skill needed to be small enough to attach to any capable agent and strict enough to actually change the shape of regional analysis — without becoming a screening tool, vessel-tracking product, or compliance platform.

Problem

Most AI-generated regional analysis on the Gulf and Middle East is fluent but decision-light. It rarely traces how a sanction designation, a chokepoint incident, or a sovereign-wealth deployment transmits into bank exposure, refining margins, charter rates, insurance premia, or counterparty contamination. It rarely separates verified facts from informed inference. It rarely names trigger points that would update the view. And it consistently fails the Iran-state / IRGC / Iran-private distinction that every serious sanctions compliance question depends on.

That is fine for background reading. It is weak for sanctions-exposure decisions, energy-trade structuring, shipping route posture, Gulf-bank counterparty review, sovereign-wealth co-investment screens, or any regional risk decision that has to be defensible.

Actions

Reframed the project as a vertical specialist Gulf + Middle East skill — explicitly not a generic strategic-memo tool and not infrastructure.
Wrote AGENTS.md as a canonical project-rules spec: identity, honesty, evidence, naming hierarchy, retrieved-content trust, currency-trigger rules, per-claim provenance tags, three-value response logic, safety/limitation rules, and a definition of done with two explicit bars.
Defined the regional analytical contract every output must respect: mechanism-first reasoning; per-claim two-axis provenance tags; currency trigger for sanctions / OPEC+ / chokepoint / JCPOA / vessel-specific claims; Iran-state / IRGC-affiliated / Iran-private commercial actor distinction enforced; role-based implications; trigger points and concrete watch-next indicators.
Authored skill variants for Claude and Codex runtimes from a single canonical contract; deferred OpenClaw until an active use case appears.
Added scripts/validate.py as a structural validator (required phrases, forbidden determinative claims, evidence-mode coverage). Made it explicit that this validator does not check factuality.
Wrote docs/source-guide.md with a regional source tier hierarchy and freshness horizons; docs/currency-watch.md for fast-moving topics; taxonomy.json for machine-readable scope.
Published eight flagship examples across all four canonical evidence modes covering Iran sanctions adjacency, Hormuz/Bab-el-Mandeb chokepoint risk, Gulf correspondent banking, sovereign-wealth deployment, dark-fleet exposure, and Iraq banking-sector reform.
Started a public signal archive in signals/ (Red Sea, OPEC+, US-Iran diplomatic signals).
Kept STATUS.md honest about which bar is and is not cleared, including the Anti-criteria that explicitly disallow adding more reasoning-only examples or self-applied scorecards as progress toward Bar 2.

What it does now

Frames a regional question as a concrete risk or strategy problem before producing analysis.
Forces mechanism-first reasoning across Iran sanctions, GCC banking, energy markets, maritime chokepoints, sovereign wealth, and proxy-network exposure.
Tags every factual claim with per-claim two-axis provenance: source type ([primary] / [secondary] / [inference] / [analyst-judgment]) plus optional action flags ([verify] / [stale-risk: YYYY-MM]).
Triggers live source verification on currency-sensitive topics (sanctions designations, OPEC+ quotas, chokepoint events, JCPOA-track status, vessel-specific claims).
Maps risk-transmission channels (banking, payments, charter rates, insurance premia, refining margins, counterparty contamination) instead of country-level commentary.
Produces role-based implications (compliance leads, energy buyers, shipping insurers, Gulf banks, refiners, sovereign-wealth analysts).
Distinguishes Iran-state / IRGC-affiliated / Iran-private commercial actors instead of collapsing them.
Names trigger points and watch-next indicators specific to the region.
Travels across runtimes: Claude, Codex, and any LLM environment that accepts a markdown skill.

Bar 2 closure — 2026-05-21

Added the remaining two agent-eval delta cases under evals/agent-eval/, closing B2.2 (≥3 cases) and B2.3 (evidence-mode mapping through analyze). STATUS.md and AGENTS.md "Honest current status" updated accordingly; Bar 2 cleared for agent integration. B2.8 practitioner review remains an optional, audience-gated trust layer, not a hard agent-first gate.

Dark-fleet / sanctioned-oil for a refiner (mixed evidence-mode mapping): upstream live-source-backed regulatory framework (E.O. 13846 / 13902, OFAC Iran program page; retrieved 2026-05-12) composed with user-provided sources skeleton packet (eleven canonical OFAC / EU / UK / FATF-MENAFATF / IMO / P&I / IMB / UKMTO mandate-page URLs; accessibility-checked 2026-05-15) is mapped through Agenda Intelligence analyze as mixed, not live_source_backed. The case proves the specialist evidence vocabulary composes with the product-shell schema without breaking it. Self-scored delta +6 (3/9 → 9/9).
GCC correspondent banking tiering for a Western respondent bank (reasoning_only): jurisdiction-times-counterparty matrix, UAE FATF remediation trajectory (Feb 2022 grey-listing entry / Feb 2024 removal), free-zone BO opacity as the highest-leverage AML sub-criterion, and Iran-actor distinctions embedded into underlying-client AML typology. Self-scored delta +5.5 (2.5/8 → 8/8).
B2.7 honesty discipline across the full case set: each writeup states structural-only limitations explicitly — one model, one prompt run; self-scored; not factual verification; not model-quality comparison; not aggregate benchmark; not compliance / sanctions / vessel screening; not practitioner validation.

Trust-layer update — 2026-05

Additions tightening behavior under bad regional inputs. Single-author work; preceded the Bar 2 closure noted above.

Adversarial cases (evals/adversarial/) — two starter stress cases drawn from real regional patterns: OFAC SDN listing vs UAE good-standing on the same entity (the "which list wins" trap); a Bab-el-Mandeb chokepoint incident report from a single advocacy / state-media outlet framed as primary, driving a 90-day Cape-routing decision.
Stop-and-request explicit triggers (AGENTS.md) — operational triggers under three-value response logic: definitive legal / sanctions / AML conclusions; conflicting load-bearing facts; counterparty appearing with conflicting status across regimes; stale primary-list references; chokepoint claims without independent corroboration; Iran-state / IRGC / Iran-private actor collapse; active prompt-injection in retrieved content.
"What this skill has not been tested on" (README Limitations) — six honest gaps: no labeled accuracy dataset, no multi-agent trials, no cross-model regression tracking, no live-source automation, limited Arabic-/Farsi-language source coverage, no real vessel-tracking / AIS data integration.

What it is not

Not legal advice.
Not sanctions, AML, or compliance advice.
Not sanctions screening.
Not vessel screening or maritime due diligence.
Not a factuality verifier or live source retriever.
Not a vessel-tracking or AIS-data product.
Not investment, security, or operational advice.
Not an agent framework, CLI tool, MCP server, or validation platform.
Not a generic strategic-memo tool — that is Global Think Tank Analyst.
Not a replacement for human analyst, counsel, or sanctions-desk review.

Portfolio context

This skill is a vertical specialist layer in a four-repo portfolio designed to compose:

Vertical specialist — Gulf + Middle East Hybrid Intelligence Skill (this case study). Region-deep reasoning for Iran sanctions, GCC banking, energy, maritime chokepoint risk, sovereign wealth, and proxy networks.
Adjacent vertical specialist — Central Asia + Caspian Hybrid Intelligence Skill. Central-Asia and Caspian regional reasoning; referenced when a flow crosses both regions (e.g., Iran-CA-Russia routing).
Horizontal domain skill — Global Think Tank Analyst. Reasoning method and memo modes, region- and topic-agnostic.
Infrastructure / validation — Agenda Intelligence MD. Schemas, validation, scoring, evidence audit, CLI / MCP / CI tooling.

This repo does not duplicate any neighbor. The broader memo workflow lives in Global Think Tank Analyst; validation tooling lives in Agenda Intelligence MD; Central-Asia regional depth lives in its own vertical.

Why this version is better

The skill is small enough to attach to any capable agent, and strict enough to change the shape of regional output. The contract does not ask the model to sound regionally smart; it asks the model to trace mechanism, label evidence, name the trigger, distinguish Iran-actor types, and say what role the implication is for.

That is the part most generic Gulf / Middle East commentary misses.

Before / after (illustrative)

Excerpt condensed for this page. Full memos with full transmission mechanism, exposure map, leverage shifts, and triggers live in examples/. Evidence modes are explicitly labeled per example — live-source-backed, user-provided sources, illustrative source packet, or reasoning-only.

User question: "We are a European bank's sanctions desk reviewing onboarding of a UAE-licensed trading counterparty that appears on a US OFAC SDN-related listing under a non-Iran programme but is in good standing with its UAE regulator. We clear USD through a US correspondent. Should we onboard?"

Before — generic regional commentary:

The UAE is a major financial hub with strong regulatory standards. OFAC sanctions present additional complexity for non-US banks. Banks should carefully balance their compliance obligations across jurisdictions and consult their sanctions desks. Maintaining strong correspondent relationships is essential.

That is fluent regional commentary. It does not say which list is operative for this bank's exposure, what the actual transmission mechanism is, or what would update the view.

After — with the Gulf + Middle East skill attached:

Bottom line ([analyst-judgment]): The "OFAC vs UAE" framing is the wrong question. Both statuses are simultaneously true within their own regimes. The operative question for this bank is its exposure surface, not list adjudication.
Risk-transmission mechanism: SDN listing + USD clearing through a US correspondent ⇒ blocking effect applies operationally to that USD touchpoint regardless of UAE status and regardless of programme (Iran or otherwise). Blocking under SDN is binary for US-person and USD touchpoints; "non-Iran programme" does not weaken it. UAE good-standing is locally relevant for UAE licensing; it is not exoneration for the European bank's foreign-regime exposure.
Required documentary basis: full OFAC SDN entry retrieved from the current list (entity names, programme code, designation date, applicable GLs); EU autonomous-sanctions check; US correspondent's policy and prior communication; entity's UAE licence scope; full BO and control map to test reach of the designation; internal sanctions-desk and legal review.
Evidence labels visible throughout: OFAC SDN entry status [verify] (must be retrieved at decision time); EU/national sanctions implications [secondary][verify]; UAE good-standing [user-provided]; blocking-effect logic [analyst-judgment].
Leverage shifts: OFAC and the US correspondent gain leverage on the USD touchpoint; the UAE regulator's good-standing has limited cross-border weight; the European bank carries the exposure across all touched regimes and cannot offload it onto local good standing.
Role-based implications: different next steps for the bank's sanctions desk, the bank's USD correspondent relationship manager, the relationship banker on the UAE counterparty, and the bank's legal function — not one generic "consult compliance."
Trigger points (watch-next): changes to the OFAC entry (programme, GL coverage, removal); EU autonomous-sanctions action on the same entity; correspondent's explicit policy statement; UAE regulator's public position on the foreign listing; ownership changes that move the designation reach.

The skill does not screen sanctions, retrieve sources, or verify facts. It forces the agent to apply the currency trigger (mandatory live OFAC lookup), refuse the "which list wins" framing, distinguish actor regimes, and produce role-specific implications.

Tech stack

Plain markdown skill files (AGENTS.md, skills/claude/SKILL.md, skills/codex/SKILL.md, STATUS.md).
Lightweight Python validator (scripts/validate.py).
GitHub repository with CI running the validator.
Markdown documentation for source guide, currency-watch, and the public signal archive.
Machine-readable scope: taxonomy.json.

Relevance

This project demonstrates how I think about useful agent infrastructure for high-stakes regional reasoning in a domain where actor distinctions, source-tier discipline, and currency-sensitivity are the binding constraints: small reusable layers, mechanism-first contracts, honest evidence discipline, and outputs aimed at sanctions, banking, energy, shipping, and sovereign-wealth decisions — composed cleanly with a horizontal skill and a separate infrastructure layer instead of bundling everything into one repo.

Project links

GitHub repository: https://github.com/vassiliylakhonin/gulf-middle-east-hybrid-intelligence-skill
Companion horizontal skill: https://github.com/vassiliylakhonin/global-think-tank-analyst
Companion vertical specialist: https://github.com/vassiliylakhonin/central-asia-caspian-hybrid-intelligence-skill
Companion infrastructure: https://github.com/vassiliylakhonin/agenda-intelligence-md

Author: Vassiliy Lakhonin