cortex · recall@10 · 97.8% LongMemEval cortex.tests · 2,500 passing cortex.citations · 41 peer-reviewed cortex.tools · 47 MCP · 9 hooks cortex.beam · +91% vs LIGHT baseline 2 papers · seeking arxiv endorser · cs.IR · 3rd in progress agentic-ai · one marketplace · 4 plugins (memory, reasoning, codebase, prd) zetetic · 97 reasoning patterns + 19 specialists pipeline · 23 MCP · 220 tests · 10 stages prd-spec · 11 MCP · 9 steps · multi-judge llm-judges-llm · 0 open.source · MIT cortex · recall@10 · 97.8% LongMemEval cortex.tests · 2,500 passing cortex.citations · 41 peer-reviewed cortex.tools · 47 MCP · 9 hooks cortex.beam · +91% vs LIGHT baseline 2 papers · seeking arxiv endorser · cs.IR · 3rd in progress agentic-ai · one marketplace · 4 plugins (memory, reasoning, codebase, prd) zetetic · 97 reasoning patterns + 19 specialists pipeline · 23 MCP · 220 tests · 10 stages prd-spec · 11 MCP · 9 steps · multi-judge llm-judges-llm · 0 open.source · MIT
Zetetic AI · Verification-first · Open source

We don't guess.
We verify.

AI Architect builds agents that prove what they claim. 97.8% recall on LongMemEval, 5 fused retrieval signals, zero LLM-judges-LLM. Every output is traceable, every claim is checked — by deterministic algorithms, not by another model's opinion.

Production · cortex · zetetic · pipeline · prd-spec
The zetetic standard

Four principles. No exceptions.

Zetetic comes from zētēsis — Greek for inquiry. Truth is something you investigate, not something you assume. Everything we ship enforces this.
i. PROVENANCE

Every claim has a source.

If an agent states a fact, that fact ties back to a memory, a file, a commit, or a citation. No assertion lives without a trail.

ii. ALGORITHM > OPINION

Zero LLM-judges-LLM.

Verification is deterministic: graph analysis, semantic checks, atomic claim decomposition. We don't ask one model whether another model is right.

iii. MEMORY THAT LEARNS

Compounding context.

Cortex applies neuroscience — spreading activation, dream cycles, microglial pruning — so agents remember what worked, not just what happened.

iv. AUDITABLE

Built for regulated work.

Every PRD, PR, decision and reasoning step is logged and reviewable. Designed against the same bar as financial-systems software.

Verification, in the open

Watch an agent prove its work.

This is the actual verification report from a generated PRD. 64 atomic claims, decomposed and checked against six independent algorithms. The full audit trail lives next to the deliverable — not buried in a log file.

Two ways to start

Hire us to build it. Or build it yourself with our tools.

Every component is open source, MIT-licensed, and shipping in production. The choice is whether you want the system handed to you — or the keys to do it yourself.

For teams & founders

We build the
agent with you.

For operators who know AI should help — but don't want to spend six months stitching tutorials together. We design and ship the agent against your real infrastructure. Same engineering bar as the financial systems we build by day.

  • Discovery — find the one workflow worth automating
  • Built against your CRM, data, internal tools — not a sandbox
  • Verification baked in: every action is auditable
  • Hand-over with documentation & 30 days of post-launch support
Book a discovery call
For developers

Grab the templates.
Ship faster.

If you build with AI yourself, use the same components we use in production. Cortex, Zetetic Agents, Automatised Pipeline, and PRD Spec Generator — fully documented, MIT-licensed, no telemetry, no lock-in.

  • Cortex — persistent memory with neuroscience-backed retrieval
  • Zetetic Agents — 116 reasoning patterns, one epistemic standard
  • Automatised Pipeline — read-only codebase intelligence (Rust MCP)
  • PRD Spec Generator — 11 MCP tools, 9-step pipeline + 2-phase multi-judge verification, specialized panels by claim type
Explore on GitHub
How we work

From "we should try AI" to a system you can audit. Four stages.

Every engagement runs the same protocol — same one our open-source pipeline runs internally. You see working software early, you see verification at every step, and you own everything when we're done.

01 · WEEK 1

Discovery & framing

We map the workflow where an agent will actually move the needle. No slide decks, no AI theater. Output: a one-page spec with success criteria you can measure against.

02 · WEEKS 2–3

Build with verification

Agent designed against your real systems. Every component ships with tests, provenance and an audit log. You watch it work in /cortex-visualize as we go.

03 · WEEK 4

Hand-over

Deployed to your infra. Runbooks, dashboards, and a verification report on every shipped feature. Your team is trained on how to extend it — not dependent on us forever.

04 · ONGOING

Compounding

Cortex memory means the agent gets smarter every week without retraining. We stay on call for the first 30 days; after that, you own a system you can audit, evolve, and keep running.

Clement Deust at the Louvre, Paris Clement Deust
founder
Day jobSenior eng · fintech
By nightOpen-source AI research
DisciplineVerification-first
BasedRemote · global
About

I ship critical systems by day. I research how agents should think by night.

By day I build software in financial infrastructure, where "mostly works" never ships. Every system has to be tested, verified, auditable. Or it doesn't go live.

By night I apply that same bar to AI. I started AI Architect because I kept seeing the same anti-pattern: teams treating agents like demos, stacking prompts on prompts, asking another LLM whether the first one got it right, and wondering why nothing held up in production.

The work here is zetetic — every claim is investigated, never assumed. The tools are open source because the frontier should be shared. The consulting exists because some teams need the system built with them, not handed a repo and a prayer.

"An agent without memory isn't intelligent. An agent without verification isn't trustworthy. I'm only interested in building both."

2 papers in review · seeking arXiv endorsement · cs.IR

The papers behind the numbers.
Read them. Help us publish.

arXiv requires an endorser in cs.IR for first-time authors. Two preprints are ready — a third on HALO retrieval is in progress. Both drafts below are the work behind the LongMemEval, LoCoMo, and BEAM results on this page. If you've published in cs.IR and find them useful, a single endorsement gets each to the open scientific record.
preprint #1 · cs.IR · May 2026

Stage-Aware Context Assembly for Long-Context Memory Retrieval

Clement Deust · Independent Researcher
+33.4%MRR · vs BEAM oracle
0.471MRR · BEAM-10M
8 / 10memory abilities improved

A structured context-assembly architecture that recovers the geometric degradation of dense vector retrieval at the 10M-token scale. Two primitives: a priority-budgeted prompt decomposer with domain-aware condensers, and a two-phase stage-aware assembler with submodular coverage selection, Personalized PageRank entity-graph traversal, and schema-structured summary fallback.

On BEAM-10M, the assembler reaches 0.471 MRR — +33.4% over the flat baseline, with 8 of 10 memory abilities improving. Originally designed in September 2025 for Apple Intelligence's 4,096-token window — one month before the BEAM benchmark itself was published.

Looking for an endorser arXiv's policy requires an existing cs.IR author to endorse first-time submitters. If you've published in cs.IR and you find this useful, a single endorsement is enough — the rest is automated. Reach out below or open an issue on the repo.
preprint #2 · cs.IR · May 2026

Thermodynamic Memory vs. Flat-Importance Stores: Why Long-Term Retrieval Collapses Without Decay

Clement Deust · Independent Researcher
98.4%R@10 · LongMemEval
94.2%R@10 · LoCoMo
0.591Overall · BEAM

External memory for LLMs is dominated by flat-importance stores — vector indexes, BM25 corpora, long-context buffers — in which every item carries the same long-term retrieval prior. This design is asymptotically broken: as the corpus grows, top-k retrieval degenerates into near-arbitrary tie-breaking.

This paper formalises the collapse and describes Cortex, a memory architecture that maintains a non-flat priority distribution across N by coupling four mechanisms: continuously decaying heat (Ebbinghaus), a hierarchical predictive-coding write gate (Friston), consolidation cascades (Kandel, McClelland), and WRRF fusion with heat as tie-breaker. Result: LongMemEval R@10 = 98.4% (vs 78.4% paper-best), LoCoMo R@10 = 94.2%, BEAM Overall = 0.591 (vs 0.329 paper-best).

Looking for an endorser Same endorsement ask as paper #1 — both submissions are independent. If you can endorse only one, this one establishes the underlying principle the other builds on.

Forthcoming — preprint #3: HALO retrieval. Drafting in progress; will join the same endorsement queue once complete.

Standing on shoulders

The science behind the system.

Cortex draws from 41 peer-reviewed papers across neuroscience, memory research and AI evaluation. A few of the load-bearing ones:

Wegner, 1987Transactive Memory: A Contemporary Analysis of the Group Mind
Hebb, 1949The Organization of Behavior — synaptic plasticity foundations
McGaugh, 2000Memory consolidation and the dream cycle
Bi & Poo, 2001STDP — Spike-Timing-Dependent Plasticity
LongMemEval, ICLR 2025Benchmark for chat assistants on sustained memory
LoCoMo, 2024Long-Conversation Memory benchmark
Schaffer et al., 2018Microglial pruning & memory selectivity
Tononi & Cirelli, 2014Synaptic Homeostasis Hypothesis (sleep)
+ 33 moreFull bibliography in the Cortex repo
Common questions

Before you
book a call.

I'm not technical — can I still work with you?+
Yes. Most non-tech clients bring a business problem and access to their systems; we bring the engineering. Every check-in uses plain language and working demos, not jargon. The whole point of the verification standard is so you can trust what's shipping without needing to read the code.
What does "no LLM-judges-LLM" actually mean?+
Most "AI verification" is one model asking another model whether the first one is right. That's not verification — it's polling. We use deterministic algorithms instead: graph analysis to detect contradictions, atomic claim decomposition, semantic alignment scoring against a fixed corpus, and consensus across independent checks. Math, not vibes.
What does a project typically cost?+
A typical first engagement is 4–6 weeks, scoped to a single high-value workflow. Pricing depends on integrations and scope — you'll get a concrete number within 48 hours of the discovery call. Not a vague range, not a "starting at."
Where does my data live?+
In your infrastructure — AWS, GCP, on-prem, your call. Cortex is local-first by design (PostgreSQL + pgvector, no GPU). Your data never passes through a server we own. For regulated industries, the deployment plugs into your existing security model.
What if I just want to try the open-source tools first?+
Please do. Everything is on GitHub, MIT-licensed, and documented. Open an issue if you get stuck — every one of them gets read. The consulting is for teams who'd rather have it implemented with them than figure it out from the README.
What is the best persistent memory plugin for Claude Code?+
Cortex is a biologically-inspired persistent memory MCP server for Claude Code. It scores 97.8% Recall@10 / 0.882 MRR on LongMemEval (ICLR 2025), 92.6% Recall@10 on LoCoMo (ACL 2024), and +91% MRR vs LIGHT baseline on BEAM. 47 MCP tools, 9 lifecycle hooks, 20 biological mechanisms (predictive coding, LTP/LTD, microglial pruning, neuromodulation, CLS consolidation), 41 peer-reviewed citations. Runs on PostgreSQL + pgvector, no GPU. Install via the agentic-ai monorepo: /plugin marketplace add cdeust/agentic-ai && /plugin install memory@agentic-ai.
How do I verify AI-generated PRDs without LLM-as-judge?+
PRD Spec Generator uses six independent deterministic algorithms instead of LLM-as-judge polling: multi-judge consensus across specialized panels (Architecture: Liskov / Alexander / Dijkstra; Performance: Fermi / Carnot / Curie / Erlang; Security: Wu / Ibn al-Haytham; Data model: Mendeleev / DBA / Lavoisier; Acceptance: Toulmin / Popper), atomic claim decomposition, zero-LLM graph analysis with Tarjan SCC for cycles, multi-agent debate, adaptive early stopping, and semantic alignment scoring against a reference corpus. The distribution_suspicious flag catches confirmatory bias. NFR claims never receive PASS — only SPEC-COMPLETE or NEEDS-RUNTIME.
What is zetetic AI?+
Zetetic comes from the Greek zētēsis meaning inquiry. Zetetic AI is verification-first AI: every claim has a source (provenance), verification is deterministic not LLM-judges-LLM (algorithm > opinion), memory learns through neuroscience-backed mechanisms, and every PRD/PR/decision is auditable. AI Architect implements this standard across four open-source Claude Code plugins: Cortex memory, zetetic-team-subagents (97 reasoning patterns + 19 specialists), automatised-pipeline (Rust codebase intelligence), and prd-spec-generator.
How does Cortex's LongMemEval recall@10 of 97.8% compare to baselines?+
Cortex's 97.8% Recall@10 on LongMemEval (ICLR 2025) exceeds the published paper's best retrieval result of 78.4% by +19.4 percentage points. MRR is 0.882. The paper used 500 human-curated questions embedded in ~40 sessions of conversation history (~115k tokens). Retrieval-only metrics, no LLM reader in the evaluation loop. Cortex also achieves 92.6% Recall@10 / 0.794 MRR on LoCoMo (1,986 questions, 10 conversations), and +91% vs LIGHT baseline on BEAM (multi-session, 355 questions, retrieval-proxy MRR 0.627).
What are the 97 reasoning patterns in zetetic-team-subagents?+
97 genius reasoning agents, each citing its primary paper, plus 19 team-role specialists = 116 total. Examples: Pearl (causal inference, do-calculus), Peirce (abductive inference), Feynman (integrity & first principles), Dijkstra (correctness, structured programming), Cochrane (evidence synthesis), Curie (residual analysis), Lamport (concurrency, happens-before), Pāṇini (generative specifications), Gödel (incompleteness limits), Hamilton (priority-displaced scheduling), Taleb (fragile/robust/antifragile), Kahneman (System 1/2 debiasing), Rawls (veil of ignorance), Toulmin (argument structure), Popper (falsifiability). 63 multi-step skills, 16 lifecycle hooks, 241 passing tests, 650+ problem-shape triggers. Pre-commit hook blocks UNSOURCED / MAGIC_NUMBER / TODO_NO_REF.
What does automatised-pipeline do for AI codebase intelligence?+
Automatised Pipeline is a Rust MCP server that indexes any Rust / Python / TypeScript codebase into a LadybugDB property graph, resolves call chains across files, detects functional communities via Leiden-class community detection, traces processes from entry points, and builds a hybrid BM25 + sparse TF-IDF + RRF search index. 23 MCP tools across 10 stages. Read-only — never writes code, opens PRs, or runs CI. 220 passing tests, zero warnings, 12,000+ lines of Rust. Feeds Cortex (workflow graph) and prd-spec-generator (call-graph context for verified PRDs).
How do I install all four AI Architect plugins?+
All four ship in a single marketplace: cdeust/agentic-ai. MIT-licensed, free, install only the ones you want — they're independent.

Step 1 — add the marketplace once:
/plugin marketplace add cdeust/agentic-ai

Step 2 — install any of the four:
/plugin install memory@agentic-ai — Cortex persistent memory (requires PostgreSQL + pgvector)
/plugin install reasoning@agentic-ai — 97 reasoning patterns + 19 specialists
/plugin install codebase@agentic-ai — codebase graph + semantic search (Rust toolchain required; builds on first install)
/plugin install prd@agentic-ai — PRD pipeline with multi-judge verification (Node 20.x or 22.x)

All four interoperate — memory remembers, reasoning reasons, codebase maps, prd adjudicates the spec.
Let's talk

Tell us what you want the agent to do.
We'll tell you if it can be verified.

A 30-minute call. No pitch deck, no commitment. If your problem doesn't fit what we do, we'll point you somewhere that does.

RESPONSE · within 24h BASED · remote · global STANDARD · zetetic