PEIRASTES

P
Peirastes Visitor
Esc Home

The Apophenia Filter

Separating structure-discrimination from luck and pattern-perception with calibration scoring

Instrument & Working Paper

Type Calibration instrument
Program On Dynamical Systems
Status Live · v0.4.1

Think you can read the market? Most conviction is spent on noise — and a track record of correct calls cannot tell skill from luck. This instrument can.

The Apophenia Filter deals you one-dimensional trajectories drawn from hidden mechanisms — momentum, mean reversion, regime shifts — shuffled among pure martingales with nothing in them. You call the next move, or admit you have no edge, and the instrument scores your calibration, not your bravado. Because every trajectory is generated from a known mechanism with a Monte-Carlo-measured oracle ceiling, your score is read against what a perfectly informed observer could achieve — never against 100%. A significance-tested two-axis classification then separates genuine sensitivity to structure from restraint on noise, and names the one profile a lucky streak cannot fake.

▶  Launch the instrument a full session runs in your browser · anonymous · opt-in data only
The working paper

Draft for comment — not yet a submission. The instrument is built and reproducible; the human study described in §6 is a protocol, not yet run. I’m most interested in whether the scoring and classification carve the construct at its joints, and where the market-transfer argument in §8 is weakest.


Abstract

People are confident readers of patterns in data that contain none. The difficulty is that a record of correct calls cannot, by itself, distinguish a skilled forecaster from a lucky one, because survivorship and multiple comparisons manufacture track records for free. We describe The Apophenia Filter, a web-based behavioral instrument that sidesteps outcome track records entirely and scores calibration over many registered predictions instead. Participants forecast the next move of one-dimensional trajectories drawn from four generative families — a pure martingale (no recoverable structure) and three structured processes (autoregressive momentum, mean reversion, and a hidden two-state regime) — presented at graded signal strengths and shuffled together. Because every trajectory is generated from a known mechanism and a stored seed, each mechanism × strength has a Monte-Carlo-measured oracle ceiling: the accuracy of an observer who knows the latent state. Human performance is read against that ceiling, never against perfection. Predictions are scored with the Brier score, and abstention (“no edge”) is a first-class, scored action, so that the correct behavior on a martingale — declining to predict — is rewarded rather than penalized. A two-axis classification separates sensitivity to real structure from restraint on noise, yielding four profiles; the diagnostically important one is the participant who is sensitive and spends conviction on noise, a signature of pattern-perception (apophenia) that a raw accuracy or profit score would hide. We argue the design is a direct instance of exam-as-filter reasoning from educational assessment, discuss machine baselines run through the identical protocol, and treat financial-market prediction as a single uncontrolled test case whose transfer is bounded by reflexivity.


1. Motivation: the lucky forecaster

Suppose a forecaster shows you ten correct market calls in a row. Should you believe they can read the market? The honest answer is that you cannot tell, and the reason is structural, not a matter of gathering more of the same evidence. Given enough forecasters making enough calls, someone produces a long correct streak by chance alone; survivorship then hides everyone who didn’t, and the survivor is indistinguishable, on their track record, from a genuine expert. Backtests inherit the same disease in a subtler form: a strategy tuned until it fits the history it was tuned on has not been tested against anything. Outcome records are, in the strict sense, unfalsifiable as evidence of skill — they are consistent with both skill and luck, so they discriminate neither.

There is a way out, and it is old: calibration. Instead of asking whether the predictions came true, ask whether the stated confidences were earned — whether the events a forecaster called at 70% happen about 70% of the time, across many registered, pre-committed predictions. Luck cannot hold calibration across repetition; that is the entire leverage. The instrument described here is built around this single move: it never scores whether you were right, only whether your confidence was justified, over enough trials that luck averages out.

It helps to fix a framing borrowed from dynamical systems. The same trajectory is deterministic to an observer who knows every hidden force, random to one who knows none, and probabilistic to one who knows some. Real forecasting is always the third case. The instrument is a measurement of the partially-informed observer’s inference capacity — how much of the recoverable structure a person actually recovers, and how well they know when there is none to recover.

2. The instrument

A trial presents a one-dimensional trajectory: 60 visible steps followed by a hidden continuation. The participant predicts whether the series will be higher or lower 12 steps later, at a confidence from 55% to 95%, or selects NO EDGE to abstain. Every trajectory is produced by a seeded pseudo-random generator (mulberry32), so a trial is fully reproducible from the tuple (seed, mechanism, strength, version) — a property that matters for auditing, for machine baselines (§7), and for pooling only sessions generated by identical parameters.

Four generative families are used. The pure-noise family is a martingale, dx = ε, in which no observer can beat 50%. The three structured families each admit a genuine edge:

FamilyGenerative ruleStrength knob
Momentum (autoregressive drift)dx = d + 0.7ε, with drift d ← φ·d + 0.25ηpersistence φ
Mean reversion (Ornstein–Uhlenbeck-like)dx = −k·x + ε, with a ± shock near the end of the visible windowpull k, shock size
Regime switching (hidden two-state)dx = s·μ + ε, sign s flips with per-step probability pdrift μ, flip rate p

Each structured family appears at three signal strengthssubtle, standard, and sentinel — set by the strength knob (Appendix A gives exact values). Subtle trials sit just above the noise floor; sentinels are high signal-to-noise trials that a competent observer should almost always get right. The families are deliberately chosen to be the canonical ways a 1-D series can carry recoverable structure — trend persistence, restoring force, and hidden state — which also happen to be the folk categories traders reach for (“it’s trending,” “it’s overbought,” “the regime changed”).

A deck interleaves the families and strengths and shuffles them. The 24-trial “Full Lab” deck is roughly 40% noise (9 martingales) plus, for each structured family, one sentinel, two standard, and two subtle trials; a 12-trial “Quick” deck preserves the same character with two sentinels (and is presented to participants as a practice deck — see §5). Mechanisms are never revealed until the final report.

Disclosure is a fairness constraint, not decoration. Before any trial, participants are told the approximate noise proportion, the three structured families and their tells, that abstaining is legitimate and scored, that sentinels exist, and that ceilings exist. This is the analogue of telling students what to study before an exam: the effort signal is only interpretable if the examinee knew the rules. Rewording for tone is fine; withholding any of this information changes what the instrument measures.

3. Ceilings, not perfection

Because each trajectory is generated from a known mechanism, we can compute what an oracle — an observer who knows the latent state (the current drift, the mean, the active regime) — achieves on each mechanism × strength, by Monte Carlo (N = 30,000). These oracle ceilings are the correct denominator for human performance. A person who scores 70% on a class whose ceiling is 73% has recovered nearly all the available signal; the same 70% against a 94% ceiling is a large unrecovered gap. Reporting accuracy against 100% would be a category error, because some fraction of every trajectory is genuinely unknowable even to the oracle.

FamilySubtleStandardSentinel
Momentum63%73%88%
Mean reversion65%73%93%
Regime switching66%77%94%
Pure noise50%

One result is worth flagging because it falls out of the measurement rather than being designed in: momentum’s ceiling saturates near 88% even as persistence is pushed to near-permanent drift. Momentum is intrinsically the hardest of the three structured classes to exploit at this horizon — the observation noise on each step limits what even perfect knowledge of the drift can buy. Not all sentinels are equal, and the instrument says so rather than pretending the three families are interchangeable.

4. Scoring: calibration, with abstention first-class

Each committed prediction is scored with the Brier score, the squared error between the stated probability of “up” and the outcome, (p_up − outcome)². Lower is better; 0 is perfect. The pivotal design choice is the treatment of abstention: NO EDGE is scored as p = 0.5, which on a binary outcome yields a Brier of exactly 0.250 every time, win or lose. A participant who abstains on everything therefore scores exactly 0.250, and this “perpetual abstainer” is the benchmark every session is read against. To come in under 0.250, your confident correct calls must more than pay for your confident wrong ones; overconfident guessing on noise reliably pushes the average above 0.250. Abstention is thus not a forfeit but the correct move on a martingale, and the scoring rewards it as such. The highest behavior the filter can detect is knowing when not to predict.

Alongside the aggregate Brier we compute an apophenia index: the mean conviction above chance (mean of confidence − 50) spent on the pure-noise trials. It is a direct readout of how much certainty a person invests where, by construction, none is warranted. Abstaining on noise contributes zero to it; betting confidently on noise inflates it.

5. Classification: sensitivity × restraint

The report places each participant on a two-axis grid — effectively a political-compass for forecasting:

  • Sensitivity (horizontal): is your accuracy on structured trials significantly above chance? The boundary is a one-sided exact binomial test against p = 0.5 at α = 0.05 — not a fixed accuracy cutoff (see “Why a significance test” below). Are you recovering real structure, or riding a lucky streak?
  • Restraint (vertical): low conviction on noise (apophenia index ≤ 10) or frequent abstention on noise (≥ 40%). Are you keeping conviction off the martingales?

The two axes yield four profiles:

  • A — Discriminator (sensitive and restrained): recovers structure, declines on noise. The profile the instrument exists to find, and the one luck cannot sustain across repetition.
  • B — Calibrated, insensitive (restrained, not sensitive): withheld conviction where nothing was recoverable, but didn’t recover the structure either. Honest and trainable; the limiting factor is perception, not judgment.
  • C — Undecidable (sensitive, not restrained): hits on structure, but also spends conviction on noise. Real perception contaminated by apophenia — or luck. This profile cannot be resolved at a single session’s sample size; only repetition separates the two accounts. Its existence, and the honest refusal to over-decide it, is the epistemic heart of the instrument.
  • D — Apophenic (neither): conviction everywhere, discrimination nowhere. The modal human profile on financial charts.

A participant with too few committed structured calls is left Unclassifiable — abstaining everywhere is perfectly calibrated but carries no discrimination signal, so the filter declines to place them rather than inventing a verdict.

Why a significance test, not an accuracy cutoff. An earlier version of the instrument drew the sensitivity boundary at a fixed 60% accuracy over as few as three committed calls. That is far too permissive: a coin-flipper who commits all fifteen structured trials of the full deck clears 60% roughly 30% of the time — a 30% false-positive rate for the very quadrant the instrument exists to certify. An instrument that hands out “Discriminator” to one in three lucky guessers would be committing the lucky-forecaster error it was built to expose. Replacing the cutoff with a one-sided binomial test fixes this at the root: the false-positive rate is controlled at α by construction (≈2% at α = 0.05), and the accuracy required to clear significance adapts to how many calls a participant actually committed — commit fewer, and you must be correspondingly more accurate. This also rewards the right behavior: a participant who abstains on the trajectories they cannot read and commits only where a tell is visible needs fewer, higher-quality calls to prove skill, exactly as a calibrated forecaster should.

The cost is honest and, in fact, on-thesis. At these sample sizes — seven structured trials in the short deck, fifteen in the full one — the test is necessarily strict; the full-deck oracle itself averages only ~74% accuracy, so even a near-perfect human clears α = 0.05 in only a minority of single sessions. “Discriminator” is therefore rare and earned, and a confident verdict genuinely requires the full deck and, ideally, repetition. The short deck is presented to participants as practice for precisely this reason: it is underpowered to separate skill from luck, which is the instrument’s central claim expressed as a design constraint rather than concealed behind a lenient threshold.

Sentinels are the instrument’s self-check. Because sentinel trials sit near their oracle ceilings (88–94%), a competent, attentive observer should rarely miss them. An individual who misses sentinels is emitting a subject-level signal (inattention, or a genuine perceptual gap). But if a population misses sentinels at a high rate, that indicts the instrument’s calibration, not the subjects — the class-wide-jump rule familiar from exam analysis: when everyone fails the question everyone should pass, the question is broken. The report states this explicitly so that a single low score is never over-read.

6. A human study (protocol)

The instrument is built and in private pilot; the following is the intended study, offered here for critique rather than as completed work.

  • Participants: anonymous, uncompensated, self-selected web participants; pseudonymous per-browser identifiers only, with an optional user-chosen codename for cross-device continuity. No names, emails, IP-linked records, or accounts.
  • Data: opt-in and post-hoc — a session is contributed only after the participant has seen their report and explicitly consents, and the contributed payload is exactly what they can download for themselves (per-trial seed, mechanism, strength, call, confidence, response time; session-level summary). Developer/tester sessions carry a pilot flag and are permanently excluded from the analysis pool.
  • Primary measures: sensitivity index d′ on structured-vs-noise discrimination; apophenia rate (conviction on noise); the distribution of participants across the four quadrants; and calibration curves (reliability diagrams) pooled within identical generator parameters.
  • Ethics: because public collection is intended to support publication, an IRB determination (anonymous, minimal-risk, web-based behavioral task, no identifiers) precedes any analysis-pool collection. If no publication is intended, this gate relaxes.

Hypotheses worth pre-registering: that a substantial fraction of untrained participants land in D; that the modal error is overconfidence on noise rather than insensitivity to structure; and that repeated play moves individuals from C toward either A or D as sample size resolves the undecidable case.

7. Machine baselines

Because trials are reproducible from seeds, the identical protocol can be run by algorithms: a Kalman filter and a particle filter (each of which knows the form of the generative families but not the latent state, i.e., the honest probabilistic observer of §1), and a small neural network trained only on visible windows. Reporting these baselines on the same decks, scored the same way and read against the same ceilings, does two things. It locates human performance on an absolute scale between chance and the oracle, and it turns the ceilings from asserted constants into a claim that can itself be checked — a filter that knows the generative form should approach the oracle, and if it doesn’t, the ceiling estimate is suspect.

8. Market data as one uncontrolled test case

The instrument’s construct is deliberately synthetic: we control the generative process, so “structure” and “noise” are ground truth rather than interpretation. Financial price series are the obvious real-world target, and the temptation is to claim that a Discriminator on this instrument will read markets. That claim is exactly what the design refuses to make cheaply. Markets can be treated as one uncontrolled test case — a series whose generating process is unknown and, crucially, reflexive: acting on a detected pattern changes the process, and any edge that becomes widely known is arbitraged away (a Goodhart dynamic). The synthetic families here are, by contrast, indifferent to being predicted. Transfer from the instrument to markets is therefore an empirical question bounded by reflexivity, not a corollary of a good score, and the honest statement of that limit is part of the contribution.

9. Relation to assessment theory

The instrument is a near-literal instance of exam-as-filter reasoning from educational assessment, which is where the design actually came from. Noise trials are the “easy exam” that isolates a single trait (restraint/calibration) the way a well-designed easy item isolates effort from ability. Graded strengths mirror staged item difficulty. Sentinels are the instrument’s internal validity check, and the class-wide-jump rule is imported wholesale. The two-axis classification is the assessment distinction between a student who doesn’t know the material (insensitive) and one who is overconfident on material that isn’t there (apophenic) — different diagnoses requiring different remediation. Reading scores against measured ceilings rather than a nominal 100% is the same move as grading against what the best-prepared student could actually achieve on a given instrument. The filter is, in effect, an exam whose subject is the examinee’s own pattern-recognition, administered under conditions where the answer key is known exactly.

10. Limitations and open questions

The instrument measures one narrow inference — direction over a fixed horizon on 1-D series from a small family of processes — and should not be read as a general “forecasting ability” score. Single sessions are underpowered to resolve the C (undecidable) profile, which is a feature of the honesty of the design but also a real limit on what one sitting can conclude. Self-selection in a public web sample biases any population statistics. The generative families are a modeling choice; a participant who reasons well about processes not in the family will be scored as insensitive. And the transfer to markets (§8) is, at present, argued rather than demonstrated. None of these are fatal, but each bounds the claims.

11. Conclusion

Outcome track records cannot separate skill from luck, and confident pattern-perception on noise is a pervasive, measurable error. By generating trajectories from known mechanisms with measured oracle ceilings, scoring calibration rather than outcomes, making abstention a first-class scored action, and classifying participants on sensitivity and restraint separately, The Apophenia Filter turns a slippery question — can this person actually read structure, or are they fooling themselves? — into a measurement with an explicit answer key and an explicit account of what it cannot yet decide. The instrument is live; the study is the next step.


Appendix A — Generative parameters (v0.4.1)

Trajectory: 60 visible steps, 12-step prediction horizon. Per-trial seed via mulberry32; every trajectory reproducible from (seed, mechanism, strength, version).

FamilySubtleStandardSentinel
Momentum dx = d + 0.7ε, d ← φd + 0.25ηφ = 0.82 → 63%φ = 0.92 → 73%φ = 0.99 → 88%
Mean reversion dx = −k·x + ε, ± shock at step 56k = 0.04, no shock → 65%k = 0.12, shock 1.5 → 73%k = 0.12, shock 8 → 93%
Regime dx = s·μ + ε, flip prob p/stepμ = 0.30, p = 0.06 → 66%μ = 0.45, p = 0.04 → 77%μ = 0.90, p = 0.01 → 94%
Noise dx = ε (martingale)50% by construction

Percentages are oracle ceilings (Monte Carlo, N = 30,000; observer knows the latent state). Any change to generator parameters, deck composition, horizon, or scoring invalidates cross-session pooling and requires re-measuring the ceilings — sessions generated under different parameter hashes are never pooled.

Appendix B — Scoring and classification

  • Actions: UP / DOWN at confidence 55–95% (5-point steps), or NO EDGE (abstain; scored as p = 0.5, Brier 0.25, correctness undefined).
  • Brier: (p_up − outcome)²; perpetual-abstainer benchmark = 0.250.
  • Apophenia index: mean of (confidence − 50) over noise trials (abstentions contribute 0).
  • Sensitive: structured accuracy significantly above chance by a one-sided exact binomial test against p = 0.5 at α = 0.05 (requires ≥ 3 committed structured calls to be testable; the short “Quick” deck is underpowered for this and is labeled practice).
  • Restrained: apophenia index ≤ 10 or noise-abstention rate ≥ 0.40.
  • Quadrants: A = sensitive ∧ restrained; B = restrained ∧ ¬sensitive; C = sensitive ∧ ¬restrained; D = ¬sensitive ∧ ¬restrained; Unclassifiable if fewer than 3 committed structured calls.

Try the instrument: https://peirastes.com/apophenia-filter/ — and if you contribute a session, it’s opt-in and carries no identifying information.

The typeset PDF

The same paper, typeset for print and reading offline.