Methodology

The judgement calls behind the numbers.

Every published answer on this site is the product of explicit choices: which source to trust, which comparison is fair enough to publish, which model assumptions are acceptable, and where the data stops supporting the story. This page is the paper trail.

If you only read one part of this page, the core rules are simple:

Compare teammates whenever possible. Same team, same car, same weekend is the closest thing this dataset offers to a fair comparison.
Keep the modelling shape visible. The GOAT page is a Bradley–Terry fit over teammate pairings; the LOAT page is a scorecard built from several teammate-normalised lenses rather than one hidden model.
Stop where the data stops. When a source changes its coding or the comparison stops being fair, the chart should say so instead of smoothing it away.

Sources

Jolpica-F1 — the community replacement for the deprecated Ergast API (frozen end of 2024). Schema- compatible, so historical season / race / result / qualifying data from 1950 onwards flows in with minimal translation. Raw JSON is pulled into Parquet per endpoint, partitioned by season for the per-race tables.
FastF1 — official F1 timing-feed wrapper. Lap-level timing with per-lap track-status codes for 2018 onwards. Used to detect safety-car / VSC windows for the safety-car lottery analysis. Race sessions only — qualifying and practice are skipped at ingest.

The Ergast → Jolpica migration turned out to be mostly a plumbing job: Jolpica preserves Ergast's schema deliberately, so the ingestion rewrite was mechanical rather than conceptual.

Pipeline shape

raw Parquet → staging views → ephemeral intermediates → materialised marts
             (rename + cast)    (gaps-and-islands,       (dim + fact)
                                  DNF classification,
                                  teammate pairings)

Staging lives in DuckDB as views — cheap, 1:1 with raw. Intermediate models are ephemeral: dbt inlines them as CTEs inside downstream marts rather than materialising them as separate objects. Marts are tables because the presentation layer hits them directly and wants fast reads.

Ingestion and the Bradley–Terry fit are both Python because each job picks its best tool: httpx + pyarrow for polite paginated API fetches with atomic Parquet writes, scipy.optimize for a ridge-penalised maximum-likelihood fit of the latent-pace model. Forcing either into pure SQL would hide more than it would reveal.

Grain decisions

Every fact table documents its grain in its YAML description. Headlines:

Model	Grain	Notes
`fct_race_results`	one row per driver per race	Joined to `dim_drivers` via the SCD-2 window so the team assignment is date-correct.
`fct_teammate_pairings`	one row per constructor per race per `driver_a < driver_b`	Canonical ordering so each head-to-head is counted once.

fct_teammate_pairings is the input to the Bradley–Terry fit.

DNF classification

finish_status in Jolpica is free text. Modern seasons use strings like "Engine", "Gearbox", "Collision", "+1 Lap", "Lapped", "Did not start", and a long tail of component failures. Raw text is useful for exposition (“Verstappen retired — gearbox”) but hostile to aggregation (“what share of DNFs was mechanical in the hybrid era?”).

A dnf_category macro collapses the long tail into a ten-element enum:

finished, lapped, collision, mechanical, disqualified, dns, dnq, medical, retired, unknown.

Two calls worth flagging. First, both "+N Lap(s)" and bare "Lapped" get bucketed as lapped — missing the second form misclassifies roughly 30% of modern-era results as mechanical DNFs (caught during smoke testing, before it reached the ratings model). Second, mechanical is a catch-all for any named component failure — Engine, Gearbox, ERS, Hydraulics, Clutch, Brakes, Suspension, Tyre, and the rest. The original finish_status is preserved alongside the category so a downstream split (power-unit vs. chassis) is a single CTE away when we need it.

Slowly changing driver → constructor relationships

A driver can race for several constructors over their career and occasionally switches mid-season. Any fact table that links to dim_drivers needs the version of the driver row that was valid on the race date, not the latest row.

dim_drivers is therefore SCD-2: one row per (driver_id, span_index), with valid_from / valid_to / is_current. The spans are derived with a gaps- and-islands walk over each driver's chronological race appearances — a new span opens whenever the constructor changes, and a driver who returns to a previous constructor gets a fresh span rather than extending the old one. Historical team loyalty is preserved faithfully.

This is a computed SCD-2 rather than a dbt snapshot on purpose: snapshots depend on when you ran them, which breaks backfill reproducibility. A computed SCD-2 produces the same spans from the same raw data forever.

The Bradley–Terry fit

The live GOAT page is the short-form write-up; this section is the technical version.

Model statement. If drivers A and B are teammates in a comparable race, the model estimates P(A beats B) = sigmoid(β_A - β_B). There is one fixed effect per driver and no intercept; one driver is pinned at zero for identifiability.
Input grain. One row per teammate head-to-head in which both drivers were classified finishers — the cleanest head-to-head we can extract from race results. DNFs that weren't the driver's fault (mechanical, collision-with-another-car) would inject pure noise, so they're excluded rather than imputed.
Estimator. Custom ridge-penalised logistic regression in Python (scipy.optimize, L-BFGS-B) rather than a library like choix or statsmodels.Logit. Perfect separation — a driver with a 100% or 0% win record — drives the MLE to ±∞ and poisons the Fisher information for every other driver's standard error. A weak L2 penalty (α = 1.0, equivalent to a N(0, 1) prior on each rating) makes this a MAP estimate, keeps every coefficient finite, and keeps every Wald SE well-defined while barely moving well-estimated drivers from the pure-MLE solution.
Reference driver. The driver with the median number of pairings is pinned at zero — rather than the alphabetically first — so the rating scale is centred on a typical career rather than an arbitrary name order.
Uncertainty. 95% Wald intervals from the inverse penalised Hessian. Drivers whose unpenalised record is perfectly separated are flagged so the leaderboard shows their point estimate without a misleading error bar: the model can say "at least this good" but not "this much better than their peers". These are model-based intervals, not a claim that the underlying sport is perfectly observed.
Independence assumption. The fit treats teammate pairings as conditionally independent observations. Real careers clearly violate that in places: the same driver appears many times, teams change, and pairings cluster by era. That makes the model useful rather than literal. It is best read as a disciplined approximation, not a final theory of driver talent.
Chaining. Ratings are fit over the full corpus in one shot, not era by era — Senna's rating is comparable to Hamilton's via chains of shared teammates. The penalty also doubles as shrinkage for drivers at the end of long, thin chains (a driver who shares only one or two teammates with the core cluster is pulled toward zero).
Career arcs — rolling-window fit. The cross-era fit returns one rating per driver, which is the right shape for a leaderboard but flattens an entire career into a single point. The career-arc view runs the same estimator on a sliding five-season window (Y ± 2) and reports the window's rating against the year at the centre. Three seasons of pairings is the eligibility floor inside each window (vs. five for the cross-era fit) because the windows are necessarily thinner. Parameters (window_size, fit_min_pairings, fit_ridge_alpha) are stamped into every row of mart_driver_ratings_by_season, so the visualisation's assumptions are auditable from the data alone.
Hollow circles on the career-arc chart. A driver who loses no head-to-heads to any teammate inside a window is perfectly separated within that window — the unpenalised MLE for their rating is +∞. The ridge penalty pulls it back to something finite, but the estimate is no longer supported by a loss observation; it's a shrunk lower bound. The career-arc chart renders those windows as hollow circles rather than filled dots so the reader can tell which points of a rising arc are "we saw them beat their teammate every time and this is how far the penalty will let us read that" rather than a fully-supported estimate. (The related leaderboard convention — no error bar on drivers whose full-career record is separated — is covered in the Uncertainty bullet above.)
Post-fit derivatives. The GOAT page uses the cross-era fit four different ways: the leaderboard (raw ratings with their Wald bars), the career arcs (same estimator on a sliding window), a head-to-head calculator (sigmoid(β_A - β_B) with first-order SE of the difference), and a peak-vs-longevity scatter (the max of each driver's windowed trajectory against its length). Everything reads the same two Parquet marts; nothing is re-fit client-side.

Generated model documentation

Everything above describes methodology at a narrative level. The machine- readable version — every model, every column, every test, every line of lineage — is published as a static site at /projects/f1/docs/. It's the output of dbt docs generate plus the descriptions and tests declared alongside the models, rebuilt on every deploy.

A few spots worth looking at if you're curious:

Lineage graph (mart_safety_car_lottery → int_fastf1_sc_windows → int_fastf1_race_sc_laps → stg_fastf1__laps) — the gaps-and-islands detection for SC windows is the most interesting SQL in the project.
Schema tests — not just not_null / unique, but dbt-expectations range checks (expect_column_values_to_be_between) and row-logic assertions (expression_is_true) on things like "inherited_share should be between 0 and 1".
Column-level descriptions — every mart's columns are described in pipeline/dbt/models/marts/_schema.yml and surfaced in the docs UI, so an outside reader can understand the grain of any mart_*.parquet without opening the SQL.

Other analyses on this site

Every page is backed by a mart_* parquet written by dbt-duckdb's external materialisation, so the presentation layer reads its data the same way regardless of whether the mart came from SQL or Python.

Page	Mart	Shape
The GOAT, in four charts	`mart_driver_ratings` + `mart_driver_ratings_by_season`	Cross-era teammate ratings, rolling-window career arcs, head-to-head win probability, and peak-vs-longevity scatter. Both marts come from `pipeline/analysis/bradley_terry.py`. Indy 500 rounds (1950-1960) excluded at source.
Era competitiveness	`mart_era_competitiveness`	HHI of constructor race-wins per season, plus unique-winners count and champion win share. Scope starts at 1958 (constructors' championship era).
DNF causes	`mart_dnf_causes`	Per-season counts and shares by `dnf_category`. Presentation rolls up to decade or stays per-year without re-pivoting.
Overtaking extinction	`mart_overtaking`	Two grid-to-finish proxies per season: mean `\|grid − finish\|` across classified finishers, and the Pearson correlation between grid and finish.
The LOAT, in three charts	`mart_teammate_reliability` + `mart_inherited_positions` + `mart_safety_car_lottery`	Three teammate-normalised luck lenses: mechanical-DNF delta on paired same-team starts, inherited-share delta on paired classified starts, and safety-car window delta since 2018. SC windows are contiguous runs of neutralised laps detected via gaps-and-islands on FastF1 `track_status_code`. The closing verdict is a percentile-sum scorecard over the joined marts. Indy 500 (1950-1960) excluded at source across all three.

A couple of confounds worth naming that affect several analyses at once:

Grid-to-finish deltas and car quality. Metrics built on raw race results — positions gained, grid-vs-finish delta, inherited position share — cannot separate driver performance from the car underneath them. A driver in a rocket ship gains positions everywhere; a driver in a sled loses them. The GOAT page leans exclusively on teammate-based marts (fct_teammate_pairings, mart_driver_ratings, mart_driver_ratings_by_season) precisely because the same-car-same-day head-to-head is the only comparison the data gives that cancels the car out. The LOAT page used to lean on the raw grid-based metrics (inherited share, SC gain rate), which turned out to correlate with average starting grid position at r ≈ 0.93 and r ≈ 0.84 respectively — i.e. the leaderboard was mostly ordering drivers by how far back they started, not how lucky they got. Every lens on LOAT is now teammate-normalised the same way the reliability mart already was (paired self-join on constructor_id, then driver delta on a same- car-same-race-same-window subset). That drops the grid correlation to 0.38 and 0.25, which is as close to grid-independent as this dataset will let us get without stepping outside the teammate-comparison framing.
DNFs bias overtaking metrics. Non-classified finishers are excluded from the overtaking-extinction proxies so unreliable eras don't artificially inflate position change. That trades one bias (DNF inflation) for another (less signal in years with lots of DNFs). The methodology callout on that page flags it explicitly.

Limits and non-claims

Honesty is part of the methodology.

Car-quality blindness — Bradley–Terry on teammate pairings controls for team effects by construction (only same-team head-to-heads count), but it can't separate driver skill from how well the car suited a given driver's style. If a team's car systematically favoured one driver's style for years, that bakes into their rating.
Sparse eras — the 1950s had a fraction of today's field size and many drivers entered one or two races. Confidence intervals in that era are wide; the ratings chart shows them honestly.
Team orders — when a team instructs one driver to let the other past, the finishing order reflects team strategy, not pace. We don't detect or exclude these, so ratings are noisier than a theoretical upper bound.

None of this invalidates the ratings — it shapes how to read them. That's the whole point of having a methodology page.