Fit, but for what? | PUFFERIZED Blog

I went looking for a better way to track an athlete's fitness for Pufferized. One month later... I have emerged from a pretty deep rabbit hole.

This post is the story of that rabbit hole, and how it has changed the way we track fitness, fatigue and form in Pufferized.

All models are wrong

"All models are wrong, but some of them are useful." — George Box¹

Weather forecasting doesn't simulate every molecule of the atmosphere. It reduces this complexity to pressure systems, temperature gradients, moisture. The model is wrong — it isn't the actual atmosphere. But it's useful enough to tell you whether to bring a jacket.

Your body runs thousands of interacting processes when you train — mitochondrial biogenesis, capillarisation, neural recruitment, hormonal signalling, immune function, psychological state.

No model captures all of that. And even if it did, I'm not convinced it would make me any better at riding my bike.

The question isn't whether the model is correct. It isn't. The question is whether it's useful enough to help you make better decisions than guessing.

Keep that frame in mind for everything below.

Popular Models

The intellectual foundation of most training models still in use today comes from Eric Banister's 1975 paper, "A systems model of training for athletic performance."²

It's essentially a nice little loop. Every session produces a training load number, calculated from time and intensity. You'll see this concept in Strava as Relative Effort, in TrainingPeaks as TSS, and in Intervals.icu as Load.

Banister's original load calculation was called TRIMP (TRaining IMPulse) — heart-rate based, roughly duration × HR intensity × a weighting factor. Modern versions adapt it to use cycling power, along with other signals for intensity.

This training load number then feeds two rolling averages: a fast one (your average training load over the last week or so), and a slow one (the last month or two). When recent load is higher than long-term load you're fatigued; when it's lower, you're fresh. When you push your long-term average to a new height, you're pushing the ceiling of your fitness.

The most widely-known implementation of Banister's model is Andy Coggan's Performance Management Chart (PMC), popularised by TrainingPeaks in the early 2000s and still dominant today³. Coggan did two important things:

1. Gave the stack a universal anchor — TSS (Training Stress Score), calibrated so that 1 hour at threshold = 100 TSS. The signal can be anything you've got: power (anchored to FTP), heart rate (hrTSS, anchored to threshold HR), pace (rTSS for runners, anchored to threshold pace), swim pace (sTSS), or perceived exertion (RPE). Same one-hour-equals-100 logic, whatever the input. This gave the number universal, intuitive meaning ("200 is a big day, 50 is easy") and made sessions comparable across athletes, sports, and devices.
2. Named the stack and fixed the constants — TSS, CTL (42-day cumulative load), ATL (7-day acute load), TSB (CTL minus ATL = form). Dinner party fact: all of those terms are registered trademarks of Peaksware, TrainingPeaks' parent company.

Some platforms use different approaches (often because they have proprietary tech):

Garmin — EPOC-based load (internal metabolic cost via oxygen debt) with an acute/chronic ratio (ACWR).
Polar — three parallel components: TRIMP-style cardio load, a separately-modelled muscle load, and RPE. (RPE is one of the most well-validated load signals in sports science.)
WHOOP — a logarithmic Strain score (0–21, a TRIMP variant from wrist HR) paired with their proprietary Recovery Score.

They're all solving the same problem: how do we quantify what you just did, and track it over time?

Limitations

A few obvious ones with the Banister-style models:

All TSS is equal.
A 5-hour easy ride and a 90-minute interval session can produce the same training load, but they're training completely different systems.

Any cumulative chart inherits the confusion.
Since CTL is just a rolling average of daily TSS, you could have the same "fitness" on two points over the year, but one could be built from easy mountain bike rides, the other from fast 5k runs. Fit, but for what?

We don't all recover the same.
TSB uses a fixed 7-day window for everyone — based on population samples from the 70s. Subsequent work fitting the Banister model to real athletes has shown recovery time constants vary 2–3× between individuals⁴. One fixed window cannot be right for everyone.

Fatigue is much greater than just training.
How we define fatigue is limited to a single number. Sleep, stress, HRV, illness and injury... these things are incredibly nuanced.

New research

A quick tour of where the science has gone in recent years:

Multi-dimensional load models that split training across energy systems were formalised in Kontro, Mastracci, Cheung & MacInnis (2026) — a three-dimensional impulse-response model⁵. Commercially, Xert has been running a version of it for about seven years (Mastracci is the founder).

Durability — how much power you can still produce after hours of accumulated fatigue — was crystallised as a concept by Maunder et al. (2021) in Sports Medicine⁶, with the landmark validation data coming from pro cycling (Van Erp 2021 on Grand Tour power decline⁷, Spragg 2022–23⁸) simply because power files are the cleanest signal available. Running durability research (marathon pacing collapse, late-race economy) is catching up fast.

Durability is especially relevant for ultra-endurance: hour 20 of a Strathpuffer lap rotation is nothing like hour two. Tracking how your engine holds up under accumulated fatigue is probably the most useful thing we could be doing for our kind of rider.

Polarised training — the idea that intensity distribution matters more than total load — is Stephen Seiler's work with Kjerland (2006)⁹, refined since by Stöggl and Sperlich and others¹⁰. In simple terms: a lower-TSS week split properly between 80% easy and 20% hard can have more impact than a high-TSS week stuck in the 'grey zone'.

Let's not get carried away

Before I started picking specific models and designing the status tracking in Pufferized, I set out a few principles. The most important one:

My goal isn't to build a data analysis tool. It's to work directly with athletes and help them prepare for specific events.

Don't lead with the model.
This is something I kept running into as I explored the more advanced training platforms — most are essentially throwing vast amounts of unprocessed data at you.

That makes sense for tools aimed at coaches, sports scientists, and people who want to manipulate the data point by point themselves.

I want to avoid drowning people (including myself) in charts, and instead focus on answering a few simple questions:

Was this session useful?
What should I do next?
Am I progressing?
Am I ready for my goal?
How should I taper?

AI-first coaching.
The value in LLMs is that if we give them the data, they can process multiple models on the fly, cross-reference additional context we provide, and answer our questions in human terms.

It becomes a conversation, or a proactive nudge from a coach, rather than a festival of graphs.

Combine LLMs with inputs like HRV, stress, and sleep, and you get a much more holistic view of fatigue and form. This is the part I'm most excited about.

So what did you actually build?

For now... I've settled on Kontro et al.'s three-dimensional model, breaking down the training stress of every session across the three primary energy systems:

Aerobic — sustainable work below the second lactate threshold (Critical Power, or CP)
Glycolytic — finite anaerobic capacity above threshold (W′, pronounced W-prime)
Alactic — maximum short-duration power (Pmax)

But... I have rolled the two above-threshold systems (glycolytic and alactic) together, so what the athlete actually sees is just two dimensions:

Power — everything above the second lactate threshold
Endurance — everything below it

So the model stays grounded in science, but we surface what's most useful (while using less academic language).

The second lactate threshold is a clean physiological point for the split: below it you're in predominantly aerobic, mitochondrially-driven metabolism; above it you're in anaerobic territory.

Quite conveniently, many established coaching frameworks (Seiler's 80/20 polarised model, Stöggl & Sperlich's 2014 follow-up¹⁰, pyramidal training) already think in those binary above/below terms.

Note: I did consider "Speed" instead of "Power", as that would sound less biased to cycling. I'd be very interested in any feedback here.

How does this actually work

You do the work. Any session, any sport — a 5-hour gravel ride, a track run, an open-water swim, a strength block in the gym.

The session gets analysed. Pufferized pulls the activity, breaks it down by time-in-zone, and works out what part of it was above the second lactate threshold and what part was below. We prioritise power if you've got it, HR otherwise, and fall back to a basic RPE estimation if we must.

Load is attributed across two dimensions. A weighted calculation based on time and intensity gives us three numbers per session: Power Load, Endurance Load, and the Total (which is just the two added together).

To make that concrete — here's what the split looks like on two very different sessions:

Session	TSS	Power Load	Endurance Load	What it built
1.5hr VO₂max intervals	120	101 (84%)	19 (16%)	Power
4.9hr endurance ride	120	17 (14%)	103 (86%)	Endurance

Same total stress. Completely different training stimulus. That's the whole point.

Power and Endurance trend at different rates. Power on a 28-day rolling average (neural and anaerobic systems come on fast and fade fast). Endurance on 42 days (aerobic adaptations build slower and persist longer). A 7-day acute average per dimension picks up whether each is currently building, maintaining or declining.

You chat to the coach. The coach has your full dimensional profile plus HRV, sleep, RPE, your journal entries, event details, training history, and recent conversations. Ask was that session useful, what should I do tomorrow, am I ready for Sunday — and you get a grounded human answer, not a number in a box.

Then... you get out there again and do whatever you do.

Durability, Power Curves and Fatigue

Alongside the core two-dimensional model, a few other signals get tracked and surfaced where they're most useful.

The Power Curve. Your best average power (or pace) at every duration from 5 seconds to 24 hours — a rolling 90-day best that extends whenever you set a new peak. This idea of a new 90-day peak is under-rated in my opinion, so it's now something we actively celebrate — you've pushed to a new ceiling. Like a new heart container in Zelda. If you're into that.

The Power Curve chart from 5 seconds to 4 hours, with green ceiling-push markers flagging efforts that moved the 90-day peak

Durability. As noted, this is emerging as a very exciting subject in sports science. For now we surface durability signals via aerobic decoupling — the ratio of HR to power in the first half versus the second half of a ride. A falling decoupling trend on long efforts is a direct readout of your aerobic engine holding up under fatigue. We also look for any other hard sustained efforts late in an activity: you smashed a PB on a climb after a 5-hour mission — that's a huge signal for durability.

Durability panel showing longest ride, average decoupling across rides over two hours, and sustained 4h+ normalised power

Fatigue. No single fatigue score. Letting LLMs do what LLMs do best. The coach pieces fatigue together from the full picture — training patterns, HRV, sleep, RPE, the cold you mentioned yesterday — and gives you a grounded answer in plain language.

Fatigue & Form panel with a plain-language AI note above Training Load, Sleep, HRV and Resting HR tiles

How to test this

RPE and post-session notes on every session.
After any session on your plan, you can log how it actually felt (1–10 RPE) and add a short note. RPE catches what power and HR miss: the injury you're dealing with, the legs that felt heavy, the fact you were just out there having a nice time.

How did that go? modal with a short note and an RPE dropdown set to 9 — Almost impossible

Structured session completion feedback.
When you tick a session off your plan, you get immediate context — what that session contributed to your Power and Endurance, how it fits into the week, and how it matched what was prescribed. You can then choose to generate a detailed session review...

Session feedback card titled The plan didn't survive contact, with ride stats and Endurance and Power dimension deltas

Detailed session review report in the coach chat.
A proper breakdown generated right in the coach chat after any notable session — feedback on what you did, and any changes that might be needed in your plans. The perfect conversation starter for the coach.

The Coach chat showing a /review me command with a rendered Athlete Review report

A new status page.
I accept that we still need some charts. The status page lets you track your training trends and fitness forecasts leading up to events. You'll also see plain-language interpretations of the fuzzier stuff — fatigue, form, durability, and where you might want to focus your training next. This feeds straight back into the coach chat, so you can actually use it to discuss your plan.

Fitness Dimensions chart showing Power, Endurance and Total lines over time, with Split, Volume and Adherence tiles beneath

Slightly less wrong, slightly more useful

Still a model. Just a more useful one. Plenty of tweaks to come — and I'd like to hear what you think.

Thanks for reading.

References

1. Box GEP. Science and Statistics. J Am Stat Assoc. 1976;71(356):791–799.

2. Banister EW, Calvert TW, Savage MV, Bach T. A systems model of training for athletic performance. Aust J Sports Med. 1975;7:57–61.

3. Allen H, Coggan AR, McGregor S. Training and Racing with a Power Meter. 3rd ed. Boulder, CO: VeloPress; 2019. (The Performance Management Chart framework — TSS, CTL, ATL, TSB — was popularised via TrainingPeaks in the early 2000s and is trademarked by Peaksware.)

4. Hellard P, Avalos M, Lacoste L, Barale F, Chatard JC, Millet GP. Assessing the limitations of the Banister model in monitoring training. J Sports Sci. 2006;24(5):509–520. (Fitted τ values across 9 elite swimmers: τ fitness 38 ± 16d, τ fatigue 19 ± 11d; CoV >30% on most parameters.)

5. Kontro H, Mastracci A, Cheung SS, MacInnis MJ. The three-dimensional impulse-response model: modeling the training process in accordance with energy system-specific adaptation. PLoS One. 2026;21(2):e0341721.

6. Maunder E, Seiler S, Mildenhall MJ, Kilding AE, Plews DJ. The importance of 'durability' in the physiological profiling of endurance athletes. Sports Med. 2021;51(8):1619–1628.

7. Van Erp T, et al. Durability and repeatability of professional cyclists during a Grand Tour. Eur J Sport Sci. 2022;22(12).

8. Spragg J, Leo P, Swart J. The relationship between training characteristics and durability in professional cyclists across a competitive season. Eur J Sport Sci. 2023;23(4).

9. Seiler KS, Kjerland GØ. Quantifying training intensity distribution in elite endurance athletes: is there evidence for an "optimal" distribution? Scand J Med Sci Sports. 2006;16(1):49–56.

10. Stöggl T, Sperlich B. Polarized training has greater impact on key endurance variables than threshold, high intensity, or high volume training. Front Physiol. 2014;5:33.