Does it actually work?

Evaluation, the gotchas that almost invalidated everything, and what happened with real (play) money.

Part 3 of 3 · May 2026


Walk-forward evaluation

To check whether the model picks winners, I march it forward through history one week at a time. For each week, I retrain on everything up to that point and score names for the following week — names it has never seen. Tens of thousands of predictions, all on data the model didn't get to peek at.

The base rate is how often a randomly chosen stock hit a significant move. The model's most-confident picks hit that threshold roughly 1.5-1.7x more often than the base rate, depending on the horizon and direction.

The lift is real but modest. The model isn't clairvoyant — it's slightly better than random in a systematic, consistent way. That's the whole game.


Gotcha #1: Slippage eats your returns

Here's the thing nobody tells you until you try to trade a model: the price you think you'll get and the price you actually get are different numbers. That gap is slippage, and for a system that trades small-cap names, it can eat your entire edge.

I ran a study: simulate buying and selling hundreds of stocks at different times of day, measure how far the fill price deviated from the expected price.

The result was dramatic. Slippage at the open was 3-4x worse than near the close. A round-trip at the open cost about 1% — and when your average winner is only a few percent, that's a devastating tax on every trade.

The fix was simple once the data was clear: stop trading at the open. The system now runs inference late in the session using a live price snapshot and enters near close. Since the entry price is so close to the anchor the labels are computed from, the gap between "what the model predicted" and "what the trade actually does" is tiny.

Gotcha #2: The model lies to you if you let it

Most of the work isn't in the model — it's in making sure it can't peek at data the real bot wouldn't have. Every exciting result starts with "what leaked?" — and there's usually a real leak.

Some examples I caught:

Each of these bugs, when present, inflated backtested accuracy by several percentage points. The model looked amazing. Then you'd fix the leak and the numbers would drop to something honest.

Gotcha #3: A good model doesn't mean good trades

A model with 1.7x lift on 1-day up predictions is great. But between "this stock will probably go up" and "I made money" sits an ocean of practical problems:

Gotcha #4: Cumulative returns lie

This one almost cost me weeks of wrong decisions. I ran a parameter sweep — dozens of configurations simulated over 23 weeks — and sorted by cumulative return to find the "best" config. The winner was obvious: +232% versus the runner-up's +143%. Case closed, right?

Wrong. When I dug into the weekly breakdown, the +232% config had a single week that returned +48.7%. That one week — the very first week, where the model was essentially bootstrapping with no history — contributed 47% of the entire cumulative return. Remove it and the "winner" drops to +123%, well below the runner-up.

The problem is compounding. An early outlier inflates everything after it because all subsequent returns multiply on a larger base. A config that gets lucky once on week 1 will dominate any sweep sorted by cumulative return, even if it underperforms in 20 of 23 weeks.

The fix was adding outlier-resistant metrics to the sweep analysis:

After switching to these metrics, the "winner" changed — and the more boring, consistent config (+143%, Sharpe 4.31, no single week contributing more than 14% of total returns) is what's now running in production.

The meta-lesson: cumulative return is the metric everyone reports, but it's the worst metric for comparing configurations. It rewards variance, not skill.

Gotcha #5: Retraining can make things worse

Every Saturday the system retrains. New data, new hyperparameters, fresh model. But "new" doesn't mean "better" — sometimes the new model overfits a recent regime or loses sensitivity to patterns the old one caught.

The fix is a promotion gate: the new model and the old model both run a backtest on the same recent period. If the new model doesn't beat the old one, nothing changes. Monday's automation picks up whichever model is current.

This sounds obvious in retrospect. Without the gate, every retrain is a coin flip disguised as progress.


The backtest journey

The numbers didn't start good. The first honest backtest — with all leakage fixed — came in at roughly -10% over the test period. The model had an edge on paper, but the trading simulation lost money.

Tuning the model and the trading parameters (hold period, position sizing, how many picks per day) got it to about +70%. That felt great — until I added realistic slippage. With open-of-day entry and proper spread modeling, it crashed back to -40%. Slippage at the open was eating the entire edge and then some.

Three changes turned it around: moving entry to near market close (cutting round-trip slippage by ~70%), adding a reranker to improve precision at the top of the list, and filtering out stocks under $5 (which were volatile enough to trigger false drawdown kills). The result over 23 weeks:

+143%
Cumulative return
4.31
Sharpe ratio
-11.5%
Max drawdown
303
Trades

Dec 6, 2025 – May 9, 2026. 23-week walk-forward backtest with weekly retraining, realistic slippage, close-mode entry. $10k starting equity.

These are not the biggest numbers in the sweep — a different configuration hit +232%. But as I described in Gotcha #4 above, that config's returns were 47% attributable to a single lucky week. The numbers here come from the most consistent config: median weekly return of 4.0%, no single week contributing more than 14% of the total, Calmar ratio of 55.7.

A fair objection: the last six weeks of that period included a strong market rally that flatters the numbers. Here are the metrics for the 17 weeks before the rally kicked in:

+56%
Cumulative return
3.27
Sharpe ratio
-7.5%
Max drawdown
17 wks
Pre-rally period

Dec 6, 2025 – Mar 28, 2026, before the April rally. Same configuration, same slippage model.

Still strong, and arguably more honest — the system was making money in a choppy market, not just riding a rising tide.

A caveat on these numbers

This backtest covers roughly the last year of market data — that's all I have access to on free-tier data plans. It doesn't include 2022's bear market or 2023's rate-hike volatility. I honestly don't know how the system would behave in a sustained crash or a prolonged sideways grind. Paper trading is the next best thing to find out.

The strategy in plain English: let the winners run, don't let the losers run at all.

Paper trading

The system now runs on a paper-trading account, fully automated. A handful of new buys per day, short hold period, capped concurrent positions. Entry near close with bracket orders for downside protection.


What I've learned

Where this goes next

The system is running on play money — fully automated, retraining weekly, promoting only when the new model proves itself. Real money starts June 4th, when the PDT rule goes away and I can start small instead of needing $25k upfront.

The meta-point I keep coming back to: the gap between "I know graphs and ML" and "I'm running an automated trading system" used to be enormous, and most of it was domain knowledge. That gap is much smaller now. Fantasy football, but for stocks. Without ever having watched a game.

This series
Part 1: Fantasy football, but for stocks Part 2: The data and the graph Part 3: Does it actually work?