Does it actually work?

Evaluation, the gotchas that almost invalidated everything, and what happened with real (play) money.

Part 3 of 3 · May 2026

Walk-forward evaluation

To check whether the model picks winners, I march it forward through history one week at a time. For each week, I retrain on everything up to that point and score names for the following week — names it has never seen. Tens of thousands of predictions, all on data the model didn't get to peek at.

The base rate is how often a randomly chosen stock hit a significant move. The model's most-confident picks hit that threshold roughly 1.5-1.7x more often than the base rate, depending on the horizon and direction.

The lift is real but modest. The model isn't clairvoyant — it's slightly better than random in a systematic, consistent way. That's the whole game.

Gotcha #1: Slippage eats your returns

Here's the thing nobody tells you until you try to trade a model: the price you think you'll get and the price you actually get are different numbers. That gap is slippage, and for a system that trades small-cap names, it can eat your entire edge.

I ran a study: simulate buying and selling hundreds of stocks at different times of day, measure how far the fill price deviated from the expected price.

The result was dramatic. Slippage at the open was 3-4x worse than near the close. A round-trip at the open cost about 1% — and when your average winner is only a few percent, that's a devastating tax on every trade.

The fix was simple once the data was clear: stop trading at the open. The system now runs inference late in the session using a live price snapshot and enters near close. Since the entry price is so close to the anchor the labels are computed from, the gap between "what the model predicted" and "what the trade actually does" is tiny.

Gotcha #2: The model lies to you if you let it

Most of the work isn't in the model — it's in making sure it can't peek at data the real bot wouldn't have. Every exciting result starts with "what leaked?" — and there's usually a real leak.

Some examples I caught:

Graph leakage. A graph built today quietly bakes in this week's price movements. If you use today's graph to test predictions for this week, the graph already knows the answer. Fix: rebuild the graph as it would have looked before the test period, for every test period, every time.
News cutoff drift. News timestamps need to be anchored at market close, timezone-aware. Get this wrong by even a few hours and features can see tomorrow's pre-market news.
Feature date off-by-one. All features must use data through the previous close only. The open price on the prediction date is future data — you don't know it when you'd be making the prediction.
Systemic day contamination. Broad market selloffs need special handling, and the flag for "yesterday was systemic" has to be shifted forward by one day — at market open you don't yet know if today will be systemic.

Each of these bugs, when present, inflated backtested accuracy by several percentage points. The model looked amazing. Then you'd fix the leak and the numbers would drop to something honest.

Gotcha #3: A good model doesn't mean good trades

A model with 1.7x lift on 1-day up predictions is great. But between "this stock will probably go up" and "I made money" sits an ocean of practical problems:

Position sizing. Putting too much in one high-conviction name is a great idea until it gaps down. Per-name weight caps are essential.
Cash management. You can't buy everything the model likes. A cash reserve plus a daily budget cap keeps the system from going all-in on Monday and having nothing left for Friday's better picks.
Small-cap traps. Very cheap stocks are volatile enough to trigger drawdown protection on normal fluctuation, forcing a liquidation that locks in losses. Adding a minimum price filter made a surprisingly large difference.
The honest backtest. A simulation with pretend prices prints beautiful numbers. The real test uses the same picking logic, same sizing, same cash limits, same "skip if it already moved." A green light only counts if the simulation would have made the same trades the live bot would.

Gotcha #4: Cumulative returns lie

This one almost cost me weeks of wrong decisions. I ran a parameter sweep — dozens of configurations simulated over 23 weeks — and sorted by cumulative return to find the "best" config. The winner was obvious: +232% versus the runner-up's +143%. Case closed, right?

Wrong. When I dug into the weekly breakdown, the +232% config had a single week that returned +48.7%. That one week — the very first week, where the model was essentially bootstrapping with no history — contributed 47% of the entire cumulative return. Remove it and the "winner" drops to +123%, well below the runner-up.

The problem is compounding. An early outlier inflates everything after it because all subsequent returns multiply on a larger base. A config that gets lucky once on week 1 will dominate any sweep sorted by cumulative return, even if it underperforms in 20 of 23 weeks.

The fix was adding outlier-resistant metrics to the sweep analysis:

Median weekly return — what does a "normal" week look like? (Both configs were ~4%, essentially tied)
Trimmed cumulative — drop the top-2 and bottom-2 weeks, then compound what's left
Leave-one-out minimum — the worst cumulative return you'd get if you removed any single week. If removing one week tanks your result, your "edge" is really just luck.
Top-1 share — what percentage of your cumulative return comes from the single best week? Above 30% is a red flag.

After switching to these metrics, the "winner" changed — and the more boring, consistent config (+143%, Sharpe 4.31, no single week contributing more than 14% of total returns) is what's now running in production.

The meta-lesson: cumulative return is the metric everyone reports, but it's the worst metric for comparing configurations. It rewards variance, not skill.

Gotcha #5: Retraining can make things worse

Every Saturday the system retrains. New data, new hyperparameters, fresh model. But "new" doesn't mean "better" — sometimes the new model overfits a recent regime or loses sensitivity to patterns the old one caught.

The fix is a promotion gate: the new model and the old model both run a backtest on the same recent period. If the new model doesn't beat the old one, nothing changes. Monday's automation picks up whichever model is current.

This sounds obvious in retrospect. Without the gate, every retrain is a coin flip disguised as progress.

The backtest journey

The numbers didn't start good. The first honest backtest — with all leakage fixed — came in at roughly -10% over the test period. The model had an edge on paper, but the trading simulation lost money.

Tuning the model and the trading parameters (hold period, position sizing, how many picks per day) got it to about +70%. That felt great — until I added realistic slippage. With open-of-day entry and proper spread modeling, it crashed back to -40%. Slippage at the open was eating the entire edge and then some.

Three changes turned it around: moving entry to near market close (cutting round-trip slippage by ~70%), adding a reranker to improve precision at the top of the list, and filtering out stocks under $5 (which were volatile enough to trigger false drawdown kills). The result over 23 weeks:

+143%

Cumulative return

4.31

Sharpe ratio

-11.5%

Max drawdown

303

Trades

Dec 6, 2025 – May 9, 2026. 23-week walk-forward backtest with weekly retraining, realistic slippage, close-mode entry. $10k starting equity.

These are not the biggest numbers in the sweep — a different configuration hit +232%. But as I described in Gotcha #4 above, that config's returns were 47% attributable to a single lucky week. The numbers here come from the most consistent config: median weekly return of 4.0%, no single week contributing more than 14% of the total, Calmar ratio of 55.7.

A fair objection: the last six weeks of that period included a strong market rally that flatters the numbers. Here are the metrics for the 17 weeks before the rally kicked in:

+56%

Cumulative return

3.27

Sharpe ratio

-7.5%

Max drawdown

17 wks

Pre-rally period

Dec 6, 2025 – Mar 28, 2026, before the April rally. Same configuration, same slippage model.

Still strong, and arguably more honest — the system was making money in a choppy market, not just riding a rising tide.

A caveat on these numbers

This backtest covers roughly the last year of market data — that's all I have access to on free-tier data plans. It doesn't include 2022's bear market or 2023's rate-hike volatility. I honestly don't know how the system would behave in a sustained crash or a prolonged sideways grind. Paper trading is the next best thing to find out.

The strategy in plain English: let the winners run, don't let the losers run at all.

Paper trading

The system now runs on a paper-trading account, fully automated. A handful of new buys per day, short hold period, capped concurrent positions. Entry near close with bracket orders for downside protection.

What I've learned

The graph is half the magic. The same model on the same numbers works okay flat, but noticeably better when it can see who each company is connected to. Graph neighborhood features consistently rank among the most important.
The LLM reads filings, not trades. An LLM extracts relationships from SEC filings (who supplies whom, who competes with whom), but it doesn't pick stocks or size positions. I tried using it for allocation — but you can't honestly backtest an LLM that knows the future. So it stays in the data pipeline, not the decision loop.
Build a small army of analysis scripts. Every claim that something was "better" needed its own script: simulate a week, line up two models on the same days, count winners by hold length, slice picks by confidence. The system got good because I had a dozen ways to ask "is this actually working?" and the answers slowly stopped disagreeing.
Don't trust the headline number. Cumulative return is a terrible comparison metric. One lucky week early in the backtest inflates everything after it via compounding. Median weekly return, trimmed cumulative, and leave-one-out analysis tell you whether you have a genuine edge or just a fortunate accident.
The edge is small. 1.5x lift means you're still wrong more often than you're right. But being slightly right, consistently, over hundreds of trades — that's enough. The returns come from volume and discipline, not from any single brilliant pick.

Where this goes next

The system is running on play money — fully automated, retraining weekly, promoting only when the new model proves itself. Real money starts June 4th, when the PDT rule goes away and I can start small instead of needing $25k upfront.

The meta-point I keep coming back to: the gap between "I know graphs and ML" and "I'm running an automated trading system" used to be enormous, and most of it was domain knowledge. That gap is much smaller now. Fantasy football, but for stocks. Without ever having watched a game.

This series

Part 1: Fantasy football, but for stocks Part 2: The data and the graph Part 3: Does it actually work?