The data and the graph

How ~3,000 stocks become a network, and how that network becomes a trading signal.

Part 2 of 3 · May 2026


Four data sources, one graph

The system watches the Russell 3000 — roughly 3,000 US stocks. Every trading day, it pulls data from several sources:

The filings get an extra pass: an LLM reads each one and extracts structured relationships between companies — who's a supplier, customer, competitor. These are cached so the LLM only runs once per filing.

Building the graph

Every day, the system builds a directed graph over a rolling window. Three types of edges connect companies:

Price co-movement edges

If stock A has a big move on day D, and stock B has a big move shortly after, that's a co-movement event. "Big move" is adaptive per stock. If this lead-lag pattern repeats enough times in the window, a directed edge is created from A to B. The direction matters: A leading B is independent of B leading A.

News co-occurrence edges

When two tickers appear in the same news headline repeatedly, they get a news edge. The news cutoff is timezone-aware and anchored at market close — so a graph built on Monday doesn't accidentally include Tuesday's pre-market news.

Filing relationship edges

These come from the LLM extraction step. "AAPL is a customer of TSM" becomes a directed edge with a confidence score. Only relationships above a confidence threshold make it in.

Price edges News edges Filing edges (lead-lag) (co-mentioned) (LLM-extracted) | | | v v v ┌─────────────────────────────────────────────────────────┐ │ Directed temporal graph │ │ thousands of nodes, tens of thousands of edges │ │ rebuilt daily on a rolling window │ └─────────────────────────────────────────────────────────┘

The graph is rebuilt from scratch each day, so it always reflects the most recent relationships. The key design choice: price co-movement is the backbone. News and filing edges enrich the graph with context — sentiment, supply-chain structure — but a pair of stocks only stays in the graph if they also have a co-movement edge. If two companies are mentioned in the same headline but their prices never lead or lag each other, that edge is dropped. Stocks with no co-movement connections at all are removed entirely and aren't even sent to model training. The model only sees stocks the graph says are part of an active lead-lag network.

From graph to features

For each stock on each day, the system extracts features from the graph. The idea: a stock's neighborhood tells you things that the stock's own price history doesn't.

Features fall into a few broad groups:

Recent observations are weighted more heavily than older ones, so the graph stays responsive to the current market regime without losing longer-term structure.

The models

A small set of gradient-boosted classifiers, each answering a slightly different question: will this stock have a significant move up (or down) over the next few days? Separate models for different horizons and directions, combined into a single buy score via a simple formula — not an ensemble in the stacking sense, just a weighted combination of the individual model outputs.

The models are intentionally simple — the complexity budget goes into the features, not the model architecture. Early stopping on a held-out validation set, calibrated probabilities, and that's about it.

Walk-forward training

The system never trains on a single fixed split. Instead, it tiles short test windows backward from the most recent data:

◄── time ──────────────────────────────────────────────────► iter 0: [════════ train ════════][val][cal][ test ] iter 1: [════════ train ════════][val][cal][ test ] iter 2: [════════ train ════════][val][cal][ test ] ... iter N: [ test ] ◄── freshest Each iteration: retrain from scratch on its own training window. Hyperparameters are tuned per iteration. An embargo gap sits between train and test (prevents label leakage).

The backward tiling is deliberate — it anchors the last iteration to the freshest data, so the ensemble's recency weighting puts the most weight where it matters most.

Combining iterations

At inference time, only the last few iterations contribute — keeping too many stale models in the ensemble adds noise. Recent iterations count more via a recency half-life, so the latest model has significantly more weight than one from two weeks ago. Each iteration's individual model outputs are combined via the same formula, then the iteration-level scores are blended with the recency weighting.

The reranker

The combined score produces a ranked list, but rank quality at the very top matters more than overall ranking accuracy. A second-stage learning-to-rank model re-orders the top of the list using actual forward returns as relevance labels, with heavy weighting toward the best-performing stocks.

If the reranker can't beat a simple "sort by combined score" on the test set, it's discarded and inference falls back to the base ranking. No harm done.

Weekly retrain

Every Saturday, the full pipeline retrains: feature extraction, hyperparameter tuning, model assembly, and reranker training. The new model runs a backtest against the same period the current production model was tested on. If the new model wins, it gets promoted. If it loses, nothing changes. Monday's automation picks up whichever model is current.

How well all of this actually works — and the many ways it can go wrong — is in Part 3.

This series
Part 1: Fantasy football, but for stocks Part 2: The data and the graph Part 3: Does it actually work?