JEPA, Rebuilt from the Paper

Highlights

Full research report comparing JEPA and MAE objectives under controlled conditions with matched ViT backbones on CIFAR-10.
Evaluates representations across four downstream tasks: linear probing, retrieval, anomaly detection, and embedding visualization.
Finds a real tradeoff: JEPA produces more perturbation-invariant embeddings; MAE wins on linear probe accuracy.

Research question

The report asks a narrow but consequential question: does predicting latent structure, JEPA-style, produce embeddings more useful than reconstructing masked pixels, MAE-style? The analysis frames this not as an accuracy contest but as a question about what structure each objective incentivizes the model to learn.

Experiment design

Both objectives were trained on CIFAR-10 (upsampled to 96x96) using matched Vision Transformer backbones, controlling for architecture and compute. Representations were then evaluated through linear probing, retrieval, anomaly detection, and t-SNE embedding visualization. The result surfaced a genuine tradeoff: JEPA embeddings were more perturbation-invariant and better structured; MAE produced stronger linear probe accuracy.

Why it connects

JEPA is part of the same thread as RAID: useful intelligence needs representations that encode structure, not surface. If systems are going to reason across modalities and feedback loops, representation quality becomes a first-order question, one the choice of self-supervised objective directly answers.