HRM-Text-1B and Efficient Pretraining
HRM-Text-1B and Efficient Pretraining
Sapient Intelligence released HRM-Text-1B, a 1B-parameter language model built on the Hierarchical Reasoning Model architecture.
The accompanying paper, HRM-Text: Efficient Pretraining Beyond Scaling, is interesting because it pushes on a question I keep coming back to: how much useful capability can come from better structure and better training objectives, instead of simply more parameters, more data, and more compute?
The authors frame HRM-Text as an existence proof. I think that is the right way to read it.
The Core Idea
Most current language models are dense decoder-only Transformers trained on enormous raw-text corpora. That recipe works, but it makes pretraining from scratch expensive enough that only a small number of labs can seriously experiment with new foundations.
HRM-Text takes a different route.
Instead of a single monolithic Transformer stack, it uses two recurrent Transformer modules:
- H, a slower high-level state that carries broader semantic context.
- L, a faster low-level state that performs local iterative refinement.
In the released 1B model, the recurrent core runs two high-level cycles, and each high-level cycle contains three low-level updates. The model is doing more serial computation than a same-size one-pass Transformer, but without adding a proportional number of parameters.
That is the part that feels important: the architecture tries to buy depth with recurrence rather than size alone.
Why Recurrence Is Hard
Reusing the same transformation across recurrent steps sounds elegant, but it can make optimization unstable. Hidden states can drift, activation variance can grow, and gradients can become difficult to assign across repeated computation.
The paper introduces two stabilizers:
- MagicNorm, which caps each recurrent module with a final normalization layer. The goal is forward stability without losing the gradient-flow advantages of internal PreNorm blocks.
- Warmup deep credit assignment, where training starts with a short backward horizon and gradually expands it. The model is not asked to learn through the full recurrent path before it has a stable base.
That combination is the engineering heart of the paper. HRM-Text is not just "loop a Transformer and hope." It is a recurrent language model with explicit machinery for keeping the loop trainable.
PrefixLM Matters
The other big move is the objective.
HRM-Text is trained from scratch on instruction-response pairs with a task-completion objective. Instead of predicting every token in a raw text stream, the loss is computed on the response conditioned on the instruction.
The attention mask matches that objective:
- Prompt tokens can attend bidirectionally to each other.
- Response tokens remain causal.
That is the PrefixLM setup. It lets the prompt behave more like an encoder-side representation while the answer still generates autoregressively.
The model card makes one practical detail very clear: the prompt has to be marked as the bidirectional prefix block during inference. If it is treated like ordinary causal text, the model is no longer running in the same attention pattern it saw during pretraining.
The Efficiency Claim
The headline numbers are strong.
The authors report that HRM-Text-1B was trained from scratch on 40B unique tokens, with a total training budget of 60B tokens, in about 46 hours on two 8xH100 nodes. They estimate the run at roughly USD 1,472.
On their reported evaluations, the 1B checkpoint reaches:
- 60.7 on MMLU.
- 81.9 on ARC-C.
- 82.2 on DROP.
- 84.5 on GSM8K.
- 56.2 on MATH.
The paper compares that against contemporary 2B to 7B open models trained with much larger token and compute budgets. Their claim is not that HRM-Text dominates every model. The claim is that a compact recurrent model can reach a competitive neighborhood with far fewer tokens and less estimated compute.
That matters because it reopens a door that has been closing: training new architectures from scratch.
Not a Chat Model
The Hugging Face model card is careful about limitations, and the blog post should be too.
HRM-Text-1B is a pre-alignment checkpoint. It is not a chat assistant. It has not been instruction-tuned for multi-turn assistant behavior, RLHF-trained, long-context adapted, or aligned for general deployment.
The card recommends:
- For classification, extraction, structured output, and short QA, use the
directcondition with a few examples. - For math, reasoning, and open-ended generation, use the
synth,cotcondition, with the caveat that output quality is uneven and below a comparable instruction-tuned model.
It is also English-only, not trained on code datasets, and expected to be weak at coding before adaptation.
That is not a criticism. It is a category boundary. HRM-Text-1B is more interesting as a research checkpoint and architecture signal than as a drop-in assistant.
Why I Care
The paper's discussion points toward a future I find compelling: decoupling reasoning from knowledge storage.
If a compact recurrent model can learn useful task execution and reasoning behavior from a smaller curated mixture, then maybe future systems do not need to cram all factual breadth into one dense backbone. The reasoning core can be smaller and more computationally deliberate. Factual coverage can come from retrieval, curated corpora, memory systems, or external tools.
That fits the Oxygen AI thesis:
- Durable memory should be explicit.
- Reasoning should be measurable.
- Interfaces should expose uncertainty and state.
- Capability should not depend only on brute-force scale.
HRM-Text is not the final answer, but it is a useful reminder that architecture still matters. Training objective still matters. The shape of computation still matters.
For builders, that is encouraging. It means there is still room to think.