xAI and 10X Training

Published / Last Updated May 29, 2026

Category Blog

Ceramic has built the fastest training stack. Vendors (AWS, Coreweave, AMD, and Lambda) have tested this, and we have demonstrated more than 80% MFU on B200s. That is actually higher GEMM performance than these chips can expect for matrices of the size found in LLM models. We have extracted all the possible performance because we train at line speed of matrix multiplication, as seen in the table below linked from Lambda’s blog at https://lambda.ai/blog/ceramic-lambda-training-performance-nvidia-hgx-b200:

Performance Metrics: 8B Model Training on 8 NVIDIA Blackwell GPUs

Recent claims that SpaceX is writing a new training stack in C to get a 10X performance increase, and, as the experts, we thought it would be polite to give some advice. Elon posted,

“SpaceX has almost finished writing V1.0 of an in-house AI training stack in C that exact-maps to 220k GB300s with 800G NICs, making heavy use of pipeline parallelism and getting as close to bare metal as possible.

The potential speed improvement vs JAX for large training runs is over an order of magnitude.”

What follows is not good advice for the average researcher - this makes your stack harder to modify, and difficult to change for experiments - but it also makes it really fast.

The first thing to realize is that Autograd is not your friend. Symbolic differentiation is one of the great CS achievements. It was the original LISP demo*, but it gets in the way of performance. All the tricky parts of the backward pass are done in specialized kernels (flash attention, RMSNorm) so autograd is just used for trivial things - like addition to the residual stream or taking the derivative of GEMM (it is two GEMMs). Just stop using it. Suddenly, you have straight line code. This is worth 10%.

The second thing is to fuse ops. Fuse them all. A single layer is just a few GEMMs and a few other ops like rotary embeddings and norms (and attention). Fuse these. Tri Dao’s code is your friend here. Another 10%.

Don’t use frameworks. Once you have a framework, you lose track of where things are kept. Instead, allocate fixed contiguous buffers for grads and parameters. Decide on the memory layout to make all networking non-copy. This is easy, so long as you have gotten rid of autograd and frameworks.

Don’t use streams. They make your code easier, but they slow everything down. Networking will overlap with computation. You don't need separate streams to run parallel tasks. Explicitly work out how long each compute and network step takes, and call the networking explicitly at the right time. This saves another 10%.

Don’t use modules - break up each piece rather than use nice code. The linear module is made up of one GEMM in the forward pass and 2 in the backward. But, we need to break this up for the backward pass, as one of these GEMMs is not on the critical path (the one that computes the gradient of the parameters). We will need to postpone this until we have some networking to do. We explicitly delay tasks that are not on the critical path until we have something to overlap. 5% more.

These tricks will get a dense model to 95% of GEMM performance. The idea is to do less, not more. But sometimes doing less means being explicit about what you are doing, which means writing a few more lines of code.

Don’t use tensor parallelism if you don’t need it. If you think you need it, stop and realize that you can almost certainly do without it. You might think that you need it to save space, but a little thought will show that you can move parameters rather than activations over the “tensor parallel” group. This has the property that the networking is not on the critical path, so we have removed all the slowdown from TP networking (another 10%).

You probably want long context, so you might be thinking about context parallelism. This involves networking on the critical path, unless, of course, you use our idea of time-based context parallelism, which leverages the pipeline parallelism to do context parallelism over time. We split the context into microbatches that are fed into the pipeline (we do need to reverse on the backward pass - but we can do that, because we got rid of autograd and frameworks). This removes the cost of context parallelism completely. There is a lesson here, the way to remove the cost of networking is to not send the data.

MoE models make some things a little trickier. The two big changes are that we now have expert parallelism, which uses 8 times more networking than tensor parallelism. That is a lot. The second big issue is that 1 in 32 tokens now go to each expert. Our (or rather Jensen’s) GEMMs don’t work well unless we have at least 256 (the arithmetic efficiency) tokens for each GEMM.

This does not sound too bad, but what kills us is Goodhart’s law: "When a measure becomes a target, it ceases to be a good measure." If the big boss says MFU matters, there is a tendency to optimize for better MFU, rather than better training. The way you goose MFU is by increasing the global batch size. The bigger this is, the easier everything gets. All the GEMMs are bigger, so more efficient, and there is less networking (as we only need to move the gradients once a batch). But, training is the opposite of golf - more steps are better. A bigger global batch size means you take fewer steps. Yes, the gradient is more accurate, but once you pass the gradient noise level, you are just wasting compute.

This is one of the reasons that Llama 3 models underperformed compared to Chinese models. Chinese labs have fewer GPUs, so there's no reason to use crazy-large batch sizes. Llama 3 used batch sizes as large as 64M for its dense models (which was probably 32 times too large). This wasted 95% of the compute. The researchers were caught between the pressure to show a high MFU number and the unexpected generosity (64k GPUs) of Zuck.

But a smaller global batch makes everything harder. Suppose we have 256k GPUs. For MoE training, each expert gets 1/32 tokens, so the experts can be trained with a 64M batch size. This means we have 256 tokens per GPU. This makes things hard.

Before I explain how to fix this, I want to mention a major error that everyone makes when training MoE models. They use a large batch size, like 64M tokens, which is right for the experts as they see 3% of tokens. But the attention, routing, norms, and shared expert see every token. They are wasting 95% of the compute. The obvious fix is to split the optimizer update and update the parameters for attention and routing more often. These have fewer parameters, so this can work out. It stops you wasting 95% of your compute for this critical part (attention), so it's worth a little effort.

We have 512 tokens per GPU. We need 4x the number of microbatches to get 80% of efficiency, so this reduces our microbatch size to 128. 1 in 32 tokens goes to each, so from each microbatch, an expert sees 4 tokens on average. But we have NVL72s, so we can use 72x expert parallelism to get 288 tokens per expert. This gets us over the efficiency threshold - memory is 256 times slower than compute, so we need our GEMMs to have 256 as their smallest dimension.

Arithmetic efficiency (the ratio of compute to memory speed) is a critical idea, but for MoE, there is another metric. The ratio of compute to network. When we do an expert step, we send d (the hidden dimension) times s (the number of tokens). We do 3 (for Swiglu) 2 (multiply and add) d (hidden dimension) * (intermediate dimension) of compute. A little math shows that, as networking is much slower than memory, the critical size for the intermediate dimension is 2k. At that size, the networking time is the same as the compute, so with clever kernels like DeepEP, we can overlap compute and networking. Keep your intermediate dimension above 2k, or nothing can save you. This is a cruel master.

One thing to remember here is that in order not to waste the compute, we need to keep the batch size down. But this means that we send as few as 256 or 128 tokens to each GPU. We need to concatenate these together with context parallelism to get long (or even medium) context, and the way to do that is with our pipeline trick. The temptation is always to increase the batch size, but as Dario showed, seeing over 2M or 4M tokens is enough. More than that is just waste.

One last trick, before I conclude. Always break down the walls between your functions and look at the primitive steps. When we do pipeline and tensor parallelism, the frameworks and code lead us to the RMSNorm using sequence parallelism, and then do the P2P transfer to the next node. This is a horrible mistake. The right way is to do the P2P in the middle of the RMSNorm. Rather than do the reduce scatter, the RMSNorm, the all-gather, and then the P2P, we do the P2P before the all-gather. This breaks the modularity of our code, but saves 8x in networking. This makes the code fragile, hard to maintain, and likely to confuse, but it is 8x faster. For big training runs, we need to think like F1 designers and make crazy tradeoffs, not build minivans. I am sure there is a Tesla analogy there (but I refuse to criticize my Model X).

In conclusion, other people get 40% MFU on H100s, some people boast at getting to near 50%. Ceramic can achieve over 80%, demonstrated to industry partners on their own hardware. We do this by breaking all the rules of good code. No frameworks, no autograd, no streams, no garbage-collected buffers - all static allocation, no copying ever, as we lay out for networking first. We break down all the walls between modules, so each primitive can be used when it is most effective.

Elon hopes for more than 10x. This means his current stack must be less than 8% MFU, as you can’t go faster than GEMM performance. We have gotten within a few percent of the maximum possible, but to do this, we needed to break all the usual engineering rules. My last piece of advice is don’t fall for the Siren song of increasing batch size - what matters is the useful flops, not the MFU of the cluster. Small batch sizes make everything harder, but that is where great engineering thrives.

* According to John McCarthy, he demoed this to industry leaders from the basement (where the computer was) while luminaries watched him from the penthouse over closed circuit TV. Like any careful researcher, John ran the program first to see all was well, so when he demoed it, half way through it garbage collected (which took minutes as the machine had thousands of words of memory). John learned never to test anything ever again, and the CEOs did not notice.