A walkthrough of the SAME autoencoder architecture — how patching and a Transformer Resampling Block achieve 4096× temporal compression of stereo audio without a single strided convolution.

I’ve known cascaded strided convolution blocks to be the dominant — and perhaps the only — way of both compressing and learning useful representations from audio waveforms. This lineage runs from SoundStream (2022), through EnCodec (2023) and DAC (2023), to Stable Audio Open (2024). I was recently pointed to a paper which introduces the Semantically-Aligned Music autoEncoder (SAME) from the folks at Stability AI, which breaks from this entirely, adopting an idea from the image and speech domains — resampling through interleaved attention tokens, as explored in the Perceiver, Yu et al. 2024, and ALMTokenizer — and applying it to music: a parameter-free reshape followed by a single transformer pass. The result is a 4096× temporal compression ratio with state-of-the-art reconstruction quality and significantly faster inference. SAME also serves as the autoencoder in the recently published Stable Audio 3.

SAME compresses stereo 44.1 kHz audio by 4096× in time using two operations: a parameter-free patching step (256×) and a single Transformer Resampling Block (16×). No cascade of strided convolutions — just a reshape and one transformer pass. This post focuses on the TRB: how it simultaneously downsamples and creates new features, and the two attention strategies it uses to handle cross-segment information flow.

Patching: folding time into width

The first step is a pure reshape with no learned parameters. Non-overlapping windows (“patches”) of P = 256 samples are taken from each channel and concatenated into a single vector. Each resulting embedding is a 512-dimensional vector — but those 512 dimensions aren’t features. They’re literally 256 left-channel samples concatenated with 256 right-channel samples. Time-domain data folded into the feature axis. Consider the simplified example in the interactive figure below:

Stereo waveform: 2 channels × 30 samples

Why concatenate L and R rather than processing them separately? The paper doesn’t state a rationale explicitly, but stereo audio is highly correlated across channels — most energy lives in the “mid” signal. Packing both channels into a single vector lets the downstream transformer jointly attend over both channels at every time position. This presumably makes it easier to learn inter-channel structure (stereo width, panning) without needing an explicit cross-channel mechanism.

The TRB: data flow step by step

The TRB takes the patched sequence and produces a shorter sequence in a new feature space. Continuing the example above with deliberately distinct dimensions (2P=10, d=8, S=3, dlatent=6) for clarity:

Not just downsampling. Each output embedding passes through multiple rounds of attention (content-dependent mixing) and feedforward layers (nonlinear transformation). By extraction time, they contain genuinely new features — not averaged inputs. The TRB simultaneously resamples the time axis and creates a new representation.

Feature evolution through transformer layers

The “Transform” step above collapses D=3 transformer layers into one view. Here’s what happens inside each layer: first, self-attention mixes information across all tokens in the window (a content-dependent weighted combination), then the feedforward network applies a nonlinear transformation to each token independently. A residual connection adds each layer’s contribution to the previous state. Critically, a transformer block preserves the sequence length and dimensionality — it changes the content of each vector, not the shape. The sequence only gets shorter at the extraction step.

The output embeddings (rightmost amber column) start near zero — the paper explicitly initialises them that way. Through each layer they accumulate information: layer 1 pulls in a first weighted mix from the inputs, layer 2 refines this richer representation, and so on. The inputs evolve too, since they attend to each other and to the output embeddings. By layer 3, every vector has been thoroughly transformed:

Colour intensity = feature magnitude. Rightmost column in each segment is the output embedding.

Two attention strategies

Full self-attention over the entire interleaved sequence is O(n²) — too expensive for long audio. SAME uses two strategies, for a practical reason: sliding-window attention gives better quality but isn’t supported by current CPU inference libraries like LiteRT (Google’s on-device ML runtime, formerly TensorFlow Lite). So SAME-L (GPU) uses sliding window, and SAME-S (CPU/edge) uses chunked attention with a midpoint shift as a pragmatic alternative. The example below uses S=2 (rather than S=3 as above) to fit more segments into view: 12 tokens across 4 segments of 2 inputs plus 1 output embedding each. Selecting any token reveals what it can attend to.

Sliding window (SAME-L, GPU)

Each token attends to S+1 = 3 neighbours on each side. The window moves smoothly — no hard boundaries. Output embeddings at segment edges see into neighbouring segments:

Select a token to see its attention window

Select a token to see which positions it can attend to.

Input Output embedding Bright = in window, faded = outside

Chunked + midpoint shift (SAME-S, CPU)

The sequence is split into fixed-size chunks — tokens can only attend within their chunk. This creates a problem: tokens at a chunk boundary can never see tokens in the adjacent chunk. The fix is a midpoint shift: the first half of the transformer layers (layers 1…D/2) uses one set of boundaries, then the second half (layers D/2+1…D) shifts them by half a chunk. The layers are sequential — the output of layer D/2 feeds directly into layer D/2+1, with only a re-chunking of the sequence inserted between the two halves. A token stuck at an edge in the first phase finds itself in the middle of a chunk in the second phase. Both phases are shown below — selecting any token reveals how its context changes between them:

Layers 1…D/2: original boundaries

Layers D/2+1…D: shifted boundaries

Select a token to see its attention scope in both phases.

Original boundary Shifted boundary

Consider output embedding o1 (position 5, right at the original chunk boundary). In the first phase it’s trapped at the edge of chunk 1 with no access to chunk 2. In the second phase, the shifted boundary places it in the middle of a chunk with full context on both sides.

The decoder: reversing the process

The decoder TRB uses the same mechanism but with reversed roles. In the encoder, S inputs informed 1 output embedding (many-to-one). In the decoder, 1 latent embedding informs S output embeddings (one-to-many). Each latent vector is paired with S=3 learnable output embeddings — initialised near zero (I assume, by analogy with the encoder; the paper only specifies that the decoder’s are perturbed with Gaussian noise). The latent provides context as the information-rich token, and the output embeddings absorb from it through attention. After the transformer layers, the latent is discarded and the S outputs are kept, yielding S× upsampling:

The symmetry with the encoder is worth noting: in both directions, learnable output embeddings start blank and gain structure through self-attention. The only difference is the ratio — many-to-one (encoder) versus one-to-many (decoder). After extraction, a linear projection maps back to the patch dimension, and the inverse reshape (unpatch) recovers the stereo waveform.

The full encode-decode path

Putting it all together — encoder, bottleneck, and decoder. No cascade of convolutional blocks on either side, just a reshape, one transformer pass, and a projection in each direction:

The latent operates at ~10.76 Hz — each of 48 vectors covers ~93 ms of audio. Between encoder and decoder sits a soft-normalisation bottleneck rather than a VAE.

The SAME paper and model weights are available online.