Six steps to get a self-hosted image generation service into production on Cloud Run GPU — what I landed on at each step, how I got there, and what each one still costs.

It began with a side-project I build in my spare time, which needed a component that generates illustrations from text prompts. I'd been using OpenAI's DALL-E endpoint, and it worked fine until it didn't: I needed reference-conditioned generation — multiple images of the same character or object across different scenes, with visual consistency — and the hosted API didn't give me enough control over the model. So I went self-hosted. That's a big leap for one feature, and I'll be honest: it was as much about wanting to get my hands into the nitty-gritty of MLOps as about the feature itself. There are surely more pragmatic ways to solve this.

I kept one deliberate constraint throughout: minimise GPU expenses. Not primarily to save money, but because tight constraints force engagement with the things that matter. Quantisation, weight loading, cold start design — these are where the interesting problems live, and renting a comfortable A100 would have let me skip all of them. The side effect was that I made expensive mistakes rather than hardware-masking them, which is where most of what follows comes from.


Pick a platform

Google Cloud Run GPU with model weights stored in a Google Cloud Storage bucket, mounted as a volume at container startup. The container image contains only application code — no model weights baked in.

gcloud run services update inference \
  --add-volume=name=model-weights,type=cloud-storage,bucket=flux-model-weights \
  --add-volume-mount=volume=model-weights,mount-path=/mnt/model-weights \
  --set-env-vars=HF_HOME=/mnt/model-weights/hf-cache

The mental model that unlocks this: treat weights as data, not code. Code is versioned, deployed, rebuilt. Data is fetched at runtime. Model weights belong to the second category. Once that distinction is clear, the architecture follows: the container image contains Python dependencies and application code; the bucket contains everything the model needs at runtime.

Baking the weights into the image vs. mounting them from GCS: the container shrinks ~50 GB → 6 GB and monthly storage drops from ~$5 to under $1 — the payoff of treating weights as data, not code.

How I got there

RunPod was the starting point — pick a card, pay by the hour, SSH in. It was quick for experimentation, but gave me a machine I configured by hand, not a service I deployed from a pipeline. The Pods I was using stayed running until they were stopped; there was no scaling down to nothing between requests, so the meter ran whether or not anything was using it. RunPod did have a separate serverless product that scaled to zero, but moving to it would have meant rewriting the service around RunPod's own handler format. Either way, what I actually wanted was the conventional infrastructure path: build the image in CI, push it to a registry, deploy declaratively, and authenticate through the platform's own IAM. That pointed at a managed container service, not a rented box.

Baking weights into the Docker image was the next attempt — the obvious way to get that CI build. Download the weights at build time, layer them into the container, load from local disk on startup: fast and self-contained, and it worked fine when I built the image by hand. Automating it was where it fell apart. A stock GitHub Actions runner has only ~16 GB of free disk, and the weights alone are ~34 GB; a cleanup step (jlumbroso/free-disk-space) that strips the runner's preinstalled Android SDK, .NET and Haskell toolchains reclaims ~23 GB — but ~39 GB still won't hold a ~50 GB image. The baked image simply couldn't be built in CI. And even setting that aside, the ~50 GB image cost roughly $5/month in Artifact Registry before a single request was served, and switching models meant rebuilding and re-pushing the whole 50 GB. Baking the weights in put their entire bulk on the critical path of every build.

The GCS mount approach brings the code image to 6 GB and CI to 13 minutes. Storage cost for the full weight set (~34 GB of Standard-class storage) is approximately $0.65/month.

What it still costs

Cold starts now involve reading weights from GCS over the network, which takes several minutes for a large model. Cloud Run scales to zero when idle, which means every cold start is a blank slate. For my use case — background illustration generation for a side-project, not a latency-sensitive user-facing API — this is acceptable. While I was actively iterating on the service I temporarily set --min-instances=1 to keep one instance warm and skip cold starts during that period, then let it scale back to zero for steady-state use. For anything that needs a guaranteed first-request latency in production, that minimum-instance setting or a separate warm-up strategy is the lever.


Pick a model

FLUX.1-Kontext-devBlack Forest Labs' own reference-conditioning model (released under their non-commercial licence). It handles reference image conditioning natively: given an image of a character and a text prompt, it generates a new scene featuring that character, without any adapter or separate encoder.

How I got there

The initial approach was FLUX.1-schnell plus a community IP-Adapter from XLabs-AI for visual conditioning. An IP-Adapter needs two parts — a CLIP image encoder that turns the reference picture into an embedding, and adapter weights that feed that embedding into the model. XLabs had packaged theirs for ComfyUI, where you wire the encoder in as a separate step, so the download shipped only the adapter weights. The diffusers load_ip_adapter method expects the opposite: a single bundle with the adapter weights and an image_encoder/ subfolder it can load in one call. With no encoder in the package, it errored. I could have wired one in by hand — but before going down that road, I went looking for an out-of-the-box alternative. That's when I found Kontext.

Kontext solves reference-conditioning at the model level: no adapter, no separate encoder, no third-party packaging to reconcile. The thinking behind the switch: wiring a CLIP encoder into the XLabs adapter wouldn't have been much work either — but when the model's own author ships the capability natively and purpose-built for exactly this, it's the simpler bet, and most likely the higher-quality one. Given that choice, I'd rather lean on what the model does natively than shim an adapter onto a base model.

What it still costs

Kontext-dev is a large model — roughly 34 GB across the transformer, T5, VAE and CLIP in bf16. The quantisation and VRAM budget decisions that follow are a direct consequence of that size.


Get the weights

Pre-quantised Q8_0 GGUF for the transformer (12 GB), downloaded to GCS in advance and read sequentially at startup. Q8_0 dequantises via a native CUDA int8 operation and loads cleanly onto the GPU.

Transformer cold-start, three ways: runtime quantisation over a GCS FUSE mount took 74 minutes (measured and logged); a pre-quantised Q8_0 GGUF, read sequentially, brings it to ~3.5 minutes.

How I got there

The original plan was to load the full bfloat16 weights at startup and quantise them at runtime using torchao's quantize_(). The result was a 74-minute cold start. This was because quantize_() touches every parameter tensor individually, and the weights sat on a GCS FUSE mount — a network-backed filesystem. On a fast local disk the same work would take minutes.

The fix is pre-quantised GGUF weights. GGUF is a single flat binary file. The GGUF reader reads it sequentially from start to finish — no random access, no per-tensor network reads, no page fault cascade. Cold start for the transformer dropped from ~74 minutes to ~3.5 minutes.

The first GGUF I tried was Q5_K_M (8.4 GB). It loaded, but pipe.to("cuda") failed with a device mismatch — somewhere in the k-quant load path, not every tensor ended up on the GPU. Q5_K is a supported diffusers quantisation type, and has been since well before the version I was running, so this was a loading quirk rather than a missing feature. Rather than root-cause a library issue, I switched to Q8_0 (12 GB) — the highest-precision and most universally supported GGUF type — which moved to CUDA cleanly.

What it still costs

Q8_0 is 3.5 GB larger than Q5_K_M. More importantly: T5-XXL (the text encoder) is still loaded in bfloat16 at ~9.5 GB, because the GGUF loading path for T5 doesn't work cleanly in diffusers — which is the main story in the VRAM section below. The cold start for T5 is dominated by GCS read time, not quantisation, so it's acceptable — but a pre-quantised T5 would reduce the weight storage footprint further.


Fit in VRAM

enable_model_cpu_offload(). T5-XXL stays in CPU RAM and moves to the GPU only during text encoding; the transformer stays on GPU for all the denoising steps. Peak VRAM sits at around 14 GB — within the L4's 24 GB budget.

Fitting the model into the L4's 24 GB. Today, CPU-offloading T5 holds peak VRAM around 14 GB; toggle to the all-GPU configuration we're working toward once T5 loads cleanly from GGUF.

How I got there

With Q8_0 GGUF for the transformer (~12 GB) and T5-XXL in bfloat16 (~9.5 GB), plus VAE and CLIP (~1 GB), total VRAM demand is around 22–23 GB against the L4's 24 GB. Tight, but theoretically feasible all on GPU. The first approach was to also load T5 from a GGUF to bring the total down. There's a Q4_0 GGUF for T5-XXL (~2.9 GB) that would bring the total under 16 GB.

That didn't work out, unfortunately. A FLUX pipeline is assembled from two libraries: the transformer is a diffusers class, while the T5 text encoder is a transformers class — Hugging Face's general NLP library. The GGUF single-file loader, from_single_file, is a diffusers feature, available only on diffusers' own model classes. T5's encoder isn't one of them, so the method simply isn't there — calling T5EncoderModel.from_single_file(...) errors out, and there's no equivalent quantized-loading entry point in its place. T5 stays in bfloat16 at ~9.5 GB, and fitting everything on the GPU was out.

CPU offload is the workaround: enable_model_cpu_offload() keeps T5 in system RAM and moves model components to the GPU only as needed. It works.

What it still costs

enable_model_cpu_offload() evicts model components back to CPU RAM after each pipe() call. On the next call, everything has to be moved back to the GPU before inference can start. For a 12 GB transformer, that reload takes several minutes. If you're generating one image every 20 minutes, you pay this cost every time. This is the main performance gap in the current setup — and the reason a working T5 GGUF loading path would matter: loading T5 in Q4_0 GGUF (~2.9 GB) alongside the Q8_0 transformer (~12 GB) brings the total to ~16 GB, comfortably within the L4's 24 GB, with no offloading needed and no per-call eviction cost.


Deploy: the wrong model

Pass the model config explicitly to from_single_file, bypassing diffusers' automatic model-type detection entirely:

transformer = FluxTransformer2DModel.from_single_file(
    "/mnt/model-weights/gguf/flux-kontext-q8_0.gguf",
    config="black-forest-labs/FLUX.1-Kontext-dev",  # skip detection
    subfolder="transformer",
    quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
    torch_dtype=torch.bfloat16,
)

This is the workaround for a model-detection bug in diffusers 0.37.1: a Kontext GGUF whose img_in.weight tensor is stored as BF16 gets silently misidentified as FLUX.1-Depth-dev, regardless of which file you load. The step-through below traces it through the diffusers source.

Step through the mechanism: diffusers reads a BF16 tensor's byte shape — 128 where the model has 64 — and its model-type check picks FLUX.1-Depth-dev.

How the investigation went

This one took a while to run down, and by the end I was reading the GGUF file's raw binary header by hand to find out what the model on disk actually was. It began, as these things do, with a single inference request:

RuntimeError: mat1 and mat2 shapes cannot be multiplied (4096x128 and 64x3072)

Decoded: the transformer's input projection layer (x_embedder) was configured with in_channels=128 — the signature of FLUX.1-Depth-dev. Kontext should have in_channels=64. Something was giving the model the wrong config.

Hypothesis 1: wrong file in GCS. I logged the full model config immediately after from_single_file:

_name_or_path: '/raid/aryan/flux1-depth-dev-diffusers'
in_channels: 128

The _name_or_path embedded in the GGUF config was a path on whoever's machine originally converted the model. The file named flux-kontext-q8_0.gguf in GCS was a Depth-dev conversion uploaded under the wrong name. I replaced it with a fresh download from gguf-org/flux-kontext-gguf on Hugging Face. Restarted. Still in_channels=128.

Log _name_or_path at startup. The GGUF filename only tells you what someone decided to call the file; the _name_or_path embedded in the config tells you which checkpoint was actually converted. Here they disagreed — and logging that one field is what surfaced it. (As the next two hypotheses show, it wasn't the whole story — but it's the cheapest check, so make it first.)

Hypothesis 2: stale HF cache. Since HF_HOME=/mnt/model-weights/hf-cache points into the GCS bucket, the Hugging Face cache also lives in GCS. I found a cached black-forest-labs/FLUX.1-Depth-dev transformer config that had been downloaded at some earlier point. I deleted it. Next cold start: it was back. Something in diffusers was re-downloading the Depth-dev config from Hugging Face on every startup, regardless of which GGUF file was on disk.

Hypothesis 3: the bug is in diffusers itself. I downloaded the first 10 MB of the GGUF and wrote a small binary parser to read the header directly:

img_in.weight: dims=[64, 3072]  dtype=30 (BF16)

In GGUF's column-major convention, dims=[64, 3072] means a PyTorch tensor of shape [3072, 64]: a Linear(in_features=64) layer. The file was correct — in_channels=64 — which left diffusers' own detection as the only place the 128 could come from. The visualisation above traces the mechanism through the 0.37.1 source.

The fix has two non-obvious requirements: config must be the repo ID as a string (a dict from load_config() is rejected), and it needs an explicit subfolder="transformer", since the Kontext repo keeps its transformer config under transformer/ rather than at the root. The call above is the correct form.

This worked.

What it still costs

This is a defect in diffusers, not a design choice on my part — the file is correct, so the fix belongs upstream. The mechanism checks out against the 0.37.1 source: with no explicit config, from_single_file runs model-type detection on the raw GGUF checkpoint, reading img_in.weight.shape[1] while a BF16 tensor still carries its doubled byte shape (128) — before the dequantisation step that would halve it back to 64. The bypass with explicit config= costs nothing at runtime, but it hardcodes the config repo ID in the deployment: if Black Forest Labs changes the config structure, the call breaks silently.


Add auth

--allow-unauthenticated at the Cloud Run network level, with a static API key checked inside the FastAPI application. The key lives in Secret Manager and is injected as an environment variable at deploy time.

bearer = HTTPBearer()

def verify_api_key(credentials: HTTPAuthorizationCredentials = Depends(bearer)) -> None:
    if credentials.credentials != settings.flux_api_key:
        raise HTTPException(status_code=401, detail="Invalid API key")

@app.post("/generate", dependencies=[Depends(verify_api_key)])
async def generate(request: GenerateRequest) -> JSONResponse:
    ...
gcloud run services update inference \
  --set-secrets=FLUX_API_KEY=FLUX_API_KEY:latest

One non-obvious step: the Cloud Run service account needs explicit roles/secretmanager.secretAccessor permission before Secret Manager injection works. The deployment fails with a permission error without it — not a very informative one.

Two separate auth layers with different scopes. Confusing them leads to requests that appear to succeed at one layer but fail silently at the other.

How I got there

The original plan used Cloud Run's --no-allow-unauthenticated flag, which protects the service with short-lived GCP identity tokens that can only be minted from inside GCP. On a laptop that means fetching a fresh one with gcloud auth print-identity-token every hour as it expires — and forgetting to refresh means silent 401s, rejected at the network layer before a request ever reaches FastAPI, with no useful error in either the backend or the inference logs.

gcloud run services proxy papers over this for local development: it runs a localhost proxy that attaches a fresh token to each forwarded request, so pointing FLUX_API_URL at it unblocks testing immediately. But the friction is the point. Rather than keep wrestling the network layer, I left it open (--allow-unauthenticated) and moved auth up into the application as a static API key — which needs no identity token and no proxy, and behaves the same from a laptop as from anywhere else. That is the design at the top of this section; the proxy belongs to the approach I dropped.

What it still costs

A static API key is a shared secret. For this use case — a single-tenant side-project — that's acceptable. For anything multi-tenant or higher-security, the right approach is IAM: each caller gets its own service account, permissions are per-caller, and there's no shared credential to rotate.


What I took from this

Three lessons stuck, one from each layer this project made me get wrong — the deploy architecture, the runtime, and the file on disk itself.

Weights-as-data. The weights-in-image problem is a category error — it treats model weights like application code and pays the price in CI time, storage cost, and iteration speed. The GCS mount pattern solves all three at once. It should be the default starting point for any GPU inference service on Cloud Run.

Match the file format to the read pattern. The 74-minute cold start was caused by a mismatch between how GCS FUSE works (network-backed, random-access cost) and how runtime quantisation works (per-tensor reads). GGUF's sequential flat-binary layout sidesteps this entirely. Understanding the read pattern of your storage format and the access pattern of your consumer is the kind of thing that doesn't appear in deployment tutorials because it requires knowing both ends simultaneously.

Verify the label. The GGUF filename is decoration. The _name_or_path embedded in the model config is what was actually converted. Log it at startup. The same principle applies to cached configs, dependency versions, and injected secrets: the thing you think is there is not always the thing that's there.

What's left — removing CPU offloading now that the Q8_0 transformer fits within the L4's budget, reducing cold start time further, understanding the per-step timing — is ordinary optimisation.