The case for self-hosting LLMs inside enterprise infrastructure keeps getting easier to make on slides and harder to execute in production. Data sovereignty requirements are real, the regulatory pressure is real, and the open-source model quality has closed the gap substantially. The part that gets glossed over: what it actually costs to run this in a way that's reliable enough to build products on.

This article is a cost and trade-off breakdown for senior engineers and technical leaders evaluating an on-premise LLM strategy. Not a benchmark comparison — that's covered elsewhere. The focus here is the operational reality: hardware tiers, hidden expenses, serving infrastructure, and where the model on your GPU simply isn't the right answer.

Why enterprises are moving to self-hosting

The primary driver is not cost. Across the enterprise deployments that have become public in 2025–2026, the pattern is consistent: data residency requirements and compliance constraints are what force the decision, not token economics. Cost reduction is the secondary justification, rarely the trigger.

The compliance triggers that come up most often:

Hardware tiers — what the market actually looks like

Model capability is gated by VRAM. Everything else — CPU, storage, network — is secondary. The practical tiers for enterprise self-hosting in 2026:

TierHardwareVRAMModels it runsAcquisition cost
Developer workstation RTX 4090 24 GB 7B full, 13B Q4, 27B Q3 ~€2,000
Departmental server 2× RTX 4090 48 GB 34B Q4, 70B Q3 ~€5,000
Professional GPU NVIDIA L40S 48 GB 70B Q4 single-card, data-center grade ~€10,000
Data center (entry) A100 40 GB 40 GB 70B Q8, 34B full ~€8,000–12,000 (used)
Data center (standard) H100 80 GB 80 GB 70B full, 2× model sharding ~€25,000–35,000

The 27B parameter range (Qwen 3.5 27B, Mistral Small 3.1) represents the practical sweet spot for enterprise use cases today: fits in 24 GB VRAM at Q4 quantization, delivers 75–85% of frontier model quality on structured tasks, and runs on hardware an IT department can manage without specialized MLOps skills.

The real cost structure

Hardware acquisition is the number people quote. It's rarely the dominant cost over a 3-year horizon. A worked example for a single RTX 4090 server serving a team of 15–20 engineers:

3-year TCO — single RTX 4090 server
Hardware (server + GPU + networking) €4,500
Electricity (~900W full server draw, €0.28/kWh, 70% utilization) €1,540 / year → €4,620
IT ops (racking, maintenance, monitoring) — 2h/month €3,600 (at €50/hr blended)
Model updates and re-testing (quarterly) €2,400 (4h/quarter)
Security patching (OS, CUDA drivers, serving stack) €1,800
Total 3-year cost ~€16,920

At ~€470/month, this serves a team of 15–20 for general coding and data tasks. The equivalent Claude Sonnet 4.6 spend for that team is roughly €200–400/month depending on usage intensity. The economics are not obviously favorable — and this is the optimistic scenario with stable ops and no hardware failures.

The cost structure shifts significantly at scale. A 10-GPU cluster serving 200+ users amortizes the ops overhead across a much larger denominator, and the per-user cost drops below any cloud API option. The break-even point — where on-premise becomes clearly cheaper than cloud APIs — is typically around 50–80 heavy users or a sustained throughput requirement above 500K tokens/day.

Serving infrastructure: Ollama vs vLLM

The serving layer is where most enterprise self-hosting deployments encounter their first serious problems. Two tools dominate the current landscape; they solve different problems.

Ollama

Ollama is the right tool for a developer workstation or a small team exploration environment. The setup time is minutes, the model library covers every current open-source model of interest, and the OpenAI-compatible API means existing tooling integrates without modification.

The production limitations are real and documented:

The summary: Ollama for development, evaluation, and teams under five concurrent users. Not for production service endpoints.

vLLM

vLLM is the production-grade option. PagedAttention — the memory management technique that gives vLLM its throughput advantage — allows efficient batching of requests that would otherwise fight over contiguous VRAM blocks. The practical result: 2–4× higher throughput than naive serving at the same hardware tier, with sub-linear latency degradation under load.

The operational overhead is proportionally higher:

For teams without existing MLOps experience, plan for one engineer spending 30–50% of their time managing a vLLM deployment for the first quarter. That cost rarely appears in the initial business case.

Model selection for enterprise self-hosting

The practical shortlist for enterprise deployments as of Q2 2026, filtered for license compatibility with commercial use:

ModelParamsLicenseBest forVRAM (Q4)
Qwen 3.5 27B 27B Apache 2.0 SQL, coding, general reasoning ~16 GB
Mistral Small 3.1 24B Apache 2.0 EU deployments, multilingual ~14 GB
Mistral Devstral 24B Apache 2.0 Agentic coding workflows ~14 GB
Llama 3.3 70B 70B Llama community General tasks at higher quality ~40 GB
Codestral 25.01 22B Non-commercial Code completion only ~13 GB

License compliance is non-negotiable in enterprise contexts. The Llama community license prohibits use in applications with over 700 million monthly active users — not a constraint for most enterprise internal tools, but worth reading in full. Codestral is released under the Mistral AI Non-Production License, which prohibits use in production environments entirely — research and evaluation only. Qwen and Mistral Small are genuinely Apache 2.0 with no usage restrictions.

When self-hosting is the wrong answer

The self-hosting conversation often starts from the assumption that it's inherently more secure or more private. Neither is automatically true. An on-premise deployment that lacks proper access controls, audit logging, and patch management creates a different risk profile than cloud APIs — not a lower one.

Self-host when
  • Regulatory requirement for data residency is explicit and documented
  • Sustained throughput above 500K tokens/day makes cloud costs prohibitive
  • Workloads involve IP or trade secrets that cannot leave the perimeter
  • You have MLOps capacity to own the stack
  • Offline or air-gapped operation is required
Don't self-host when
  • The driver is cost and usage is below 50K tokens/day
  • The team has no GPU infrastructure experience
  • Tasks require frontier model quality (complex reasoning, novel problem types)
  • You need the latest model capabilities within weeks of release
  • The IT environment can't support 24/7 infrastructure ownership

The hybrid approach that most mature enterprise AI teams land on: self-hosted open-source models for high-volume, structured, sensitive-data workloads (classification, extraction, SQL generation on internal schemas), combined with cloud APIs routed through Bedrock EU or Vertex AI EU for complex reasoning tasks that open-source models handle poorly.

The honest accounting

Self-hosting an LLM is building and operating a piece of infrastructure. The model is the easy part — Ollama installs in five minutes. What you're actually committing to is a GPU server, a serving stack, a model registry, an update cadence, a security posture, and someone accountable for uptime. Price all of that before comparing it to an API invoice.

The EU compliance path

For Italian enterprises operating under GDPR and the Garante's enforcement guidance, the on-premise route using Mistral or Qwen models on internal infrastructure is the most legally defensible posture for AI processing of personal data. The combination that satisfies both GDPR and the EU AI Act's technical requirements for high-risk AI use cases:

The alternative — cloud APIs with a DPA and the inference_geo parameter set to EU regions — is valid for many use cases and operationally simpler. Whether it satisfies your specific legal basis for processing depends on the data categories involved and your legal team's reading of the applicable supervisory authority guidance. For sensitive categories under Article 9 GDPR, on-premise is the more defensible choice.

MM
Michele Mader
Technical Leader · Fortop S.R.L.

I lead the technical direction of AI-driven data products for enterprise clients — defining architecture, making stack decisions, and owning delivery from roadmap to production.

Connect on LinkedIn