The case for self-hosting LLMs inside enterprise infrastructure keeps getting easier to make on slides and harder to execute in production. Data sovereignty requirements are real, the regulatory pressure is real, and the open-source model quality has closed the gap substantially. The part that gets glossed over: what it actually costs to run this in a way that's reliable enough to build products on.
This article is a cost and trade-off breakdown for senior engineers and technical leaders evaluating an on-premise LLM strategy. Not a benchmark comparison — that's covered elsewhere. The focus here is the operational reality: hardware tiers, hidden expenses, serving infrastructure, and where the model on your GPU simply isn't the right answer.
Why enterprises are moving to self-hosting
The primary driver is not cost. Across the enterprise deployments that have become public in 2025–2026, the pattern is consistent: data residency requirements and compliance constraints are what force the decision, not token economics. Cost reduction is the secondary justification, rarely the trigger.
The compliance triggers that come up most often:
- GDPR and the Garante italiano — sending personal data to a US-based API for processing requires a valid legal basis, a DPA, and in some interpretations is simply prohibited for sensitive categories. Several EU DPAs have issued guidance that cloud LLM processing of personal data is not compatible with Chapter V transfer rules without additional safeguards.
- HIPAA (US healthcare) — Business Associate Agreements with major LLM providers exist but are narrow. Any use of protected health information in prompts requires careful scoping.
- Sector-specific regulation — banking, insurance, and public administration in Italy are increasingly requiring that AI processing of sensitive data occur on infrastructure under direct organizational control.
- IP and trade secret protection — engineering teams uncomfortable sending proprietary code or internal documents to external APIs. A legitimate concern at enterprises where source code constitutes competitive advantage.
Hardware tiers — what the market actually looks like
Model capability is gated by VRAM. Everything else — CPU, storage, network — is secondary. The practical tiers for enterprise self-hosting in 2026:
| Tier | Hardware | VRAM | Models it runs | Acquisition cost |
|---|---|---|---|---|
| Developer workstation | RTX 4090 | 24 GB | 7B full, 13B Q4, 27B Q3 | ~€2,000 |
| Departmental server | 2× RTX 4090 | 48 GB | 34B Q4, 70B Q3 | ~€5,000 |
| Professional GPU | NVIDIA L40S | 48 GB | 70B Q4 single-card, data-center grade | ~€10,000 |
| Data center (entry) | A100 40 GB | 40 GB | 70B Q8, 34B full | ~€8,000–12,000 (used) |
| Data center (standard) | H100 80 GB | 80 GB | 70B full, 2× model sharding | ~€25,000–35,000 |
The 27B parameter range (Qwen 3.5 27B, Mistral Small 3.1) represents the practical sweet spot for enterprise use cases today: fits in 24 GB VRAM at Q4 quantization, delivers 75–85% of frontier model quality on structured tasks, and runs on hardware an IT department can manage without specialized MLOps skills.
The real cost structure
Hardware acquisition is the number people quote. It's rarely the dominant cost over a 3-year horizon. A worked example for a single RTX 4090 server serving a team of 15–20 engineers:
At ~€470/month, this serves a team of 15–20 for general coding and data tasks. The equivalent Claude Sonnet 4.6 spend for that team is roughly €200–400/month depending on usage intensity. The economics are not obviously favorable — and this is the optimistic scenario with stable ops and no hardware failures.
The cost structure shifts significantly at scale. A 10-GPU cluster serving 200+ users amortizes the ops overhead across a much larger denominator, and the per-user cost drops below any cloud API option. The break-even point — where on-premise becomes clearly cheaper than cloud APIs — is typically around 50–80 heavy users or a sustained throughput requirement above 500K tokens/day.
Serving infrastructure: Ollama vs vLLM
The serving layer is where most enterprise self-hosting deployments encounter their first serious problems. Two tools dominate the current landscape; they solve different problems.
Ollama
Ollama is the right tool for a developer workstation or a small team exploration environment. The setup time is minutes, the model library covers every current open-source model of interest, and the OpenAI-compatible API means existing tooling integrates without modification.
The production limitations are real and documented:
- Single-request concurrency by default. Ollama queues concurrent requests rather than batching them. Under any meaningful concurrent load — five engineers sending requests simultaneously — latency becomes unacceptable.
- No authentication layer. The HTTP server has no built-in auth. Running it on a shared server requires an nginx proxy and manual token management.
- No horizontal scaling. There is no native mechanism to distribute load across multiple GPU nodes.
- Memory management is basic. Ollama keeps a model in VRAM for a configurable keep-alive window, then unloads it. Switching between models in a multi-model setup causes repeated load/unload cycles that dominate latency.
The summary: Ollama for development, evaluation, and teams under five concurrent users. Not for production service endpoints.
vLLM
vLLM is the production-grade option. PagedAttention — the memory management technique that gives vLLM its throughput advantage — allows efficient batching of requests that would otherwise fight over contiguous VRAM blocks. The practical result: 2–4× higher throughput than naive serving at the same hardware tier, with sub-linear latency degradation under load.
The operational overhead is proportionally higher:
- Python environment management with CUDA version dependencies that break on OS updates
- Tensor parallelism configuration across multiple GPUs requires explicit tuning
- The OpenAI-compatible API is complete, but streaming and function calling behavior differs from Anthropic/OpenAI in subtle ways that affect client code
- Model loading from HuggingFace requires outbound internet access or an internal model registry — the latter requires additional infrastructure
For teams without existing MLOps experience, plan for one engineer spending 30–50% of their time managing a vLLM deployment for the first quarter. That cost rarely appears in the initial business case.
Model selection for enterprise self-hosting
The practical shortlist for enterprise deployments as of Q2 2026, filtered for license compatibility with commercial use:
| Model | Params | License | Best for | VRAM (Q4) |
|---|---|---|---|---|
| Qwen 3.5 27B | 27B | Apache 2.0 | SQL, coding, general reasoning | ~16 GB |
| Mistral Small 3.1 | 24B | Apache 2.0 | EU deployments, multilingual | ~14 GB |
| Mistral Devstral | 24B | Apache 2.0 | Agentic coding workflows | ~14 GB |
| Llama 3.3 70B | 70B | Llama community | General tasks at higher quality | ~40 GB |
| Codestral 25.01 | 22B | Non-commercial | Code completion only | ~13 GB |
License compliance is non-negotiable in enterprise contexts. The Llama community license prohibits use in applications with over 700 million monthly active users — not a constraint for most enterprise internal tools, but worth reading in full. Codestral is released under the Mistral AI Non-Production License, which prohibits use in production environments entirely — research and evaluation only. Qwen and Mistral Small are genuinely Apache 2.0 with no usage restrictions.
When self-hosting is the wrong answer
The self-hosting conversation often starts from the assumption that it's inherently more secure or more private. Neither is automatically true. An on-premise deployment that lacks proper access controls, audit logging, and patch management creates a different risk profile than cloud APIs — not a lower one.
- Regulatory requirement for data residency is explicit and documented
- Sustained throughput above 500K tokens/day makes cloud costs prohibitive
- Workloads involve IP or trade secrets that cannot leave the perimeter
- You have MLOps capacity to own the stack
- Offline or air-gapped operation is required
- The driver is cost and usage is below 50K tokens/day
- The team has no GPU infrastructure experience
- Tasks require frontier model quality (complex reasoning, novel problem types)
- You need the latest model capabilities within weeks of release
- The IT environment can't support 24/7 infrastructure ownership
The hybrid approach that most mature enterprise AI teams land on: self-hosted open-source models for high-volume, structured, sensitive-data workloads (classification, extraction, SQL generation on internal schemas), combined with cloud APIs routed through Bedrock EU or Vertex AI EU for complex reasoning tasks that open-source models handle poorly.
Self-hosting an LLM is building and operating a piece of infrastructure. The model is the easy part — Ollama installs in five minutes. What you're actually committing to is a GPU server, a serving stack, a model registry, an update cadence, a security posture, and someone accountable for uptime. Price all of that before comparing it to an API invoice.
The EU compliance path
For Italian enterprises operating under GDPR and the Garante's enforcement guidance, the on-premise route using Mistral or Qwen models on internal infrastructure is the most legally defensible posture for AI processing of personal data. The combination that satisfies both GDPR and the EU AI Act's technical requirements for high-risk AI use cases:
- vLLM or Ollama serving a Mistral or Qwen model on internal compute
- Langfuse (self-hosted) for observability, audit trails, and prompt logging — required documentation for AI Act compliance
- An internal model registry (MinIO + a version manifest) to control which model version is in production and maintain rollback capability
- A data classification layer that routes requests containing personal data exclusively to the on-premise endpoint — not a step that can be skipped
The alternative — cloud APIs with a DPA and the inference_geo parameter set to EU regions — is valid for many use cases and operationally simpler. Whether it satisfies your specific legal basis for processing depends on the data categories involved and your legal team's reading of the applicable supervisory authority guidance. For sensitive categories under Article 9 GDPR, on-premise is the more defensible choice.