Paperless-ngx 2.20 on a VPS: The Right OCR, Mail-Ingestion, and Storage Layout from Day One

~13 min read

1,799 views

0 likes

If you want Paperless-ngx on a VPS, the real decision is not "can I get it running?" That part is easy. The real decision is whether you are going to install it in a way that still makes sense six months later, when you have thousands of PDFs, OCR jobs queued, mail ingestion rules piling up, and backups large enough to hurt if you got the storage layout wrong. Paperless-ngx is one of those apps that rewards getting the boring parts right from day one.

The correct starting point in 2026 is simple: use Paperless-ngx 2.20.x, run it on a roomy LXC or a small KVM VPS, use PostgreSQL, keep Redis local, separate your consume and archive paths cleanly, and decide early whether you want Office document conversion and email ingestion. If you skip those choices and just "throw Docker at it," you will usually end up rebuilding the stack later.

This tutorial is not a vague overview. It is the install path for a practical Paperless-ngx VPS: correct OCR expectations, proper mail-ingestion support, sane Docker Compose layout, and a storage design that does not become a mess the moment you start importing real documents.

Why Paperless-ngx 2.20 is worth installing now

Paperless-ngx was already a useful self-hosted document system before the 2.20 line, but 2.20.x is a better starting point for new installs than older pinned stacks people still copy from random blog posts. The project’s official changelog shows three things that matter operationally: 2.20.0 moved the Docker image base to Debian Trixie, the release included performance-oriented changes, and later 2.20.x releases addressed security issues. That is enough reason to stop pasting stale Compose files built around older versions and just pin a clean 2.20.x tag from the start.

That does not mean "always chase latest" blindly. It means do not start a fresh install on an older branch for no reason. If you are building this on a VPS today, start from a current 2.20.x image, document the version you pinned, and upgrade deliberately later instead of inheriting old deployment debt on day one.

LXC or KVM first: which VPS shape actually fits Paperless-ngx?

Paperless-ngx runs well on either a roomy LXC or a small KVM VPS. The right answer depends less on Paperless itself and more on how much you trust your container model, how much isolation you want, and how much storage weirdness you intend to introduce later.

A roomy LXC is usually the better fit when:

you want lower overhead
the storage is local and straightforward
you are not trying to do unusual nested virtualization tricks
you are comfortable managing Docker inside a containerized environment

A small KVM VPS is usually the better fit when:

you want a cleaner isolation boundary
you expect to add reverse proxy, backup agents, or other services around it
you do not want container-on-container edge cases
you prefer fewer surprises with storage mounts and permissions

The blunt version is this: if this is your personal document archive and you know your Proxmox or VPS environment well, a roomy LXC is perfectly reasonable. If this is business-critical, shared between users, or you already know you hate debugging container nesting and mount behavior, use a small KVM and move on.

For most small deployments, a sane starting point looks like this:

2 vCPU
4 GB RAM minimum
6 to 8 GB RAM if you expect heavier OCR and document conversion
60 to 100 GB SSD to start, more if you already have a backlog of scanned PDFs

If you need a clean place to run it, this is exactly the kind of workload that fits well on ServerSpan Virtual Servers. If you do not want to own the Linux side at all, that is where Linux administration becomes more rational than pretending you will "figure it out later."

The storage layout that saves you pain later

This is the part most rushed tutorials get wrong. Paperless-ngx does not just need "a volume." It needs storage paths that stay understandable after growth.

You should think in terms of five functional areas:

consume for incoming files waiting to be processed
media for stored documents and thumbnails
data for internal app data, search index, model files, and similar state
export for explicit exports
db_data for PostgreSQL itself

Do not mash everything into one random folder if you care about backups, troubleshooting, or migrations. Yes, Paperless can technically work with a simpler layout. That does not make it a good idea.

A clean host-side directory structure on the VPS looks like this:

/srv/paperless/
├── consume/
├── media/
├── data/
├── export/
├── db/
└── compose/

This gives you something important: you can back up media differently from PostgreSQL, you can inspect consume without digging through app internals, and you can migrate the stack later without reverse engineering your own shortcuts.

One more rule that matters: keep consume on fast local storage if you can. If you later decide to drop files into it from network storage, and inotify behavior becomes inconsistent, Paperless documents the fallback option PAPERLESS_CONSUMER_POLLING. That is a workaround, not your first choice.

What OCR, Tika, and Gotenberg actually do

A lot of people use the word OCR loosely here. Paperless-ngx’s document pipeline is not one single feature blob.

OCR is what turns scanned image content into searchable text.
Tika helps parse file contents and metadata from document types that are not simple PDFs.
Gotenberg handles conversion work for Office-style documents.
Tika and Gotenberg together are also required if you want proper parsing of email files such as .eml.

This is why you should decide early whether your Paperless stack is only for scanned PDFs and image imports, or whether you also want Office files and email ingestion to behave properly. If you skip Tika and Gotenberg now, then add email workflows later, you will end up revisiting the Compose stack anyway.

The practical answer for a fresh VPS install is simple: deploy with Tika and Gotenberg from day one unless you are absolutely sure this will remain a PDF-only archive.

Base OS and package preparation

This guide assumes a current Debian 12 or Ubuntu 24.04 VPS with root access. Update the system first and install Docker plus the Compose plugin properly. Do not build a fresh app stack on an unpatched base.

apt update
apt upgrade -y
apt install -y ca-certificates curl gnupg lsb-release

install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/debian/gpg | gpg --dearmor -o /etc/apt/keyrings/docker.gpg
chmod a+r /etc/apt/keyrings/docker.gpg

echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
  https://download.docker.com/linux/debian $(. /etc/os-release && echo $VERSION_CODENAME) stable" \
  > /etc/apt/sources.list.d/docker.list

apt update
apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

systemctl enable --now docker
docker version
docker compose version

If you are on Ubuntu, use the Ubuntu Docker repo instead of blindly copying Debian instructions. The point is not the exact vendor line. The point is to install Docker cleanly and verify it before the application stack enters the picture.

Create the Paperless-ngx directory layout

mkdir -p /srv/paperless/{consume,media,data,export,db,compose}
cd /srv/paperless/compose

Use a dedicated location like /srv/paperless or another explicit service path. Do not scatter volumes under random home directories unless you enjoy forgetting where your own data lives.

The Docker Compose stack that makes sense

For a fresh install, use PostgreSQL. Paperless-ngx officially recommends PostgreSQL for new installations. That is enough reason not to start new production-ish installs on SQLite just because a smaller example file exists somewhere on the internet.

Create docker-compose.yml:

services:
  broker:
    image: docker.io/library/redis:7
    container_name: paperless-redis
    restart: unless-stopped
    volumes:
      - redisdata:/data

  db:
    image: docker.io/library/postgres:16
    container_name: paperless-postgres
    restart: unless-stopped
    environment:
      POSTGRES_DB: paperless
      POSTGRES_USER: paperless
      POSTGRES_PASSWORD: change-this-now
    volumes:
      - /srv/paperless/db:/var/lib/postgresql/data

  gotenberg:
    image: docker.io/gotenberg/gotenberg:8
    container_name: paperless-gotenberg
    restart: unless-stopped
    command:
      - gotenberg
      - --chromium-disable-javascript=true
      - --chromium-allow-list=file:///tmp/.*

  tika:
    image: docker.io/apache/tika:latest
    container_name: paperless-tika
    restart: unless-stopped

  webserver:
    image: ghcr.io/paperless-ngx/paperless-ngx:2.20.8
    container_name: paperless-web
    restart: unless-stopped
    depends_on:
      - db
      - broker
      - gotenberg
      - tika
    ports:
      - "8000:8000"
    environment:
      PAPERLESS_REDIS: redis://broker:6379
      PAPERLESS_DBHOST: db
      PAPERLESS_DBNAME: paperless
      PAPERLESS_DBUSER: paperless
      PAPERLESS_DBPASS: change-this-now
      PAPERLESS_URL: https://paperless.example.com
      PAPERLESS_SECRET_KEY: change-this-to-a-long-random-secret
      PAPERLESS_TIME_ZONE: Europe/Bucharest
      PAPERLESS_OCR_LANGUAGE: eng+ron
      PAPERLESS_TIKA_ENABLED: 1
      PAPERLESS_TIKA_ENDPOINT: http://tika:9998
      PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000
      PAPERLESS_CONSUMER_POLLING: 0
      PAPERLESS_ADMIN_USER: admin
      PAPERLESS_ADMIN_PASSWORD: change-this-now
    volumes:
      - /srv/paperless/data:/usr/src/paperless/data
      - /srv/paperless/media:/usr/src/paperless/media
      - /srv/paperless/export:/usr/src/paperless/export
      - /srv/paperless/consume:/usr/src/paperless/consume

volumes:
  redisdata:

This is not the only valid stack. It is the one that gets the important things right from day one:

PostgreSQL is separate and persistent.
Redis is local and not shared with random unrelated apps.
Tika and Gotenberg are included from the start.
The Paperless data paths are explicit and bind-mounted.
The image is pinned to a 2.20.x release, not "latest".

Change the passwords and secret immediately. The example values are placeholders, not suggestions.

Bring the stack up and verify it cleanly

cd /srv/paperless/compose
docker compose up -d
docker compose ps
docker compose logs -f webserver

Do not stop at "containers are up." Wait until the app is reachable and the logs show normal startup. Then test the web UI directly:

curl -I http://127.0.0.1:8000

If you are exposing it behind a reverse proxy, that comes next. Do not point DNS at it until the local app itself is healthy.

Reverse proxy and public URL done properly

If this is not a private-only deployment, put Paperless behind a reverse proxy and use a real hostname. The configuration variable that matters most here is PAPERLESS_URL. It should match the public scheme and hostname you actually intend to use.

For example:

PAPERLESS_URL=https://paperless.example.com

Do not leave it vague or wrong. Paperless documents that this setting drives host and CSRF-related behavior. If you deploy behind a proxy later and forget to correct it, you create the kind of self-inflicted admin breakage that wastes an hour for no good reason.

The right OCR setup from day one

The OCR decision is not just "do I turn OCR on?" It is "what languages do I actually need, and will the system produce usable searchable text for my documents?"

If your archive is mostly English and Romanian, this is the correct kind of starting point:

PAPERLESS_OCR_LANGUAGE=eng+ron

If you only set English but half your scans are Romanian, your OCR will still run. It will just do a worse job than it should. That matters later when you are relying on search to find invoices, contracts, IDs, or handwritten annotations that were already difficult to OCR.

Also understand what OCR cannot fix:

terrible scan quality
bad contrast
skewed pages
photos of documents that were never scanned properly

Paperless is good. It is not magic. If the input is garbage, searchable output will still be limited.

Mail ingestion done correctly from the beginning

This is one of the features that makes Paperless-ngx worth the effort. If invoices, statements, contracts, or tickets arrive by email, you do not want to forward them manually forever. Use the built-in mail consumption properly.

The official workflow is simple inside the web interface:

define one or more email accounts
define mail rules for those accounts
decide how attachments, body content, and routing should work

What matters operationally is getting the policy right early:

use a dedicated mailbox for invoices and statements if possible
do not point Paperless at a noisy personal inbox full of junk
tag mail-ingested documents consistently
use storage paths and correspondents deliberately, not as an afterthought

If you want Paperless to process email documents properly, including `.eml` parsing and office attachment handling, that is exactly why Tika and Gotenberg were included in the stack earlier. This is one of those design choices that is annoying to retrofit later if you skipped it at install time.

The storage-path model that keeps the archive usable

Paperless-ngx can store documents without any special filename or path structure. That is fine for a lab. It is weak for a real archive. Decide early whether you want predictable on-disk organization.

A practical starting point is to keep the archive grouped by correspondent and year, so exported or backed-up media still makes some human sense outside the UI. You can use Paperless filename formatting for that later, but do not over-engineer it on day one. The important part is that your media directory is durable, separate from consume, and included in backup planning.

Do not put the archive on fragile remote storage just because you can. If your VPS has fast local SSD, use it first. Move to more complex storage only when you actually understand the backup and restore trade-offs.

Permissions and ownership: the part people break in silence

Paperless is easy to break with sloppy volume permissions. The docs are very clear that the directories Paperless uses must exist and must be writable by the user running the service. On Docker-based installs, this becomes a host-side bind-mount responsibility.

If consumption is failing, documents are not picked up, or files stay stuck in the inbox path, permissions are one of the first things to verify.

ls -lah /srv/paperless
ls -lah /srv/paperless/consume
ls -lah /srv/paperless/media
docker compose logs -f webserver
docker compose logs -f gotenberg
docker compose logs -f tika

Do not wait until after you have imported 800 files to discover that one directory was mounted incorrectly and Paperless has been half-working the whole time.

Backup strategy from day one

If the archive matters, backup planning is part of the install, not something you do once you have "time later." Paperless has two storage classes that matter most:

the database
the document media

Lose either one and the restore gets ugly. The basic rule is simple:

back up PostgreSQL regularly
back up /srv/paperless/media
back up /srv/paperless/data as well
test restore logic before you trust the system with irreplaceable documents

If this is business paperwork, not just household scans, stop acting like "the VPS provider has snapshots" is a document retention plan. It is not.

What usually goes wrong on first Paperless installs

People deploy SQLite for a fresh install even though PostgreSQL is recommended.
They skip Tika and Gotenberg, then later wonder why Office files and email parsing are weak.
They dump all volumes into one vague path and regret it during backup or migration.
They use weak passwords or leave the secret key untouched.
They do not think about OCR language coverage until search quality disappoints them.
They expose the app publicly before the local stack is actually stable.

None of these are hard technical problems. They are planning failures. Which is exactly why they keep happening.

When a roomy LXC is enough and when a small KVM is the smarter call

A roomy LXC is enough when the install is clean, the storage is local, and you want efficient overhead for a self-hosted document archive. A small KVM is the smarter call when you care more about isolation, cleaner Docker behavior, or running this as part of a more serious stack that includes reverse proxy, backup agents, monitoring, and more deliberate change control.

Do not overthink this into paralysis. Paperless-ngx is not unusually hard to host. It just punishes sloppy filesystem and service layout once you start relying on it.

If you want a clean place to run it without improvising the infrastructure, use ServerSpan Virtual Servers. If you want the app but not the Linux chores around Docker, reverse proxy, backups, and hardening, use ServerSpan Linux Administration and stop pretending that you will enjoy debugging container logs at 11 PM.

The practical bottom line

The right Paperless-ngx install on a VPS is not the one that merely starts. It is the one that has the correct OCR languages, the optional services already in place for email and Office documents, a storage layout that still makes sense after growth, PostgreSQL from day one, and a VPS shape that matches how much isolation and control you actually need.

If you build it that way now, you avoid most of the rebuild triggers that make self-hosted document systems annoying later. If you build it carelessly, Paperless-ngx will still run. It will just become one more app you plan to "fix properly later."

Romanian version: Paperless-ngx 2.20 pe un VPS: cum alegi corect OCR-ul, ingestia prin email și layout-ul de stocare din prima zi

Source & Attribution

This article is based on original data belonging to serverspan.com blog. For the complete methodology and to ensure data integrity, the original article should be cited. The canonical source is available at: Paperless-ngx 2.20 on a VPS: The Right OCR, Mail-Ingestion, and Storage Layout from Day One.