Distributed Encoding Architecture

This page explains the internal mechanics of how BitBonsai distributes work across nodes. For setup instructions, see Multi-Node Setup.

Architecture Overview

BitBonsai uses a shared-database, direct-poll model for distributed encoding. Every node — main and child — connects directly to the same PostgreSQL database and polls the job queue independently.

┌─────────────────────────────────────────────────────────┐
│                     PostgreSQL (Main Node)               │
│                     jobs table, nodes table              │
└───────────────────┬─────────────────────────────────────┘
                    │ All nodes read/write directly
          ┌─────────┴──────────┐
          ▼                    ▼
   ┌─────────────┐      ┌─────────────┐
   │  Main Node  │      │  Child Node │
   │  NODE_MODE  │      │  NODE_MODE  │
   │    =MAIN    │      │   =LINKED   │
   └──────┬──────┘      └──────┬──────┘
          │                    │
          └────────┬───────────┘
                   ▼
          ┌─────────────────┐
          │   NFS Storage   │
          │  /mnt/videos    │
          │ (shared mount)  │
          └─────────────────┘

Key design decisions:

No central job dispatcher — nodes self-assign via optimistic locking
No message queue (no Redis, no RabbitMQ) — PostgreSQL is the queue
Child nodes need direct database access (DATABASE_URL) to poll jobs
MAIN_NODE_URL is used only for node registration and health reporting

Node Registration

When a child node starts with NODE_MODE=LINKED, it:

Connects to PostgreSQL via DATABASE_URL
POSTs its registration to MAIN_NODE_URL/nodes/register with:
- Hostname / IP address
- CONCURRENT_JOBS capacity
- Hardware info (CPU cores, GPU presence)
Main node inserts a row in the nodes table
Child node appears in Settings → Nodes in the web UI

Child Node Startup Log:
  ✓ Connected to database at postgresql://192.168.1.100:5432/bitbonsai
  ✓ Registered as LINKED node with ID: node_abc123
  ✓ Capacity: 8 concurrent jobs
  ✓ Heartbeat started (every 30 seconds)

Child nodes send a heartbeat every 30 seconds. If the main node hasn’t received a heartbeat in 90 seconds, the node is marked OFFLINE and its active jobs are reset to QUEUED.

Job Distribution Mechanics

There is no central scheduler. Each node runs its own polling loop:

Every 5 seconds, each node:
  1. Count my active jobs (status=ENCODING, assigned_node=me)
  2. If active < CONCURRENT_JOBS:
     a. SELECT job FROM jobs WHERE status='QUEUED'
        ORDER BY created_at ASC
        FOR UPDATE SKIP LOCKED   ← prevents double-assignment
        LIMIT (CONCURRENT_JOBS - active)
     b. UPDATE job SET status='ENCODING', assigned_node=me
     c. Start FFmpeg process

The FOR UPDATE SKIP LOCKED clause is the key — PostgreSQL’s row-level locking prevents two nodes from claiming the same job. No distributed lock manager needed. Load balancing result:

Scenario	Outcome
All nodes equally loaded	Jobs distributed round-robin by creation time
One node faster than others	Faster node picks up more jobs (polls more aggressively)
Node offline	Its QUEUED jobs are immediately visible to other nodes
Node crashes mid-encode	Orphan recovery resets its ENCODING jobs to QUEUED on next startup

NFS: Why It’s Required

All nodes must mount the same NFS share at the same path (/media inside the container). Here’s why:

Job assignment stores the file path — e.g., /media/Movies/Film.mkv
FFmpeg runs on the assigned node — it opens that exact path
Output is written back to the same path (encoded file replaces original)

If paths differ between nodes, FFmpeg on a child node will get ENOENT: no such file even if the file exists on the main node.

✓ Correct:
  Main node:  /mnt/videos:/media   (NFS: nas:/mnt/user/videos → /mnt/videos)
  Child node: /mnt/videos:/media   (NFS: nas:/mnt/user/videos → /mnt/videos)
  Job path: /media/Movies/Film.mkv → works on any node

✗ Wrong:
  Main node:  /mnt/movies:/media
  Child node: /mnt/media:/media
  Job path: /media/Movies/Film.mkv → ENOENT on child (different host path)

TRANSFERRING Status

When a job enters TRANSFERRING, the assigned node is verifying that the source file is accessible on its local NFS mount before starting FFmpeg. This is a brief file-existence check (not a copy operation). Flow:

Node assigned job → check /media/Movies/Film.mkv exists
  → File accessible → status: ENCODING (immediate)
  → File not yet accessible → retry 10x with 2s delay → ENCODING or FAILED

If you see jobs stuck in TRANSFERRING for more than 30 seconds, the NFS mount on that node is degraded. Check:

# On the affected child node
docker compose exec backend ls -la /media/Movies/

Capacity Planning

Use this formula to determine how many nodes you need:

Required nodes = ceil(target_throughput_fps / per_node_fps)

Example:
  Target: encode 1000 files/day × 5,400 frames/file ÷ 86,400s = 62.5 fps needed
  Per node (NVIDIA GPU, HEVC): ~60 fps
  Required: ceil(62.5 / 60) = 2 nodes

Node sizing guidelines:

Library Size	Recommended Setup
< 500 GB	Single node (main only)
500 GB – 5 TB	1 main + 1–2 child nodes
5–20 TB	1 main + 3–5 child nodes (GPU recommended)
> 20 TB	1 main + 6+ child nodes, 10GbE NFS

Multi-Node Setup — Step-by-step configuration
GPU Acceleration — Hardware encoding per node
Monitoring — Track distributed encoding performance
Understanding Jobs — Job lifecycle and status reference

Getting Started

Installation

User Guide

Advanced

Support

Distributed Encoding Architecture

Architecture Overview

Node Registration

Job Distribution Mechanics

NFS: Why It’s Required

TRANSFERRING Status

Capacity Planning

Getting Started

Installation

User Guide

Advanced

Support

​Architecture Overview

​Node Registration

​Job Distribution Mechanics

​NFS: Why It’s Required

​TRANSFERRING Status

​Capacity Planning

​Related

Architecture Overview

Node Registration

Job Distribution Mechanics

NFS: Why It’s Required

TRANSFERRING Status

Capacity Planning

Related