Skip to main content
This page explains the internal mechanics of how BitBonsai distributes work across nodes. For setup instructions, see Multi-Node Setup.

Architecture Overview

BitBonsai uses a shared-database, direct-poll model for distributed encoding. Every node — main and child — connects directly to the same PostgreSQL database and polls the job queue independently.
┌─────────────────────────────────────────────────────────┐
│                     PostgreSQL (Main Node)               │
│                     jobs table, nodes table              │
└───────────────────┬─────────────────────────────────────┘
                    │ All nodes read/write directly
          ┌─────────┴──────────┐
          ▼                    ▼
   ┌─────────────┐      ┌─────────────┐
   │  Main Node  │      │  Child Node │
   │  NODE_MODE  │      │  NODE_MODE  │
   │    =MAIN    │      │   =LINKED   │
   └──────┬──────┘      └──────┬──────┘
          │                    │
          └────────┬───────────┘

          ┌─────────────────┐
          │   NFS Storage   │
          │  /mnt/videos    │
          │ (shared mount)  │
          └─────────────────┘
Key design decisions:
  • No central job dispatcher — nodes self-assign via optimistic locking
  • No message queue (no Redis, no RabbitMQ) — PostgreSQL is the queue
  • Child nodes need direct database access (DATABASE_URL) to poll jobs
  • MAIN_NODE_URL is used only for node registration and health reporting

Node Registration

When a child node starts with NODE_MODE=LINKED, it:
  1. Connects to PostgreSQL via DATABASE_URL
  2. POSTs its registration to MAIN_NODE_URL/nodes/register with:
    • Hostname / IP address
    • CONCURRENT_JOBS capacity
    • Hardware info (CPU cores, GPU presence)
  3. Main node inserts a row in the nodes table
  4. Child node appears in Settings → Nodes in the web UI
Child Node Startup Log:
  ✓ Connected to database at postgresql://192.168.1.100:5432/bitbonsai
  ✓ Registered as LINKED node with ID: node_abc123
  ✓ Capacity: 8 concurrent jobs
  ✓ Heartbeat started (every 30 seconds)
Child nodes send a heartbeat every 30 seconds. If the main node hasn’t received a heartbeat in 90 seconds, the node is marked OFFLINE and its active jobs are reset to QUEUED.

Job Distribution Mechanics

There is no central scheduler. Each node runs its own polling loop:
Every 5 seconds, each node:
  1. Count my active jobs (status=ENCODING, assigned_node=me)
  2. If active < CONCURRENT_JOBS:
     a. SELECT job FROM jobs WHERE status='QUEUED'
        ORDER BY created_at ASC
        FOR UPDATE SKIP LOCKED   ← prevents double-assignment
        LIMIT (CONCURRENT_JOBS - active)
     b. UPDATE job SET status='ENCODING', assigned_node=me
     c. Start FFmpeg process
The FOR UPDATE SKIP LOCKED clause is the key — PostgreSQL’s row-level locking prevents two nodes from claiming the same job. No distributed lock manager needed. Load balancing result:
ScenarioOutcome
All nodes equally loadedJobs distributed round-robin by creation time
One node faster than othersFaster node picks up more jobs (polls more aggressively)
Node offlineIts QUEUED jobs are immediately visible to other nodes
Node crashes mid-encodeOrphan recovery resets its ENCODING jobs to QUEUED on next startup

NFS: Why It’s Required

All nodes must mount the same NFS share at the same path (/media inside the container). Here’s why:
  1. Job assignment stores the file path — e.g., /media/Movies/Film.mkv
  2. FFmpeg runs on the assigned node — it opens that exact path
  3. Output is written back to the same path (encoded file replaces original)
If paths differ between nodes, FFmpeg on a child node will get ENOENT: no such file even if the file exists on the main node.
✓ Correct:
  Main node:  /mnt/videos:/media   (NFS: nas:/mnt/user/videos → /mnt/videos)
  Child node: /mnt/videos:/media   (NFS: nas:/mnt/user/videos → /mnt/videos)
  Job path: /media/Movies/Film.mkv → works on any node

✗ Wrong:
  Main node:  /mnt/movies:/media
  Child node: /mnt/media:/media
  Job path: /media/Movies/Film.mkv → ENOENT on child (different host path)

TRANSFERRING Status

When a job enters TRANSFERRING, the assigned node is verifying that the source file is accessible on its local NFS mount before starting FFmpeg. This is a brief file-existence check (not a copy operation). Flow:
Node assigned job → check /media/Movies/Film.mkv exists
  → File accessible → status: ENCODING (immediate)
  → File not yet accessible → retry 10x with 2s delay → ENCODING or FAILED
If you see jobs stuck in TRANSFERRING for more than 30 seconds, the NFS mount on that node is degraded. Check:
# On the affected child node
docker compose exec backend ls -la /media/Movies/

Capacity Planning

Use this formula to determine how many nodes you need:
Required nodes = ceil(target_throughput_fps / per_node_fps)

Example:
  Target: encode 1000 files/day × 5,400 frames/file ÷ 86,400s = 62.5 fps needed
  Per node (NVIDIA GPU, HEVC): ~60 fps
  Required: ceil(62.5 / 60) = 2 nodes
Node sizing guidelines:
Library SizeRecommended Setup
< 500 GBSingle node (main only)
500 GB – 5 TB1 main + 1–2 child nodes
5–20 TB1 main + 3–5 child nodes (GPU recommended)
> 20 TB1 main + 6+ child nodes, 10GbE NFS