Skip to content

Docker GPU Build on v25 fails on startup: FileNotFoundError in transformers / torch distributed tmp directory #1155

@BoBBer446

Description

@BoBBer446

When running the v25 branch in Docker with GPU support, the container immediately exits on startup with a FileNotFoundError. The error originates from torch.distributed.nn.jit.instantiator when creating a temporary directory under /app/tmp/….

Creating /app/tmp manually (both on the host and inside the container) does not resolve the issue.


Steps to reproduce

  1. Clean environment and clone the repo:
docker system prune -a
cd ..
rm -rf ebook2audiobook/
git clone -b v25 https://github.com/DrewThomasson/ebook2audiobook.git
cd ebook2audiobook
  1. Use the following docker-compose.yml (GPU enabled, building locally):
x-gpu-enabled: &gpu-enabled
  devices:
    - driver: nvidia
      count: all
      capabilities:
        - gpu

x-gpu-disabled: &gpu-disabled
  devices: []

services:
  ebook2audiobook:
    build:
      context: .
      args:
        TORCH_VERSION: cuda128   # Available tags: [cuda121, cuda118, cuda128, rocm, xpu, cpu]
        SKIP_XTTS_TEST: "true"
    entrypoint: ["python", "app.py", "--script_mode", "full_docker"]
    command: []
    tty: true
    stdin_open: true
    ports:
      - 7860:7860
    deploy:
      resources:
        reservations:
          <<: *gpu-enabled
        limits: {}
    volumes:
      - ./:/app
  1. Build and start:
docker compose up -d
docker compose logs -f

Actual behavior

The container prints:

v25.11.11 full_docker mode
Traceback (most recent call last):
  File "/app/app.py", line 495, in <module>
    main()
  File "/app/app.py", line 378, in main
    import lib.functions as f
  File "/app/lib/functions.py", line 48, in <module>
    from lib.classes.voice_extractor import VoiceExtractor
  File "/app/lib/classes/voice_extractor.py", line 17, in <module>
    from lib.classes.background_detector import BackgroundDetector
  File "/app/lib/classes/background_detector.py", line 5, in <module>
    from pyannote.audio import Model
  File "/usr/local/lib/python3.12/site-packages/pyannote/audio/__init__.py", line 29, in <module>
    from .core.inference import Inference
  File "/usr/local/lib/python3.12/site-packages/pyannote/audio/core/inference.py", line 33, in <module>
    from pytorch_lightning.utilities.memory import is_oom_error
  File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/__init__.py", line 25, in <module>
    from lightning_fabric.utilities.seed import seed_everything  # noqa: E402
  File "/usr/local/lib/python3.12/site-packages/lightning_fabric/__init__.py", line 35, in <module>
    from lightning_fabric.fabric import Fabric  # noqa: E402
  File "/usr/local/lib/python3.12/site-packages/lightning_fabric/fabric.py", line 38, in <module>
    from lightning_fabric.accelerators.accelerator import Accelerator
  File "/usr/local/lib/python3.12/site-packages/lightning_fabric/accelerators/__init__.py", line 15, in <module>
    from lightning_fabric.accelerators.accelerator import Accelerator
  File "/usr/local/lib/python3.12/site-packages/lightning_fabric/accelerators/accelerator.py", line 19, in <module>
    from lightning_fabric.accelerators.registry import _AcceleratorRegistry
  File "/usr/local/lib/python3.12/site-packages/lightning_fabric/accelerators/registry.py", line 18, in <module>
    from lightning_fabric.utilities.exceptions import MisconfigurationException
  File "/usr/local/lib/python3.12/site-packages/lightning_fabric/utilities/__init__.py", line 16, in <module>
    from lightning_fabric.utilities.apply_func import move_data_to_device
  File "/usr/local/lib/python3.12/site-packages/lightning_fabric/utilities/apply_func.py", line 24, in <module>
    from lightning_fabric.utilities.imports import _NUMPY_AVAILABLE
  File "/usr/local/lib/python3.12/site-packages/lightning_fabric/utilities/imports.py", line 39, in <module>
    _TORCHMETRICS_GREATER_EQUAL_1_0_0 = compare_version("torchmetrics", operator.ge, "1.0.0")
  File "/usr/local/lib/python3.12/site-packages/lightning_utilities/core/imports.py", line 78, in compare_version
    pkg = importlib.import_module(package)
  File "/usr/local/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/usr/local/lib/python3.12/site-packages/torchmetrics/__init__.py", line 37, in <module>
    from torchmetrics import functional  # noqa: E402
  File "/usr/local/lib/python3.12/site-packages/torchmetrics/functional/__init__.py", line 129, in <module>
    from torchmetrics.functional.text._deprecated import _bleu_score as bleu_score
  File "/usr/local/lib/python3.12/site-packages/torchmetrics/functional/text/__init__.py", line 50, in <module>
    from torchmetrics.functional.text.bert import bert_score
  File "/usr/local/lib/python3.12/site-packages/torchmetrics/functional/text/bert.py", line 56, in <module>
    from transformers import AutoModel, AutoTokenizer
  File "/usr/local/lib/python3.12/site-packages/transformers/generation/utils.py", line 48, in <module>
    from ..masking_utils import create_masks_for_generate
  File "/usr/local/lib/python3.12/site-packages/transformers/masking_utils.py", line 29, in <module>
    from torch.nn.attention.flex_attention import _DEFAULT_SPARSE_BLOCK_SIZE as flex_default_block_size  # noqa: N811
  File "/usr/local/lib/python3.12/site-packages/torch/nn/attention/flex_attention.py", line 15, in <module>
    from torch._dynamo._trace_wrapped_higher_order_op import TransformGetItemToIndex
  File "/usr/local/lib/python3.12/site-packages/torch/distributed/nn/jit/instantiator.py", line 21, in <module>
    _TEMP_DIR = tempfile.TemporaryDirectory()
  File "/usr/local/lib/python3.12/tempfile.py", line 886, in __init__
    self.name = mkdtemp(suffix, prefix, dir)
  File "/usr/local/lib/python3.12/tempfile.py", line 384, in mkdtemp
    _os.mkdir(file, 0o700)
FileNotFoundError: [Errno 2] No such file or directory: '/app/tmp/tmpgbtrn3qt'

The container exits with code 1.

Environment

  • Branch: v25
  • Mode: full_docker
  • Docker: Docker Compose (v2)
  • Base image: python:3.12
  • Build args: TORCH_VERSION=cuda128, SKIP_XTTS_TEST=true
  • GPU: NVIDIA (NVIDIA Container Toolkit installed)
  • Host OS: Linux (WSL2-based environment)

Question

Could you please check whether:

  • TMPDIR or any other temp-related env variable is set to /app/tmp in the Dockerfile / code, and
  • whether there is a recommended temp directory configuration for GPU / PyTorch / transformers in this project?

If you want, I can also test a patch that forces TMPDIR=/tmp or similar in the container entrypoint.

Metadata

Metadata

Assignees

No one assigned

    Labels

    DockerRelated to docker

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions