Docker GPU Build on v25 fails on startup: FileNotFoundError in transformers / torch distributed tmp directory

When running the `v25` branch in Docker with GPU support, the container immediately exits on startup with a `FileNotFoundError`. The error originates from `torch.distributed.nn.jit.instantiator` when creating a temporary directory under `/app/tmp/…`.

Creating `/app/tmp` manually (both on the host and inside the container) does **not** resolve the issue.

---

### Steps to reproduce

1. Clean environment and clone the repo:

```bash
docker system prune -a
cd ..
rm -rf ebook2audiobook/
git clone -b v25 https://github.com/DrewThomasson/ebook2audiobook.git
cd ebook2audiobook
```

2. Use the following `docker-compose.yml` (GPU enabled, building locally):

```yml
x-gpu-enabled: &gpu-enabled
  devices:
    - driver: nvidia
      count: all
      capabilities:
        - gpu

x-gpu-disabled: &gpu-disabled
  devices: []

services:
  ebook2audiobook:
    build:
      context: .
      args:
        TORCH_VERSION: cuda128   # Available tags: [cuda121, cuda118, cuda128, rocm, xpu, cpu]
        SKIP_XTTS_TEST: "true"
    entrypoint: ["python", "app.py", "--script_mode", "full_docker"]
    command: []
    tty: true
    stdin_open: true
    ports:
      - 7860:7860
    deploy:
      resources:
        reservations:
          <<: *gpu-enabled
        limits: {}
    volumes:
      - ./:/app
```

3. Build and start:

```bash
docker compose up -d
docker compose logs -f
```

---

### Actual behavior

The container prints:

```text
v25.11.11 full_docker mode
Traceback (most recent call last):
  File "/app/app.py", line 495, in <module>
    main()
  File "/app/app.py", line 378, in main
    import lib.functions as f
  File "/app/lib/functions.py", line 48, in <module>
    from lib.classes.voice_extractor import VoiceExtractor
  File "/app/lib/classes/voice_extractor.py", line 17, in <module>
    from lib.classes.background_detector import BackgroundDetector
  File "/app/lib/classes/background_detector.py", line 5, in <module>
    from pyannote.audio import Model
  File "/usr/local/lib/python3.12/site-packages/pyannote/audio/__init__.py", line 29, in <module>
    from .core.inference import Inference
  File "/usr/local/lib/python3.12/site-packages/pyannote/audio/core/inference.py", line 33, in <module>
    from pytorch_lightning.utilities.memory import is_oom_error
  File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/__init__.py", line 25, in <module>
    from lightning_fabric.utilities.seed import seed_everything  # noqa: E402
  File "/usr/local/lib/python3.12/site-packages/lightning_fabric/__init__.py", line 35, in <module>
    from lightning_fabric.fabric import Fabric  # noqa: E402
  File "/usr/local/lib/python3.12/site-packages/lightning_fabric/fabric.py", line 38, in <module>
    from lightning_fabric.accelerators.accelerator import Accelerator
  File "/usr/local/lib/python3.12/site-packages/lightning_fabric/accelerators/__init__.py", line 15, in <module>
    from lightning_fabric.accelerators.accelerator import Accelerator
  File "/usr/local/lib/python3.12/site-packages/lightning_fabric/accelerators/accelerator.py", line 19, in <module>
    from lightning_fabric.accelerators.registry import _AcceleratorRegistry
  File "/usr/local/lib/python3.12/site-packages/lightning_fabric/accelerators/registry.py", line 18, in <module>
    from lightning_fabric.utilities.exceptions import MisconfigurationException
  File "/usr/local/lib/python3.12/site-packages/lightning_fabric/utilities/__init__.py", line 16, in <module>
    from lightning_fabric.utilities.apply_func import move_data_to_device
  File "/usr/local/lib/python3.12/site-packages/lightning_fabric/utilities/apply_func.py", line 24, in <module>
    from lightning_fabric.utilities.imports import _NUMPY_AVAILABLE
  File "/usr/local/lib/python3.12/site-packages/lightning_fabric/utilities/imports.py", line 39, in <module>
    _TORCHMETRICS_GREATER_EQUAL_1_0_0 = compare_version("torchmetrics", operator.ge, "1.0.0")
  File "/usr/local/lib/python3.12/site-packages/lightning_utilities/core/imports.py", line 78, in compare_version
    pkg = importlib.import_module(package)
  File "/usr/local/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/usr/local/lib/python3.12/site-packages/torchmetrics/__init__.py", line 37, in <module>
    from torchmetrics import functional  # noqa: E402
  File "/usr/local/lib/python3.12/site-packages/torchmetrics/functional/__init__.py", line 129, in <module>
    from torchmetrics.functional.text._deprecated import _bleu_score as bleu_score
  File "/usr/local/lib/python3.12/site-packages/torchmetrics/functional/text/__init__.py", line 50, in <module>
    from torchmetrics.functional.text.bert import bert_score
  File "/usr/local/lib/python3.12/site-packages/torchmetrics/functional/text/bert.py", line 56, in <module>
    from transformers import AutoModel, AutoTokenizer
  File "/usr/local/lib/python3.12/site-packages/transformers/generation/utils.py", line 48, in <module>
    from ..masking_utils import create_masks_for_generate
  File "/usr/local/lib/python3.12/site-packages/transformers/masking_utils.py", line 29, in <module>
    from torch.nn.attention.flex_attention import _DEFAULT_SPARSE_BLOCK_SIZE as flex_default_block_size  # noqa: N811
  File "/usr/local/lib/python3.12/site-packages/torch/nn/attention/flex_attention.py", line 15, in <module>
    from torch._dynamo._trace_wrapped_higher_order_op import TransformGetItemToIndex
  File "/usr/local/lib/python3.12/site-packages/torch/distributed/nn/jit/instantiator.py", line 21, in <module>
    _TEMP_DIR = tempfile.TemporaryDirectory()
  File "/usr/local/lib/python3.12/tempfile.py", line 886, in __init__
    self.name = mkdtemp(suffix, prefix, dir)
  File "/usr/local/lib/python3.12/tempfile.py", line 384, in mkdtemp
    _os.mkdir(file, 0o700)
FileNotFoundError: [Errno 2] No such file or directory: '/app/tmp/tmpgbtrn3qt'
```

The container exits with code 1.


### Environment

* Branch: `v25`
* Mode: `full_docker`
* Docker: Docker Compose (v2)
* Base image: `python:3.12`
* Build args: `TORCH_VERSION=cuda128`, `SKIP_XTTS_TEST=true`
* GPU: NVIDIA (NVIDIA Container Toolkit installed)
* Host OS: Linux (WSL2-based environment)



### Question

Could you please check whether:

* `TMPDIR` or any other temp-related env variable is set to `/app/tmp` in the Dockerfile / code, and
* whether there is a recommended temp directory configuration for GPU / PyTorch / transformers in this project?

If you want, I can also test a patch that forces `TMPDIR=/tmp` or similar in the container entrypoint.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Docker GPU Build on v25 fails on startup: FileNotFoundError in transformers / torch distributed tmp directory #1155

Steps to reproduce

Actual behavior

Environment

Question

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Docker GPU Build on v25 fails on startup: FileNotFoundError in transformers / torch distributed tmp directory #1155

Description

Steps to reproduce

Actual behavior

Environment

Question

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions