Skip to content

Conversation

@ada-ggf25
Copy link

@ada-ggf25 ada-ggf25 commented Nov 29, 2025

Fix: Auto-create TMPDIR directory when it doesn't exist

Fixes #7877

Description

This PR fixes issue #7877 by implementing automatic creation of the TMPDIR directory when it is set but doesn't exist. Previously, tempfile.mkdtemp() would silently ignore TMPDIR and fall back to /tmp if the specified directory didn't exist, causing confusion for users experiencing "No space left on device" errors.

Problem

When users set TMPDIR to a non-existent directory (e.g., export TMPDIR=/some/big/storage), Python's tempfile module silently ignores it and falls back to the default temporary directory (/tmp). This leads to:

  • Users unable to use their specified temporary directory
  • Silent failures that are difficult to debug
  • Continued "No space left on device" errors even after setting TMPDIR

Solution

The fix automatically creates the TMPDIR directory if it is set but doesn't exist, ensuring that:

  1. Users' TMPDIR settings are respected
  2. Clear logging is provided when the directory is created
  3. Graceful fallback with warnings if directory creation fails
  4. The fix works even if tempfile.gettempdir() was already called and cached

Changes

Implementation (src/datasets/fingerprint.py)

  • Modified _TempCacheDir.__init__() to check if TMPDIR environment variable is set
  • Added logic to auto-create the directory if it doesn't exist using os.makedirs()
  • Added informative logging when directory is created
  • Added warning logging when directory creation fails, with graceful fallback
  • Added path normalisation to handle path resolution issues
  • Explicitly pass dir parameter to tempfile.mkdtemp() to ensure TMPDIR is respected

Tests (tests/test_fingerprint.py)

Added comprehensive test coverage:

  • test_temp_cache_dir_with_tmpdir_nonexistent: Tests auto-creation of non-existent TMPDIR
  • test_temp_cache_dir_with_tmpdir_existing: Tests behaviour when TMPDIR already exists
  • test_temp_cache_dir_without_tmpdir: Tests default behaviour when TMPDIR is not set
  • test_temp_cache_dir_tmpdir_creation_failure: Tests error handling when directory creation fails

Also fixed incomplete test_fingerprint_in_multiprocessing test that was missing implementation.

Testing

  • All existing tests pass
  • New tests added for TMPDIR handling scenarios
  • Code formatted with make style
  • No linter errors
  • Manual testing confirms the fix works as expected

Example Usage

Before this fix:

$ export TMPDIR='/tmp/username'  # Directory doesn't exist
$ python -c "import tempfile; print(tempfile.gettempdir())"
/tmp  # Silently falls back, ignoring TMPDIR

After this fix:

$ export TMPDIR='/tmp/username'  # Directory doesn't exist
$ python -c "from datasets.fingerprint import get_temporary_cache_files_directory; print(get_temporary_cache_files_directory())"
# Directory is automatically created and used
# Log: "Created TMPDIR directory: /tmp/username"

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Note: This implementation follows the approach suggested in the issue, automatically creating the TMPDIR directory when it doesn't exist, which provides the best user experience whilst maintaining security (we only create directories explicitly specified by the user via environment variable).

…mpCacheDir

Enhanced the _TempCacheDir.__init__ method to properly respect and handle
the TMPDIR environment variable when creating temporary cache directories.

Changes:
- Add TMPDIR environment variable detection and validation
- Normalise paths to handle path resolution issues
- Auto-create TMPDIR directory if it doesn't exist to prevent silent
  fallback to default temporary directory
- Validate that TMPDIR is actually a directory before use
- Explicitly pass directory to mkdtemp to ensure TMPDIR is respected even
  if tempfile.gettempdir() was already called and cached
- Add appropriate logging for directory creation and fallback scenarios

This ensures that when TMPDIR is set, the temporary cache files are
created in the specified directory rather than silently falling back to
the system default temporary directory.
…mpCacheDir

Add test coverage for the improved TMPDIR environment variable handling
in the _TempCacheDir class. These tests verify the various scenarios
for TMPDIR usage and error handling.

Changes:
- Refactor test_fingerprint_in_multiprocessing to use Pool.map for
  cleaner test implementation
- Add test_temp_cache_dir_with_tmpdir_nonexistent to verify TMPDIR
  auto-creation when directory doesn't exist
- Add test_temp_cache_dir_with_tmpdir_existing to verify correct
  behaviour when TMPDIR exists and is valid
- Add test_temp_cache_dir_without_tmpdir to verify fallback to default
  temporary directory when TMPDIR is not set
- Add test_temp_cache_dir_tmpdir_creation_failure to verify graceful
  error handling and fallback when TMPDIR creation fails

These tests ensure that the TMPDIR improvements work correctly across
all scenarios and edge cases, including proper logging and fallback
behaviour.
Copy link
Contributor

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for implementing this, @ada-ggf25

I see there are a few more uses of tempfile in datasets besides the one treated here - should they be treated in the same way as well?

os.makedirs(tmpdir, exist_ok=True)
logger.info(f"Created TMPDIR directory: {tmpdir}")
except OSError as e:
logger.warning(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if TMPDIR is considered as part of the API then this and the one below ideally should re-raise an exception, while giving additional context.

Warnings are very easy to miss in complex applications where there are already dozens of warnings multiplied by multiple gpu processes.

Copy link
Author

@ada-ggf25 ada-ggf25 Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if TMPDIR is considered as part of the API then this and the one below ideally should re-raise an exception, while giving additional context.

Warnings are very easy to miss in complex applications where there are already dozens of warnings multiplied by multiple gpu processes.

I agree that if we treat TMPDIR as part of the API, a silent warning is too easy to miss in complex multi‑GPU runs. I’ll update _TempCacheDir.__init__ so that when TMPDIR is set but cannot be created/used, the code re‑raises an OSError with a clearer, actionable message (e.g. “TMPDIR is set to ‘…’ but the directory could not be created, please create it manually or unset TMPDIR”). I’ll also adjust the corresponding test to expect this behaviour.

I’ll open a PR update with those changes, please let me know if you’d prefer a different error type or wording.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! You can check the new PR here #7891. I hope it helps!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused - why was there a need for a new PR? Just in case the maintainers would prefer the original instead?

@ada-ggf25
Copy link
Author

ada-ggf25 commented Dec 1, 2025

Thank you for implementing this, @ada-ggf25

I see there are a few more uses of tempfile in datasets besides the one treated here - should they be treated in the same way as well?

I had a look through the other tempfile call sites:

  • In arrow_dataset.py the code always pass an explicit dir=cache_dir when using NamedTemporaryFile, so TMPDIR is not involved there.
  • In search.py the NamedTemporaryFile is only used to generate a random Elasticsearch index name, it doesn’t control where big cache files are written, so it doesn’t have the same “No space left on device” impact.

Given that, the fingerprint temp directory is the only place where respecting TMPDIR is part of the user‑visible API for disk usage. If you’d like, I’m happy to also thread TMPDIR through the search.py name generation, but I think this PR should keep scoped to the place that actually writes large temporary files.

Refine TMPDIR-related failure tests for _TempCacheDir to assert explicit
error conditions instead of fallback behaviour.

Changes:
- Update test_temp_cache_dir_tmpdir_creation_failure to use _TempCacheDir
  directly and assert that an OSError is raised with a clear TMPDIR
  context when directory creation fails
- Introduce test_temp_cache_dir_tmpdir_not_directory to verify that
  pointing TMPDIR at a non-directory raises an OSError with an
  informative error message

These tests better match the intended contract of _TempCacheDir by
ensuring invalid TMPDIR configurations fail loudly with descriptive
messages rather than silently falling back.
…loudly

Tighten TMPDIR handling in _TempCacheDir so that invalid configurations
raise clear errors instead of silently falling back to the default
temporary directory.

Changes:
- When TMPDIR points to a non-existent directory, raise an OSError with
  explicit guidance to create it manually or unset TMPDIR
- When TMPDIR points to a non-directory path, raise an OSError with
  guidance to point TMPDIR to a writable directory or unset it
- Remove previous warning-and-fallback behaviour to avoid masking
  configuration issues

This ensures that TMPDIR misconfigurations are surfaced early and
clearly, aligning runtime behaviour with the stricter expectations
codified in the new tests.
@stas00
Copy link
Contributor

stas00 commented Dec 1, 2025

Thank you for checking the other instances of tempfile use, @ada-ggf25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

work around tempfile silently ignoring TMPDIR if the dir doesn't exist

3 participants