-
Notifications
You must be signed in to change notification settings - Fork 0
Resumable training #15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
pszemraj
commented
May 18, 2025
- ability to start from existing .pt checkpoint from previous run
- optionally save optimizer/rng states with output checkpoints, can load those later
- ability to start from hf transformers format weights, convert to .pt, then train
- Add functionality to save and load optimizer states - Implement continuous checkpointing with RNG state - Support resuming training from specific checkpoint - Add stub for HuggingFace model loading - Add command-line arguments for controlling resumable training 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
…gFace model loading
…rly handle RNG state
…PyTorch models only
|
PR is still WIP - need to test/improve the start from hf format weights. there is some weird unrelated bug where https://huggingface.co/BEE-spoke-data/tiny-random-MPNetForMaskedLM (and any other mpnet model) pytorch weights are not recognized locally. but it works fine on colab (versions of things are the same ...), so will test there |
|
@amazingvince here are the tests that I was going to run locally for starting from existing hf weights but started having a strange bug where my WSL environment would not recognize any PyTorch/safetensors files in model repos. https://gist.github.com/pszemraj/30d1a6995d4365ef92bbe71ee10e8c91 The ones for resuming training (starting from random weights) worked |
Signed-off-by: peter szemraj <peterszemraj@gmail.com>
Test Summary for
|
- Fix accuracy calculation showing 0 in early steps by normalizing by total tokens - Fix RNG state restoration with proper ByteTensor conversion - Add weights_only=False to torch.load() for PyTorch 2.6 compatibility - Add ruff.toml and .isort.cfg to configure linting/formatting - Configure to skip __init__.py files to avoid circular import issues 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
|
@codex review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # Create meters for all the relevant logging statistics using the Meters module | ||
| meters = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preserve best_loss value when resuming
After loading a checkpoint the code restores best_loss from the file, but immediately afterwards it reassigns best_loss = 10e6 before entering the training loop. This wipes out the loaded metric, so a resumed run always considers the next validation pass a “new best” and overwrites best_checkpoint.pt even if the loss increased. Drop the second assignment or only initialize when no checkpoint was loaded.
Useful? React with 👍 / 👎.