|
| 1 | +# Data Management Best Practices |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +RoseLab servers provide two types of storage with different characteristics. Understanding when and how to use each type is crucial for optimal performance and efficient resource utilization. |
| 6 | + |
| 7 | +## Storage Types |
| 8 | + |
| 9 | +### System SSD (Local Storage) |
| 10 | + |
| 11 | +- **Size**: ~256 GB available per container |
| 12 | +- **Performance**: High-speed NVMe SSD |
| 13 | +- **Scope**: Local to each machine (not synchronized) |
| 14 | +- **Best for**: Active development, environments, code, and frequently accessed small files |
| 15 | + |
| 16 | +### `/data` Directory (Network HDD) |
| 17 | + |
| 18 | +- **Size**: 5 TB per user (private, accessible only to you) |
| 19 | +- **Performance**: Network-mounted HDD over 100Gbps connection |
| 20 | +- **Scope**: Synchronized across all RoseLab servers (roselab1-5) |
| 21 | +- **Best for**: Large datasets, checkpoints, archived data |
| 22 | + |
| 23 | +### `/public` Directory (Shared Network HDD) |
| 24 | + |
| 25 | +- **Size**: 5 TB total (shared among all users) |
| 26 | +- **Performance**: Network-mounted HDD over 100Gbps connection |
| 27 | +- **Scope**: Synchronized across all servers, accessible to all users |
| 28 | +- **Best for**: Shared datasets, collaborative data |
| 29 | + |
| 30 | +## Best Practices by File Type |
| 31 | + |
| 32 | +### ❌ Never Put on `/data` |
| 33 | + |
| 34 | +**Development Environments and Package Caches** |
| 35 | + |
| 36 | +Do **NOT** store these on `/data`: |
| 37 | +- Python virtual environments (`venv`, `conda` environments) |
| 38 | +- Package manager caches (`pip`, `conda`, `npm`) |
| 39 | +- Compiled code and bytecode (`.pyc` files, `__pycache__` directories) |
| 40 | +- Build artifacts |
| 41 | + |
| 42 | +**Why?** Loading thousands of small files through the network causes significant performance degradation. Even with a 100Gbps connection, the latency of accessing numerous small files adds up quickly. |
| 43 | + |
| 44 | +**Example of what to avoid:** |
| 45 | +```bash |
| 46 | +# ❌ BAD: Creating conda environment on /data |
| 47 | +conda create -p /data/envs/myenv python=3.10 |
| 48 | + |
| 49 | +# ✅ GOOD: Keep environments on local SSD |
| 50 | +conda create -n myenv python=3.10 |
| 51 | +``` |
| 52 | + |
| 53 | +### ✅ Always Put on `/data` |
| 54 | + |
| 55 | +**Large Model Checkpoints** |
| 56 | + |
| 57 | +Checkpoints should be stored on `/data` because: |
| 58 | +- They are typically single large files (GBs each) |
| 59 | +- Loading a single large file over network is efficient |
| 60 | +- They benefit from cross-server synchronization |
| 61 | +- They don't need to be duplicated across servers |
| 62 | + |
| 63 | +**Example:** |
| 64 | +```python |
| 65 | +# ✅ GOOD: Save checkpoints directly to /data |
| 66 | +torch.save(model.state_dict(), '/data/experiments/project1/checkpoint_epoch_50.pt') |
| 67 | + |
| 68 | +# Loading is also efficient |
| 69 | +model.load_state_dict(torch.load('/data/experiments/project1/checkpoint_epoch_50.pt')) |
| 70 | +``` |
| 71 | + |
| 72 | +**Archived or Cold Datasets** |
| 73 | + |
| 74 | +Move datasets to `/data` when: |
| 75 | +- You've completed a project but want to keep the data |
| 76 | +- You're doing intermediate preprocessing |
| 77 | +- The dataset is not actively used in training |
| 78 | + |
| 79 | +### ⚖️ Conditional: Active Datasets |
| 80 | + |
| 81 | +The decision depends on dataset characteristics: |
| 82 | + |
| 83 | +#### Small to Medium Datasets (< 500 GB, consolidated files) |
| 84 | + |
| 85 | +**Strategy**: Keep a hot copy on local SSD, archive to `/data` |
| 86 | + |
| 87 | +```bash |
| 88 | +# Keep active copy on local SSD |
| 89 | +/home/ubuntu/projects/active-project/data/ |
| 90 | + |
| 91 | +# Archive completed datasets to /data |
| 92 | +/data/datasets/project1/ |
| 93 | +``` |
| 94 | + |
| 95 | +#### Large Consolidated Datasets (> 500 GB, few large files) |
| 96 | + |
| 97 | +**Strategy**: Load directly from `/data` |
| 98 | + |
| 99 | +If your dataset consists of a few large files (e.g., HDF5, Parquet, or compressed archives), loading from `/data` is acceptable: |
| 100 | + |
| 101 | +```python |
| 102 | +# ✅ Acceptable: Loading large consolidated files from /data |
| 103 | +import h5py |
| 104 | +with h5py.File('/data/datasets/large_dataset.h5', 'r') as f: |
| 105 | + data = f['train'][:] |
| 106 | +``` |
| 107 | + |
| 108 | +#### Large Scattered Datasets (> 500 GB, millions of small files) |
| 109 | + |
| 110 | +**Strategy**: Create a continuous copy for `/data` storage |
| 111 | + |
| 112 | +If you have millions of small files (e.g., ImageNet with individual JPG files): |
| 113 | + |
| 114 | +1. **For active use**: Keep on local SSD if space permits |
| 115 | +2. **For archival**: Create a single consolidated file |
| 116 | + |
| 117 | +```bash |
| 118 | +# Create a tar archive for efficient storage/loading from /data |
| 119 | +tar -czf /data/datasets/imagenet.tar.gz /home/ubuntu/datasets/imagenet/ |
| 120 | + |
| 121 | +# Or use HDF5 to consolidate |
| 122 | +# Python example: |
| 123 | +import h5py |
| 124 | +import os |
| 125 | +from PIL import Image |
| 126 | +import numpy as np |
| 127 | + |
| 128 | +with h5py.File('/data/datasets/imagenet.h5', 'w') as f: |
| 129 | + images_group = f.create_group('images') |
| 130 | + for img_path in image_paths: |
| 131 | + img = np.array(Image.open(img_path)) |
| 132 | + images_group.create_dataset(img_path, data=img) |
| 133 | +``` |
| 134 | +
|
| 135 | +3. **Alternative**: Use a dataloader that supports streaming from tar archives: |
| 136 | +```python |
| 137 | +import webdataset as wds |
| 138 | + |
| 139 | +# Stream from tar archive on /data |
| 140 | +dataset = wds.WebDataset('/data/datasets/imagenet.tar') |
| 141 | +``` |
| 142 | +
|
| 143 | +## Storage Management Strategies |
| 144 | +
|
| 145 | +### Symlinks for Data Access |
| 146 | +
|
| 147 | +Use symbolic links to maintain clean project structure while storing data on `/data`: |
| 148 | +
|
| 149 | +```bash |
| 150 | +# Instead of hardcoding paths like /data/project1/samples... |
| 151 | +# Create a symlink in your project directory |
| 152 | +cd /home/ubuntu/projects/my-project/ |
| 153 | +ln -s /data/project1/ ./data |
| 154 | +
|
| 155 | +# Now you can use relative paths in your code |
| 156 | +# ./data/samples/sample1.pt |
| 157 | +``` |
| 158 | +
|
| 159 | +This approach allows you to: |
| 160 | +- Keep code and data logically together |
| 161 | +- Easily switch between different data locations |
| 162 | +- Move projects between servers without changing code |
| 163 | +
|
| 164 | +### Hot/Cold Data Management |
| 165 | +
|
| 166 | +**Active Projects (Hot Data)**: |
| 167 | +- Store on local SSD for best performance |
| 168 | +- Keep code, environments, and active datasets local |
| 169 | +- Use `/data` only for checkpoints and large files |
| 170 | +
|
| 171 | +**Completed Projects (Cold Data)**: |
| 172 | +- Move entire project data to `/data` |
| 173 | +- Keep only code on local SSD (or use Git) |
| 174 | +- This frees up SSD space for new active projects |
| 175 | +
|
| 176 | +**Example workflow:** |
| 177 | +```bash |
| 178 | +# During active development |
| 179 | +/home/ubuntu/projects/active-research/ |
| 180 | +├── code/ |
| 181 | +├── data/ -> /data/active-research/data/ # symlink to /data for large files |
| 182 | +├── checkpoints/ -> /data/active-research/checkpoints/ # symlink |
| 183 | +└── env/ # local conda environment |
| 184 | +
|
| 185 | +# After project completion |
| 186 | +# Move everything to /data, remove local copy |
| 187 | +mv /home/ubuntu/projects/active-research /data/archived-projects/ |
| 188 | +# Keep only the code in git, remove local files |
| 189 | +``` |
| 190 | +
|
| 191 | +### Monitoring Storage Usage |
| 192 | +
|
| 193 | +Regularly check your storage usage: |
| 194 | +
|
| 195 | +```bash |
| 196 | +# Check local SSD usage |
| 197 | +df -h / |
| 198 | +
|
| 199 | +# Check /data usage |
| 200 | +df -h /data |
| 201 | +
|
| 202 | +# Find large directories |
| 203 | +du -h --max-depth=1 /home/ubuntu/ | sort -hr | head -10 |
| 204 | +du -h --max-depth=1 /data/ | sort -hr | head -10 |
| 205 | +``` |
| 206 | +
|
| 207 | +### When Storage is Full |
| 208 | +
|
| 209 | +If you encounter storage capacity issues: |
| 210 | +
|
| 211 | +1. **Check for unnecessary files**: |
| 212 | + ```bash |
| 213 | + # Find large files |
| 214 | + find /home/ubuntu -type f -size +1G -exec ls -lh {} \; |
| 215 | +
|
| 216 | + # Clean conda/pip caches |
| 217 | + conda clean --all |
| 218 | + pip cache purge |
| 219 | + ``` |
| 220 | +
|
| 221 | +2. **Move cold data to `/data`**: |
| 222 | + - Archive completed projects |
| 223 | + - Move old checkpoints |
| 224 | + - Compress large log files |
| 225 | +
|
| 226 | +3. **Remove redundant data**: |
| 227 | + - Delete duplicate datasets across servers (keep one copy in `/data`) |
| 228 | + - Remove intermediate experiment results |
| 229 | + - Clean up old Docker images if using Docker-in-LXC |
| 230 | +
|
| 231 | +4. **Contact admin** if `/data` is full - additional storage may need to be provisioned |
| 232 | +
|
| 233 | +## Performance Considerations |
| 234 | +
|
| 235 | +### Network Mounted Storage Performance |
| 236 | +
|
| 237 | +While `/data` is connected via 100Gbps network, performance depends on access patterns: |
| 238 | +
|
| 239 | +- **Good**: Sequential reads of large files (300+ MB/s) |
| 240 | +- **Acceptable**: Random reads of medium files |
| 241 | +- **Poor**: Random access to thousands of small files |
| 242 | +
|
| 243 | +### Loading Large Checkpoints |
| 244 | +
|
| 245 | +Loading large checkpoints from `/data` is efficient: |
| 246 | +
|
| 247 | +```python |
| 248 | +# This is fine - single large file transfer |
| 249 | +checkpoint = torch.load('/data/models/large_model_5GB.pt') |
| 250 | +``` |
| 251 | +
|
| 252 | +### Accessing Many Small Files |
| 253 | +
|
| 254 | +Avoid patterns like this: |
| 255 | +
|
| 256 | +```python |
| 257 | +# ❌ BAD: Loading many small files from /data during training |
| 258 | +class MyDataset(Dataset): |
| 259 | + def __getitem__(self, idx): |
| 260 | + # Each call loads a small file from network - very slow! |
| 261 | + return torch.load(f'/data/samples/sample_{idx}.pt') |
| 262 | +``` |
| 263 | +
|
| 264 | +Instead: |
| 265 | +
|
| 266 | +```python |
| 267 | +# ✅ GOOD: Consolidate small files or cache locally |
| 268 | +class MyDataset(Dataset): |
| 269 | + def __init__(self): |
| 270 | + # Load entire dataset once from /data |
| 271 | + self.data = torch.load('/data/dataset/consolidated.pt') |
| 272 | +
|
| 273 | + def __getitem__(self, idx): |
| 274 | + return self.data[idx] |
| 275 | +``` |
| 276 | +
|
| 277 | +## Summary |
| 278 | +
|
| 279 | +| File Type | Local SSD | /data | /public | |
| 280 | +|-----------|-----------|-------|---------| |
| 281 | +| Code, scripts | ✅ | ❌ | ❌ | |
| 282 | +| Conda/venv environments | ✅ | ❌ | ❌ | |
| 283 | +| Python cache (`__pycache__`) | ✅ | ❌ | ❌ | |
| 284 | +| Active small datasets | ✅ | ❌ | ❌ | |
| 285 | +| Large model checkpoints | 🟡 | ✅ | ❌ | |
| 286 | +| Archived datasets | ❌ | ✅ | 🟡 | |
| 287 | +| Shared datasets | ❌ | ❌ | ✅ | |
| 288 | +| Logs (active) | ✅ | ❌ | ❌ | |
| 289 | +| Logs (archived) | ❌ | ✅ | ❌ | |
| 290 | +
|
| 291 | +Legend: ✅ Recommended, 🟡 Acceptable, ❌ Not recommended |
| 292 | +
|
| 293 | +## Additional Resources |
| 294 | +
|
| 295 | +- [Getting Started Guide](./index) - Basic storage overview |
| 296 | +- [Moving between Machines](./cluster) - Container migration |
| 297 | +- [Troubleshooting](./troubleshooting) - Performance issues |
0 commit comments