Skip to content

Commit 004f224

Browse files
committed
Add comprehensive documentation for data management and multi-server workflows
- Add Data Management Best Practices guide covering SSD vs HDD usage, storage strategies, and performance considerations - Add Multi-Server Workflow guide with branching strategies, Git integration, and server migration procedures - Add System Status page for tracking server status, storage alerts, and system updates - Update navigation to include new documentation pages - Add last updated timestamp to config page
1 parent 211822e commit 004f224

File tree

5 files changed

+950
-2
lines changed

5 files changed

+950
-2
lines changed

docs/.vitepress/config.mts

Lines changed: 39 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ export default defineConfig({
4646
nav: [
4747
{ text: 'Roselab Guide', link: '/guide/', activeMatch: '/guide/' },
4848
{ text: 'CSE Public Server Guide', link: '/shared/', activeMatch: '/shared/' },
49+
{ text: 'System Status', link: '/status/', activeMatch: '/status/' },
4950
{ text: 'Config', link: '/config/', activeMatch: '/config/' }
5051
],
5152

@@ -61,6 +62,30 @@ export default defineConfig({
6162
],
6263
}
6364
],
65+
'/status/': [
66+
{
67+
text: 'System Status',
68+
items: [
69+
{
70+
text: 'Current Status',
71+
link: '/status/',
72+
},
73+
],
74+
},
75+
{
76+
text: 'Related',
77+
items: [
78+
{
79+
text: 'Software Versions',
80+
link: '/config/',
81+
},
82+
{
83+
text: 'Grafana Dashboard',
84+
link: 'http://roselab1.ucsd.edu/grafana/',
85+
},
86+
],
87+
},
88+
],
6489
'/guide/': [
6590
{
6691
text: 'Guide',
@@ -77,6 +102,14 @@ export default defineConfig({
77102
text: 'Moving between Machines',
78103
link: '/guide/cluster',
79104
},
105+
{
106+
text: 'Multi-Server Workflow',
107+
link: '/guide/workflow',
108+
},
109+
{
110+
text: 'Data Management',
111+
link: '/guide/data-management',
112+
},
80113
{
81114
text: 'Limitations',
82115
link: '/guide/limit',
@@ -104,10 +137,14 @@ export default defineConfig({
104137
],
105138
},
106139
{
107-
text: 'Config',
140+
text: 'System Info',
108141
items: [
109142
{
110-
text: 'Versions',
143+
text: 'Current Status',
144+
link: '/status/',
145+
},
146+
{
147+
text: 'Software Versions',
111148
link: '/config/',
112149
},
113150
],

docs/config/index.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@ layout: doc
44

55
# Config
66

7+
**Last Updated**: October 24, 2025
8+
79
## Software Info
810

911
| Name | Version | Installation Method | Location | Uninstall Method |

docs/guide/data-management.md

Lines changed: 297 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,297 @@
1+
# Data Management Best Practices
2+
3+
## Overview
4+
5+
RoseLab servers provide two types of storage with different characteristics. Understanding when and how to use each type is crucial for optimal performance and efficient resource utilization.
6+
7+
## Storage Types
8+
9+
### System SSD (Local Storage)
10+
11+
- **Size**: ~256 GB available per container
12+
- **Performance**: High-speed NVMe SSD
13+
- **Scope**: Local to each machine (not synchronized)
14+
- **Best for**: Active development, environments, code, and frequently accessed small files
15+
16+
### `/data` Directory (Network HDD)
17+
18+
- **Size**: 5 TB per user (private, accessible only to you)
19+
- **Performance**: Network-mounted HDD over 100Gbps connection
20+
- **Scope**: Synchronized across all RoseLab servers (roselab1-5)
21+
- **Best for**: Large datasets, checkpoints, archived data
22+
23+
### `/public` Directory (Shared Network HDD)
24+
25+
- **Size**: 5 TB total (shared among all users)
26+
- **Performance**: Network-mounted HDD over 100Gbps connection
27+
- **Scope**: Synchronized across all servers, accessible to all users
28+
- **Best for**: Shared datasets, collaborative data
29+
30+
## Best Practices by File Type
31+
32+
### ❌ Never Put on `/data`
33+
34+
**Development Environments and Package Caches**
35+
36+
Do **NOT** store these on `/data`:
37+
- Python virtual environments (`venv`, `conda` environments)
38+
- Package manager caches (`pip`, `conda`, `npm`)
39+
- Compiled code and bytecode (`.pyc` files, `__pycache__` directories)
40+
- Build artifacts
41+
42+
**Why?** Loading thousands of small files through the network causes significant performance degradation. Even with a 100Gbps connection, the latency of accessing numerous small files adds up quickly.
43+
44+
**Example of what to avoid:**
45+
```bash
46+
# ❌ BAD: Creating conda environment on /data
47+
conda create -p /data/envs/myenv python=3.10
48+
49+
# ✅ GOOD: Keep environments on local SSD
50+
conda create -n myenv python=3.10
51+
```
52+
53+
### ✅ Always Put on `/data`
54+
55+
**Large Model Checkpoints**
56+
57+
Checkpoints should be stored on `/data` because:
58+
- They are typically single large files (GBs each)
59+
- Loading a single large file over network is efficient
60+
- They benefit from cross-server synchronization
61+
- They don't need to be duplicated across servers
62+
63+
**Example:**
64+
```python
65+
# ✅ GOOD: Save checkpoints directly to /data
66+
torch.save(model.state_dict(), '/data/experiments/project1/checkpoint_epoch_50.pt')
67+
68+
# Loading is also efficient
69+
model.load_state_dict(torch.load('/data/experiments/project1/checkpoint_epoch_50.pt'))
70+
```
71+
72+
**Archived or Cold Datasets**
73+
74+
Move datasets to `/data` when:
75+
- You've completed a project but want to keep the data
76+
- You're doing intermediate preprocessing
77+
- The dataset is not actively used in training
78+
79+
### ⚖️ Conditional: Active Datasets
80+
81+
The decision depends on dataset characteristics:
82+
83+
#### Small to Medium Datasets (< 500 GB, consolidated files)
84+
85+
**Strategy**: Keep a hot copy on local SSD, archive to `/data`
86+
87+
```bash
88+
# Keep active copy on local SSD
89+
/home/ubuntu/projects/active-project/data/
90+
91+
# Archive completed datasets to /data
92+
/data/datasets/project1/
93+
```
94+
95+
#### Large Consolidated Datasets (> 500 GB, few large files)
96+
97+
**Strategy**: Load directly from `/data`
98+
99+
If your dataset consists of a few large files (e.g., HDF5, Parquet, or compressed archives), loading from `/data` is acceptable:
100+
101+
```python
102+
# ✅ Acceptable: Loading large consolidated files from /data
103+
import h5py
104+
with h5py.File('/data/datasets/large_dataset.h5', 'r') as f:
105+
data = f['train'][:]
106+
```
107+
108+
#### Large Scattered Datasets (> 500 GB, millions of small files)
109+
110+
**Strategy**: Create a continuous copy for `/data` storage
111+
112+
If you have millions of small files (e.g., ImageNet with individual JPG files):
113+
114+
1. **For active use**: Keep on local SSD if space permits
115+
2. **For archival**: Create a single consolidated file
116+
117+
```bash
118+
# Create a tar archive for efficient storage/loading from /data
119+
tar -czf /data/datasets/imagenet.tar.gz /home/ubuntu/datasets/imagenet/
120+
121+
# Or use HDF5 to consolidate
122+
# Python example:
123+
import h5py
124+
import os
125+
from PIL import Image
126+
import numpy as np
127+
128+
with h5py.File('/data/datasets/imagenet.h5', 'w') as f:
129+
images_group = f.create_group('images')
130+
for img_path in image_paths:
131+
img = np.array(Image.open(img_path))
132+
images_group.create_dataset(img_path, data=img)
133+
```
134+
135+
3. **Alternative**: Use a dataloader that supports streaming from tar archives:
136+
```python
137+
import webdataset as wds
138+
139+
# Stream from tar archive on /data
140+
dataset = wds.WebDataset('/data/datasets/imagenet.tar')
141+
```
142+
143+
## Storage Management Strategies
144+
145+
### Symlinks for Data Access
146+
147+
Use symbolic links to maintain clean project structure while storing data on `/data`:
148+
149+
```bash
150+
# Instead of hardcoding paths like /data/project1/samples...
151+
# Create a symlink in your project directory
152+
cd /home/ubuntu/projects/my-project/
153+
ln -s /data/project1/ ./data
154+
155+
# Now you can use relative paths in your code
156+
# ./data/samples/sample1.pt
157+
```
158+
159+
This approach allows you to:
160+
- Keep code and data logically together
161+
- Easily switch between different data locations
162+
- Move projects between servers without changing code
163+
164+
### Hot/Cold Data Management
165+
166+
**Active Projects (Hot Data)**:
167+
- Store on local SSD for best performance
168+
- Keep code, environments, and active datasets local
169+
- Use `/data` only for checkpoints and large files
170+
171+
**Completed Projects (Cold Data)**:
172+
- Move entire project data to `/data`
173+
- Keep only code on local SSD (or use Git)
174+
- This frees up SSD space for new active projects
175+
176+
**Example workflow:**
177+
```bash
178+
# During active development
179+
/home/ubuntu/projects/active-research/
180+
├── code/
181+
├── data/ -> /data/active-research/data/ # symlink to /data for large files
182+
├── checkpoints/ -> /data/active-research/checkpoints/ # symlink
183+
└── env/ # local conda environment
184+
185+
# After project completion
186+
# Move everything to /data, remove local copy
187+
mv /home/ubuntu/projects/active-research /data/archived-projects/
188+
# Keep only the code in git, remove local files
189+
```
190+
191+
### Monitoring Storage Usage
192+
193+
Regularly check your storage usage:
194+
195+
```bash
196+
# Check local SSD usage
197+
df -h /
198+
199+
# Check /data usage
200+
df -h /data
201+
202+
# Find large directories
203+
du -h --max-depth=1 /home/ubuntu/ | sort -hr | head -10
204+
du -h --max-depth=1 /data/ | sort -hr | head -10
205+
```
206+
207+
### When Storage is Full
208+
209+
If you encounter storage capacity issues:
210+
211+
1. **Check for unnecessary files**:
212+
```bash
213+
# Find large files
214+
find /home/ubuntu -type f -size +1G -exec ls -lh {} \;
215+
216+
# Clean conda/pip caches
217+
conda clean --all
218+
pip cache purge
219+
```
220+
221+
2. **Move cold data to `/data`**:
222+
- Archive completed projects
223+
- Move old checkpoints
224+
- Compress large log files
225+
226+
3. **Remove redundant data**:
227+
- Delete duplicate datasets across servers (keep one copy in `/data`)
228+
- Remove intermediate experiment results
229+
- Clean up old Docker images if using Docker-in-LXC
230+
231+
4. **Contact admin** if `/data` is full - additional storage may need to be provisioned
232+
233+
## Performance Considerations
234+
235+
### Network Mounted Storage Performance
236+
237+
While `/data` is connected via 100Gbps network, performance depends on access patterns:
238+
239+
- **Good**: Sequential reads of large files (300+ MB/s)
240+
- **Acceptable**: Random reads of medium files
241+
- **Poor**: Random access to thousands of small files
242+
243+
### Loading Large Checkpoints
244+
245+
Loading large checkpoints from `/data` is efficient:
246+
247+
```python
248+
# This is fine - single large file transfer
249+
checkpoint = torch.load('/data/models/large_model_5GB.pt')
250+
```
251+
252+
### Accessing Many Small Files
253+
254+
Avoid patterns like this:
255+
256+
```python
257+
# ❌ BAD: Loading many small files from /data during training
258+
class MyDataset(Dataset):
259+
def __getitem__(self, idx):
260+
# Each call loads a small file from network - very slow!
261+
return torch.load(f'/data/samples/sample_{idx}.pt')
262+
```
263+
264+
Instead:
265+
266+
```python
267+
# ✅ GOOD: Consolidate small files or cache locally
268+
class MyDataset(Dataset):
269+
def __init__(self):
270+
# Load entire dataset once from /data
271+
self.data = torch.load('/data/dataset/consolidated.pt')
272+
273+
def __getitem__(self, idx):
274+
return self.data[idx]
275+
```
276+
277+
## Summary
278+
279+
| File Type | Local SSD | /data | /public |
280+
|-----------|-----------|-------|---------|
281+
| Code, scripts ||||
282+
| Conda/venv environments ||||
283+
| Python cache (`__pycache__`) ||||
284+
| Active small datasets ||||
285+
| Large model checkpoints | 🟡 |||
286+
| Archived datasets ||| 🟡 |
287+
| Shared datasets ||||
288+
| Logs (active) ||||
289+
| Logs (archived) ||||
290+
291+
Legend: ✅ Recommended, 🟡 Acceptable, ❌ Not recommended
292+
293+
## Additional Resources
294+
295+
- [Getting Started Guide](./index) - Basic storage overview
296+
- [Moving between Machines](./cluster) - Container migration
297+
- [Troubleshooting](./troubleshooting) - Performance issues

0 commit comments

Comments
 (0)