Skip to content

Commit d0fb697

Browse files
committed
Update documentation with recent server announcements
- Add storage capacity warning (new drives on the way) - Document memory limits with hard kill policy - Add memory quota differences by server (roselab5: 4x, roselab4: 2x) - Document new utility features (kill container, clean pip cache) - Add RoseLibreChat privacy notice (API-based, not self-hosted) - Add server selection spreadsheet link - Document NVIDIA driver fix script and current version (580.95.05) - Update common-utilities options numbering
1 parent 004f224 commit d0fb697

File tree

6 files changed

+180
-4
lines changed

6 files changed

+180
-4
lines changed

docs/guide/cluster.md

Lines changed: 28 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,10 @@ The utility offers several options:
1515
2. Copy container
1616
3. Start container
1717
4. Stop container
18-
5. Remove container
19-
6. Create new container
18+
5. Kill container (force stop)
19+
6. Remove container
20+
7. Create new container
21+
8. Clean pip cache (with temporary quota lift)
2022

2123
## Container Migration Process
2224

@@ -27,7 +29,30 @@ To migrate your container to another server:
2729

2830
## Creating New Containers
2931

30-
Use option 6 to create a new container from preset images on any available server.
32+
Use option 7 to create a new container from preset images on any available server.
33+
34+
## Troubleshooting Stuck Containers
35+
36+
If your container becomes stuck and unresponsive (e.g., due to memory issues or process deadlocks):
37+
38+
1. Try option 4 to stop the container normally
39+
2. If the container won't stop, use option 5 to kill the container (force stop)
40+
3. After killing, you can restart the container with option 3
41+
42+
::: warning
43+
Killing a container forcefully terminates all processes without cleanup. Only use this option when the container is truly stuck and won't respond to normal stop commands.
44+
:::
45+
46+
## Managing Storage
47+
48+
If you're running out of storage quota, use option 8 to clean pip caches:
49+
50+
1. Select "Clean pip cache" from the menu
51+
2. The utility will temporarily lift your storage quota
52+
3. Pip caches will be removed to free up space
53+
4. Your quota will be restored after cleanup
54+
55+
This is particularly useful when you've installed many Python packages and pip's cache is consuming significant space.
3156

3257
## Benefits of Using This Utility
3358

docs/guide/data-management.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,10 @@ RoseLab servers provide two types of storage with different characteristics. Und
2020
- **Scope**: Synchronized across all RoseLab servers (roselab1-5)
2121
- **Best for**: Large datasets, checkpoints, archived data
2222

23+
::: warning Storage Capacity
24+
The lab's shared storage is at high utilization. New drives are on the way to expand capacity. In the meantime, please be mindful of storage usage and clean up unnecessary data regularly. Contact the admin if you need assistance with storage management.
25+
:::
26+
2327
### `/public` Directory (Shared Network HDD)
2428

2529
- **Size**: 5 TB total (shared among all users)
@@ -216,6 +220,10 @@ If you encounter storage capacity issues:
216220
# Clean conda/pip caches
217221
conda clean --all
218222
pip cache purge
223+
224+
# Or use the common utility to clean pip cache with temporary quota lift
225+
python /utilities/common-utilities.py
226+
# Select the "Clean pip cache" option
219227
```
220228
221229
2. **Move cold data to `/data`**:

docs/guide/index.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,11 @@ The RoseLab servers, owned and managed by the UCSD CSE [Rose Lab](https://roseyu
99
- S3 dataset hosting with [MinIO](https://rosedata.ucsd.edu)
1010
- Online markdown collaboration using [Hedgedoc](https://roselab1.ucsd.edu/hedgedoc)
1111
- Self-hosted experiment tracking via [WandB](https://rosewandb.ucsd.edu) (contact admin for invitation email)
12-
- Lab-shared ChatGPT service frontend [RoseLibreChat](https://roselab1.ucsd.edu:3407/) (contact admin for backend API access)
12+
- Lab-shared AI chat interface [RoseLibreChat](https://roselab1.ucsd.edu:3407/) using lab credits (contact admin for access)
13+
14+
::: warning Privacy Notice for RoseLibreChat
15+
Most AI models available through RoseLibreChat are **not self-hosted** - they connect to external APIs (OpenAI, Anthropic, etc.). Do not share confidential, proprietary, or sensitive information through this service if you have privacy concerns. The lab provides this service using shared credits for convenience, not as a secure private deployment.
16+
:::
1317

1418
Additional web applications are planned to support future research needs.
1519

docs/guide/limit.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,87 @@ This is because the request made by an outsider, after being forwarded by the ho
9393

9494
As a result, it is not recommended to establish your own firewall inside the container. If you want to control access to your service, for example, only allowing your work PC to access your Jupyter Lab, please contact the admin to add a firewall rule to the host OS.
9595

96+
## Memory Limits
9697

98+
Each container has a memory quota that varies by server. When your processes exceed the allocated memory limit, the container will be automatically killed to protect the host system.
99+
100+
### Memory Enforcement Policy
101+
102+
The lab uses a **hard enforcement policy** for memory limits:
103+
- When your processes exceed the memory quota, the container will be immediately killed
104+
- There is no grace period or soft limit
105+
- Pay close attention to your processes' memory usage to avoid unexpected termination
106+
107+
### Monitoring Memory Usage
108+
109+
You can monitor your container's RAM usage on [Grafana](http://roselab1.ucsd.edu/grafana/):
110+
111+
1. Navigate to the container metrics dashboard
112+
2. Check the "Memory Usage" panel to see current and historical usage
113+
3. Set up alerts if your usage approaches the quota
114+
115+
### Memory Quota by Server
116+
117+
Each RoseLab server has different RAM capacity and quota allocation:
118+
119+
- **roselab1**: 512 GB total RAM, standard quota per container
120+
- **roselab2**: 512 GB total RAM, standard quota per container
121+
- **roselab3**: 512 GB total RAM, standard quota per container
122+
- **roselab4**: 1 TB total RAM, 2x standard quota per container
123+
- **roselab5**: 2 TB total RAM, 4x standard quota per container
124+
125+
::: tip Moving to Higher-Memory Servers
126+
If your workload requires more memory than your current allocation:
127+
1. Move your container to roselab5 (4x quota) using `/utilities/common-utilities.py`
128+
2. Contact Rose for resource request approval if even roselab5's quota is insufficient
129+
3. After approval, contact the admin to increase your specific quota
130+
:::
131+
132+
### Reducing Memory Usage
133+
134+
If you're hitting memory limits:
135+
136+
1. **Profile your code** to identify memory leaks:
137+
```python
138+
# Use memory_profiler
139+
from memory_profiler import profile
140+
141+
@profile
142+
def my_function():
143+
# Your code here
144+
pass
145+
```
146+
147+
2. **Reduce batch size** in training:
148+
```python
149+
# Smaller batch size uses less memory
150+
train_loader = DataLoader(dataset, batch_size=16) # instead of 32
151+
```
152+
153+
3. **Use gradient accumulation** instead of large batches:
154+
```python
155+
# Accumulate gradients over multiple steps
156+
accumulation_steps = 4
157+
for i, (inputs, labels) in enumerate(train_loader):
158+
outputs = model(inputs)
159+
loss = criterion(outputs, labels)
160+
loss = loss / accumulation_steps
161+
loss.backward()
162+
163+
if (i + 1) % accumulation_steps == 0:
164+
optimizer.step()
165+
optimizer.zero_grad()
166+
```
167+
168+
4. **Clear unused variables**:
169+
```python
170+
import gc
171+
del large_variable
172+
gc.collect()
173+
torch.cuda.empty_cache() # For GPU memory
174+
```
175+
176+
5. **Use data streaming** instead of loading entire datasets into memory
97177

98178

99179

docs/guide/troubleshooting.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,5 +85,55 @@ When running machine learning tasks, large datasets are often loaded from disk.
8585

8686
PyTorch has a build-in data prefetching mechanism. While the GPU is training a batch, `torch.utils.data.DataLoader` will load the data for next batch. By default, there is a total of 2 * `num_workers` batches prefetched across all workers. The `num_workers` is by default 0. If you face slow disk reading speed, consider increasing the number of workers.
8787

88+
## NVIDIA Driver Issues
89+
90+
### Driver/Library Version Mismatch
91+
92+
If you see the error message:
93+
94+
```bash
95+
Failed to initialize NVML: Driver/library version mismatch
96+
```
97+
98+
This indicates that your container's NVIDIA driver is out of sync with the host system.
99+
100+
#### Cause
101+
102+
The host and container NVIDIA driver versions must match exactly. This mismatch typically occurs after:
103+
- Server maintenance or reboots
104+
- Host driver updates
105+
- Container restoration from backup
106+
107+
::: warning
108+
You **cannot** change host driver versions yourself - these are managed by the admin and documented in the [config table](../config/).
109+
:::
110+
111+
#### Solution
112+
113+
1. **Use the NVIDIA upgrade script** (recommended):
114+
```bash
115+
sudo /utilities/nvidia-upgrade.sh
116+
sudo reboot
117+
```
118+
119+
2. **Wait for automatic reboot**: After running the script, your container will reboot to apply the new driver.
120+
121+
3. **Verify the fix**: After reboot, check that CUDA is working:
122+
```bash
123+
nvidia-smi
124+
python -c "import torch; print(torch.cuda.is_available())"
125+
```
126+
127+
::: danger Important
128+
**Never install nvidia-driver through your package manager** (apt, yum, etc.). This will break GPU passthrough and prevent your container from accessing GPUs. Always use the provided upgrade script.
129+
:::
130+
131+
### Current Driver Version
132+
133+
As of the latest server migration (October 2024), all NVIDIA drivers have been upgraded to version **580.95.05**. You can verify your driver version with:
134+
135+
```bash
136+
nvidia-smi | grep "Driver Version"
137+
```
88138

89139

docs/guide/workflow.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,15 @@ RoseLab provides five servers (roselab1-5) that you can use simultaneously. This
1515
- "I thought I fixed this bug already... did I do it on roselab2 or roselab3?"
1616
- "Which server has my newest experimental results?"
1717

18+
::: tip Server Selection Guidelines
19+
The lab maintains a [Server Selection Spreadsheet](https://docs.google.com/spreadsheets/d/1aTKbCNTq0guwF144nnNwXgWHBZ8lwyzOLQC5vKHU6mk/edit?usp=sharing) where you should claim your main server. This helps:
20+
- Avoid interruptions when someone runs heavy jobs
21+
- Prevent bottlenecks by balancing load across servers
22+
- Match your workload to the right GPU type (4090 vs A100 vs L40S vs H200)
23+
24+
Review the guidelines in the spreadsheet to choose the most appropriate server for your typical workload.
25+
:::
26+
1827
**Example Strategy**:
1928
```
2029
roselab1: Main development server (always has latest code)

0 commit comments

Comments
 (0)