Update documentation with recent server announcements

ZihaoZhou · ZihaoZhou · commit d0fb6978f3ad · 2025-10-24T20:20:02.000-07:00
- Add storage capacity warning (new drives on the way)
- Document memory limits with hard kill policy
- Add memory quota differences by server (roselab5: 4x, roselab4: 2x)
- Document new utility features (kill container, clean pip cache)
- Add RoseLibreChat privacy notice (API-based, not self-hosted)
- Add server selection spreadsheet link
- Document NVIDIA driver fix script and current version (580.95.05)
- Update common-utilities options numbering
diff --git a/docs/guide/cluster.md b/docs/guide/cluster.md
@@ -15,8 +15,10 @@ The utility offers several options:
 2. Copy container
 3. Start container
 4. Stop container
-5. Remove container
-6. Create new container
+5. Kill container (force stop)
+6. Remove container
+7. Create new container
+8. Clean pip cache (with temporary quota lift)
 
 ## Container Migration Process
 
@@ -27,7 +29,30 @@ To migrate your container to another server:
 
 ## Creating New Containers
 
-Use option 6 to create a new container from preset images on any available server.
+Use option 7 to create a new container from preset images on any available server.
+
+## Troubleshooting Stuck Containers
+
+If your container becomes stuck and unresponsive (e.g., due to memory issues or process deadlocks):
+
+1. Try option 4 to stop the container normally
+2. If the container won't stop, use option 5 to kill the container (force stop)
+3. After killing, you can restart the container with option 3
+
+::: warning
+Killing a container forcefully terminates all processes without cleanup. Only use this option when the container is truly stuck and won't respond to normal stop commands.
+:::
+
+## Managing Storage
+
+If you're running out of storage quota, use option 8 to clean pip caches:
+
+1. Select "Clean pip cache" from the menu
+2. The utility will temporarily lift your storage quota
+3. Pip caches will be removed to free up space
+4. Your quota will be restored after cleanup
+
+This is particularly useful when you've installed many Python packages and pip's cache is consuming significant space.
 
 ## Benefits of Using This Utility
 
diff --git a/docs/guide/data-management.md b/docs/guide/data-management.md
@@ -20,6 +20,10 @@ RoseLab servers provide two types of storage with different characteristics. Und
 - **Scope**: Synchronized across all RoseLab servers (roselab1-5)
 - **Best for**: Large datasets, checkpoints, archived data
 
+::: warning Storage Capacity
+The lab's shared storage is at high utilization. New drives are on the way to expand capacity. In the meantime, please be mindful of storage usage and clean up unnecessary data regularly. Contact the admin if you need assistance with storage management.
+:::
+
 ### `/public` Directory (Shared Network HDD)
 
 - **Size**: 5 TB total (shared among all users)
@@ -216,6 +220,10 @@ If you encounter storage capacity issues:
    # Clean conda/pip caches
    conda clean --all
    pip cache purge
+
+   # Or use the common utility to clean pip cache with temporary quota lift
+   python /utilities/common-utilities.py
+   # Select the "Clean pip cache" option
    ```
 
 2. **Move cold data to `/data`**:
diff --git a/docs/guide/index.md b/docs/guide/index.md
@@ -9,7 +9,11 @@ The RoseLab servers, owned and managed by the UCSD CSE [Rose Lab](https://roseyu
 - S3 dataset hosting with [MinIO](https://rosedata.ucsd.edu)
 - Online markdown collaboration using [Hedgedoc](https://roselab1.ucsd.edu/hedgedoc)
 - Self-hosted experiment tracking via [WandB](https://rosewandb.ucsd.edu) (contact admin for invitation email)
-- Lab-shared ChatGPT service frontend [RoseLibreChat](https://roselab1.ucsd.edu:3407/) (contact admin for backend API access)
+- Lab-shared AI chat interface [RoseLibreChat](https://roselab1.ucsd.edu:3407/) using lab credits (contact admin for access)
+
+::: warning Privacy Notice for RoseLibreChat
+Most AI models available through RoseLibreChat are **not self-hosted** - they connect to external APIs (OpenAI, Anthropic, etc.). Do not share confidential, proprietary, or sensitive information through this service if you have privacy concerns. The lab provides this service using shared credits for convenience, not as a secure private deployment.
+:::
 
 Additional web applications are planned to support future research needs.
 
diff --git a/docs/guide/limit.md b/docs/guide/limit.md
@@ -93,7 +93,87 @@ This is because the request made by an outsider, after being forwarded by the ho
 
 As a result, it is not recommended to establish your own firewall inside the container. If you want to control access to your service, for example, only allowing your work PC to access your Jupyter Lab, please contact the admin to add a firewall rule to the host OS.
 
+## Memory Limits
 
+Each container has a memory quota that varies by server. When your processes exceed the allocated memory limit, the container will be automatically killed to protect the host system.
+
+### Memory Enforcement Policy
+
+The lab uses a **hard enforcement policy** for memory limits:
+- When your processes exceed the memory quota, the container will be immediately killed
+- There is no grace period or soft limit
+- Pay close attention to your processes' memory usage to avoid unexpected termination
+
+### Monitoring Memory Usage
+
+You can monitor your container's RAM usage on [Grafana](http://roselab1.ucsd.edu/grafana/):
+
+1. Navigate to the container metrics dashboard
+2. Check the "Memory Usage" panel to see current and historical usage
+3. Set up alerts if your usage approaches the quota
+
+### Memory Quota by Server
+
+Each RoseLab server has different RAM capacity and quota allocation:
+
+- **roselab1**: 512 GB total RAM, standard quota per container
+- **roselab2**: 512 GB total RAM, standard quota per container
+- **roselab3**: 512 GB total RAM, standard quota per container
+- **roselab4**: 1 TB total RAM, 2x standard quota per container
+- **roselab5**: 2 TB total RAM, 4x standard quota per container
+
+::: tip Moving to Higher-Memory Servers
+If your workload requires more memory than your current allocation:
+1. Move your container to roselab5 (4x quota) using `/utilities/common-utilities.py`
+2. Contact Rose for resource request approval if even roselab5's quota is insufficient
+3. After approval, contact the admin to increase your specific quota
+:::
+
+### Reducing Memory Usage
+
+If you're hitting memory limits:
+
+1. **Profile your code** to identify memory leaks:
+   ```python
+   # Use memory_profiler
+   from memory_profiler import profile
+
+   @profile
+   def my_function():
+       # Your code here
+       pass
+   ```
+
+2. **Reduce batch size** in training:
+   ```python
+   # Smaller batch size uses less memory
+   train_loader = DataLoader(dataset, batch_size=16)  # instead of 32
+   ```
+
+3. **Use gradient accumulation** instead of large batches:
+   ```python
+   # Accumulate gradients over multiple steps
+   accumulation_steps = 4
+   for i, (inputs, labels) in enumerate(train_loader):
+       outputs = model(inputs)
+       loss = criterion(outputs, labels)
+       loss = loss / accumulation_steps
+       loss.backward()
+
+       if (i + 1) % accumulation_steps == 0:
+           optimizer.step()
+           optimizer.zero_grad()
+   ```
+
+4. **Clear unused variables**:
+   ```python
+   import gc
+   del large_variable
+   gc.collect()
+   torch.cuda.empty_cache()  # For GPU memory
+   ```
+
+5. **Use data streaming** instead of loading entire datasets into memory
 
 
 
diff --git a/docs/guide/troubleshooting.md b/docs/guide/troubleshooting.md
@@ -85,5 +85,55 @@ When running machine learning tasks, large datasets are often loaded from disk.
 
 PyTorch has a build-in data prefetching mechanism. While the GPU is training a batch, `torch.utils.data.DataLoader` will load the data for next batch. By default, there is a total of 2 * `num_workers` batches prefetched across all workers. The `num_workers` is by default 0. If you face slow disk reading speed, consider increasing the number of workers.
 
+## NVIDIA Driver Issues
+
+### Driver/Library Version Mismatch
+
+If you see the error message:
+
+```bash
+Failed to initialize NVML: Driver/library version mismatch
+```
+
+This indicates that your container's NVIDIA driver is out of sync with the host system.
+
+#### Cause
+
+The host and container NVIDIA driver versions must match exactly. This mismatch typically occurs after:
+- Server maintenance or reboots
+- Host driver updates
+- Container restoration from backup
+
+::: warning
+You **cannot** change host driver versions yourself - these are managed by the admin and documented in the [config table](../config/).
+:::
+
+#### Solution
+
+1. **Use the NVIDIA upgrade script** (recommended):
+   ```bash
+   sudo /utilities/nvidia-upgrade.sh
+   sudo reboot
+   ```
+
+2. **Wait for automatic reboot**: After running the script, your container will reboot to apply the new driver.
+
+3. **Verify the fix**: After reboot, check that CUDA is working:
+   ```bash
+   nvidia-smi
+   python -c "import torch; print(torch.cuda.is_available())"
+   ```
+
+::: danger Important
+**Never install nvidia-driver through your package manager** (apt, yum, etc.). This will break GPU passthrough and prevent your container from accessing GPUs. Always use the provided upgrade script.
+:::
+
+### Current Driver Version
+
+As of the latest server migration (October 2024), all NVIDIA drivers have been upgraded to version **580.95.05**. You can verify your driver version with:
+
+```bash
+nvidia-smi | grep "Driver Version"
+```
 
 
diff --git a/docs/guide/workflow.md b/docs/guide/workflow.md
@@ -15,6 +15,15 @@ RoseLab provides five servers (roselab1-5) that you can use simultaneously. This
 - "I thought I fixed this bug already... did I do it on roselab2 or roselab3?"
 - "Which server has my newest experimental results?"
 
+::: tip Server Selection Guidelines
+The lab maintains a [Server Selection Spreadsheet](https://docs.google.com/spreadsheets/d/1aTKbCNTq0guwF144nnNwXgWHBZ8lwyzOLQC5vKHU6mk/edit?usp=sharing) where you should claim your main server. This helps:
+- Avoid interruptions when someone runs heavy jobs
+- Prevent bottlenecks by balancing load across servers
+- Match your workload to the right GPU type (4090 vs A100 vs L40S vs H200)
+
+Review the guidelines in the spreadsheet to choose the most appropriate server for your typical workload.
+:::
+
 **Example Strategy**:
 ```
 roselab1: Main development server (always has latest code)