GPU Server

From MuHack
Jump to navigationJump to search
Service: GPU Server
Service URI gpu.muhack
Location Aula associazioni
Contact webmaster@muhack.org
Operational Status Working
Super Users Ceres-c, Mrmoddom, Cammo
Owner MuHack
Last Update 2025-05-14

MUHACK GPU SERVER

A shared machine that gives all members access to serious GPU horsepower for AI/ML, rendering, data-science and other compute-intensive experiments.

What this box is for

  • Training / fine-tuning neural networks
  • CUDA / OpenCL development and compilation
  • 3-D rendering, video encoding or scientific workloads
  • General “I-need-48-GB-of-VRAM” tinkering

The host runs Proxmox VE, so we can either

  1. carve the GPU into vGPU slices for VMs, or
  2. passthrough the full card to Debian 12 LXC containers.

Fast facts

  • Host name  : gpu.muhack (reachable only locally or through the MuHack Tailscale network)
  • Management  : Proxmox web UI @ port 8006 → Link
  • Sysadmins  : Ceres-c, Mrmoddom, Cammo
  • Contact  : webmaster@muhack.com • or in person every Tuesday evening at the MuHack meeting

Hardware

  • CPU  : Intel Core i7-4930K — 6 cores / 12 threads @ 3.40 GHz (3.90 GHz Turbo)
  • RAM  : 64 GB DDR3-1866
  • GPU-A  : NVIDIA Quadro RTX 8000 — 48 GB GDDR6 (compute card)
  • GPU-B  : NVIDIA GeForce GT 710 — emergency host console only
  • Disk  : 1 TB PNY SATA SSD + additional SATA ports free for expansion
  • NICs  :
    • Intel 82574L Gigabit
    • Intel 82579V Gigabit

Software stack

  • Proxmox VE 8.x (kernel 6.8)
  • “Merged” NVIDIA driver 550.90.07 (both vGPU and passthrough available)

Pre-built images

  • Debian 12 LXC (default; CUDA-ready; shared GPU)
  • Ubuntu 22.04 / Windows VM (on request, with dedicated vGPU slice, only for specific necessities)

Getting your own environment

  1. Make sure your laptop/PC is connected to the MuHack Tailscale tail-net and can ping gpu.muhack.
  2. Go to the proxmox login page and login a first time to create your account and then logout immediately
  3. Talk to any sysadmin (email or in person on Tuesday). Tell us
    1. what you plan to run (ML, rendering, etc.)
    2. an estimate of how much GPU RAM / runtime you’ll need
  4. We’ll create a container (or VM) for you and hand back:
    1. ID (e.g. ct104 or vm203)
    2. a random initial password

Typical turnaround time: usually same day or next-day. However might take more time if the sysadmins are busy.

After the first login you have to re-login a second time to let proxmox setup correctly your account's group

At this point you have two options for interact with the console:

  1. Use the proxmox web interface's integrated console
  2. Join the machine to a tailnet (your personal or the MuHack's one) and connect trough SSH -> remember to enable ssh root access within sshd config file!

Using your Debian 12 LXC

Need CUDA toolkit? Run the following script:

#!/bin/bash
wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
apt install -y ./cuda-keyring_1.1-1_all.deb
apt update
rm cuda-keyring_1.1-1_all.deb
apt install -y cuda-toolkit-12-4

echo "export PATH=\${PATH}:/usr/local/cuda/bin" >> ~/.bashrc
echo "export LD_LIBRARY_PATH=\${LD_LIBRARY_PATH}:/usr/local/cuda/lib64" >> ~/.bashrc

After running the script, apply the environment changes by sourcing your `.bashrc`:

source ~/.bashrc

You’ll get CUDA toolkit 12.4 that matches the host driver.

Fair-share policy

  • Jobs longer than 12 h → announce them at the Tuesday meeting or in the telegram channel.
  • Sysadmins may pause/stop workloads that block others
  • Crypto-mining strictly forbidden

Good-citizen checklist

  • Before logging out, run watch -n60 nvidia-smi and confirm your process is gone and is not hogging the gpu.
  • Keep /tmp and your home directory tidy; disk space is shared
  • BACK UP YOUR OWN DATA: we have no data persistency or backup guarantee

Extras

  • vGPU-enabled Windows or Ubuntu VMs — ask if needed
  • Docker-inside-LXC supported (nesting + cgroupv2 enabled)

Troubleshooting quick-ref

Problem Fix
nvidia-smi shows no devices Container lost its GPU mapping—tell a sysadmin
Driver mismatch error Host driver was updated—reinstall matching userspace driver with --no-kernel-modules—tell a sysadmin
Out-of-memory in PyTorch If using a VM: request a larger vGPU slice. If using a container: someone else might be using some GPU memory. Check with nvidia-smi (gui for that coming up sometime kinda soon).

Questions / ideas → email ceres-c@muhack.com, mrmoddom@muhack.com, or cammo@muhack.com — or just grab us at the Tuesday MuHack meeting.