📖 Fotheidil Speech-to-Text Service

Overview

The Fotheidil recognition service provides speech-to-text for long-form audio and video files with the following features:

Speaker diarization (speaker segmentation)
Capitalisation and punctuation restoration
GPU-accelerated ASR and Diarization for efficiency and accuracy

The service takes a pre-processed .wav file (downsampled and converted by the frontend) and returns a transcript enriched with speaker information and restored text formatting.

System Architecture

The system is distributed across two servers:

🖥️ Banba

Host: phoneticsrv3.lcs.tcd.ie
IP: 134.226.98.116
Role: GPU-accelerated diarization & ASR (NeMo + Pyannote), FastAPI backend, MarianMT CPU translation.

Docker image is built with requirements for NeMo and Pyannote using the following Dockerfile:

FROM nvcr.io/nvidia/nemo:24.09

# Install OS dependencies
RUN apt-get update && apt-get install -y \
    ffmpeg \
    sox \
    && rm -rf /var/lib/apt/lists/*

RUN pip install torch==2.4.0 torchaudio==2.4.0 torchvision --index-url https://download.pytorch.org/whl/cu121


# Install Python packages
RUN pip install pyannote.audio fastapi uvicorn python-multipart

# Copy your FastAPI app into the container
WORKDIR /media/storage/liam/nemo
COPY diarize_asr_capt_fastapi_fotheidil_output_error_handling.py .
COPY FastConformer-Hybrid-Transducer-CTC-BPE-lr5-NEST-train_unlabelled_mwer-0.1_min_dur_1.0_remove_duplicates_remove_tg4_27may25-averaged.nemo .

# Run FastAPI when container starts
CMD ["python", "diarize_asr_capt_fastapi_fotheidil_output_error_handling.py"]

The Docker image is run in conjunction with MarianMT server for C&PR using the following systemd service.

[Unit]
Description=Fotheidil Automatic Docker Container
After=network.target docker.service
Requires=docker.service

[Service]
Restart=always
# Optional: restart delay to avoid rapid crash loops
RestartSec=10s
# Start MarianMT server first, so that the subsequent fastAPI server can establish a connection with port 10001
ExecStartPre=/bin/bash -c 'cd /media/storage/liam/CaptPunct_MT && bash marian_server_batch.sh & sleep 5'


ExecStart=/usr/bin/docker run \
  --rm \
  --gpus all \
  --ipc=host \
  --add-host=host.docker.internal:172.17.0.1 \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -p 8000:8000 \
  --name fotheidil-automatic \
  fotheidil-automatic

ExecStop=/usr/bin/docker stop -t 10 fotheidil-automatic

# Optional: set limits
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

Frameworks: FastAPI, Docker, MarianMT
Port: 8000
Endpoint:
```
@app.post("/generate_transcripts")
```

🖥️ Recognition

Host: proxmox
IP: 10.0.0.8
Role: API frontend, audio preprocessing, manages tunneling to Banba.

The API frontend is managed using the following systemd service:

[Unit]
Description=Run FastAPI Test Banba
After=network.target

[Service]
Type=simple
ExecStart=/bin/bash /home/llonerga/run_fastapi_test_banba.sh
WorkingDirectory=/home/llonerga
Restart=always
RestartSec=5
User=llonerga
Environment=PATH=/usr/local/bin:/usr/bin:/bin

[Install]
WantedBy=multi-user.target

Framework: FastAPI
Port: 6060
Endpoint:
```
@app.post("/generate_transcripts")
```

Networking

Requires an SSH tunnel from Recognition → Banba on port 8000
Tunnel is managed by a systemd service called banba-tunnel on Recognition:

[Unit]
Description=SSH tunnel to banba
After=network.target

[Service]
ExecStart=/usr/bin/ssh -o ServerAliveInterval=60 -o ServerAliveCountMax=3 -N -L 8000:134.226.98.116:8000 lonergan@phoneticsrv3.lcs.tcd.ie
Restart=always
RestartSec=10
User=llonerga

[Install]
WantedBy=multi-user.target

Models Used

Diarization: Pyannote Diarization 3.1
Recognition: Nvidia NeMo 110M FastConformer RNN-T (semi-supervised)
Capitalisation & Punctuation Restoration: MarianMT transformer model
- Currently running on CPU (can be sped up with batching or GPU build)

Overview​

System Architecture​

🖥️ Banba​

🖥️ Recognition​

Networking​

Models Used​