Skip to main content

📖 Fotheidil Speech-to-Text Service

Overview

The Fotheidil recognition service provides speech-to-text for long-form audio and video files with the following features:

  • Speaker diarization (speaker segmentation)
  • Capitalisation and punctuation restoration
  • GPU-accelerated ASR and Diarization for efficiency and accuracy

The service takes a pre-processed .wav file (downsampled and converted by the frontend) and returns a transcript enriched with speaker information and restored text formatting.


System Architecture

The system is distributed across two servers:

🖥️ Banba

  • Host: phoneticsrv3.lcs.tcd.ie
  • IP: 134.226.98.116
  • Role: GPU-accelerated diarization & ASR (NeMo + Pyannote), FastAPI backend, MarianMT CPU translation.

Docker image is built with requirements for NeMo and Pyannote using the following Dockerfile:

FROM nvcr.io/nvidia/nemo:24.09

# Install OS dependencies
RUN apt-get update && apt-get install -y \
ffmpeg \
sox \
&& rm -rf /var/lib/apt/lists/*

RUN pip install torch==2.4.0 torchaudio==2.4.0 torchvision --index-url https://download.pytorch.org/whl/cu121


# Install Python packages
RUN pip install pyannote.audio fastapi uvicorn python-multipart

# Copy your FastAPI app into the container
WORKDIR /media/storage/liam/nemo
COPY diarize_asr_capt_fastapi_fotheidil_output_error_handling.py .
COPY FastConformer-Hybrid-Transducer-CTC-BPE-lr5-NEST-train_unlabelled_mwer-0.1_min_dur_1.0_remove_duplicates_remove_tg4_27may25-averaged.nemo .

# Run FastAPI when container starts
CMD ["python", "diarize_asr_capt_fastapi_fotheidil_output_error_handling.py"]

The Docker image is run in conjunction with MarianMT server for C&PR using the following systemd service.

[Unit]
Description=Fotheidil Automatic Docker Container
After=network.target docker.service
Requires=docker.service

[Service]
Restart=always
# Optional: restart delay to avoid rapid crash loops
RestartSec=10s
# Start MarianMT server first, so that the subsequent fastAPI server can establish a connection with port 10001
ExecStartPre=/bin/bash -c 'cd /media/storage/liam/CaptPunct_MT && bash marian_server_batch.sh & sleep 5'


ExecStart=/usr/bin/docker run \
--rm \
--gpus all \
--ipc=host \
--add-host=host.docker.internal:172.17.0.1 \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
--name fotheidil-automatic \
fotheidil-automatic

ExecStop=/usr/bin/docker stop -t 10 fotheidil-automatic

# Optional: set limits
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
  • Frameworks: FastAPI, Docker, MarianMT
  • Port: 8000
  • Endpoint:
    @app.post("/generate_transcripts")


🖥️ Recognition

  • Host: proxmox
  • IP: 10.0.0.8
  • Role: API frontend, audio preprocessing, manages tunneling to Banba.

The API frontend is managed using the following systemd service:

[Unit]
Description=Run FastAPI Test Banba
After=network.target

[Service]
Type=simple
ExecStart=/bin/bash /home/llonerga/run_fastapi_test_banba.sh
WorkingDirectory=/home/llonerga
Restart=always
RestartSec=5
User=llonerga
Environment=PATH=/usr/local/bin:/usr/bin:/bin

[Install]
WantedBy=multi-user.target
  • Framework: FastAPI
  • Port: 6060
  • Endpoint:
    @app.post("/generate_transcripts")

Networking

  • Requires an SSH tunnel from Recognition → Banba on port 8000
  • Tunnel is managed by a systemd service called banba-tunnel on Recognition:
[Unit]
Description=SSH tunnel to banba
After=network.target

[Service]
ExecStart=/usr/bin/ssh -o ServerAliveInterval=60 -o ServerAliveCountMax=3 -N -L 8000:134.226.98.116:8000 lonergan@phoneticsrv3.lcs.tcd.ie
Restart=always
RestartSec=10
User=llonerga

[Install]
WantedBy=multi-user.target

Models Used

  • Diarization: Pyannote Diarization 3.1
  • Recognition: Nvidia NeMo 110M FastConformer RNN-T (semi-supervised)
  • Capitalisation & Punctuation Restoration: MarianMT transformer model
    • Currently running on CPU (can be sped up with batching or GPU build)