📖 Fotheidil Speech-to-Text Service
Overview
The Fotheidil recognition service provides speech-to-text for long-form audio and video files with the following features:
- Speaker diarization (speaker segmentation)
- Capitalisation and punctuation restoration
- GPU-accelerated ASR and Diarization for efficiency and accuracy
The service takes a pre-processed .wav file (downsampled and converted by the frontend) and returns a transcript enriched with speaker information and restored text formatting.
System Architecture
The system is distributed across two servers:
🖥️ Banba
- Host:
phoneticsrv3.lcs.tcd.ie - IP:
134.226.98.116 - Role: GPU-accelerated diarization & ASR (NeMo + Pyannote), FastAPI backend, MarianMT CPU translation.
Docker image is built with requirements for NeMo and Pyannote using the following Dockerfile:
FROM nvcr.io/nvidia/nemo:24.09
# Install OS dependencies
RUN apt-get update && apt-get install -y \
ffmpeg \
sox \
&& rm -rf /var/lib/apt/lists/*
RUN pip install torch==2.4.0 torchaudio==2.4.0 torchvision --index-url https://download.pytorch.org/whl/cu121
# Install Python packages
RUN pip install pyannote.audio fastapi uvicorn python-multipart
# Copy your FastAPI app into the container
WORKDIR /media/storage/liam/nemo
COPY diarize_asr_capt_fastapi_fotheidil_output_error_handling.py .
COPY FastConformer-Hybrid-Transducer-CTC-BPE-lr5-NEST-train_unlabelled_mwer-0.1_min_dur_1.0_remove_duplicates_remove_tg4_27may25-averaged.nemo .
# Run FastAPI when container starts
CMD ["python", "diarize_asr_capt_fastapi_fotheidil_output_error_handling.py"]
The Docker image is run in conjunction with MarianMT server for C&PR using the following systemd service.
[Unit]
Description=Fotheidil Automatic Docker Container
After=network.target docker.service
Requires=docker.service
[Service]
Restart=always
# Optional: restart delay to avoid rapid crash loops
RestartSec=10s
# Start MarianMT server first, so that the subsequent fastAPI server can establish a connection with port 10001
ExecStartPre=/bin/bash -c 'cd /media/storage/liam/CaptPunct_MT && bash marian_server_batch.sh & sleep 5'
ExecStart=/usr/bin/docker run \
--rm \
--gpus all \
--ipc=host \
--add-host=host.docker.internal:172.17.0.1 \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
--name fotheidil-automatic \
fotheidil-automatic
ExecStop=/usr/bin/docker stop -t 10 fotheidil-automatic
# Optional: set limits
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
- Frameworks: FastAPI, Docker, MarianMT
- Port:
8000 - Endpoint:
@app.post("/generate_transcripts")
🖥️ Recognition
- Host:
proxmox - IP:
10.0.0.8 - Role: API frontend, audio preprocessing, manages tunneling to Banba.
The API frontend is managed using the following systemd service:
[Unit]
Description=Run FastAPI Test Banba
After=network.target
[Service]
Type=simple
ExecStart=/bin/bash /home/llonerga/run_fastapi_test_banba.sh
WorkingDirectory=/home/llonerga
Restart=always
RestartSec=5
User=llonerga
Environment=PATH=/usr/local/bin:/usr/bin:/bin
[Install]
WantedBy=multi-user.target
- Framework: FastAPI
- Port:
6060 - Endpoint:
@app.post("/generate_transcripts")
Networking
- Requires an SSH tunnel from Recognition → Banba on port 8000
- Tunnel is managed by a systemd service called banba-tunnel on Recognition:
[Unit]
Description=SSH tunnel to banba
After=network.target
[Service]
ExecStart=/usr/bin/ssh -o ServerAliveInterval=60 -o ServerAliveCountMax=3 -N -L 8000:134.226.98.116:8000 lonergan@phoneticsrv3.lcs.tcd.ie
Restart=always
RestartSec=10
User=llonerga
[Install]
WantedBy=multi-user.target
Models Used
- Diarization: Pyannote Diarization 3.1
- Recognition: Nvidia NeMo 110M FastConformer RNN-T (semi-supervised)
- Capitalisation & Punctuation Restoration: MarianMT transformer model
- Currently running on CPU (can be sped up with batching or GPU build)