Available starting with FlowX.AI 5.7.0The Speech to Text service powers the Speech to Text workflow node in Integration Designer.
Dependencies
The Speech to Text service connects to standard FlowX infrastructure services that should already be configured in your environment:- Kafka — async job processing for transcription requests from Integration Designer
- S3-compatible storage (MinIO or AWS S3) — audio file retrieval and result storage
- OpenAI API key — required for cloud transcription (
/transcribe/openai) and text-to-speech (/tts/openai) - ffmpeg — required for audio decoding (included in Docker image)
The local Whisper transcription engine runs on-device and does not require an OpenAI API key. Only cloud transcription and TTS features need the API key.
Capabilities
What it can do
- Local transcription: On-device transcription using any Whisper model (
base,small,medium,large,turbo) with automatic language detection - Cloud transcription: OpenAI Whisper API with automatic chunking of long audio into 10-minute segments
- Text-to-speech: OpenAI TTS API with multiple models (
tts-1,tts-1-hd,gpt-4o-mini-tts) and voice options - Segment-level timing: Returns word-level segments with start/end timestamps
- Language detection: Automatic source language detection with confidence probabilities
- Long audio support: Kafka-based async processing for files over 15 minutes
Supported audio formats
| Format | Extensions |
|---|---|
| MP3 | .mp3 |
| WAV | .wav |
| FLAC | .flac |
| AAC | .aac, .m4a |
| OGG | .ogg |
API endpoints
All endpoints are prefixed with theURL_PREFIX value (default: /speech-to-text).
| Endpoint | Method | Purpose |
|---|---|---|
/api/v1/speech-to-text/transcribe | POST | Transcribe audio using local Whisper model |
/api/v1/speech-to-text/transcribe/openai | POST | Transcribe audio using OpenAI Whisper API |
/api/v1/speech-to-text/tts/openai | POST | Text-to-speech using OpenAI TTS API |
/api/v1/speech-to-text/info/health | GET | Health check |
Configuration
Server configuration
| Environment Variable | Description | Default Value |
|---|---|---|
URL_PREFIX | FastAPI root path for reverse proxy routing | /speech-to-text |
PORT | API port | 8000 |
GUNICORN_WORKERS | Number of Gunicorn worker processes | 2 |
GUNICORN_TIMEOUT | Worker timeout in seconds | 120 |
Whisper model configuration
| Environment Variable | Description | Default Value |
|---|---|---|
WHISPER_MODEL | Local Whisper model name | turbo |
OPENAI_API_KEY | OpenAI API key (required for cloud transcription and TTS) | - |
OPENAI_WHISPER_MODEL | Model for OpenAI cloud transcription | whisper-1 |
| Model | Size | Speed | Accuracy | Use case |
|---|---|---|---|---|
base | 74 MB | Fastest | Low | Quick prototyping, non-critical transcription |
small | 244 MB | Fast | Medium | General use with limited resources |
medium | 769 MB | Medium | Good | Balanced speed and accuracy |
large | 1.55 GB | Slow | High | High-accuracy requirements |
turbo | 809 MB | Fast | High | Recommended default — best speed/accuracy trade-off |
Text-to-speech configuration
TTS is configured per-request via the API. Available options:| Parameter | Options | Default |
|---|---|---|
| Model | tts-1, tts-1-hd, gpt-4o-mini-tts | tts-1 |
| Voice | alloy, ash, coral, echo, fable, onyx, nova, sage, shimmer | alloy |
The
gpt-4o-mini-tts model supports additional voices: ballad, verse, marin, cedar.Kafka configuration (optional)
The Speech to Text service supports Kafka for async job processing of long-running audio files (15+ minutes) that would timeout via REST. Kafka topics are not pre-created in standard deployments and must be provisioned manually if needed.Core Kafka settings
| Environment Variable | Description | Default Value |
|---|---|---|
KAFKA_BOOTSTRAP_SERVERS | Kafka broker address | kafka:9092 |
KAFKA_CONSUMER_ENABLED | Enable Kafka consumer (0 to turn off) | 1 |
KAFKA_CONSUMER_GROUPID | Consumer group ID | ai-services |
KAFKA_SECURITY_ENABLED | Enable SASL/OAUTHBEARER auth (0 to turn off) | 0 |
KAFKA_MAX_POLL_INTERVAL_MS | Max poll interval (ms) | 600000 |
Topic configuration
| Environment Variable | Description | Default Value |
|---|---|---|
KAFKA_JOB_REQUEST_TOPIC | Incoming transcription requests (from Integration Designer) | ai.flowx.stt.job.request |
KAFKA_JOB_RESPONSE_TOPIC | Outgoing transcription responses (to Integration Designer) | ai.flowx.stt.job.response |
The Speech to Text service integrates with flx-job-lib for async job processing. On startup, the service verifies Kafka connectivity and starts a background consumer.
Storage configuration (MinIO / S3)
The Speech to Text service reads audio files from and writes results to S3-compatible storage.| Environment Variable | Description | Default Value |
|---|---|---|
MINIO_DOCUMENTS_URL | MinIO/S3 endpoint URL | - |
MINIO_DOCUMENTS_ACCESS_KEY | MinIO access key | - |
MINIO_DOCUMENTS_SECRET_KEY | MinIO secret key | - |
MINIO_DOCUMENTS_BUCKET | MinIO bucket name | - |
Observability (optional)
| Environment Variable | Description | Default Value |
|---|---|---|
USE_OBSERVATORY | Enable FlowX Observatory middleware (1 to enable) | 0 |
FLEURS evaluation (optional)
The service can optionally download Google FLEURS datasets at startup for language evaluation and benchmarking.| Environment Variable | Description | Default Value |
|---|---|---|
FLEURS_LANGS | Comma-separated FLEURS language codes | en_us,hu_hu |
FLEURS_SPLIT | FLEURS dataset split | test |
FLEURS_CACHE_DIR | Dataset cache directory | /flowx/code/data/fleurs |
FLEURS datasets are used for evaluating transcription accuracy across languages. This is not required for production use — skip this configuration unless you are benchmarking transcription quality.
Kafka job processing
The Speech to Text service processes transcription requests asynchronously via Kafka. Integration Designer sends requests to the job request topic, and the service writes results to S3 and responds through the job response topic.Request payload
| Field | Type | Default | Description |
|---|---|---|---|
fileStoragePath | string | (required) | Path to audio file in MinIO/S3 |
provider | string | local | local for on-device Whisper, openai for OpenAI Whisper API |
Response flow
- Integration Designer publishes a transcription request to
ai.flowx.stt.job.request - Speech to Text downloads the audio file from S3, transcribes it, and stores the result in S3
- Speech to Text publishes a response with the result path to
ai.flowx.stt.job.response - Integration Designer retrieves the result (text, language, segments with timestamps) from S3
Deployment and sizing
Docker
- Base image: Python 3.13
- Port:
8000 - Health check:
/speech-to-text/api/v1/speech-to-text/info/health - Requires ffmpeg (included in Docker image)
Kubernetes configuration
- Turbo model (recommended)
- Large model (high accuracy)
| Setting | Value |
|---|---|
| Replicas | 2 |
| CPU requests | 2 cores |
| CPU limits | 4 cores |
| RAM requests | 2 Gi |
| RAM limits | 4 Gi |
WHISPER_MODEL | turbo |
GUNICORN_WORKERS | 2 |
Verify your setup
The Speech to Text pod is running:
kubectl get pods -l app=speech-to-textThe health endpoint returns HTTP 200:
curl http://speech-to-text:8000/api/v1/speech-to-text/info/healthWhisper model loaded successfully — check pod logs for model initialization messages at startup
Kafka consumer is connected — check pod logs for
Kafka consumer started message at startupIntegration Designer can reach the service — verify
FLOWX_SPEECH_TO_TEXT_BASE_URL is set in Integration Designer setupTroubleshooting
Pod fails to start or crashes
Pod fails to start or crashes
Symptoms: Pod stays in CrashLoopBackOff or never becomes ready.Solutions:
- Check that memory limits are sufficient for the configured Whisper model
- Verify ffmpeg is available (included in Docker image, may need manual install for local dev)
- Review pod logs for model download or loading errors
- Ensure the S3/MinIO endpoint is reachable if Kafka consumer is enabled
Transcription jobs not being processed
Transcription jobs not being processed
Symptoms: Integration Designer sends transcription requests but no results are returned.Solutions:
- Verify
KAFKA_CONSUMER_ENABLEDis set to1 - Check that
KAFKA_JOB_REQUEST_TOPICmatches the topic Integration Designer publishes to - Ensure the Kafka consumer group (
KAFKA_CONSUMER_GROUPID) has no conflicting consumers - Review pod logs for Kafka connection or authentication errors
Transcription quality is poor
Transcription quality is poor
Symptoms: Transcription text is inaccurate or contains many errors.Solutions:
- Upgrade to a larger Whisper model (
turboorlargeinstead ofbase) - Check audio quality — low bitrate or noisy recordings reduce accuracy
- For non-English audio, use the
largemodel which has better multilingual support - Try the OpenAI Whisper API (
provider: openai) for comparison
Consumer rebalancing during long transcriptions
Consumer rebalancing during long transcriptions
Symptoms: Kafka consumer drops out mid-transcription, causing job failures.Solutions:
- Increase
KAFKA_MAX_POLL_INTERVAL_MSbeyond the expected transcription time - For very long audio files (30+ minutes), set to
1800000(30 minutes) - Consider splitting long audio files before submitting
OOM kills with large Whisper models
OOM kills with large Whisper models
Symptoms: Pods restart with OOMKilled status.Solutions:
- Reduce
GUNICORN_WORKERSto1— each worker loads its own model copy - Increase memory limits (4 Gi minimum for
turbo, 8 Gi forlarge) - Use a smaller model (
smallorbase) if accuracy requirements allow - Scale horizontally with more single-worker pods
OpenAI API errors
OpenAI API errors
Symptoms: Cloud transcription or TTS requests fail with API errors.Solutions:
- Verify
OPENAI_API_KEYis set and valid - Check API rate limits on your OpenAI account
- For large files, the service automatically chunks audio into 10-minute segments — ensure network is stable
- Review pod logs for specific API error messages
Related resources
Integration Designer setup
Configure Integration Designer, which orchestrates Speech to Text jobs
Speech to Text node
Configure the Speech to Text workflow node in Integration Designer
Kafka Authentication
Configure Kafka security and authentication
Web Crawler setup
Configure the Web Crawler service for web page extraction

