Skip to main content
Available starting with FlowX.AI 5.7.0The Speech to Text service powers the Speech to Text workflow node in Integration Designer.
The Speech to Text service is a Python-based microservice that provides audio transcription using OpenAI Whisper (local or API) and text-to-speech via the OpenAI TTS API. It supports both synchronous REST calls and asynchronous Kafka job processing for long-running audio files.

Dependencies

The Speech to Text service connects to standard FlowX infrastructure services that should already be configured in your environment:
  • Kafka — async job processing for transcription requests from Integration Designer
  • S3-compatible storage (MinIO or AWS S3) — audio file retrieval and result storage
  • OpenAI API key — required for cloud transcription (/transcribe/openai) and text-to-speech (/tts/openai)
  • ffmpeg — required for audio decoding (included in Docker image)
The local Whisper transcription engine runs on-device and does not require an OpenAI API key. Only cloud transcription and TTS features need the API key.

Capabilities

What it can do

  • Local transcription: On-device transcription using any Whisper model (base, small, medium, large, turbo) with automatic language detection
  • Cloud transcription: OpenAI Whisper API with automatic chunking of long audio into 10-minute segments
  • Text-to-speech: OpenAI TTS API with multiple models (tts-1, tts-1-hd, gpt-4o-mini-tts) and voice options
  • Segment-level timing: Returns word-level segments with start/end timestamps
  • Language detection: Automatic source language detection with confidence probabilities
  • Long audio support: Kafka-based async processing for files over 15 minutes

Supported audio formats

FormatExtensions
MP3.mp3
WAV.wav
FLAC.flac
AAC.aac, .m4a
OGG.ogg

API endpoints

All endpoints are prefixed with the URL_PREFIX value (default: /speech-to-text).
EndpointMethodPurpose
/api/v1/speech-to-text/transcribePOSTTranscribe audio using local Whisper model
/api/v1/speech-to-text/transcribe/openaiPOSTTranscribe audio using OpenAI Whisper API
/api/v1/speech-to-text/tts/openaiPOSTText-to-speech using OpenAI TTS API
/api/v1/speech-to-text/info/healthGETHealth check

Configuration

Server configuration

Environment VariableDescriptionDefault Value
URL_PREFIXFastAPI root path for reverse proxy routing/speech-to-text
PORTAPI port8000
GUNICORN_WORKERSNumber of Gunicorn worker processes2
GUNICORN_TIMEOUTWorker timeout in seconds120
Increase GUNICORN_TIMEOUT for long audio files. The default 120 seconds works for most files, but transcriptions of 30+ minute recordings may need 600 or higher.

Whisper model configuration

Environment VariableDescriptionDefault Value
WHISPER_MODELLocal Whisper model nameturbo
OPENAI_API_KEYOpenAI API key (required for cloud transcription and TTS)-
OPENAI_WHISPER_MODELModel for OpenAI cloud transcriptionwhisper-1
Available local Whisper models:
ModelSizeSpeedAccuracyUse case
base74 MBFastestLowQuick prototyping, non-critical transcription
small244 MBFastMediumGeneral use with limited resources
medium769 MBMediumGoodBalanced speed and accuracy
large1.55 GBSlowHighHigh-accuracy requirements
turbo809 MBFastHighRecommended default — best speed/accuracy trade-off
Larger Whisper models require significantly more memory. The large model needs at least 4 Gi RAM per worker. Plan memory limits accordingly and keep GUNICORN_WORKERS low when using larger models.

Text-to-speech configuration

TTS is configured per-request via the API. Available options:
ParameterOptionsDefault
Modeltts-1, tts-1-hd, gpt-4o-mini-ttstts-1
Voicealloy, ash, coral, echo, fable, onyx, nova, sage, shimmeralloy
The gpt-4o-mini-tts model supports additional voices: ballad, verse, marin, cedar.

Kafka configuration (optional)

The Speech to Text service supports Kafka for async job processing of long-running audio files (15+ minutes) that would timeout via REST. Kafka topics are not pre-created in standard deployments and must be provisioned manually if needed.
Kafka integration for Speech to Text requires manual topic creation. Ensure the request and response topics exist before enabling the consumer.

Core Kafka settings

Environment VariableDescriptionDefault Value
KAFKA_BOOTSTRAP_SERVERSKafka broker addresskafka:9092
KAFKA_CONSUMER_ENABLEDEnable Kafka consumer (0 to turn off)1
KAFKA_CONSUMER_GROUPIDConsumer group IDai-services
KAFKA_SECURITY_ENABLEDEnable SASL/OAUTHBEARER auth (0 to turn off)0
KAFKA_MAX_POLL_INTERVAL_MSMax poll interval (ms)600000
The default KAFKA_MAX_POLL_INTERVAL_MS is set to 10 minutes (600000 ms) to prevent consumer rebalancing during long transcriptions. Increase this value if you regularly process audio files longer than 10 minutes.

Topic configuration

Environment VariableDescriptionDefault Value
KAFKA_JOB_REQUEST_TOPICIncoming transcription requests (from Integration Designer)ai.flowx.stt.job.request
KAFKA_JOB_RESPONSE_TOPICOutgoing transcription responses (to Integration Designer)ai.flowx.stt.job.response
The Speech to Text service integrates with flx-job-lib for async job processing. On startup, the service verifies Kafka connectivity and starts a background consumer.

Storage configuration (MinIO / S3)

The Speech to Text service reads audio files from and writes results to S3-compatible storage.
Environment VariableDescriptionDefault Value
MINIO_DOCUMENTS_URLMinIO/S3 endpoint URL-
MINIO_DOCUMENTS_ACCESS_KEYMinIO access key-
MINIO_DOCUMENTS_SECRET_KEYMinIO secret key-
MINIO_DOCUMENTS_BUCKETMinIO bucket name-

Observability (optional)

Environment VariableDescriptionDefault Value
USE_OBSERVATORYEnable FlowX Observatory middleware (1 to enable)0

FLEURS evaluation (optional)

The service can optionally download Google FLEURS datasets at startup for language evaluation and benchmarking.
Environment VariableDescriptionDefault Value
FLEURS_LANGSComma-separated FLEURS language codesen_us,hu_hu
FLEURS_SPLITFLEURS dataset splittest
FLEURS_CACHE_DIRDataset cache directory/flowx/code/data/fleurs
FLEURS datasets are used for evaluating transcription accuracy across languages. This is not required for production use — skip this configuration unless you are benchmarking transcription quality.

Kafka job processing

The Speech to Text service processes transcription requests asynchronously via Kafka. Integration Designer sends requests to the job request topic, and the service writes results to S3 and responds through the job response topic.

Request payload

FieldTypeDefaultDescription
fileStoragePathstring(required)Path to audio file in MinIO/S3
providerstringlocallocal for on-device Whisper, openai for OpenAI Whisper API

Response flow

  1. Integration Designer publishes a transcription request to ai.flowx.stt.job.request
  2. Speech to Text downloads the audio file from S3, transcribes it, and stores the result in S3
  3. Speech to Text publishes a response with the result path to ai.flowx.stt.job.response
  4. Integration Designer retrieves the result (text, language, segments with timestamps) from S3

Deployment and sizing

Docker

  • Base image: Python 3.13
  • Port: 8000
  • Health check: /speech-to-text/api/v1/speech-to-text/info/health
  • Requires ffmpeg (included in Docker image)

Kubernetes configuration

The Whisper model is loaded into memory at startup. Each Gunicorn worker loads its own copy of the model. With the large model (~1.55 GB), keep GUNICORN_WORKERS at 1 to avoid OOM kills. Scale horizontally with more pods instead.

Verify your setup

The Speech to Text pod is running: kubectl get pods -l app=speech-to-text
The health endpoint returns HTTP 200: curl http://speech-to-text:8000/api/v1/speech-to-text/info/health
Whisper model loaded successfully — check pod logs for model initialization messages at startup
Kafka consumer is connected — check pod logs for Kafka consumer started message at startup
Integration Designer can reach the service — verify FLOWX_SPEECH_TO_TEXT_BASE_URL is set in Integration Designer setup

Troubleshooting

Symptoms: Pod stays in CrashLoopBackOff or never becomes ready.Solutions:
  • Check that memory limits are sufficient for the configured Whisper model
  • Verify ffmpeg is available (included in Docker image, may need manual install for local dev)
  • Review pod logs for model download or loading errors
  • Ensure the S3/MinIO endpoint is reachable if Kafka consumer is enabled
Symptoms: Integration Designer sends transcription requests but no results are returned.Solutions:
  • Verify KAFKA_CONSUMER_ENABLED is set to 1
  • Check that KAFKA_JOB_REQUEST_TOPIC matches the topic Integration Designer publishes to
  • Ensure the Kafka consumer group (KAFKA_CONSUMER_GROUPID) has no conflicting consumers
  • Review pod logs for Kafka connection or authentication errors
Symptoms: Transcription text is inaccurate or contains many errors.Solutions:
  • Upgrade to a larger Whisper model (turbo or large instead of base)
  • Check audio quality — low bitrate or noisy recordings reduce accuracy
  • For non-English audio, use the large model which has better multilingual support
  • Try the OpenAI Whisper API (provider: openai) for comparison
Symptoms: Kafka consumer drops out mid-transcription, causing job failures.Solutions:
  • Increase KAFKA_MAX_POLL_INTERVAL_MS beyond the expected transcription time
  • For very long audio files (30+ minutes), set to 1800000 (30 minutes)
  • Consider splitting long audio files before submitting
Symptoms: Pods restart with OOMKilled status.Solutions:
  • Reduce GUNICORN_WORKERS to 1 — each worker loads its own model copy
  • Increase memory limits (4 Gi minimum for turbo, 8 Gi for large)
  • Use a smaller model (small or base) if accuracy requirements allow
  • Scale horizontally with more single-worker pods
Symptoms: Cloud transcription or TTS requests fail with API errors.Solutions:
  • Verify OPENAI_API_KEY is set and valid
  • Check API rate limits on your OpenAI account
  • For large files, the service automatically chunks audio into 10-minute segments — ensure network is stable
  • Review pod logs for specific API error messages

Integration Designer setup

Configure Integration Designer, which orchestrates Speech to Text jobs

Speech to Text node

Configure the Speech to Text workflow node in Integration Designer

Kafka Authentication

Configure Kafka security and authentication

Web Crawler setup

Configure the Web Crawler service for web page extraction
Last modified on April 9, 2026