Speech to Text setup

Available starting with FlowX.AI 5.7.0The Speech to Text service powers the Speech to Text workflow node in Integration Designer.

The Speech to Text service is a Python-based microservice that provides audio transcription using OpenAI Whisper (local or API) and text-to-speech via the OpenAI TTS API. It supports both synchronous REST calls and asynchronous Kafka job processing for long-running audio files.

Dependencies

The Speech to Text service connects to standard FlowX infrastructure services that should already be configured in your environment:

Kafka — async job processing for transcription requests from Integration Designer, plus subscription to the AI Providers cache invalidation topic
S3-compatible storage (MinIO or AWS S3) — audio file retrieval and result storage
organization-manager — resolves OpenAI credentials and model assignments from the AUDIO capability configured in AI Providers
OpenAI API key (fallback) — used for local development, MCP usage, or when no tenant context is available on a request
ffmpeg — required for audio decoding (included in Docker image)

The local Whisper transcription engine runs on-device and does not require any OpenAI credentials. Only cloud transcription (/transcribe/openai) and TTS (/tts/openai) need credentials, which are resolved from the organization’s AUDIO capability when tenant headers are present on the request.

Capabilities

What it can do

Local transcription: On-device transcription using any Whisper model (base, small, medium, large, turbo) with automatic language detection
Cloud transcription: OpenAI Whisper API with automatic chunking of long audio into 10-minute segments
Text-to-speech: OpenAI TTS API with multiple models (tts-1, tts-1-hd, gpt-4o-mini-tts) and voice options
Segment-level timing: Returns word-level segments with start/end timestamps
Language detection: Automatic source language detection with confidence probabilities
Long audio support: Kafka-based async processing for files over 15 minutes

Supported audio formats

Format	Extensions
MP3	`.mp3`
WAV	`.wav`
FLAC	`.flac`
AAC	`.aac`, `.m4a`
OGG	`.ogg`

API endpoints

All endpoints are prefixed with the URL_PREFIX value (default: /speech-to-text).

Endpoint	Method	Purpose
`/api/v1/speech-to-text/transcribe`	POST	Transcribe audio using local Whisper model
`/api/v1/speech-to-text/transcribe/openai`	POST	Transcribe audio using OpenAI Whisper API
`/api/v1/speech-to-text/tts/openai`	POST	Text-to-speech using OpenAI TTS API
`/api/v1/speech-to-text/info/health`	GET	Health check

Configuration

Server configuration

Environment Variable	Description	Default Value
`URL_PREFIX`	FastAPI root path for reverse proxy routing	`/speech-to-text`
`PORT`	API port	`8000`
`GUNICORN_WORKERS`	Number of Gunicorn worker processes	`2`
`GUNICORN_TIMEOUT`	Worker timeout in seconds	`120`

Increase GUNICORN_TIMEOUT for long audio files. The default 120 seconds works for most files, but transcriptions of 30+ minute recordings may need 600 or higher.

Whisper model configuration

Environment Variable	Description	Default Value
`WHISPER_MODEL`	Local Whisper model name	`turbo`
`OPENAI_API_KEY`	OpenAI API key — fallback for local dev, MCP, or requests without tenant context. In production, credentials resolve from the organization’s AUDIO capability.	-
`OPENAI_WHISPER_MODEL`	Fallback model for OpenAI cloud transcription when no tenant context is present or the resolver returns no model	`whisper-1`

The primary source for OpenAI credentials and the transcription model is the AUDIO capability configured under Organization Settings → AI Settings → Defaults & Fallbacks. The env vars above are used only when tenant context is missing or AUDIO is not configured for the organization. See AI providers and model configuration.

Available local Whisper models:

Model	Size	Speed	Accuracy	Use case
`base`	74 MB	Fastest	Low	Quick prototyping, non-critical transcription
`small`	244 MB	Fast	Medium	General use with limited resources
`medium`	769 MB	Medium	Good	Balanced speed and accuracy
`large`	1.55 GB	Slow	High	High-accuracy requirements
`turbo`	809 MB	Fast	High	Recommended default — best speed/accuracy trade-off

Larger Whisper models require significantly more memory. The large model needs at least 4 Gi RAM per worker. Plan memory limits accordingly and keep GUNICORN_WORKERS low when using larger models.

Text-to-speech configuration

TTS is configured per-request via the API. Available options:

Parameter	Options	Default
Model	`tts-1`, `tts-1-hd`, `gpt-4o-mini-tts`	`tts-1`
Voice	`alloy`, `ash`, `coral`, `echo`, `fable`, `onyx`, `nova`, `sage`, `shimmer`	`alloy`

The gpt-4o-mini-tts model supports additional voices: ballad, verse, marin, cedar.

Tenant context

The Speech to Text service resolves OpenAI credentials per organization via the llm-config library. To route a request to the correct organization’s AUDIO capability, the caller must propagate tenant context.

REST requests

Integration Designer and other upstream services must include these headers on every call to /transcribe/openai and /tts/openai:

Header	Purpose
`Fx-Organization-Id`	Organization UUID — used to resolve the AUDIO capability
`Fx-Workspace-Id`	Workspace UUID — used when workspace-scoped overrides exist

The TenantContextMiddleware reads these headers and sets the tenant context for the duration of the request. When neither header is present, the service falls back to the OPENAI_API_KEY env var.

Kafka jobs

Transcription requests published to ai.flowx.stt.job.request must include organizationId and workspaceId fields in the payload. The job consumer reads these fields and sets the tenant context before invoking the transcription resolver.

Cache invalidation for AI provider configuration changes is delivered via Kafka. The service subscribes to ai.flowx.llm.config.changed.v1 automatically on startup — no manual topic creation is required for invalidation, but the topic must exist in the cluster.

Kafka configuration (optional)

The Speech to Text service supports Kafka for async job processing of long-running audio files (15+ minutes) that would timeout via REST. Kafka topics are not pre-created in standard deployments and must be provisioned manually if needed.

Kafka integration for Speech to Text requires manual topic creation. Ensure the request and response topics exist before enabling the consumer.

Core Kafka settings

Environment Variable	Description	Default Value
`KAFKA_BOOTSTRAP_SERVERS`	Kafka broker address	`kafka:9092`
`KAFKA_CONSUMER_ENABLED`	Enable Kafka consumer (`0` to turn off)	`1`
`KAFKA_CONSUMER_GROUPID`	Consumer group ID	`ai-services`
`KAFKA_SECURITY_ENABLED`	Enable SASL/OAUTHBEARER auth (`0` to turn off)	`0`
`KAFKA_MAX_POLL_INTERVAL_MS`	Max poll interval (ms)	`600000`

The default KAFKA_MAX_POLL_INTERVAL_MS is set to 10 minutes (600000 ms) to prevent consumer rebalancing during long transcriptions. Increase this value if you regularly process audio files longer than 10 minutes.

Topic configuration

Environment Variable	Description	Default Value
`KAFKA_JOB_REQUEST_TOPIC`	Incoming transcription requests (from Integration Designer)	`ai.flowx.stt.job.request`
`KAFKA_JOB_RESPONSE_TOPIC`	Outgoing transcription responses (to Integration Designer)	`ai.flowx.stt.job.response`
(subscribed automatically)	AI Providers cache invalidation — consumed by the `llm-config` resolver to refresh cached AUDIO capability assignments when an organization admin updates Defaults & Fallbacks	`ai.flowx.llm.config.changed.v1`

The Speech to Text service integrates with flx-job-lib for async job processing. On startup, the service verifies Kafka connectivity and starts a background consumer.

Storage configuration (MinIO / S3)

The Speech to Text service reads audio files from and writes results to S3-compatible storage.

Environment Variable	Description	Default Value
`MINIO_DOCUMENTS_URL`	MinIO/S3 endpoint URL	-
`MINIO_DOCUMENTS_ACCESS_KEY`	MinIO access key	-
`MINIO_DOCUMENTS_SECRET_KEY`	MinIO secret key	-
`MINIO_DOCUMENTS_BUCKET`	MinIO bucket name	-

Observability (optional)

Environment Variable	Description	Default Value
`USE_OBSERVATORY`	Enable FlowX Observatory middleware (`1` to enable)	`0`

FLEURS evaluation (optional)

The service can optionally download Google FLEURS datasets at startup for language evaluation and benchmarking.

Environment Variable	Description	Default Value
`FLEURS_LANGS`	Comma-separated FLEURS language codes	`en_us,hu_hu`
`FLEURS_SPLIT`	FLEURS dataset split	`test`
`FLEURS_CACHE_DIR`	Dataset cache directory	`/flowx/code/data/fleurs`

FLEURS datasets are used for evaluating transcription accuracy across languages. This is not required for production use — skip this configuration unless you are benchmarking transcription quality.

Kafka job processing

The Speech to Text service processes transcription requests asynchronously via Kafka. Integration Designer sends requests to the job request topic, and the service writes results to S3 and responds through the job response topic.

Request payload

Field	Type	Default	Description
`fileStoragePath`	`string`	(required)	Path to audio file in MinIO/S3
`provider`	`string`	`local`	`local` for on-device Whisper, `openai` for OpenAI Whisper API
`organizationId`	`string`	(required for `openai` provider)	Organization UUID — sets tenant context so the resolver can fetch AUDIO capability credentials
`workspaceId`	`string`	-	Workspace UUID — used for workspace-scoped overrides

Response flow

Integration Designer publishes a transcription request to ai.flowx.stt.job.request
Speech to Text downloads the audio file from S3, transcribes it, and stores the result in S3
Speech to Text publishes a response with the result path to ai.flowx.stt.job.response
Integration Designer retrieves the result (text, language, segments with timestamps) from S3

Deployment and sizing

Docker

Base image: Python 3.13
Port: 8000
Health check: /speech-to-text/api/v1/speech-to-text/info/health
Requires ffmpeg (included in Docker image)

Kubernetes configuration

Turbo model (recommended)
Large model (high accuracy)

Setting	Value
Replicas	2
CPU requests	2 cores
CPU limits	4 cores
RAM requests	2 Gi
RAM limits	4 Gi
`WHISPER_MODEL`	`turbo`
`GUNICORN_WORKERS`	`2`

Setting	Value
Replicas	2
CPU requests	4 cores
CPU limits	8 cores
RAM requests	4 Gi
RAM limits	8 Gi
`WHISPER_MODEL`	`large`
`GUNICORN_WORKERS`	`1`

The Whisper model is loaded into memory at startup. Each Gunicorn worker loads its own copy of the model. With the large model (~1.55 GB), keep GUNICORN_WORKERS at 1 to avoid OOM kills. Scale horizontally with more pods instead.

Verify your setup

The Speech to Text pod is running: kubectl get pods -l app=speech-to-text

The health endpoint returns HTTP 200: curl http://speech-to-text:8000/api/v1/speech-to-text/info/health

Whisper model loaded successfully — check pod logs for model initialization messages at startup

Kafka consumer is connected — check pod logs for Kafka consumer started message at startup

Integration Designer can reach the service — verify FLOWX_SPEECH_TO_TEXT_BASE_URL is set in Integration Designer setup

Troubleshooting

Pod fails to start or crashes

Symptoms: Pod stays in CrashLoopBackOff or never becomes ready.Solutions:

Check that memory limits are sufficient for the configured Whisper model
Verify ffmpeg is available (included in Docker image, may need manual install for local dev)
Review pod logs for model download or loading errors
Ensure the S3/MinIO endpoint is reachable if Kafka consumer is enabled

Transcription jobs not being processed

Symptoms: Integration Designer sends transcription requests but no results are returned.Solutions:

Verify KAFKA_CONSUMER_ENABLED is set to 1
Check that KAFKA_JOB_REQUEST_TOPIC matches the topic Integration Designer publishes to
Ensure the Kafka consumer group (KAFKA_CONSUMER_GROUPID) has no conflicting consumers
Review pod logs for Kafka connection or authentication errors

Transcription quality is poor

Symptoms: Transcription text is inaccurate or contains many errors.Solutions:

Upgrade to a larger Whisper model (turbo or large instead of base)
Check audio quality — low bitrate or noisy recordings reduce accuracy
For non-English audio, use the large model which has better multilingual support
Try the OpenAI Whisper API (provider: openai) for comparison

Consumer rebalancing during long transcriptions

Symptoms: Kafka consumer drops out mid-transcription, causing job failures.Solutions:

Increase KAFKA_MAX_POLL_INTERVAL_MS beyond the expected transcription time
For very long audio files (30+ minutes), set to 1800000 (30 minutes)
Consider splitting long audio files before submitting

OOM kills with large Whisper models

Symptoms: Pods restart with OOMKilled status.Solutions:

Reduce GUNICORN_WORKERS to 1 — each worker loads its own model copy
Increase memory limits (4 Gi minimum for turbo, 8 Gi for large)
Use a smaller model (small or base) if accuracy requirements allow
Scale horizontally with more single-worker pods

OpenAI API errors

Symptoms: Cloud transcription or TTS requests fail with API errors.Solutions:

Confirm the request includes Fx-Organization-Id (REST) or organizationId (Kafka). Without tenant context, the service falls back to OPENAI_API_KEY, which may not be set in production deployments.
Verify the organization’s AUDIO capability is configured under Organization Settings → AI Settings → Defaults & Fallbacks. Pod logs show no AUDIO capability configured for org, falling back to env when missing.
If falling back to env, verify OPENAI_API_KEY is set and valid.
Check API rate limits on your OpenAI account.
For large files, the service automatically chunks audio into 10-minute segments — ensure network is stable.
Review pod logs for specific API error messages.

AI Providers cache not refreshing

Symptoms: Organization admin updates the AUDIO model in Defaults & Fallbacks, but the service continues to use the previous model.Solutions:

Verify the Kafka topic ai.flowx.llm.config.changed.v1 exists in the cluster and the service has consumer permissions.
Check pod logs at startup for LLM config resolver initialized. If this line is missing or followed by falling back to env vars, the resolver did not start — the service is permanently in fallback mode and will not pick up AI Providers changes.
Restart the pods to force a cache refresh if the Kafka topic is unreachable.

Integration Designer setup

Configure Integration Designer, which orchestrates Speech to Text jobs

Speech to Text node

Configure the Speech to Text workflow node in Integration Designer

Kafka Authentication

Configure Kafka security and authentication

Web Crawler setup

Configure the Web Crawler service for web page extraction

Documentation Index

​Dependencies

​Capabilities

​What it can do

​Supported audio formats

​API endpoints

​Configuration

​Server configuration

​Whisper model configuration

​Text-to-speech configuration

​Tenant context

​REST requests

​Kafka jobs

​Kafka configuration (optional)

​Core Kafka settings

​Topic configuration

​Storage configuration (MinIO / S3)

​Observability (optional)

​FLEURS evaluation (optional)

​Kafka job processing

​Request payload

​Response flow

​Deployment and sizing

​Docker

​Kubernetes configuration

​Verify your setup

​Troubleshooting

​Related resources

Integration Designer setup

Speech to Text node

Kafka Authentication

Web Crawler setup

Dependencies

Capabilities

What it can do

Supported audio formats

API endpoints

Configuration

Server configuration

Whisper model configuration

Text-to-speech configuration

Tenant context

REST requests

Kafka jobs

Kafka configuration (optional)

Core Kafka settings

Topic configuration

Storage configuration (MinIO / S3)

Observability (optional)

FLEURS evaluation (optional)

Kafka job processing

Request payload

Response flow

Deployment and sizing

Docker

Kubernetes configuration

Verify your setup

Troubleshooting

Related resources