What is Observability?

Observability is the capacity to infer the internal state of a system by analyzing its external outputs. In software development, this entails understanding the internal workings of a system through its telemetry data, which comprises traces, metrics, and logs.

What is Open Telemetry?

OpenTelemetry is an observability framework and toolkit for generating and managing telemetry data, including traces, metrics, and logs. It is vendor-agnostic and compatible with various observability backends like Jaeger and Prometheus. Unlike observability backends, OpenTelemetry focuses on the creation, collection, and export of telemetry data, leaving storage and visualization to other tools.

Tracing with Open Telemetry is availabile starting with FlowX.AI v.4.1.0 release.

How it works?

Our monitoring and performance analysis system leverages OpenTelemetry for comprehensive tracing and logging across our microservices architecture. By integrating with Grafana and other observability tools, we achieve detailed visibility into the lifecycle of requests, the performance of individual operations, and the interactions between different components of the system.

OTEL Collectors are designed in a vendor-agnostic way to receive, process and export telemetry data. More information about OTEL Collectors, you can find here.

Recommended OpenTelemetry Collector Processors: Follow the recommended processors.

Prerequisites

Microservices

  • Custom code addition for manual instrumentation.
  • Configuration and deployment of Java agent.
  • Performance impact assessment.

Kubernetes

  • Use of a Kubernetes Operator for managing instrumentation and tracing configuration.

Instrumentation

Auto-instrumentation with Java agent

  • Works: Automatically wraps methods at the application edges (HTTP calls, Kafka messages, DB calls), creating spans and adding default span attributes.
  • Configuration: Configure the Java agent for auto-instrumentation.

Manual instrumentation

  • Custom Spans: These were created for methods important to the business flow and enriched with business attributes such as fx.type, fx.methodName, fx.processInstanceUuid, and others.
  • Custom BUSINESS Spans: Create spans for business events.

Business logic metadata in logs and spans

Spans now include custom FlowX attributes (e.g., node names, action names, process names, instance UUIDs), which ccan be used for filtering and searching in traces.

Here is the full list of included custom FlowX span attributes:

Custom span attributes

  • fx.type - BUSINESS/TECHNICAL
  • fx.methodName
  • fx.parentProcessInstanceId
  • fx.parentProcessInstanceUuid
  • fx.processInstanceUuid
  • fx.processName
  • fx.processVersionId
  • fx.tokenInstanceUuid
  • fx.nodeName
  • fx.nodeId
  • fx.nodeUuid
  • fx.boundaryEventId
  • fx.nextNodeId
  • fx.triggeredByBoundaryEventId
  • fx.actionUuid
  • fx.actionName
  • fx.context
  • fx.platform

Custom business spans

  • identified by the fx.type = BUSINESS attribute

Detailed trace operations

Trace specific operations and measure request time across different layers/services.

  • Process Start: Auto-instrumentation enabled for Spring Data to show time spent in repository methods. JDBC query instrumentation can be added.
  • Token Creation and Advancing: Custom tracing added.
  • Action Execution and Subprocess Start: Custom tracing added.

Troubleshooting scenarios and common usages

Scenario examples

  • Process Trace: Analyze DB vs cache times, token advancement, node actions.
  • Parallel Gateway: Trace split tokens.
  • DB Query Time: Enable JDBC query tracing.
  • Endpoint Data Issues: Check traces for Redis or DB source.
  • Token Stuck: Filter by node name and process UUID.
  • Action Execution: Trace action names for stuck tokens.
  • Subprocess Failures: Analyze subprocess start and failures.
  • Latency Analysis: Identify latencies in automatic actions.
  • Boundary Events: Ensure Kafka schedule messages are sent and received correctly.
  • External Service Tracking: Trace between process engine and external plugins.

Business operation analysis

  • Long Running Operations: Use Uptrace for identifying slow operations.
  • Failed Requests: Filter traces by error status.

Visualization of Traces

We recommend to use Grafana, but any observability platform compatible with open telemetry standards can be used.

Grafana integrates with tracing backends such as Tempo (for tracing) and Loki (for logging), allowing us to visualize the entire lifecycle of a request. This includes detailed views of spans, which are the basic units of work in a trace. By using Grafana, we can:

  • View Trace Trees: Grafana provides an intuitive UI for viewing the hierarchy and relationships between spans, making it easier to understand the flow of a request through the system.
  • Filter and Search: Use Grafana to filter and search spans based on custom attributes like fx.processInstanceUuid, fx.nodeName, fx.actionName, and others. This helps in pinpointing specific operations or issues within a trace.
  • Error Analysis: Identify spans with errors and visualize the stack trace or error message, aiding in quick troubleshooting.

Resources:

OpenTelemetry Docs