Infrastructure Prerequisites:

  • S3 bucket or alternative (for example, minio)
  • Kafka cluster

Starting with ocr-plugin 1.X it no longer requires RabbitMQ.

The following environment from previous releases must be removed in order to use OCR plugin: CELERY_BROKER_URL.

Deployment/Configuration

To deploy the OCR plugin, you will need to deploy ocr-plugin helm chart with custom values file.

Most important sections are these, but more can be extracted from helm chart.

image:
  repository: <repository>/ocr-plugin

applicationSecrets: {}

replicaCount: 2

resources: {}
  
env: []

Credentials

S3 bucket:

applicationSecrets:
  enable: true
  envSecretKeyRef:
    STORAGE_S3_ACCESS_KEY: access-key # default empty
    STORAGE_S3_SECRET_KEY: secret-key # default empty
  existingSecret: true
  secretName: ocr-plugin-application-config

Kafka configuration

You can override the following environment variables:

Environment VariableDefinitionDefault ValueExample
ENABLE_KAFKA_SASLIndicates whether Kafka SASL authentication is enabledFalse-
KAFKA_ADDRESSThe address of the Kafka bootstrap server in the format <hostname>:<port>-kafka-server1:9092
KAFKA_CONSUME_SCHEDULEThe interval (in seconds) at which Kafka messages are consumed30-
KAFKA_INPUT_TOPICThe Kafka topic from which input messages are consumed--
KAFKA_OCR_CONSUMER_GROUPIDThe consumer group ID for the OCR Kafka consumerocr_group-
KAFKA_CONSUMER_AUTO_COMMITDetermines whether Kafka consumer commits offsets automaticallyTrue-
KAFKA_CONSUMER_AUTO_COMMIT_INTERVALThe interval (in milliseconds) at which Kafka consumer commits offsets automatically1000-
KAFKA_CONSUMER_TIMEOUTThe timeout (in milliseconds) for Kafka consumer operations28000-
KAFKA_CONSUMER_MAX_POLL_INTERVALThe maximum interval (in milliseconds) between consecutive polls for Kafka consume25000-
KAFKA_CONSUMER_AUTO_OFFSET_RESETThe strategy for resetting the offset when no initial offset is available or if the current offset is invalidearliest-
KAFKA_OUTPUT_TOPICThe Kafka topic to which output messages are sent--

Please note that the default values and examples provided here are for illustrative purposes. Make sure to replace them with the appropriate values based on your Kafka configuration.

When configuring the OCR plugin, make sure to use the correct outgoing topic names that match the pattern expected by the Engine, which listens for messages on topics with specific names.

Authorization

You can override the following environment variables:

Environment VariableDefinitionDefault ValueExample
OAUTH_CLIENT_IDThe client ID for OAuth authentication-your_client_id
OAUTH_CLIENT_SECRETThe client secret for OAuth authentication-your_client_secret
OAUTH_TOKEN_ENDPOINT_URIThe URI of the token endpoint for OAuth authentication-https://oauth.example.com/token

Please note that the default values and examples provided here are for illustrative purposes. Make sure to replace them with the appropriate values based on your OAuth authentication configuration.

Storage (S3 configuration)

You can override the following environment variables:

Environment VariableDefinitionDefault ValueExample
STORAGE_S3_HOSTThe host address of the S3 storage service-minio:9000, https://s3.eu-west-1.amazonaws.com/
STORAGE_S3_SECURE_CONNECTIONIndicates whether to use a secure connection (HTTPS) for S3 storageFalse
STORAGE_S3_LOCATIONThe location of the S3 storage service-eu-west-1
STORAGE_S3_OCR_SCANS_BUCKETThe name of the S3 bucket for storing OCR scans-pdf-scans
STORAGE_S3_OCR_SIGNATURE_BUCKETThe name of the S3 bucket for storing OCR signatures-extracted-signatures
STORAGE_S3_OCR_SIGNATURE_FILENAMEThe filename pattern for extracted OCR signatures-extracted_signature_{}.png
STORAGE_S3_ACCESS_KEYThe access key for connecting to the S3 storage service-
STORAGE_S3_SECRET_KEYThe secret key for connecting to the S3 storage service-

Please note that the default values and examples provided here are for illustrative purposes. Make sure to replace them with the appropriate values based on your S3 storage configuration.

Performance

Environment VariableDefinitionDefault Value
ENABLE_PERFORMANCE_PAYLOADWhen set to true, the response payload will contain performance metrics related to various stages of the process.true

Example

  "perf": {
    "total_time": 998,
    "split": {
      "get_file": 248,
      "extract_images": 172,
      "extract_barcodes": 37,
      "extract_signatures": 238,
      "minio_signature_save": 301
    }
  }

Certificates

You can override the following environment variables:

Environment VariableDefinitionDefault Value
REQUESTS_CA_BUNDLEThe path to the certificate bundle file used for secure requests5
CERT_REQUESTSIf no activity has occurred for a certain number of seconds, an attempt will be made to refresh the workers'CERT_REQUIRED'

Workers Behavior

You can override the following environment variables:

Environment VariableDefinitionDefault Value
OCR_WORKER_COUNTNumber of workers5
OCR_WORK_QUEUE_TIMEOUTIf no activity has occurred for a certain number of seconds, an attempt will be made to refresh the workers10

If no worker is released after OCR_WORK_QUEUE_TIMEOUT seconds, the application will verify whether any workers have become unresponsive and need to be restarted.

If none of the workers have died, it means they are likely blocked in some process. In this case, the application will terminate all the workers and shut down itself, hoping that the container will be restarted.

Control Aspect Ratio

Environment VariableDefinitionDefault Value
OCR_SIGNATURE_MAX_RATIOThis variable sets the maximum acceptable aspect ratio for a signed scanned document (the OCR plugin will recognize a signature only if the document ratio is greater than or equal to the specified minimum ratio)1.43
OCR_SIGNATURE_MIN_RATIOThis variable sets the minimum acceptable aspect ratio for a signed scanned document (in this context, the OCR plugin will consider a detected signature only if the document aspect ratio is less than or equal to the specified maximum ratio)1.39

The plugin has been tested with aspect ratio values between 1.38 and 1.43. However, caution is advised when using untested values outside this range, as they may potentially disrupt the functionality. Adjust these parameters at your own risk and consider potential consequences, as untested values might lead to plugin instability or undesired behavior.