IngestDocument

This plugin is currently in beta. While it is considered safe for use, please be aware that its API could change in ways that are not compatible with earlier versions in future releases, or it might become unsupported.

Ingest documents into an embedding store.

Only text documents (TXT, HTML, Markdown) are supported for now.

yaml
type: "io.kestra.plugin.langchain4j.rag.IngestDocument"

Ingest documents into a KV embedding store.\nWARNING: the KV embedding store is for quick prototyping only, as it stores the embedding vectors in a K/V Store and load them all in memory.

yaml
id: document-ingestion
namespace: company.team

tasks:
  - id: ingest
    type: io.kestra.plugin.langchain4j.rag.IngestDocument
    provider:
      type: io.kestra.plugin.langchain4j.provider.GoogleGemini
      modelName: gemini-embedding-exp-03-07
      apiKey: "{{ secret('GEMINI_API_KEY') }}"
    embeddings:
      type: io.kestra.plugin.langchain4j.embeddings.KestraKVStore
    drop: true
    fromExternalURLs:
      - https://raw.githubusercontent.com/kestra-io/docs/refs/heads/main/content/blogs/release-0-22.md

Dynamic NO

Embedding Store Provider

Dynamic NO

Language Model Provider

This provider must be configured with an embedding model.

Dynamic NO

The document splitter

Dynamic NO

Default false

Whether to drop the store before ingestion. Useful for testing purpose.

SubType

Dynamic NO

A list of inline documents

SubType string

Dynamic YES

A list of document URLs from external sources

SubType string

Dynamic YES

A list of internal storage URIs representing documents

Dynamic YES

A path inside the task working directory that contains documents to ingest

Each document inside the directory will be ingested into the embedding store. This is recursive and protected from being path traversal (CWE-22).

SubType string

Dynamic YES

Additional metadata that will be added to all ingested documents

Additional outputs from the embedding store.

The number of ingested documents

The input token count

The output token count

The total token count

Dynamic YES

Endpoint URL

Dynamic YES

Project location

Dynamic YES

Model name

Dynamic YES

Project ID

Dynamic NO

Dynamic YES

API endpoint

The Azure OpenAI endpoint in the format: https://{resource}.openai.azure.com/

Dynamic YES

Model name

Dynamic NO

Dynamic YES

API Key

Dynamic YES

Client ID

Dynamic YES

Client secret

Dynamic YES

API version

Dynamic YES

Tenant ID

Dynamic YES

API Key

Dynamic YES

Model name

Dynamic NO

Dynamic YES

Default https://api.deepseek.com/v1

API base URL

SubType string

Dynamic YES

Min items 1

List of HTTP ElasticSearch servers.

Must be an URI like https://elasticsearch.com: 9200 with scheme and port.

Dynamic NO

Basic auth configuration.

SubType string

Dynamic YES

List of HTTP headers to be send on every request.

Must be a string with key value separated with : , ex: Authorization: Token XYZ.

Dynamic YES

Sets the path's prefix for every request used by the HTTP client.

For example, if this is set to /my/path, then any client request will become /my/path/ + endpoint. In essence, every request's endpoint is prefixed by this pathPrefix. The path prefix is useful for when ElasticSearch is behind a proxy that provides a base path or a proxy that requires all paths to start with '/'; it is not intended for other purposes and it should not be supplied in other scenarios.

Dynamic NO

Whether the REST client should return any response containing at least one warning header as a failure.

Dynamic NO

Trust all SSL CA certificates.

Use this if the server is using a self signed SSL certificate.

Dynamic YES

API Key

Dynamic YES

Model name

Dynamic NO

Dynamic YES

API Key

Dynamic YES

Model name

Dynamic NO

Dynamic YES

API base URL

Dynamic YES

Model endpoint

Dynamic YES

Model name

Dynamic NO

Dynamic YES

Basic auth password.

Dynamic YES

Basic auth username.

Dynamic NO

Dynamic YES

Default {{flow.id}}-embedding-store

The name of the K/V entry to use

Dynamic YES

API Key

Dynamic YES

Model name

Dynamic NO

Dynamic YES

AWS Access Key ID

Dynamic YES

Model name

Dynamic YES

AWS Secret Access Key

Dynamic NO

Dynamic YES

Default COHERE

Possible Values

COHERETITAN

Amazon Bedrock Embedding Model Type

Dynamic YES

The content of the document

Dynamic YES

The metadata of the document

Dynamic YES

The database name

Dynamic YES

The database server host

Dynamic YES

The database password

Dynamic NO

The database server port

Dynamic YES

The table to store embeddings in

Dynamic NO

Dynamic YES

The database user

Dynamic NO

Default false

Whether to use use an IVFFlat index

An IVFFlat index divides vectors into lists, and then searches a subset of those lists closest to the query vector. It has faster build times and uses less memory than HNSW but has lower query performance (in terms of speed-recall tradeoff).

Dynamic YES

API Key

Dynamic YES

Model name

Dynamic NO

Dynamic YES

API base URL

Dynamic NO

The maximum size of the overlap, defined in characters. Only full sentences are considered for the overlap.

Dynamic NO

The maximum size of the segment, defined in characters.

Dynamic NO

Default RECURSIVE

Possible Values

RECURSIVEPARAGRAPHLINESENTENCEWORD

Title the type of the DocumentSplitter

We recommend using a RECURSIVE DocumentSplitter for generic text. It tries to split the document into paragraphs first and fits as many paragraphs into a single TextSegment as possible. If some paragraphs are too long, they are recursively split into lines, then sentences, then words, and then characters until they fit into a segment.

Dynamic NO

Dynamic YES

The name of the index to store embeddings

Dynamic NO

​Ingest​Document

IngestDocument