Light-Fabric

Light-Fabric is a high-performance, unified platform for managing the lifecycle, governance, and orchestration of enterprise AI services including agentic services, agents, tools, skills, memories, MCP servers, APIs, gateways and workflows.

Why Light-Fabric?

We chose the name Light-Fabric because it embodies the "Unified Governance" required for enterprise-grade AI:

  • Unified Control Plane: Light-Fabric provides a single point of truth for discovering, governing, and auditing agents, MCP servers, and APIs via the light-portal.
  • Enterprise Governance: It prioritizes security and policy enforcement (such as fine-grained authorization) over pure decentralized autonomy, making it safe for corporate environments.
  • Integrated Ecosystem: It "weaves" together distributed components—from memory units (Hindsight) to centralized skills—into a cohesive, observable system.
  • Durable Identity: The name emphasizes the platform's role as the infrastructure foundation, remaining relevant regardless of the underlying implementation details.

Technical Advantages

By building Light-Fabric on a Rust foundation, we achieve:

  • Performance: Built on top of tokio and axum for maximum throughput and memory safety.
  • Native Intelligence: Specialized crates for Hindsight memory, tool calling, and workflow orchestration.
  • Production Ready: Includes robust features like retries, failover, and observability out of the box.

Core Components

The Light-Fabric is composed of modular crates, infrastructure frameworks, and reference applications:

Crates

  • crates/model-provider: A unified interface for multiple LLM providers (Ollama, etc.).
  • crates/hindsight-client: Client for the Hindsight biomimetic memory system.
  • crates/mcp-client: Implementation of the Model Context Protocol (MCP) for tool discovery and execution.
  • crates/portal-registry: Integration with the Light-Portal for service registration and discovery.
  • crates/light-runtime: Core runtime foundation for building agentic and microservice components.
  • crates/light-rule: High-performance rule engine for fine-grained authorization and data filtering.
  • crates/workflow-core & workflow-builder: Core engine and builder for complex agentic workflows.
  • crates/config-loader: Flexible configuration management for enterprise environments.
  • crates/asymmetric-decryptor & symmetric-decryptor: Security utilities for sensitive data handling.

Frameworks

  • frameworks/light-axum: A specialized microservice & agentic framework built on top of the Axum web ecosystem.
  • frameworks/light-pingora: High-performance proxy and gateway framework built on top of Cloudflare's Pingora.

Applications

  • apps/light-agent: A managed AI agent capable of using tools, accessing memory, and executing complex tasks.
  • apps/light-gateway: An enterprise-grade gateway for securing and governing API and agent traffic.
  • apps/light-workflow: A service for orchestrating and executing long-running agentic workflows.

Getting Started with Light-Fabric

This guide will help you set up a local development environment for Light-Fabric, including the AI Gateway, Agent Engine, and the management Portal.

Prerequisites

  • Rust: Latest stable version.
  • Docker: For running database and backend services.
  • Node.js: For running the portal-view UI.
  • Git: To clone the necessary repositories.

Local Development Setup

To run the entire ecosystem locally, we use the portal-config-loc and service-asset repositories to manage configuration and pre-built assets.

1. Initialize Workspace

Create a unified workspace directory (e.g., ~/lightapi) and clone the core management repositories:

cd ~
mkdir -p lightapi
cd lightapi

# Clone configuration and assets
git clone [email protected]:lightapi/portal-config-loc.git
git clone [email protected]:lightapi/service-asset.git

2. Deploy Local Services

Light-Fabric services are orchestrated via Docker Compose scripts in portal-config-loc. The following command starts the PostgreSQL database and the core services (including the Rust-based components):

cd ~/lightapi/portal-config-loc
./scripts/deploy-local.sh pg rust

3. Import Initial Data

Use the importer script in service-asset to populate the local database with initial events, users, and configurations:

cd ~/lightapi/service-asset
./importer.sh -f events.json

4. Update /etc/hosts

The platform uses virtual hosts for local routing. Add the following entry to your /etc/hosts file (replace with your actual local IP if necessary):

127.0.0.1  local.lightapi.net locsignin.lightapi.net

Running the Management Portal

The Light-Portal provides a unified UI for onboarding MCP servers, configuring AI Gateways, and interacting with agents.

cd ~/lightapi
git clone [email protected]:lightapi/portal-view.git
cd portal-view
npm install
npm run dev

Navigate to https://localhost:3000 and log in with your developer credentials.


Cloud Development (Coming Soon)

We are currently preparing a Cloud Development Server. This will allow developers to:

  • Connect to a shared, high-performance AI Gateway.
  • Onboard and test MCP servers without a full local installation.
  • Collaborate on shared agentic workflows and Hindsight memory banks.

Stay tuned for the connection details and onboarding guide for the cloud environment.


Contributing to Light-Fabric

If you are developing for the Rust crates specifically:

cd ~/lightapi
git clone [email protected]:networknt/light-fabric.git
cd light-fabric
cargo build

Model Providers

Light-Fabric provides a unified, high-performance interface for interacting with diverse Large Language Model (LLM) providers. This abstraction is centered around the Provider trait, allowing applications to remain model-agnostic while leveraging advanced capabilities like native tool calling and prompt caching.

The Provider Trait

All model integrations implement the Provider trait, which supports:

  • One-shot and Multi-turn Chat: Simplified APIs for simple prompts and full conversation histories.
  • Structured Tool Calling: Native integration for function calling (OpenAI-style).
  • Capabilities Detection: Programmatic checks for vision, native tool support, and prompt caching.

Supported Cloud Providers

Light-Fabric supports all major LLM providers. Because the Provider trait is model-agnostic, the framework is compatible with the latest flagship releases as soon as they are available.

  • OpenAI: Native support for the GPT-5 series (5.4, mini, nano), the o4 reasoning models, and full legacy support for GPT-4o and GPT-4 Turbo.
  • Anthropic: Support for the Claude 4 generation, including Opus 4.7, Sonnet, and Haiku.
  • Google Gemini: Support for Gemini 3.1 Pro and Flash, leveraging Vertex AI or AI Studio for multi-modal and long-context tasks.
  • Azure OpenAI: Enterprise-grade OpenAI deployments with support for the latest model deployments.
  • AWS Bedrock: Access to the latest Claude and Titan models hosted on Amazon Web Services.
  • OpenRouter: Access to hundreds of open-source and proprietary models via a single unified API.
  • Telnyx: Support for models hosted on the Telnyx platform.
  • GLM (Zhipu AI): Support for the ChatGLM/GLM-5 series of models.

Local & Specialized Providers

  • Ollama: Seamless integration with local models running on your machine.
  • OpenAI-Compatible: A generic CompatibleProvider for any service implementing the OpenAI REST API.
  • GitHub Copilot: Integration with GitHub Copilot Chat for developer-centric workflows.

Meta-Providers (Orchestration)

These providers wrap other providers to add resilient or intelligent behavior:

  • ReliableProvider: Enhances any base provider with retries, exponential backoff, and automatic failover to fallback models.
  • RouterProvider: Dynamically routes requests to different models based on hints or input complexity.

CLI & Tooling Integrations

Light-Fabric includes specialized integrations for developer tools and terminal environments:

  • Claude Code CLI: Integration with Anthropic's Claude Code environment.
  • Gemini CLI: Terminal-based access to Google's Gemini models.
  • KiloCLI: Light-Fabric's native CLI integration for rapid testing and automation.

Key Capabilities

Providers can be queried for their support of advanced features:

  • Native Tool Calling: Efficiently generate structured function calls.
  • Vision: Process images alongside text prompts.
  • Prompt Caching: Leverage provider-side caching to reduce latency and costs for long contexts.

Agentic Workflow Design

Agentic Workflow in Light-Fabric implements a hybrid orchestration model for enterprise business processes. The workflow is deterministic, auditable, and stateful, while selected steps can be executed by agents, API calls, rule engine checks, or humans.

The design goal is not to replace enterprise process control with an open-ended agent loop. The goal is to let agents work inside a managed process that has clear state, clear ownership, repeatable execution, and human approval where needed.

Enterprise Challenge

In regulated or operationally sensitive environments, a purely autonomous AI agent is not enough for long-running business work.

  • Compliance requires deterministic process paths, approval records, and audit history.
  • Reliability requires long-running state to survive process restarts, UI disconnects, and agent failures.
  • Safety requires human-in-the-loop checkpoints for decisions with business, security, or financial impact.
  • Coordination requires multiple humans and roles to participate in the same process.
  • Testing requires the same workflow to run interactively with humans or headlessly with example data.

Light-Fabric solves this by separating orchestration from execution.

Hybrid Model

The workflow is the deterministic process manager. It defines the ordered steps, conditions, retries, error handling, human checkpoints, and outputs.

Agents are workers inside that process. They can reason, call tools, ask for missing data, and use skills, but they do not own the overall process state.

FeatureTraditional WorkflowPure Agent LoopLight-Fabric Hybrid
PathFixedDynamicFixed path with flexible task execution
StateDurableOften transientDurable workflow and task state
Human inputForms and approvalsAd hoc chatFirst-class waiting tasks
AuditStrongWeakStep-level audit and agent trace
API callsBuilt into codeTool callsSpec-described endpoint invocations
TestingSeparate test harnessPrompt replaySame workflow can run live tests

Core Separation

There are two related specifications:

  1. Agentic Workflow Specification Describes orchestration: task order, branching, human input, assertions, API calls, retries, errors, exports, and state transitions.

  2. LightAPI Description Specification Describes API capabilities at the endpoint level: how an endpoint is invoked, what inputs it accepts, what result shape it returns, examples, behavior notes, and result expectations.

This separation is important. The workflow should not duplicate every endpoint contract. It should reference endpoint descriptions and use them to invoke calls, guide agents, and verify results.

Endpoint-Level Consumption

Light-Portal manages API descriptions at the endpoint level, not only at the whole API level.

This is necessary because real workflows often combine one endpoint from one API with one endpoint from another API. For example, onboarding an API to an AI gateway may involve:

  1. register an API
  2. create an API version from a specification
  3. create a development API instance
  4. configure the API through config server
  5. link the API instance to a gateway instance
  6. select endpoints to expose as MCP tools
  7. create a gateway config snapshot
  8. reload the gateway through controller
  9. run MCP tests against the gateway

Each step may come from a different API surface. The workflow consumes only the endpoints it needs.

The recommended model is:

  • API-level descriptions can be authored for convenience and consistency.
  • Endpoint-level descriptions are published and consumed by agents and workflows.
  • Endpoint descriptions inherit shared context such as authentication, environments, sources, and secrets from an API catalog.
  • Agents progressively load endpoint information by disclosure level instead of receiving the entire catalog up front.

Progressive Disclosure

Endpoint descriptions should be disclosed to agents in layers:

  • index: operation id, title, tags, visibility
  • summary: purpose, capability group, lifecycle
  • invocation: input shape, request mapping, auth, examples
  • behavior: result cases, errors, edge cases, assertions
  • full: complete description for debugging or generation

This allows the agent to discover capabilities cheaply, load invocation details only for selected endpoints, and load behavior details only when verification or failure analysis needs it.

Workflow Task Types

The updated workflow specification adds first-class support for the task types needed by agentic API workflows.

Ask Task

ask pauses the workflow and waits for human input. It supports prompts, choices, validation, defaults, timeouts, and sensitive input.

The task returns the user's answer as task output. The normal export block should move the answer into workflow context.

Example:

- ask-authz:
    ask:
      prompt: Do you want to configure endpoint authorization?
      mode: choice
      options:
        - label: Configure authorization
          value: configure
        - label: Skip
          value: skip
    export:
      as:
        authzChoice: ${ .result }

Assert Task

assert validates workflow state or API results. It is used for both live tests and interactive workflows.

It supports simple comparisons, JSONPath-style checks, length checks, regex checks, and rule-engine-backed assertions for complex business logic.

Assertion failures should produce structured, catchable errors so workflows can route failures to remediation, task creation, or agent investigation. Complex business assertions can delegate to Light-Rule.

API Call Tasks

The workflow supports direct and description-backed API calls:

  • HTTP / OpenAPI
  • JSON-RPC
  • OpenRPC
  • gRPC
  • MCP tool/resource/prompt calls

For direct internal calls, jsonrpc can be used with an endpoint, method, params, id, notification flag, and error policy.

For cataloged JSON-RPC, openrpc references an OpenRPC document and method.

For MCP, the workflow references a tool, resource, or prompt and passes arguments. MCP capability descriptions belong in the API description layer; the workflow only selects and invokes them.

Explanation Metadata

Tasks can include explain metadata to help an agent or UI explain what is happening.

Useful fields include:

  • purpose
  • visible
  • before
  • success
  • failure
  • requires

Example:

explain:
  purpose: Link the API instance to the development gateway.
  visible: true
  requires:
    - portal-command-token authentication
    - apiInstanceId from prior step

Human Task State

Human-in-the-loop behavior must be represented as durable workflow state.

Recommended task states:

A = active
W = waiting for input
C = completed
F = failed
X = canceled

When an ask or approval task reaches W, the process remains active but the task is no longer picked up by the executor. A user, CLI, scheduler, or agent must complete the task through the workflow API.

Waiting tasks should carry:

  • prompt
  • input mode
  • options
  • validation rules
  • default value
  • sensitive flag
  • assignment metadata
  • explanation metadata
  • timeout policy

Assignment And Worklist

Enterprise workflows need more than chat. Some tasks must be assigned to roles or users and coordinated across multiple humans.

Human tasks should support:

  • assigned user
  • assigned role
  • candidate roles
  • claimed by
  • claimed timestamp
  • due timestamp
  • priority
  • comments
  • audit trail

A role-based task appears in the worklist for users with a matching role. Once claimed, it belongs to the claiming user until completed, released, delegated, or timed out.

Client Architecture

light-workflow should run as a containerized backend service alongside other portal services. It owns workflow execution and state. Portal chat, worklist, CLI, scheduler, and agents are all clients of the same workflow APIs.

The client surfaces are:

  • Portal Chat: conversational guidance for a single user.
  • Worklist: role-based task inbox for approvals, reviews, and coordination.
  • CLI: developer, CI/CD, live test, and automation interface.
  • Scheduler: periodic headless execution, such as hourly live integration tests.
  • Agent: task executor that can call APIs, use skills, and report results back to the workflow.

See Workflow Client Architecture for the dedicated client design.

Workflow Service API

The workflow service should expose one stable API boundary for all clients.

Core operations:

workflow.start
workflow.getInstance
workflow.listInstances
workflow.getEvents
workflow.listTasks
workflow.getTask
workflow.claimTask
workflow.releaseTask
workflow.completeTask
workflow.delegateTask
workflow.cancelInstance

Streaming clients should subscribe to workflow events through Server-Sent Events, WebSocket, or another portal-standard event mechanism.

Important event types:

  • workflow started
  • task started
  • task completed
  • task failed
  • task waiting for input
  • task assigned
  • task claimed
  • task completed by human
  • agent started
  • agent completed
  • workflow completed
  • workflow failed

Live Testing

The same workflow runtime should support interactive runs and headless live tests.

Interactive workflows use ask tasks when decisions or missing values are needed.

Live tests should use example data from LightAPI endpoint descriptions and workflow input fixtures instead of asking the user. Assertions should verify results through assert tasks or rule-engine checks.

This lets the scheduler run workflows every hour against the latest deployed services. When a test fails, the workflow can create a task with the failure detail and assign an agent or human to investigate.

Example: API Onboarding To AI Gateway

An API onboarding workflow can guide a user through a complex multi-endpoint process without requiring a dedicated UI for every operation.

The workflow can:

  1. ask for or infer the API metadata
  2. call the register API endpoint
  3. create an API version from an OpenAPI specification
  4. create a development API instance
  5. configure the API
  6. ask whether fine-grained authorization should be configured
  7. route to create or select authorization rules
  8. link the API instance to the development AI gateway
  9. select endpoints to expose as MCP tools
  10. create a gateway config snapshot
  11. reload the gateway through controller
  12. run MCP tests through the gateway
  13. assert expected results
  14. report success or create remediation tasks

The same workflow can run interactively through portal chat, be managed through the worklist, or run headlessly with examples as a live test.

Technical Implementation

The Light-Fabric implementation is split across:

  • workflow-core: Rust models for the workflow specification.
  • workflow-builder: fluent builders for programmatic workflow construction.
  • light-workflow: runtime service and executor.
  • light-agent: agent execution surface for delegated agent tasks.
  • light-rule: rule engine used by workflow and assertion tasks. See Light-Rule Design.

Runtime responsibilities include:

  • deserializing workflow definitions
  • claiming active tasks
  • executing supported task types
  • storing task output
  • applying exports into process context
  • creating next tasks
  • pausing waiting tasks
  • resuming after human completion
  • failing or completing process instances
  • exposing workflow APIs to clients

The current executable slice supports API invocation and verification tasks such as HTTP, JSON-RPC, OpenRPC, MCP over enterprise HTTP transports, rules, assertions, and waiting human input. MCP stdio transport is intentionally not a priority for enterprise deployment.

Design Rule

There must be one workflow runtime and one task state model.

Chat, worklist, CLI, scheduler, and agents should never implement their own workflow execution. They should all use the same light-workflow service APIs.

This keeps enterprise workflow behavior auditable, testable, and consistent regardless of how a process is started, resumed, or observed.

Workflow Client Architecture

Light-Fabric workflow execution should run as a containerized backend service, not as logic embedded in a portal screen, CLI, scheduler, or agent. The workflow service owns process state, task state, audit records, API invocation, agent invocation, and human-in-the-loop transitions. Clients are thin interaction surfaces over the same service APIs.

This separation lets the same workflow instance be driven by a portal chat session, a worklist user, a CLI command, a scheduler, or an AI agent without creating multiple execution models.

Goals

  • Provide one authoritative workflow runtime for long-running enterprise processes.
  • Support human-in-the-loop tasks from both conversational and worklist interfaces.
  • Support headless execution for live tests, scheduled runs, and CI/CD.
  • Keep all clients stateless or lightly stateful; workflow state lives in light-workflow.
  • Make role assignment, audit, and retry behavior consistent across UI, CLI, scheduler, and agent use.

Runtime Service

light-workflow should be deployed as a portal service in a container alongside the other portal services. It should expose APIs for workflow definitions, workflow instances, task claiming, task completion, event streaming, and operational control.

The service is responsible for:

  • loading workflow definitions
  • starting workflow instances
  • persisting process_info_t and task_info_t
  • executing API calls and assertions
  • invoking agents for agent-owned tasks
  • pausing on ask and approval tasks
  • assigning human tasks to users or roles
  • resuming workflows when a human answer is submitted
  • emitting workflow and task events
  • recording audit history

Clients should never execute workflow steps themselves. They should only start workflows, inspect workflow state, and complete assigned tasks.

Client Surfaces

Portal Chat

The portal chat client is the guided conversational interface for a single user working through a process. It is useful when the workflow needs to ask clarifying questions, explain the next action, or guide a user through a complex multi-endpoint operation.

Typical uses:

  • API onboarding
  • API endpoint publication to an AI gateway
  • guided configuration
  • troubleshooting and remediation workflows
  • interactive approval with explanation

The chat client should call the workflow service for current state and submit answers to waiting tasks. It may stream workflow events and render agent explanations, but it should not own workflow state.

Worklist

The worklist is the enterprise task inbox. It is the right interface for multi-user coordination, role-based assignment, approvals, escalations, and audit-sensitive operations.

Typical uses:

  • approval tasks
  • compliance review
  • operations handoff
  • role-based queue processing
  • task claim and release
  • delegated work
  • due-date and priority management

The worklist should be built around waiting human tasks. A task may have:

  • assigned user
  • candidate roles
  • assigned role
  • priority
  • due time
  • claim status
  • comments
  • completion payload
  • audit trail

The worklist is especially important because many enterprise workflows are not purely conversational. They need accountable ownership and coordination between multiple humans.

CLI

The CLI is a developer and automation client. It should use the same workflow service APIs as portal-view and should not contain separate execution logic.

Typical uses:

  • local workflow testing
  • live parity tests
  • CI/CD automation
  • scheduled headless runs
  • debugging stuck workflow instances
  • submitting test data
  • completing simple waiting tasks from scripts

Example commands:

light-workflow start portal.onboard-api --input input.yaml
light-workflow status <instance-id>
light-workflow tasks --role portal-admin
light-workflow claim <task-id>
light-workflow answer <task-id> --value approve
light-workflow logs <instance-id>
light-workflow cancel <instance-id>

The CLI should be added after the workflow APIs stabilize. It will be valuable for developers and automation, but the worklist and portal chat should drive the primary enterprise UX.

API Boundary

The workflow service should expose a stable API boundary that all clients use. The API can be HTTP, JSON-RPC, or both, but the concepts should remain the same.

Core operations:

workflow.start
workflow.getInstance
workflow.listInstances
workflow.getEvents
workflow.listTasks
workflow.getTask
workflow.claimTask
workflow.releaseTask
workflow.completeTask
workflow.delegateTask
workflow.cancelInstance

For streaming clients, the service should expose workflow events through Server-Sent Events, WebSocket, or another portal-standard event mechanism.

Important event types:

  • workflow started
  • task started
  • task completed
  • task failed
  • task waiting for input
  • task assigned
  • task claimed
  • task completed by human
  • agent started
  • agent completed
  • workflow completed
  • workflow failed

Human Task State

ask and approval-style tasks should enter a waiting state. While waiting, the workflow instance remains active, but the task is no longer executable by the worker loop until a human answer is submitted.

Recommended states:

A = active
W = waiting for input
C = completed
F = failed
X = canceled

The waiting task should include enough metadata for all clients:

  • prompt
  • input mode
  • options
  • validation rules
  • default value
  • sensitivity flag
  • assignment metadata
  • explanation metadata
  • timeout policy

The completion API should validate submitted input against the task definition before resuming the workflow.

Assignment Model

Human tasks should support both direct assignment and role-based queues.

Recommended fields:

assigned_user
assigned_role
candidate_roles
claimed_by
claimed_ts
due_ts
priority
comments

A role-based task can appear in the worklist for all users with a matching role. Once a user claims it, the task becomes owned by that user until completed, released, delegated, or timed out.

  1. Implement stable workflow service APIs for start, status, events, task list, task claim, and task completion.
  2. Harden the ask resume path and waiting task state machine.
  3. Build the worklist because it forces the assignment, audit, and state model to be correct.
  4. Build the portal chat workflow interaction on top of the same task APIs.
  5. Add the CLI after the API shape stabilizes.
  6. Add scheduler integration for hourly live tests and headless workflow runs.

Design Rule

There must be one workflow runtime and one task state model. Chat, worklist, CLI, scheduler, and agents are only clients of that runtime.

This keeps enterprise workflow behavior auditable, testable, and consistent regardless of how a workflow is started or resumed.

LightAPI Description Design

LightAPI Description is the endpoint capability specification used by Light-Fabric agents, workflows, live tests, and portal API administration.

It describes how an API endpoint is discovered, invoked, explained, and verified. It is intentionally separate from the Agentic Workflow Specification. Workflow describes process orchestration. LightAPI describes endpoint capability.

Why LightAPI

OpenAPI is useful for REST APIs, and OpenRPC is useful for JSON-RPC APIs, but Light-Fabric needs a common description model across multiple enterprise protocols:

  • REST / HTTP
  • OpenAPI-described HTTP
  • JSON-RPC 2.0
  • OpenRPC-described JSON-RPC
  • gRPC
  • MCP tools, resources, and prompts

LightAPI provides a single agent-facing and workflow-facing description layer over these protocols.

The goal is not to replace OpenAPI or OpenRPC. The goal is to reference them where they exist and add the missing information needed by agents and workflow live tests.

API-Level Authoring, Endpoint-Level Consumption

Light-Portal may let teams author descriptions at the API level for convenience. However, workflows and agents consume descriptions at the endpoint level.

This distinction is important because real workflow processes rarely use a whole API. They usually combine selected endpoints from multiple APIs.

For example, onboarding an API to an AI gateway may consume:

  • one endpoint from API registration
  • one endpoint from API version management
  • one endpoint from API instance management
  • one endpoint from config server
  • one endpoint from gateway linking
  • one endpoint from controller reload
  • one or more MCP tools exposed through the gateway

Each consumed operation should have an endpoint-level description with a stable endpointId.

API-level descriptions are still useful as catalogs. Endpoint-level descriptions may inherit shared API context such as:

  • environments
  • authentication
  • secrets
  • sources
  • common tags
  • lifecycle metadata

Relationship To Agentic Workflow

Agentic Workflow and LightAPI have different responsibilities.

ConcernAgentic WorkflowLightAPI Description
Process orderYesNo
Branching and retriesYesNo
Human-in-the-loopYesNo
Endpoint invocation contractReference onlyYes
Input and result examplesOptional workflow fixturesYes
Result verification expectationsCalls assertDescribes expected result cases
Agent progressive disclosureUses selected endpointsDefines disclosure levels
Live testingOrchestrates executionSupplies examples and expected results

In live tests, the workflow should use example data from LightAPI descriptions and workflow fixtures instead of asking for user input.

In interactive runs, the workflow may ask the user for missing values, then invoke endpoints described by LightAPI.

Relationship To Centralized Agent Skills

LightAPI endpoint descriptions are a source of agent skills.

The centralized skill registry should not require every API operation to be manually rewritten as a separate skill. Instead, Light-Portal can publish selected LightAPI endpoint descriptions into the skill registry as invokable capabilities.

The skill registry adds:

  • permission-aware discovery
  • semantic search
  • skill grouping
  • agent persona scoping
  • audit around skill disclosure and execution

LightAPI provides:

  • endpoint identity
  • protocol details
  • input schema
  • request mapping
  • result shape
  • examples
  • behavior notes
  • result cases

Together, they allow an agent to discover a capability as a skill, progressively load only the endpoint details it needs, and execute through the workflow or controller runtime.

See Centralized Agentic Skill Registry for the skill registry design.

Core Document Concepts

A LightAPI document should support both API-level catalogs and endpoint-level documents.

Important top-level concepts:

  • lightapi: specification version
  • profile: api or endpoint
  • info: name, title, version, namespace, owner, contact
  • context: inherited catalog context for endpoint-level documents
  • sources: OpenAPI, OpenRPC, protobuf, MCP, or raw protocol references
  • environments: environment-specific server details
  • secrets: required secret names
  • authentications: reusable authentication policies
  • operations: endpoint operation descriptions
  • testSequences: linear endpoint test sequences
  • agent: progressive disclosure and skill metadata

For profile: endpoint, the document should describe at most one operation.

Operation Model

Each operation represents one endpoint-level capability.

Common fields include:

  • operationId: local operation identifier
  • endpointId: globally stable endpoint identifier
  • title
  • summary
  • description
  • visibility
  • lifecycle
  • tags
  • capability
  • agent
  • input
  • request
  • result
  • examples

The input section describes the logical interface the agent or workflow sees.

The request section describes how logical input maps to the wire protocol.

The result section describes expected output, result cases, and failure shapes.

Protocol Coverage

HTTP And OpenAPI

For raw HTTP, the operation describes method, endpoint, headers, query, path, and body mappings.

For OpenAPI, LightAPI references the OpenAPI document and operation, then adds agent-oriented behavior, examples, and result expectations.

JSON-RPC And OpenRPC

For direct JSON-RPC, the operation describes endpoint, method, params, id behavior, notification behavior, and error policy.

For OpenRPC, LightAPI references the OpenRPC document and method. The workflow runtime can use the OpenRPC document to validate that the method exists and that required params are present before calling it.

gRPC

For gRPC, the operation describes service, method, protobuf source, transport, metadata, request mapping, and result mapping.

For browser or gateway-mediated enterprise deployments, gRPC over WebSocket can be represented as a transport on the structured protocol operation.

MCP

For MCP, the operation describes tool, resource, or prompt invocation.

Tool listing alone is not enough. The description must also include:

  • input schema
  • result shape
  • examples
  • behavior differences for important input cases
  • error cases
  • verification expectations

MCP stdio is not a priority for enterprise portal deployment. HTTP and streamable HTTP transports should be the main runtime targets.

Result Cases And Verification

LightAPI should describe expected result behavior, but Agentic Workflow should execute the actual assertions.

This keeps verification orchestration in one place.

Recommended model:

  • LightAPI operation result cases describe expected outputs, failure shapes, and examples.
  • Workflow test steps invoke the operation.
  • Workflow assert tasks verify actual output against expected result cases.
  • Complex business checks can call the rule engine.

This allows the same endpoint description to support:

  • agent skill usage
  • workflow execution
  • live integration testing
  • failure diagnosis

Progressive Disclosure For Agents

A LightAPI document should support progressive disclosure so an agent can load only the information needed at each stage.

Recommended levels:

  • index: endpoint id, title, tags, visibility
  • summary: purpose, capability group, lifecycle
  • invocation: input schema, request mapping, authentication, examples
  • behavior: result cases, edge cases, errors, assertions
  • full: complete endpoint description

The portal can expose query APIs such as:

lightapi.listOperations
lightapi.getOperation
lightapi.getCapabilityGroup

Agents should start with index or summary data, load invocation details only for selected endpoints, and load behavior details only for testing, troubleshooting, or failure repair.

Portal Publishing Flow

Light-Portal should manage endpoint descriptions as part of API endpoint administration.

Recommended flow:

  1. API owner creates or imports API metadata.
  2. Portal extracts initial endpoint descriptions from OpenAPI, OpenRPC, protobuf, MCP, or raw endpoint configuration.
  3. API owner enriches endpoint descriptions with examples, behavior notes, result cases, and visibility.
  4. Portal stores endpoint-level LightAPI descriptions.
  5. Authorized agents and workflows query descriptions by endpoint, tag, lifecycle, visibility, or capability.
  6. Selected endpoints can be published into the centralized skill registry.
  7. Workflow instances reference endpoint descriptions during execution and live testing.

Live Test Use

Live tests should be workflow-driven.

LightAPI supplies:

  • example input data
  • expected result cases
  • protocol invocation details
  • error behavior

Agentic Workflow supplies:

  • sequence
  • fixtures
  • environment selection
  • endpoint invocation
  • assertions
  • failure routing
  • task creation
  • agent assignment

This avoids building a second test runner model outside the workflow engine.

Design Rule

LightAPI describes endpoint capability. Agentic Workflow orchestrates endpoint use. Centralized Skills expose selected capabilities to agents.

Keeping these responsibilities separate lets Light-Fabric support API administration, agent skill discovery, workflow execution, and live integration testing without duplicating endpoint definitions across multiple systems.

Light-Rule Design

Light-Rule is the local YAML rule engine used by Light-Fabric services and workflows for deterministic business checks, transformations, authorization decisions, and workflow assertions.

It complements agentic workflow by keeping critical decisions explicit, repeatable, and auditable. Agents can propose or select rules, but the rule engine executes the deterministic logic.

Purpose

Light-Rule is designed for enterprise services that need fast local policy and transformation logic without a database call on every request.

Primary uses:

  • fine-grained authorization
  • request transformation
  • response transformation
  • workflow assertions
  • business validation
  • permission and filter injection
  • reusable rule templates selected from Light-Portal

The rule configuration is loaded locally by the target service. When permissions or rule mappings change, the controller can trigger a config reload so the service swaps to the latest rules.

Relationship To Agentic Workflow

Agentic Workflow orchestrates process steps. Light-Rule evaluates deterministic logic inside those steps.

Workflow uses Light-Rule in two main ways:

  1. Rule call task A workflow task can call a named rule to validate or mutate workflow context.

  2. Assert task extension Simple checks can be handled directly by assert, while complex business checks can delegate to Light-Rule.

This separation keeps workflows readable. The workflow says when a check happens; Light-Rule defines the reusable business logic for the check.

Example workflow responsibilities:

  • decide when authorization configuration is needed
  • select or create a rule
  • invoke a rule during live testing
  • route failures to a human or agent

Example Light-Rule responsibilities:

  • evaluate role, group, position, or attribute checks
  • inject endpoint permissions into the context
  • compute row or column filters
  • execute transformation plugins
  • return pass/fail for business assertions

See Agentic Workflow Design for the workflow orchestration model.

Relationship To LightAPI

LightAPI endpoint descriptions describe endpoint invocation and expected result behavior. Light-Rule can implement complex result checks that are too business-specific for simple schema assertions.

Recommended model:

  • LightAPI describes endpoint result cases and expected behavior.
  • Agentic Workflow invokes the endpoint and runs assert tasks.
  • assert handles simple checks directly.
  • Light-Rule handles complex checks, authorization logic, row filters, column filters, and reusable business policies.

See LightAPI Description Design for endpoint capability descriptions.

Rule Specification

Rules are described by the rule specification in rule-specification/schema/rule.yaml.

The top-level configuration contains:

  • ruleBodies: named rule definitions
  • endpointRules: endpoint-to-rule mappings

Each rule can contain:

  • ruleId
  • ruleDesc
  • version
  • author
  • updatedAt
  • conditions
  • actions

Each endpoint mapping can contain:

  • req-tra: request transformation rules
  • res-tra: response transformation rules
  • access-control: access control rules
  • permission: permission values injected into context
  • x-*: extension rule phases

Rule Conditions

Conditions evaluate fields in the input context.

Supported operand forms:

  • direct field: role
  • dotted path: user.role
  • JSON Pointer: /user/role
  • JSONPath-like path: $.user.roles[0]

Supported operators:

==
!=
>
<
>=
<=
eq
ne
contains
matches
startsWith
endsWith
exists
notExists

expected is typed and may be a string, number, boolean, array, object, or null.

Flat condition arrays are evaluated left-to-right. joinCode combines the current condition with the previous result.

A AND B OR C

is evaluated as:

(A AND B) OR C

If explicit grouping is required, split logic into multiple rules and combine them through endpoint mapping or workflow orchestration.

Rule Actions

Actions execute plugin logic after conditions pass.

An action contains:

  • actionId
  • actionClassName
  • actionValues

actionClassName identifies the registered plugin. actionValues carries plugin-specific configuration.

Typical action plugins:

  • add values to request context
  • inject permission attributes
  • compute filters
  • transform request body
  • transform response body
  • call a local business function

Actions are intentionally plugin-based so the schema remains stable while implementation logic can evolve.

Endpoint Rule Phases

Endpoint mappings define when rules run.

Request Transformation

req-tra rules run before the service handles the request. They can enrich or transform request context.

Response Transformation

res-tra rules run after the service produces a response. They can filter, redact, or reshape response data.

Access Control

access-control rules validate whether a request is allowed. These rules normally run in parallel because they should not mutate shared state.

Permission Injection

permission values are injected into the evaluation context before rule execution. This lets API owners configure roles, groups, attributes, row filters, or column filters without editing the technical rule body.

Extension Phases

Custom phases must use the x-* prefix. This avoids silent typos in standard phase names while preserving controlled extensibility.

Execution Model

The Rust implementation lives in crates/light-rule.

Core components:

  • RuleConfig: top-level config model
  • Rule: rule definition
  • RuleCondition: condition model
  • RuleAction: action model
  • RuleEngine: evaluates one rule
  • ActionRegistry: maps action class names to plugins
  • MultiThreadRuleExecutor: executes rule lists and endpoint phase mappings

Sequential phases such as req-tra and res-tra should run with all semantics so transformations happen in order.

Access control can run in parallel because it should be a validation step rather than a mutation step.

Why Not Replace With Cedar Or Casbin

Cedar and Casbin are strong policy engines, but Light-Rule has a different role in this platform.

Light-Rule supports:

  • local YAML configuration
  • request and response transformation
  • permission injection
  • row and column filters
  • endpoint-specific rule selection
  • technical-team-authored reusable rules
  • API-owner-selected rule parameters
  • config reload through controller

Cedar is excellent for authorization policy, but it does not naturally cover transformation, row filter, and column filter use cases. Casbin is strong for policy enforcement, but it introduces a different policy storage and matching model.

Light-Rule should remain the native rule engine for Light-Fabric service configuration and workflow assertions. External policy engines can still be integrated as action plugins if needed.

Governance

Rule bodies should be authored and reviewed like code or controlled configuration.

Recommended governance metadata:

  • version
  • author
  • updatedAt
  • ruleDesc

Recommended operational controls:

  • validate rule YAML against the schema before publishing
  • reject endpoint phase typos
  • keep ruleId equal to the ruleBodies map key
  • audit rule publication and reload events
  • test rules with representative input contexts
  • use workflow live tests to verify rules in integrated environments

Workflow Live Testing

Light-Rule is useful in live tests because it can express business checks that are more specific than generic JSON assertions.

Example flow:

  1. Workflow invokes an endpoint using LightAPI description.
  2. Workflow captures the endpoint response.
  3. assert verifies simple fields.
  4. A rule task validates business-specific behavior.
  5. On failure, workflow creates a task for a human or agent to investigate.

This keeps live test orchestration in workflow while preserving reusable business rules in Light-Rule.

Design Rule

Use workflow for process control. Use LightAPI for endpoint capability. Use Light-Rule for deterministic business logic.

Agents may select, explain, or help author rules, but the rule engine should execute the final deterministic decision.

Design Document: Centralized Agentic Skill Registry

Subject: Transitioning from File-Based Markdown Skills to a Database-Backed Skill Registry


1. Executive Summary

Currently, most AI agent frameworks rely on localized Markdown (.md) files to define agent "skills." While Markdown is highly LLM-native and human-readable, it creates significant bottlenecks at an enterprise scale regarding strict typing, API integration, and context window limits.

This document proposes transitioning to an Agentic Control Plane (Centralized Skill Registry) backed by a database. By decoupling skill metadata, schemas, and instructions, and by utilizing dynamic routing, we will achieve hierarchical structuring, strict schema enforcement, and progressive disclosure of tools to agents.


2. Problem Statement

Managing agent skills as flat Markdown files introduces several scaling challenges:

  1. Lack of Strict Typing: Markdown cannot enforce data types (e.g., ensuring a parameter is an integer vs. string), leading to hallucinated or malformed tool inputs.
  2. Context Window Exhaustion: Loading dozens or hundreds of skill definitions at startup overwhelms the LLM context window, increasing latency, token costs, and tool-misuse.
  3. Static Deployments: Updating a skill or changing access permissions requires a full application redeploy.
  4. Poor Discoverability: Flat file structures offer no native mechanism for progressive disclosure or tool search.

3. Data Models & Formats

To solve the limitations of purely text-based skills, we will adopt a hybrid, structured format stored within a database (e.g., PostgreSQL/MongoDB). The architecture uses the right format for the right job:

  • JSON Schema: Used strictly for defining parameters, inputs, and tool shapes. Natively supported by OpenAI/Anthropic/Google tool-calling APIs.
  • LightAPI Description (YAML/JSON): Used to map endpoint-level API capabilities to skills across REST, JSON-RPC, gRPC, and MCP.
  • OpenAPI / OpenRPC / Protobuf: Referenced by LightAPI where protocol-native specifications already exist.
  • Executable Code (Python/JS) / URI: Stores the actual execution logic or the endpoint reference.
  • Markdown: Retained only for the instructions or prompt fields, as LLMs excel at parsing markdown headers and lists for constraints and persona instructions.

LightAPI is the preferred source format for API-backed skills because it describes endpoint identity, protocol invocation, input schema, request mapping, result shape, examples, and behavior notes in one agent-oriented document. See LightAPI Description Design for the endpoint description model.

3.1 Proposed Database Schema Structure

The centralized Controller will store skills in a structured table/collection. Below is a representation of the skill payload:

{
  "skill_id": "sk_finance_001",
  "name": "generate_financial_report",
  "version": "1.2.0",
  "tags": ["finance", "reporting"],
  "tool_schema": {
    "type": "function",
    "function": {
      "name": "generate_financial_report",
      "description": "Generates a Q3 report based on ticker symbol.",
      "parameters": {
        "type": "object",
        "properties": {
          "ticker": {"type": "string", "description": "The stock ticker"}
        },
        "required": ["ticker"]
      },
      "response_schema": {
        "type": "object",
        "properties": {
          "report_url": {"type": "string"},
          "status": {"type": "string"}
        }
      }
    }
  },
  "execution": {
    "type": "rest_api",
    "endpoint_id": "ep_finance_report_001",
    "endpoint": "https://internal-api.company.com/v1/finance/report",
    "method": "POST"
  },
  "instructions": "## Role\nYou are a financial analyst.\n## Constraints\n- Never hallucinate financial data.\n- Always return exact numbers."
}

4. Hierarchical Structure & Progressive Disclosure

Dumping 500 JSON schemas into an LLM's context window will cause system failure. The Centralized Controller will act as a mediator, enforcing hierarchy and progressive disclosure (giving the agent only the schemas it needs, exactly when it needs them).

4.1 Implementing Hierarchy & Tagging

Because JSON Schema does not have built-in folders, hierarchy and categorization are enforced via the platform's global entity management system:

  1. Namespacing: Tool names follow a strict convention: [domain]_[subdomain]_[action] (e.g., aws_rds_provision).
  2. Tags & Categories: Instead of hardcoded columns, the registry utilizes the entity_tag_t and entity_category_t tables (with entity_type = 'skill'). This allows for unlimited flat tagging and deep hierarchical folder structures that are consistent across the entire portal.
  3. Discovery API: The Controller's discovery service filters by these tags/categories to scoped skill sets for specific agent personas.

4.2 Progressive Disclosure Patterns

Agents will no longer load all skills at startup. Instead, the Controller will mediate access using one (or a combination) of the following patterns:

Pattern A: Meta-Tools (Dynamic Injection)

The agent is booted with only two "meta-tools" designed for discovery.

  1. search_skills(query): Agent searches the DB. The Controller returns lightweight summaries (Name + Description only, no heavy schemas).
  2. load_skill_schema(skill_name): Once the agent identifies the correct tool, it calls this. The Controller dynamically injects the heavy JSON schema into the context for the next turn.

Pattern B: Semantic Tool RAG (Zero-Shot Discovery)

For highly complex systems with thousands of skills:

  1. Tool descriptions are embedded into a Vector Database (e.g., pgvector).
  2. When the user prompts the system (e.g., "Reset my AWS password"), the Controller intercepts the prompt, performs a semantic search, and retrieves the Top-3 most relevant JSON Schemas.
  3. The agent boots with only those 3 tools in its context.

Pattern C: Multi-Agent Orchestration (Supervisor / Worker)

Hierarchy is mapped to agent teams.

  1. A Supervisor Agent holds routing tools (e.g., delegate_to_finance, delegate_to_devops).
  2. When delegate_to_devops is triggered, the Controller spins up a DevOps Worker Agent, loading only the specific DevOps JSON schemas into its context.

5. Example Flow: Dynamic Loading in Action

User: "I need to provision a new database for the marketing team."

  1. Turn 1: Discovery
    • Agent Context: Possesses only search_skills(query).
    • Agent Action: Calls search_skills(query="provision database").
  2. Turn 2: High-Level Awareness
    • Controller Response: Returns token-efficient summaries from the DB: [{"name": "aws_rds_provision", "description": "Creates AWS RDS DB"}, {"name": "mongo_atlas_create", "description": "Creates Mongo cluster"}]
    • Agent Action: Decides AWS is needed. Calls load_skill_schema("aws_rds_provision").
  3. Turn 3: Strict Execution
    • Controller Response: Injects the full JSON schema (requiring instance_type, storage_gb).
    • Agent Action: Understands parameters and safely executes aws_rds_provision via the Controller's execution engine.

6. Operational Benefits & Security

By centralizing skills in a database, the platform gains enterprise-grade operational capabilities:

  • Dynamic Updates: API endpoints, instructions, and schemas can be updated in the database without restarting agents.
  • Permission-Aware Discovery (RBAC): By linking skills to LightAPI endpoint descriptions and api_endpoint_t, the Controller ensures that an agent only "discovers" tools that the current user/agent session is authorized to execute based on their roles.
  • A/B Testing: The Controller can route 50% of an agent's requests to skill_v1 and 50% to skill_v2 to measure prompt/tool efficacy.
  • Audit Logging: Every tool injection and execution is logged at the Controller level, establishing a single pane of glass for multi-agent compliance.
  • Distilled Memory RAG: Following the "Hindsight" pattern, raw conversation history (agent_session_history_t) is separated from RAG-optimized memory (session_memory_t). This prevents the "noisy context" problem while maintaining a perfect audit trail.

7. LightAPI As Skill Source

API-backed skills should be generated from endpoint-level LightAPI descriptions whenever possible.

The skill registry should store skill metadata, access control, grouping, and agent-facing instructions. The LightAPI description should remain the source of truth for endpoint invocation and verification details.

Recommended flow:

  1. Light-Portal creates or imports endpoint-level LightAPI descriptions.
  2. API owners enrich endpoint descriptions with examples, behavior notes, result cases, and visibility.
  3. Approved endpoint descriptions are published as agent skills.
  4. Skill discovery returns lightweight summaries first.
  5. When the agent selects a skill, the registry loads the relevant LightAPI disclosure level.
  6. Execution goes through the workflow service or controller runtime, preserving audit and authorization.

This avoids manually duplicating every API endpoint as a separate hand-written skill while still giving agents strict schemas and progressive disclosure.

8. Next Steps

  1. Provision the agent_skills table in the core database.
  2. Build the API layer (Controller) to handle search, retrieve, and execute requests from agents.
  3. Add publishing from LightAPI endpoint descriptions into the skill registry.
  4. Migrate existing Markdown-based skills into the structured DB payload (extracting prompts to the instructions field and converting parameters to JSON Schema).
  5. Implement Pattern B (Semantic Tool RAG) as the default progressive disclosure mechanism.

Hindsight Memory

Hindsight Memory is the core memory system for light-rs, designed to move beyond simple chat logs. Instead of just remembering what was said, the agent learns and forms mental models over time.

This design is strongly inspired by the paper Hindsight is 20/20: Building Agent Memory that Retains, Recalls, and Reflects and extends it with multi-tenant support.


1. Core Concepts

Hindsight memory organizes information into three distinct "Pathway" types:

  1. World Facts: Objective truths about the environment (e.g., "The production server is in US-East-1").
  2. Experiences: The agent's own history of actions and results (e.g., "I tried to deploy to US-East-1 and it failed due to a timeout").
  3. Mental Models: Synthesized understandings formed by reflecting on facts and experiences (e.g., "Deployments to US-East-1 are unstable during peak hours").

2. The Three Operations

Interaction with the memory system is standardized into three primary operations:

Retain (Storage)

The retain operation ingests information. Behind the scenes, the system:

  • Extracts entities and relationships.
  • Normalizes time and temporal data.
  • Stores the data in agent_memory_unit_t.

Recall (Retrieval)

The recall operation retrieves relevant context using a hybrid strategy:

  • Semantic: Vector similarity using the hnsw index.
  • Graph: Following links in agent_memory_link_t (causes, enables, prevents).
  • Temporal: Time-series filtering.

Reflect (Synthesis)

The reflect operation performs "deep thinking." It analyzes existing memories to generate new insights, which are stored in agent_memory_reflection_t.


3. Database Architecture

The Hindsight system is fully integrated into the portal's multi-tenant schema:

Table NameDescription
agent_memory_bank_tThe primary container. Defines personality and disposition (skepticism, empathy).
agent_memory_doc_tSource documents (logs, files, transcripts) that provide the raw text for memory units.
agent_memory_unit_tSentence-level "atoms" of thought. Stores content, embeddings, and fact types (world, experience, etc.).
agent_memory_entity_tResolved Knowledge Graph nodes, optionally linked to platform users (user_t).
agent_memory_unit_entity_tThe join table linking individual memories to the entities they mention.
agent_memory_entity_cooccur_tAssociation graph tracking concept relationships and co-occurrence counts.
agent_memory_link_tDefines causal and semantic relationships between memories (causes, enables, etc.).
agent_memory_directive_t"Hard rules" that override probabilistic learning.
agent_memory_reflection_tSynthesized high-level insights generated during the "Reflect" phase.
agent_session_history_tThe live record of active conversations, linked to a specific bank for context.

4. Privacy & Multi-Tenancy

Isolation is managed at the Bank level using three scoping tiers:

  1. Global Host Bank (user_id IS NULL, agent_def_id IS NULL):
    • Knowledge shared across all users and all agents within a specific host_id.
    • Ideal for organization-wide SOPs, common facts, and shared documentation.
  2. Shared Agent Bank (user_id IS NULL, agent_def_id IS NOT NULL):
    • Knowledge shared by all users interacting with a specific agent type.
    • Used for maintaining a consistent agent "Persona" or specialized domain expertise.
  3. Private User Bank (user_id IS NOT NULL):
    • Knowledge unique to a specific user.
    • Can be scoped further by agent_def_id to provide user-specific memory within a particular agent persona.
    • Used for personal preferences, private history, and individualized learning.

5. Implementation Guide

To implement a "Learning Agent," follow this sequence in your application logic:

  1. Ingestion: After every tool call or user interaction, call retain to update the bank.
  2. Context Loading: Before calling the LLM, call recall to fetch the most relevant 3-5 memories for the current prompt.
  3. Scheduled Reflection: Run the reflect operation during idle time to compress raw experiences into high-level mental models.

Fine-Grained Authorization

The following is the raw material collected when I chat with AI about migrating the light-4j fine-grained authorization to light-rs to secure light-agent and light-gateway.

The information needs to be organized and documented into a design document for light-rs fine-grained authorization.

Question: We have a Java implementation of fine-grained authorization that we want to migrate to light-rs. In our java implementation, we have a middleware handler to verify the custom claims like role, group, attribute etc in the jwt token against endpoint specific rules defined in database and cached at the service. How should we handle the same requirement in rust?

Response:

In Rust, bridging the gap between dynamic, database-driven rules and high-performance web requests is a well-solved problem. If you are coming from Java, the architecture translates very cleanly, but takes advantage of Rust's async runtime and thread-safe concurrency primitives.

To handle fine-grained authorization (Attribute-Based Access Control / ABAC or Role-Based Access Control / RBAC) where rules are cached locally but backed by a database, you have two primary paths in Rust:

  1. The Custom Middleware Approach (Closest to your current Java implementation)
  2. The Policy Engine Approach (Using an industry-standard Rust authorization crate)

Here is how you handle this requirement idiomatically in Rust, assuming you are using Axum (the current standard web framework in Rust).


Approach 1: Custom Middleware with Async Caching

In Java, you likely used something like Guava or Caffeine for your local cache. In Rust, the exact equivalent for high-performance, concurrent, async-aware caching is the moka crate.

Here is the architectural blueprint:

  1. Define your custom claims using serde.
  2. Setup a shared Application State holding your Database connection pool (e.g., sqlx) and your local Cache (moka).
  3. Create an Axum Middleware to intercept the request, parse the JWT, consult the cache based on the route, and authorize the request.

1. Dependencies (Cargo.toml)

axum = "0.7"
jsonwebtoken = "9.0"
moka = { version = "0.12", features = ["future"] } # Async cache (Rust's Caffeine)
sqlx = { version = "0.7", features = ["postgres", "runtime-tokio"] }
serde = { version = "1.0", features = ["derive"] }

2. Defining Claims and State

#![allow(unused)]
fn main() {
use axum::{extract::State, http::Request, middleware::Next, response::Response};
use moka::future::Cache;
use serde::{Deserialize, Serialize};
use sqlx::PgPool;
use std::sync::Arc;

// 1. Define your custom JWT claims
#[derive(Debug, Serialize, Deserialize, Clone)]
struct CustomClaims {
    sub: String,
    roles: Vec<String>,
    groups: Vec<String>,
    attributes: std::collections::HashMap<String, String>,
    exp: usize,
}

// 2. Define the Endpoint Rule (what you fetch from DB)
#[derive(Clone)]
struct EndpointRule {
    required_roles: Vec<String>,
    // add required_groups, attributes, etc.
}

// 3. Define the Application State to share across your app
struct AppState {
    db: PgPool,
    // Cache keyed by String (e.g., "GET:/api/data") returning an EndpointRule
    rule_cache: Cache<String, EndpointRule>,
}
}

3. The Authorization Middleware

In Axum, middleware is written as an asynchronous function that takes the request, does some work, and either rejects it or passes it to the Next handler.

#![allow(unused)]
fn main() {
use axum::http::StatusCode;
use axum::response::IntoResponse;

async fn fine_grained_authz_middleware(
    State(state): State<Arc<AppState>>,
    req: Request<axum::body::Body>,
    next: Next,
) -> Result<Response, StatusCode> {
    
    // 1. Extract and decode the JWT from Authorization header
    let auth_header = req.headers().get("Authorization").and_then(|h| h.to_str().ok());
    let claims = decode_jwt(auth_header).ok_or(StatusCode::UNAUTHORIZED)?;

    // 2. Identify the endpoint being accessed (e.g., "GET:/api/users/123")
    // Note: In practice, use req.extensions() to get the matched generic route template
    // rather than the raw URI so "/api/users/123" matches "/api/users/{id}"
    let endpoint_key = format!("{}:{}", req.method(), req.uri().path());

    // 3. Check Cache or Fallback to DB
    let rule = match state.rule_cache.get(&endpoint_key).await {
        Some(cached_rule) => cached_rule,
        None => {
            // Cache miss: Fetch from database
            let db_rule = fetch_rule_from_db(&state.db, &endpoint_key)
                .await
                .map_err(|_| StatusCode::INTERNAL_SERVER_ERROR)?;
            
            // Populate cache
            state.rule_cache.insert(endpoint_key, db_rule.clone()).await;
            db_rule
        }
    };

    // 4. Execute Fine-Grained Authorization Logic
    let is_authorized = rule.required_roles.iter().any(|role| claims.roles.contains(role));
    // ... check groups, evaluate custom attribute expressions, etc.

    if !is_authorized {
        return Err(StatusCode::FORBIDDEN);
    }

    // 5. Pass to the actual route handler!
    Ok(next.run(req).await)
}

// Dummy helper functions to make the example compile mentally
fn decode_jwt(_header: Option<&str>) -> Option<CustomClaims> { /* ... */ unimplemented!() }
async fn fetch_rule_from_db(_db: &PgPool, _key: &str) -> Result<EndpointRule, ()> { /* ... */ unimplemented!() }
}

4. Wiring it up

use axum::{routing::get, Router};

#[tokio::main]
async fn main() {
    let state = Arc::new(AppState {
        db: PgPool::connect("postgres://...").await.unwrap(),
        rule_cache: Cache::builder()
            .max_capacity(10_000)
            .time_to_live(std::time::Duration::from_secs(300)) // 5 minute TTL
            .build(),
    });

    let app = Router::new()
        .route("/api/secure-data", get(|| async { "Secret Data" }))
        // Apply the middleware to all routes above this line
        .route_layer(axum::middleware::from_fn_with_state(state.clone(), fine_grained_authz_middleware))
        .with_state(state);

    let listener = tokio::net::TcpListener::bind("0.0.0.0:3000").await.unwrap();
    axum::serve(listener, app).await.unwrap();
}

If your endpoint-specific rules are highly complex (e.g., "User must be in Group X AND have Attribute Y = 'finance', UNLESS Role is 'Admin'"), writing that evaluation logic manually in Rust gets messy.

Instead of writing custom evaluation logic, the Rust ecosystem heavily relies on external Policy Engines for ABAC/RBAC. The two leaders are:

Developed by Amazon Web Services in Rust, Cedar is a blazing-fast language for defining permissions.

  • You store your Cedar policies in your database as text.
  • Your Rust middleware fetches the Cedar policy string from the DB, compiles it (and caches the compiled policy using moka).
  • You pass the JWT CustomClaims into the Cedar engine as a context entity.
  • Cedar evaluates the claims against the policy in microseconds.

2. Casbin (casbin)

Casbin is a very popular authorization library that exists in Java, Go, and Rust.

  • It supports ACL, RBAC, and ABAC.
  • It has native adapters for sqlx (Postgres/MySQL). You don't even need to write the caching logic yourself; casbin-rs handles reading the rules from the database and keeping them cached in memory automatically.

Summary

  • For simple Role/Group matching: Stick to the Custom Middleware approach using moka for async caching and axum::middleware to intercept requests. It is functionally identical to your Java approach but extremely fast.
  • For complex Attribute evaluation: Use your middleware to parse the JWT, but delegate the actual rule evaluation to cedar-policy or casbin.

Question: Could you please provide some examples for cedar-policy and casbin? For example, what is the format in database? In our system, we are using a yaml rule engine to evaluate the fine-grained authorization. We create some rules users can choose and they can also create their own rules if necessary.

Response:

Transitioning from a custom YAML rule engine to an industry-standard policy engine is a fantastic move. Both AWS Cedar and Casbin are designed precisely for this kind of "user-defined fine-grained rules" requirement, but they take very different approaches to how rules are written and stored.

Here is a breakdown of how both engines work, what their database format looks like, and how you map your JWT claims to them in Rust.


1. AWS Cedar (cedar-policy) - Best for Complex/Custom Rules

Cedar uses a custom, highly readable Domain Specific Language (DSL). It is specifically built to evaluate ABAC (Attribute-Based Access Control) using contextual data like JWT claims.

The Format (In the Database)

In your database, you do not need complex tables. You simply store the Cedar policies as plain text strings in a standard TEXT or VARCHAR column.

If a user wants to create a custom rule saying, "Allow access if the user's JWT has the 'finance' group and a clearance level greater than 3," the Cedar policy stored in the DB looks like this:

// Database column: policy_text
permit (
    principal,
    action == Action::"read_data",
    resource == Resource::"secure_endpoint"
)
when {
    context.jwt.groups contains "finance" &&
    context.jwt.attributes.clearance_level > 3
};

For "predefined rules that users can choose," Cedar has Policy Templates. You store the template once, and users just link their parameters to it.

// Template: "Require specific group"
permit(principal, action, resource)
when { context.jwt.groups contains ?required_group };

How you execute it in Rust

When a request comes in, you fetch the relevant Cedar text strings from your database, pass in the JWT claims as the Context, and let Cedar evaluate it.

#![allow(unused)]
fn main() {
use cedar_policy::{Authorizer, Context, Decision, Entities, PolicySet, Request};
use serde_json::json;
use std::str::FromStr;

fn evaluate_cedar_rule(db_policy_text: &str, jwt_claims: &serde_json::Value) {
    // 1. Parse the text policy from the database
    let policies = PolicySet::from_str(db_policy_text).expect("Invalid policy format");

    // 2. Map your JWT claims directly into the Cedar "Context" using JSON
    let context_json = json!({
        "jwt": jwt_claims
    });
    let context = Context::from_json_value(context_json, None).unwrap();

    // 3. Create the evaluation request
    let request = Request::new(
        Some("User::\"Alice\"".parse().unwrap()), 
        Some("Action::\"read_data\"".parse().unwrap()), 
        Some("Resource::\"secure_endpoint\"".parse().unwrap()), 
        context,
        None,
    ).unwrap();

    // 4. Evaluate
    let authorizer = Authorizer::new();
    let entities = Entities::empty(); // Used if fetching hierarchical data
    let answer = authorizer.is_authorized(&request, &policies, &entities);

    if answer.decision() == Decision::Allow {
        println!("Access Granted!");
    } else {
        println!("Access Denied. Failed policies: {:?}", answer.diagnostics().errors());
    }
}
}

2. Casbin (casbin-rs) - Best for Tabular/Database-Native Rules

Casbin works fundamentally differently. It splits authorization into two parts: the Model (a config file defining the structure of your rules) and the Policies/Rules (the actual data stored in the database).

The Format (In the Database)

Casbin doesn't store a DSL string. Instead, it natively connects to your database using an adapter (like sqlx-adapter) and creates a table usually called casbin_rule.

The table structure looks exactly like a CSV file, with columns: ptype, v0, v1, v2, v3, v4, v5.

If a user selects a predefined rule to allow the finance_admin role to read /api/data, Casbin inserts a row into your database like this:

ptypev0 (Subject)v1 (Object)v2 (Action)v3 (Condition - optional)
pfinance_admin/api/dataread

If a user writes a custom ABAC rule using a UI (e.g., clearance > 3), Casbin can store the evaluation expression as a string in one of the columns (evaluated internally via the rhai scripting language):

ptypev0 (Subject/Condition)v1 (Object)v2 (Action)
pr.sub.groups.contains("finance") && r.sub.clearance > 3/api/dataread

How you execute it in Rust

To make Casbin evaluate JWT attributes dynamically, you define a model.conf file that tells Casbin to evaluate (eval) the string stored in the database.

The model.conf File:

[request_definition]
r = sub, obj, act

[policy_definition]
p = sub_rule, obj, act

[policy_effect]
e = some(where (p.eft == allow))

[matchers]
# This matcher says: check if the object and action match, 
# AND evaluate the ABAC rule stored in the database (p.sub_rule) against the request subject (r.sub)
m = r.obj == p.obj && r.act == p.act && eval(p.sub_rule) 

The Rust Code:

#![allow(unused)]
fn main() {
use casbin::{CoreApi, Enforcer, DefaultModel};
use sqlx_adapter::SqlxAdapter;
use std::collections::HashMap;

// A struct representing your JWT Claims
#[derive(Clone)]
struct JwtClaims {
    groups: Vec<String>,
    clearance: i32,
}

async fn evaluate_casbin() {
    // 1. Connect to DB. The adapter automatically fetches the rules from the `casbin_rule` table.
    let adapter = SqlxAdapter::new("postgres://user:pass@localhost/db", 8).await.unwrap();
    let mut enforcer = Enforcer::new("model.conf", adapter).await.unwrap();

    // 2. Parse the JWT for the incoming request
    let claims = JwtClaims {
        groups: vec!["finance".to_string()],
        clearance: 5,
    };

    // 3. Casbin in Rust uses `rhai` under the hood for dynamic object evaluation.
    // You map your JWT fields into a format the engine understands.
    let mut sub_map = HashMap::new();
    sub_map.insert("groups", claims.groups);
    sub_map.insert("clearance", claims.clearance);

    // 4. Enforce the rules against the DB
    // Request is: Subject (JWT Claims Map), Object (/api/data), Action (read)
    if enforcer.enforce((sub_map, "/api/data", "read")).unwrap() {
        println!("Access Granted!");
    } else {
        println!("Access Denied!");
    }
}
}

Which one maps better to your YAML Rule Engine?

AWS Cedar is highly recommended for your use case.

Because you currently use a YAML rule engine where users define complex logical rules, your users are already accustomed to defining logical expressions.

  1. Translation: It is very easy to write a script that takes your existing YAML definitions and converts them into Cedar text policies.
  2. Safety: Cedar is statically typed and formally verified. If a user writes an invalid custom rule, Cedar will fail to parse it instantly and give you a helpful error message to show in your UI. Casbin's eval() logic can sometimes be brittle at runtime if the user types bad syntax.
  3. Database Schema: With Cedar, you control your database schema. You just add a policy_text column to your existing "Rules" table. Casbin requires taking over a table (casbin_rule) to map to its strict CSV-like format.

Question: In our current implementation, we define all the persmissions and filters from UI and save the permissions and picked up rule in database. During server startup, it load everything locally in yaml format like the following. One of the pros is that our technical team write the rules, and the user or api owner just define the roles, groups, positions, attributes for the endpoint. The rule will be executed locally on the target server without going to the database so the performance is very good. If any permission changed for the target server, we can trigger a config reload from controller to ask the server to reload the config/permission. I am not sure if we should replace this implementation with Cedar. One of the feature we have is the column and row filter and I don't think Cedar support it. What do you think?

rule.endpointRules:
  echo@call:
    req-acc:
    - allow-role-based-access-control.lightapi.net
    permission:
      roles: mcp-reader
  /v1/pets@get:
    req-acc:
    - allow-role-based-access-control.lightapi.net
    permission:
      roles: host-admin
  /v1/accounts@get:
    req-acc:
    - allow-role-based-access-control.lightapi.net
    res-fil:
    - response-column-filter.lightapi.net
    - response-row-filter.lightapi.net
    permission:
      col:
        role:
          teller: '["accountNo","userId","accountType","firstName","lastName"]'
          account-manager: '["accountNo","userId","accountType","firstName","lastName","status"]'
      row:
        role:
          teller:
          - colName: accountType
            colValue: C
            operator: =
      roles: account-manager teller

Response:

You have excellent architectural instincts. You are absolutely right to hesitate, and your concern about column and row filtering is the exact reason you probably should not replace your current system with Cedar.

Your current design is actually highly sophisticated. Let’s break down why your current implementation is so strong, why Cedar would struggle with your specific requirements, and how you should idiomatically build your current design in Rust.


Why Cedar is NOT a drop-in replacement for your system

AWS Cedar (and most standard policy engines like Casbin or OPA) are designed as Binary Decision Engines. You give them a Request, and they return an Allow or Deny.

The Missing Feature: Data Filtering (Masking) Your system handles two fundamentally different concerns in one elegant YAML file:

  1. Access Control (req-acc): Can the user access the /v1/accounts endpoint? (Cedar does this perfectly).
  2. Data Filtering (res-fil): Which rows and columns can the user see in the response? (Cedar does not do this natively).

If you forced Cedar to do column filtering, you would have to misuse its "Annotations" feature to attach custom metadata to policies, read the output, and then write a custom Rust engine to parse that metadata and filter the JSON anyway. You would essentially be rebuilding your existing system, just with Cedar awkwardly wedged in the middle.

Strengths of your current architecture

What you have built is essentially a custom API Gateway / Sidecar Data Masking Engine.

  1. Performance: Loading rules into memory at startup means zero network latency. (Matches Cedar).
  2. Hot Reloading: Fetching updates via a controller is standard cloud-native practice. (Matches Cedar).
  3. Separation of Concerns: Tech team writes the underlying engine logic (allow-role-based-access-control), while API owners just write YAML configs. (Matches Cedar’s Policy Templates).
  4. Row/Column filtering: Natively understood by your engine. (Far exceeds Cedar).

Verdict: Do not switch to Cedar. Your current YAML-based, locally cached evaluation engine is a perfect fit for your business requirements.


How to implement your architecture in Rust

Since you are migrating to or building in Rust, your current Java-based architecture translates beautifully. In fact, Rust will make this system significantly faster and more memory-efficient.

Here is how you replicate your exact Java architecture—including the hot-reloading and filtering—in Rust.

1. Hot-Reloading using arc-swap

In Java, hot-reloading configuration safely across threads can be tricky. In Rust, the industry standard for "read-heavy, occasionally-updated" memory structures is the arc-swap crate. It allows your web server to read the YAML rules lock-free at millions of requests per second, while your controller can swap out the rules in the background instantly.

#![allow(unused)]
fn main() {
use arc_swap::ArcSwap;
use serde::Deserialize;
use std::sync::Arc;
use std::collections::HashMap;

// 1. Map your YAML exactly to Rust Structs
#[derive(Deserialize, Debug, Clone)]
struct EndpointRule {
    #[serde(rename = "req-acc")]
    req_acc: Option<Vec<String>>,
    #[serde(rename = "res-fil")]
    res_fil: Option<Vec<String>>,
    permission: PermissionConfig,
}

#[derive(Deserialize, Debug, Clone)]
struct PermissionConfig {
    roles: String,
    col: Option<HashMap<String, HashMap<String, String>>>, // Map for roles to column JSON strings
}

// 2. Wrap your entire config in an ArcSwap
lazy_static::lazy_static! {
    static ref CONFIG: ArcSwap<HashMap<String, EndpointRule>> = ArcSwap::from_pointee(HashMap::new());
}

// 3. Background task to listen to the Controller and Hot-Reload
async fn reload_config_from_controller(new_yaml_str: &str) {
    let new_config: HashMap<String, EndpointRule> = serde_yaml::from_str(new_yaml_str).unwrap();
    // Swaps the pointer atomically. 
    // All new requests see the new rules instantly. Old requests finish using the old rules.
    CONFIG.store(Arc::new(new_config));
}
}

2. The Request Access (req-acc) Middleware

You would write an Axum middleware that intercepts the request, looks up the current endpoint in the arc-swap configuration, checks the JWT roles against permission.roles, and rejects the request if it fails.

#![allow(unused)]
fn main() {
// Inside your Axum middleware
let current_config = CONFIG.load();
if let Some(rule) = current_config.get("/v1/accounts@get") {
    let required_roles: Vec<&str> = rule.permission.roles.split_whitespace().collect();
    if !user_roles.iter().any(|r| required_roles.contains(&r.as_str())) {
        return Err(StatusCode::FORBIDDEN); // req-acc fails
    }
}
}

3. The Response Filtering (res-fil) Middleware

This is where Rust shines. Because Axum and Actix allow you to intercept the response body after the handler completes but before it goes to the user, you can apply your column filtering dynamically.

Using serde_json, you can strip out columns that the user isn't allowed to see based on the col mapping in your YAML.

#![allow(unused)]
fn main() {
use serde_json::Value;

fn apply_column_filter(mut response_json: Value, allowed_columns_json: &str) -> Value {
    // 1. Parse the allowed columns from the YAML string (e.g., '["accountNo","firstName"]')
    let allowed_cols: Vec<String> = serde_json::from_str(allowed_columns_json).unwrap();

    // 2. Filter the JSON object
    if let Value::Object(ref mut map) = response_json {
        // Retain only the keys that are in the allowed_cols list
        map.retain(|key, _| allowed_cols.contains(key));
    }
    
    response_json
}
}

Summary

Keep your current system!

  1. The YAML approach is great. It acts as a perfect declarative contract between your API owners and your technical rules engine.
  2. Cedar is the wrong tool for Data Masking. Cedar is for strict Allow/Deny decisions, not mutating response bodies or filtering data.
  3. Rust will supercharge this. By using serde_yaml to parse your config, arc-swap for lock-free hot-reloading, and serde_json for column masking inside an Axum middleware, you will get a system that does exactly what your Java system does, but with significantly less memory overhead and higher throughput.

Agent Engine Pattern

The Agent Engine Pattern is the architectural standard for building industrial-grade, metadata-driven AI platforms within the Light-Fabric ecosystem.

In this model, the Rust Runtime acts as a high-performance Orchestrator, while the Application Logic resides in externalized metadata (JSON/YAML) and the Hindsight Memory database.

1. Why the Metadata-Driven Approach?

  • Separation of Concerns: Complex platform logic (security, retries, database connectivity, LLM integration) is implemented once in Rust. Business logic—defining agent personas, goals, and steps—is "programmed" via JSON or Database records.
  • Hot-Reloading: Using the arc-swap crate and YAML-based rule engines, agent personas, model parameters, and tool access can be updated in real-time without a server restart.
  • Elastic Scalability: Deploy a single, generic light-agent binary. At runtime, it specializes into a "Researcher," "Auditor," or "Support Specialist" based on the workflow_id or agent_id it retrieves from the registry.
  • High Performance: Rust's asynchronous tokio runtime allows a single engine instance to manage thousands of concurrent agentic sessions with minimal memory overhead.

2. The Core Architecture: Engine vs. Content

To function as a generic interpreter, the Light-Fabric Engine relies on four primary components:

A. The Tool & Skill Registry (The "Hands")

The engine maps string identifiers in the workflow JSON (e.g., "call": "get_customer_data") to executable code or remote MCP tools.

  • Implementation: Uses a ToolRegistry with trait objects (Box<dyn Tool>) or dynamic dispatch to MCP (Model Context Protocol) servers.
  • Logic: When the LLM requests a tool call, the engine verifies permissions via Fine-Grained Authorization, executes the tool, and feeds the result back into the context.

B. Hindsight State Manager (The "Memory")

Unlike simple session storage, the state manager persists every step of the agentic interaction into biomimetic memory banks.

  • Implementation: Every "turn" in the conversation is saved as a unit_t in the Hindsight database.
  • Benefit: Provides fault tolerance (resuming from a crashed step) and "Recall" capabilities, allowing agents to remember past interactions across different sessions.

C. Prompt Templating (The "Mind")

System prompts and instructions are stored as templates rather than hardcoded strings.

  • Implementation: Uses the tera or rinja engines for high-performance string interpolation.
  • Example: "You are a {{agent_role}}. Your current objective is to {{agent_goal}}."
  • Rust Logic: The engine merges runtime context (user input, memory recall, tool results) into the template before calling the LLM.

D. Policy Engine (The "Shield")

Before any tool execution or data retrieval, the engine consults the Light-Rule middleware.

  • Logic: Ensures the agent has the authority to access specific data or execute specific functions, preventing "prompt injection" from leading to unauthorized actions.

3. Conceptual Implementation in Rust

The AgentEngine in Light-Fabric follows a non-blocking, async loop:

#![allow(unused)]
fn main() {
pub struct AgentEngine {
    registry: Arc<ToolRegistry>,
    memory: Arc<HindsightClient>,
    rules: Arc<RuleEngine>,
}

impl AgentEngine {
    pub async fn execute_step(&self, session_id: Uuid, task: Task) -> anyhow::Result<()> {
        // 1. Fetch current context from Hindsight Memory
        let mut context = self.memory.get_context(session_id).await?;

        // 2. Resolve Task Type (Agentic vs. Tool Call)
        match task {
            Task::LlmCall { agent_id, prompt_template } => {
                // Render prompt with Tera
                let prompt = self.render_prompt(prompt_template, &context)?;
                
                // Call LLM Provider
                let response = self.llm_provider.chat(prompt, &context).await?;
                
                // Retain turn in Hindsight
                self.memory.retain_turn(session_id, response).await?;
            },
            Task::ToolCall { tool_name, params } => {
                // 3. Enforce Fine-Grained Authorization
                if self.rules.authorize(session_id, &tool_name).await? {
                    let result = self.registry.call(&tool_name, params).await?;
                    context.add_result(tool_name, result);
                }
            }
        }

        // 4. Update Session State
        self.memory.checkpoint(session_id, context).await
    }
}
}

4. Operational Challenges & Solutions

  1. Tool Versioning: As the platform evolves, tools may change. Light-Fabric handles this by versioning tool definitions in the Registry, ensuring old workflows remain compatible with the tools they were designed for.
  2. Safe Execution: For dynamic "scripts" defined in metadata, Light-Fabric utilizes WebAssembly (WASM) runtimes to provide a high-performance, secure sandbox that is superior to traditional container-based isolation.
  3. Observability: Because the engine is generic, tracing is built into the light-runtime. Every step generates OpenTelemetry traces, allowing developers to visualize the "thought process" and execution path of any agent in real-time.

The Recommendation

Light-Fabric adopts this "Engine-first" philosophy to ensure the platform remains sustainable. By treating the Agentic Workflow as data and the Rust Runtime as the interpreter, we achieve the perfect balance of extreme performance and business flexibility.

Database Design

The Light-Fabric utilizes a robust PostgreSQL schema to manage the entire lifecycle of agentic workflows, skills, and the biomimetic Hindsight memory system. The schema is organized into four logical layers:

1. Workflow Engine

These tables manage the definition and execution of long-running agentic workflows.

wf_definition_t

Stores the Agentic Workflow DSL (YAML) that defines the high-level orchestration logic.

process_info_t & task_info_t

Manage the runtime state of workflow instances (processes) and individual steps (tasks). They include input_data, context_data, and error_info to provide a resilient "scratchpad" for intermediate variables.

worklist_t & worklist_asst_t

Manage task assignments and visibility for human-in-the-loop interactions.


2. Agentic Core (The "Brain & Skills")

These tables define the identity, expertise, and capabilities of individual agents.

agent_definition_t

Defines the agent's persona, model provider (OpenAI, Anthropic, etc.), and runtime parameters like temperature and max tokens.

skill_t

Stores the "Expertise" of an agent in Markdown format. Skills are hierarchical and versioned.

tool_t & tool_param_t

The "Hands" of the agent. Defines executable functions, including REST endpoints, MCP server calls, or WASM scripts.

agent_skill_t & skill_tool_t

Maps agents to skills and skills to tools, implementing the Progressive Disclosure pattern where agents only see the tools required for their current skill context.


3. Hindsight Memory System

A biomimetic memory architecture that transitions from flat logs to structured "atoms of thought."

agent_memory_bank_t

Profiles for memory banks, defining the "Personality and Disposition" (e.g., skepticism, empathy) of the memory layer.

agent_memory_unit_t

The individual "Atoms" of memory. Each unit contains content and a vector embedding (384-dim) for semantic retrieval.

A Knowledge Graph layer that resolves entities and causal/semantic relationships between memory units.


4. Session Management

agent_session_history_t

The "Source of Truth" for active conversations, linking specific sessions to their respective Hindsight memory banks.


DDL Specification

-- Workflow Definitions: Stores the Agentic Workflow JSON
CREATE TABLE wf_definition_t (
    host_id             UUID NOT NULL,
    wf_def_id           UUID NOT NULL,
    namespace           VARCHAR(126) NOT NULL,
    name                VARCHAR(126) NOT NULL,
    version             VARCHAR(20) NOT NULL,
    definition          TEXT NOT NULL, -- The Agentic Workflow DSL in YAML
    aggregate_version    BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT TRUE,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY(host_id, wf_def_id),
    UNIQUE(host_id, namespace, name, version)
);

CREATE TABLE worklist_t (
  host_id              UUID NOT NULL,
  assignee_id          VARCHAR(126) NOT NULL,
  category_id          VARCHAR(126) DEFAULT '(all)' NOT NULL,
  status_code          VARCHAR(10) DEFAULT 'Active' NOT NULL,
  app_id               VARCHAR(512) DEFAULT 'global' NOT NULL,
  aggregate_version    BIGINT DEFAULT 1 NOT NULL,
  active               BOOLEAN NOT NULL DEFAULT TRUE,
  update_user          VARCHAR (255) DEFAULT SESSION_USER NOT NULL,
  update_ts            TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
  PRIMARY KEY(host_id, assignee_id, category_id)
);

CREATE TABLE worklist_column_t (
  host_id               UUID NOT NULL,
  assignee_id           VARCHAR(126) NOT NULL,
  category_id           VARCHAR(126) DEFAULT '(all)' NOT NULL,
  sequence_id           INTEGER NOT NULL,
  column_id             VARCHAR(126) NOT NULL,
  aggregate_version     BIGINT DEFAULT 1 NOT NULL,
  active                BOOLEAN DEFAULT TRUE,
  update_ts             TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
  update_user           VARCHAR(126) DEFAULT SESSION_USER,
  PRIMARY KEY(host_id, assignee_id, category_id, sequence_id),
  FOREIGN KEY(host_id, assignee_id, category_id) REFERENCES worklist_t(host_id, assignee_id, category_id) ON DELETE CASCADE
);

CREATE TABLE process_info_t (
  host_id                    UUID NOT NULL,
  process_id                 UUID NOT NULL, -- generated uuid
  wf_def_id                  UUID NOT NULL, -- workflow definition id
  wf_instance_id             VARCHAR(126)       NOT NULL, -- workflow intance id
  app_id                     VARCHAR(512)       NOT NULL, -- application id
  process_type               VARCHAR(126)      NOT NULL,
  status_code                CHAR(1)            NOT NULL, -- process status code 'A', 'C'
  started_ts                 TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
  ex_trigger_ts              TIMESTAMP WITH TIME ZONE          NOT NULL,
  custom_status_code         VARCHAR(126),
  completed_ts               TIMESTAMP WITH TIME ZONE,
  result_code                VARCHAR(126),
  source_id                  VARCHAR(126),
  branch_code                VARCHAR(126),
  rr_code                    VARCHAR(126),
  party_id                   VARCHAR(126),
  party_name                 VARCHAR(126),
  counter_party_id           VARCHAR(126),
  counter_party_name         VARCHAR(126),
  txn_id                     VARCHAR(126),
  txn_name                   VARCHAR(126),
  product_id                 VARCHAR(126),
  product_name               VARCHAR(126),
  product_type               VARCHAR(126),
  group_name                 VARCHAR(126),
  subgroup_name              VARCHAR(126),
  event_start_ts             TIMESTAMP WITH TIME ZONE,
  event_end_ts               TIMESTAMP WITH TIME ZONE,
  event_other_ts             TIMESTAMP WITH TIME ZONE,
  event_other                VARCHAR(126),
  risk                       NUMERIC,
  risk_scale                 INTEGER,
  price                      NUMERIC,
  price_scale                INTEGER, -- Scale (number of digits to the right of the decimal) of the risk column. NULL implies zero
  product_qy                 NUMERIC,
  currency_code              CHAR(3),
  ex_ref_id                  VARCHAR(126),
  ex_ref_code                VARCHAR(126),
  product_qy_scale           INTEGER,
  parent_process_id          VARCHAR(22),
  deadline_ts                TIMESTAMP WITH TIME ZONE,
  parent_group_id            NUMERIC,
  process_subtype_code       VARCHAR(126),
  owning_group_name          VARCHAR(126), -- Name of the group that owns the process
  input_data                 JSONB,        -- The initial data that triggered the workflow
  context_data               JSONB,        -- The runtime "scratchpad" for intermediate variables
  error_info                 TEXT,         -- Detailed error or stack trace if the process fails
  aggregate_version   BIGINT DEFAULT 1 NOT NULL,
  active              BOOLEAN DEFAULT TRUE,
  update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
  update_user         VARCHAR(126) DEFAULT SESSION_USER,
  PRIMARY KEY(host_id, process_id),
  FOREIGN KEY(host_id, wf_def_id) REFERENCES wf_definition_t(host_id, wf_def_id) ON DELETE CASCADE
);

CREATE TABLE task_info_t
(
    host_id             UUID NOT NULL,
    task_id             UUID NOT NULL,
    task_type           VARCHAR(126) NOT NULL,
    process_id          UUID NOT NULL,
    wf_instance_id      VARCHAR(126) NOT NULL,
    wf_task_id          VARCHAR(126) NOT NULL,
    status_code         CHAR(1)       NOT NULL, -- U, A, C
    started_ts          TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
    locked              CHAR(1)       NOT NULL,
    priority            INTEGER        NOT NULL,
    completed_ts        TIMESTAMP WITH TIME ZONE      NULL,
    completed_user      VARCHAR(126)     NULL,
    result_code         VARCHAR(126)     NULL,
    locking_user        VARCHAR(126)     NULL,
    locking_role        VARCHAR(126)     NULL,
    deadline_ts         TIMESTAMP WITH TIME ZONE      NULL,
    lock_group          VARCHAR(126)     NULL,
    task_input          JSONB,           -- Specific data passed to the task
    task_output         JSONB,           -- Result returned by the task action
    aggregate_version   BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT TRUE,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY(host_id, task_id),
    FOREIGN KEY (host_id, process_id) REFERENCES process_info_t(host_id, process_id) ON DELETE CASCADE
);

CREATE TABLE task_asst_t
(
    host_id             UUID NOT NULL,
    task_asst_id         UUID NOT NULL,
    task_id              UUID NOT NULL,
    assigned_ts          TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
    assignee_id          VARCHAR(126) NOT NULL,
    reason_code          VARCHAR(126) NOT NULL,
    unassigned_ts        TIMESTAMP WITH TIME ZONE      NULL,
    unassigned_reason    VARCHAR(126)     NULL,
    category_code        VARCHAR(126)     NULL,
    aggregate_version    BIGINT DEFAULT 1 NOT NULL,
    active               BOOLEAN DEFAULT TRUE,
    update_ts            TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user          VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY(host_id, task_asst_id),
    FOREIGN KEY(host_id, task_id) REFERENCES task_info_t(host_id, task_id) ON DELETE CASCADE
);

CREATE TABLE audit_log_t
(
    host_id             UUID NOT NULL,
    audit_log_id        UUID NOT NULL,
    source_type_id      VARCHAR(126)      NULL,
    correlation_id      VARCHAR(126)      NULL,
    user_id             VARCHAR(126)     NULL,
    event_ts            TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
    success             CHAR(1)           NULL,
    message0            VARCHAR(126)     NULL,
    message1            VARCHAR(126)     NULL,
    message2            VARCHAR(126)     NULL,
    message3            VARCHAR(126)     NULL,
    message             VARCHAR(500)     NULL,
    user_comment        VARCHAR(500)     NULL,
    PRIMARY KEY(host_id, audit_log_id)
);

CREATE INDEX audit_log_idx1 ON audit_log_t (source_type_id, correlation_id, event_ts, user_id);

-- Agent Definitions: Stores the "Brain" configuration
CREATE TABLE agent_definition_t (
    host_id             UUID NOT NULL,
    agent_def_id        UUID NOT NULL,
    agent_name          VARCHAR(126) NOT NULL,
    model_provider      VARCHAR(64) NOT NULL,  -- 'openai', 'anthropic', etc.
    model_name          VARCHAR(126) NOT NULL, -- 'gpt-4o', 'claude-3-5-sonnet'
    api_key_ref         VARCHAR(126),          -- Reference to Secret Manager key
    temperature         NUMERIC(3,2) DEFAULT 0.7,
    max_tokens          INTEGER,               -- max number of tokens can be used
    aggregate_version    BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT TRUE,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY(host_id, agent_def_id),
    UNIQUE(host_id, agent_name)
);


-- Skills: Stores Instructions and Domain Knowledge (The "Expertise")
-- Note: Use entity_tag_t and entity_category_t with entity_type = 'skill' 
-- for flat tagging and hierarchical folder structure of skills.
CREATE TABLE skill_t (
    host_id             UUID NOT NULL,
    skill_id            UUID NOT NULL,
    parent_skill_id     UUID,                  -- Self-reference for Hierarchy
    name                VARCHAR(126) NOT NULL,
    description         VARCHAR(500),          -- High-level description for the initial LLM prompt
    content_markdown    TEXT NOT NULL,         -- The actual instructions/prompts

    description_embedding VECTOR(384),          -- For semantic lookup/discovery
    version             VARCHAR(20) DEFAULT '1.0.0',
    aggregate_version    BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY(host_id, skill_id),
    FOREIGN KEY(host_id, parent_skill_id) REFERENCES skill_t(host_id, skill_id)
);

CREATE INDEX idx_skill_active ON skill_t(active);
CREATE INDEX idx_skill_name ON skill_t(name);

-- Tools: Stores Executable Functions (The "Hands")
CREATE TABLE tool_t (
    host_id             UUID NOT NULL,
    tool_id             UUID NOT NULL,
    name                VARCHAR(126) NOT NULL,
    description         TEXT NOT NULL,         -- Instructions for LLM on when/how to use this tool

    -- Implementation specifics
    implementation_type VARCHAR(50),           -- 'java', 'mcp_server', 'rest', 'python', 'javascript'
    implementation_class VARCHAR(500),         -- FQCN if 'java'
    mcp_server_name      VARCHAR(126),         -- MCP server name if 'mcp_server'
    api_endpoint        VARCHAR(1024),         -- URL if 'rest'
    api_method          VARCHAR(10),           -- HTTP Method if 'rest'
    endpoint_id         UUID,                  -- Reference to fine-grained auth endpoint
    script_content      TEXT,                  -- Source code if 'python'/'javascript'
    response_schema     JSONB,                 -- Strict output schema for tool results

    description_embedding VECTOR(384),          -- For semantic lookup/discovery
    version             VARCHAR(20) DEFAULT '1.0.0',
    aggregate_version   BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY(host_id, tool_id),
    FOREIGN KEY(host_id, endpoint_id) REFERENCES api_endpoint_t(host_id, endpoint_id) ON DELETE CASCADE
);

CREATE INDEX idx_tool_host_endpoint ON tool_t(host_id, endpoint_id);
CREATE INDEX idx_tool_active ON tool_t(active);
CREATE INDEX idx_tool_name ON tool_t(name);

-- Tool Parameters: Defines the arguments for each tool
CREATE TABLE tool_param_t (
    host_id             UUID NOT NULL,
    param_id            UUID NOT NULL,
    tool_id             UUID NOT NULL,
    name                VARCHAR(255) NOT NULL,
    param_type          VARCHAR(50) NOT NULL,      -- 'string', 'number', 'boolean', 'object', 'array'
    required            BOOLEAN DEFAULT true,
    default_value       JSONB,
    description         TEXT,                      -- Helps LLM understand what value to extract
    validation_schema   JSONB,                     -- JSON Schema for complex validation
    order_index         INTEGER DEFAULT 0,
    aggregate_version   BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY(host_id, param_id),
    FOREIGN KEY(host_id, tool_id) REFERENCES tool_t(host_id, tool_id) ON DELETE CASCADE
);

-- Skill Dependencies: Manages hierarchies where one skill requires another
CREATE TABLE skill_dependency_t (
    host_id             UUID NOT NULL,
    skill_id            UUID NOT NULL,
    depends_on_skill_id UUID NOT NULL,
    required            BOOLEAN DEFAULT true,
    aggregate_version   BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY (host_id, skill_id, depends_on_skill_id),
    FOREIGN KEY(host_id, skill_id) REFERENCES skill_t(host_id, skill_id),
    FOREIGN KEY(host_id, depends_on_skill_id) REFERENCES skill_t(host_id, skill_id)
);

-- Agent-Skill Mapping: Links Agents to their Skills
CREATE TABLE agent_skill_t (
    host_id             UUID NOT NULL,
    agent_def_id        UUID NOT NULL,
    skill_id            UUID NOT NULL,

    config              JSONB DEFAULT '{}',
    priority            INTEGER DEFAULT 0,
    sequence_id         INTEGER DEFAULT 0,     -- Order in which skills are concatenated

    aggregate_version    BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY(host_id, agent_def_id, skill_id),
    FOREIGN KEY(host_id, agent_def_id) REFERENCES agent_definition_t(host_id, agent_def_id) ON DELETE CASCADE,
    FOREIGN KEY(host_id, skill_id) REFERENCES skill_t(host_id, skill_id) ON DELETE CASCADE
);
CREATE INDEX idx_agent_skill_agent ON agent_skill_t(agent_def_id);

-- Skill-Tool Mapping: Implements Progressive Disclosure
CREATE TABLE skill_tool_t (
    host_id             UUID NOT NULL,
    skill_id            UUID NOT NULL,
    tool_id             UUID NOT NULL,

    config              JSONB DEFAULT '{}',
    access_level        VARCHAR(20) DEFAULT 'read', -- e.g., 'read', 'write', 'execute'

    aggregate_version   BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY(host_id, skill_id, tool_id),
    FOREIGN KEY(host_id, skill_id) REFERENCES skill_t(host_id, skill_id) ON DELETE CASCADE,
    FOREIGN KEY(host_id, tool_id) REFERENCES tool_t(host_id, tool_id) ON DELETE CASCADE
);
CREATE INDEX idx_skill_tool_skill ON skill_tool_t(skill_id);

-- -- Hindsight Advanced Memory System
-- Transitioned from flat logs to biomimetic memory banks (World, Experiences, Mental Models)

-- Memory bank profiles (Personality & Disposition)
CREATE TABLE agent_memory_bank_t (
    host_id             UUID NOT NULL,
    bank_id             UUID NOT NULL,
    agent_def_id        UUID,                  -- NULL if bank is shared across agents
    user_id             UUID,                  -- NULL if bank is global for the host/agent
    bank_name           VARCHAR(126) NOT NULL,
    disposition         JSONB NOT NULL DEFAULT '{"skepticism": 3, "literalism": 3, "empathy": 3}'::jsonb,
    background          TEXT,
    aggregate_version   BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY(host_id, bank_id),
    FOREIGN KEY(host_id) REFERENCES host_t(host_id) ON DELETE CASCADE,
    FOREIGN KEY(host_id, agent_def_id) REFERENCES agent_definition_t(host_id, agent_def_id) ON DELETE CASCADE,
    FOREIGN KEY(user_id) REFERENCES user_t(user_id) ON DELETE CASCADE
);

-- Source documents for memory units
CREATE TABLE agent_memory_doc_t (
    host_id             UUID NOT NULL,
    doc_id              UUID NOT NULL,
    bank_id             UUID NOT NULL,
    original_text       TEXT,
    content_hash        TEXT,
    aggregate_version   BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY (host_id, bank_id, doc_id),
    FOREIGN KEY (host_id, bank_id) REFERENCES agent_memory_bank_t(host_id, bank_id) ON DELETE CASCADE
);

-- Individual sentence-level memories (The "Atoms" of thought)
CREATE TABLE agent_memory_unit_t (
    host_id             UUID NOT NULL,
    unit_id             UUID NOT NULL,
    bank_id             UUID NOT NULL,
    doc_id              UUID,
    content             TEXT NOT NULL,
    embedding           vector(384),
    context             TEXT,
    event_date          TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now(),
    occurred_start      TIMESTAMP WITH TIME ZONE,
    occurred_end        TIMESTAMP WITH TIME ZONE,
    mentioned_at        TIMESTAMP WITH TIME ZONE,
    fact_type           VARCHAR(32) NOT NULL DEFAULT 'world' CHECK (fact_type IN ('world', 'experience', 'opinion', 'observation', 'mental_model')),
    metadata            JSONB DEFAULT '{}'::jsonb,
    proof_count         INT DEFAULT 1,
    source_memory_ids   UUID[] DEFAULT ARRAY[]::UUID[],
    aggregate_version   BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY(host_id, bank_id, unit_id),
    FOREIGN KEY(host_id, bank_id) REFERENCES agent_memory_bank_t(host_id, bank_id) ON DELETE CASCADE,
    FOREIGN KEY(host_id, bank_id, doc_id) REFERENCES agent_memory_doc_t(host_id, bank_id, doc_id) ON DELETE CASCADE
);

CREATE INDEX idx_mem_unit_bank ON agent_memory_unit_t(bank_id);
CREATE INDEX idx_mem_unit_embedding ON agent_memory_unit_t USING hnsw (embedding vector_cosine_ops);

-- Resolved entities (Knowledge Graph Nodes)
CREATE TABLE agent_memory_entity_t (
    host_id             UUID NOT NULL,
    entity_id           UUID NOT NULL,
    bank_id             UUID NOT NULL,
    user_id             UUID,                  -- Link to user_t if this entity is a platform user
    canonical_name      TEXT NOT NULL,
    mention_count       INT DEFAULT 1,
    metadata            JSONB DEFAULT '{}'::jsonb,
    aggregate_version   BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY (host_id, bank_id, entity_id),
    FOREIGN KEY (host_id, bank_id) REFERENCES agent_memory_bank_t(host_id, bank_id) ON DELETE CASCADE,
    FOREIGN KEY (user_id) REFERENCES user_t(user_id) ON DELETE CASCADE
);

-- Association between memory units and entities
CREATE TABLE agent_memory_unit_entity_t (
    host_id             UUID NOT NULL,
    bank_id             UUID NOT NULL,
    unit_id             UUID NOT NULL,
    entity_id           UUID NOT NULL,
    PRIMARY KEY (host_id, bank_id, unit_id, entity_id),
    FOREIGN KEY (host_id, bank_id, unit_id) REFERENCES agent_memory_unit_t(host_id, bank_id, unit_id) ON DELETE CASCADE,
    FOREIGN KEY (host_id, bank_id, entity_id) REFERENCES agent_memory_entity_t(host_id, bank_id, entity_id) ON DELETE CASCADE
);

-- Cache of entity co-occurrences (Concept Relationship Graph)
CREATE TABLE agent_memory_entity_cooccur_t (
    host_id             UUID NOT NULL,
    bank_id             UUID NOT NULL,
    entity_id_1         UUID NOT NULL,
    entity_id_2         UUID NOT NULL,
    cooccur_count       INT DEFAULT 1,
    last_cooccurred     TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    aggregate_version   BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY (host_id, bank_id, entity_id_1, entity_id_2),
    CONSTRAINT entity_cooccur_order_check CHECK (entity_id_1 < entity_id_2),
    FOREIGN KEY (host_id, bank_id, entity_id_1) REFERENCES agent_memory_entity_t(host_id, bank_id, entity_id) ON DELETE CASCADE,
    FOREIGN KEY (host_id, bank_id, entity_id_2) REFERENCES agent_memory_entity_t(host_id, bank_id, entity_id) ON DELETE CASCADE
);

CREATE INDEX idx_mem_cooccur_e1 ON agent_memory_entity_cooccur_t(host_id, entity_id_1);
CREATE INDEX idx_mem_cooccur_e2 ON agent_memory_entity_cooccur_t(host_id, entity_id_2);

-- Links between memory units (Semantic & Causal relationships)
CREATE TABLE agent_memory_link_t (
    host_id             UUID NOT NULL,
    bank_id             UUID NOT NULL,
    from_unit_id        UUID NOT NULL,
    to_unit_id          UUID NOT NULL,
    link_type           VARCHAR(32) NOT NULL,
    weight              FLOAT NOT NULL DEFAULT 1.0,
    aggregate_version   BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY (host_id, bank_id, from_unit_id, to_unit_id, link_type),
    CONSTRAINT memory_links_type_check CHECK (link_type IN ('temporal', 'semantic', 'entity', 'causes', 'caused_by', 'enables', 'prevents')),
    FOREIGN KEY (host_id, bank_id, from_unit_id) REFERENCES agent_memory_unit_t(host_id, bank_id, unit_id) ON DELETE CASCADE,
    FOREIGN KEY (host_id, bank_id, to_unit_id) REFERENCES agent_memory_unit_t(host_id, bank_id, unit_id) ON DELETE CASCADE
);

-- Directives (Hard rules that override probabilistic learning)
CREATE TABLE agent_memory_directive_t (
    host_id             UUID NOT NULL,
    directive_id        UUID NOT NULL,
    bank_id             UUID NOT NULL,
    name                VARCHAR(256) NOT NULL,
    content             TEXT NOT NULL,
    priority            INT NOT NULL DEFAULT 0,
    aggregate_version   BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY(host_id, bank_id, directive_id),
    FOREIGN KEY(host_id, bank_id) REFERENCES agent_memory_bank_t(host_id, bank_id) ON DELETE CASCADE
);

-- Reflections (Synthesized knowledge and high-level observations)
CREATE TABLE agent_memory_reflection_t (
    host_id             UUID NOT NULL,
    reflection_id       UUID NOT NULL,
    bank_id             UUID NOT NULL,
    content             TEXT NOT NULL,
    embedding           vector(384),
    aggregate_version   BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY(host_id, bank_id, reflection_id),
    FOREIGN KEY(host_id, bank_id) REFERENCES agent_memory_bank_t(host_id, bank_id) ON DELETE CASCADE
);

CREATE INDEX idx_mem_reflection_embedding ON agent_memory_reflection_t USING hnsw (embedding vector_cosine_ops);

-- Raw Session History (The source of Truth for active conversations)
CREATE TABLE agent_session_history_t (
    host_id             UUID NOT NULL,
    session_id          UUID NOT NULL,
    bank_id             UUID NOT NULL,         -- Links the session to a Hindsight bank
    messages            JSONB NOT NULL DEFAULT '[]'::jsonb,
    metadata            JSONB DEFAULT '{}'::jsonb,
    aggregate_version   BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY(host_id, bank_id, session_id),
    FOREIGN KEY(host_id, bank_id) REFERENCES agent_memory_bank_t(host_id, bank_id) ON DELETE CASCADE
);

CREATE INDEX idx_session_bank ON agent_session_history_t(host_id, bank_id);


Light-Deployer Design

light-deployer is the cluster-local Kubernetes deployment executor in Light Fabric.

This document focuses only on the deployer service that lives in apps/light-deployer. The broader Light Portal deployment workflow, approval flow, deployment history model, controller routing, and portal UI are covered outside this repository.

Purpose

light-deployer receives a deployment command, fetches Kubernetes templates, renders them with deployment values, validates the resulting resources, applies or deletes resources in the target Kubernetes cluster, and returns safe status details.

It is intentionally narrow. It does not decide whether a user is allowed to deploy an instance, does not own portal deployment history, and does not create tenant business workflows. Those decisions belong to Light Portal, Light Controller, and the workflow engine.

Service Boundary

light-deployer owns:

  • local deployment policy enforcement
  • template repository fetch
  • YAML template rendering
  • manifest parsing and resource summary generation
  • Kubernetes dry-run, apply, delete, status, and pruning
  • safe event and error reporting
  • direct local/MicroK8s deployment endpoints

light-deployer does not own:

  • tenant authorization
  • instance metadata
  • deployment approval
  • deployment history persistence
  • config snapshot creation
  • long-running human workflow decisions

The deployer should reject commands outside its local policy even if an upstream service sends them.

Runtime Model

The service follows the same runtime pattern as light-agent.

main.rs builds the domain service and starts it through:

#![allow(unused)]
fn main() {
LightRuntimeBuilder::new(AxumTransport::new(app))
}

The HTTP listener is owned by light-runtime and light-axum, not by service-specific socket code. Bind address, HTTP/HTTPS ports, service identity, and registry settings live in runtime config files.

Default config files:

  • config/server.yml
  • config/deployer.yml
  • config/portal-registry.yml

Local cargo run resolves config from apps/light-deployer/config when run from the workspace root. The container image runs from /app and uses /app/config.

Public Endpoints

Phase 1 exposes a direct HTTP surface for local and MicroK8s testing:

GET  /health
GET  /ready
POST /mcp
GET  /mcp/tools
GET  /mcp/tools/list
GET  /mcp/tools/{tool}
POST /deployments
POST /mcp/tools/{tool}
GET  /events?request_id=...

POST /mcp is the MCP JSON-RPC 2.0 endpoint. It supports tools/list, tools/call, and a minimal initialize response. This is the endpoint that MCP clients, Light Portal, and AI agents should use.

/deployments accepts the canonical deployment request directly. /mcp/tools/{tool} maps tool names onto the same internal service functions as a REST-style local debugging convenience. The convenience tool-list endpoints return metadata with name, description, inputSchema, endpoint, and method, but they are not the MCP protocol endpoint.

Supported tool names:

  • deployment.render
  • deployment.dryRun
  • deployment.diff
  • deployment.apply
  • deployment.delete
  • deployment.status
  • deployment.rollback

The direct HTTP mode is useful for development and managed environments. The same internal command handling should later be reused by controller-mediated WebSocket/MCP routing.

Request Model

A deployment request is explicit and auditable.

{
  "requestId": "01964b05-0000-7000-8000-000000000001",
  "hostId": "01964b05-552a-7c4b-9184-6857e7f3dc5f",
  "instanceId": "petstore-dev",
  "environment": "dev",
  "clusterId": "microk8s-local",
  "namespace": "petstore-dev",
  "action": "deploy",
  "values": {
    "name": "petstore",
    "image": {
      "repository": "networknt/openapi-petstore",
      "tag": "latest"
    }
  },
  "template": {
    "repoUrl": "https://github.com/networknt/openapi-petstore.git",
    "ref": "master",
    "path": "k8s"
  },
  "options": {
    "dryRun": false,
    "waitForRollout": true,
    "timeoutSeconds": 300,
    "pruneOverride": false
  }
}

The current implementation supports inline values. The request model also contains fields for future values references and immutable snapshot metadata so it can align with the full portal deployment workflow.

When invoking a specific /mcp/tools/{tool} endpoint, callers do not need to send action. The deployer derives the action from the tool name. The generic /deployments endpoint still expects an explicit action in the request body.

For the MCP endpoint, callers use JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": "tools-list-1",
  "method": "tools/list",
  "params": {}
}

Tool invocation uses tools/call:

{
  "jsonrpc": "2.0",
  "id": "render-1",
  "method": "tools/call",
  "params": {
    "name": "deployment.render",
    "arguments": {
      "hostId": "local-host",
      "instanceId": "petstore-dev",
      "environment": "dev",
      "clusterId": "local",
      "namespace": "light-deployer",
      "values": {},
      "template": {
        "repoUrl": "local",
        "ref": "main",
        "path": "k8s"
      }
    }
  }
}

tools/call derives the deployment action from params.name; callers should not provide an action field in arguments.

Actions

render : Fetch templates, render manifests, add namespaces and management labels, and return resource summaries plus a manifest hash.

dryRun : Render manifests and validate them against Kubernetes using server-side dry-run.

diff : Render manifests, fetch current managed resources, calculate additions, modifications, and pruned resources, and return a redacted diff summary.

deploy : Accept the request, run the deployment in the background, apply manifests, prune removed managed resources, and stream events.

undeploy : Delete resources associated with the deployment.

status : Return current managed resource status.

rollback : Reserved for redeploying a previous immutable portal snapshot. Native Kubernetes rollout undo is not the target rollback model because it does not restore ConfigMaps, Secrets, or values snapshots.

Template Fetching

Templates are loaded through the TemplateSource trait.

The current source supports two modes:

  • local template root through LIGHT_DEPLOYER_TEMPLATE_BASE_DIR
  • remote HTTPS Git clone through gix

For remote repositories, the deployment request provides:

{
  "template": {
    "repoUrl": "https://github.com/networknt/openapi-petstore.git",
    "ref": "master",
    "path": "k8s"
  }
}

Private HTTPS Git access is controlled by environment variables:

  • LIGHT_DEPLOYER_GIT_TOKEN: token or app password
  • LIGHT_DEPLOYER_GIT_USERNAME: optional username override

Defaults:

  • GitHub uses x-access-token
  • Bitbucket Cloud uses x-token-auth

SSH authentication is intentionally deferred because it requires private key handling and strict known_hosts validation.

Template Format

The built-in renderer uses simple placeholders:

image: ${image.repository}:${image.tag:latest}

Supported behavior:

  • nested paths such as image.repository
  • default values after :
  • render failure when a required value is missing
  • placeholder replacement only inside YAML string scalar values

The renderer parses YAML into serde_yaml::Value, traverses the AST, replaces placeholders, and serializes or applies structured YAML values afterward. This avoids the most common raw string replacement bugs around quoting, indentation, certificates, and multi-line values.

Because placeholders currently produce strings, templates should avoid placeholders in numeric-only Kubernetes fields unless Kubernetes accepts a string value there. For example, containerPort should be fixed or rendered by a future typed placeholder extension.

Resource Metadata

After rendering, the deployer ensures every resource has the target namespace and adds management labels:

  • app.kubernetes.io/managed-by=light-deployer
  • lightapi.net/host-id
  • lightapi.net/instance-id
  • lightapi.net/request-id

These labels are used for status lookup and pruning.

Kubernetes Execution

Kubernetes execution is behind the KubeExecutor trait.

Current implementations:

  • KubeRsExecutor: real Kubernetes API execution through kube-rs
  • NoopKubeExecutor: local render/test mode

Execution mode:

  • LIGHT_DEPLOYER_KUBE_MODE=real: force real Kubernetes mode
  • LIGHT_DEPLOYER_KUBE_MODE=noop: force no-op mode
  • default: real mode when KUBERNETES_SERVICE_HOST is present, otherwise no-op

The production path uses kube-rs, not kubectl.

Kubernetes operations should use:

  • in-cluster ServiceAccount auth when running as a pod
  • server-side dry-run for validation
  • server-side apply with field manager light-deployer
  • structured status and error handling

Pruning

The deployer is declarative. If a previously managed resource is no longer rendered from the template, it should be considered for pruning.

Pruning is calculated by comparing:

  • current resources in the namespace with lightapi.net/instance-id
  • resources rendered from the new template

The policy layer enforces blast-radius protection:

  • maximum delete percentage
  • sensitive kinds requiring override
  • explicit pruneOverride in deployment options

This prevents stale resources while still protecting against accidental large-scale deletion.

Policy

The local deployer.yml policy constrains what a deployer is allowed to do.

Policy dimensions:

  • allowed namespaces
  • allowed repository hosts
  • allowed repository URL prefixes
  • allowed image registries
  • allowed actions
  • allowed Kubernetes kinds
  • blocked Kubernetes kinds
  • prune settings
  • development insecure mode

Version 1 allows application-level resource kinds by default:

  • Deployment
  • Service
  • Ingress
  • ConfigMap
  • Secret

Cluster-scoped and control-plane resources are blocked by default:

  • Namespace
  • ClusterRole
  • ClusterRoleBinding
  • CustomResourceDefinition
  • admission webhooks

Security

The deployer can mutate a Kubernetes cluster, so its default posture must be conservative.

Required practices:

  • run in Kubernetes with a dedicated ServiceAccount
  • prefer namespace-scoped Role and RoleBinding
  • restrict allowed namespaces and resource kinds
  • restrict template repository hosts or prefixes in production
  • restrict image registries in production
  • never log raw rendered Secret manifests
  • never log raw Kubernetes patch/apply payloads containing Secret data
  • return redacted summaries and diffs

Secret values in rendered manifests are redacted before being included in responses or diffs. Kubernetes Secret values are base64 encoded, not encrypted, so they must be treated as plaintext for logging purposes.

Response Model

Responses include enough detail for callers to understand what happened without exposing secrets.

Important fields:

  • requestId
  • action
  • status
  • deployerId
  • clusterId
  • namespace
  • manifestHash
  • templateCommitSha
  • resources
  • diff
  • events
  • error

Resource summaries contain kind, namespace, name, apiVersion, and action. Full rendered manifests should not be returned or persisted by default.

Event Model

Long-running operations return quickly and continue in the background.

Clients can subscribe to:

GET /events?request_id=...

Events contain:

  • request ID
  • timestamp
  • status
  • message
  • optional resource identity

The event stream is currently direct SSE. Controller-mediated mode can forward the same event shape later.

Installation

The app includes Kubernetes install manifests under apps/light-deployer/k8s:

  • namespace
  • RBAC
  • deployment
  • service

The deployment runs the container with LIGHT_DEPLOYER_KUBE_MODE=real. The image contains /app/config, and server.yml defaults the HTTP port to 7088.

For MicroK8s testing:

./apps/light-deployer/build.sh latest
docker save networknt/light-deployer:latest | microk8s ctr image import -
microk8s kubectl apply -f apps/light-deployer/k8s/namespace.yaml
microk8s kubectl apply -f apps/light-deployer/k8s/rbac.yaml
microk8s kubectl apply -f apps/light-deployer/k8s/deployment.yaml
microk8s kubectl apply -f apps/light-deployer/k8s/service.yaml

Current Limitations

  • Direct HTTP/MCP-style mode is implemented first; controller-mediated WebSocket routing is a later integration step.
  • Inline values are implemented; config-server valuesRef fetching is still a future integration point.
  • Rollback is represented in the model but needs portal snapshot integration.
  • Helm and Kustomize are not implemented yet.
  • Typed placeholders are not implemented yet.
  • Rollout watch depth is intentionally basic in the first phase.

Design Direction

Keep light-deployer small and cluster-local.

The deployer should execute precise deployment commands, enforce local safety policy, and report structured results. It should not grow into a portal, workflow engine, or deployment database. That separation keeps the service easy to install inside customer clusters and reduces the security blast radius.

Asymmetric Decryptor

asymmetric-decryptor decrypts RSA encrypted configuration values.

It is used by config-loader when a service loads encrypted values that use the CRYPT:RSA: prefix. The crate supports RSA private keys in PKCS#8 and PKCS#1 PEM formats and decrypts payloads with RSA-OAEP using SHA-256.

Main Types

  • AsymmetricDecryptor: owns the RSA private key and decrypts supported payloads.
  • AsymmetricError: error type for prefix, base64, key, and decrypt failures.
  • CRYPT_RSA_PREFIX: the required CRYPT:RSA: payload prefix.

Usage

#![allow(unused)]
fn main() {
use asymmetric_decryptor::AsymmetricDecryptor;

let decryptor = AsymmetricDecryptor::from_pem(private_key_pem)?;
let plaintext = decryptor.decrypt("CRYPT:RSA:...")?;
}

Notes

This crate is intentionally small. It does not fetch keys, rotate keys, or perform configuration merging. Those concerns belong to config-loader and the runtime layer.

Config Loader

config-loader loads, merges, resolves, and decrypts service configuration.

It provides the common configuration behavior used by fabric services and runtime modules. Configuration can be loaded from YAML, JSON, or TOML files, merged across layers, expanded from values maps, and decrypted when encrypted values are present.

Main Types

  • ConfigLoader: loads files and resolves ${key:default} style values.
  • ConfigManager<T>: stores hot-swappable typed configuration behind an atomic reference.
  • ConfigError: shared error type for IO, parse, decrypt, and conversion failures.

Resolution Model

The loader supports:

  • merging multiple config files in order
  • external overlays through LIGHT_RS_CONFIG_DIR
  • whole-value variable replacement
  • embedded variable expansion inside strings
  • typed deserialization through Serde
  • symmetric encrypted values through symmetric-decryptor
  • asymmetric encrypted values through asymmetric-decryptor

Usage

#![allow(unused)]
fn main() {
use config_loader::ConfigLoader;
use std::collections::HashMap;

let loader = ConfigLoader::from_values(HashMap::new(), None, None)?;
let config: MyConfig = loader.load_typed(["config/my-service.yml"])?;
}

Consumers

light-runtime uses this crate for service bootstrap and runtime config. Application crates can also use it for app-specific policy or domain config.

Hindsight Client

hindsight-client provides a small client abstraction for persistent agent memory.

It stores and recalls memory units from PostgreSQL. The current implementation uses sqlx and pgvector for vector similarity search.

Main Types

  • HindsightMemory: trait used by applications that need memory retention and recall without coupling to a specific database implementation.
  • PgHindsightClient: PostgreSQL-backed implementation of HindsightMemory.
  • MemoryUnit: returned memory record with content, type, metadata, and bank identity.

Usage

#![allow(unused)]
fn main() {
use hindsight_client::{HindsightMemory, PgHindsightClient};

let memory = PgHindsightClient::new(pool);
let unit_id = memory
    .retain(host_id, bank_id, "User prefers concise answers", "fact", None, metadata)
    .await?;
}

Data Model

The PostgreSQL implementation writes to agent_memory_unit_t and uses host_id plus bank_id to isolate memory between tenants, users, or sessions.

Consumers

light-agent uses this crate to persist and recall agent conversation memory.

Light Rule

light-rule is the Rust rule engine for evaluating rule definitions and executing registered actions.

It is designed to align with the rule.yaml specification while remaining runtime-neutral. Java services can use yaml-rule; Rust services use this crate.

Main Types

  • RuleEngine: evaluates rule conditions and determines action execution.
  • MultiThreadRuleExecutor: executes rules with runtime state.
  • RuntimeState: input/output state passed through rule evaluation.
  • ActionRegistry: registry for action plugins.
  • RuleActionPlugin: trait implemented by Rust action handlers.
  • Rule, RuleCondition, RuleAction, RuleConfig, EndpointConfig: rule model types.

Action Model

Rules reference actions by actionRef. In Rust, actionRef resolves to a registered RuleActionPlugin; it is not a Java class name. This keeps the rule format portable across Java and Rust executors.

Usage

#![allow(unused)]
fn main() {
use light_rule::{ActionRegistry, RuleEngine};

let registry = ActionRegistry::default();
let engine = RuleEngine::new(registry);
}

See Light-Rule for the rule format and its relationship to workflow assertions and portal rule management.

Light Runtime

light-runtime is the shared service runtime for Light Fabric applications.

It owns bootstrap, configuration loading, transport startup, graceful shutdown, and optional portal registry registration. Apps such as light-agent and light-deployer should start through this crate instead of binding sockets directly.

Main Types

  • LightRuntimeBuilder: builds a runtime from a transport.
  • LightRuntime: configured runtime before start.
  • RunningRuntime: running service handle with shutdown support.
  • Module: lifecycle hook abstraction.
  • RuntimeConfig: resolved runtime configuration.
  • ServerConfig: HTTP/HTTPS bind and service identity settings.
  • BootstrapConfig: remote config bootstrap settings.
  • PortalRegistryConfig: portal registry connection settings.

Startup Pattern

#![allow(unused)]
fn main() {
use light_axum::AxumTransport;
use light_runtime::LightRuntimeBuilder;

let runtime = LightRuntimeBuilder::new(AxumTransport::new(app))
    .with_config_dir("config")
    .build();

let running = runtime.start().await?;
running.shutdown().await?;
}

Configuration

At minimum, runtime services need server.yml. Optional files include startup.yml, client.yml, and portal-registry.yml.

light-runtime is transport-neutral. light-axum supplies the Axum transport implementation.

MCP Client

mcp-client is a client for calling MCP-compatible gateway endpoints.

It provides a small API for listing and invoking tools through a configured MCP gateway path. It is intentionally focused on the client side; MCP server implementations live in applications or framework layers.

Main Types

  • McpGatewayClient: gateway client used by applications.
  • McpTool: tool metadata returned by the gateway.
  • McpContent: content item returned by MCP tool calls.
  • McpToolCallResult: structured result for a tool invocation.

Usage

#![allow(unused)]
fn main() {
use mcp_client::McpGatewayClient;

let client = McpGatewayClient::new(gateway_url, path, timeout_ms);
let result = client.call_tool("tool.name", arguments).await?;
}

Consumers

light-agent uses this crate when an agent session needs to discover or invoke tools exposed through an MCP gateway.

Model Provider

model-provider defines a common abstraction over LLM providers and implements multiple provider adapters.

The goal is to let agent and workflow code depend on one Provider trait while supporting local models, hosted APIs, and provider-specific features.

Main Types

  • Provider: async trait implemented by model providers.
  • ChatRequest, ChatResponse, ChatMessage: common chat data model.
  • ToolSpec, ToolCall: tool-calling model.
  • ProviderCapabilities: capability metadata.
  • TokenUsage: usage accounting.
  • ReliableProvider: reliability wrapper.
  • RouterProvider: route requests across multiple providers.

Provider Implementations

Current modules include:

  • Anthropic
  • Azure OpenAI
  • Bedrock
  • Claude Code
  • Codex
  • OpenAI-compatible providers
  • Copilot
  • Gemini
  • Gemini CLI
  • GLM
  • Kilo Code CLI
  • Ollama
  • OpenAI
  • OpenRouter
  • Telnyx

Consumers

light-agent uses this crate to send chat requests and tool specs without hard-coding a single LLM provider.

Portal Registry

portal-registry provides client support for registering services with Light Portal or Light Controller.

It uses a JSON-RPC style WebSocket protocol for service registration, metadata updates, and skill search. Runtime services normally use this through light-runtime, but applications can also use the client directly when they need custom registry behavior.

Main Types

  • PortalRegistryClient: WebSocket client for registry communication.
  • RegistryHandler: trait for handling registry callbacks and messages.
  • RegistrationState: client registration state.
  • RegistrationBuilder: helper for constructing registration parameters.
  • ServiceRegistrationParams: service identity and advertised endpoint.
  • ServiceMetadataUpdate: metadata update payload.
  • SkillSearchRequest, SkillSearchResponse: skill discovery messages.

Usage

#![allow(unused)]
fn main() {
use portal_registry::RegistrationBuilder;

let registration = RegistrationBuilder::new(
    "com.networknt.service-1.0.0",
    "1.0.0",
    "http",
    "127.0.0.1",
    8080,
)
.with_env("DEV")
.with_jwt(token)
.build();
}

Runtime Integration

light-runtime can register a service automatically when server.yml enables registry support and portal-registry.yml supplies the portal connection.

Symmetric Decryptor

symmetric-decryptor decrypts legacy symmetric encrypted configuration values.

It supports payloads with the CRYPT prefix and decrypts AES-256-CBC data with a key derived from the configured password using PBKDF2-HMAC-SHA256.

Main Types

  • Decryptor: trait implemented by decryptors.
  • SymmetricDecryptor: password-based decryptor.
  • DecryptError: error type for prefix, format, hex, and cipher failures.
  • CRYPT_PREFIX: required CRYPT payload prefix.

Usage

#![allow(unused)]
fn main() {
use symmetric_decryptor::{Decryptor, SymmetricDecryptor};

let decryptor = SymmetricDecryptor::new("password");
let plaintext = decryptor.decrypt("CRYPT:...")?;
}

Consumers

config-loader uses this crate when it encounters symmetric encrypted values and a config password is available.

Workflow Builder

workflow-builder provides fluent builders for creating Agentic Workflow definitions programmatically.

It depends on workflow-core for the actual model types and layers a builder API on top so applications and tests can construct valid workflows without manually assembling nested maps.

Main Areas

  • workflow metadata construction
  • authentication definitions
  • task definitions
  • nested do, for, fork, try, and other task structures
  • YAML/JSON serialization through workflow-core model types

Usage

#![allow(unused)]
fn main() {
use workflow_builder::services::workflow::WorkflowBuilder;

let workflow = WorkflowBuilder::new()
    .use_dsl("1.0.0")
    .with_namespace("lightapi")
    .with_name("example")
    .with_version("1.0.0")
    .build();
}

Relationship To Workflow Core

Use workflow-core when you need direct access to the schema model. Use workflow-builder when you want an ergonomic construction API.

Workflow Core

workflow-core contains the Rust model for the Agentic Workflow DSL.

The crate is schema-oriented: its structs and enums represent workflow documents, tasks, authentication blocks, durations, timeouts, errors, and supporting map types.

Main Areas

  • workflow document metadata
  • task definitions
  • call task protocol definitions
  • ask and assert task definitions
  • duration and timeout models
  • error definitions
  • ordered map support for workflow task lists

Usage

#![allow(unused)]
fn main() {
use workflow_core::models::workflow::{
    WorkflowDefinition,
    WorkflowDefinitionMetadata,
};

let document = WorkflowDefinitionMetadata::new(
    "lightapi",
    "example",
    "1.0.0",
    Some("Example".to_string()),
    None,
    None,
    None,
);
let workflow = WorkflowDefinition::new(document);
}

Consumers

workflow-builder builds on this crate. light-workflow and workflow-related services use the model for loading, validating, and executing workflow documents.

Light-Axum

light-axum adapts Axum applications to light-runtime.

Applications implement AxumApp and return an axum::Router. The framework owns binding, optional TLS, runtime metadata resolution, and graceful shutdown through the runtime transport contract.

Main Types

  • AxumApp: trait implemented by an application.
  • AxumTransport: transport passed to LightRuntimeBuilder.
  • ServerContext: runtime context passed into the app when building routes.
  • AxumBoundHandle: running Axum server handle.

Pattern

#![allow(unused)]
fn main() {
use light_axum::{AxumApp, AxumTransport, ServerContext};
use light_runtime::LightRuntimeBuilder;

#[derive(Clone)]
struct App;

impl AxumApp for App {
    fn router(&self, _context: ServerContext) -> axum::Router {
        axum::Router::new()
    }
}

let runtime = LightRuntimeBuilder::new(AxumTransport::new(App))
    .with_config_dir("config")
    .build();
}

Consumers

light-agent and light-deployer use this framework.

Light-Pingora

light-pingora adapts Pingora proxy services to light-runtime.

It is the framework layer for high-performance gateway and proxy products. The crate keeps runtime concerns such as configuration and service lifecycle separate from Pingora-specific proxy behavior.

Role

  • bridge Pingora services into the common runtime lifecycle
  • expose transport metadata to light-runtime
  • support gateway products without duplicating bootstrap code

Consumers

light-gateway uses this framework.

Light-Agent

light-agent is the interactive agent service in Light Fabric.

It provides a WebSocket chat interface, integrates with model providers, invokes MCP tools through mcp-client, and stores conversation memory through hindsight-client.

Key Dependencies

  • light-runtime
  • light-axum
  • model-provider
  • mcp-client
  • hindsight-client
  • portal-registry

Runtime

The app follows the standard runtime pattern:

  • load config from config/
  • implement an Axum app
  • start through LightRuntimeBuilder
  • optionally register through portal registry

Light-Deployer

light-deployer is the cluster-local Kubernetes deployment executor for Light Portal.

It renders Kubernetes templates, validates manifests, applies resources through kube-rs, reports rollout status, and exposes deployment tools through an MCP JSON-RPC endpoint for local and MicroK8s testing.

Key Capabilities

  • MCP JSON-RPC endpoint at POST /mcp
  • AST-based YAML template rendering
  • Git template fetching with gix
  • Kubernetes dry-run, apply, delete, status, and prune
  • redacted manifest summaries and diffs
  • SSE deployment events

Runtime

light-deployer uses light-runtime, light-axum, config-loader, and portal-registry so it follows the same service boot model as light-agent.

Testing Path

Use these pages in order when testing locally:

  1. Build Local
  2. Prepare Config
  3. Run Standalone
  4. Run Kubernetes

Start with standalone noop mode to validate template rendering. Then move to MicroK8s real mode once the render request and target templates are correct.

For MCP clients, Light Portal, and AI agents, use POST /mcp with JSON-RPC methods such as tools/list and tools/call. The /mcp/tools/* routes are kept only as local debugging conveniences.

Build Local

This page builds the light-deployer binary and container image from the Light Fabric workspace.

Run all commands from the repository root:

cd ~/workspace/light-fabric

Rust Build

Use cargo check first for a quick compile validation:

cargo check -p light-deployer

Run the deployer tests:

cargo test -p light-deployer

Build a debug binary:

cargo build -p light-deployer

Build a release binary:

cargo build --release -p light-deployer

The release binary is written to:

target/release/light-deployer

Docker Image

Build the local image:

./apps/light-deployer/build.sh latest

The default image name is:

networknt/light-deployer:latest

To override the image name:

IMAGE=localhost:32000/light-deployer:latest ./apps/light-deployer/build.sh latest

Verify the image exists:

docker image inspect networknt/light-deployer:latest

What The Image Contains

The Dockerfile copies:

  • /usr/local/bin/light-deployer
  • /app/config

The container runs from /app, so the default runtime config directory is:

/app/config

The default HTTP port is 7088, configured in:

apps/light-deployer/config/server.yml

Expected Result

Before moving on, these commands should pass:

cargo check -p light-deployer
cargo test -p light-deployer
./apps/light-deployer/build.sh latest
docker image inspect networknt/light-deployer:latest

Prepare Config

light-deployer uses two kinds of configuration:

  • runtime config loaded by light-runtime
  • deployment request data sent through MCP tools/call at POST /mcp

Runtime Config Files

Default config lives in:

apps/light-deployer/config

Files:

  • server.yml: HTTP/HTTPS bind settings and service identity
  • deployer.yml: local deployer policy
  • portal-registry.yml: future portal/controller registry settings

When running from the workspace root, the deployer automatically uses:

apps/light-deployer/config

When running inside the Docker image, it uses:

/app/config

Override the config directory with:

LIGHT_DEPLOYER_CONFIG_DIR=/path/to/config

Server Config

The default server config listens on HTTP port 7088:

ip: ${server.ip:0.0.0.0}
httpPort: ${server.httpPort:7088}
enableHttp: ${server.enableHttp:true}
enableHttps: ${server.enableHttps:false}
serviceId: ${server.serviceId:com.networknt.light-deployer-0.1.0}
enableRegistry: ${server.enableRegistry:false}

To change the port without editing the file, provide values through the normal runtime values mechanism, or use a copied config directory for local testing.

Deployer Policy

The default policy is permissive enough for local testing:

deployerId: ${deployer.deployerId:local-light-deployer}
clusterId: ${deployer.clusterId:local}
allowedNamespaces: []
allowedRepoHosts: []
allowedRepoPrefixes: []
allowedImageRegistries: []
devInsecure: ${deployer.devInsecure:false}

Empty allow lists mean the policy does not restrict that dimension. For production, configure explicit values.

Example tighter policy:

deployerId: petstore-microk8s
clusterId: microk8s-local
allowedNamespaces:
  - petstore-dev
allowedRepoHosts:
  - github.com
allowedRepoPrefixes:
  - https://github.com/networknt/
allowedImageRegistries:
  - networknt
devInsecure: false
prune:
  enabled: true
  maxDeletePercent: 30
  sensitiveKinds:
    - PersistentVolumeClaim
  overrideRequired: true

Git Access

Public repositories do not need credentials.

For private HTTPS repositories, set:

LIGHT_DEPLOYER_GIT_TOKEN=...

Defaults:

  • GitHub username: x-access-token
  • Bitbucket Cloud username: x-token-auth

For Bitbucket app passwords or other Git servers:

LIGHT_DEPLOYER_GIT_USERNAME=my-user
LIGHT_DEPLOYER_GIT_TOKEN=my-token-or-app-password

Only HTTPS token auth is supported in Phase 1. SSH auth is deferred.

Template Repository Requirements

The target application repository should contain a k8s/ directory with YAML templates. The deployer reads all .yaml and .yml files under the requested template path.

Example template reference:

{
  "template": {
    "repoUrl": "https://github.com/networknt/openapi-petstore.git",
    "ref": "master",
    "path": "k8s"
  }
}

For local testing without Git clone, set:

LIGHT_DEPLOYER_TEMPLATE_BASE_DIR=/home/steve/workspace/openapi-petstore

Then use:

{
  "template": {
    "repoUrl": "local",
    "ref": "master",
    "path": "k8s"
  }
}

Request Values

The request values object supplies placeholder values for templates.

Example for openapi-petstore:

{
  "name": "openapi-petstore",
  "image": {
    "repository": "networknt/openapi-petstore",
    "tag": "latest",
    "pullPolicy": "IfNotPresent"
  },
  "service": {
    "name": "openapi-petstore",
    "type": "ClusterIP"
  },
  "resources": {
    "requests": {
      "memory": "64Mi",
      "cpu": "250m"
    },
    "limits": {
      "memory": "256Mi",
      "cpu": "500m"
    }
  }
}

The current renderer replaces placeholders inside YAML string scalar values. Avoid placeholders in Kubernetes fields that must be numeric unless the template keeps those fields as fixed numbers.

Run Standalone

Standalone mode is the fastest way to test light-deployer before using a real Kubernetes cluster.

Use noop mode first. It validates config, HTTP endpoints, template loading, rendering, resource summaries, and response shape without mutating Kubernetes.

Run all commands from:

cd /home/steve/workspace/light-fabric

Start With Built-In Sample

Start the deployer with the sample template directory:

LIGHT_DEPLOYER_TEMPLATE_BASE_DIR=apps/light-deployer/examples/petstore \
LIGHT_DEPLOYER_KUBE_MODE=noop \
cargo run -p light-deployer

The service listens on:

http://127.0.0.1:7088

Check health from another terminal:

curl -fsSL http://127.0.0.1:7088/health

Expected output:

ok

List Tools With MCP JSON-RPC

The MCP endpoint is JSON-RPC 2.0 over HTTP at:

POST /mcp

List all deployment tools:

curl -fsSL http://127.0.0.1:7088/mcp \
  -H 'content-type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "tools-list-1",
    "method": "tools/list",
    "params": {}
  }'

Call a tool through MCP:

curl -fsSL http://127.0.0.1:7088/mcp \
  -H 'content-type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "render-1",
    "method": "tools/call",
    "params": {
      "name": "deployment.render",
      "arguments": {
        "hostId": "local-host",
        "instanceId": "petstore-dev",
        "environment": "dev",
        "clusterId": "local",
        "namespace": "light-deployer",
        "values": {
          "name": "petstore",
          "image": {
            "repository": "nginx",
            "tag": "1.27"
          },
          "containerPort": 80
        },
        "template": {
          "repoUrl": "local",
          "ref": "main",
          "path": "k8s"
        }
      }
    }
  }'

For local debugging, the deployer also exposes REST-style convenience endpoints:

curl -fsSL http://127.0.0.1:7088/mcp/tools/list
curl -fsSL http://127.0.0.1:7088/mcp/tools
curl -fsSL http://127.0.0.1:7088/mcp/tools/deployment.render

Use POST /mcp for MCP clients and AI agents.

Render The Built-In Sample

curl -fsSL http://127.0.0.1:7088/mcp \
  -H 'content-type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "render-sample-1",
    "method": "tools/call",
    "params": {
      "name": "deployment.render",
      "arguments": {
        "hostId": "local-host",
        "instanceId": "petstore-dev",
        "environment": "dev",
        "clusterId": "microk8s-local",
        "namespace": "light-deployer",
        "values": {
          "name": "petstore",
          "replicas": 1,
          "image": {
            "repository": "nginx",
            "tag": "1.27"
          },
          "containerPort": 80,
          "service": {
            "port": 80
          }
        },
        "template": {
          "repoUrl": "local",
          "ref": "main",
          "path": "k8s"
        }
      }
    }
  }'

Expected response shape:

{
  "jsonrpc": "2.0",
  "result": {
    "isError": false,
    "structuredContent": {
      "action": "render",
      "status": "rendered",
      "deployerId": "local-light-deployer",
      "clusterId": "local",
      "resources": [
        {
          "kind": "Deployment",
          "name": "petstore"
        },
        {
          "kind": "Service",
          "name": "petstore"
        }
      ]
    }
  }
}

The exact requestId and manifestHash will differ.

Render openapi-petstore Locally

If /home/steve/workspace/openapi-petstore is available and has a k8s/ folder, run:

LIGHT_DEPLOYER_TEMPLATE_BASE_DIR=/home/steve/workspace/openapi-petstore \
LIGHT_DEPLOYER_KUBE_MODE=noop \
cargo run -p light-deployer

Render request:

curl -fsSL http://127.0.0.1:7088/mcp \
  -H 'content-type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "render-openapi-petstore-1",
    "method": "tools/call",
    "params": {
      "name": "deployment.render",
      "arguments": {
        "hostId": "local-host",
        "instanceId": "openapi-petstore-dev",
        "environment": "dev",
        "clusterId": "microk8s-local",
        "namespace": "petstore-dev",
        "values": {
          "name": "openapi-petstore",
          "image": {
            "repository": "networknt/openapi-petstore",
            "tag": "latest",
            "pullPolicy": "IfNotPresent"
          },
          "service": {
            "name": "openapi-petstore",
            "type": "ClusterIP"
          },
          "resources": {
            "requests": {
              "memory": "64Mi",
              "cpu": "250m"
            },
            "limits": {
              "memory": "256Mi",
              "cpu": "500m"
            }
          }
        },
        "template": {
          "repoUrl": "local",
          "ref": "master",
          "path": "k8s"
        }
      }
    }
  }'

Expected resources:

  • Deployment/openapi-petstore
  • Service/openapi-petstore

Test Git Fetch

Stop the local-template run and restart without LIGHT_DEPLOYER_TEMPLATE_BASE_DIR:

LIGHT_DEPLOYER_KUBE_MODE=noop \
cargo run -p light-deployer

Render from GitHub:

curl -fsSL http://127.0.0.1:7088/mcp \
  -H 'content-type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "render-git-1",
    "method": "tools/call",
    "params": {
      "name": "deployment.render",
      "arguments": {
        "hostId": "local-host",
        "instanceId": "openapi-petstore-dev",
        "environment": "dev",
        "clusterId": "microk8s-local",
        "namespace": "petstore-dev",
        "values": {
          "name": "openapi-petstore",
          "image": {
            "repository": "networknt/openapi-petstore",
            "tag": "latest"
          }
        },
        "template": {
          "repoUrl": "https://github.com/networknt/openapi-petstore.git",
          "ref": "master",
          "path": "k8s"
        }
      }
    }
  }'

For a private repository:

LIGHT_DEPLOYER_GIT_TOKEN=... \
LIGHT_DEPLOYER_KUBE_MODE=noop \
cargo run -p light-deployer

For Bitbucket app-password style auth:

LIGHT_DEPLOYER_GIT_USERNAME=my-user \
LIGHT_DEPLOYER_GIT_TOKEN=my-app-password \
LIGHT_DEPLOYER_KUBE_MODE=noop \
cargo run -p light-deployer

Dry Run And Diff In Noop Mode

Noop mode can also exercise the request path for these tools:

curl -fsSL http://127.0.0.1:7088/mcp \
  -H 'content-type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "dry-run-sample-1",
    "method": "tools/call",
    "params": {
      "name": "deployment.dryRun",
      "arguments": {
        "hostId": "local-host",
        "instanceId": "petstore-dev",
        "environment": "dev",
        "clusterId": "microk8s-local",
        "namespace": "light-deployer",
        "values": {
          "name": "petstore",
          "replicas": 1,
          "image": {
            "repository": "nginx",
            "tag": "1.27"
          },
          "containerPort": 80,
          "service": {
            "port": 80
          }
        },
        "template": {
          "repoUrl": "local",
          "ref": "main",
          "path": "k8s"
        }
      }
    }
  }'
curl -fsSL http://127.0.0.1:7088/mcp \
  -H 'content-type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "diff-sample-1",
    "method": "tools/call",
    "params": {
      "name": "deployment.diff",
      "arguments": {
        "hostId": "local-host",
        "instanceId": "petstore-dev",
        "environment": "dev",
        "clusterId": "microk8s-local",
        "namespace": "light-deployer",
        "values": {
          "name": "petstore",
          "replicas": 1,
          "image": {
            "repository": "nginx",
            "tag": "1.27"
          },
          "containerPort": 80,
          "service": {
            "port": 80
          }
        },
        "template": {
          "repoUrl": "local",
          "ref": "main",
          "path": "k8s"
        }
      }
    }
  }'

These calls do not validate against Kubernetes unless real mode is enabled.

Stop The Service

Press Ctrl-C in the terminal running cargo run.

Run Kubernetes

This page runs light-deployer inside MicroK8s and uses the in-cluster ServiceAccount with kube-rs.

Prerequisites

MicroK8s should be running and microk8s kubectl should work:

microk8s status --wait-ready
microk8s kubectl get nodes

Build the image first:

cd /home/steve/workspace/light-fabric
./apps/light-deployer/build.sh latest

Import Image Into MicroK8s

docker save networknt/light-deployer:latest | microk8s ctr image import -

If your MicroK8s install requires elevated permissions:

docker save networknt/light-deployer:latest | sudo microk8s ctr image import -

Verify the image is available:

microk8s ctr images ls | grep light-deployer

Install Deployer

Apply the included manifests:

microk8s kubectl apply -f apps/light-deployer/k8s/namespace.yaml
microk8s kubectl apply -f apps/light-deployer/k8s/rbac.yaml
microk8s kubectl apply -f apps/light-deployer/k8s/deployment.yaml
microk8s kubectl apply -f apps/light-deployer/k8s/service.yaml

Wait for the pod:

microk8s kubectl -n light-deployer rollout status deploy/light-deployer
microk8s kubectl -n light-deployer get pods

Check logs:

microk8s kubectl -n light-deployer logs deploy/light-deployer

The deployment sets:

LIGHT_DEPLOYER_KUBE_MODE=real

So the service uses real Kubernetes API calls from inside the cluster.

Port Forward

microk8s kubectl -n light-deployer port-forward svc/light-deployer 7088:7088

In another terminal:

curl -fsSL http://127.0.0.1:7088/health

Expected:

ok

List Tools

curl -fsSL http://127.0.0.1:7088/mcp \
  -H 'content-type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "tools-list-1",
    "method": "tools/list",
    "params": {}
  }'

The response contains the deployer's tool names, descriptions, input schemas, and invocation metadata. Light Portal can use this JSON-RPC response to populate MCP tools for the API details view.

Render In Kubernetes

Rendering does not mutate the cluster:

curl -fsSL http://127.0.0.1:7088/mcp \
  -H 'content-type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "render-sample-1",
    "method": "tools/call",
    "params": {
      "name": "deployment.render",
      "arguments": {
        "hostId": "local-host",
        "instanceId": "petstore-dev",
        "environment": "dev",
        "clusterId": "microk8s-local",
        "namespace": "light-deployer",
        "values": {
          "name": "petstore",
          "replicas": 1,
          "image": {
            "repository": "nginx",
            "tag": "1.27"
          },
          "containerPort": 80,
          "service": {
            "port": 80
          }
        },
        "template": {
          "repoUrl": "local",
          "ref": "main",
          "path": "k8s"
        }
      }
    }
  }'

Dry Run In Kubernetes

Dry-run renders the manifest and asks the Kubernetes API to validate it without persisting resources:

curl -fsSL http://127.0.0.1:7088/mcp \
  -H 'content-type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "dry-run-sample-1",
    "method": "tools/call",
    "params": {
      "name": "deployment.dryRun",
      "arguments": {
        "hostId": "local-host",
        "instanceId": "petstore-dev",
        "environment": "dev",
        "clusterId": "microk8s-local",
        "namespace": "light-deployer",
        "values": {
          "name": "petstore",
          "replicas": 1,
          "image": {
            "repository": "nginx",
            "tag": "1.27"
          },
          "containerPort": 80,
          "service": {
            "port": 80
          }
        },
        "template": {
          "repoUrl": "local",
          "ref": "main",
          "path": "k8s"
        }
      }
    }
  }'

Expected status:

{
  "jsonrpc": "2.0",
  "result": {
    "isError": false,
    "structuredContent": {
      "status": "validated"
    }
  }
}

Deploy Sample

The sample request deploys into the light-deployer namespace so it matches the included namespace-scoped RBAC.

curl -fsSL http://127.0.0.1:7088/mcp \
  -H 'content-type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "apply-sample-1",
    "method": "tools/call",
    "params": {
      "name": "deployment.apply",
      "arguments": {
        "hostId": "local-host",
        "instanceId": "petstore-dev",
        "environment": "dev",
        "clusterId": "microk8s-local",
        "namespace": "light-deployer",
        "values": {
          "name": "petstore",
          "replicas": 1,
          "image": {
            "repository": "nginx",
            "tag": "1.27"
          },
          "containerPort": 80,
          "service": {
            "port": 80
          }
        },
        "template": {
          "repoUrl": "local",
          "ref": "main",
          "path": "k8s"
        }
      }
    }
  }'

The response should return quickly with an accepted/applying-style status. The operation continues in the deployer.

Watch Kubernetes resources:

microk8s kubectl -n light-deployer get deploy,svc,pods

Stream Events

Use the requestId from the deployment response:

curl -N "http://127.0.0.1:7088/events?request_id=<requestId>"

The event stream reports deployment progress and failures for that request.

Check Status

curl -fsSL http://127.0.0.1:7088/mcp \
  -H 'content-type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "status-sample-1",
    "method": "tools/call",
    "params": {
      "name": "deployment.status",
      "arguments": {
        "hostId": "local-host",
        "instanceId": "petstore-dev",
        "environment": "dev",
        "clusterId": "microk8s-local",
        "namespace": "light-deployer",
        "template": {
          "repoUrl": "local",
          "ref": "main",
          "path": "k8s"
        }
      }
    }
  }'

Undeploy Sample

curl -fsSL http://127.0.0.1:7088/mcp \
  -H 'content-type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "delete-sample-1",
    "method": "tools/call",
    "params": {
      "name": "deployment.delete",
      "arguments": {
        "hostId": "local-host",
        "instanceId": "petstore-dev",
        "environment": "dev",
        "clusterId": "microk8s-local",
        "namespace": "light-deployer",
        "template": {
          "repoUrl": "local",
          "ref": "main",
          "path": "k8s"
        }
      }
    }
  }'

Then verify resources:

microk8s kubectl -n light-deployer get deploy,svc,pods

Deploy openapi-petstore From Git

After the openapi-petstore repository has a k8s/ folder committed, use a request like this:

curl -fsSL http://127.0.0.1:7088/mcp \
  -H 'content-type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "apply-openapi-petstore-1",
    "method": "tools/call",
    "params": {
      "name": "deployment.apply",
      "arguments": {
        "hostId": "local-host",
        "instanceId": "openapi-petstore-dev",
        "environment": "dev",
        "clusterId": "microk8s-local",
        "namespace": "light-deployer",
        "values": {
          "name": "openapi-petstore",
          "image": {
            "repository": "networknt/openapi-petstore",
            "tag": "latest",
            "pullPolicy": "IfNotPresent"
          },
          "service": {
            "name": "openapi-petstore",
            "type": "ClusterIP"
          }
        },
        "template": {
          "repoUrl": "https://github.com/networknt/openapi-petstore.git",
          "ref": "master",
          "path": "k8s"
        }
      }
    }
  }'

For private Git access, set LIGHT_DEPLOYER_GIT_TOKEN on the deployer pod. In Kubernetes this should be injected from a Secret, not written directly into the deployment manifest.

Update The Deployer Image

After rebuilding locally:

./apps/light-deployer/build.sh latest
docker save networknt/light-deployer:latest | microk8s ctr image import -
microk8s kubectl -n light-deployer rollout restart deploy/light-deployer
microk8s kubectl -n light-deployer rollout status deploy/light-deployer

Remove The Deployer

microk8s kubectl delete -f apps/light-deployer/k8s/service.yaml
microk8s kubectl delete -f apps/light-deployer/k8s/deployment.yaml
microk8s kubectl delete -f apps/light-deployer/k8s/rbac.yaml
microk8s kubectl delete -f apps/light-deployer/k8s/namespace.yaml

Light-Gateway

light-gateway is the Pingora-based gateway product in Light Fabric.

It is intended to host gateway behavior such as routing, proxying, and eventually AI/MCP gateway integrations while using the shared runtime and config model.

Key Dependencies

  • light-runtime
  • light-pingora
  • config-loader

Runtime

The gateway uses light-pingora as its transport framework and light-runtime for lifecycle, bootstrap, and service configuration.

Light-Workflow

light-workflow is the workflow execution service for Agentic Workflow documents.

It loads workflow definitions, executes workflow tasks, integrates with light-rule for rule-backed checks, and exposes workflow execution APIs.

Key Dependencies

  • workflow-core
  • light-rule
  • axum
  • sqlx
  • reqwest

Role

light-workflow is the runtime service that turns workflow specifications into long-running execution state. It is used by agentic flows, human-in-the-loop orchestration, and integration-test style automation.

Comparison: Light-Fabric vs. AgentGateway

This document provides a high-level comparison between Light-Fabric and AgentGateway to help architects and engineering leaders choose the right foundation for their agentic workflows.

Overview

While both systems aim to facilitate interaction with Large Language Models (LLMs), they operate at different layers of the AI stack and prioritize different architectural outcomes.

FeatureLight-FabricAgentGateway
Primary PhilosophyAgentic Fabric: Unified Governance & LifecycleAgentic Gateway: High-performance Proxy
Core ArchitectureIntegrated Platform (Layer)Standalone Gateway (Service)
Target UserCentral IT / Platform EngineeringApplication Developers / DevOps
Lifecycle ManagementAPIs, Agents, MCPs, and GatewaysPrimarily LLM Request Routing
LanguageNative Rust (Extreme Performance)Rust / Go (Variable)

1. Governance vs. Connectivity

Light-Fabric (Governance)

Light-Fabric is designed as a Single Control Plane. It assumes that in an enterprise environment, "freedom without governance is chaos." It provides:

  • Centralized Registry: Every agent, skill, and tool is registered and governed via the light-portal.
  • Fine-Grained Authorization: Deep policy enforcement at the endpoint level, including row and column-level data masking.
  • Auditability: A unified audit trail for all agentic interactions across the entire organization.

AgentGateway (Connectivity)

AgentGateway typically focuses on the North-South traffic between an application and multiple LLM providers. Its primary strength is:

  • Simplified Routing: Getting a request from Point A to Point B with retries and failover.
  • Provider Abstraction: Normalizing different LLM APIs into a single interface.

2. Integrated Intelligence: Hindsight

One of the defining differences of the Light-Fabric is the deep integration of Hindsight Memory.

  • Light-Fabric: Memory is not an "add-on." The platform provides native biomimetic memory banks (World Facts, Experiences, Mental Models) that are automatically managed and scoped (Global, Shared, Private) as part of the fabric.
  • AgentGateway: Typically treats memory as external state. The application or a separate vector database must manage context before sending the request through the gateway.

3. Skill & Tool Management

Centralized Skills (Fabric)

In Light-Fabric, skills (tools) are first-class citizens. They are registered, versioned, and governed centrally. An agent doesn't just "have" a tool; the Fabric grants the agent access to a skill based on its role and the current context.

Standard Tooling (Gateway)

AgentGateway generally passes tool definitions through to the provider. The management of who can use which tool and how those tools are secured is usually left to the application logic.


4. Orchestration: Hybrid Agentic Workflows

Light-Fabric (Integrated Orchestrator)

Light-Fabric treats orchestration as a foundational service. It implements a Hybrid Model:

  • Deterministic Process: The overall business logic (e.g., insurance claim steps) is fixed and compliant.
  • Autonomous Tasks: Individual steps within the process are delegated to agents.
  • Statefulness: The Fabric manages long-running state across days or weeks, ensuring durability.

AgentGateway (Stateless Proxy)

AgentGateway is primarily a stateless component.

  • External Orchestration: The workflow logic must reside in your application code or an external engine (like Temporal).
  • Proxy Only: It handles the communication but does not "understand" or manage the multi-step business process itself.

5. Security: The Rule Engine

Light-Fabric (Integrated Governance)

Light-Fabric includes an integrated YAML-based Rule Engine (light-rule) designed for fine-grained authorization:

  • Data Filtering: Automatically masks or filters response data (column/row level) based on policies.
  • Policy Enforcement: Checks permissions before an agent executes a tool or accesses a memory unit.
  • Hot-Reloading: Security rules can be updated in real-time without redeploying the platform.

AgentGateway (Basic Middleware)

AgentGateway typically provides basic security features like API key validation or rate limiting.

  • Limited Filtering: While it can intercept traffic, implementing complex, context-aware data masking usually requires writing custom middleware or handling it at the application level.

6. MCP Support: Gateway vs. Ecosystem

Light-Fabric (Integrated Tooling)

Light-Fabric treats Model Context Protocol (MCP) as a primary source for agent tools.

  • Direct Integration: Agents use the mcp-client to directly consume tools from MCP servers.
  • Registry Management: MCP servers are registered in the light-portal, allowing for centralized discovery and governance.
  • Unified Security: The same Fine-Grained Authorization rules apply to MCP tools as they do to native Rust tools.

AgentGateway (Specialized MCP Proxy)

AgentGateway provides a highly specialized MCP Gateway layer.

  • Protocol Translation: It excels at translating between different MCP transports (SSE, Streamable HTTP, etc.).
  • Exposing Servers: Its primary role is to make MCP servers accessible to external applications through a normalized gateway interface.
  • Advanced Networking: Includes features like stream merging and specialized MCP routing.

For a deep dive into the technical differences, see our Detailed MCP Feature Comparison.


Summary: Which to Choose?

Choose Light-Fabric if:

  • You are building an Enterprise AI Strategy that requires unified governance, stateful workflows, and integrated security.
  • You need to manage the entire lifecycle of agents and the business processes they participate in.
  • You require advanced data privacy (masking) and long-term memory (Hindsight) as native platform features.

Choose AgentGateway if:

  • You need a lightweight proxy to handle LLM provider failover and basic request normalization.
  • You prefer to manage agent logic, workflows, memory, and security entirely within your external application stack.
  • You are looking for a simple tool to solve immediate connectivity needs without implementing a comprehensive platform layer.

Detailed Comparison: MCP Gateway Features

This document provides a technical deep dive into the Model Context Protocol (MCP) implementations in Light-Fabric and AgentGateway.

Feature Matrix

FeatureLight-FabricAgentGateway
Primary RoleProvider/Gateway/Portal: Exposes MCP/API Servers.Provider/Gateway: Exposes MCP servers.
OnboardingAuto-Discovery: Automatic tools/list sync.Manual: K8s CRD/Manifest configuration.
Data PrivacyDeep: Row/Column level masking.Basic: Allow/Deny access control.
TransportsSSE, Streamable HTTP, WebSocketSSE, Streamable HTTP, WebSocket
Legacy IntegrationNative: REST/RPC to MCP transformation.External: Manual wrappers required.
AuthorizationManaged: Roles, Groups, Positions, Attributes.Infrastructure: CEL-based policies.
Hot-ReloadingNative: Integrated Control Plane & Registry.Infrastructure: Istio/xDS sync.
AuthenticationJWT (End-to-End Propagation)JWT, Keycloak, OIDC, Passthrough
ObservabilityDistributed Tracing (OTEL) and Integrated Hindsight MemoryDistributed Tracing (OTEL)

1. Architectural Intent

AgentGateway: The Network Proxy Layer

AgentGateway is designed as a high-availability proxy for MCP servers. Its primary focus is the North-South traffic between an application and multiple MCP backends.

  • Multiplexing: Optimized for merging multiple MCP backends into a single upstream connection (mergestream.rs).
  • Protocol Translation: Excels at translating between SSE, Streamable HTTP, and WebSocket transports.
  • Infrastructure Focus: Operates as a Kubernetes-native component managed via manifests and standard networking policies.

Light-Fabric: The Managed Enterprise Platform

Light-Fabric provides a Unified Governance Fabric that treats AI agents and MCP tools as part of the broader enterprise API ecosystem.

  • Unified Gateway: The AI Gateway (Rust/Pingora-based) serves as a single entry point for UI, Agents, and Tools, supporting both MCP and traditional REST/RPC APIs.
  • Centralized Portal: Uses the Light-Portal as a control plane for onboarding (auto-discovery), configuration (hot-reloading), and security management.
  • Governed Intelligence: Integrates the gateway directly with Hindsight Memory and the Fine-Grained Rule Engine, ensuring that every tool call is governed by corporate compliance rules (e.g., row/column masking).
  • End-to-End Security: Maintains a single JWT-based identity from the user's chat interface all the way to the underlying MCP or API endpoint.

2. Security & Authorization

AgentGateway: Infrastructure-Aware RBAC

AgentGateway uses Common Expression Language (CEL) for its authorization policies.

  • Capabilities: High-speed, network-level blocking based on JWT claims and request headers.
  • Limitation: Lacks native support for content-aware data masking or organizational hierarchy logic.

Light-Fabric: Content-Aware Managed Auth

Light-Fabric provides a mature Fine-Grained Authorization layer:

  • Managed ABAC/PBAC: Supports Role, Group, Corporate Position (Hierarchy), and Attribute-based protection.
  • Data Privacy: Supports native Row and Column filtering (data masking), ensuring agents only see data they are authorized to process.
  • End-to-End JWT: The same JWT token is propagated from the UI through the Agent to the AI Gateway and MCP tool.

3. Lifecycle & Tool Onboarding

AgentGateway: Configuration-Driven

Onboarding tools in AgentGateway is an infrastructure task:

  • Manual Mapping: Requires defining Kubernetes Custom Resources (HTTPRoute, Backend) to map MCP servers to the gateway.
  • Scope: Primarily focused on exposing existing MCP servers.

Light-Fabric: Registry-Driven

Light-Fabric provides a "Zero-Effort" onboarding experience via Light-Portal:

  • Auto-Discovery: Registering an MCP API triggers an automatic tools/list call to populate the registry.
  • Protocol Transformation: Automatically transforms existing OpenAPI/REST and RPC services into MCP tools without requiring wrappers.
  • Centralized Governance: All tools (Native, REST, MCP) are managed in a single unified registry.

4. Control Plane & Configuration

AgentGateway: Kubernetes-Native

  • Orchestration: Managed via the Istio/xDS control plane.
  • Updates: Configuration changes are applied via Kubernetes manifests (YAML).

Light-Fabric: Portal-Managed

  • Hot-Reloading: Uses a dedicated Config Server and Control Plane to update gateway and agent configurations in real-time without restarts.
  • Enterprise Management: Business-centric UI for managing tool visibility, agent permissions, and security policies.

5. Conclusion

  • Use AgentGateway if you are an infrastructure provider who needs to expose MCP-based tools to multiple external applications securely and reliably.
  • Use Light-Fabric if you are building intelligent agents that need to use those tools to solve complex business problems within a governed framework.

Why Light-Fabric Already Covers the MCP Gateway — No Second Gateway Required

This document addresses a recommendation (produced by Grok AI) suggesting that an enterprise should deploy the open-source AgentGateway as a dedicated MCP layer alongside an existing API platform. After performing a side-by-side source code analysis of both projects (see vs-agentgateway.md and vs-agent-gateway-mcp.md), we present the findings below.


1. The Recommendation Was Generated Without Knowledge of Light-Fabric

The Grok-produced analysis operates under a critical blind spot: it has no knowledge of Light-Fabric (Rust-based, open-sourced to customers) or its capabilities. The recommendation frames the choice as "keep your existing REST platform + add AgentGateway for MCP," because Grok only knows about publicly documented open-source projects. It does not account for the fact that:

  • Light-Fabric is already in production and serving agentic workloads today.
  • Every feature listed in the recommendation — MCP federation, tool discovery, protocol translation, security, and observability — has already been built, demonstrated, and validated with the project team.
  • The comparison is therefore not between "a REST framework" and "an MCP gateway." It is between two systems that both provide MCP gateway capabilities, where one (Light-Fabric/Light-Gateway) is already deployed and battle-tested in our environment.

2. Source Code Analysis: Light-Fabric Already Does What AgentGateway Does

We conducted a detailed, code-level comparison of both projects. The full results are documented in our High-Level Comparison and Detailed MCP Feature Comparison. The key findings are summarized below.

2.1 MCP Protocol Support

CapabilityLight-FabricAgentGateway
TransportsSSE, Streamable HTTP, WebSocketSSE, Streamable HTTP, WebSocket
Tool DiscoveryAuto-discovery via tools/list syncManual K8s CRD configuration
Protocol TranslationNative REST/RPC → MCP transformationManual wrappers required
Stream HandlingSupportedSupported (mergestream)

Both projects support the same MCP transports. Light-Fabric goes further with automatic tool discovery and native protocol transformation from existing REST/RPC APIs — exactly the "OpenAPI-to-MCP mapping" that the Grok recommendation credits to AgentGateway, except Light-Fabric does it without requiring a separate component.

2.2 Security & Authorization

CapabilityLight-FabricAgentGateway
AuthenticationJWT (end-to-end propagation)JWT, Keycloak, OIDC, Passthrough
AuthorizationRole, Group, Position, Attribute-based (ABAC/PBAC)CEL-based policies
Data PrivacyRow/Column-level maskingAllow/Deny access control
Rule EngineIntegrated YAML-based, hot-reloadableBasic middleware

The Grok recommendation highlights "tool-level RBAC" and "MCP-compliant OAuth 2.1" as AgentGateway strengths. Our code analysis shows that Light-Fabric's authorization model is significantly deeper — it supports corporate-hierarchy-aware policies and content-level data masking that AgentGateway simply does not implement.

2.3 Lifecycle & Operations

CapabilityLight-FabricAgentGateway
OnboardingPortal-driven, auto-discoveryK8s manifest-driven, manual
Hot-ReloadingNative (Config Server + Control Plane)Infrastructure-dependent (Istio/xDS)
ObservabilityOTEL + integrated Hindsight MemoryOTEL + OpenInference
OrchestrationIntegrated hybrid workflows (deterministic + autonomous)None (stateless proxy)

Light-Fabric manages the entire lifecycle — from tool registration through governance to runtime orchestration — while AgentGateway only handles the proxy layer.


3. Two Gateways Is Overkill

The Grok recommendation frames the architecture as a "clean separation of concerns." In practice, deploying both Light-Fabric and AgentGateway creates redundant infrastructure with real costs:

Duplicated Capabilities

Both systems would be performing the same core functions:

  • Receiving MCP requests from agents
  • Translating tool calls to backend HTTP requests
  • Enforcing security policies on tool access
  • Providing observability for agentic traffic

Running two gateways that do the same thing is not "separation of concerns" — it is duplication of concerns. Every MCP request would traverse two proxy layers instead of one, adding latency and operational complexity for zero additional capability.

Operational Burden

  • Two deployment pipelines to maintain on EKS
  • Two sets of security policies to keep in sync
  • Two configuration surfaces (K8s CRDs for AgentGateway vs. Portal for Light-Fabric)
  • Two failure domains to monitor and troubleshoot
  • Two upgrade cycles to coordinate

The "No Code Changes" Claim Is Misleading

The Grok recommendation states AgentGateway requires "no code changes." This is true only if you ignore the work required to:

  • Write and maintain Kubernetes Custom Resources for every MCP backend
  • Build manual wrappers for non-MCP services (Light-Fabric does this natively)
  • Implement application-level logic for everything AgentGateway doesn't cover (stateful workflows, data masking, memory management)

Light-Fabric also requires no code changes to existing backend services — and it provides the governance layer out of the box.


4. Addressing the "Rust Performance" Argument

The recommendation claims AgentGateway has a "performance edge" due to its Rust data plane. This argument does not hold:

  • Light-Fabric's AI Gateway currently runs on the high-performance Java-based light-gateway, and a new Rust-based AI Gateway is also under way, built on the Pingora framework (Cloudflare's production proxy engine). Even the existing Java gateway delivers exceptional throughput, and the Rust gateway will remove the JVM from the critical path entirely.
  • Both systems benefit from Rust's zero-cost abstractions, memory safety, and lack of garbage collection pauses.
  • The performance comparison between the two Rust implementations would be marginal and workload-dependent — not a differentiator.

5. Addressing the "Custom Development" Concern

The recommendation warns against "implementing MCP directly" because it "involves significant custom development." This concern does not apply:

  • Light-Fabric's MCP support is not custom development — it is a fully implemented, production-ready feature of the platform.
  • The MCP client, gateway routing, tool registry, and security integration are all existing, tested components, not a backlog of work to be done.
  • The project team has already seen these features demonstrated end-to-end.

6. Summary

Concern from Grok RecommendationReality
"Light4j is a REST framework, not an AI proxy"Light-Fabric is a full agentic platform with an AI Gateway already in production
"AgentGateway provides MCP federation and tool discovery"Light-Fabric provides the same capabilities with deeper governance
"Rust performance advantage over JVM"Light-Fabric's Java gateway is already very fast, and a Rust (Pingora-based) gateway is coming
"Clean separation of concerns"Two gateways doing the same thing is duplication, not separation
"No code changes required"True for both — but AgentGateway requires extensive K8s manifest management
"Custom MCP implementation is risky"Light-Fabric's MCP support is already built, tested, and in production

Conclusion

The Grok-generated recommendation is well-structured but fundamentally flawed because it was produced without knowledge of Light-Fabric's capabilities. When evaluated against the actual source code and production state of both systems, the case for adding AgentGateway collapses:

  • Light-Fabric already provides every MCP gateway capability that AgentGateway offers.
  • Light-Fabric goes significantly further with integrated governance, data privacy, memory, and orchestration.
  • Adding a second gateway introduces operational complexity and latency with no net-new capability.

The pragmatic, low-risk path is to continue with the platform that is already built, already in production, and already proven to the team.