Light-Fabric

Light-Fabric is a high-performance, unified platform for managing the lifecycle, governance, and orchestration of enterprise AI services including agentic services, agents, tools, skills, memories, MCP servers, APIs, gateways and workflows.

Why Light-Fabric?

We chose the name Light-Fabric because it embodies the "Unified Governance" required for enterprise-grade AI:

  • Unified Control Plane: Light-Fabric provides a single point of truth for discovering, governing, and auditing agents, MCP servers, and APIs via the light-portal.
  • Enterprise Governance: It prioritizes security and policy enforcement (such as fine-grained authorization) over pure decentralized autonomy, making it safe for corporate environments.
  • Integrated Ecosystem: It "weaves" together distributed components—from memory units (Hindsight) to centralized skills—into a cohesive, observable system.
  • Durable Identity: The name emphasizes the platform's role as the infrastructure foundation, remaining relevant regardless of the underlying implementation details.

Technical Advantages

By building Light-Fabric on a Rust foundation, we achieve:

  • Performance: Built on top of tokio and axum for maximum throughput and memory safety.
  • Native Intelligence: Specialized crates for Hindsight memory, tool calling, and workflow orchestration.
  • Production Ready: Includes robust features like retries, failover, and observability out of the box.

Core Components

The Light-Fabric is composed of modular crates, infrastructure frameworks, and reference applications:

Crates

  • crates/model-provider: A unified interface for multiple LLM providers (Ollama, etc.).
  • crates/hindsight-client: Client for the Hindsight biomimetic memory system.
  • crates/mcp-client: Implementation of the Model Context Protocol (MCP) for tool discovery and execution.
  • crates/portal-registry: Integration with the Light-Portal for service registration and discovery.
  • crates/light-runtime: Core runtime foundation for building agentic and microservice components.
  • crates/light-rule: High-performance rule engine for fine-grained authorization and data filtering.
  • crates/workflow-core & workflow-builder: Core engine and builder for complex agentic workflows.
  • crates/config-loader: Flexible configuration management for enterprise environments.
  • crates/asymmetric-decryptor & symmetric-decryptor: Security utilities for sensitive data handling.

Frameworks

  • frameworks/light-axum: A specialized microservice & agentic framework built on top of the Axum web ecosystem.
  • frameworks/light-pingora: High-performance proxy and gateway framework built on top of Cloudflare's Pingora.

Applications

  • apps/light-agent: A managed AI agent capable of using tools, accessing memory, and executing complex tasks.
  • apps/light-gateway: An enterprise-grade gateway for securing and governing API and agent traffic.
  • apps/light-workflow: A service for orchestrating and executing long-running agentic workflows.

Getting Started with Light-Fabric

This guide will help you set up a local development environment for Light-Fabric, including the AI Gateway, Agent Engine, and the management Portal.

Prerequisites

  • Rust: Latest stable version.
  • Docker: For running database and backend services.
  • Node.js: For running the portal-view UI.
  • Git: To clone the necessary repositories.

Local Development Setup

To run the entire ecosystem locally, we use the portal-config-loc and service-asset repositories to manage configuration and pre-built assets.

1. Initialize Workspace

Create a unified workspace directory (e.g., ~/lightapi) and clone the core management repositories:

cd ~
mkdir -p lightapi
cd lightapi

# Clone configuration and assets
git clone [email protected]:lightapi/portal-config-loc.git
git clone [email protected]:lightapi/service-asset.git

2. Deploy Local Services

Light-Fabric services are orchestrated via Docker Compose scripts in portal-config-loc. The following command starts the PostgreSQL database and the core services (including the Rust-based components):

cd ~/lightapi/portal-config-loc
./scripts/deploy-local.sh pg rust

3. Import Initial Data

Use the importer script in service-asset to populate the local database with initial events, users, and configurations:

cd ~/lightapi/service-asset
./importer.sh -f events.json

4. Update /etc/hosts

The platform uses virtual hosts for local routing. Add the following entry to your /etc/hosts file (replace with your actual local IP if necessary):

127.0.0.1  local.lightapi.net locsignin.lightapi.net

Running the Management Portal

The Light-Portal provides a unified UI for onboarding MCP servers, configuring AI Gateways, and interacting with agents.

cd ~/lightapi
git clone [email protected]:lightapi/portal-view.git
cd portal-view
npm install
npm run dev

Navigate to https://localhost:3000 and log in with your developer credentials.


Cloud Development (Coming Soon)

We are currently preparing a Cloud Development Server. This will allow developers to:

  • Connect to a shared, high-performance AI Gateway.
  • Onboard and test MCP servers without a full local installation.
  • Collaborate on shared agentic workflows and Hindsight memory banks.

Stay tuned for the connection details and onboarding guide for the cloud environment.


Contributing to Light-Fabric

If you are developing for the Rust crates specifically:

cd ~/lightapi
git clone [email protected]:networknt/light-fabric.git
cd light-fabric
cargo build

Model Providers

Light-Fabric provides a unified, high-performance interface for interacting with diverse Large Language Model (LLM) providers. This abstraction is centered around the Provider trait, allowing applications to remain model-agnostic while leveraging advanced capabilities like native tool calling and prompt caching.

The Provider Trait

All model integrations implement the Provider trait, which supports:

  • One-shot and Multi-turn Chat: Simplified APIs for simple prompts and full conversation histories.
  • Structured Tool Calling: Native integration for function calling (OpenAI-style).
  • Capabilities Detection: Programmatic checks for vision, native tool support, and prompt caching.

Supported Cloud Providers

Light-Fabric supports all major LLM providers. Because the Provider trait is model-agnostic, the framework is compatible with the latest flagship releases as soon as they are available.

  • OpenAI: Native support for the GPT-5 series (5.4, mini, nano), the o4 reasoning models, and full legacy support for GPT-4o and GPT-4 Turbo.
  • Anthropic: Support for the Claude 4 generation, including Opus 4.7, Sonnet, and Haiku.
  • Google Gemini: Support for Gemini 3.1 Pro and Flash, leveraging Vertex AI or AI Studio for multi-modal and long-context tasks.
  • Azure OpenAI: Enterprise-grade OpenAI deployments with support for the latest model deployments.
  • AWS Bedrock: Access to the latest Claude and Titan models hosted on Amazon Web Services.
  • OpenRouter: Access to hundreds of open-source and proprietary models via a single unified API.
  • Telnyx: Support for models hosted on the Telnyx platform.
  • GLM (Zhipu AI): Support for the ChatGLM/GLM-5 series of models.

Local & Specialized Providers

  • Ollama: Seamless integration with local models running on your machine.
  • OpenAI-Compatible: A generic CompatibleProvider for any service implementing the OpenAI REST API.
  • GitHub Copilot: Integration with GitHub Copilot Chat for developer-centric workflows.

Meta-Providers (Orchestration)

These providers wrap other providers to add resilient or intelligent behavior:

  • ReliableProvider: Enhances any base provider with retries, exponential backoff, and automatic failover to fallback models.
  • RouterProvider: Dynamically routes requests to different models based on hints or input complexity.

CLI & Tooling Integrations

Light-Fabric includes specialized integrations for developer tools and terminal environments:

  • Claude Code CLI: Integration with Anthropic's Claude Code environment.
  • Gemini CLI: Terminal-based access to Google's Gemini models.
  • KiloCLI: Light-Fabric's native CLI integration for rapid testing and automation.

Key Capabilities

Providers can be queried for their support of advanced features:

  • Native Tool Calling: Efficiently generate structured function calls.
  • Vision: Process images alongside text prompts.
  • Prompt Caching: Leverage provider-side caching to reduce latency and costs for long contexts.

Agentic Workflow Design

Hybrid Agentic Workflow Specification

Agentic Workflow in Light-Fabric implements a hybrid orchestration model for enterprise business processes. The workflow is deterministic, auditable, and stateful, while selected steps can be executed by agents, API calls, rule engine checks, or humans.

The design goal is not to replace enterprise process control with an open-ended agent loop. The goal is to let agents work inside a managed process that has clear state, clear ownership, repeatable execution, and human approval where needed.

Enterprise Challenge

In regulated or operationally sensitive environments, a purely autonomous AI agent is not enough for long-running business work.

  • Compliance requires deterministic process paths, approval records, and audit history.
  • Reliability requires long-running state to survive process restarts, UI disconnects, and agent failures.
  • Safety requires human-in-the-loop checkpoints for decisions with business, security, or financial impact.
  • Coordination requires multiple humans and roles to participate in the same process.
  • Testing requires the same workflow to run interactively with humans or headlessly with example data.

Light-Fabric solves this by separating orchestration from execution.

Hybrid Model

The workflow is the deterministic process manager. It defines the ordered steps, conditions, retries, error handling, human checkpoints, and outputs.

Agents are workers inside that process. They can reason, call tools, ask for missing data, and use skills, but they do not own the overall process state.

FeatureTraditional WorkflowPure Agent LoopLight-Fabric Hybrid
PathFixedDynamicFixed path with flexible task execution
StateDurableOften transientDurable workflow and task state
Human inputForms and approvalsAd hoc chatFirst-class waiting tasks
AuditStrongWeakStep-level audit and agent trace
API callsBuilt into codeTool callsSpec-described endpoint invocations
TestingSeparate test harnessPrompt replaySame workflow can run live tests

Core Separation

There are two related specifications:

  1. Agentic Workflow Specification Describes orchestration: task order, branching, human input, assertions, API calls, retries, errors, exports, and state transitions.

  2. LightAPI Description Specification Describes API capabilities at the endpoint level: how an endpoint is invoked, what inputs it accepts, what result shape it returns, examples, behavior notes, and result expectations.

This separation is important. The workflow should not duplicate every endpoint contract. It should reference endpoint descriptions and use them to invoke calls, guide agents, and verify results.

Endpoint-Level Consumption

Light-Portal manages API descriptions at the endpoint level, not only at the whole API level.

This is necessary because real workflows often combine one endpoint from one API with one endpoint from another API. For example, onboarding an API to an AI gateway may involve:

  1. register an API
  2. create an API version from a specification
  3. create a development API instance
  4. configure the API through config server
  5. link the API instance to a gateway instance
  6. select endpoints to expose as MCP tools
  7. create a gateway config snapshot
  8. reload the gateway through controller
  9. run MCP tests against the gateway

Each step may come from a different API surface. The workflow consumes only the endpoints it needs.

The recommended model is:

  • API-level descriptions can be authored for convenience and consistency.
  • Endpoint-level descriptions are published and consumed by agents and workflows.
  • Endpoint descriptions inherit shared context such as authentication, environments, sources, and secrets from an API catalog.
  • Agents progressively load endpoint information by disclosure level instead of receiving the entire catalog up front.

Progressive Disclosure

Endpoint descriptions should be disclosed to agents in layers:

  • index: operation id, title, tags, visibility
  • summary: purpose, capability group, lifecycle
  • invocation: input shape, request mapping, auth, examples
  • behavior: result cases, errors, edge cases, assertions
  • full: complete description for debugging or generation

This allows the agent to discover capabilities cheaply, load invocation details only for selected endpoints, and load behavior details only when verification or failure analysis needs it.

Workflow Task Types

The updated workflow specification adds first-class support for the task types needed by agentic API workflows.

Ask Task

ask pauses the workflow and waits for human input. It supports prompts, choices, validation, defaults, timeouts, and sensitive input.

The task returns the user's answer as task output. The normal export block should move the answer into workflow context.

Example:

- ask-authz:
    ask:
      prompt: Do you want to configure endpoint authorization?
      mode: choice
      options:
        - label: Configure authorization
          value: configure
        - label: Skip
          value: skip
    export:
      as:
        authzChoice: ${ .result }

Assert Task

assert validates workflow state or API results. It is used for both live tests and interactive workflows.

It supports simple comparisons, JSONPath-style checks, length checks, regex checks, and rule-engine-backed assertions for complex business logic.

Assertion failures should produce structured, catchable errors so workflows can route failures to remediation, task creation, or agent investigation. Complex business assertions can delegate to Light-Rule.

API Call Tasks

The workflow supports direct and description-backed API calls:

  • HTTP / OpenAPI
  • JSON-RPC
  • OpenRPC
  • gRPC
  • MCP tool/resource/prompt calls

For direct internal calls, jsonrpc can be used with an endpoint, method, params, id, notification flag, and error policy.

For cataloged JSON-RPC, openrpc references an OpenRPC document and method.

For MCP, the workflow references a tool, resource, or prompt and passes arguments. MCP capability descriptions belong in the API description layer; the workflow only selects and invokes them.

Explanation Metadata

Tasks can include explain metadata to help an agent or UI explain what is happening.

Useful fields include:

  • purpose
  • visible
  • before
  • success
  • failure
  • requires

Example:

explain:
  purpose: Link the API instance to the development gateway.
  visible: true
  requires:
    - portal-command-token authentication
    - apiInstanceId from prior step

Human Task State

Human-in-the-loop behavior must be represented as durable workflow state.

Recommended task states:

A = active
W = waiting for input
C = completed
F = failed
X = canceled

When an ask or approval task reaches W, the process remains active but the task is no longer picked up by the executor. A user, CLI, scheduler, or agent must complete the task through the workflow API.

Waiting tasks should carry:

  • prompt
  • input mode
  • options
  • validation rules
  • default value
  • sensitive flag
  • assignment metadata
  • explanation metadata
  • timeout policy

Assignment And Worklist

Enterprise workflows need more than chat. Some tasks must be assigned to roles or users and coordinated across multiple humans.

Human tasks should support:

  • assigned user
  • assigned role
  • candidate roles
  • claimed by
  • claimed timestamp
  • due timestamp
  • priority
  • comments
  • audit trail

A role-based task appears in the worklist for users with a matching role. Once claimed, it belongs to the claiming user until completed, released, delegated, or timed out.

Client Architecture

light-workflow should run as a containerized backend service alongside other portal services. It owns workflow execution and state. Portal chat, worklist, CLI, scheduler, and agents are all clients of the same workflow APIs.

The client surfaces are:

  • Portal Chat: conversational guidance for a single user.
  • Worklist: role-based task inbox for approvals, reviews, and coordination.
  • CLI: developer, CI/CD, live test, and automation interface.
  • Scheduler: periodic headless execution, such as hourly live integration tests.
  • Agent: task executor that can call APIs, use skills, and report results back to the workflow.

See Workflow Client Architecture for the dedicated client design.

Workflow Service API

The workflow service should expose one stable API boundary for all clients.

Core operations:

workflow.start
workflow.getInstance
workflow.listInstances
workflow.getEvents
workflow.listTasks
workflow.getTask
workflow.claimTask
workflow.releaseTask
workflow.completeTask
workflow.delegateTask
workflow.cancelInstance

Streaming clients should subscribe to workflow events through Server-Sent Events, WebSocket, or another portal-standard event mechanism.

Important event types:

  • workflow started
  • task started
  • task completed
  • task failed
  • task waiting for input
  • task assigned
  • task claimed
  • task completed by human
  • agent started
  • agent completed
  • workflow completed
  • workflow failed

Live Testing

The same workflow runtime should support interactive runs and headless live tests.

Interactive workflows use ask tasks when decisions or missing values are needed.

Live tests should use example data from LightAPI endpoint descriptions and workflow input fixtures instead of asking the user. Assertions should verify results through assert tasks or rule-engine checks.

This lets the scheduler run workflows every hour against the latest deployed services. When a test fails, the workflow can create a task with the failure detail and assign an agent or human to investigate.

Example: API Onboarding To AI Gateway

An API onboarding workflow can guide a user through a complex multi-endpoint process without requiring a dedicated UI for every operation.

The workflow can:

  1. ask for or infer the API metadata
  2. call the register API endpoint
  3. create an API version from an OpenAPI specification
  4. create a development API instance
  5. configure the API
  6. ask whether fine-grained authorization should be configured
  7. route to create or select authorization rules
  8. link the API instance to the development AI gateway
  9. select endpoints to expose as MCP tools
  10. create a gateway config snapshot
  11. reload the gateway through controller
  12. run MCP tests through the gateway
  13. assert expected results
  14. report success or create remediation tasks

The same workflow can run interactively through portal chat, be managed through the worklist, or run headlessly with examples as a live test.

Technical Implementation

The Light-Fabric implementation is split across:

  • workflow-core: Rust models for the workflow specification.
  • workflow-builder: fluent builders for programmatic workflow construction.
  • light-workflow: runtime service and executor.
  • light-agent: agent execution surface for delegated agent tasks.
  • light-rule: rule engine used by workflow and assertion tasks. See Light-Rule Design.

Runtime responsibilities include:

  • deserializing workflow definitions
  • claiming active tasks
  • executing supported task types
  • storing task output
  • applying exports into process context
  • creating next tasks
  • pausing waiting tasks
  • resuming after human completion
  • failing or completing process instances
  • exposing workflow APIs to clients

The current executable slice supports API invocation and verification tasks such as HTTP, JSON-RPC, OpenRPC, MCP over enterprise HTTP transports, rules, assertions, and waiting human input. MCP stdio transport is intentionally not a priority for enterprise deployment.

Design Rule

There must be one workflow runtime and one task state model.

Chat, worklist, CLI, scheduler, and agents should never implement their own workflow execution. They should all use the same light-workflow service APIs.

This keeps enterprise workflow behavior auditable, testable, and consistent regardless of how a process is started, resumed, or observed.

Workflow Client Architecture

Light-Fabric workflow execution should run as a containerized backend service, not as logic embedded in a portal screen, CLI, scheduler, or agent. The workflow service owns process state, task state, audit records, API invocation, agent invocation, and human-in-the-loop transitions. Clients are thin interaction surfaces over the same service APIs.

This separation lets the same workflow instance be driven by a portal chat session, a worklist user, a CLI command, a scheduler, or an AI agent without creating multiple execution models.

Goals

  • Provide one authoritative workflow runtime for long-running enterprise processes.
  • Support human-in-the-loop tasks from both conversational and worklist interfaces.
  • Support headless execution for live tests, scheduled runs, and CI/CD.
  • Keep all clients stateless or lightly stateful; workflow state lives in light-workflow.
  • Make role assignment, audit, and retry behavior consistent across UI, CLI, scheduler, and agent use.

Runtime Service

light-workflow should be deployed as a portal service in a container alongside the other portal services. It should expose APIs for workflow definitions, workflow instances, task claiming, task completion, event streaming, and operational control.

The service is responsible for:

  • loading workflow definitions
  • starting workflow instances
  • persisting process_info_t and task_info_t
  • executing API calls and assertions
  • invoking agents for agent-owned tasks
  • pausing on ask and approval tasks
  • assigning human tasks to users or roles
  • resuming workflows when a human answer is submitted
  • emitting workflow and task events
  • recording audit history

Clients should never execute workflow steps themselves. They should only start workflows, inspect workflow state, and complete assigned tasks.

Client Surfaces

Portal Chat

The portal chat client is the guided conversational interface for a single user working through a process. It is useful when the workflow needs to ask clarifying questions, explain the next action, or guide a user through a complex multi-endpoint operation.

Typical uses:

  • API onboarding
  • API endpoint publication to an AI gateway
  • guided configuration
  • troubleshooting and remediation workflows
  • interactive approval with explanation

The chat client should call the workflow service for current state and submit answers to waiting tasks. It may stream workflow events and render agent explanations, but it should not own workflow state.

Worklist

The worklist is the enterprise task inbox. It is the right interface for multi-user coordination, role-based assignment, approvals, escalations, and audit-sensitive operations.

Typical uses:

  • approval tasks
  • compliance review
  • operations handoff
  • role-based queue processing
  • task claim and release
  • delegated work
  • due-date and priority management

The worklist should be built around waiting human tasks. A task may have:

  • assigned user
  • candidate roles
  • assigned role
  • priority
  • due time
  • claim status
  • comments
  • completion payload
  • audit trail

The worklist is especially important because many enterprise workflows are not purely conversational. They need accountable ownership and coordination between multiple humans.

CLI

The CLI is a developer and automation client. It should use the same workflow service APIs as portal-view and should not contain separate execution logic.

Typical uses:

  • local workflow testing
  • live parity tests
  • CI/CD automation
  • scheduled headless runs
  • debugging stuck workflow instances
  • submitting test data
  • completing simple waiting tasks from scripts

Example commands:

light-workflow start portal.onboard-api --input input.yaml
light-workflow status <instance-id>
light-workflow tasks --role portal-admin
light-workflow claim <task-id>
light-workflow answer <task-id> --value approve
light-workflow logs <instance-id>
light-workflow cancel <instance-id>

The CLI should be added after the workflow APIs stabilize. It will be valuable for developers and automation, but the worklist and portal chat should drive the primary enterprise UX.

API Boundary

The workflow service should expose a stable API boundary that all clients use. The API can be HTTP, JSON-RPC, or both, but the concepts should remain the same.

Core operations:

workflow.start
workflow.getInstance
workflow.listInstances
workflow.getEvents
workflow.listTasks
workflow.getTask
workflow.claimTask
workflow.releaseTask
workflow.completeTask
workflow.delegateTask
workflow.cancelInstance

For streaming clients, the service should expose workflow events through Server-Sent Events, WebSocket, or another portal-standard event mechanism.

Important event types:

  • workflow started
  • task started
  • task completed
  • task failed
  • task waiting for input
  • task assigned
  • task claimed
  • task completed by human
  • agent started
  • agent completed
  • workflow completed
  • workflow failed

Human Task State

ask and approval-style tasks should enter a waiting state. While waiting, the workflow instance remains active, but the task is no longer executable by the worker loop until a human answer is submitted.

Recommended states:

A = active
W = waiting for input
C = completed
F = failed
X = canceled

The waiting task should include enough metadata for all clients:

  • prompt
  • input mode
  • options
  • validation rules
  • default value
  • sensitivity flag
  • assignment metadata
  • explanation metadata
  • timeout policy

The completion API should validate submitted input against the task definition before resuming the workflow.

Assignment Model

Human tasks should support both direct assignment and role-based queues.

Recommended fields:

assigned_user
assigned_role
candidate_roles
claimed_by
claimed_ts
due_ts
priority
comments

A role-based task can appear in the worklist for all users with a matching role. Once a user claims it, the task becomes owned by that user until completed, released, delegated, or timed out.

  1. Implement stable workflow service APIs for start, status, events, task list, task claim, and task completion.
  2. Harden the ask resume path and waiting task state machine.
  3. Build the worklist because it forces the assignment, audit, and state model to be correct.
  4. Build the portal chat workflow interaction on top of the same task APIs.
  5. Add the CLI after the API shape stabilizes.
  6. Add scheduler integration for hourly live tests and headless workflow runs.

Design Rule

There must be one workflow runtime and one task state model. Chat, worklist, CLI, scheduler, and agents are only clients of that runtime.

This keeps enterprise workflow behavior auditable, testable, and consistent regardless of how a workflow is started or resumed.

LightAPI Description Design

lightapi-description-specification

LightAPI Description is the endpoint capability specification used by Light-Fabric agents, workflows, live tests, and portal API administration.

It describes how an API endpoint is discovered, invoked, explained, and verified. It is intentionally separate from the Agentic Workflow Specification. Workflow describes process orchestration. LightAPI describes endpoint capability.

Why LightAPI

OpenAPI is useful for REST APIs, and OpenRPC is useful for JSON-RPC APIs, but Light-Fabric needs a common description model across multiple enterprise protocols:

  • REST / HTTP
  • OpenAPI-described HTTP
  • JSON-RPC 2.0
  • OpenRPC-described JSON-RPC
  • gRPC
  • MCP tools, resources, and prompts

LightAPI provides a single agent-facing and workflow-facing description layer over these protocols.

The goal is not to replace OpenAPI or OpenRPC. The goal is to reference them where they exist and add the missing information needed by agents and workflow live tests.

API-Level Authoring, Endpoint-Level Consumption

Light-Portal may let teams author descriptions at the API level for convenience. However, workflows and agents consume descriptions at the endpoint level.

This distinction is important because real workflow processes rarely use a whole API. They usually combine selected endpoints from multiple APIs.

For example, onboarding an API to an AI gateway may consume:

  • one endpoint from API registration
  • one endpoint from API version management
  • one endpoint from API instance management
  • one endpoint from config server
  • one endpoint from gateway linking
  • one endpoint from controller reload
  • one or more MCP tools exposed through the gateway

Each consumed operation should have an endpoint-level description with a stable endpointId.

API-level descriptions are still useful as catalogs. Endpoint-level descriptions may inherit shared API context such as:

  • environments
  • authentication
  • secrets
  • sources
  • common tags
  • lifecycle metadata

Relationship To Agentic Workflow

Agentic Workflow and LightAPI have different responsibilities.

ConcernAgentic WorkflowLightAPI Description
Process orderYesNo
Branching and retriesYesNo
Human-in-the-loopYesNo
Endpoint invocation contractReference onlyYes
Input and result examplesOptional workflow fixturesYes
Result verification expectationsCalls assertDescribes expected result cases
Agent progressive disclosureUses selected endpointsDefines disclosure levels
Live testingOrchestrates executionSupplies examples and expected results

In live tests, the workflow should use example data from LightAPI descriptions and workflow fixtures instead of asking for user input.

In interactive runs, the workflow may ask the user for missing values, then invoke endpoints described by LightAPI.

Relationship To Centralized Agent Skills

LightAPI endpoint descriptions are a source of agent skills.

The centralized skill registry should not require every API operation to be manually rewritten as a separate skill. Instead, Light-Portal can publish selected LightAPI endpoint descriptions into the skill registry as invokable capabilities.

The skill registry adds:

  • permission-aware discovery
  • semantic search
  • skill grouping
  • agent persona scoping
  • audit around skill disclosure and execution

LightAPI provides:

  • endpoint identity
  • protocol details
  • input schema
  • request mapping
  • result shape
  • examples
  • behavior notes
  • result cases

Together, they allow an agent to discover a capability as a skill, progressively load only the endpoint details it needs, and execute through the workflow or controller runtime.

See Centralized Agentic Skill Registry for the skill registry design.

Core Document Concepts

A LightAPI document should support both API-level catalogs and endpoint-level documents.

Important top-level concepts:

  • lightapi: specification version
  • profile: api or endpoint
  • info: name, title, version, namespace, owner, contact
  • context: inherited catalog context for endpoint-level documents
  • sources: OpenAPI, OpenRPC, protobuf, MCP, or raw protocol references
  • environments: environment-specific server details
  • secrets: required secret names
  • authentications: reusable authentication policies
  • operations: endpoint operation descriptions
  • testSequences: linear endpoint test sequences
  • agent: progressive disclosure and skill metadata

For profile: endpoint, the document should describe at most one operation.

Operation Model

Each operation represents one endpoint-level capability.

Common fields include:

  • operationId: local operation identifier
  • endpointId: globally stable endpoint identifier
  • title
  • summary
  • description
  • visibility
  • lifecycle
  • tags
  • capability
  • agent
  • input
  • request
  • result
  • examples

The input section describes the logical interface the agent or workflow sees.

The request section describes how logical input maps to the wire protocol.

The result section describes expected output, result cases, and failure shapes.

Protocol Coverage

HTTP And OpenAPI

For raw HTTP, the operation describes method, endpoint, headers, query, path, and body mappings.

For OpenAPI, LightAPI references the OpenAPI document and operation, then adds agent-oriented behavior, examples, and result expectations.

JSON-RPC And OpenRPC

For direct JSON-RPC, the operation describes endpoint, method, params, id behavior, notification behavior, and error policy.

For OpenRPC, LightAPI references the OpenRPC document and method. The workflow runtime can use the OpenRPC document to validate that the method exists and that required params are present before calling it.

gRPC

For gRPC, the operation describes service, method, protobuf source, transport, metadata, request mapping, and result mapping.

For browser or gateway-mediated enterprise deployments, gRPC over WebSocket can be represented as a transport on the structured protocol operation.

MCP

For MCP, the operation describes tool, resource, or prompt invocation.

Tool listing alone is not enough. The description must also include:

  • input schema
  • result shape
  • examples
  • behavior differences for important input cases
  • error cases
  • verification expectations

MCP stdio is not a priority for enterprise portal deployment. HTTP and streamable HTTP transports should be the main runtime targets.

Result Cases And Verification

LightAPI should describe expected result behavior, but Agentic Workflow should execute the actual assertions.

This keeps verification orchestration in one place.

Recommended model:

  • LightAPI operation result cases describe expected outputs, failure shapes, and examples.
  • Workflow test steps invoke the operation.
  • Workflow assert tasks verify actual output against expected result cases.
  • Complex business checks can call the rule engine.

This allows the same endpoint description to support:

  • agent skill usage
  • workflow execution
  • live integration testing
  • failure diagnosis

Progressive Disclosure For Agents

A LightAPI document should support progressive disclosure so an agent can load only the information needed at each stage.

Recommended levels:

  • index: endpoint id, title, tags, visibility
  • summary: purpose, capability group, lifecycle
  • invocation: input schema, request mapping, authentication, examples
  • behavior: result cases, edge cases, errors, assertions
  • full: complete endpoint description

The portal can expose query APIs such as:

lightapi.listOperations
lightapi.getOperation
lightapi.getCapabilityGroup

Agents should start with index or summary data, load invocation details only for selected endpoints, and load behavior details only for testing, troubleshooting, or failure repair.

Portal Publishing Flow

Light-Portal should manage endpoint descriptions as part of API endpoint administration.

Recommended flow:

  1. API owner creates or imports API metadata.
  2. Portal extracts initial endpoint descriptions from OpenAPI, OpenRPC, protobuf, MCP, or raw endpoint configuration.
  3. API owner enriches endpoint descriptions with examples, behavior notes, result cases, and visibility.
  4. Portal stores endpoint-level LightAPI descriptions.
  5. Authorized agents and workflows query descriptions by endpoint, tag, lifecycle, visibility, or capability.
  6. Selected endpoints can be published into the centralized skill registry.
  7. Workflow instances reference endpoint descriptions during execution and live testing.

Live Test Use

Live tests should be workflow-driven.

LightAPI supplies:

  • example input data
  • expected result cases
  • protocol invocation details
  • error behavior

Agentic Workflow supplies:

  • sequence
  • fixtures
  • environment selection
  • endpoint invocation
  • assertions
  • failure routing
  • task creation
  • agent assignment

This avoids building a second test runner model outside the workflow engine.

Design Rule

LightAPI describes endpoint capability. Agentic Workflow orchestrates endpoint use. Centralized Skills expose selected capabilities to agents.

Keeping these responsibilities separate lets Light-Fabric support API administration, agent skill discovery, workflow execution, and live integration testing without duplicating endpoint definitions across multiple systems.

Light-Rule Design

rule-specification

Light-Rule is the local YAML rule engine used by Light-Fabric services and workflows for deterministic business checks, transformations, authorization decisions, and workflow assertions.

It complements agentic workflow by keeping critical decisions explicit, repeatable, and auditable. Agents can propose or select rules, but the rule engine executes the deterministic logic.

Purpose

Light-Rule is designed for enterprise services that need fast local policy and transformation logic without a database call on every request.

Primary uses:

  • fine-grained authorization
  • request transformation
  • response transformation
  • workflow assertions
  • business validation
  • permission and filter injection
  • reusable rule templates selected from Light-Portal

The rule configuration is loaded locally by the target service. When permissions or rule mappings change, the controller can trigger a config reload so the service swaps to the latest rules.

Relationship To Agentic Workflow

Agentic Workflow orchestrates process steps. Light-Rule evaluates deterministic logic inside those steps.

Workflow uses Light-Rule in two main ways:

  1. Rule call task A workflow task can call a named rule to validate or mutate workflow context.

  2. Assert task extension Simple checks can be handled directly by assert, while complex business checks can delegate to Light-Rule.

This separation keeps workflows readable. The workflow says when a check happens; Light-Rule defines the reusable business logic for the check.

Example workflow responsibilities:

  • decide when authorization configuration is needed
  • select or create a rule
  • invoke a rule during live testing
  • route failures to a human or agent

Example Light-Rule responsibilities:

  • evaluate role, group, position, or attribute checks
  • inject endpoint permissions into the context
  • compute row or column filters
  • execute transformation plugins
  • return pass/fail for business assertions

See Agentic Workflow Design for the workflow orchestration model.

Relationship To LightAPI

LightAPI endpoint descriptions describe endpoint invocation and expected result behavior. Light-Rule can implement complex result checks that are too business-specific for simple schema assertions.

Recommended model:

  • LightAPI describes endpoint result cases and expected behavior.
  • Agentic Workflow invokes the endpoint and runs assert tasks.
  • assert handles simple checks directly.
  • Light-Rule handles complex checks, authorization logic, row filters, column filters, and reusable business policies.

See LightAPI Description Design for endpoint capability descriptions.

Rule Specification

Rules are described by the rule specification in rule-specification/schema/rule.yaml.

The top-level configuration contains:

  • ruleBodies: named rule definitions
  • endpointRules: endpoint-to-rule mappings

Each rule can contain:

  • ruleId
  • ruleDesc
  • version
  • author
  • updatedAt
  • conditions
  • actions

Each endpoint mapping can contain:

  • req-tra: request transformation rules
  • res-tra: response transformation rules
  • access-control: access control rules
  • permission: permission values injected into context
  • x-*: extension rule phases

Rule Conditions

Conditions evaluate fields in the input context.

Supported operand forms:

  • direct field: role
  • dotted path: user.role
  • JSON Pointer: /user/role
  • JSONPath-like path: $.user.roles[0]

Supported operators:

==
!=
>
<
>=
<=
eq
ne
contains
matches
startsWith
endsWith
exists
notExists

expected is typed and may be a string, number, boolean, array, object, or null.

Flat condition arrays are evaluated left-to-right. joinCode combines the current condition with the previous result.

A AND B OR C

is evaluated as:

(A AND B) OR C

If explicit grouping is required, split logic into multiple rules and combine them through endpoint mapping or workflow orchestration.

Rule Actions

Actions execute plugin logic after conditions pass.

An action contains:

  • actionId
  • actionClassName
  • actionValues

actionClassName identifies the registered plugin. actionValues carries plugin-specific configuration.

Typical action plugins:

  • add values to request context
  • inject permission attributes
  • compute filters
  • transform request body
  • transform response body
  • call a local business function

Actions are intentionally plugin-based so the schema remains stable while implementation logic can evolve.

Endpoint Rule Phases

Endpoint mappings define when rules run.

Request Transformation

req-tra rules run before the service handles the request. They can enrich or transform request context.

Response Transformation

res-tra rules run after the service produces a response. They can filter, redact, or reshape response data.

Access Control

access-control rules validate whether a request is allowed. These rules normally run in parallel because they should not mutate shared state.

Permission Injection

permission values are injected into the evaluation context before rule execution. This lets API owners configure roles, groups, attributes, row filters, or column filters without editing the technical rule body.

Extension Phases

Custom phases must use the x-* prefix. This avoids silent typos in standard phase names while preserving controlled extensibility.

Execution Model

The Rust implementation lives in crates/light-rule.

Core components:

  • RuleConfig: top-level config model
  • Rule: rule definition
  • RuleCondition: condition model
  • RuleAction: action model
  • RuleEngine: evaluates one rule
  • ActionRegistry: maps action class names to plugins
  • MultiThreadRuleExecutor: executes rule lists and endpoint phase mappings

Sequential phases such as req-tra and res-tra should run with all semantics so transformations happen in order.

Access control can run in parallel because it should be a validation step rather than a mutation step.

Why Not Replace With Cedar Or Casbin

Cedar and Casbin are strong policy engines, but Light-Rule has a different role in this platform.

Light-Rule supports:

  • local YAML configuration
  • request and response transformation
  • permission injection
  • row and column filters
  • endpoint-specific rule selection
  • technical-team-authored reusable rules
  • API-owner-selected rule parameters
  • config reload through controller

Cedar is excellent for authorization policy, but it does not naturally cover transformation, row filter, and column filter use cases. Casbin is strong for policy enforcement, but it introduces a different policy storage and matching model.

Light-Rule should remain the native rule engine for Light-Fabric service configuration and workflow assertions. External policy engines can still be integrated as action plugins if needed.

Governance

Rule bodies should be authored and reviewed like code or controlled configuration.

Recommended governance metadata:

  • version
  • author
  • updatedAt
  • ruleDesc

Recommended operational controls:

  • validate rule YAML against the schema before publishing
  • reject endpoint phase typos
  • keep ruleId equal to the ruleBodies map key
  • audit rule publication and reload events
  • test rules with representative input contexts
  • use workflow live tests to verify rules in integrated environments

Workflow Live Testing

Light-Rule is useful in live tests because it can express business checks that are more specific than generic JSON assertions.

Example flow:

  1. Workflow invokes an endpoint using LightAPI description.
  2. Workflow captures the endpoint response.
  3. assert verifies simple fields.
  4. A rule task validates business-specific behavior.
  5. On failure, workflow creates a task for a human or agent to investigate.

This keeps live test orchestration in workflow while preserving reusable business rules in Light-Rule.

Design Rule

Use workflow for process control. Use LightAPI for endpoint capability. Use Light-Rule for deterministic business logic.

Agents may select, explain, or help author rules, but the rule engine should execute the final deterministic decision.

CEL Rule Conditions

Light-Rule should support both the existing native condition schema and CEL expressions. The two forms solve different problems and should share the same rule lifecycle, endpoint mapping, action execution, config loading, testing, and governance model.

The native condition schema remains the default because it is easy to render in Light-Portal, simple to validate, and suitable for most API-owner use cases. CEL is an advanced condition form for customers that need richer boolean logic, grouping, list predicates, or compatibility with existing CEL-based policy assets.

Each rule should choose one condition language: native or cel. Mixing native condition rows and CEL expressions inside the same rule is not recommended as the canonical model because it makes portal authoring, validation, and runtime dispatch harder to reason about.

Goals

  • keep existing rule YAML and portal-authored rules compatible
  • support CEL expressions as a rule-level condition language
  • evaluate native and CEL rules in the same RuleEngine
  • reuse the existing rule context for gateway, workflow, and test execution
  • preserve existing actions, endpointRules, and rule phase semantics
  • let Light-Portal choose the correct editor from rule metadata without parsing arbitrary rule bodies
  • validate CEL before publishing or reloading rules where possible
  • keep CEL execution deterministic and side-effect free

Non-Goals

  • replacing the native Light-Rule condition schema
  • replacing actions with CEL
  • allowing CEL expressions to perform I/O, network calls, mutation, or service lookups
  • making every native operator available as a custom CEL function on day one
  • requiring business users to write CEL for common rules
  • supporting mixed native and CEL condition blocks in the canonical portal authoring flow

Current Model

Today a rule contains an optional flat list of native conditions:

ruleBodies:
  allowMcpReader:
    common: Y
    ruleId: allowMcpReader
    ruleName: Allow MCP reader
    ruleType: req-acc
    conditions:
      - operatorCode: isNotNull
        propertyPath: auditInfo.subject_claims.ClaimsMap.role
    actions:
      - actionClassName: com.networknt.rule.RoleBasedAccessControlAction

Each native condition contains:

  • operator
  • operand
  • expected
  • joinCode

The engine evaluates conditions left-to-right. joinCode combines each condition with the accumulated result. If the final condition result is true, actions run as they do today.

Portal persistence stores rule metadata in rule_t and the executable rule JSON in rule_t.rule_body. Today there is no dedicated column that tells the portal which condition editor to render, so the UI would have to inspect rule_body.

Proposed Rule Shape

Add a rule-level condition language flag. Use native for existing condition rows and cel for a single CEL expression.

Persist the flag in both places:

  • rule_t.condition_language: indexed/listable portal metadata
  • ruleBody.conditionLanguage: self-contained exported runtime configuration

Recommended values:

native
cel

Existing rules without the field are interpreted as native.

Native rule body:

ruleBodies:
  allowMcpReader:
    common: Y
    ruleId: allowMcpReader
    ruleName: Allow MCP reader
    ruleType: req-acc
    conditionLanguage: native
    conditions:
      - operatorCode: isNotNull
        propertyPath: auditInfo.subject_claims.ClaimsMap.role
    actions:
      - actionClassName: com.networknt.rule.RoleBasedAccessControlAction

CEL rule body:

ruleBodies:
  allowApprovedTransfer:
    common: Y
    ruleId: allowApprovedTransfer
    ruleName: Allow approved transfer
    ruleType: req-acc
    conditionLanguage: cel
    conditionSecurityProfile: strict
    expression: >
      auditInfo.subject_claims.ClaimsMap.role != null
      && roles.exists(r, r == auditInfo.subject_claims.ClaimsMap.role)
    actions:
      - actionClassName: com.networknt.rule.RoleBasedAccessControlAction

Recommended database shape:

ALTER TABLE rule_t
ADD COLUMN condition_language VARCHAR(16) DEFAULT 'native' NOT NULL;

ALTER TABLE rule_t
ADD COLUMN condition_security_profile VARCHAR(32);

ALTER TABLE rule_t
ADD CONSTRAINT rule_t_condition_language_check
CHECK (condition_language IN ('native', 'cel'));

ALTER TABLE rule_t
ADD CONSTRAINT rule_t_condition_security_profile_check
CHECK (
  condition_security_profile IS NULL
  OR condition_security_profile IN ('strict', 'standard', 'internal-admin')
);

Recommended schema rules:

  • conditionLanguage is optional and defaults to native
  • conditionLanguage: native allows conditions and rejects expression
  • conditionLanguage: cel requires expression and rejects conditions
  • conditionSecurityProfile is optional and names a runtime-defined profile
  • native conditions continue to require operator or operatorCode
  • native conditions continue to require operand or propertyPath
  • unknown rule and condition fields should continue to be rejected by the schema
  • command handlers should reject requests where the DB metadata and rule body condition language disagree

This can be represented with conditional validation in rule-specification/schema/rule.yaml:

allOf:
  - if:
      properties:
        conditionLanguage:
          const: cel
      required: [conditionLanguage]
    then:
      required: [expression]
      not:
        required: [conditions]
    else:
      not:
        required: [expression]

The Rust model can add optional fields to Rule:

#![allow(unused)]
fn main() {
pub condition_language: Option<String>,
pub condition_security_profile: Option<String>,
pub expression: Option<String>,
}

This is less disruptive than changing RuleCondition into an enum and keeps old rule bodies valid.

Cross-Repository Scope

This change crosses the rule specification, runtime engines, portal services, and portal UI. The implementation should be tracked as a coordinated change rather than a light-fabric-only feature.

AreaRequired work
rule-specificationAdd conditionLanguage, conditionSecurityProfile, expression, native rule and CEL rule schema branches, and mode/profile-specific validation rules.
portal-dbAdd rule_t.condition_language with default native, optional rule_t.condition_security_profile, check constraints, and pending rule-change approval state if workflow task payloads are not sufficient. Keep existing rows valid without rewriting rule_body.
light-portalUpdate persistence and projection code so rule create/update/read/export/import paths carry conditionLanguage and conditionSecurityProfile; ensure endpoint rule config generation emits only approved, self-contained rule bodies; integrate stronger-profile requests with worklist and assistant-task approval.
rule-commandAccept conditionLanguage, conditionSecurityProfile, and expression, normalize old/native payloads, validate mode/profile-specific shape, publish strict changes immediately, route stronger profile requests through approval, and write both DB metadata and rule body consistently after approval.
rule-queryReturn conditionLanguage, conditionSecurityProfile, and approval status for list/detail APIs, include selected/effective profiles in test-case execution payloads, and surface CEL parse/type/missing-field/profile errors from Java and Rust runners.
portal-viewRender either the native condition builder or a CEL expression editor based on conditionLanguage; show a controlled profile selector for CEL rules; submit strict directly and route standard or internal-admin to worklist approval; do not require the UI to infer mode from ruleBody.
workflow and assistant taskUse the existing human-in-the-loop worklist flow for stronger profile approval, route tasks to admin and rule-admin, and attach an advisory assistant-task risk summary for the approver.
light-fabricAdd conditionLanguage, conditionSecurityProfile, and expression to crates/light-rule, dispatch in RuleEngine, add policy-driven CEL evaluator/caching, and update gateway/workflow tests.
yaml-ruleAdd Java runtime parity for conditionLanguage: cel and named profile enforcement if Java services need to execute the same rules; otherwise reject CEL rules explicitly with a clear runtime-capability error.

portal-db is listed even though it is not a rule engine because rule_t lives there. Without the DB column, portal-view would need to parse the compact rule body to choose the editor, which is the coupling this design is trying to avoid.

Operator Alias Alternative

Another possible shape is to add operatorCode: cel and store the CEL expression in expected inside conditions:

conditions:
  - operatorCode: cel
    expected: >
      context.toolArguments.amount < 1000
      || roles.exists(r, r == "approver")

This has one advantage: it can be implemented with a small Rust model change because operator, operand, and expected already exist. It is useful as a compatibility alias or import format.

It should not be the canonical schema because:

  • CEL is a full boolean expression, not a comparison operator
  • overloading expected makes validation and portal rendering less clear
  • operand becomes ignored or artificial
  • the UI still has to draw a condition-row editor even though the rule is really a single expression
  • the rule schema still needs to change because the operator enum must include cel and native operand requirements must be relaxed
  • future expression languages would continue overloading native condition fields

The recommended contract is therefore:

  • canonical form: conditionLanguage: cel plus rule-level expression
  • optional compatibility form: operatorCode: cel plus string expected
  • normalize compatibility imports to the canonical rule-level model before persistence or runtime evaluation

Mixed Conditions Alternative

Another possible shape is to allow native and CEL conditions in the same conditions array. The runtime can support this if needed, but it should not be the default authoring model.

Reasons to avoid canonical mixed rules:

  • Light-Portal would need a hybrid editor that switches row-by-row
  • validation errors become harder to explain to non-technical users
  • joinCode semantics across native and CEL expressions are correct but subtle
  • users may expect CEL operator precedence inside the whole rule even though native joinCode remains left-to-right
  • runtime dispatch is simpler and faster when the rule selects one evaluator

If mixed rules are ever accepted for import or advanced API use, joinCode should still apply left-to-right to the accumulated result regardless of which evaluator handled the current or previous condition.

Execution Model

Rule execution should dispatch by conditionLanguage once per rule:

RuleEngine::execute_rule
  -> conditionLanguage == native
     -> evaluate native conditions
  -> conditionLanguage == cel
     -> evaluate rule expression
  -> execute actions when conditions pass

The outer behavior stays unchanged:

  • rules with no conditions continue to run actions
  • CEL rules without an expression fail validation before runtime
  • failed conditions skip actions
  • failed action execution fails the rule
  • endpoint rule ordering and access-control logic stay unchanged
  • req-tra and res-tra continue to run sequentially
  • access-control rules can still be evaluated independently

Runtime should treat a missing conditionLanguage as native for backward compatibility.

Rule Context

CEL should evaluate against the same JSON context used by native conditions. For gateway access-control and response filtering, this includes fields such as:

  • auditInfo
  • headers
  • endpoint
  • toolName
  • toolArguments
  • correlationId
  • responseBody
  • statusCode

Endpoint permission values are merged into the root context as their configured keys. For example, permission.roles in endpointRules is available to conditions as roles, response row filters are available as row, and column filters are available as col. A future runtime can also expose a namespaced permission object as an additive convenience, but CEL support should not require that shape to preserve compatibility with existing native rules and actions.

For standard and internal-admin profiles, the CEL environment can expose variables in two ways:

  • top-level context fields as direct CEL variables, such as auditInfo, toolArguments, and roles
  • the full root object as context, so expressions can use explicit paths such as context.toolArguments.amount

Direct variables keep expressions concise and close to the native condition path style. The context variable is safer for generated expressions, collision avoidance, and future fields that are not valid CEL identifiers.

For the strict profile, the runtime should expose only curated root variables such as auditInfo, headers, toolArguments, endpoint metadata, and permission values needed by the rule phase. It should not expose the full context object by default. This prevents future internal runtime metadata from becoming visible to tenant-authored CEL just because it was appended to the root request context.

The context contract should be documented as part of Light-Rule because CEL expressions depend on stable field names. Adding fields is compatible. Renaming or changing field shapes is a breaking change for CEL rules.

Type Mapping

The CEL evaluator should receive deterministic values converted from serde_json::Value:

  • JSON object to CEL map
  • JSON array to CEL list
  • JSON string to CEL string
  • JSON number to CEL integer or double
  • JSON boolean to CEL bool
  • JSON null to CEL null

Missing fields should evaluate according to the chosen CEL implementation's standard behavior. The rule test API should expose these failures clearly so authors can distinguish "expression false" from "expression invalid".

Authors should guard optional fields explicitly. Depending on the selected CEL runtime and the field shape, this can use presence checks such as has(...) or map membership checks such as:

"role" in auditInfo.subject_claims.ClaimsMap
  && auditInfo.subject_claims.ClaimsMap.role == "admin"

The portal rule tester should surface missing-field evaluation errors and suggest guarded expressions instead of letting these failures look like ordinary denied rules.

Context Injection Performance

CEL expressions run on request paths, so context conversion must be controlled. The implementation should not recursively deep-clone and convert large JSON payloads separately for every CEL rule evaluation.

Recommended approach:

  • compile expressions once at rule load
  • build the rule context once per request or response phase
  • reuse converted CEL variables across evaluations in the same request or response phase when possible
  • prefer lazy or reference-backed variable resolution if the selected CEL crate supports it
  • if eager conversion is required, convert only the variables exposed to CEL and avoid parsing large string fields such as responseBody unless an expression explicitly needs structured access to them
  • benchmark access-control and response-filter scenarios before enabling CEL by default in high-throughput paths

The initial implementation can be pragmatic, but performance tests should guard against accidentally making CEL expression evaluation proportional to the full response body size when the expression only needs claims or endpoint metadata.

Validation

CEL should be validated earlier than request execution.

Recommended validation points:

  • portal rule editor
  • rule command create/update handler
  • rule test API
  • runtime config reload

Validation must enforce mode-specific shape:

  • native: conditions is allowed, expression is rejected
  • cel: expression is required, conditions is rejected
  • persisted rule_t.condition_language must match ruleBody.conditionLanguage
  • persisted rule_t.condition_security_profile must match ruleBody.conditionSecurityProfile when either side is present

Runtime reload should reject invalid CEL when strict validation is enabled. If a service must preserve availability, it can keep the last known-good rule set and report the new config as rejected.

Approval workflow should not bypass validation. For profile escalation requests, the command path should validate the submitted rule shape and expression before creating the approval task. Final approval should revalidate the exact submitted rule body before emitting the active rule event.

Validation output should include:

  • rule id
  • condition language
  • parse or type error
  • source offset when provided by the CEL implementation

Compilation And Caching

Do not compile CEL on every request. Compile once per rule load and cache the compiled program with the loaded rule set.

Recommended cache key:

ruleId + expression hash + effective profile

The compiled expression cache should be replaced atomically when the rule config reloads. It should not outlive the rule version it was compiled from. Old compiled entries must be evicted during reload so repeated rule updates cannot leak memory through stale expression hashes.

Rust CEL Library

cel-interpreter is a practical first candidate for the Rust implementation. It provides Program::compile(...), Program::execute(...), a Context for variables and functions, and compiled Program values that are Send + Sync.

Implementation should still be isolated behind a small internal trait:

CelEvaluator
  -> compile(ruleId, expression) -> compiled expression
  -> evaluate(compiled expression, serde_json::Value context) -> bool

This keeps Light-Rule from leaking third-party crate types through its public model and allows the implementation to change if CEL crate maturity, feature flags, or Java parity requirements change.

Native Operator Parity

The native evaluator includes operators that may not map one-to-one to the selected CEL runtime. Examples include:

  • containsIgnoreCase
  • matches and notMatch
  • inList and notInList
  • containsAny, containsAll, and containsNone
  • date-style comparisons such as before, after, and on

Before encouraging migration from native conditions to CEL, the implementation should define a small compatibility function registry for any gaps. Candidate pure helper functions include:

contains_ignore_case(value, substring)
matches(value, pattern)
in_list(value, values)
contains_any(value, values)
contains_all(value, values)

These functions must be deterministic, side-effect free, and shared by the rule tester and runtime evaluator. If Java parity is required, the same function names and edge-case behavior should be implemented in the Java runtime.

Safety

CEL support should be deterministic and sandboxed.

The evaluator does not need an operating-system sandbox for normal trusted/admin-authored rule configuration. CEL is an interpreted expression language, not arbitrary Rust or JavaScript execution, and expressions can only resolve variables and functions registered in the CEL context. The CEL context is therefore the primary sandbox boundary.

For the Rust cel-interpreter integration, context construction should be explicit. Context::default() exposes standard pure CEL functions such as size, contains, string helpers, type conversions, regex matches, and time parsing helpers depending on enabled crate features. If a service accepts tenant-authored or otherwise untrusted CEL, prefer Context::empty() and add only platform-approved helper functions.

Security policy should be engine-owned. A rule may request a named condition security profile, but it must not define its own function allowlist, size limits, resource limits, or isolation mode. If a rule author controls the rule body, then inline security settings are also attacker-controlled.

Recommended policy model:

runtime config defines profiles:
  strict
  standard
  internal-admin

rule optionally requests:
  conditionSecurityProfile: strict

effective policy:
  runtime maximum profile intersected with requested profile

If a rule omits conditionSecurityProfile, the runtime default applies. If a rule requests a profile that the service, tenant, or rule phase does not allow, the rule config should be rejected during validation or runtime reload. The engine may choose a stricter profile than requested, but it must never choose a weaker one because the rule requested it.

Recommended profiles:

  • strict: default for tenant-authored, portal self-service, imported, or marketplace-style CEL. Use an empty CEL context, expose only approved variables, add only pure helper functions, and enforce tight size and expression-shape limits. Do not expose the full context root, and disable regex until both Java and Rust provide matching bounded or linear-time behavior.
  • standard: default for internal business rules. Keep allowlists and resource limits, but permit common pure helpers such as size, contains, startsWith, endsWith, contains_ignore_case, and bounded regex support if needed.
  • internal-admin: limited to trusted operator-maintained rules. This may be closer to the selected CEL runtime's default behavior, but should still compile during rule load, validate references, enforce maximum input size, and protect reloads with the last known-good rule set.

Allowed:

  • boolean logic
  • comparisons
  • arithmetic supported by the CEL implementation
  • string operations
  • list and map predicates
  • approved pure helper functions

Not allowed:

  • file access
  • network access
  • database access
  • current time unless explicitly added as an input field
  • random values
  • mutation of the rule context
  • action execution from inside CEL

Custom functions should be added conservatively. Native Light-Rule actions remain the extension point for side effects and transformations.

The core runtime object should be a policy-driven condition evaluator rather than ad hoc logic embedded directly in RuleEngine:

RuleEngineOptions
  -> ConditionExecutionPolicy
      -> defaultCelProfile
      -> allowRuleProfileSelection
      -> profiles[name] = CelSecurityProfile

CelSecurityProfile
  -> allowedFunctions
  -> allowedRootVariables
  -> exposeContextRoot
  -> exposeTopLevelAliases
  -> maxExpressionBytes
  -> maxContextBytes
  -> maxStringBytes
  -> maxCollectionItems
  -> allowRegex
  -> allowTimeParsing
  -> allowComprehensions
  -> maxComprehensionNesting

CEL still needs resource and robustness controls because expressions run on request paths and can iterate over input data. Runtime and publish-time validation should:

  • allow-list functions and variables, using compiled expression references where available
  • reject functions that perform I/O, mutation, service lookup, action execution, random generation, or implicit current-time access
  • cap expression length and input context size
  • reject or limit expensive access to large request or response bodies
  • compile during rule load and fail invalid expression shapes before request execution
  • keep the last known-good rule set if reload validation fails

Phase ceilings should be enforced by runtime policy. Response phases such as res-tra and res-fil should default to a strict ceiling or tight maxContextBytes limits because they can include large response payloads. Access-control phases may allow standard only when the exposed context is small and bounded. A rule request for a stronger profile than the phase ceiling must be rejected or downgraded to the stricter effective profile.

For fully untrusted public input, evaluate CEL in a separate worker, process, or another resource-isolated execution path with CPU and memory limits. A Tokio timeout alone is not a complete guard for synchronous CPU-bound expression evaluation.

Portal Experience

Light-Portal should use conditionLanguage to choose the rule editor. This keeps the form predictable and avoids mixing two mental models on the same screen.

Recommended authoring modes:

  • Builder: native condition rows with operand, operator, expected, and join controls
  • CEL: advanced text area for one rule-level CEL expression

Recommended behavior:

  • default new rules to native
  • render condition subforms only for conditionLanguage: native
  • render a CEL expression text area only for conditionLanguage: cel
  • hide native condition controls when CEL is selected
  • hide CEL expression controls when native is selected
  • require confirmation when switching modes if the existing mode has content
  • do not try to round-trip arbitrary CEL into native builder rows
  • store the selected mode in rule_t.condition_language and in the JSON rule body as conditionLanguage
  • for CEL rules, store only the selected profile name in rule_t.condition_security_profile and in the JSON rule body as conditionSecurityProfile; do not expose raw policy limits in the form
  • do not show internal-admin in standard self-service forms; allow it only through checked-in runtime configuration or an explicitly authorized internal admin JWT/role path

The CEL editor should provide:

  • syntax validation
  • test context input
  • expression result preview
  • visible context field reference
  • selected and effective security profile display
  • rule test execution against the same backend evaluator used by runtime

Profile Approval Workflow

Light-Portal may allow a user to select a CEL security profile, but the selected profile is only a request. Runtime policy still computes the effective profile from the requested profile, the service maximum, the tenant maximum, and the rule phase ceiling.

Recommended publish behavior:

  • strict: direct publish. If schema, CEL validation, and command authorization pass, create or update the rule immediately.
  • standard: approval required. Submit the proposed rule change, create a worklist task for rule-admin and admin, and keep the change pending until approval.
  • internal-admin: hidden from standard self-service authoring. If exposed to an operator-only flow, require stronger approval and never allow ordinary self-service users to request it.

For approval-required changes, the command side should not emit the final active RuleCreated or RuleUpdated event at submission time. It should emit a submission event such as RuleChangeSubmittedEvent or RuleApprovalRequestedEvent, store the proposed rule body and requested profile, and create the human-in-the-loop worklist task. Only approval should emit the active rule event. Rejection should emit a rejection event and leave the active rule unchanged.

Assistant tasks can help the approver by summarizing the CEL expression, rule phase, requested profile, referenced context roots, use of response body fields, regex usage, and any runtime ceiling that would downgrade the effective profile. The assistant output is advisory only; the human approver remains responsible for the approval decision.

Recommended approval rules:

  • changing the expression, action list, rule phase, requested profile, or exposed context assumptions invalidates prior approval
  • downgrading from standard to strict can publish directly after validation
  • upgrading from strict to standard or internal-admin requires approval
  • requester and approver should be different users except for an explicit break-glass workflow
  • approval audit should record requested profile, effective profile, requester, approver, approval time, assistant-task summary id, and approval comments
  • pending rules must not be exported to runtime endpoint rule config until approved

Compatibility

Existing rule YAML remains valid.

Rules without conditionLanguage are treated as native. The database migration should add rule_t.condition_language with default native, so existing rows do not need their rule_body rewritten immediately.

Rules without conditionSecurityProfile use the runtime default CEL profile. The field is meaningful only for CEL rules; native rules do not need a condition security profile.

Native condition aliases must continue to work:

  • operatorCode as alias for operator
  • propertyPath as alias for operand
  • actionClassName as alias for actionRef

CEL introduces a new capability. If the Java yaml-rule runtime needs to execute the same rules, it must implement the same CEL rule shape. Until then, Java runtimes must fail closed with a clear capability error, such as UnsupportedConditionLanguageException, when loading or executing a rule with conditionLanguage: cel. A runtime must not silently ignore a CEL rule because that can fail open for access-control rules.

Java parity is feasible because Google maintains CEL-Java under the dev.cel Maven group, including the dev.cel:cel artifact with compiler and runtime APIs. The compatibility requirement is therefore mostly about aligning the rule schema, context shape, custom functions, and error handling across the Rust and Java runtimes.

Example: Access Control

ruleBodies:
  allowApprovedTransfer:
    common: Y
    ruleId: allowApprovedTransfer
    ruleName: Allow approved transfer
    ruleType: req-acc
    conditionLanguage: cel
    conditionSecurityProfile: strict
    expression: >
      auditInfo.subject_claims.ClaimsMap.role in roles
      && (
        toolArguments.amount < 1000
        || "transfer.approve" in auditInfo.subject_claims.ClaimsMap.scope
      )
    actions:
      - actionClassName: com.networknt.rule.RoleBasedAccessControlAction

endpointRules:
  /transfer@post:
    req-acc:
      - allowApprovedTransfer
    permission:
      roles:
        - teller
        - approver

Example: Response Filter Guard

ruleBodies:
  filterAccountsForPortalUsers:
    common: Y
    ruleId: filterAccountsForPortalUsers
    ruleName: Filter accounts for portal users
    ruleType: res-fil
    conditionLanguage: cel
    conditionSecurityProfile: strict
    expression: >
      statusCode == 200
      && responseBody != ""
      && auditInfo.subject_claims.ClaimsMap.role != null
    actions:
      - actionClassName: com.networknt.rule.ResponseRowFilterAction

Rollout Plan

  1. Add rule_t.condition_language with default native, optional rule_t.condition_security_profile, and check constraints.
  2. Extend the rule specification with native and CEL rule branches plus optional conditionSecurityProfile.
  3. Add conditionLanguage, conditionSecurityProfile, and expression fields to the Rust Rule model.
  4. Update command/query APIs so the portal can persist and read the condition language, security profile, and approval state without parsing ruleBody.
  5. Optionally accept operatorCode: cel as an import/compatibility alias and normalize it to the rule-level CEL shape.
  6. Choose and pin the Rust CEL crate behind an internal evaluator abstraction.
  7. Add runtime-owned CEL security profiles and policy-driven context building.
  8. Add approval workflow integration for standard and internal-admin profile requests, including worklist and assistant-task support.
  9. Dispatch inside RuleEngine::execute_rule based on conditionLanguage.
  10. Compile and cache CEL expressions during rule config load.
  11. Add unit tests for CEL true, CEL false, invalid expression, mode validation, and missing-field behavior.
  12. Add tests for custom native-parity helper functions.
  13. Add performance tests for context conversion with large toolArguments and response payloads.
  14. Add gateway integration tests using the existing rule context and the context root variable.
  15. Add rule test API support so Light-Portal can validate CEL before publish.
  16. Add portal mode-based rule editing, a controlled CEL profile selector, and approval UX for stronger profile requests.
  17. Document runtime compatibility and Java parity requirements.

Decision

Support both condition languages. Native Light-Rule conditions remain the stable, portal-friendly default. CEL becomes an optional advanced expression language inside the same rule engine for customers that need richer policy expressions. A rule should select one condition language through conditionLanguage; mixed native/CEL condition arrays are not the canonical authoring model.

Design Document: Centralized Agentic Skill Registry

Subject: Transitioning from File-Based Markdown Skills to a Database-Backed Skill Registry


1. Executive Summary

Currently, most AI agent frameworks rely on localized Markdown (.md) files to define agent "skills." While Markdown is highly LLM-native and human-readable, it creates significant bottlenecks at an enterprise scale regarding strict typing, API integration, and context window limits.

This document proposes transitioning to an Agentic Control Plane (Centralized Skill Registry) backed by a database. By decoupling skill metadata, schemas, and instructions, and by utilizing dynamic routing, we will achieve hierarchical structuring, strict schema enforcement, and progressive disclosure of tools to agents.


2. Problem Statement

Managing agent skills as flat Markdown files introduces several scaling challenges:

  1. Lack of Strict Typing: Markdown cannot enforce data types (e.g., ensuring a parameter is an integer vs. string), leading to hallucinated or malformed tool inputs.
  2. Context Window Exhaustion: Loading dozens or hundreds of skill definitions at startup overwhelms the LLM context window, increasing latency, token costs, and tool-misuse.
  3. Static Deployments: Updating a skill or changing access permissions requires a full application redeploy.
  4. Poor Discoverability: Flat file structures offer no native mechanism for progressive disclosure or tool search.

3. Data Models & Formats

To solve the limitations of purely text-based skills, we will adopt a hybrid, structured format stored within a database (e.g., PostgreSQL/MongoDB). The architecture uses the right format for the right job:

  • JSON Schema: Used strictly for defining parameters, inputs, and tool shapes. Natively supported by OpenAI/Anthropic/Google tool-calling APIs.
  • LightAPI Description (YAML/JSON): Used to map endpoint-level API capabilities to skills across REST, JSON-RPC, gRPC, and MCP.
  • OpenAPI / OpenRPC / Protobuf: Referenced by LightAPI where protocol-native specifications already exist.
  • Executable Code (Python/JS) / URI: Stores the actual execution logic or the endpoint reference.
  • Markdown: Retained only for the instructions or prompt fields, as LLMs excel at parsing markdown headers and lists for constraints and persona instructions.

LightAPI is the preferred source format for API-backed skills because it describes endpoint identity, protocol invocation, input schema, request mapping, result shape, examples, and behavior notes in one agent-oriented document. See LightAPI Description Design for the endpoint description model.

YAML and JSON are the external skill document formats. In the portal database, they should not replace the Markdown instruction field. The normalized model is structured columns and relationships for identity, versioning, taxonomy, tools, and execution metadata, plus content_markdown for the LLM-facing instruction body. If the portal later needs to persist a full structured skill document, add a nullable JSONB skill-spec column beside content_markdown and normalize YAML imports to JSON.

3.1 Proposed Database Schema Structure

Light Portal stores skills in structured catalog tables. Below is a representation of the skill payload:

{
  "skill_id": "sk_finance_001",
  "name": "generate_financial_report",
  "version": "1.2.0",
  "tags": ["finance", "reporting"],
  "tool_schema": {
    "type": "function",
    "function": {
      "name": "generate_financial_report",
      "description": "Generates a Q3 report based on ticker symbol.",
      "parameters": {
        "type": "object",
        "properties": {
          "ticker": {"type": "string", "description": "The stock ticker"}
        },
        "required": ["ticker"]
      },
      "response_schema": {
        "type": "object",
        "properties": {
          "report_url": {"type": "string"},
          "status": {"type": "string"}
        }
      }
    }
  },
  "execution": {
    "type": "rest_api",
    "endpoint_id": "ep_finance_report_001",
    "endpoint": "https://internal-api.company.com/v1/finance/report",
    "method": "POST"
  },
  "instructions": "## Role\nYou are a financial analyst.\n## Constraints\n- Never hallucinate financial data.\n- Always return exact numbers."
}

4. Hierarchical Structure & Progressive Disclosure

Dumping 500 JSON schemas into an LLM's context window will cause system failure. The Centralized Controller will act as a mediator, enforcing hierarchy and progressive disclosure (giving the agent only the schemas it needs, exactly when it needs them).

4.1 Implementing Hierarchy & Tagging

Because JSON Schema does not have built-in folders, hierarchy and categorization are enforced via the platform's global entity management system:

  1. Namespacing: Tool names follow a strict convention: [domain]_[subdomain]_[action] (e.g., aws_rds_provision).
  2. Tags & Categories: Instead of hardcoded columns, the registry utilizes the entity_tag_t and entity_category_t tables (with entity_type = 'skill'). This allows for unlimited flat tagging and deep hierarchical folder structures that are consistent across the entire portal.
  3. Discovery API: Portal-query filters by these tags/categories to scoped skill sets for specific agent personas. Agents cache the effective catalog locally and reload it when runtime cache-management invalidation is triggered.

4.2 Progressive Disclosure Patterns

Agents should not load every executable tool into the LLM context. Instead, they should load their assigned skill/tool catalog from the portal API, cache it locally, and use one of the following progressive disclosure patterns:

Phase 5 starts with the Rust light-agent. The agent loads genai-query/getEffectiveAgentCatalog, keeps a local cache keyed by hostId + agentDefId + serviceId + envTag, ranks cached skill/tool entries with keyword and routing-field matching, and intersects the selected tool names with the live gateway tools/list result before giving schemas to the model. Execution remains gateway tools/call.

Pattern A: Meta-Tools (Dynamic Injection)

The agent is booted with only two "meta-tools" designed for discovery.

  1. Local catalog search: Agent searches its cached assigned skills. The cache contains lightweight summaries and mapped tool names.
  2. Schema loading: Once the agent identifies the correct tool, it loads the schema from the local catalog cache or refreshes the cache from portal-query.

Pattern B: Semantic Tool RAG (Zero-Shot Discovery)

For highly complex systems with thousands of skills:

  1. Tool descriptions are embedded into a Vector Database (e.g., pgvector).
  2. When the user prompts the system (e.g., "Reset my AWS password"), portal-query or the agent's local cache performs semantic search and retrieves the Top-3 most relevant JSON Schemas.
  3. The agent boots with only those 3 tools in its context.

Pattern C: Multi-Agent Orchestration (Supervisor / Worker)

Hierarchy is mapped to agent teams.

  1. A Supervisor Agent holds routing tools (e.g., delegate_to_finance, delegate_to_devops).
  2. When delegate_to_devops is triggered, the supervisor routes to a DevOps Worker Agent, loading only the specific DevOps JSON schemas into its context.

5. Example Flow: Dynamic Loading in Action

User: "I need to provision a new database for the marketing team."

  1. Turn 1: Discovery
    • Agent Context: Has a local cache of assigned skill summaries.
    • Agent Action: Searches the local cache for provision database.
  2. Turn 2: High-Level Awareness
    • Local Cache Result: Returns token-efficient summaries from the portal catalog: [{"name": "aws_rds_provision", "description": "Creates AWS RDS DB"}, {"name": "mongo_atlas_create", "description": "Creates Mongo cluster"}]
    • Agent Action: Decides AWS is needed and loads the cached schema for aws_rds_provision.
  3. Turn 3: Strict Execution
    • Agent Catalog: Provides the full JSON schema (requiring instance_type, storage_gb).
    • Agent Action: Understands parameters and safely executes aws_rds_provision through the gateway tools/call path.

6. Operational Benefits & Security

By centralizing skills in a database, the platform gains enterprise-grade operational capabilities:

  • Dynamic Updates: API endpoints, instructions, and schemas can be updated in the database without restarting agents.
  • Permission-Aware Discovery (RBAC): By linking skills to LightAPI endpoint descriptions and api_endpoint_t, portal-query can limit catalog disclosure to the current agent or tenant, while runtime gateway policy still authorizes execution.
  • A/B Testing: Portal catalog metadata can route 50% of an agent's requests to skill_v1 and 50% to skill_v2 to measure prompt/tool efficacy.
  • Audit Logging: Catalog disclosure and gateway execution can be logged separately, preserving a compliance trail without moving tool execution into the registry.
  • Distilled Memory RAG: Following the "Hindsight" pattern, raw conversation history (agent_session_history_t) is separated from RAG-optimized memory (session_memory_t). This prevents the "noisy context" problem while maintaining a perfect audit trail.

7. LightAPI As Skill Source

API-backed skills should be generated from endpoint-level LightAPI descriptions whenever possible.

The skill registry should store skill metadata, access control, grouping, and agent-facing instructions. The LightAPI description should remain the source of truth for endpoint invocation and verification details.

Recommended flow:

  1. Light-Portal creates or imports endpoint-level LightAPI descriptions.
  2. API owners enrich endpoint descriptions with examples, behavior notes, result cases, and visibility.
  3. Approved endpoint descriptions are published as agent skills.
  4. The agent loads assigned skill summaries from portal-query and caches them locally.
  5. When the agent selects a skill, it loads the relevant LightAPI disclosure level from the local cache or refreshes from portal-query.
  6. Execution goes through the gateway tools/call path, preserving runtime policy and downstream authorization.

This avoids manually duplicating every API endpoint as a separate hand-written skill while still giving agents strict schemas and progressive disclosure.

8. Workflow-Backed Skills

Some skills need more than instructions and a curated tool set. A skill that must orchestrate several tools, wait for human approval, retry failed steps, run assertions, or preserve a durable audit trail should be backed by light-workflow.

The boundary should stay clear:

LayerResponsibility
SkillDiscovery metadata, taxonomy, instructions, allowed tools, and agent guidance.
WorkflowOrdered execution, branching, retries, assertions, human tasks, durable state, and audit events.
GatewayRuntime tool execution through tools/list and tools/call.

Workflow backing should be optional. Simple skills can stay as instructions plus tool mappings. Durable or regulated processes should link to workflow definitions and let light-workflow own execution.

Recommended storage:

  1. Keep wf_definition_t.definition as the canonical workflow YAML.
  2. Keep skill_t.content_markdown as the LLM-facing skill instruction body.
  3. Add skill_workflow_t to link skills to workflow definitions with a role such as primary, validation, remediation, or test.
  4. Treat skill_tool_t as the allowed tool set for a workflow-backed skill. Validation should flag workflow tool-call steps that are not linked to the skill.

The Portal Skill Workspace should embed a generic Workflow Editor instead of creating a skill-specific workflow runtime. The editor provides YAML editing, step preview, reference lookup, validation, and test runs. Skill authoring provides the surrounding context: skill metadata, taxonomy, allowed tools, effective prompt preview, and workflow link configuration.

9. Next Steps

  1. Complete phase 3 by adding category and tag assignment to existing skill create/update forms, backed by entity_category_t and entity_tag_t with entity_type = 'skill'.
  2. Save skill taxonomy through a composite skill command so the skill row and selected taxonomy associations are emitted from the same user action.
  3. Move the richer authoring workspace, effective prompt preview, skill_tool_t.config formalization, workflow-backed skills, and "create skill from LightAPI/tool" flows to phase 3.5.
  4. Build the generic Workflow Editor for YAML editing, parsed step preview, catalog references, validation, and workflow test runs.
  5. Complete phase 4 agent assignment by improving the agent_skill_t UI, adding an Agent Definition assignment context, and adding a batch assignment composite command that emits one AgentSkillCreatedEvent per selected skill.
  6. Enforce phase 4 assignment validation in command handlers and UI preflight: assigned skills must be active and must have at least one active direct skill_tool_t link. Workflow-backed skills still rely on skill_tool_t as the allowed tool set.
  7. Keep live gateway tools/list runtime executability checks as a diagnostics or governance concern, not as phase 4 persistence validation.
  8. Complete phase 5 for the Rust agent with the genai-query getEffectiveAgentCatalog endpoint, claim checks against host, sid, and env, local catalog caching, keyword/routing search, gateway tools/list intersection, and controller-driven cache invalidation.
  9. Complete phase 6 governance for the Rust agent only: normalize sensitivity tiers to public, internal, confidential, and restricted; filter blocked tools before catalog disclosure; compare the effective catalog with gateway tools/list through /diagnostics/tools; and keep execution through gateway tools/call.
  10. Enforce destructive, approval-required, and sensitivity metadata at the gateway with debug/auditInfo fields when a call is blocked. Do not use workflow audit_log_t for catalog disclosure; use auditInfo/file logging until a generic governance audit table is introduced.
  11. Keep current active row plus aggregate version as the approval/version boundary until workflow-owned approval state is implemented.
  12. Add publishing from LightAPI endpoint descriptions into the skill registry.
  13. Migrate existing file-based skills into structured catalog payloads, keeping instructions in Markdown and converting parameters to JSON Schema.
  14. Implement Pattern B (Semantic Tool RAG) after indexed catalog fields and embeddings are ready for production search.

Skill Workflow Orchestration

Status

Proposed demo design.

Executive Summary

This design describes a focused demo for agent-driven orchestration in Light-Fabric. The demo uses one agent with two skills:

  1. A skill that starts a workflow which calls two REST APIs directly.
  2. A skill that starts a workflow which calls the same two REST APIs through the MCP router.

Both paths solve the same business use case and return the same output. The visible difference is the execution trace:

  • The REST workflow shows light-workflow invoking HTTP endpoints directly.
  • The MCP workflow shows light-workflow invoking MCP tools/call, with light-gateway routing each tool call to the same backend REST APIs.

This demonstrates that skills provide agent-facing guidance and discovery, workflows provide durable orchestration, and the gateway provides the MCP data plane for tool execution.

Goals

  • Show one agent selecting between two assigned skills.
  • Show a workflow that orchestrates multiple REST APIs directly.
  • Show a second workflow that orchestrates the same APIs through MCP tools.
  • Keep the input and output contract identical across both workflows.
  • Keep the demo small enough to explain in a few minutes.
  • Preserve the runtime boundary: skills guide, workflows orchestrate, gateway executes MCP tool calls.

Non-Goals

  • Do not benchmark REST versus MCP latency.
  • Do not claim that MCP replaces REST. The demo shows two supported access patterns over the same backend capabilities.
  • Do not require every skill to be workflow-backed. Simple skills can remain instructions plus allowed tools.
  • Do not move MCP tool execution into the portal registry or agent catalog. Runtime tool execution stays on the gateway tools/call path.
  • Do not make the demo depend on a large endpoint catalog.

Recommendation

Use two APIs, not one.

A one-API demo can show sequencing, but it does not clearly prove cross-service orchestration. Two APIs show a more realistic enterprise shape: the workflow has to collect data from one business capability and make a decision through another capability.

Use four endpoints for the base demo.

Demo sizeEndpoint countRecommendationWhy
Smoke test2Optional onlyShows a happy path, but not enough variation.
Base demo4RecommendedCovers path parameters, query parameters, arrays, request bodies, branching, and transformation.
Advanced demo6Later phaseAdds parallel enrichment, compensation, or audit callbacks.

The base demo should be small enough to run repeatedly while still proving meaningful orchestration behavior.

Demo Scenario

The demo domain is personalized offer recommendation.

The agent receives a prompt such as:

Recommend an offer for customer CUST-1001.

The agent can use either skill:

  • Personalized Offer via REST Workflow
  • Personalized Offer via MCP Router

If the prompt does not specify REST or MCP, the demo agent should not pick a path at random. It should ask a short clarification question:

Do you want to run this through the direct REST workflow or through the MCP
router workflow?

Scripted demos can avoid the clarification by naming the path in the prompt.

Both skills start a workflow that:

  1. Loads the customer profile.
  2. Loads customer preferences and consent.
  3. Stops if the customer has not consented.
  4. Searches for eligible offers.
  5. Selects the best offer.
  6. Records the offer decision.
  7. Returns a normalized decision payload.

APIs And Endpoints

Customer Profile API

The Customer Profile API owns customer data and preferences.

EndpointShapePurpose
GET /customers/{customerId}Path parameter, object responseLoad customer identity, segment, region, and account status.
GET /customers/{customerId}/preferences?channel=portalPath parameter plus query parameterLoad consent, preferred categories, and contact channel rules.

Offer Decision API

The Offer Decision API owns eligible offer lookup and decision recording.

EndpointShapePurpose
GET /offers?segment={segment}&state={state}&category={category}Query parameters, array responseSearch active offers matching the customer profile and preferences.
POST /offer-decisionsJSON request body, object responsePersist the selected offer decision and return a decision id.

Demo API Runtime Services

The two business APIs should be implemented as real Rust services using the light-axum framework, not as ad hoc mocks. This keeps the demo aligned with normal Light-Fabric service lifecycle behavior:

  • load runtime configuration from config-server
  • bind HTTP using configured server settings
  • register with controller through portal-registry
  • appear in the control panel service-discovery view
  • support gateway service discovery by serviceId and envTag

Recommended demo apps:

AppService idDefault HTTP portPurpose
demo-customer-profile-apicom.networknt.demo.customer-profile-1.0.08085Serves customer profile and preference data.
demo-offer-decision-apicom.networknt.demo.offer-decision-1.0.08086Serves offer lookup and decision recording.

The ports are config defaults only. They must be configurable through config-server values so local, Docker, Kubernetes, and shared demo environments can choose different ports without recompiling.

Both services should expose:

GET /health

The API endpoints should return deterministic demo data. A database is not required for the first demo; in-memory seed data is enough as long as the data is stable and documented. If later demos need persistence, keep it behind the same endpoint contract.

Light-Axum Bootstrap

Each demo API should follow the normal light-axum pattern: implement AxumApp, return an axum::Router, and let LightRuntimeBuilder own binding, configuration, shutdown, and controller registration.

The service should read config from the same runtime config files used by other Light-Fabric services:

startup.yml
server.yml
portal-registry.yml

Example config-server values for the Customer Profile API:

startup.host: dev.lightapi.net
startup.externalConfigDir: /var/lib/demo-customer-profile-api/config-cache

light-config-server-uri: https://config-server.lightapi.svc.cluster.local:8435

server.serviceId: com.networknt.demo.customer-profile-1.0.0
server.environment: demo
server.ip: 0.0.0.0
server.advertisedAddress: demo-customer-profile-api
server.httpPort: 8085
server.enableHttp: true
server.enableHttps: false
server.enableRegistry: true
server.startOnRegistryFailure: true

portalRegistry.portalUrl: https://controller.lightapi.svc.cluster.local:8438

Example config-server values for the Offer Decision API:

startup.host: dev.lightapi.net
startup.externalConfigDir: /var/lib/demo-offer-decision-api/config-cache

light-config-server-uri: https://config-server.lightapi.svc.cluster.local:8435

server.serviceId: com.networknt.demo.offer-decision-1.0.0
server.environment: demo
server.ip: 0.0.0.0
server.advertisedAddress: demo-offer-decision-api
server.httpPort: 8086
server.enableHttp: true
server.enableHttps: false
server.enableRegistry: true
server.startOnRegistryFailure: true

portalRegistry.portalUrl: https://controller.lightapi.svc.cluster.local:8438

server.advertisedAddress must be a reachable address, not 0.0.0.0. In Kubernetes, use the Service DNS name. In local Docker Compose, use the Compose service name. In a native VM demo, use the VM hostname or another reachable address.

Controller Registration

The services should register with controller using the runtime's portal-registry integration. The controller registration payload must include at least:

  • serviceId
  • envTag
  • protocol
  • advertised address
  • port
  • discovery token or portal registry token, according to environment policy

After startup, the control panel should show two registered service instances:

com.networknt.demo.customer-profile-1.0.0 / demo
com.networknt.demo.offer-decision-1.0.0 / demo

The MCP router configuration should prefer these service IDs over fixed targetHost values where service discovery is available. Fixed targetHost values are still useful for a minimal local smoke test.

Optional Advanced Endpoints

The base demo should start with four endpoints. If we later want to demonstrate more workflow shapes, add one or two optional endpoints:

EndpointShape DemonstratedUse
GET /customers/{customerId}/riskParallel enrichmentRun profile, preferences, and risk lookup before offer selection.
POST /offer-decisions/{decisionId}/auditFollow-up side effectRecord a compliance audit event after the decision is created.
POST /offer-decisions/{decisionId}/cancelCompensationCancel the decision if a later step fails.

Agent, Skills, And Workflows

Use one agent so the demo highlights skill selection rather than agent handoff.

ObjectNameResponsibility
AgentDemo Orchestration AgentReceives the user request and selects one of the assigned skills.
SkillPersonalized Offer via REST WorkflowGuides the agent to start the direct REST workflow.
SkillPersonalized Offer via MCP RouterGuides the agent to start the MCP-backed workflow.
Workflowpersonalized-offer-rest-v1Orchestrates direct HTTP calls to the two REST APIs.
Workflowpersonalized-offer-mcp-v1Orchestrates MCP tool calls through the gateway router.

The skill registry should link each skill to its workflow definition through skill_workflow_t. The workflow definition remains canonical in wf_definition_t.definition. The skill content_markdown remains agent-facing guidance, not the executable workflow source.

Execution Paths

Direct REST Workflow

User prompt
  -> Demo Orchestration Agent
  -> Personalized Offer via REST Workflow skill
  -> light-workflow
  -> Customer Profile API
  -> Offer Decision API
  -> normalized decision result

This path is useful for showing direct, durable API orchestration.

MCP Router Workflow

User prompt
  -> Demo Orchestration Agent
  -> Personalized Offer via MCP Router skill
  -> light-workflow
  -> MCP tools/call
  -> light-gateway MCP router
  -> Customer Profile API
  -> Offer Decision API
  -> normalized decision result

This path is useful for showing MCP protocol orchestration over the same backend API capabilities.

Common Workflow Contract

Both workflows should accept the same input:

{
  "customerId": "CUST-1001",
  "channel": "portal"
}

Both workflows should return the same successful output shape:

{
  "status": "APPROVED",
  "customerId": "CUST-1001",
  "selectedOfferId": "OFFER-TRAVEL-01",
  "decisionId": "DEC-1001"
}

Both workflows should return comparable business outcomes for known edge cases:

{
  "status": "NO_CONSENT",
  "customerId": "CUST-3003",
  "reason": "Customer has not consented to personalized offers."
}
{
  "status": "NO_ELIGIBLE_OFFER",
  "customerId": "CUST-2002",
  "reason": "No active offer matches the customer profile and preferences."
}

Workflow Shape

The REST and MCP workflows should have the same logical steps.

StepREST workflow actionMCP workflow action
Load profileGET /customers/{customerId}tools/call customer_get_profile
Load preferencesGET /customers/{customerId}/preferencestools/call customer_get_preferences
Check consentWorkflow conditionWorkflow condition
Search offersGET /offerstools/call offer_search
Select offerWorkflow expression or ruleWorkflow expression or rule
Record decisionPOST /offer-decisionstools/call offer_record_decision
Return resultWorkflow output mappingWorkflow output mapping

The workflow should own branching, retries, and output normalization. The agent should not manually sequence each API call after the workflow starts.

Error Handling And Retries

Business outcomes and technical failures should be treated differently.

Business outcomes are expected workflow results and should not be retried:

  • NO_CONSENT
  • NO_ELIGIBLE_OFFER

Technical failures should use bounded workflow retries:

FailureRecommended behavior
Customer Profile API timeoutRetry the profile step with exponential backoff.
Offer Decision API returns 503Retry the affected offer step with exponential backoff.
Gateway MCP tools/call timeoutRetry the MCP tool-call step with the same workflow policy.
Persistent downstream failureEnd with a controlled technical failure result and preserve the workflow trace.

Recommended transient retry status codes:

408, 429, 502, 503, 504

The POST /offer-decisions step should include an idempotency key derived from the workflow instance id and selected offer id. This prevents duplicate decisions when a retry happens after the backend processed the first request but the response was lost.

For parity, the REST and MCP workflows should use the same retry policy. In the MCP path, the gateway should preserve enough error detail for the workflow trace to show the tool name, mapped backend endpoint, status code, and correlation id.

MCP Tool Mapping

The MCP workflow should use a small, explicit tool set.

MCP toolBackend endpointArguments
customer_get_profileGET /customers/{customerId}customerId
customer_get_preferencesGET /customers/{customerId}/preferencescustomerId, channel
offer_searchGET /offerssegment, state, category
offer_record_decisionPOST /offer-decisionscustomerId, offerId, channel, source, reason

The MCP tool input schemas should be normalized JSON objects. The gateway router maps those objects to path parameters, query parameters, or request bodies for the backend REST APIs.

The MCP skill should list these tools in skill_tool_t as its allowed runtime tool set. Workflow validation should flag an MCP tool-call step if it references a tool that is not linked to the skill.

Gateway Tool Configuration Example

Current gateway HTTP tool execution maps GET arguments to query parameters and sends non-GET arguments as JSON request bodies. To support endpoint shapes such as GET /customers/{customerId} without changing the backend API, the demo should add or configure explicit path-template substitution before the request is sent.

Recommended minimal mapping shape:

mcp-router.tools:
  - name: customer_get_profile
    description: Get a customer profile by id.
    protocol: http
    serviceId: com.networknt.demo.customer-profile-1.0.0
    envTag: demo
    path: /customers/{customerId}
    method: GET
    apiType: http
    inputSchema:
      type: object
      required:
        - customerId
      properties:
        customerId:
          type: string
    toolMetadata:
      pathParams:
        - customerId

With this mapping, the MCP tool call:

{
  "name": "customer_get_profile",
  "arguments": {
    "customerId": "CUST-1001"
  }
}

should be routed to:

GET /customers/CUST-1001

The path parameter should not also be appended as a query parameter. Arguments not listed under pathParams can still be appended as query parameters for GET requests or sent as JSON body fields for POST requests.

Skill Content Markdown Guidance

The skill content_markdown should explain when and how the agent should use the skill. It should not duplicate the workflow definition or the full API contract.

Example REST skill content:

## Purpose
Use this skill when the user asks for a personalized offer decision through the
direct REST workflow.

## Inputs
- customerId: customer identifier, such as CUST-1001
- channel: request channel, default portal

## Behavior
- Start workflow personalized-offer-rest-v1.
- Return the workflow result as the answer.
- Do not manually call offer APIs outside the workflow.
- If the user does not specify REST or MCP, ask which execution path they want.

Example MCP skill content:

## Purpose
Use this skill when the user asks to demonstrate MCP router orchestration for a
personalized offer decision.

## Inputs
- customerId: customer identifier, such as CUST-1001
- channel: request channel, default portal

## Behavior
- Start workflow personalized-offer-mcp-v1.
- The workflow will call MCP tools through the gateway.
- Return the workflow result as the answer.
- If the user does not specify REST or MCP, ask which execution path they want.

Structured execution metadata belongs in registry rows and workflow definitions, not only in markdown. The markdown is the LLM-facing explanation.

Output Normalization

The workflows should not pass raw endpoint responses directly to the agent. They should normalize backend responses into a stable business result.

Example raw POST /offer-decisions response:

{
  "decisionId": "DEC-1001",
  "customerId": "CUST-1001",
  "offerId": "OFFER-TRAVEL-01",
  "decision": "approved",
  "createdAt": "2026-05-25T14:12:00Z",
  "auditRef": "AUD-7788"
}

Normalized workflow output:

{
  "status": "APPROVED",
  "customerId": "CUST-1001",
  "selectedOfferId": "OFFER-TRAVEL-01",
  "decisionId": "DEC-1001"
}

The workflow should own this transformation so the REST and MCP variants produce identical final results even if their intermediate transport envelopes are different.

Demo Data

Use deterministic seed data so the demo is repeatable.

CustomerProfilePreferencesExpected result
CUST-1001Premium segment, active, OntarioConsent true, travel preferredAPPROVED with OFFER-TRAVEL-01.
CUST-2002Standard segment, active, OntarioConsent true, travel preferredNO_ELIGIBLE_OFFER.
CUST-3003Premium segment, active, OntarioConsent falseNO_CONSENT.

Seed offers:

OfferMatch conditionResult
OFFER-TRAVEL-01segment=premium, state=ON, category=travelEligible for CUST-1001.
OFFER-CASHBACK-01segment=premium, state=BC, category=shoppingNot eligible for Ontario travel scenario.

Demo Script

Run the REST workflow path first:

Use the REST workflow skill to recommend an offer for CUST-1001.

Expected observation:

  • The agent selects Personalized Offer via REST Workflow.
  • The workflow trace shows direct HTTP calls to the Customer Profile API and Offer Decision API.
  • The final response contains status=APPROVED and a decision id.

Run the MCP workflow path second:

Use the MCP router skill to recommend an offer for CUST-1001.

Expected observation:

  • The agent selects Personalized Offer via MCP Router.
  • The workflow trace shows MCP tools/call invocations.
  • The gateway trace shows those tool calls routed to the same backend REST endpoints.
  • The final response uses the same output shape as the REST workflow.

Then run one edge case:

Use either skill to recommend an offer for CUST-3003.

Expected observation:

  • The workflow stops after the consent check.
  • No offer decision is recorded.
  • The result is NO_CONSENT.

Run one ambiguity case:

Recommend an offer for CUST-1001.

Expected observation:

  • The agent asks whether to use the direct REST workflow or the MCP router workflow.
  • After the user chooses, the agent starts the selected workflow.

Run one technical failure case:

Use the MCP router skill to recommend an offer for CUST-1001 while the Offer
Decision API returns one transient 503.

Expected observation:

  • The workflow retries the failed tool-call step.
  • The gateway trace records the failed offer_record_decision call and the successful retry.
  • The final response still uses the normalized APPROVED output shape.

Portal Authoring Flow

The portal should make the demo visible from the existing GenAI and workflow surfaces:

  1. Create or import the two REST APIs and four endpoint descriptions.
  2. Implement the two APIs as light-axum services.
  3. Add config-server values for both API services.
  4. Start both services and verify controller registration.
  5. Publish MCP router tools for the same four endpoints.
  6. Create personalized-offer-rest-v1 in the workflow catalog.
  7. Create personalized-offer-mcp-v1 in the workflow catalog.
  8. Create the two skills in the skill registry.
  9. Link each skill to its primary workflow through skill_workflow_t.
  10. Link the MCP skill to its allowed tool set through skill_tool_t.
  11. Assign both skills to Demo Orchestration Agent.
  12. Use Skill Workspace preview and test panels to validate the effective prompt, workflow link, allowed tools, and sample test input.

Validation Rules

The authoring experience should validate the following before the demo is considered complete:

  • Each skill has exactly one primary workflow link.
  • The REST workflow does not require MCP tools.
  • The MCP workflow references only MCP tools linked through skill_tool_t.
  • Both workflows declare the same input schema.
  • Both workflows declare the same normalized output shape.
  • The four backend endpoint descriptions are active.
  • Both demo API services load config from config-server.
  • Both demo API services register with controller and appear in the control panel service-discovery view.
  • The MCP router tools/list result includes the four expected tool names.
  • MCP router tools resolve the demo APIs by serviceId and envTag in the service-discovery environment.
  • MCP path-parameter mappings are validated before the workflow test run.
  • POST /offer-decisions includes an idempotency key for retry safety.
  • Test runs for CUST-1001, CUST-2002, and CUST-3003 produce the expected outcomes.

Observability

The demo should show three different traces:

  1. Agent trace: which skill the agent selected and what workflow it started.
  2. Workflow trace: step order, branches, retries, and final output.
  3. Gateway trace: MCP tool name, mapped backend endpoint, status, duration, and correlation id for the MCP path.

Use the same correlation id across the agent request, workflow instance, and gateway calls where possible. This makes the REST and MCP execution paths easy to compare.

Security And Authorization

Authorization should be enforced at each layer:

  • The agent can discover only assigned skills.
  • The workflow can start only definitions visible to the authenticated caller or service identity.
  • The MCP skill can expose only tools linked to the skill and allowed for the agent.
  • The gateway still performs runtime MCP access checks before executing tools/call.
  • Backend REST APIs continue to enforce their own authorization policies.

The skill registry is not a runtime bypass. It narrows discovery and guidance, while the workflow and gateway remain responsible for execution-time controls.

Context And Auth Propagation

The demo should explicitly show that caller context is preserved.

For direct REST workflow steps:

  1. The workflow start request records the initiating user, host, tenant, correlation id, and authorization context.
  2. The workflow executor builds outbound REST calls with the correct caller context. If the original bearer token is safe to forward, it can be passed through. Otherwise, the workflow service should use a service token with on-behalf-of metadata that preserves the initiating subject.
  3. Backend APIs enforce their normal authorization policies.

For MCP workflow steps:

  1. light-workflow calls the gateway MCP endpoint with the same correlation, tenant, locale, and authorization context.
  2. light-gateway validates the MCP request and runtime tool authorization.
  3. The MCP router forwards the allowed caller headers to the backend REST API while regenerating transport-specific headers such as Host, Content-Length, and connection management headers.
  4. Backend APIs see the same business identity context they would see on the direct REST path.

The trace should show this propagation without exposing sensitive token values.

Acceptance Criteria

  • One demo agent has both skills assigned.
  • The REST skill starts personalized-offer-rest-v1.
  • The MCP skill starts personalized-offer-mcp-v1.
  • Both workflows accept the same input JSON.
  • Both workflows return the same normalized output shape.
  • The REST workflow trace shows direct REST calls to two APIs.
  • The MCP workflow trace shows MCP tools/call routed through the gateway to the same two APIs.
  • The two APIs run as light-axum services with config-server supplied HTTP ports.
  • The two APIs register with controller and are visible in the control panel service-discovery view.
  • The demo succeeds for CUST-1001.
  • The demo returns controlled business outcomes for CUST-2002 and CUST-3003.
  • Ambiguous user prompts trigger a clarification question instead of random skill selection.
  • A transient 503 from the Offer Decision API is retried and appears in the workflow trace.
  • The MCP path preserves caller context through workflow, gateway, and backend REST calls.

Hindsight Memory

Hindsight Memory is the core memory system for light-rs, designed to move beyond simple chat logs. Instead of just remembering what was said, the agent learns and forms mental models over time.

This design is strongly inspired by the paper Hindsight is 20/20: Building Agent Memory that Retains, Recalls, and Reflects and extends it with multi-tenant support.


1. Core Concepts

Hindsight memory organizes information into three distinct "Pathway" types:

  1. World Facts: Objective truths about the environment (e.g., "The production server is in US-East-1").
  2. Experiences: The agent's own history of actions and results (e.g., "I tried to deploy to US-East-1 and it failed due to a timeout").
  3. Mental Models: Synthesized understandings formed by reflecting on facts and experiences (e.g., "Deployments to US-East-1 are unstable during peak hours").

2. The Three Operations

Interaction with the memory system is standardized into three primary operations:

Retain (Storage)

The retain operation ingests information. Behind the scenes, the system:

  • Extracts entities and relationships.
  • Normalizes time and temporal data.
  • Stores the data in agent_memory_unit_t.

Recall (Retrieval)

The recall operation retrieves relevant context using a hybrid strategy:

  • Semantic: Vector similarity using the hnsw index.
  • Graph: Following links in agent_memory_link_t (causes, enables, prevents).
  • Temporal: Time-series filtering.

Reflect (Synthesis)

The reflect operation performs "deep thinking." It analyzes existing memories to generate new insights, which are stored in agent_memory_reflection_t.


3. Database Architecture

The Hindsight system is fully integrated into the portal's multi-tenant schema:

Table NameDescription
agent_memory_bank_tThe primary container. Defines personality and disposition (skepticism, empathy).
agent_memory_doc_tSource documents (logs, files, transcripts) that provide the raw text for memory units.
agent_memory_unit_tSentence-level "atoms" of thought. Stores content, embeddings, and fact types (world, experience, etc.).
agent_memory_entity_tResolved Knowledge Graph nodes, optionally linked to platform users (user_t).
agent_memory_unit_entity_tThe join table linking individual memories to the entities they mention.
agent_memory_entity_cooccur_tAssociation graph tracking concept relationships and co-occurrence counts.
agent_memory_link_tDefines causal and semantic relationships between memories (causes, enables, etc.).
agent_memory_directive_t"Hard rules" that override probabilistic learning.
agent_memory_reflection_tSynthesized high-level insights generated during the "Reflect" phase.
agent_session_history_tThe live record of active conversations, linked to a specific bank for context.

4. Privacy & Multi-Tenancy

Isolation is managed at the Bank level using three scoping tiers:

  1. Global Host Bank (user_id IS NULL, agent_def_id IS NULL):
    • Knowledge shared across all users and all agents within a specific host_id.
    • Ideal for organization-wide SOPs, common facts, and shared documentation.
  2. Shared Agent Bank (user_id IS NULL, agent_def_id IS NOT NULL):
    • Knowledge shared by all users interacting with a specific agent type.
    • Used for maintaining a consistent agent "Persona" or specialized domain expertise.
  3. Private User Bank (user_id IS NOT NULL):
    • Knowledge unique to a specific user.
    • Can be scoped further by agent_def_id to provide user-specific memory within a particular agent persona.
    • Used for personal preferences, private history, and individualized learning.

5. Implementation Guide

To implement a "Learning Agent," follow this sequence in your application logic:

  1. Ingestion: After every tool call or user interaction, call retain to update the bank.
  2. Context Loading: Before calling the LLM, call recall to fetch the most relevant 3-5 memories for the current prompt.
  3. Scheduled Reflection: Run the reflect operation during idle time to compress raw experiences into high-level mental models.

Fine-Grained Authorization

The following is the raw material collected when I chat with AI about migrating the light-4j fine-grained authorization to light-rs to secure light-agent and light-gateway.

The information needs to be organized and documented into a design document for light-rs fine-grained authorization.

Question: We have a Java implementation of fine-grained authorization that we want to migrate to light-rs. In our java implementation, we have a middleware handler to verify the custom claims like role, group, attribute etc in the jwt token against endpoint specific rules defined in database and cached at the service. How should we handle the same requirement in rust?

Response:

In Rust, bridging the gap between dynamic, database-driven rules and high-performance web requests is a well-solved problem. If you are coming from Java, the architecture translates very cleanly, but takes advantage of Rust's async runtime and thread-safe concurrency primitives.

To handle fine-grained authorization (Attribute-Based Access Control / ABAC or Role-Based Access Control / RBAC) where rules are cached locally but backed by a database, you have two primary paths in Rust:

  1. The Custom Middleware Approach (Closest to your current Java implementation)
  2. The Policy Engine Approach (Using an industry-standard Rust authorization crate)

Here is how you handle this requirement idiomatically in Rust, assuming you are using Axum (the current standard web framework in Rust).


Approach 1: Custom Middleware with Async Caching

In Java, you likely used something like Guava or Caffeine for your local cache. In Rust, the exact equivalent for high-performance, concurrent, async-aware caching is the moka crate.

Here is the architectural blueprint:

  1. Define your custom claims using serde.
  2. Setup a shared Application State holding your Database connection pool (e.g., sqlx) and your local Cache (moka).
  3. Create an Axum Middleware to intercept the request, parse the JWT, consult the cache based on the route, and authorize the request.

1. Dependencies (Cargo.toml)

axum = "0.7"
jsonwebtoken = "9.0"
moka = { version = "0.12", features = ["future"] } # Async cache (Rust's Caffeine)
sqlx = { version = "0.7", features = ["postgres", "runtime-tokio"] }
serde = { version = "1.0", features = ["derive"] }

2. Defining Claims and State

#![allow(unused)]
fn main() {
use axum::{extract::State, http::Request, middleware::Next, response::Response};
use moka::future::Cache;
use serde::{Deserialize, Serialize};
use sqlx::PgPool;
use std::sync::Arc;

// 1. Define your custom JWT claims
#[derive(Debug, Serialize, Deserialize, Clone)]
struct CustomClaims {
    sub: String,
    roles: Vec<String>,
    groups: Vec<String>,
    attributes: std::collections::HashMap<String, String>,
    exp: usize,
}

// 2. Define the Endpoint Rule (what you fetch from DB)
#[derive(Clone)]
struct EndpointRule {
    required_roles: Vec<String>,
    // add required_groups, attributes, etc.
}

// 3. Define the Application State to share across your app
struct AppState {
    db: PgPool,
    // Cache keyed by String (e.g., "GET:/api/data") returning an EndpointRule
    rule_cache: Cache<String, EndpointRule>,
}
}

3. The Authorization Middleware

In Axum, middleware is written as an asynchronous function that takes the request, does some work, and either rejects it or passes it to the Next handler.

#![allow(unused)]
fn main() {
use axum::http::StatusCode;
use axum::response::IntoResponse;

async fn fine_grained_authz_middleware(
    State(state): State<Arc<AppState>>,
    req: Request<axum::body::Body>,
    next: Next,
) -> Result<Response, StatusCode> {
    
    // 1. Extract and decode the JWT from Authorization header
    let auth_header = req.headers().get("Authorization").and_then(|h| h.to_str().ok());
    let claims = decode_jwt(auth_header).ok_or(StatusCode::UNAUTHORIZED)?;

    // 2. Identify the endpoint being accessed (e.g., "GET:/api/users/123")
    // Note: In practice, use req.extensions() to get the matched generic route template
    // rather than the raw URI so "/api/users/123" matches "/api/users/{id}"
    let endpoint_key = format!("{}:{}", req.method(), req.uri().path());

    // 3. Check Cache or Fallback to DB
    let rule = match state.rule_cache.get(&endpoint_key).await {
        Some(cached_rule) => cached_rule,
        None => {
            // Cache miss: Fetch from database
            let db_rule = fetch_rule_from_db(&state.db, &endpoint_key)
                .await
                .map_err(|_| StatusCode::INTERNAL_SERVER_ERROR)?;
            
            // Populate cache
            state.rule_cache.insert(endpoint_key, db_rule.clone()).await;
            db_rule
        }
    };

    // 4. Execute Fine-Grained Authorization Logic
    let is_authorized = rule.required_roles.iter().any(|role| claims.roles.contains(role));
    // ... check groups, evaluate custom attribute expressions, etc.

    if !is_authorized {
        return Err(StatusCode::FORBIDDEN);
    }

    // 5. Pass to the actual route handler!
    Ok(next.run(req).await)
}

// Dummy helper functions to make the example compile mentally
fn decode_jwt(_header: Option<&str>) -> Option<CustomClaims> { /* ... */ unimplemented!() }
async fn fetch_rule_from_db(_db: &PgPool, _key: &str) -> Result<EndpointRule, ()> { /* ... */ unimplemented!() }
}

4. Wiring it up

use axum::{routing::get, Router};

#[tokio::main]
async fn main() {
    let state = Arc::new(AppState {
        db: PgPool::connect("postgres://...").await.unwrap(),
        rule_cache: Cache::builder()
            .max_capacity(10_000)
            .time_to_live(std::time::Duration::from_secs(300)) // 5 minute TTL
            .build(),
    });

    let app = Router::new()
        .route("/api/secure-data", get(|| async { "Secret Data" }))
        // Apply the middleware to all routes above this line
        .route_layer(axum::middleware::from_fn_with_state(state.clone(), fine_grained_authz_middleware))
        .with_state(state);

    let listener = tokio::net::TcpListener::bind("0.0.0.0:3000").await.unwrap();
    axum::serve(listener, app).await.unwrap();
}

If your endpoint-specific rules are highly complex (e.g., "User must be in Group X AND have Attribute Y = 'finance', UNLESS Role is 'Admin'"), writing that evaluation logic manually in Rust gets messy.

Instead of writing custom evaluation logic, the Rust ecosystem heavily relies on external Policy Engines for ABAC/RBAC. The two leaders are:

Developed by Amazon Web Services in Rust, Cedar is a blazing-fast language for defining permissions.

  • You store your Cedar policies in your database as text.
  • Your Rust middleware fetches the Cedar policy string from the DB, compiles it (and caches the compiled policy using moka).
  • You pass the JWT CustomClaims into the Cedar engine as a context entity.
  • Cedar evaluates the claims against the policy in microseconds.

2. Casbin (casbin)

Casbin is a very popular authorization library that exists in Java, Go, and Rust.

  • It supports ACL, RBAC, and ABAC.
  • It has native adapters for sqlx (Postgres/MySQL). You don't even need to write the caching logic yourself; casbin-rs handles reading the rules from the database and keeping them cached in memory automatically.

Summary

  • For simple Role/Group matching: Stick to the Custom Middleware approach using moka for async caching and axum::middleware to intercept requests. It is functionally identical to your Java approach but extremely fast.
  • For complex Attribute evaluation: Use your middleware to parse the JWT, but delegate the actual rule evaluation to cedar-policy or casbin.

Question: Could you please provide some examples for cedar-policy and casbin? For example, what is the format in database? In our system, we are using a yaml rule engine to evaluate the fine-grained authorization. We create some rules users can choose and they can also create their own rules if necessary.

Response:

Transitioning from a custom YAML rule engine to an industry-standard policy engine is a fantastic move. Both AWS Cedar and Casbin are designed precisely for this kind of "user-defined fine-grained rules" requirement, but they take very different approaches to how rules are written and stored.

Here is a breakdown of how both engines work, what their database format looks like, and how you map your JWT claims to them in Rust.


1. AWS Cedar (cedar-policy) - Best for Complex/Custom Rules

Cedar uses a custom, highly readable Domain Specific Language (DSL). It is specifically built to evaluate ABAC (Attribute-Based Access Control) using contextual data like JWT claims.

The Format (In the Database)

In your database, you do not need complex tables. You simply store the Cedar policies as plain text strings in a standard TEXT or VARCHAR column.

If a user wants to create a custom rule saying, "Allow access if the user's JWT has the 'finance' group and a clearance level greater than 3," the Cedar policy stored in the DB looks like this:

// Database column: policy_text
permit (
    principal,
    action == Action::"read_data",
    resource == Resource::"secure_endpoint"
)
when {
    context.jwt.groups contains "finance" &&
    context.jwt.attributes.clearance_level > 3
};

For "predefined rules that users can choose," Cedar has Policy Templates. You store the template once, and users just link their parameters to it.

// Template: "Require specific group"
permit(principal, action, resource)
when { context.jwt.groups contains ?required_group };

How you execute it in Rust

When a request comes in, you fetch the relevant Cedar text strings from your database, pass in the JWT claims as the Context, and let Cedar evaluate it.

#![allow(unused)]
fn main() {
use cedar_policy::{Authorizer, Context, Decision, Entities, PolicySet, Request};
use serde_json::json;
use std::str::FromStr;

fn evaluate_cedar_rule(db_policy_text: &str, jwt_claims: &serde_json::Value) {
    // 1. Parse the text policy from the database
    let policies = PolicySet::from_str(db_policy_text).expect("Invalid policy format");

    // 2. Map your JWT claims directly into the Cedar "Context" using JSON
    let context_json = json!({
        "jwt": jwt_claims
    });
    let context = Context::from_json_value(context_json, None).unwrap();

    // 3. Create the evaluation request
    let request = Request::new(
        Some("User::\"Alice\"".parse().unwrap()), 
        Some("Action::\"read_data\"".parse().unwrap()), 
        Some("Resource::\"secure_endpoint\"".parse().unwrap()), 
        context,
        None,
    ).unwrap();

    // 4. Evaluate
    let authorizer = Authorizer::new();
    let entities = Entities::empty(); // Used if fetching hierarchical data
    let answer = authorizer.is_authorized(&request, &policies, &entities);

    if answer.decision() == Decision::Allow {
        println!("Access Granted!");
    } else {
        println!("Access Denied. Failed policies: {:?}", answer.diagnostics().errors());
    }
}
}

2. Casbin (casbin-rs) - Best for Tabular/Database-Native Rules

Casbin works fundamentally differently. It splits authorization into two parts: the Model (a config file defining the structure of your rules) and the Policies/Rules (the actual data stored in the database).

The Format (In the Database)

Casbin doesn't store a DSL string. Instead, it natively connects to your database using an adapter (like sqlx-adapter) and creates a table usually called casbin_rule.

The table structure looks exactly like a CSV file, with columns: ptype, v0, v1, v2, v3, v4, v5.

If a user selects a predefined rule to allow the finance_admin role to read /api/data, Casbin inserts a row into your database like this:

ptypev0 (Subject)v1 (Object)v2 (Action)v3 (Condition - optional)
pfinance_admin/api/dataread

If a user writes a custom ABAC rule using a UI (e.g., clearance > 3), Casbin can store the evaluation expression as a string in one of the columns (evaluated internally via the rhai scripting language):

ptypev0 (Subject/Condition)v1 (Object)v2 (Action)
pr.sub.groups.contains("finance") && r.sub.clearance > 3/api/dataread

How you execute it in Rust

To make Casbin evaluate JWT attributes dynamically, you define a model.conf file that tells Casbin to evaluate (eval) the string stored in the database.

The model.conf File:

[request_definition]
r = sub, obj, act

[policy_definition]
p = sub_rule, obj, act

[policy_effect]
e = some(where (p.eft == allow))

[matchers]
# This matcher says: check if the object and action match, 
# AND evaluate the ABAC rule stored in the database (p.sub_rule) against the request subject (r.sub)
m = r.obj == p.obj && r.act == p.act && eval(p.sub_rule) 

The Rust Code:

#![allow(unused)]
fn main() {
use casbin::{CoreApi, Enforcer, DefaultModel};
use sqlx_adapter::SqlxAdapter;
use std::collections::HashMap;

// A struct representing your JWT Claims
#[derive(Clone)]
struct JwtClaims {
    groups: Vec<String>,
    clearance: i32,
}

async fn evaluate_casbin() {
    // 1. Connect to DB. The adapter automatically fetches the rules from the `casbin_rule` table.
    let adapter = SqlxAdapter::new("postgres://user:pass@localhost/db", 8).await.unwrap();
    let mut enforcer = Enforcer::new("model.conf", adapter).await.unwrap();

    // 2. Parse the JWT for the incoming request
    let claims = JwtClaims {
        groups: vec!["finance".to_string()],
        clearance: 5,
    };

    // 3. Casbin in Rust uses `rhai` under the hood for dynamic object evaluation.
    // You map your JWT fields into a format the engine understands.
    let mut sub_map = HashMap::new();
    sub_map.insert("groups", claims.groups);
    sub_map.insert("clearance", claims.clearance);

    // 4. Enforce the rules against the DB
    // Request is: Subject (JWT Claims Map), Object (/api/data), Action (read)
    if enforcer.enforce((sub_map, "/api/data", "read")).unwrap() {
        println!("Access Granted!");
    } else {
        println!("Access Denied!");
    }
}
}

Which one maps better to your YAML Rule Engine?

AWS Cedar is highly recommended for your use case.

Because you currently use a YAML rule engine where users define complex logical rules, your users are already accustomed to defining logical expressions.

  1. Translation: It is very easy to write a script that takes your existing YAML definitions and converts them into Cedar text policies.
  2. Safety: Cedar is statically typed and formally verified. If a user writes an invalid custom rule, Cedar will fail to parse it instantly and give you a helpful error message to show in your UI. Casbin's eval() logic can sometimes be brittle at runtime if the user types bad syntax.
  3. Database Schema: With Cedar, you control your database schema. You just add a policy_text column to your existing "Rules" table. Casbin requires taking over a table (casbin_rule) to map to its strict CSV-like format.

Question: In our current implementation, we define all the persmissions and filters from UI and save the permissions and picked up rule in database. During server startup, it load everything locally in yaml format like the following. One of the pros is that our technical team write the rules, and the user or api owner just define the roles, groups, positions, attributes for the endpoint. The rule will be executed locally on the target server without going to the database so the performance is very good. If any permission changed for the target server, we can trigger a config reload from controller to ask the server to reload the config/permission. I am not sure if we should replace this implementation with Cedar. One of the feature we have is the column and row filter and I don't think Cedar support it. What do you think?

rule.endpointRules:
  echo@call:
    req-acc:
    - allow-role-based-access-control.lightapi.net
    permission:
      roles: mcp-reader
  /v1/pets@get:
    req-acc:
    - allow-role-based-access-control.lightapi.net
    permission:
      roles: host-admin
  /v1/accounts@get:
    req-acc:
    - allow-role-based-access-control.lightapi.net
    res-fil:
    - response-column-filter.lightapi.net
    - response-row-filter.lightapi.net
    permission:
      col:
        role:
          teller: '["accountNo","userId","accountType","firstName","lastName"]'
          account-manager: '["accountNo","userId","accountType","firstName","lastName","status"]'
      row:
        role:
          teller:
          - colName: accountType
            colValue: C
            operator: =
      roles: account-manager teller

Response:

You have excellent architectural instincts. You are absolutely right to hesitate, and your concern about column and row filtering is the exact reason you probably should not replace your current system with Cedar.

Your current design is actually highly sophisticated. Let’s break down why your current implementation is so strong, why Cedar would struggle with your specific requirements, and how you should idiomatically build your current design in Rust.


Why Cedar is NOT a drop-in replacement for your system

AWS Cedar (and most standard policy engines like Casbin or OPA) are designed as Binary Decision Engines. You give them a Request, and they return an Allow or Deny.

The Missing Feature: Data Filtering (Masking) Your system handles two fundamentally different concerns in one elegant YAML file:

  1. Access Control (req-acc): Can the user access the /v1/accounts endpoint? (Cedar does this perfectly).
  2. Data Filtering (res-fil): Which rows and columns can the user see in the response? (Cedar does not do this natively).

If you forced Cedar to do column filtering, you would have to misuse its "Annotations" feature to attach custom metadata to policies, read the output, and then write a custom Rust engine to parse that metadata and filter the JSON anyway. You would essentially be rebuilding your existing system, just with Cedar awkwardly wedged in the middle.

Strengths of your current architecture

What you have built is essentially a custom API Gateway / Sidecar Data Masking Engine.

  1. Performance: Loading rules into memory at startup means zero network latency. (Matches Cedar).
  2. Hot Reloading: Fetching updates via a controller is standard cloud-native practice. (Matches Cedar).
  3. Separation of Concerns: Tech team writes the underlying engine logic (allow-role-based-access-control), while API owners just write YAML configs. (Matches Cedar’s Policy Templates).
  4. Row/Column filtering: Natively understood by your engine. (Far exceeds Cedar).

Verdict: Do not switch to Cedar. Your current YAML-based, locally cached evaluation engine is a perfect fit for your business requirements.


How to implement your architecture in Rust

Since you are migrating to or building in Rust, your current Java-based architecture translates beautifully. In fact, Rust will make this system significantly faster and more memory-efficient.

Here is how you replicate your exact Java architecture—including the hot-reloading and filtering—in Rust.

1. Hot-Reloading using arc-swap

In Java, hot-reloading configuration safely across threads can be tricky. In Rust, the industry standard for "read-heavy, occasionally-updated" memory structures is the arc-swap crate. It allows your web server to read the YAML rules lock-free at millions of requests per second, while your controller can swap out the rules in the background instantly.

#![allow(unused)]
fn main() {
use arc_swap::ArcSwap;
use serde::Deserialize;
use std::sync::Arc;
use std::collections::HashMap;

// 1. Map your YAML exactly to Rust Structs
#[derive(Deserialize, Debug, Clone)]
struct EndpointRule {
    #[serde(rename = "req-acc")]
    req_acc: Option<Vec<String>>,
    #[serde(rename = "res-fil")]
    res_fil: Option<Vec<String>>,
    permission: PermissionConfig,
}

#[derive(Deserialize, Debug, Clone)]
struct PermissionConfig {
    roles: String,
    col: Option<HashMap<String, HashMap<String, String>>>, // Map for roles to column JSON strings
}

// 2. Wrap your entire config in an ArcSwap
lazy_static::lazy_static! {
    static ref CONFIG: ArcSwap<HashMap<String, EndpointRule>> = ArcSwap::from_pointee(HashMap::new());
}

// 3. Background task to listen to the Controller and Hot-Reload
async fn reload_config_from_controller(new_yaml_str: &str) {
    let new_config: HashMap<String, EndpointRule> = serde_yaml::from_str(new_yaml_str).unwrap();
    // Swaps the pointer atomically. 
    // All new requests see the new rules instantly. Old requests finish using the old rules.
    CONFIG.store(Arc::new(new_config));
}
}

2. The Request Access (req-acc) Middleware

You would write an Axum middleware that intercepts the request, looks up the current endpoint in the arc-swap configuration, checks the JWT roles against permission.roles, and rejects the request if it fails.

#![allow(unused)]
fn main() {
// Inside your Axum middleware
let current_config = CONFIG.load();
if let Some(rule) = current_config.get("/v1/accounts@get") {
    let required_roles: Vec<&str> = rule.permission.roles.split_whitespace().collect();
    if !user_roles.iter().any(|r| required_roles.contains(&r.as_str())) {
        return Err(StatusCode::FORBIDDEN); // req-acc fails
    }
}
}

3. The Response Filtering (res-fil) Middleware

This is where Rust shines. Because Axum and Actix allow you to intercept the response body after the handler completes but before it goes to the user, you can apply your column filtering dynamically.

Using serde_json, you can strip out columns that the user isn't allowed to see based on the col mapping in your YAML.

#![allow(unused)]
fn main() {
use serde_json::Value;

fn apply_column_filter(mut response_json: Value, allowed_columns_json: &str) -> Value {
    // 1. Parse the allowed columns from the YAML string (e.g., '["accountNo","firstName"]')
    let allowed_cols: Vec<String> = serde_json::from_str(allowed_columns_json).unwrap();

    // 2. Filter the JSON object
    if let Value::Object(ref mut map) = response_json {
        // Retain only the keys that are in the allowed_cols list
        map.retain(|key, _| allowed_cols.contains(key));
    }
    
    response_json
}
}

Summary

Keep your current system!

  1. The YAML approach is great. It acts as a perfect declarative contract between your API owners and your technical rules engine.
  2. Cedar is the wrong tool for Data Masking. Cedar is for strict Allow/Deny decisions, not mutating response bodies or filtering data.
  3. Rust will supercharge this. By using serde_yaml to parse your config, arc-swap for lock-free hot-reloading, and serde_json for column masking inside an Axum middleware, you will get a system that does exactly what your Java system does, but with significantly less memory overhead and higher throughput.

Agent Engine Pattern

The Agent Engine Pattern is the architectural standard for building industrial-grade, metadata-driven AI platforms within the Light-Fabric ecosystem.

In this model, the Rust Runtime acts as a high-performance Orchestrator, while the Application Logic resides in externalized metadata (JSON/YAML) and the Hindsight Memory database.

1. Why the Metadata-Driven Approach?

  • Separation of Concerns: Complex platform logic (security, retries, database connectivity, LLM integration) is implemented once in Rust. Business logic—defining agent personas, goals, and steps—is "programmed" via JSON or Database records.
  • Hot-Reloading: Using the arc-swap crate and YAML-based rule engines, agent personas, model parameters, and tool access can be updated in real-time without a server restart.
  • Elastic Scalability: Deploy a single, generic light-agent binary. At runtime, it specializes into a "Researcher," "Auditor," or "Support Specialist" based on the workflow_id or agent_id it retrieves from the registry.
  • High Performance: Rust's asynchronous tokio runtime allows a single engine instance to manage thousands of concurrent agentic sessions with minimal memory overhead.

2. The Core Architecture: Engine vs. Content

To function as a generic interpreter, the Light-Fabric Engine relies on four primary components:

A. The Tool & Skill Registry (The "Hands")

The engine maps string identifiers in the workflow JSON (e.g., "call": "get_customer_data") to executable code or remote MCP tools.

  • Implementation: Uses a ToolRegistry with trait objects (Box<dyn Tool>) or dynamic dispatch to MCP (Model Context Protocol) servers.
  • Logic: When the LLM requests a tool call, the engine verifies permissions via Fine-Grained Authorization, executes the tool, and feeds the result back into the context.

B. Hindsight State Manager (The "Memory")

Unlike simple session storage, the state manager persists every step of the agentic interaction into biomimetic memory banks.

  • Implementation: Every "turn" in the conversation is saved as a unit_t in the Hindsight database.
  • Benefit: Provides fault tolerance (resuming from a crashed step) and "Recall" capabilities, allowing agents to remember past interactions across different sessions.

C. Prompt Templating (The "Mind")

System prompts and instructions are stored as templates rather than hardcoded strings.

  • Implementation: Uses the tera or rinja engines for high-performance string interpolation.
  • Example: "You are a {{agent_role}}. Your current objective is to {{agent_goal}}."
  • Rust Logic: The engine merges runtime context (user input, memory recall, tool results) into the template before calling the LLM.

D. Policy Engine (The "Shield")

Before any tool execution or data retrieval, the engine consults the Light-Rule middleware.

  • Logic: Ensures the agent has the authority to access specific data or execute specific functions, preventing "prompt injection" from leading to unauthorized actions.

3. Conceptual Implementation in Rust

The AgentEngine in Light-Fabric follows a non-blocking, async loop:

#![allow(unused)]
fn main() {
pub struct AgentEngine {
    registry: Arc<ToolRegistry>,
    memory: Arc<HindsightClient>,
    rules: Arc<RuleEngine>,
}

impl AgentEngine {
    pub async fn execute_step(&self, session_id: Uuid, task: Task) -> anyhow::Result<()> {
        // 1. Fetch current context from Hindsight Memory
        let mut context = self.memory.get_context(session_id).await?;

        // 2. Resolve Task Type (Agentic vs. Tool Call)
        match task {
            Task::LlmCall { agent_id, prompt_template } => {
                // Render prompt with Tera
                let prompt = self.render_prompt(prompt_template, &context)?;
                
                // Call LLM Provider
                let response = self.llm_provider.chat(prompt, &context).await?;
                
                // Retain turn in Hindsight
                self.memory.retain_turn(session_id, response).await?;
            },
            Task::ToolCall { tool_name, params } => {
                // 3. Enforce Fine-Grained Authorization
                if self.rules.authorize(session_id, &tool_name).await? {
                    let result = self.registry.call(&tool_name, params).await?;
                    context.add_result(tool_name, result);
                }
            }
        }

        // 4. Update Session State
        self.memory.checkpoint(session_id, context).await
    }
}
}

4. Operational Challenges & Solutions

  1. Tool Versioning: As the platform evolves, tools may change. Light-Fabric handles this by versioning tool definitions in the Registry, ensuring old workflows remain compatible with the tools they were designed for.
  2. Safe Execution: For dynamic "scripts" defined in metadata, Light-Fabric utilizes WebAssembly (WASM) runtimes to provide a high-performance, secure sandbox that is superior to traditional container-based isolation.
  3. Observability: Because the engine is generic, tracing is built into the light-runtime. Every step generates OpenTelemetry traces, allowing developers to visualize the "thought process" and execution path of any agent in real-time.

The Recommendation

Light-Fabric adopts this "Engine-first" philosophy to ensure the platform remains sustainable. By treating the Agentic Workflow as data and the Rust Runtime as the interpreter, we achieve the perfect balance of extreme performance and business flexibility.

Database Design

The Light-Fabric utilizes a robust PostgreSQL schema to manage the entire lifecycle of agentic workflows, skills, and the biomimetic Hindsight memory system. The schema is organized into four logical layers:

1. Workflow Engine

These tables manage the definition and execution of long-running agentic workflows.

wf_definition_t

Stores the Agentic Workflow DSL (YAML) that defines the high-level orchestration logic.

process_info_t & task_info_t

Manage the runtime state of workflow instances (processes) and individual steps (tasks). They include input_data, context_data, and error_info to provide a resilient "scratchpad" for intermediate variables.

worklist_t & worklist_asst_t

Manage task assignments and visibility for human-in-the-loop interactions.


2. Agentic Core (The "Brain & Skills")

These tables define the identity, expertise, and capabilities of individual agents.

agent_definition_t

Defines the agent's persona, model provider (OpenAI, Anthropic, etc.), and runtime parameters like temperature and max tokens.

skill_t

Stores the "Expertise" of an agent in Markdown format. Skills are hierarchical and versioned.

tool_t & tool_param_t

The "Hands" of the agent. Defines executable functions, including REST endpoints, MCP server calls, or WASM scripts.

agent_skill_t & skill_tool_t

Maps agents to skills and skills to tools, implementing the Progressive Disclosure pattern where agents only see the tools required for their current skill context.


3. Hindsight Memory System

A biomimetic memory architecture that transitions from flat logs to structured "atoms of thought."

agent_memory_bank_t

Profiles for memory banks, defining the "Personality and Disposition" (e.g., skepticism, empathy) of the memory layer.

agent_memory_unit_t

The individual "Atoms" of memory. Each unit contains content and a vector embedding (384-dim) for semantic retrieval.

A Knowledge Graph layer that resolves entities and causal/semantic relationships between memory units.


4. Session Management

agent_session_history_t

The "Source of Truth" for active conversations, linking specific sessions to their respective Hindsight memory banks.


DDL Specification

-- Workflow Definitions: Stores the Agentic Workflow JSON
CREATE TABLE wf_definition_t (
    host_id             UUID NOT NULL,
    wf_def_id           UUID NOT NULL,
    namespace           VARCHAR(126) NOT NULL,
    name                VARCHAR(126) NOT NULL,
    version             VARCHAR(20) NOT NULL,
    definition          TEXT NOT NULL, -- The Agentic Workflow DSL in YAML
    aggregate_version    BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT TRUE,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY(host_id, wf_def_id),
    UNIQUE(host_id, namespace, name, version)
);

CREATE TABLE worklist_t (
  host_id              UUID NOT NULL,
  assignee_id          VARCHAR(126) NOT NULL,
  category_id          VARCHAR(126) DEFAULT '(all)' NOT NULL,
  status_code          VARCHAR(10) DEFAULT 'Active' NOT NULL,
  app_id               VARCHAR(512) DEFAULT 'global' NOT NULL,
  aggregate_version    BIGINT DEFAULT 1 NOT NULL,
  active               BOOLEAN NOT NULL DEFAULT TRUE,
  update_user          VARCHAR (255) DEFAULT SESSION_USER NOT NULL,
  update_ts            TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
  PRIMARY KEY(host_id, assignee_id, category_id)
);

CREATE TABLE worklist_column_t (
  host_id               UUID NOT NULL,
  assignee_id           VARCHAR(126) NOT NULL,
  category_id           VARCHAR(126) DEFAULT '(all)' NOT NULL,
  sequence_id           INTEGER NOT NULL,
  column_id             VARCHAR(126) NOT NULL,
  aggregate_version     BIGINT DEFAULT 1 NOT NULL,
  active                BOOLEAN DEFAULT TRUE,
  update_ts             TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
  update_user           VARCHAR(126) DEFAULT SESSION_USER,
  PRIMARY KEY(host_id, assignee_id, category_id, sequence_id),
  FOREIGN KEY(host_id, assignee_id, category_id) REFERENCES worklist_t(host_id, assignee_id, category_id) ON DELETE CASCADE
);

CREATE TABLE process_info_t (
  host_id                    UUID NOT NULL,
  process_id                 UUID NOT NULL, -- generated uuid
  wf_def_id                  UUID NOT NULL, -- workflow definition id
  wf_instance_id             VARCHAR(126)       NOT NULL, -- workflow intance id
  app_id                     VARCHAR(512)       NOT NULL, -- application id
  process_type               VARCHAR(126)      NOT NULL,
  status_code                CHAR(1)            NOT NULL, -- process status code 'A', 'C'
  started_ts                 TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
  ex_trigger_ts              TIMESTAMP WITH TIME ZONE          NOT NULL,
  custom_status_code         VARCHAR(126),
  completed_ts               TIMESTAMP WITH TIME ZONE,
  result_code                VARCHAR(126),
  source_id                  VARCHAR(126),
  branch_code                VARCHAR(126),
  rr_code                    VARCHAR(126),
  party_id                   VARCHAR(126),
  party_name                 VARCHAR(126),
  counter_party_id           VARCHAR(126),
  counter_party_name         VARCHAR(126),
  txn_id                     VARCHAR(126),
  txn_name                   VARCHAR(126),
  product_id                 VARCHAR(126),
  product_name               VARCHAR(126),
  product_type               VARCHAR(126),
  group_name                 VARCHAR(126),
  subgroup_name              VARCHAR(126),
  event_start_ts             TIMESTAMP WITH TIME ZONE,
  event_end_ts               TIMESTAMP WITH TIME ZONE,
  event_other_ts             TIMESTAMP WITH TIME ZONE,
  event_other                VARCHAR(126),
  risk                       NUMERIC,
  risk_scale                 INTEGER,
  price                      NUMERIC,
  price_scale                INTEGER, -- Scale (number of digits to the right of the decimal) of the risk column. NULL implies zero
  product_qy                 NUMERIC,
  currency_code              CHAR(3),
  ex_ref_id                  VARCHAR(126),
  ex_ref_code                VARCHAR(126),
  product_qy_scale           INTEGER,
  parent_process_id          VARCHAR(22),
  deadline_ts                TIMESTAMP WITH TIME ZONE,
  parent_group_id            NUMERIC,
  process_subtype_code       VARCHAR(126),
  owning_group_name          VARCHAR(126), -- Name of the group that owns the process
  input_data                 JSONB,        -- The initial data that triggered the workflow
  context_data               JSONB,        -- The runtime "scratchpad" for intermediate variables
  error_info                 TEXT,         -- Detailed error or stack trace if the process fails
  aggregate_version   BIGINT DEFAULT 1 NOT NULL,
  active              BOOLEAN DEFAULT TRUE,
  update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
  update_user         VARCHAR(126) DEFAULT SESSION_USER,
  PRIMARY KEY(host_id, process_id),
  FOREIGN KEY(host_id, wf_def_id) REFERENCES wf_definition_t(host_id, wf_def_id) ON DELETE CASCADE
);

CREATE TABLE task_info_t
(
    host_id             UUID NOT NULL,
    task_id             UUID NOT NULL,
    task_type           VARCHAR(126) NOT NULL,
    process_id          UUID NOT NULL,
    wf_instance_id      VARCHAR(126) NOT NULL,
    wf_task_id          VARCHAR(126) NOT NULL,
    status_code         CHAR(1)       NOT NULL, -- U, A, C
    started_ts          TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
    locked              CHAR(1)       NOT NULL,
    priority            INTEGER        NOT NULL,
    completed_ts        TIMESTAMP WITH TIME ZONE      NULL,
    completed_user      VARCHAR(126)     NULL,
    result_code         VARCHAR(126)     NULL,
    locking_user        VARCHAR(126)     NULL,
    locking_role        VARCHAR(126)     NULL,
    deadline_ts         TIMESTAMP WITH TIME ZONE      NULL,
    lock_group          VARCHAR(126)     NULL,
    task_input          JSONB,           -- Specific data passed to the task
    task_output         JSONB,           -- Result returned by the task action
    aggregate_version   BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT TRUE,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY(host_id, task_id),
    FOREIGN KEY (host_id, process_id) REFERENCES process_info_t(host_id, process_id) ON DELETE CASCADE
);

CREATE TABLE task_asst_t
(
    host_id             UUID NOT NULL,
    task_asst_id         UUID NOT NULL,
    task_id              UUID NOT NULL,
    assigned_ts          TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
    assignee_id          VARCHAR(126) NOT NULL,
    reason_code          VARCHAR(126) NOT NULL,
    unassigned_ts        TIMESTAMP WITH TIME ZONE      NULL,
    unassigned_reason    VARCHAR(126)     NULL,
    category_code        VARCHAR(126)     NULL,
    aggregate_version    BIGINT DEFAULT 1 NOT NULL,
    active               BOOLEAN DEFAULT TRUE,
    update_ts            TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user          VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY(host_id, task_asst_id),
    FOREIGN KEY(host_id, task_id) REFERENCES task_info_t(host_id, task_id) ON DELETE CASCADE
);

CREATE TABLE audit_log_t
(
    host_id             UUID NOT NULL,
    audit_log_id        UUID NOT NULL,
    source_type_id      VARCHAR(126)      NULL,
    correlation_id      VARCHAR(126)      NULL,
    user_id             VARCHAR(126)     NULL,
    event_ts            TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
    success             CHAR(1)           NULL,
    message0            VARCHAR(126)     NULL,
    message1            VARCHAR(126)     NULL,
    message2            VARCHAR(126)     NULL,
    message3            VARCHAR(126)     NULL,
    message             VARCHAR(500)     NULL,
    user_comment        VARCHAR(500)     NULL,
    PRIMARY KEY(host_id, audit_log_id)
);

CREATE INDEX audit_log_idx1 ON audit_log_t (source_type_id, correlation_id, event_ts, user_id);

-- Agent Definitions: Stores the "Brain" configuration
CREATE TABLE agent_definition_t (
    host_id             UUID NOT NULL,
    agent_def_id        UUID NOT NULL,
    agent_name          VARCHAR(126) NOT NULL,
    model_provider      VARCHAR(64) NOT NULL,  -- 'openai', 'anthropic', etc.
    model_name          VARCHAR(126) NOT NULL, -- 'gpt-4o', 'claude-3-5-sonnet'
    api_key_ref         VARCHAR(126),          -- Reference to Secret Manager key
    temperature         NUMERIC(3,2) DEFAULT 0.7,
    max_tokens          INTEGER,               -- max number of tokens can be used
    aggregate_version    BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT TRUE,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY(host_id, agent_def_id),
    UNIQUE(host_id, agent_name)
);


-- Skills: Stores Instructions and Domain Knowledge (The "Expertise")
-- Note: Use entity_tag_t and entity_category_t with entity_type = 'skill' 
-- for flat tagging and hierarchical folder structure of skills.
CREATE TABLE skill_t (
    host_id             UUID NOT NULL,
    skill_id            UUID NOT NULL,
    parent_skill_id     UUID,                  -- Self-reference for Hierarchy
    name                VARCHAR(126) NOT NULL,
    description         VARCHAR(500),          -- High-level description for the initial LLM prompt
    content_markdown    TEXT NOT NULL,         -- The actual instructions/prompts

    description_embedding VECTOR(384),          -- For semantic lookup/discovery
    version             VARCHAR(20) DEFAULT '1.0.0',
    aggregate_version    BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY(host_id, skill_id),
    FOREIGN KEY(host_id, parent_skill_id) REFERENCES skill_t(host_id, skill_id)
);

CREATE INDEX idx_skill_active ON skill_t(active);
CREATE INDEX idx_skill_name ON skill_t(name);

-- Tools: Stores Executable Functions (The "Hands")
CREATE TABLE tool_t (
    host_id             UUID NOT NULL,
    tool_id             UUID NOT NULL,
    name                VARCHAR(126) NOT NULL,
    description         TEXT NOT NULL,         -- Instructions for LLM on when/how to use this tool

    -- Implementation specifics
    implementation_type VARCHAR(50),           -- 'java', 'mcp_server', 'rest', 'python', 'javascript'
    implementation_class VARCHAR(500),         -- FQCN if 'java'
    mcp_server_name      VARCHAR(126),         -- MCP server name if 'mcp_server'
    api_endpoint        VARCHAR(1024),         -- URL if 'rest'
    api_method          VARCHAR(10),           -- HTTP Method if 'rest'
    endpoint_id         UUID,                  -- Reference to fine-grained auth endpoint
    script_content      TEXT,                  -- Source code if 'python'/'javascript'
    response_schema     JSONB,                 -- Strict output schema for tool results

    description_embedding VECTOR(384),          -- For semantic lookup/discovery
    version             VARCHAR(20) DEFAULT '1.0.0',
    aggregate_version   BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY(host_id, tool_id),
    FOREIGN KEY(host_id, endpoint_id) REFERENCES api_endpoint_t(host_id, endpoint_id) ON DELETE CASCADE
);

CREATE INDEX idx_tool_host_endpoint ON tool_t(host_id, endpoint_id);
CREATE INDEX idx_tool_active ON tool_t(active);
CREATE INDEX idx_tool_name ON tool_t(name);

-- Tool Parameters: Defines the arguments for each tool
CREATE TABLE tool_param_t (
    host_id             UUID NOT NULL,
    param_id            UUID NOT NULL,
    tool_id             UUID NOT NULL,
    name                VARCHAR(255) NOT NULL,
    param_type          VARCHAR(50) NOT NULL,      -- 'string', 'number', 'boolean', 'object', 'array'
    required            BOOLEAN DEFAULT true,
    default_value       JSONB,
    description         TEXT,                      -- Helps LLM understand what value to extract
    validation_schema   JSONB,                     -- JSON Schema for complex validation
    order_index         INTEGER DEFAULT 0,
    aggregate_version   BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY(host_id, param_id),
    FOREIGN KEY(host_id, tool_id) REFERENCES tool_t(host_id, tool_id) ON DELETE CASCADE
);

-- Skill Dependencies: Manages hierarchies where one skill requires another
CREATE TABLE skill_dependency_t (
    host_id             UUID NOT NULL,
    skill_id            UUID NOT NULL,
    depends_on_skill_id UUID NOT NULL,
    required            BOOLEAN DEFAULT true,
    aggregate_version   BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY (host_id, skill_id, depends_on_skill_id),
    FOREIGN KEY(host_id, skill_id) REFERENCES skill_t(host_id, skill_id),
    FOREIGN KEY(host_id, depends_on_skill_id) REFERENCES skill_t(host_id, skill_id)
);

-- Agent-Skill Mapping: Links Agents to their Skills
CREATE TABLE agent_skill_t (
    host_id             UUID NOT NULL,
    agent_def_id        UUID NOT NULL,
    skill_id            UUID NOT NULL,

    config              JSONB DEFAULT '{}',
    priority            INTEGER DEFAULT 0,
    sequence_id         INTEGER DEFAULT 0,     -- Order in which skills are concatenated

    aggregate_version    BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY(host_id, agent_def_id, skill_id),
    FOREIGN KEY(host_id, agent_def_id) REFERENCES agent_definition_t(host_id, agent_def_id) ON DELETE CASCADE,
    FOREIGN KEY(host_id, skill_id) REFERENCES skill_t(host_id, skill_id) ON DELETE CASCADE
);
CREATE INDEX idx_agent_skill_agent ON agent_skill_t(agent_def_id);

-- Skill-Tool Mapping: Implements Progressive Disclosure
CREATE TABLE skill_tool_t (
    host_id             UUID NOT NULL,
    skill_id            UUID NOT NULL,
    tool_id             UUID NOT NULL,

    config              JSONB DEFAULT '{}',
    access_level        VARCHAR(20) DEFAULT 'read', -- e.g., 'read', 'write', 'execute'

    aggregate_version   BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY(host_id, skill_id, tool_id),
    FOREIGN KEY(host_id, skill_id) REFERENCES skill_t(host_id, skill_id) ON DELETE CASCADE,
    FOREIGN KEY(host_id, tool_id) REFERENCES tool_t(host_id, tool_id) ON DELETE CASCADE
);
CREATE INDEX idx_skill_tool_skill ON skill_tool_t(skill_id);

-- -- Hindsight Advanced Memory System
-- Transitioned from flat logs to biomimetic memory banks (World, Experiences, Mental Models)

-- Memory bank profiles (Personality & Disposition)
CREATE TABLE agent_memory_bank_t (
    host_id             UUID NOT NULL,
    bank_id             UUID NOT NULL,
    agent_def_id        UUID,                  -- NULL if bank is shared across agents
    user_id             UUID,                  -- NULL if bank is global for the host/agent
    bank_name           VARCHAR(126) NOT NULL,
    disposition         JSONB NOT NULL DEFAULT '{"skepticism": 3, "literalism": 3, "empathy": 3}'::jsonb,
    background          TEXT,
    aggregate_version   BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY(host_id, bank_id),
    FOREIGN KEY(host_id) REFERENCES host_t(host_id) ON DELETE CASCADE,
    FOREIGN KEY(host_id, agent_def_id) REFERENCES agent_definition_t(host_id, agent_def_id) ON DELETE CASCADE,
    FOREIGN KEY(user_id) REFERENCES user_t(user_id) ON DELETE CASCADE
);

-- Source documents for memory units
CREATE TABLE agent_memory_doc_t (
    host_id             UUID NOT NULL,
    doc_id              UUID NOT NULL,
    bank_id             UUID NOT NULL,
    original_text       TEXT,
    content_hash        TEXT,
    aggregate_version   BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY (host_id, bank_id, doc_id),
    FOREIGN KEY (host_id, bank_id) REFERENCES agent_memory_bank_t(host_id, bank_id) ON DELETE CASCADE
);

-- Individual sentence-level memories (The "Atoms" of thought)
CREATE TABLE agent_memory_unit_t (
    host_id             UUID NOT NULL,
    unit_id             UUID NOT NULL,
    bank_id             UUID NOT NULL,
    doc_id              UUID,
    content             TEXT NOT NULL,
    embedding           vector(384),
    context             TEXT,
    event_date          TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now(),
    occurred_start      TIMESTAMP WITH TIME ZONE,
    occurred_end        TIMESTAMP WITH TIME ZONE,
    mentioned_at        TIMESTAMP WITH TIME ZONE,
    fact_type           VARCHAR(32) NOT NULL DEFAULT 'world' CHECK (fact_type IN ('world', 'experience', 'opinion', 'observation', 'mental_model')),
    metadata            JSONB DEFAULT '{}'::jsonb,
    proof_count         INT DEFAULT 1,
    source_memory_ids   UUID[] DEFAULT ARRAY[]::UUID[],
    aggregate_version   BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY(host_id, bank_id, unit_id),
    FOREIGN KEY(host_id, bank_id) REFERENCES agent_memory_bank_t(host_id, bank_id) ON DELETE CASCADE,
    FOREIGN KEY(host_id, bank_id, doc_id) REFERENCES agent_memory_doc_t(host_id, bank_id, doc_id) ON DELETE CASCADE
);

CREATE INDEX idx_mem_unit_bank ON agent_memory_unit_t(bank_id);
CREATE INDEX idx_mem_unit_embedding ON agent_memory_unit_t USING hnsw (embedding vector_cosine_ops);

-- Resolved entities (Knowledge Graph Nodes)
CREATE TABLE agent_memory_entity_t (
    host_id             UUID NOT NULL,
    entity_id           UUID NOT NULL,
    bank_id             UUID NOT NULL,
    user_id             UUID,                  -- Link to user_t if this entity is a platform user
    canonical_name      TEXT NOT NULL,
    mention_count       INT DEFAULT 1,
    metadata            JSONB DEFAULT '{}'::jsonb,
    aggregate_version   BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY (host_id, bank_id, entity_id),
    FOREIGN KEY (host_id, bank_id) REFERENCES agent_memory_bank_t(host_id, bank_id) ON DELETE CASCADE,
    FOREIGN KEY (user_id) REFERENCES user_t(user_id) ON DELETE CASCADE
);

-- Association between memory units and entities
CREATE TABLE agent_memory_unit_entity_t (
    host_id             UUID NOT NULL,
    bank_id             UUID NOT NULL,
    unit_id             UUID NOT NULL,
    entity_id           UUID NOT NULL,
    PRIMARY KEY (host_id, bank_id, unit_id, entity_id),
    FOREIGN KEY (host_id, bank_id, unit_id) REFERENCES agent_memory_unit_t(host_id, bank_id, unit_id) ON DELETE CASCADE,
    FOREIGN KEY (host_id, bank_id, entity_id) REFERENCES agent_memory_entity_t(host_id, bank_id, entity_id) ON DELETE CASCADE
);

-- Cache of entity co-occurrences (Concept Relationship Graph)
CREATE TABLE agent_memory_entity_cooccur_t (
    host_id             UUID NOT NULL,
    bank_id             UUID NOT NULL,
    entity_id_1         UUID NOT NULL,
    entity_id_2         UUID NOT NULL,
    cooccur_count       INT DEFAULT 1,
    last_cooccurred     TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    aggregate_version   BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY (host_id, bank_id, entity_id_1, entity_id_2),
    CONSTRAINT entity_cooccur_order_check CHECK (entity_id_1 < entity_id_2),
    FOREIGN KEY (host_id, bank_id, entity_id_1) REFERENCES agent_memory_entity_t(host_id, bank_id, entity_id) ON DELETE CASCADE,
    FOREIGN KEY (host_id, bank_id, entity_id_2) REFERENCES agent_memory_entity_t(host_id, bank_id, entity_id) ON DELETE CASCADE
);

CREATE INDEX idx_mem_cooccur_e1 ON agent_memory_entity_cooccur_t(host_id, entity_id_1);
CREATE INDEX idx_mem_cooccur_e2 ON agent_memory_entity_cooccur_t(host_id, entity_id_2);

-- Links between memory units (Semantic & Causal relationships)
CREATE TABLE agent_memory_link_t (
    host_id             UUID NOT NULL,
    bank_id             UUID NOT NULL,
    from_unit_id        UUID NOT NULL,
    to_unit_id          UUID NOT NULL,
    link_type           VARCHAR(32) NOT NULL,
    weight              FLOAT NOT NULL DEFAULT 1.0,
    aggregate_version   BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY (host_id, bank_id, from_unit_id, to_unit_id, link_type),
    CONSTRAINT memory_links_type_check CHECK (link_type IN ('temporal', 'semantic', 'entity', 'causes', 'caused_by', 'enables', 'prevents')),
    FOREIGN KEY (host_id, bank_id, from_unit_id) REFERENCES agent_memory_unit_t(host_id, bank_id, unit_id) ON DELETE CASCADE,
    FOREIGN KEY (host_id, bank_id, to_unit_id) REFERENCES agent_memory_unit_t(host_id, bank_id, unit_id) ON DELETE CASCADE
);

-- Directives (Hard rules that override probabilistic learning)
CREATE TABLE agent_memory_directive_t (
    host_id             UUID NOT NULL,
    directive_id        UUID NOT NULL,
    bank_id             UUID NOT NULL,
    name                VARCHAR(256) NOT NULL,
    content             TEXT NOT NULL,
    priority            INT NOT NULL DEFAULT 0,
    aggregate_version   BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY(host_id, bank_id, directive_id),
    FOREIGN KEY(host_id, bank_id) REFERENCES agent_memory_bank_t(host_id, bank_id) ON DELETE CASCADE
);

-- Reflections (Synthesized knowledge and high-level observations)
CREATE TABLE agent_memory_reflection_t (
    host_id             UUID NOT NULL,
    reflection_id       UUID NOT NULL,
    bank_id             UUID NOT NULL,
    content             TEXT NOT NULL,
    embedding           vector(384),
    aggregate_version   BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY(host_id, bank_id, reflection_id),
    FOREIGN KEY(host_id, bank_id) REFERENCES agent_memory_bank_t(host_id, bank_id) ON DELETE CASCADE
);

CREATE INDEX idx_mem_reflection_embedding ON agent_memory_reflection_t USING hnsw (embedding vector_cosine_ops);

-- Raw Session History (The source of Truth for active conversations)
CREATE TABLE agent_session_history_t (
    host_id             UUID NOT NULL,
    session_id          UUID NOT NULL,
    bank_id             UUID NOT NULL,         -- Links the session to a Hindsight bank
    messages            JSONB NOT NULL DEFAULT '[]'::jsonb,
    metadata            JSONB DEFAULT '{}'::jsonb,
    aggregate_version   BIGINT DEFAULT 1 NOT NULL,
    active              BOOLEAN DEFAULT true,
    update_ts           TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    update_user         VARCHAR(126) DEFAULT SESSION_USER,
    PRIMARY KEY(host_id, bank_id, session_id),
    FOREIGN KEY(host_id, bank_id) REFERENCES agent_memory_bank_t(host_id, bank_id) ON DELETE CASCADE
);

CREATE INDEX idx_session_bank ON agent_session_history_t(host_id, bank_id);


Light-Deployer Design

light-deployer is the cluster-local Kubernetes deployment executor in Light Fabric.

This document focuses only on the deployer service that lives in apps/light-deployer. The broader Light Portal deployment workflow, approval flow, deployment history model, controller routing, and portal UI are covered outside this repository.

Purpose

light-deployer receives a deployment command, fetches Kubernetes templates, renders them with deployment values, validates the resulting resources, applies or deletes resources in the target Kubernetes cluster, and returns safe status details.

It is intentionally narrow. It does not decide whether a user is allowed to deploy an instance, does not own portal deployment history, and does not create tenant business workflows. Those decisions belong to Light Portal, Light Controller, and the workflow engine.

Service Boundary

light-deployer owns:

  • local deployment policy enforcement
  • template repository fetch
  • YAML template rendering
  • manifest parsing and resource summary generation
  • Kubernetes dry-run, apply, delete, status, and pruning
  • safe event and error reporting
  • direct local/MicroK8s deployment endpoints

light-deployer does not own:

  • tenant authorization
  • instance metadata
  • deployment approval
  • deployment history persistence
  • config snapshot creation
  • long-running human workflow decisions

The deployer should reject commands outside its local policy even if an upstream service sends them.

Runtime Model

The service follows the same runtime pattern as light-agent.

main.rs builds the domain service and starts it through:

#![allow(unused)]
fn main() {
LightRuntimeBuilder::new(AxumTransport::new(app))
}

The HTTP listener is owned by light-runtime and light-axum, not by service-specific socket code. Bind address, HTTP/HTTPS ports, service identity, and registry settings live in runtime config files.

Default config files:

  • config/server.yml
  • config/deployer.yml
  • config/portal-registry.yml

Local cargo run resolves config from apps/light-deployer/config when run from the workspace root. The container image runs from /app and uses /app/config.

Public Endpoints

Phase 1 exposes a direct HTTP surface for local and MicroK8s testing:

GET  /health
GET  /ready
POST /mcp
GET  /mcp/tools
GET  /mcp/tools/list
GET  /mcp/tools/{tool}
POST /deployments
POST /mcp/tools/{tool}
GET  /events?request_id=...

POST /mcp is the MCP JSON-RPC 2.0 endpoint. It supports tools/list, tools/call, and a minimal initialize response. This is the endpoint that MCP clients, Light Portal, and AI agents should use.

/deployments accepts the canonical deployment request directly. /mcp/tools/{tool} maps tool names onto the same internal service functions as a REST-style local debugging convenience. The convenience tool-list endpoints return metadata with name, description, inputSchema, endpoint, and method, but they are not the MCP protocol endpoint.

Supported tool names:

  • deployment.render
  • deployment.dryRun
  • deployment.diff
  • deployment.apply
  • deployment.delete
  • deployment.status
  • deployment.rollback

The direct HTTP mode is useful for development and managed environments. The same internal command handling should later be reused by controller-mediated WebSocket/MCP routing.

Request Model

A deployment request is explicit and auditable.

{
  "requestId": "01964b05-0000-7000-8000-000000000001",
  "hostId": "01964b05-552a-7c4b-9184-6857e7f3dc5f",
  "instanceId": "petstore-dev",
  "environment": "dev",
  "clusterId": "microk8s-local",
  "namespace": "petstore-dev",
  "action": "deploy",
  "values": {
    "name": "petstore",
    "image": {
      "repository": "networknt/openapi-petstore",
      "tag": "latest"
    }
  },
  "template": {
    "repoUrl": "https://github.com/networknt/openapi-petstore.git",
    "ref": "master",
    "path": "k8s"
  },
  "options": {
    "dryRun": false,
    "waitForRollout": true,
    "timeoutSeconds": 300,
    "pruneOverride": false
  }
}

The current implementation supports inline values. The request model also contains fields for future values references and immutable snapshot metadata so it can align with the full portal deployment workflow.

When invoking a specific /mcp/tools/{tool} endpoint, callers do not need to send action. The deployer derives the action from the tool name. The generic /deployments endpoint still expects an explicit action in the request body.

For the MCP endpoint, callers use JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": "tools-list-1",
  "method": "tools/list",
  "params": {}
}

Tool invocation uses tools/call:

{
  "jsonrpc": "2.0",
  "id": "render-1",
  "method": "tools/call",
  "params": {
    "name": "deployment.render",
    "arguments": {
      "hostId": "local-host",
      "instanceId": "petstore-dev",
      "environment": "dev",
      "clusterId": "local",
      "namespace": "light-deployer",
      "values": {},
      "template": {
        "repoUrl": "local",
        "ref": "main",
        "path": "k8s"
      }
    }
  }
}

tools/call derives the deployment action from params.name; callers should not provide an action field in arguments.

Actions

render : Fetch templates, render manifests, add namespaces and management labels, and return resource summaries plus a manifest hash.

dryRun : Render manifests and validate them against Kubernetes using server-side dry-run.

diff : Render manifests, fetch current managed resources, calculate additions, modifications, and pruned resources, and return a redacted diff summary.

deploy : Accept the request, run the deployment in the background, apply manifests, prune removed managed resources, and stream events.

undeploy : Delete resources associated with the deployment.

status : Return current managed resource status.

rollback : Reserved for redeploying a previous immutable portal snapshot. Native Kubernetes rollout undo is not the target rollback model because it does not restore ConfigMaps, Secrets, or values snapshots.

Template Fetching

Templates are loaded through the TemplateSource trait.

The current source supports two modes:

  • local template root through LIGHT_DEPLOYER_TEMPLATE_BASE_DIR
  • remote HTTPS Git clone through gix

For remote repositories, the deployment request provides:

{
  "template": {
    "repoUrl": "https://github.com/networknt/openapi-petstore.git",
    "ref": "master",
    "path": "k8s"
  }
}

Private HTTPS Git access is controlled by environment variables:

  • LIGHT_DEPLOYER_GIT_TOKEN: token or app password
  • LIGHT_DEPLOYER_GIT_USERNAME: optional username override

Defaults:

  • GitHub uses x-access-token
  • Bitbucket Cloud uses x-token-auth

SSH authentication is intentionally deferred because it requires private key handling and strict known_hosts validation.

Template Format

The built-in renderer uses simple placeholders:

image: ${image.repository}:${image.tag:latest}

Supported behavior:

  • nested paths such as image.repository
  • default values after :
  • render failure when a required value is missing
  • placeholder replacement only inside YAML string scalar values

The renderer parses YAML into serde_yaml::Value, traverses the AST, replaces placeholders, and serializes or applies structured YAML values afterward. This avoids the most common raw string replacement bugs around quoting, indentation, certificates, and multi-line values.

Because placeholders currently produce strings, templates should avoid placeholders in numeric-only Kubernetes fields unless Kubernetes accepts a string value there. For example, containerPort should be fixed or rendered by a future typed placeholder extension.

Resource Metadata

After rendering, the deployer ensures every resource has the target namespace and adds management labels:

  • app.kubernetes.io/managed-by=light-deployer
  • lightapi.net/host-id
  • lightapi.net/instance-id
  • lightapi.net/request-id

These labels are used for status lookup and pruning.

Kubernetes Execution

Kubernetes execution is behind the KubeExecutor trait.

Current implementations:

  • KubeRsExecutor: real Kubernetes API execution through kube-rs
  • NoopKubeExecutor: local render/test mode

Execution mode:

  • LIGHT_DEPLOYER_KUBE_MODE=real: force real Kubernetes mode
  • LIGHT_DEPLOYER_KUBE_MODE=noop: force no-op mode
  • default: real mode when KUBERNETES_SERVICE_HOST is present, otherwise no-op

The production path uses kube-rs, not kubectl.

Kubernetes operations should use:

  • in-cluster ServiceAccount auth when running as a pod
  • server-side dry-run for validation
  • server-side apply with field manager light-deployer
  • structured status and error handling

Pruning

The deployer is declarative. If a previously managed resource is no longer rendered from the template, it should be considered for pruning.

Pruning is calculated by comparing:

  • current resources in the namespace with lightapi.net/instance-id
  • resources rendered from the new template

The policy layer enforces blast-radius protection:

  • maximum delete percentage
  • sensitive kinds requiring override
  • explicit pruneOverride in deployment options

This prevents stale resources while still protecting against accidental large-scale deletion.

Policy

The local deployer.yml policy constrains what a deployer is allowed to do.

Policy dimensions:

  • allowed namespaces
  • allowed repository hosts
  • allowed repository URL prefixes
  • allowed image registries
  • allowed actions
  • allowed Kubernetes kinds
  • blocked Kubernetes kinds
  • prune settings
  • development insecure mode

Version 1 allows application-level resource kinds by default:

  • Deployment
  • Service
  • Ingress
  • ConfigMap
  • Secret

Cluster-scoped and control-plane resources are blocked by default:

  • Namespace
  • ClusterRole
  • ClusterRoleBinding
  • CustomResourceDefinition
  • admission webhooks

Security

The deployer can mutate a Kubernetes cluster, so its default posture must be conservative.

Required practices:

  • run in Kubernetes with a dedicated ServiceAccount
  • prefer namespace-scoped Role and RoleBinding
  • restrict allowed namespaces and resource kinds
  • restrict template repository hosts or prefixes in production
  • restrict image registries in production
  • never log raw rendered Secret manifests
  • never log raw Kubernetes patch/apply payloads containing Secret data
  • return redacted summaries and diffs

Secret values in rendered manifests are redacted before being included in responses or diffs. Kubernetes Secret values are base64 encoded, not encrypted, so they must be treated as plaintext for logging purposes.

Response Model

Responses include enough detail for callers to understand what happened without exposing secrets.

Important fields:

  • requestId
  • action
  • status
  • deployerId
  • clusterId
  • namespace
  • manifestHash
  • templateCommitSha
  • resources
  • diff
  • events
  • error

Resource summaries contain kind, namespace, name, apiVersion, and action. Full rendered manifests should not be returned or persisted by default.

Event Model

Long-running operations return quickly and continue in the background.

Clients can subscribe to:

GET /events?request_id=...

Events contain:

  • request ID
  • timestamp
  • status
  • message
  • optional resource identity

The event stream is currently direct SSE. Controller-mediated mode can forward the same event shape later.

Installation

The app includes Kubernetes install manifests under apps/light-deployer/k8s:

  • namespace
  • RBAC
  • deployment
  • service

The deployment runs the container with LIGHT_DEPLOYER_KUBE_MODE=real. The image contains /app/config, and server.yml defaults the HTTP port to 7088.

For MicroK8s testing:

./apps/light-deployer/build.sh latest
docker save networknt/light-deployer:latest | microk8s ctr image import -
microk8s kubectl apply -f apps/light-deployer/k8s/namespace.yaml
microk8s kubectl apply -f apps/light-deployer/k8s/rbac.yaml
microk8s kubectl apply -f apps/light-deployer/k8s/deployment.yaml
microk8s kubectl apply -f apps/light-deployer/k8s/service.yaml

Current Limitations

  • Direct HTTP/MCP-style mode is implemented first; controller-mediated WebSocket routing is a later integration step.
  • Inline values are implemented; config-server valuesRef fetching is still a future integration point.
  • Rollback is represented in the model but needs portal snapshot integration.
  • Helm and Kustomize are not implemented yet.
  • Typed placeholders are not implemented yet.
  • Rollout watch depth is intentionally basic in the first phase.

Design Direction

Keep light-deployer small and cluster-local.

The deployer should execute precise deployment commands, enforce local safety policy, and report structured results. It should not grow into a portal, workflow engine, or deployment database. That separation keeps the service easy to install inside customer clusters and reduces the security blast radius.

Module Registry

Status: Phase 4 implemented for light-gateway/gateway; additional module reloaders remain planned.

Purpose

Light Fabric needs a runtime module registry equivalent to the ModuleRegistry feature in light-4j.

In light-4j, each active component registers its runtime configuration when the component loads. Older integrations exposed this through the /adm/server/info REST endpoint, but the current control-plane path uses MCP tools through portal-registry. The same registry is also used by the config-reload operation to decide which modules can reload configuration from the config server.

Light Fabric already has structured config files and a shared runtime startup flow, but it does not yet have a central registry that answers these operational questions:

  • which modules are active in this running instance
  • which config file each module loaded
  • what masked runtime config is currently active
  • which modules can be reloaded without restarting the process
  • what happened during the last reload attempt

This document proposes a registry in light-runtime so every Light Fabric application can expose the same control-plane behavior.

Goals

  • Register built-in runtime configs such as startup, server, client, and portal-registry.
  • Register application configs such as gateway, deployer, ollama, and mcp-client.
  • Store only masked config snapshots in the registry.
  • Expose a Java-compatible server-info payload through the get_service_info MCP tool.
  • Expose a module list through the get_modules MCP tool for config reload selection.
  • Support control-plane reload requests for one module, several modules, or all modules through the reload_modules MCP tool. Phase 3 reports non-reloadable modules as skipped. Phase 4 adds real hot reload for light-gateway/gateway.
  • Keep the feature transport-neutral by routing management requests through portal-registry, not through framework-specific REST routes.

Non-Goals

  • Do not make every config hot-reloadable in the first phase.
  • Do not rebind server ports or TLS listeners unless a transport explicitly supports it.
  • Do not expose decrypted secrets through diagnostics.
  • Do not make Rust type names part of the public control-plane contract.
  • Do not add /adm/... REST endpoints for Light Fabric.

Current Light Fabric Runtime Shape

The natural home for this feature is crates/light-runtime.

LightRuntimeBuilder already owns the startup sequence:

  1. load local bootstrap config
  2. optionally fetch remote config from config server
  3. build RuntimeConfig
  4. call registered runtime modules
  5. bind the transport
  6. register the running instance with the controller
  7. mark the runtime ready

RuntimeConfig already carries the merged resolved_values, config_dir, and external_config_dir. Application code can use those fields to load resolved application config without reparsing values.yml.

The config registry should build on that runtime boundary instead of creating a separate app-local registry per product.

Registry Model

Add a shared registry type in light-runtime.

#![allow(unused)]
fn main() {
pub struct ModuleRegistry {
    entries: RwLock<BTreeMap<String, ModuleEntry>>,
    reloaders: RwLock<BTreeMap<String, Arc<dyn ReloadableModule>>>,
}

pub struct ModuleEntry {
    pub module_id: String,
    pub config_name: String,
    pub kind: ModuleKind,
    pub active: bool,
    pub enabled: Option<bool>,
    pub reloadable: bool,
    pub config: serde_json::Value,
    pub masks: Vec<MaskSpec>,
    pub loaded_at: DateTime<Utc>,
    pub last_reload: Option<ReloadStatus>,
}

pub enum ModuleKind {
    Core,
    Framework,
    Application,
    Plugin,
}
}

Use stable module IDs instead of Rust type names. Java uses class names because they are stable operational identifiers in the JVM. Rust type names are not a good public API and can change during refactoring.

Example module IDs:

  • light-runtime/startup
  • light-runtime/server
  • light-client/client
  • light-runtime/portal-registry
  • light-gateway/gateway
  • light-deployer/deployer
  • light-agent/ollama
  • light-agent/mcp-client

The registry key should be module_id. Each entry also carries config_name so the server-info response can preserve the Java-style component map keyed by config name.

Registered Config Loading

Add a small registered-loader API around the existing ConfigLoader behavior.

#![allow(unused)]
fn main() {
let gateway_config: GatewayConfig = context
    .config()
    .load_registered(
        "gateway",
        "light-gateway/gateway",
        [MaskSpec::key("password")],
    )?;
}

The helper should:

  1. merge the base file from config_dir
  2. overlay the external file from external_config_dir
  3. resolve variables from RuntimeConfig.resolved_values
  4. deserialize the typed config
  5. serialize the resolved config to serde_json::Value
  6. apply masks to the serialized copy
  7. store only the masked copy in ModuleRegistry
  8. return the typed config to the caller

This keeps the app code simple and prevents accidental registry entries that contain raw secrets.

Phase 2 added this shared registered-loader path in ModuleRegistry and attached the registry to RuntimeConfig so apps that load after runtime bootstrap can register resolved config through the same runtime-owned registry. Apps that load before runtime startup can create the registry first, register their application configs, and pass that registry into LightRuntimeBuilder. For modules that must validate typed config before changing the registry snapshot, the same loader is also available as load_config(...) followed by register_loaded_config(...) after validation succeeds.

Masking

Masking must happen at registration time. The registry should not store raw config and then mask it later.

Support two mask forms:

#![allow(unused)]
fn main() {
pub enum MaskSpec {
    Key(String),
    Path(String),
}
}

MaskSpec::Key("password") masks every matching key recursively, matching the current light-4j behavior.

MaskSpec::Path("oauth.clientSecret") masks a precise path for configs where a generic key would be too broad.

Suggested default masks:

  • authorization
  • password
  • secret
  • clientSecret
  • apiKey
  • token
  • portalToken
  • controllerDiscoveryToken
  • privateKey
  • tlsKeyPath
  • bootstrapKeyPath

Add a runtime flag such as server.maskConfigProperties or admin.maskConfigProperties, defaulting to true, for parity with the Java server.maskConfigProperties behavior. Even if this flag is disabled, the control-plane documentation should treat unmasked output as a local debugging mode only.

Server Info MCP Response

The get_service_info MCP tool response should preserve the same logical shape that portal-view already understands from Java instances.

{
  "deployment": {
    "apiVersion": "0.1.0",
    "frameworkVersion": "0.1.0"
  },
  "environment": {
    "host": {
      "ip": "127.0.0.1",
      "hostname": "light-gateway-0"
    },
    "runtime": {},
    "system": {}
  },
  "security": {},
  "component": {
    "server": {},
    "gateway": {}
  },
  "plugin": {},
  "plugins": [],
  "modules": []
}

component should remain keyed by config_name for compatibility.

modules should provide richer Rust metadata:

[
  {
    "moduleId": "light-gateway/gateway",
    "configName": "gateway",
    "kind": "application",
    "active": true,
    "enabled": true,
    "reloadable": true,
    "loadedAt": "2026-05-07T14:30:00Z",
    "lastReload": {
      "status": "success",
      "message": "reloaded from config server",
      "completedAt": "2026-05-07T14:45:00Z"
    }
  }
]

MCP Access

Expose the registry only through MCP tools served by the runtime's portal-registry connection.

MCP tools:

get_service_info
get_modules
reload_modules

These are invoked through standard MCP JSON-RPC calls:

{
  "jsonrpc": "2.0",
  "id": "info-1",
  "method": "tools/call",
  "params": {
    "name": "get_service_info",
    "arguments": {}
  }
}

The controller remains the management channel. portal-registry receives the MCP request from the controller, dispatches it to the local runtime registry, and returns the result through the same websocket session. Light Fabric should not expose a parallel REST admin surface for this feature.

For compatibility with the existing Java and portal-view workflow, get_modules returns a string list of module IDs:

{
  "modules": [
    "light-runtime/server",
    "light-gateway/gateway"
  ]
}

The richer module metadata remains available in the modules field of get_service_info.

Reload Request

The reload_modules tool should accept omitted arguments, ALL, or explicit module IDs.

{
  "modules": [
    "light-gateway/gateway",
    "light-runtime/portal-registry"
  ]
}

An omitted modules value, an empty array, or ["ALL"] targets all registered modules. Registered modules without concrete reload implementations are reported as skipped instead of being marked as reloaded.

The response should be explicit about what happened:

{
  "modules": ["light-gateway/gateway"],
  "reloaded": ["light-gateway/gateway"],
  "skipped": [
    {
      "moduleId": "light-runtime/server",
      "reason": "requiresRestart"
    }
  ],
  "failed": [
    {
      "moduleId": "light-agent/ollama",
      "message": "missing ollama.yml"
    }
  ]
}

modules is a Java-compatible alias for the successfully reloaded module IDs and is the field portal-view reads today. reloaded, skipped, and failed carry the more explicit Rust result details.

Reload Implementation

Phase 4 adds a reload trait for modules that can safely swap runtime config.

#![allow(unused)]
fn main() {
#[async_trait]
pub trait ReloadableModule: Send + Sync {
    async fn reload(&self, ctx: ReloadContext) -> Result<ReloadOutcome, RuntimeError>;
}
}

ReloadContext includes:

  • a refreshed RuntimeConfig
  • updated resolved_values
  • the existing config_dir
  • the existing external_config_dir
  • the shared ModuleRegistry

Reload flow:

  1. Re-fetch values.yml, certs, and files from the config server into external_config_dir.
  2. Rebuild the merged resolved_values.
  3. Resolve requested module IDs.
  4. For each reloadable module, call its reload implementation.
  5. Each module validates the new typed config before swapping it into live state.
  6. Update the registry entry and last_reload status.
  7. Return a detailed reload result.

Use ConfigManager<T> or another ArcSwap-backed holder for modules that need hot reload. This avoids locking the request path while still allowing atomic config replacement.

Phase 4 implements this with ConfigManager<T> in light-runtime. It stores an Arc<T> behind a short-lived RwLock, so request handlers clone the current config quickly and reloaders replace the entire typed config only after the new config has loaded and validated.

Reloadability Rules

Classify configs by reload safety.

Reloadable candidates:

  • light-gateway/gateway
  • light-deployer/deployer
  • light-agent/ollama
  • light-agent/mcp-client
  • route, policy, provider, or rule configs that are already read through swappable state

Requires restart by default:

  • bind IP
  • HTTP/HTTPS port
  • protocol enablement
  • TLS certificate path used by the listener
  • runtime config directory
  • config-server bootstrap identity
  • controller registration identity

Some server.yml fields can still be reloadable later, such as shutdownGracefulPeriod, but listener-affecting fields should stay requiresRestart until each transport supports safe rebinding.

Framework Integration

The registry should not require each framework to expose admin routes.

light-runtime should attach an MCP-capable RegistryHandler to the portal-registry client. When the controller invokes tools/list or tools/call, the handler can advertise and execute the local management tools without involving light-axum or light-pingora request routing.

This keeps light-axum and light-pingora focused on application traffic. It also avoids adding service ports, Kubernetes routes, or Pingora request filters only for control-plane operations.

Application Integration

light-gateway is integrated first because it already loads gateway.yml from RuntimeConfig.resolved_values, config_dir, and external_config_dir. It loads the resolved typed config, validates upstreams, and then stores the masked registry snapshot. In Phase 4, light-gateway/gateway also registers a ReloadableModule that reloads and validates gateway.yml, updates the masked registry snapshot, and swaps the live GatewayConfig through ConfigManager.

light-deployer loads deployer.yml before the runtime is started, so it creates a ModuleRegistry before loading its config, registers the final env-overridden deployer config, and passes the same registry to LightRuntimeBuilder.

light-agent also loads application configs before runtime startup. It now registers ollama.yml and mcp-client.yml in the pre-runtime registry and passes that registry into LightRuntimeBuilder. The existing manual PortalRegistryClient setup is unchanged so the registry feature does not reintroduce duplicate controller registration.

Current Registered Modules

Phase 4 registers these modules:

Module IDConfig nameKindReloadable
light-runtime/startupstartupcoreno
light-runtime/serverservercoreno
light-client/clientclientcoreno
light-runtime/portal-registryportal-registrycoreno
light-gateway/gatewaygatewayapplicationyes
light-deployer/deployerdeployerapplicationno
light-agent/ollamaollamaapplicationno
light-agent/mcp-clientmcp-clientapplicationno

The application modules are visible in get_service_info once their owning application loads them. get_modules returns the corresponding module ID strings for portal-view selection. light-gateway/gateway can reload without a restart. Other application modules keep reloadable=false until their runtime state is moved behind swappable holders.

Rollout Plan

Phase 1: Registry and Masked Info

  • Implemented: ModuleRegistry, ModuleEntry, and mask utilities in light-runtime.
  • Implemented: built-in runtime config registration.
  • Implemented: tests proving raw secrets are not stored in registry entries.
  • Implemented: Java-compatible server-info response assembly.
  • Implemented: module-list response.
  • Implemented: a portal-registry MCP handler that exposes get_service_info and get_modules.

Phase 2: Application Registration

  • Implemented: convert light-gateway/gateway to registered config loading.
  • Implemented: convert light-deployer/deployer.
  • Implemented: convert light-agent/ollama and light-agent/mcp-client.
  • Implemented: add docs showing module IDs and reloadability.

Phase 3: Controller Operations

  • Implemented: add MCP tools/list and tools/call support for reload_modules.
  • Implemented: align portal-view calls so Java and Rust instances can be managed with the same control-plane workflow.
  • Implemented: return Java-compatible modules string lists while preserving detailed reloaded, skipped, and failed reload result fields.

Phase 4: Hot Reload

  • Implemented: add ReloadableModule, ReloadContext, and ReloadOutcome.
  • Implemented: add ConfigManager<T> for swappable typed configs.
  • Implemented: implement reload for light-gateway/gateway.
  • Implemented: add reload result tracking in the registry.
  • Implemented: add tests for registry reload results, gateway live config swapping, and config-server-backed reload context refresh.

Open Questions

  • Should module IDs be centrally reserved in light-runtime, or should each application own its ID namespace?
  • Should the Java-compatible component map include only active modules, while modules includes inactive-but-known modules?
  • Should MCP tool execution be enabled whenever portal-registry is enabled, or guarded by a separate admin-tools flag?
  • Should server.maskConfigProperties=false be allowed in production builds, or should Rust always mask known dangerous keys?

Implementation Sequence

Phase 1 implemented registry and masked server info first, without hot reload.

Phase 2 added application registration, so portal-view can display Rust application modules next to Java modules once it calls the MCP tools through portal-registry.

Phase 3 added the controller-facing reload_modules tool and Java-compatible module ID lists.

Phase 4 added the first real hot reload implementation for light-gateway/gateway. The next implementation step is to move additional application configs, such as light-deployer/deployer, light-agent/ollama, and light-agent/mcp-client, behind swappable runtime state before marking them reloadable.

Cache Control Plane

Status: Proposed

Purpose

Light Fabric should expose the same cache operations through the portal control plane that Java services expose through light-4j and portal-registry.

Today, portal-view can list caches and inspect cache entries for a running service instance. The next required operation is clearing a cache so cached data can be reloaded from its source of truth after operational data changes. A common case is clearing the reference-data cache in portal-service after reference tables are changed from light-portal.

The feature should be generic. It should not be a portal-service only endpoint. Any Java or Rust service that registers with the controller and has named local caches should be manageable through the same MCP tool contract.

Current Shape

The Java implementation already has most of the control-plane pieces:

  • light-4j/cache-manager defines the generic CacheManager API.
  • light-4j/caffeine-cache provides the Caffeine-backed implementation.
  • light-4j/portal-registry exposes MCP tools such as list_caches and get_cache_entries.
  • controller-rs and the Java controller forward instance-specific MCP tool calls by runtimeInstanceId.
  • portal-view calls the controller MCP websocket and passes runtimeInstanceId for cache exploration.

The main semantic gap is that CacheManager.removeCache(name) removes the cache from the manager in the Caffeine implementation. For a control-plane clear operation, the desired behavior is different: invalidate all entries while keeping the configured cache alive so the next application read repopulates it.

Goals

  • Add a generic whole-cache clear operation.
  • Keep the control-plane contract compatible between Java services and Light Fabric services.
  • Expose cache operations through portal-registry and controller MCP routing, not through service-specific REST endpoints.
  • Let portal-view clear a selected cache from the existing Cache Explorer page.
  • Use the same feature for portal-service reference data caching.
  • Preserve existing cache inspection behavior.

Non-Goals

  • Do not remove or unregister a configured cache when clearing entries.
  • Do not require every service to use the same cache backend.
  • Do not expose raw secrets or unsafe object internals through cache inspection.
  • Do not build event-driven cross-service cache invalidation in the first phase.
  • Do not confuse runtime data caches with the config-cache directory used for remote configuration files.

MCP Tool Contract

Add a new generic tool:

{
  "name": "clear_cache",
  "description": "Clear all entries from a named cache on a live runtime instance.",
  "inputSchema": {
    "type": "object",
    "required": ["runtimeInstanceId", "name"],
    "properties": {
      "runtimeInstanceId": { "type": "string", "format": "uuid" },
      "name": { "type": "string" }
    }
  }
}

The controller accepts runtimeInstanceId, removes it from the forwarded arguments, and sends this to the target runtime:

{
  "name": "clear_cache",
  "arguments": {
    "name": "reference-data"
  }
}

Recommended success response:

{
  "supported": true,
  "status": "success",
  "name": "reference-data",
  "beforeSize": 42,
  "afterSize": 0
}

Recommended unsupported response:

{
  "supported": false,
  "status": "unsupported",
  "name": "reference-data",
  "message": "Cache support is not available on this service."
}

Key-level invalidation can be added later as a separate invalidate_cache_entry tool with { "name": "...", "key": "..." }. Whole-cache clear should be implemented first because it solves the reference data reload case without introducing cache-key UX and serialization questions.

Java Compatibility Work

In light-4j, add an explicit clear operation to the generic cache API:

void clear(String cacheName);

The Caffeine implementation should call cache.invalidateAll() and keep the cache in the manager. It may call cache.cleanUp() before returning size data. removeCache(name) should keep its existing unregister/remove semantics.

portal-registry should advertise clear_cache in tools/list and handle it in tools/call by using CacheManager.getInstance(). The handler should return supported: false when cache classes or a cache manager are not available, matching the current list_caches and get_cache_entries behavior.

The controller catalogs need the same tool so portal-view can call it through the normal controller websocket:

  • controller-rs tool catalog and command serialization
  • Java light-controller tool catalog and routed-call handling, if it remains a supported control-plane runtime

Light Fabric Runtime Design

Light Fabric should provide a small cache abstraction at the runtime layer so applications do not each define a different operational surface.

A practical shape is:

#![allow(unused)]
fn main() {
#[async_trait::async_trait]
pub trait RuntimeCache: Send + Sync {
    async fn len(&self) -> usize;
    async fn entries_summary(&self) -> serde_json::Value;
    async fn clear(&self);
}

#[derive(Default)]
pub struct CacheRegistry {
    caches: RwLock<BTreeMap<String, Arc<dyn RuntimeCache>>>,
}
}

The registry should support:

  • register named cache
  • list cache names
  • get summarized entries
  • clear a named cache

moka is the preferred default backend for async Rust services because it maps well to the Caffeine use case. Applications should still be free to register custom cache wrappers as long as they implement the runtime trait.

RuntimeMcpHandler in light-runtime should expose the same tools as Java:

  • list_caches
  • get_cache_entries
  • clear_cache

If a runtime has no cache registry, these tools should return supported: false rather than failing the request.

Portal Service Reference Data Cache

portal-service can use the generic Light Fabric cache for /r/data.

Suggested cache names:

  • reference-data
  • reference-data-relation

Suggested keys:

  • host:{hostId|global}:lang:{lang}:table:{name}
  • host:{hostId|global}:lang:{lang}:table:{name}:rela:{rela}:from:{from}

The request flow becomes:

  1. /r/data receives a reference-data request.
  2. ReferenceService builds a stable cache key from host, language, table, relation, and source value.
  3. On cache hit, return cached reference data.
  4. On cache miss, query Postgres, cache the result, and return it.
  5. When reference data changes in light-portal, an operator clears reference-data or reference-data-relation for the target portal-service runtime instance from portal-view.
  6. The next /r/data call reloads from Postgres.

This keeps the first implementation manual and deterministic. A later phase can subscribe to reference-table change events and clear matching caches automatically.

Portal View UX

The existing Cache Explorer page should stay the main UI.

Add a clear action for the selected cache:

  • show the selected cache name
  • require confirmation before clearing
  • disable the button while the request is running
  • call clear_cache with { runtimeInstanceId, name }
  • show success or error status
  • refetch cache entries after a successful clear

The UI should not require users to know whether the target service is Java or Rust. Unsupported runtimes should show the returned unsupported message.

Implementation Phases

Phase 1: Java clear support

  • Add CacheManager.clear(cacheName).
  • Implement it in caffeine-cache.
  • Add clear_cache to portal-registry MCP tools.
  • Add targeted tests for clearing while preserving the configured cache.

Phase 2: Controller and portal-view

  • Add clear_cache to controller tool catalogs and command routing.
  • Add the Cache Explorer clear button and confirmation.
  • Verify the existing runtimeInstanceId forwarding path is reused.

Phase 3: Light Fabric generic cache

  • Add a runtime cache registry and trait.
  • Add moka backed cache support.
  • Expose list_caches, get_cache_entries, and clear_cache from RuntimeMcpHandler.
  • Add focused light-runtime tests for supported and unsupported cache cases.

Phase 4: Portal service reference data

  • Register reference-data and reference-data-relation caches.
  • Cache /r/data query results.
  • Clear the cache from portal-view and verify the next request reloads from Postgres.

Verification

Recommended targeted checks:

mvn -q -pl cache-manager,caffeine-cache,portal-registry test
cargo test -p light-runtime
cargo check --workspace
yarn build

Use the Maven command in light-4j, the Cargo commands in light-fabric and portal-service as appropriate, and the frontend build in portal-view.

Client Configuration And Modules

Status

Brainstorming proposal for standardizing client.yml across Light Fabric runtime, framework modules, and products.

The immediate trigger is that different Rust modules currently interpret client.yml differently. For example, light-runtime reads a small top-level verifyHostname field for controller and config-server clients, while light-pingora token and SPA modules read a Java-style nested tls section. That split makes a single client.verifyHostname: false value unreliable.

This document proposes a common contract so every Rust module uses the same client.yml file and the same typed configuration model.

Purpose

client.yml should describe outbound client behavior for a running service:

  • TLS trust, hostname verification, and optional client identity.
  • HTTP request timeout, retry, circuit breaker, connection pool, and HTTP/2 behavior.
  • OAuth 2.0 token, key, sign, dereference, and provider-selection behavior.
  • Path-prefix-to-service mapping used when different downstream services use different OAuth providers.

The file should be loaded once through the runtime configuration system, registered once in the module registry with secrets masked, then shared by all modules that make outbound calls.

Compatibility Contract

The Java light-4j client.yml remains the compatibility baseline. Rust can clean up the internal model, but it should not remove behavior that Java http-client and client-config expose.

Important Java sections:

tls:
  verifyHostname: ${client.verifyHostname:true}
  loadDefaultTrustStore: ${client.loadDefaultTrustStore:true}
  loadTrustStore: ${client.loadTrustStore:true}
  trustStore: ${client.trustStore:client.truststore}
  trustStorePass: ${client.trustStorePass:password}
  loadKeyStore: ${client.loadKeyStore:false}
  keyStore: ${client.keyStore:client.keystore}
  keyStorePass: ${client.keyStorePass:password}
  keyPass: ${client.keyPass:password}
  defaultCertPassword: ${client.defaultCertPassword:changeit}
  tlsVersion: ${client.tlsVersion:TLSv1.3}

oauth:
  multipleAuthServers: ${client.multipleAuthServers:false}
  token:
    cache:
      capacity: ${client.tokenCacheCapacity:200}
    tokenRenewBeforeExpired: ${client.tokenRenewBeforeExpired:60000}
    expiredRefreshRetryDelay: ${client.expiredRefreshRetryDelay:2000}
    earlyRefreshRetryDelay: ${client.earlyRefreshRetryDelay:4000}
    server_url: ${client.tokenServerUrl:}
    serviceId: ${client.tokenServiceId:com.networknt.oauth2-token-1.0.0}
    proxyHost: ${client.tokenProxyHost:}
    proxyPort: ${client.tokenProxyPort:}
    enableHttp2: ${client.tokenEnableHttp2:true}
    authorization_code: {}
    client_credentials: {}
    refresh_token: {}
    token_exchange: {}
    key: {}
  sign: {}
  deref: {}

pathPrefixServices: ${client.pathPrefixServices:}

request:
  errorThreshold: ${client.errorThreshold:2}
  connectTimeout: ${client.connectTimeout:2000}
  timeout: ${client.timeout:3000}
  resetTimeout: ${client.resetTimeout:7000}
  injectOpenTracing: ${client.injectOpenTracing:false}
  injectCallerId: ${client.injectCallerId:false}
  enableHttp2: ${client.enableHttp2:true}
  connectionPoolSize: ${client.connectionPoolSize:1000}
  connectionExpireTime: ${client.connectionExpireTime:1800000}
  maxReqPerConn: ${client.maxReqPerConn:1000000}
  maxConnectionNumPerHost: ${client.maxConnectionNumPerHost:1000}
  minConnectionNumPerHost: ${client.minConnectionNumPerHost:250}
  maxRequestRetry: ${client.maxRequestRetry:3}
  requestRetryDelay: ${client.requestRetryDelay:1000}
  poolMetricsEnabled: ${client.poolMetricsEnabled:false}
  poolWarmUpEnabled: ${client.poolWarmUpEnabled:false}
  poolWarmUpSize: ${client.poolWarmUpSize:1}
  healthCheckEnabled: ${client.healthCheckEnabled:true}
  healthCheckIntervalMs: ${client.healthCheckIntervalMs:30000}

Rust should add fields such as tls.caCertPath, tls.clientCertPath, and tls.clientKeyPath because PEM files are the native Rust deployment shape. Rust does not need to support Java-specific JKS/JCEKS truststore or keystore formats. If those Java-only fields appear in a Rust client.yml, they can be ignored because config-server should control which fields it injects for Rust services.

Initial Rust Gaps

At the start of this migration, the Rust implementation had three separate interpretations of client configuration:

AreaCurrent behaviorProblem
light-runtime config-server and portal-registry clientsRead ClientConfig { verify_hostname } from top-level client.ymlDid not understand the Java nested tls.verifyHostname shape
light-pingora token, security JWKS, stateless auth, and MSAL exchangeRead ClientTokenConfig with tls, oauth, pathPrefixServices, and requestWas closer to Java, but framework-local and did not drive runtime clients
light-gateway upstream proxyRead the resolved flat value client.verifyHostname directly from values.ymlBypassed typed client.yml and could disagree with other modules

Before this design, Rust support was also partial compared with Java:

Java capabilityInitial Rust status
tls.verifyHostnameSupported by Pingora token/SPAs, not by runtime controller/config-server clients
CA trustSupported through Rust caCertPath; Java truststore fields are not modeled
Client certificate and key for mTLSNot yet modeled for outbound clients
TLS versionNot yet modeled
Request connect and total timeoutSupported for token/SPAs
Retries, circuit breaker, pool sizing, pool healthNot yet modeled as shared client behavior
OAuth authorization_codeSupported by SPA auth
OAuth client_credentialsSupported by token handler
OAuth refresh_tokenSupported by SPA auth
OAuth token_exchangeSupported by MSAL exchange and SPA auth
OAuth token key / JWKSPartially supported by security runtime
token.key.serviceIdAuthServers and audienceNot fully modeled in Rust
OAuth signNot yet modeled
OAuth sign.key / sign JWKSNot yet modeled
OAuth derefNot yet modeled
Multiple auth providers by service idSupported for client credentials, but should become a shared resolver
pathPrefixServicesSupported in token handler, but should become shared resolver logic

Goals

  • Keep client.yml as the only config file for outbound client behavior.
  • Make the Java nested shape canonical: tls.verifyHostname, not top-level verifyHostname.
  • Load and register the resolved client.yml once through light-runtime.
  • Share one typed ClientConfig across runtime, Pingora, gateway, agent, deployer, MCP clients, model-provider clients, and future products.
  • Preserve Java-compatible field names and config-server placeholder names.
  • Support direct URL, direct registry, and portal registry service discovery consistently for token, key, sign, deref, and generic outbound calls.
  • Keep secrets masked in module registry snapshots and logs.
  • Make invalid active client config fail startup or reject reload before it changes live runtime behavior.
  • Allow Rust-native PEM fields without forcing Java keystore names into every Rust deployment.

Non-Goals

  • Do not move handler activation into client.yml. Handler-specific files such as token.yml, statelessAuth.yml, and msal-exchange.yml still decide whether a handler runs.
  • Do not implement every Java-only low-level connection-pool behavior in the first phase. The shared schema should include the fields so config is not lost, but unsupported fields can be ignored deliberately until the transport supports them.
  • Do not expose decrypted client secrets, tokens, or legacy Java password fields through module registry, MCP tools, logs, metrics, or cache output.
  • Do not require every module to use OAuth. The shared config must support simple TLS-only clients too.

Resolved Decisions

  • Create a separate light-client crate now so the shared config, HTTP client factory, OAuth client, and provider resolver can be reused without coupling every consumer to light-runtime.
  • Standardize Rust outbound TLS material on PEM paths. Java truststore and keystore formats are not required for Rust services.
  • client.yml reload should not force an immediate portal-registry reconnect. Reload is primarily for newly onboarded JWKS/JWT access and future outbound requests. Existing long-lived controller connections can keep running until their normal reconnect or service restart.
  • Unsupported Java fields can be ignored by Rust. Config-server should avoid injecting unsupported fields into Rust service config.
  • Ignored Java-only fields should be ignored silently. Rust startup does not need to warn about fields that config-server may omit for Rust services.
  • oauth.multipleAuthServers remains accepted for Java compatibility, but Rust should infer multi-provider mode when serviceIdAuthServers is configured.
  • pathPrefixServices stays in client.yml. It is outbound-client provider selection and is different from inbound path routing to downstream services.
  • Circuit breaker behavior is only needed by Pingora. Shared request config can carry the Java-compatible fields, but non-Pingora clients do not need to own circuit breaker state.
  • SAML bearer is not required for Light Fabric and should remain out of scope unless a future product explicitly needs it.

Proposed Canonical Shape

The canonical Rust client.yml should stay close to Java:

tls:
  verifyHostname: ${client.verifyHostname:true}
  caCertPath: ${client.caCertPath:}
  clientCertPath: ${client.clientCertPath:}
  clientKeyPath: ${client.clientKeyPath:}
  tlsVersion: ${client.tlsVersion:TLSv1.3}

request:
  connectTimeout: ${client.connectTimeout:2000}
  timeout: ${client.timeout:3000}
  maxRequestRetry: ${client.maxRequestRetry:3}
  requestRetryDelay: ${client.requestRetryDelay:1000}
  errorThreshold: ${client.errorThreshold:2}
  resetTimeout: ${client.resetTimeout:7000}
  injectCallerId: ${client.injectCallerId:false}
  enableHttp2: ${client.enableHttp2:true}
  connectionPoolSize: ${client.connectionPoolSize:1000}
  connectionExpireTime: ${client.connectionExpireTime:1800000}
  maxReqPerConn: ${client.maxReqPerConn:1000000}
  maxConnectionNumPerHost: ${client.maxConnectionNumPerHost:1000}
  minConnectionNumPerHost: ${client.minConnectionNumPerHost:250}
  poolMetricsEnabled: ${client.poolMetricsEnabled:false}
  poolWarmUpEnabled: ${client.poolWarmUpEnabled:false}
  poolWarmUpSize: ${client.poolWarmUpSize:1}
  healthCheckEnabled: ${client.healthCheckEnabled:true}
  healthCheckIntervalMs: ${client.healthCheckIntervalMs:30000}

oauth:
  multipleAuthServers: ${client.multipleAuthServers:false}
  token:
    cache:
      capacity: ${client.tokenCacheCapacity:200}
    tokenRenewBeforeExpired: ${client.tokenRenewBeforeExpired:60000}
    expiredRefreshRetryDelay: ${client.expiredRefreshRetryDelay:2000}
    earlyRefreshRetryDelay: ${client.earlyRefreshRetryDelay:4000}
    server_url: ${client.tokenServerUrl:}
    serviceId: ${client.tokenServiceId:com.networknt.oauth2-token-1.0.0}
    proxyHost: ${client.tokenProxyHost:}
    proxyPort: ${client.tokenProxyPort:}
    enableHttp2: ${client.tokenEnableHttp2:true}
    authorization_code:
      uri: ${client.tokenAcUri:/oauth2/token}
      client_id: ${client.tokenAcClientId:}
      client_secret: ${client.tokenAcClientSecret:}
      redirect_uri: ${client.tokenAcRedirectUri:}
      scope: ${client.tokenAcScope:}
    client_credentials:
      uri: ${client.tokenCcUri:/oauth2/token}
      client_id: ${client.tokenCcClientId:}
      client_secret: ${client.tokenCcClientSecret:}
      scope: ${client.tokenCcScope:}
      serviceIdAuthServers: ${client.tokenCcServiceIdAuthServers:}
    refresh_token:
      uri: ${client.tokenRtUri:/oauth2/token}
      client_id: ${client.tokenRtClientId:}
      client_secret: ${client.tokenRtClientSecret:}
      scope: ${client.tokenRtScope:}
    token_exchange:
      uri: ${client.tokenExUri:/oauth2/token}
      client_id: ${client.tokenExClientId:}
      client_secret: ${client.tokenExClientSecret:}
      scope: ${client.tokenExScope:}
      subjectToken: ${client.subjectToken:}
      subjectTokenType: ${client.subjectTokenType:urn:ietf:params:oauth:token-type:jwt}
      requestedTokenType: ${client.requestedTokenType:}
      audience: ${client.tokenExAudience:}
    key:
      server_url: ${client.tokenKeyServerUrl:}
      serviceId: ${client.tokenKeyServiceId:com.networknt.oauth2-key-1.0.0}
      uri: ${client.tokenKeyUri:/oauth2/key}
      client_id: ${client.tokenKeyClientId:}
      client_secret: ${client.tokenKeyClientSecret:}
      enableHttp2: ${client.tokenKeyEnableHttp2:true}
      serviceIdAuthServers: ${client.tokenKeyServiceIdAuthServers:}
      audience: ${client.tokenKeyAudience:}
  sign:
    server_url: ${client.signServerUrl:}
    serviceId: ${client.signServiceId:com.networknt.oauth2-token-1.0.0}
    uri: ${client.signUri:/oauth2/sign}
    timeout: ${client.signTimeout:2000}
    client_id: ${client.signClientId:}
    client_secret: ${client.signClientSecret:}
    proxyHost: ${client.signProxyHost:}
    proxyPort: ${client.signProxyPort:}
    enableHttp2: ${client.signEnableHttp2:true}
    key:
      server_url: ${client.signKeyServerUrl:}
      serviceId: ${client.signKeyServiceId:com.networknt.oauth2-key-1.0.0}
      uri: ${client.signKeyUri:/oauth2/key}
      client_id: ${client.signKeyClientId:}
      client_secret: ${client.signKeyClientSecret:}
      enableHttp2: ${client.signKeyEnableHttp2:true}
      audience: ${client.signKeyAudience:}
  deref:
    server_url: ${client.derefServerUrl:}
    serviceId: ${client.derefServiceId:com.networknt.oauth2-token-1.0.0}
    uri: ${client.derefUri:/oauth2/deref}
    client_id: ${client.derefClientId:}
    client_secret: ${client.derefClientSecret:}
    proxyHost: ${client.derefProxyHost:}
    proxyPort: ${client.derefProxyPort:}
    enableHttp2: ${client.derefEnableHttp2:true}

pathPrefixServices: ${client.pathPrefixServices:}

Compatibility aliases:

  • Accept serverUrl in addition to Java server_url for Rust callers.
  • Accept clientId and clientSecret in addition to Java client_id and client_secret only as aliases. The emitted template should keep Java names.
  • Temporarily accept top-level verifyHostname only as a migration fallback, but register a warning and normalize it into tls.verifyHostname.

Serde strategy for the top-level verifyHostname fallback:

  • The shared ClientConfig should deserialize into a struct that has a tls.verifyHostname field and a separate #[serde(default)] top-level verify_hostname field.
  • After deserialization, a post-parse normalization step should check whether the top-level field was explicitly set. If so, it logs a deprecation warning and copies the value into tls.verify_hostname only when the nested field was not also explicitly set.
  • When both the top-level and nested fields are present, the nested tls.verifyHostname value wins. The top-level value is ignored after the warning.
  • Do not rely on two competing #[serde(default)] fields resolving the conflict. Use a custom Deserialize impl or an explicit post-parse step.

Serde strategy for Java-compatible but unimplemented sections:

  • Do not use #[serde(deny_unknown_fields)] for the top-level ClientConfig or OAuth section during Phase 1.
  • Known but not-yet-implemented Java sections such as oauth.sign and oauth.deref should deserialize into typed structs or serde_json::Value placeholders so representative Java fixtures load successfully.
  • Demand-driven validation decides whether a section is required. If no active module consumes oauth.sign or oauth.deref, those sections can be present and ignored silently.

Proposed Rust Modules

Shared Config Model

Create one shared typed config model outside light-pingora and light-runtime:

crates/light-client/src/lib.rs
crates/light-client/src/config.rs
crates/light-client/src/http.rs
crates/light-client/src/oauth.rs
crates/light-client/src/provider.rs

light-runtime should use light-client for loading, validating, and building outbound clients, but the reusable client model should not live inside the runtime crate.

Core types:

#![allow(unused)]
fn main() {
pub struct ClientConfig {
    pub tls: ClientTlsConfig,
    pub request: ClientRequestConfig,
    pub oauth: ClientOauthConfig,
    pub path_prefix_services: BTreeMap<String, String>,
}

pub struct ClientTlsConfig {
    pub verify_hostname: bool,
    pub ca_cert_path: Option<PathBuf>,
    pub client_cert_path: Option<PathBuf>,
    pub client_key_path: Option<PathBuf>,
    pub tls_version: Option<TlsVersion>,
}

pub struct ClientRequestConfig {
    pub connect_timeout_ms: u64,
    pub timeout_ms: u64,
    pub max_request_retry: u32,
    pub request_retry_delay_ms: u64,
    pub error_threshold: u32,
    pub reset_timeout_ms: u64,
    pub inject_caller_id: bool,
    pub enable_http2: bool,
    pub pool: ClientPoolConfig,
}
}

TlsVersion should be an enum with serde names for Java-compatible strings such as TLSv1.2 and TLSv1.3, rather than a raw string in runtime code.

Secrets should use a type that serializes as masked data for registry output, or the registry masks should cover every secret field recursively.

Runtime Loader

light-runtime should own the startup lifecycle for client.yml loading, but delegate parsing and validation to light-client:

  1. Load local values.yml.
  2. Load local startup.yml.
  3. Load local client.yml with resolved values for config-server bootstrap.
  4. Fetch remote config if configured.
  5. Rebuild the final RuntimeConfig with the remote client.yml overlay.
  6. Register masked light-client/client in ModuleRegistry.

Every runtime client should use this shared config:

  • config-server fetch client
  • portal-registry WebSocket client
  • MCP client
  • future model-provider outbound clients
  • framework/application clients through RuntimeConfig.client

For the earlier hostname-verification bug, the controller client should read:

runtime_config.client.tls.verify_hostname

not a separate top-level ClientConfig.verify_hostname.

HTTP Client Factory

Add a small factory that converts ClientConfig plus optional per-endpoint overrides into concrete clients:

#![allow(unused)]
fn main() {
pub struct ClientFactory {
    config: Arc<ClientConfig>,
    direct_registry: DirectRegistryConfig,
    registry_client: Option<Arc<PortalRegistryClient>>,
}

pub struct EndpointOptions {
    pub server_url: Option<String>,
    pub service_id: Option<String>,
    pub proxy_host: Option<String>,
    pub proxy_port: Option<u16>,
    pub enable_http2: Option<bool>,
    pub timeout_ms: Option<u64>,
}
}

Responsibilities:

  • Build reqwest::Client with consistent TLS, timeout, proxy, HTTP/2, retry, and pool settings for non-Pingora consumers.
  • Build Pingora HttpPeer options from the same TLS config for gateway upstream proxying.
  • Resolve endpoint base URL by priority:
    1. direct server_url
    2. direct-registry.yml
    3. portal-registry discovery by serviceId
  • Apply per-service AuthServerConfig overrides without duplicating resolver logic in each handler.

The config-server bootstrap path still starts from BootstrapConfig because it needs enough client settings before remote client.yml has been fetched. To keep light-client independent from light-runtime, the factory should not take a BootstrapConfig type directly. Instead, light-runtime should adapt BootstrapConfig.connect_timeout, BootstrapConfig.timeout, authorization, and bootstrap CA path into EndpointOptions or a small bootstrap options type owned by light-client.

OAuth Client

Add a shared OAuth client module that implements Java http-client behavior:

oauth/client_credentials
oauth/authorization_code
oauth/refresh_token
oauth/token_exchange
oauth/key
oauth/sign
oauth/deref

The existing light-pingora SpaTokenClient, token handler client credentials code, and security JWKS fetcher should delegate to this shared module. Handler modules still own request-path decisions, cookies, headers, and rejection mapping.

OAuth provider selection should be one reusable resolver:

#![allow(unused)]
fn main() {
pub struct OAuthProviderResolver {
    client: Arc<ClientConfig>,
}

impl OAuthProviderResolver {
    pub fn service_for_path(&self, path: &str) -> Option<&str>;
    pub fn client_credentials_provider(&self, service_id: Option<&str>) -> Result<AuthServerConfig>;
    pub fn key_provider(&self, service_id: Option<&str>) -> Result<AuthServerConfig>;
}
}

Rules:

  • Single-provider mode uses global oauth.token.* defaults.
  • Multi-provider mode is enabled when oauth.multipleAuthServers: true or when relevant serviceIdAuthServers maps are non-empty.
  • Multi-provider mode selects the service id from an explicit request header first, then outbound pathPrefixServices.
  • client_credentials.serviceIdAuthServers[serviceId] selects the token provider.
  • key.serviceIdAuthServers[serviceId] selects the JWKS/key provider.
  • Per-service config inherits unset values from global oauth.token defaults.
  • Path-prefix matching should be boundary-aware in Rust. Java uses startsWith; the Rust implementation can be stricter as an intentional improvement. Exact rule: a prefix matches when the request path equals the prefix or starts with prefix + "/". Therefore /api matches /api and /api/orders, but does not match /api-v2.
  • pathPrefixServices is not an inbound routing table. It maps outbound request paths to service ids only for client-side OAuth provider selection.

Consumer Modules

All modules should consume the same shared config:

ModuleUses
light-runtime/config-serverlight-client tls, request
light-runtime/portal-registrylight-client tls, request
light-pingora/securityoauth.token.key, tls, request, provider resolver
light-pingora/tokenoauth.token.client_credentials, token cache settings, provider resolver
light-pingora/stateless-authauthorization_code, refresh_token, token client
light-pingora/msal-exchangetoken_exchange, token client
light-gateway/proxytls.verifyHostname, PEM mTLS, request timeout, retry, circuit breaker, and pool settings where Pingora supports them
light-agentcontroller/MCP outbound clients
light-deployercontroller/MCP/outbound clients as needed

Reload Behavior

client.yml should be reloadable as a module, but reload must be conservative:

  1. Load and validate the new config into a fresh ClientConfig.
  2. Build new shared client factories and OAuth clients.
  3. Swap the config atomically for future requests.
  4. Clear OAuth token caches because client credentials, scopes, providers, or trust settings may have changed.
  5. Keep old in-flight requests on their existing client instances.
  6. Reject the reload if active modules cannot build required clients from the new config.

Reload atomicity: all runtimes that consume client.yml must be swapped together in the same reload callback. Today, the gateway TokenReloader already rebuilds token_runtime, stateless_auth, and msal_exchange as a unit. This must remain a hard requirement. A reload that updates the client config without also rebuilding dependent runtimes would leave stale TLS or OAuth state in the old runtime instances.

Controller registration is long-lived. Reloading client.yml should not force an immediate portal-registry reconnect. New TLS and request settings should apply to future outbound clients and the next normal controller reconnect, but the active controller WebSocket can remain open.

Validation Rules

Base validation:

  • tls.verifyHostname: false requires explicit trust material unless the transport has a clear dev-only mode.
  • If Rust-native mTLS is configured, both client certificate and client key paths are required.
  • request.connectTimeout and request.timeout must be positive.
  • proxyPort must be 0 to 65535.
  • pathPrefixServices keys must start with /.
  • Secret fields may be empty only when the consuming active module does not need that grant.

OAuth validation should be demand-driven:

  • If token handler is active and enabled, validate client_credentials.
  • If stateless-auth is active, validate authorization_code and refresh_token.
  • If msal-exchange is active, validate token_exchange.
  • If security.yml enables JWKS bootstrap from key service, validate oauth.token.key.
  • If a future sign module is active, validate oauth.sign.
  • If a future deref module is active, validate oauth.deref.

This avoids forcing every service to configure every Java OAuth section.

Validation failure behavior:

  • At startup, validation failures are fatal. The process must exit with a clear error message identifying which active module requires which missing or invalid client config section.
  • On reload, validation failures are non-fatal. The reload is rejected, the old config stays live, and the rejection reason is logged and reported through the module registry reload outcome.

Masking

Mask these fields recursively in registry output:

  • client_secret
  • clientSecret
  • trustStorePass
  • keyStorePass
  • keyPass
  • defaultCertPassword
  • subjectToken
  • access_token
  • refresh_token
  • id_token
  • authorization
  • any field ending in Token whose value is a scalar string (not a nested object, list, or URN-typed field like subjectTokenType or requestedTokenType)
  • any field ending in Secret

Explicit exclusions from suffix matching:

  • subjectTokenType - a URN string, not a secret.
  • requestedTokenType - a URN string, not a secret.

The registry should store only the masked snapshot. It should not store raw config and mask later.

Migration Plan

Phase 0: Deprecation Logging

  • Add a tracing::warn! in light-gateway where it reads resolved_values["client.verifyHostname"] to alert operators that this path is deprecated and will be replaced by runtime_config.client.tls.verify_hostname.
  • This gives operators visibility into the migration before behavior changes.

Phase 1: Unify The Schema

  • Add the light-client crate with the full shared ClientConfig type.
  • Make light-runtime load nested tls.verifyHostname.
  • Keep top-level verifyHostname as a temporary compatibility fallback.
  • Update Rust config templates to include only the canonical nested shape.
  • Add tests proving client.verifyHostname: false reaches config-server, portal-registry, token, security JWKS, SPA auth, and gateway proxy clients.

Phase 2: Move Consumers To Shared Config

  • Replace light-pingora::token::ClientTokenConfig with the light-client shared type or a type alias.
  • Replace gateway direct resolved_values["client.verifyHostname"] lookup with runtime_config.client.tls.verify_hostname.
  • Move JWKS, token, and SPA token HTTP client construction behind the shared client factory.
  • Register one masked light-client/client module instead of separate partial client registry entries.

Phase 3: Shared OAuth Provider Resolver

  • Extract provider selection from the token handler.
  • Support token.key.serviceIdAuthServers and audience.
  • Use the same resolver for token injection and JWT key lookup.
  • Keep Java field names and config-server placeholders.

Phase 4: Java Feature Completion

  • Implemented sign client support in light-client.
  • Implemented deref client support in light-client.
  • Implemented Rust-native PEM mTLS for reqwest clients and Pingora upstreams.
  • Implemented retry, circuit breaker, and pool behavior where the Rust transport supports them.

Open Questions

None at this stage.

Test Plan

Unit tests:

  • Parse the Java client.yml template into the shared Rust config.
  • Parse the current Rust client.yml template into the shared Rust config.
  • Resolve client.verifyHostname into tls.verifyHostname.
  • Accept top-level verifyHostname only as a fallback and prefer nested TLS when both are set.
  • Mask every secret field in the module registry snapshot.
  • Validate provider selection by service id and path prefix.
  • Validate per-service override inheritance for token and key providers.

Runtime tests:

  • Config-server bootstrap uses tls.verifyHostname.
  • Portal-registry controller WebSocket uses tls.verifyHostname.
  • Gateway upstream proxy uses tls.verifyHostname.
  • Token handler, stateless auth, MSAL exchange, and security JWKS all receive the same ClientConfig instance or snapshot.
  • Client reload clears token caches and rejects invalid active grant config.
  • Reload round-trip: verify that reloading from config A to config B swaps the ClientConfig, creates fresh token caches, and that in-flight requests on the old config are not affected. Verify that a reload from valid config to invalid config is rejected and the old config stays live.

Compatibility tests:

  • Reuse representative Java client.yml fixtures for single provider, multiple providers, proxy, token key, sign, and deref sections.
  • Confirm Java-compatible form bodies for authorization_code, client_credentials, refresh_token, and token_exchange.
  • Confirm config-server injected YAML strings and structured YAML maps both deserialize for serviceIdAuthServers and pathPrefixServices.

Embedded Configuration Templates

Status

Initial implementation completed. Rust applications in light-fabric and related portal-service applications keep template configuration files under each app's config directory. Container images may copy those files into /app/config-defaults, then runtime overlays local config, downloaded config-cache, remote values.yml, and environment variables.

That works well for container deployments. It is awkward for native binary deployments on a VM because the operator must copy a full template directory beside the binary even when they only want to provide values.yml, certs, or a small local override.

This design embeds the template files into the Rust binary while keeping the app config directories in source control as the readable template source.

Purpose

Embedded configuration templates should make the Rust deployment model match the Java module model more closely:

  1. The application binary carries its default template files.
  2. Operators provide only overrides, usually values.yml, startup.yml, certs, keys, or environment variables.
  3. Config-server can still return values.yml after bootstrap, plus external files for explicit migration or operational exceptions.
  4. Developers and operators can still inspect the app's config directory in source control to learn supported properties.

The embedded files are defaults. They are not runtime state and should not be written out automatically unless an explicit diagnostic/export command is added later.

Current Model

The current runtime model has these filesystem layers:

LayerExamplePurpose
Default templatesconfig-defaults/server.ymlApp-provided templates copied into the container image
Local configconfig/values.yml, config/startup.ymlOperator overrides and bootstrap inputs
External/cache configconfig-cache/values.ymlFiles downloaded from config-server
Remote valuesconfig-server response bodyRuntime values fetched during bootstrap
Environment variablesCLIENT_VERIFYHOSTNAME=falseLast-mile process overrides during placeholder expansion

For light-fabric runtime applications, LightRuntimeBuilder passes default_config_dir, config_dir, and external_config_dir into light-runtime. load_bootstrap_config() reads bootstrap-time values.yml, startup.yml, and client.yml before remote config-server bootstrap. After remote bootstrap, runtime config loads server.yml, client.yml, portal-registry.yml, and framework/application module files through the same merged configuration path.

Some portal-service apps share the light-runtime path, while standalone apps such as config-server and light-oauth have local helper functions that merge config-defaults and config.

Goals

  • Allow a native binary deployment to start with embedded templates and a small external config/values.yml.
  • Keep apps/<app>/config/*.yml as the source of truth for template content.
  • Keep container deployment behavior compatible with the current /app/config-defaults copy.
  • Preserve the existing overlay order and placeholder expansion behavior.
  • Support bootstrap-time files such as startup.yml and client.yml.
  • Support runtime module files such as handler.yml, proxy.yml, model-provider.yml, provider configs, and product-specific files.
  • Provide one reusable loading abstraction for light-fabric and portal-service instead of app-specific parsing logic.
  • Avoid writing embedded templates to disk during normal startup.

Non-Goals

  • Do not embed secrets, certificates, private keys, trust bundles, static web assets, or downloaded config-server files.
  • Do not remove the source config directories. They remain the reviewable, documented template source.
  • Do not make values.yml mandatory. Apps should keep current defaults where they are already valid.
  • Do not make config-server responsible for delivering template files that are already part of the binary.
  • Do not change the meaning of values.yml placeholders or environment variable expansion.

Proposed Layer Order

The new effective source order should be:

  1. Embedded template file from the binary.
  2. Filesystem default template from config-defaults, if present.
  3. Local operator file from config.
  4. External/cache file from config-cache, when runtime loading supports it.
  5. Remote values.yml payload from config-server.
  6. Environment variables during placeholder resolution.

This keeps existing container images compatible. If config-defaults exists, it can override the embedded template. That gives operators and image builders a transition path and a deliberate escape hatch for patched images.

For native binary deployment, config-defaults is simply absent and the binary falls back to embedded templates.

Structured config files and values.yml should use different overlay semantics:

File typeSemanticsReason
Structured config files such as server.yml, handler.yml, proxy.yml, and model-provider.ymlSource-level override. The highest-priority source that contains the file supplies the whole template.Avoids surprising hybrid files assembled from embedded, image, local, and cache layers. Operators should use values.yml for partial property overrides.
values.ymlKey-level overlay in source order, followed by remote values and environment variables.values.yml is explicitly the property override surface. Partial overlays are expected and useful.

After the structured file source is selected, placeholders in that file are resolved from the merged values map and environment variables.

Embedded Template Representation

include_dir is a possible embedding mechanism. It embeds the entire app config directory at compile time and avoids custom directory-scanning build scripts in every application crate:

#![allow(unused)]
fn main() {
use include_dir::{include_dir, Dir};

pub static EMBEDDED_CONFIG: Dir<'_> = include_dir!("$CARGO_MANIFEST_DIR/config");
}

The runtime should hide the concrete embedding mechanism behind a small config source abstraction. A typed file representation is still useful as the stable runtime boundary:

#![allow(unused)]
fn main() {
pub struct EmbeddedConfigFile {
    pub name: &'static str,
    pub content: &'static str,
}
}

Application code should pass a flattened static file list into the runtime:

#![allow(unused)]
fn main() {
LightRuntimeBuilder::new(transport)
    .with_embedded_config(embedded_config::FILES)
    .build();
}

include_str! is still acceptable for one or two files, but application main.rs files should not accumulate hand-maintained include_str! lists. include_bytes! is not preferred for YAML templates because configuration templates should be valid UTF-8 before they are parsed.

The initial implementation uses a shared build-time generator instead of adding an external embedding dependency. Each app has a small build.rs that calls config-embed-build, which scans the committed config directory and produces a manifest like this under OUT_DIR:

#![allow(unused)]
fn main() {
pub const FILES: &[config_loader::EmbeddedConfigFile] = &[
    config_loader::EmbeddedConfigFile {
        name: "server.yml",
        content: include_str!(concat!(env!("CARGO_MANIFEST_DIR"), "/config/server.yml")),
    },
    config_loader::EmbeddedConfigFile {
        name: "startup.yml",
        content: include_str!(concat!(env!("CARGO_MANIFEST_DIR"), "/config/startup.yml")),
    },
];
}

Build-Time Generation Fallback

The project currently uses the build-time manifest path. Each app uses a shared build.rs helper to scan its config directory and generate the embedded manifest. The generator lives in one reusable crate so apps do not carry duplicated build logic.

The generated manifest should:

  • Include only known text config extensions, initially .yml, .yaml, .json, and .toml.
  • Preserve the file name relative to the app config directory.
  • Emit cargo:rerun-if-changed=config.
  • Fail the build if a template file cannot be read as UTF-8.

Nested config paths are not needed for current app templates, but the manifest should allow names such as oauth/server.yml if a future product needs them.

Runtime API

Add embedded defaults to LightRuntimeBuilder:

#![allow(unused)]
fn main() {
LightRuntimeBuilder::new(transport)
    .with_embedded_config(embedded_config::FILES)
    .with_default_config_dir(DEFAULT_CONFIG_DIR)
    .with_config_dir(CONFIG_DIR)
    .with_external_config_dir(EXTERNAL_CONFIG_DIR)
    .build();
}

RuntimeConfig should carry the embedded source as skipped runtime state, the same way it carries default_config_dir and registries today:

#![allow(unused)]
fn main() {
pub struct RuntimeConfig {
    // existing fields
    #[serde(skip, default)]
    pub embedded_config: &'static [EmbeddedConfigFile],
}
}

The stable contract is lookup by relative file name and iteration for diagnostics or dumping. The concrete representation can remain a static file slice or later move behind a provider abstraction if needed.

The low-level loader should accept named in-memory content as another config source:

#![allow(unused)]
fn main() {
pub enum ConfigSource {
    Embedded { name: &'static str, content: &'static str },
    File(PathBuf),
}
}

ConfigLoader can then parse embedded and filesystem sources with the same YAML/JSON/TOML parser. Structured config loading should select the highest priority source for the requested file. values.yml loading should continue to merge maps in source order.

Bootstrap Behavior

Bootstrap must support embedded templates because this is the path that native deployments need most.

load_bootstrap_values() should merge:

  1. Embedded values.yml, if present.
  2. config-defaults/values.yml, if present.
  3. config/values.yml, if present.

load_bootstrap_config() should load startup.yml and client.yml from:

  1. Embedded templates.
  2. config-defaults.
  3. config.

For startup.yml and client.yml, the highest-priority source that contains the file should be used as the full template. Placeholder resolution still uses the merged bootstrap values.

After bootstrap fetches remote values, load_values_map() should merge embedded values.yml before the existing file and remote layers. This allows remote values to override embedded placeholders exactly as they override copied template files today.

Application Integration

Light-Gateway

light-gateway should be the first light-fabric application to adopt the runtime API because it has the richest template set:

  • bootstrap and server files
  • client and portal registry files
  • handler chain files
  • proxy, resource, MCP, websocket, auth, token, metrics, and rule-related files

After integration, a native gateway deployment can run with the binary plus a small config/values.yml and any required cert/key files.

Light-Agent

light-agent should use the same runtime API for all provider templates. The embedded set should include model-provider.yml, mcp-client.yml, and every provider-specific template such as openai.yml, bedrock.yml, codex.yml, anthropic.yml, and ollama.yml.

Runtime provider selection should still happen after bootstrap. Embedded templates do not mean provider clients are created before config-server values are loaded.

Light-Deployer

light-deployer currently has a separate app-level config load for deployer.yml. It should either move to the shared embedded-source helper or set embedded defaults on LightRuntimeBuilder and use the same merged source logic for its application config.

Portal-Service App

portal-service/apps/portal-service already uses LightRuntimeBuilder, but it loads portal-service.yml before runtime startup to create the database pool. That pre-runtime load should use the same shared embedded-source helper.

The portal-service.yml config remains non-reloadable because dbUrl and hostId feed process-owned state.

Portal-Service Config-Server And Light-OAuth

portal-service/apps/config-server and apps/light-oauth do not bootstrap from config-server. They should still embed their server.yml templates so native deployment does not require a copied config-defaults directory.

Because these apps have local merge helpers today, they should consume a shared config-loader helper that can merge:

  1. Embedded defaults.
  2. Filesystem defaults.
  3. Local config.

This keeps their behavior aligned with light-runtime without requiring them to become runtime-bootstrap applications.

Operator Model

For a native deployment, the recommended layout becomes:

/opt/light-gateway/
  light-gateway
  config/
    values.yml
    startup.yml        # optional, only when values/env defaults are not enough
    cert.pem           # optional external asset
    key.pem            # optional external asset

The operator no longer needs to copy every template file beside the binary. They only provide files that are deployment-specific.

For a container deployment, the current layout continues to work:

/app/light-gateway
/app/config-defaults/*.yml
/config/values.yml
/app/config-cache/values.yml

In the long term, the /app/config-defaults copy can become optional. Keeping it during migration is useful because it lets operators inspect templates inside the image and provides a familiar override layer.

After embedded templates are stable across production deployments, Docker images should deprecate and then remove the unconditional /app/config-defaults copy. Template inspectability should move to explicit dump/print commands rather than extra image layers.

Diagnostics

The runtime should expose enough information to make source precedence clear:

  • Log whether embedded templates were registered for the application.
  • When a required config file is missing, include the searched source names: embedded, config-defaults, config, and config-cache.
  • Module registry snapshots should show the resolved config, not the raw embedded template.
  • Module registry metadata should include config source provenance when available, for example embedded, file:/app/config-defaults/server.yml, or file:/config/server.yml.
  • Normal startup should not write embedded templates to disk.

Native operators should have explicit inspection commands:

light-gateway --print-default-config server.yml
light-gateway --dump-default-configs ./config-defaults

The print command writes one embedded template to stdout. The dump command writes all embedded templates to a target directory so operators can inspect, copy, and customize them.

Controller Server Info Compatibility

Rust services register with the controller, and the controller can call the runtime MCP service-info path to inspect runtime configuration. This behavior must continue to work with embedded templates.

The service-info response should expose resolved runtime configuration, not raw templates. The implementation contract is:

  1. Select the effective structured config source, such as embedded server.yml, filesystem config/server.yml, or cached config-cache server.yml.
  2. Build the merged values map from embedded, filesystem, cached, remote values.yml, and environment variables.
  3. Resolve placeholders in the selected config source.
  4. Deserialize the resolved config into the typed runtime or module config.
  5. Register that typed config in ModuleRegistry.
  6. Return ModuleRegistry component configs from the controller service-info MCP call.

With that flow, the controller still sees every registered config file with defaults and overrides applied. Embedded templates only replace the missing filesystem default-template layer. They should not bypass typed config loading, masking, module registration, reload validation, or service-info reporting.

Source provenance can be added as metadata beside each registered config, but it must not replace the resolved config payload that operators and the controller depend on.

Testing Strategy

Add unit tests at the shared loader boundary:

  • Embedded-only server.yml loads successfully.
  • Local config/server.yml replaces embedded server.yml rather than deep merging with it.
  • config-defaults/server.yml replaces embedded server.yml.
  • config-cache/server.yml replaces local config during runtime loads.
  • Embedded values.yml is overridden by local values.yml.
  • Remote values.yml overrides embedded and filesystem values.
  • Missing required config reports all searched layers.
  • Source provenance is recorded for resolved module configs.
  • --print-default-config and --dump-default-configs expose embedded templates without changing normal startup behavior.
  • Controller service-info output includes resolved values from embedded defaults plus local, cached, remote, and environment overrides.

Add application-level smoke tests for:

  • light-gateway startup with no filesystem server.yml, using embedded templates plus local values.yml.
  • light-agent provider config loading from embedded templates after bootstrap.
  • portal-service/apps/portal-service pre-runtime portal-service.yml load from embedded templates.
  • portal-service/apps/config-server standalone server.yml load from embedded templates.

Migration Plan

  1. Add embedded source support to config-loader and light-runtime.
  2. Add shared build-time template embedding for light-gateway.
  3. Wire light-gateway to pass embedded templates to LightRuntimeBuilder.
  4. Keep Docker config-defaults copies unchanged and verify container parity.
  5. Add native startup tests that run without a copied template directory.
  6. Roll the same pattern to light-agent and light-deployer.
  7. Add the shared embedded-source helper to portal-service and migrate portal-service, config-server, and light-oauth.
  8. Add print and dump commands for embedded templates.
  9. After several releases, deprecate Docker config-defaults copies and rely on embedded defaults plus explicit dump commands for inspectability.

Risks And Mitigations

RiskMitigation
Embedded templates drift from source templatesEmbed the committed config/ directory directly with include_dir, or generate a manifest from that directory at build time
Operators cannot inspect templates in native deploymentKeep source templates in repo and add print/dump commands for embedded templates
Docker behavior changes unexpectedlyKeep config-defaults above embedded defaults during migration
Config-server remote values stop overriding defaultsPreserve remote values as the highest non-env value layer
Apps duplicate merge logicMove embedded-source merging into shared loader/runtime helpers
Secrets accidentally embeddedEmbed only committed template files and keep secrets in values, env, or external files
Structured config becomes hard to reason aboutUse source-level override for config files and reserve key-level merging for values.yml

Resolved Decisions

  • Native operators should get --print-default-config <name> and --dump-default-configs <directory> commands.
  • Module registry should expose resolved config first, with source provenance as metadata when available.
  • Docker images should keep /app/config-defaults during migration, then deprecate it once embedded templates and dump commands are stable.
  • Rust deployments should standardize on embedded templates plus remote values.yml. Config-server should not normally deliver full template files for Rust services.

Decision Summary

Embed app config/*.yml templates into the binary as the lowest-priority default configuration source. The initial implementation uses a shared build-time manifest generator, with include_dir remaining a possible future implementation detail. Keep the existing source config directories for documentation and build input. Use source-level override for structured config files and key-level overlay for values.yml. Preserve current filesystem and remote value layers so container deployments keep working, while native deployments can run with only the binary and a small deployment-specific config directory.

Handler Chain

Status: Phases 1, 2, 3, 4, 5, 6, 7, and 8 implemented; further transport phases proposed

Purpose

Light Fabric needs a light-pingora handler chain for the Rust light-gateway product.

The first implementation should focus on light-pingora, not a generic cross-framework abstraction. A Pingora-first design is simpler and matches the gateway family of use cases: gateway, sidecar, proxy server, proxy client, load balancer, and BFF.

The deployment model should use one light-gateway binary. Different runtime behaviors should come from product-specific configuration managed in light-portal and delivered by config-server. A BFF deployment, a sidecar deployment, and a load-balancer deployment can therefore run the same binary with different handler.yml, traffic/resource config, and handler-specific config files.

The design should preserve the useful part of light-4j handler.yml: ordered configuration of cross-cutting request and response concerns. It should not copy the Java reflection model, mutable next handler pattern, or class-name-based configuration.

Goals

  • Add middleware handler-chain support to frameworks/light-pingora.
  • Use one apps/light-gateway binary for the Pingora gateway family.
  • Keep handler.yml as the chain and ordering configuration.
  • Let light-portal manage product-specific configuration and config-server deliver it at startup.
  • Support virtual hosts selected from the HTTP Host header.
  • Serve static SPA content directly from Pingora.
  • Proxy API, BFF, sidecar, and balancer routes to upstream services.
  • Use stable handler IDs instead of Rust type names.
  • Use explicit handler registration. Do not require inventory.
  • Integrate loaded handler and traffic/resource config with ModuleRegistry.
  • Keep the design compatible with runtime config reload.

Non-Goals

  • Do not build a transport-neutral light-handler crate in the first phase.
  • Do not add an Axum/Tower adapter in the first phase.
  • Do not create separate binaries for gateway, sidecar, proxy server, proxy client, load balancer, and BFF in the first phase.
  • Do not dynamically load handler crates from handler.yml.
  • Do not use Java-style reflection or string-to-type construction.
  • Do not make Rust type names part of the public config contract.
  • Do not support multi-certificate TLS SNI selection in the first phase.
  • Do not implement streaming static-file delivery in the first phase unless it is needed for a concrete SPA asset size problem.

Current Shape

light-pingora already adapts a Pingora proxy into the shared runtime:

#![allow(unused)]
fn main() {
pub trait PingoraApp: Send + Sync + 'static {
    type Proxy: ProxyHttp + Send + Sync + 'static;

    fn proxy(&self, config: &RuntimeConfig) -> Result<Self::Proxy, RuntimeError>;
}
}

PingoraTransport calls app.proxy(config) and passes the result to pingora::proxy::http_proxy_service(...).

Pingora's ProxyHttp lifecycle already has the hooks needed for the gateway family:

  • request_filter: validate, authenticate, rate limit, or directly write a local response such as a static file
  • upstream_peer: select the upstream for proxy routes
  • upstream_request_filter: mutate the request sent to upstream
  • upstream_response_filter: mutate the upstream response before caching
  • response_filter: mutate the response sent to the browser

The current light-gateway already writes /health directly from request_filter. Static SPA serving can use the same pattern.

Product Model

The Rust light-gateway binary should link all built-in Pingora gateway capabilities:

  • virtual host routing
  • static SPA serving
  • reverse proxy routing
  • outbound proxy behavior
  • upstream load balancing
  • sidecar token/header behavior
  • shared middleware handlers

The active behavior is selected by configuration, not by compiling a different binary. The six product personas are configuration profiles:

  • gateway
  • sidecar
  • proxy-server
  • proxy-client
  • balancer
  • bff

These profiles can be represented in light-portal as product-specific config sets. At runtime, light-gateway only sees the resolved files returned by config-server. The binary should not need to know whether the files came from a portal product template, an environment override, or a local fallback.

This keeps deployment simple:

  • one binary
  • one container image
  • one light-pingora framework
  • different behavior by remote config

The tradeoff is that config validation must be strong. A product config should not silently start in a different mode if a static root, virtual host, upstream, or chain is wrong.

High-Level Flow

The Pingora gateway request flow should be:

request
  -> match handler.yml paths by path and method
  -> fall back to handler.yml defaultHandlers when no path matches
  -> run request handlers
  -> proxy fixed upstream, route by service_id/service_url, serve static file,
     or return error
  -> run response handlers
  -> response

For static handlers such as virtual-host or path-resource, request_filter writes the response and returns Ok(true) so Pingora does not proxy the request.

For proxy or router handlers, request_filter stores the selected upstream decision in the per-request context and returns Ok(false). upstream_peer and upstream_request_filter then use that context to connect to the right upstream and set headers.

Crate Layout

Keep the first implementation inside frameworks/light-pingora.

Suggested modules:

frameworks/light-pingora/src/
  lib.rs
  handler.rs
  correlation.rs
  cors.rs
  metrics.rs
  proxy.rs
  resource.rs
  router.rs
  service.rs
  token.rs

Responsibilities:

  • parse and validate handler.yml
  • parse handler.yaml as a compatibility fallback
  • parse and validate proxy.yml, router.yml, path-resource.yml, and virtual-host.yml
  • build explicit handler registry
  • resolve handler chains
  • match handler paths and fallback handlers
  • capture Java-style {name} path-template variables
  • load active handler-specific config files
  • serve static SPA content
  • select fixed proxy upstreams from proxy.yml
  • select dynamic sidecar/router upstreams from router.yml
  • resolve sidecar service_id values from pathPrefixService.yml
  • retrieve and cache OAuth client-credentials tokens from client.yml
  • expose module-registry entries for active handler and traffic/resource config

This keeps the first implementation close to the Pingora lifecycle and avoids premature abstractions for Axum.

If Axum later needs the same handler semantics, extract the framework-neutral parts after the Pingora implementation has stabilized.

Configuration Split

Use handler.yml for the Java-compatible handler middleware contract: handler declarations, reusable chains, path-to-chain mappings, and fallback handlers.

Use Java-compatible product-specific config files for traffic and static resource behavior:

  • proxy.yml: fixed inbound reverse proxy targets for gateway, proxy server, balancer, and simple BFF API forwarding.
  • router.yml: dynamic outbound routing by service_id or service_url, mainly for sidecar-style deployments.
  • path-resource.yml or path-resource.yaml: a single static resource mount.
  • virtual-host.yml or virtual-host.yaml: host-based static resource mounts for BFF/SPA deployments.

The product profile selected in light-portal decides which of these files are included and which handlers are active in handler.yml. The Rust binary should not require a separate gateway.yml to duplicate these existing contracts.

Handler-specific files such as correlation.yml, cors.yml, metrics.yml, header.yml, security.yml, apikey.yml, basic-auth.yml, unified-security.yml, and limit.yml stay separate. They are loaded only when the corresponding handler is active in the resolved path/default execution model. Phase 3 implements this active loading for correlation.yml, cors.yml, and metrics.yml. Phase 4 extends the same active-loading and reload model to header.yml, security.yml, apikey.yml, basic-auth.yml, unified-security.yml, and limit.yml.

Remote Config Source

light-gateway starts with enough local bootstrap configuration to contact config-server. The existing Light Fabric runtime then resolves local and remote configuration before light-pingora builds the runtime handler/resource/proxy model.

Startup flow:

  1. load local bootstrap files from the configured config directory
  2. contact config-server using the configured service identity, environment, and authorization
  3. download remote product configuration managed by light-portal
  4. merge remote config with local fallback config
  5. load handler.yml, applicable traffic/resource config files, and active handler-specific config files
  6. validate the complete route and handler model
  7. bind Pingora listeners
  8. register the runtime instance with the controller

The remote product config should include:

  • handler.yml
  • proxy.yml for fixed inbound proxy profiles
  • router.yml for sidecar/router profiles
  • path-resource.yml or virtual-host.yml for static/BFF profiles
  • active handler config files
  • TLS, trust, or client files required by the runtime
  • optional product-specific static file references or mount paths

handler.yml decides which linked handlers are active. A handler that is registered in the binary but not referenced by any configured paths entry or defaultHandlers chain should not be instantiated, should not load its config file, and should never run.

Handler Config

Example handler.yml:

enabled: ${handler.enabled:true}
reportHandlerDuration: ${handler.reportHandlerDuration:false}
handlerMetricsLogLevel: ${handler.handlerMetricsLogLevel:DEBUG}
basePath: ${handler.basePath:/}
handlers: ${handler.handlers:[]}
chains: ${handler.chains:{}}
paths: ${handler.paths:[]}
defaultHandlers: ${handler.defaultHandlers:[]}

The config-server values managed by light-portal provide the concrete arrays and maps:

handler.handlers:
  - correlation
  - headers
  - metrics
  - cors
  - jwt
  - rate-limit

handler.chains:
  spa:
    exec:
      - correlation
      - headers
      - metrics
      - cors
  api:
    exec:
      - correlation
      - headers
      - metrics
      - cors
      - jwt
      - rate-limit
  public:
    exec:
      - correlation
      - headers
      - metrics

handler.paths:
  - path: /api/
    method: GET
    exec:
      - api

handler.defaultHandlers:
  - public

This keeps the same top-level handler.yml contract as the Java framework: enabled, reportHandlerDuration, handlerMetricsLogLevel, basePath, handlers, chains, paths, and defaultHandlers.

The Rust implementation also accepts the Java extension fields additionalHandlers, additionalChains, and additionalPaths. They are merged into the effective handler model before validation.

Unlike Java, the Rust handlers list uses stable short handler IDs. It does not use fully qualified class names, and it does not need @alias because the IDs are already short and stable.

handler.yml is the preferred Rust file name. handler.yaml is accepted as a compatibility fallback because some Java modules and templates use that suffix.

Fixed Proxy Config

proxy.yml should keep the Java inbound reverse-proxy contract. It is used when the deployment has a known set of target upstream URIs.

enabled: ${proxy.enabled:true}
http2Enabled: ${proxy.http2Enabled:false}
hosts: ${proxy.hosts:http://localhost:8080}
connectionsPerThread: ${proxy.connectionsPerThread:20}
maxRequestTime: ${proxy.maxRequestTime:1000}
rewriteHostHeader: ${proxy.rewriteHostHeader:true}
reuseXForwarded: ${proxy.reuseXForwarded:false}
maxConnectionRetries: ${proxy.maxConnectionRetries:3}
maxQueueSize: ${proxy.maxQueueSize:0}
forwardJwtClaims: ${proxy.forwardJwtClaims:false}
metricsInjection: ${proxy.metricsInjection:false}
metricsName: ${proxy.metricsName:proxy-response}

The Rust implementation should parse proxy.hosts as one or more comma separated http:// or https:// targets and select a target with round-robin load balancing. It should preserve rewriteHostHeader, reuseXForwarded, request timeout, retry, and queue settings where Pingora exposes equivalent behavior.

Router Config

router.yml should keep the Java outbound router contract. This is primarily for the sidecar pattern, where earlier handlers resolve service_id, service_url, tokens, and discovery context before the router connects to the downstream service.

http2Enabled: ${router.http2Enabled:true}
httpsEnabled: ${router.httpsEnabled:true}
maxRequestTime: ${router.maxRequestTime:1000}
pathPrefixMaxRequestTime: ${router.pathPrefixMaxRequestTime:{}}
connectionsPerThread: ${router.connectionsPerThread:10}
softMaxConnectionsPerThread: ${router.softMaxConnectionsPerThread:5}
maxQueueSize: ${router.maxQueueSize:0}
rewriteHostHeader: ${router.rewriteHostHeader:true}
reuseXForwarded: ${router.reuseXForwarded:false}
maxConnectionRetries: ${router.maxConnectionRetries:3}
preResolveFQDN2IP: ${router.preResolveFQDN2IP:false}
hostWhitelist: ${router.hostWhitelist:[]}
serviceIdQueryParameter: ${router.serviceIdQueryParameter:false}
urlRewriteRules: ${router.urlRewriteRules:[]}
methodRewriteRules: ${router.methodRewriteRules:[]}
queryParamRewriteRules: ${router.queryParamRewriteRules:{}}
headerRewriteRules: ${router.headerRewriteRules:{}}
metricsInjection: ${router.metricsInjection:false}
metricsName: ${router.metricsName:router-response}

The Java router chooses the target from service_url first, guarded by hostWhitelist, or from service_id plus optional env_tag through service discovery.

Phase 5 implements the Pingora router execution path and keeps the Java configuration shape. The active router handler loads and registers router.yml, selects direct service_url targets after hostWhitelist validation, supports serviceIdQueryParameter, and removes router selection headers before forwarding upstream. It also applies Java-style URL, method, query-parameter, and header rewrite rules.

Rust adds serviceTargets as an interim improvement for service_id routing:

serviceTargets:
  com.networknt.petstore-1.0.0:
    - http://localhost:8080
  com.networknt.petstore-1.0.0|dev:
    - https://petstore-dev.example.com

This lets sidecar-style router flows run in local/static deployments and acts as the fallback when controller discovery is unavailable.

Phase 6 adds the sidecar path-prefix and token flow. Phase 7 adds controller-backed service_id discovery while keeping the same request contract and the same static fallback.

Sidecar Path Prefix And Token Config

pathPrefixService.yml maps request path prefixes to downstream service IDs. The handler writes service_id only when the request does not already provide one.

enabled: ${pathPrefixService.enabled:true}
mapping: ${pathPrefixService.mapping:{}}

Rust intentionally selects the longest path-boundary prefix. This avoids map iteration ambiguity when prefixes overlap and prevents /v1/address from matching /v1/address2.

token.yml gates when the token handler should run:

enabled: ${token.enabled:false}
appliedPathPrefixes: ${token.appliedPathPrefixes:}

The token handler reads the Java-compatible client credentials section from client.yml:

tls:
  verifyHostname: ${client.verifyHostname:true}
oauth:
  multipleAuthServers: ${client.multipleAuthServers:false}
  token:
    cache:
      capacity: ${client.tokenCacheCapacity:200}
    tokenRenewBeforeExpired: ${client.tokenRenewBeforeExpired:60000}
    server_url: ${client.tokenServerUrl:}
    serviceId: ${client.tokenServiceId:com.networknt.oauth2-token-1.0.0}
    proxyHost: ${client.tokenProxyHost:}
    proxyPort: ${client.tokenProxyPort:}
    enableHttp2: ${client.tokenEnableHttp2:true}
    client_credentials:
      uri: ${client.tokenCcUri:/oauth2/token}
      client_id: ${client.tokenCcClientId:}
      client_secret: ${client.tokenCcClientSecret:}
      scope: ${client.tokenCcScope:}
      serviceIdAuthServers: ${client.tokenCcServiceIdAuthServers:}
pathPrefixServices: ${client.pathPrefixServices:}
request:
  connectTimeout: ${client.connectTimeout:2000}
  timeout: ${client.timeout:3000}
  enableHttp2: ${client.enableHttp2:true}

In single-auth-server mode, the handler uses the configured token server and client credentials for all matched paths. In multipleAuthServers mode, it uses service_id or pathPrefixServices to select client_credentials.serviceIdAuthServers[service_id].

The token request follows the Java request shape:

  • POST to server_url + uri
  • Content-Type: application/x-www-form-urlencoded
  • Accept: application/json
  • HTTP Basic authentication with client_id:client_secret
  • form fields grant_type=client_credentials and optional space-joined scope

The injected header follows the Java gateway rule:

  • if the inbound request has no Authorization, inject Authorization: Bearer <token>
  • if the inbound request already has Authorization, inject X-Scope-Token: Bearer <token>

The Rust cache is local to the gateway process and is registered as light-pingora/token-cache when a runtime cache registry is available. Cache summaries expose key and expiry metadata but never expose bearer token values. Tokens are refreshed synchronously inside the configured renew-before-expiry window. Async background renewal can be added later if blocking refresh latency becomes visible.

When server_url is not configured, phase 7 discovers the token service from serviceId through the runtime portal-registry client. This requires server.enableRegistry and a live controller registration. A disconnected registry client returns a clear configuration/runtime error instead of silently falling back to an unknown token endpoint.

Static Resource Config

For a single static site, keep path-resource.yml:

path: ${path-resource.path:/public}
base: ${path-resource.base:/opt/light-4j/public}
prefix: ${path-resource.prefix:true}
transferMinSize: ${path-resource.transferMinSize:1024}
directoryListingEnabled: ${path-resource.directoryListingEnabled:false}

For host-based BFF/static sites, keep virtual-host.yml:

hosts: ${virtual-host.hosts:[]}

Example config-server values:

virtual-host.hosts:
  - domain: local.localhost
    path: /
    base: /lightapi/dist
    transferMinSize: 10245760
    directoryListingEnabled: false
  - domain: signin.localhost
    path: /
    base: /signin/dist
    transferMinSize: 10245760
    directoryListingEnabled: false

Rust should preserve the Java domain, path, base, transferMinSize, and directoryListingEnabled fields. It should also add the Rust improvement for SPA fallback: when a static virtual host cannot find a requested browser route and the path does not look like an asset, it should serve index.html from the matched static root.

BFF Wiring Example

The Java BFF config in portal-config-loc/all-in-lt/light-gateway uses handler.paths to send API routes through the default chain, which includes path-prefix service resolution, token handling, and the router. It then uses:

handler.defaultHandlers:
  - cors
  - virtual

That means unmatched browser routes fall through to CORS plus virtual-host static serving. Rust should keep this pattern: handler.yml decides whether a request goes to proxy/router/static handling, based on paths and fallback handlers.

Other product personas use different config file combinations. A BFF commonly uses handler.yml, router.yml, path-prefix/token configs, and virtual-host.yml. A simple proxy or balancer can use handler.yml and proxy.yml. A sidecar uses handler.yml, router.yml, token/cache config, registry/discovery config, and usually no static resource config.

Phase 3 Handler Config

Phase 3 implements the first three Java-compatible cross-cutting handlers.

correlation.yml:

enabled: ${correlation.enabled:true}
autogenCorrelationID: ${correlation.autogenCorrelationID:true}
correlationMdcField: ${correlation.correlationMdcField:cId}
traceabilityMdcField: ${correlation.traceabilityMdcField:tId}

The Rust handler reads X-Correlation-Id and X-Traceability-Id, generates a Java-compatible URL-safe UUID value when correlation is missing, passes the correlation ID to the upstream request, and echoes X-Traceability-Id on the response. It stores the values in the Pingora request context instead of MDC.

cors.yml:

enabled: ${cors.enabled:true}
allowedOrigins: ${cors.allowedOrigins:}
allowedMethods: ${cors.allowedMethods:}
pathPrefixAllowed: ${cors.pathPrefixAllowed:}

The Rust handler accepts the same list/string forms as Java, supports pathPrefixAllowed, short-circuits preflight OPTIONS, rejects disallowed origins with 403, and adds the CORS response headers before static or proxied responses are sent. Rust intentionally uses longest-prefix selection for pathPrefixAllowed so overlapping prefixes are deterministic.

metrics.yml:

enabled: ${metrics.enabled:true}
enableJVMMonitor: ${metrics.enableJVMMonitor:false}
serverProtocol: ${metrics.serverProtocol:http}
serverHost: ${metrics.serverHost:localhost}
serverPath: ${metrics.serverPath:/apm/metricFeed}
serverPort: ${metrics.serverPort:8086}
serverName: ${metrics.serverName:metrics}
serverUser: ${metrics.serverUser:admin}
serverPass: ${metrics.serverPass:admin}
reportInMinutes: ${metrics.reportInMinutes:1}
productName: ${metrics.productName:http-sidecar}
sendScopeClientId: ${metrics.sendScopeClientId:false}
sendCallerId: ${metrics.sendCallerId:false}
sendIssuer: ${metrics.sendIssuer:false}
issuerRegex: ${metrics.issuerRegex:}

Phase 3 parses and registers this config with serverPass masked, records request counts and status classes in memory, and logs request metrics with the matched endpoint and correlation ID. enableJVMMonitor is parsed for config compatibility but is not applicable to Rust. External Influx/APM reporters are deferred until the metrics sink decision is made.

Phase 4 Handler Config

Phase 4 implements the security-oriented Java-compatible handlers that fit the Pingora request metadata model.

header.yml:

enabled: ${header.enabled:false}
request:
  remove: ${header.request.remove:}
  update: ${header.request.update:}
response:
  remove: ${header.response.remove:}
  update: ${header.response.update:}
pathPrefixHeader: ${header.pathPrefixHeader:}

The Rust handler applies request header remove/update rules before proxying and response header remove/update rules before static or proxied responses are sent. Rust intentionally uses longest-prefix selection for pathPrefixHeader so overlapping prefixes are deterministic.

apikey.yml:

enabled: ${apikey.enabled:true}
hashEnabled: ${apikey.hashEnabled:false}
pathPrefixAuths: ${apikey.pathPrefixAuths:[]}

The Rust handler follows the Java rule that no matching path prefix means the handler passes the request. A matching rule validates the configured header against either a plain API key or the Java iterations:saltHex:hashHex PBKDF2-HMAC-SHA1 hash format.

basic-auth.yml:

enabled: ${basic.enabled:false}
enableAD: ${basic.enableAD:true}
allowAnonymous: ${basic.allowAnonymous:false}
allowBearerToken: ${basic.allowBearerToken:false}
users: ${basic.users:[]}

The Rust handler supports configured local users, anonymous path users, and the Java-compatible bearer pass-through mode. LDAP/AD authentication is parsed for configuration compatibility but is not implemented in phase 4.

security.yml:

enableVerifyJwt: ${security.enableVerifyJwt:true}
ignoreJwtExpiry: ${security.ignoreJwtExpiry:false}
enableH2c: ${security.enableH2c:false}
enableMockJwt: ${security.enableMockJwt:false}
jwt:
  certificate: ${security.jwt.certificate:{}}
  clockSkewInSeconds: ${security.jwt.clockSkewInSeconds:60}
  keyResolver: ${security.jwt.keyResolver:}
skipPathPrefixes: ${security.skipPathPrefixes:[]}
passThroughClaims: ${security.passThroughClaims:{}}

The Rust handler verifies Bearer JWTs with configured PEM certificates, honors kid when present, supports RSA and EC algorithms handled by the Rust JWT library, applies clock skew and optional expiry bypass, caches decoded claims, and forwards configured pass-through claims as request headers. Dynamic JWK key service bootstrap and SWT/SJWT verification are deferred until the runtime has the discovery and key-service client surface needed by those flows.

unified-security.yml:

enabled: ${unified-security.enabled:true}
anonymousPrefixes: ${unified-security.anonymousPrefixes:[]}
pathPrefixAuths: ${unified-security.pathPrefixAuths:[]}

The Rust handler supports Java-style path-prefix selection across Basic, JWT, and API-key authentication. Anonymous prefixes bypass authentication. SWT/SJWT rules return a clear not-implemented response until the discovery-backed key flow is added.

limit.yml:

enabled: ${limit.enabled:false}
concurrentRequest: ${limit.concurrentRequest:0}
queueSize: ${limit.queueSize:0}
errorCode: ${limit.errorCode:429}
rateLimit: ${limit.rateLimit:}
headersAlwaysSet: ${limit.headersAlwaysSet:false}
key: ${limit.key:server}
server: ${limit.server:{}}
address: ${limit.address:{}}
client: ${limit.client:{}}
user: ${limit.user:{}}

The Rust handler implements in-memory request rate limiting by server, client address, JWT client ID, or JWT user ID. It emits X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, and Retry-After when a request is rejected, and it can always emit the rate-limit headers when headersAlwaysSet is enabled. Cluster-wide distributed counters are deferred until there is a concrete gateway clustering requirement.

Handler Registry

Use explicit registration.

#![allow(unused)]
fn main() {
let handlers = PingoraHandlerRegistry::new()
    .register(correlation::descriptor())
    .register(headers::descriptor())
    .register(metrics::descriptor())
    .register(cors::descriptor())
    .register(jwt::descriptor())
    .register(rate_limit::descriptor());
}

No inventory is needed for the first version. Explicit registration is deterministic, testable, and makes the compiled-in handler set clear from the service binary.

The light-gateway binary can register every built-in handler it supports. Registration only makes a handler available. Activation is controlled by handler.yml.

Build the active handler set lazily:

  1. parse handler.yml
  2. resolve paths and defaultHandlers
  3. expand any referenced chains
  4. compute the set of referenced handler IDs
  5. instantiate only referenced handlers
  6. load config only for referenced handlers

This allows one binary to support gateway, sidecar, proxy, balancer, and BFF profiles without requiring unused handler config files.

The registry maps stable config IDs to factories:

#![allow(unused)]
fn main() {
pub struct PingoraHandlerDescriptor {
    pub id: &'static str,
    pub kind: PingoraHandlerKind,
    pub factory: PingoraHandlerFactory,
}
}

Suggested first handler IDs:

  • correlation
  • headers
  • metrics
  • cors
  • jwt
  • api-key
  • basic-auth
  • rate-limit
  • request-size-limit

Trace headers should be handled by correlation; there should not be a separate traceability handler.

Handler API

Use Pingora phases directly. Avoid a generic exchange abstraction until another framework needs it.

The current implementation keeps PingoraHandler as a descriptor/factory surface and executes the built-in phase 3 handlers from light-gateway's Pingora lifecycle. This keeps the first implementation straightforward:

  • request_filter resolves the configured chain and runs request-stage handlers in order.
  • A request-stage handler can continue, short-circuit with a local response, or select a terminal action such as proxy/static/health.
  • upstream_request_filter applies upstream request mutations such as generated correlation IDs.
  • response_filter applies response-stage headers and records proxied response metrics.
  • Static responses call the same response decoration and metrics code before writing the local response.

Once security/rate-limit handlers are added, this can be lifted into a richer trait with request/upstream/response hooks if the duplication becomes real. It is intentionally not generalized before the Pingora behavior stabilizes.

Response handlers should run before both static and proxied responses are sent. For proxied responses, this maps to Pingora response_filter. For static responses, the static-file renderer calls the same response handler chain before writing the local response.

Request Context

The per-request context should carry route decisions across Pingora phases.

#![allow(unused)]
fn main() {
pub struct GatewayRequestContext {
    pub upstream: Option<ProxyTarget>,
    pub endpoint: String,
    pub method: String,
    pub path_params: BTreeMap<String, String>,
    pub correlation: CorrelationState,
    pub cors: Option<CorsResponseHeaders>,
    pub metrics_enabled: bool,
}
}

The context is created by ProxyHttp::new_ctx() and populated in request_filter.

upstream_peer should only select an upstream after a proxy or router handler has selected one. If no upstream is selected for a proxied request, the implementation should return a clear configuration error rather than silently falling back.

Virtual Hosts

Virtual-host static serving should use the HTTP Host header.

Host normalization rules:

  • lowercase the host
  • strip the port when present
  • reject empty or invalid hosts unless a default virtual host is configured
  • exact host match first
  • wildcard match such as *.example.com after exact hosts, with the longest matching suffix winning

HTTP host routing is enough for the first implementation.

TLS certificate selection by SNI is separate. The current light-pingora transport uses one Rustls TLS setting for the listener, so the first production options are:

  • terminate TLS at ingress or a load balancer
  • use a wildcard certificate
  • use one certificate with all required SANs

Phase 8 evaluated dynamic multi-cert SNI selection. The current light-pingora build uses Pingora's Rustls listener, and Pingora 0.8 Rustls TLS settings do not support certificate callbacks. For now the production options remain terminating TLS before light-gateway, using a wildcard certificate, or using one certificate with all required SANs. Native multi-cert SNI can be added only after moving to a Pingora TLS backend/version that supports server certificate callbacks or certificate resolution through Rustls.

Static SPA Rendering

Static SPA rendering should be part of the Pingora resource engine, not a generic middleware handler. It is enabled by path-resource.yml or virtual-host.yml, typically for BFF profiles.

Rules for the first implementation:

  • support GET and HEAD
  • return 405 for unsupported methods on static routes
  • canonicalize requested paths under the configured static root
  • reject path traversal
  • do not serve files outside the static root
  • deny dotfiles by default
  • do not list directories
  • serve index.html for the root path
  • support SPA fallback to index.html for non-asset routes
  • infer Content-Type from file extension
  • set Cache-Control: no-cache for index.html
  • set long immutable cache headers for hashed assets
  • allow static route prefixes to be bypassed by API routes such as /api/, /oauth/, /mcp/, or /ws/

Recommended cache behavior:

index.html                 Cache-Control: no-cache
*.js, *.css with hash       Cache-Control: public, max-age=31536000, immutable
images/fonts with hash      Cache-Control: public, max-age=31536000, immutable
other assets                Cache-Control: public, max-age=3600

Phase 8 keeps small static files on the simple read-then-write path and streams files whose size is greater than or equal to the configured transferMinSize. Static responses include ETag and Last-Modified, honor If-None-Match and If-Modified-Since, and return 304 without a response body when the browser cache is current.

Proxy And Router Behavior

proxy.yml selects from configured upstream URIs. This is the simpler inbound reverse-proxy case and should be implemented before dynamic sidecar routing.

Fixed proxy target behavior:

  • parse comma-separated proxy.hosts
  • support http:// and https://
  • duplicate a single host internally if retry/load-balancer behavior needs at least two entries
  • select upstream with round-robin
  • apply timeout, retry, queue, and host-forwarding settings where Pingora supports them

router.yml selects from request metadata. Phase 5 implements direct service_url targets, static serviceTargets for service_id, host whitelist enforcement, and rewrite behavior. Phase 7 adds controller-backed service_id lookup through the runtime portal-registry client and keeps static serviceTargets as a local fallback.

Router target behavior:

  • prefer service_url when present and allowed by router.hostWhitelist
  • otherwise use service_id plus optional env_tag
  • optionally allow service_id from the query string when serviceIdQueryParameter is true
  • resolve service_id from controller discovery when the portal-registry client is connected
  • fall back to router.serviceTargets for local/static deployments or controller lookup failures
  • support URL, method, query-parameter, and header rewrite rules
  • remove service_url and service_id headers before forwarding

upstream_peer creates the HttpPeer from the selected upstream:

  • address
  • TLS enabled
  • SNI
  • optional host header

upstream_request_filter should set or override upstream headers such as:

  • Host
  • X-Forwarded-For
  • X-Forwarded-Proto
  • X-Forwarded-Host
  • X-Light-Gateway or equivalent runtime marker

Handler-specific upstream mutations should also run from this phase.

Chain Resolution

Startup should validate handler and selected traffic/resource configuration before binding listeners.

Validation rules:

  • every handler ID in handler.yml must exist in the explicit registry
  • every chain item must resolve to a registered handler or another chain
  • recursive chain references are invalid
  • every handler.paths entry must reference existing chains or handlers
  • every handler.defaultHandlers entry must reference existing chains or handlers
  • proxy.yml hosts must be valid http:// or https:// URIs when the proxy handler is active
  • router.yml rewrite rules must be parseable when the router handler is active
  • every static virtual host must have a static root
  • static roots must be absolute or resolved relative to a configured base
  • duplicate exact virtual hosts are invalid
  • duplicate handler IDs in the registry are invalid

The resolved model should be immutable and cheap to read:

#![allow(unused)]
fn main() {
pub struct GatewayRuntimeModel {
    pub virtual_hosts: BTreeMap<String, Arc<VirtualHost>>,
    pub default_host: Option<Arc<VirtualHost>>,
    pub chains: BTreeMap<String, Arc<ResolvedHandlerChain>>,
    pub proxy_targets: Vec<Arc<ProxyTarget>>,
}
}

Config reload should continue to swap loaded models atomically. In-flight requests should keep using the handler/resource/proxy/router model they already selected.

Runtime Integration

light-runtime remains responsible for bootstrap, config loading, lifecycle, controller registration, and module registry. light-pingora should load its Pingora-specific handler, traffic, and resource config through the existing runtime config loader.

Module IDs:

  • light-pingora/handler
  • light-pingora/proxy
  • light-pingora/router
  • light-pingora/path-prefix-service
  • light-pingora/token
  • light-client/client
  • light-pingora/path-resource
  • light-pingora/virtual-host
  • light-pingora/correlation
  • light-pingora/cors
  • light-pingora/metrics
  • light-pingora/header
  • light-pingora/security
  • light-pingora/apikey
  • light-pingora/basic-auth
  • light-pingora/unified-security
  • light-pingora/limit

The module registry should expose:

  • handler config snapshot, masked
  • proxy, router, path-resource, and virtual-host config snapshots, masked
  • active handler IDs
  • active chains
  • active virtual hosts
  • active proxy/router/static capabilities
  • reloadable status

The implemented phases use the existing ReloadableModule pattern for active handler, proxy, router, resource, virtual-host, path-prefix service, token, and handler-specific config files. Phase 7 exposes a capabilities summary from get_service_info, including active modules, traffic capabilities, active handlers, chain names, path mappings, default handlers, virtual hosts, and path-resource config.

Suitable First Handlers

Start with handlers that map cleanly to Pingora request and response metadata:

  • correlation ID and trace headers
  • response headers
  • metrics
  • CORS
  • JWT verification
  • API key verification
  • basic auth
  • request size limit from headers
  • simple rate limiting by principal, IP, host, or route

Defer handlers that require deeper body handling:

  • request decompression
  • response compression policy beyond Pingora modules
  • request body sanitizer
  • generic body parser
  • WebSocket message handlers

Error Model

Handlers and proxy/resource selection should return structured errors that render consistently.

#![allow(unused)]
fn main() {
pub struct HandlerError {
    pub status: u16,
    pub code: Cow<'static, str>,
    pub message: Cow<'static, str>,
    pub metadata: serde_json::Value,
}
}

Security handlers should avoid returning sensitive validation details to the browser. Detailed diagnostics should go to logs with correlation IDs.

Common gateway errors:

  • unknown host: 404
  • no matching handler path or static resource: 404
  • unsupported method for static route: 405
  • static file outside root: 403
  • missing upstream: startup validation error
  • auth failure: 401 or 403
  • rate limit: 429

Testing Strategy

Unit tests in light-pingora:

  • build active handler set from referenced paths and defaultHandlers
  • ignore registered but unreferenced handlers
  • do not require config files for unreferenced handlers
  • parse valid handler.yml
  • reject unknown handler IDs
  • reject recursive chains
  • resolve path/default handler chains in order
  • parse handler.yaml fallback
  • merge additionalHandlers, additionalChains, and additionalPaths
  • capture path-template variables
  • parse CORS list/string and path-prefix config
  • classify metrics status codes
  • normalize host names and strip ports
  • reject duplicate virtual hosts
  • match exact virtual hosts
  • parse and validate proxy.yml hosts
  • parse and validate router.yml rewrite-rule config
  • select router targets from direct service_url
  • reject direct router targets that do not match hostWhitelist
  • select router targets from controller discovery and static serviceTargets
  • apply router URL, method, query-parameter, and header rewrites
  • parse pathPrefixService.yml and avoid partial-segment path matches
  • parse token.yml and the client credentials subset of client.yml
  • support single and multiple auth-server token configuration
  • discover token service endpoints from client.yml token serviceId
  • mask token cache summaries and never expose bearer token values
  • expose gateway capabilities in get_service_info
  • prevent static path traversal
  • deny dotfiles by default
  • serve index.html for /
  • serve SPA fallback for non-asset paths
  • avoid SPA fallback for /api/ proxy routes
  • select cache headers for index.html and hashed assets
  • stop handler execution on early response
  • run response handlers before static response write

Integration tests:

  • same binary starts with BFF profile config
  • same binary starts with proxy or balancer profile config
  • BFF profile can route API paths through configured handlers and serve SPA fallback through defaultHandlers
  • static SPA route returns index.html
  • static asset route returns correct content type and cache header
  • virtual host A and virtual host B serve different roots
  • API route is proxied to the configured proxy.yml upstream
  • auth handler blocks protected API routes
  • public static route does not require auth unless configured

Rollout Plan

Phase 1: Product config and active handler model (implemented)

  • keep a single apps/light-gateway binary
  • register all built-in handler descriptors explicitly
  • resolve active handler IDs from handler.yml
  • instantiate only active handlers
  • load config only for active handlers
  • document product profiles managed by light-portal

Phase 2: BFF and fixed proxy engine (implemented)

  • load and register proxy.yml, path-resource.yml, and virtual-host.yml

  • match handler.yml paths and fallback handlers in Java-compatible order

  • select fixed proxy upstreams from proxy.yml

  • match virtual hosts by Host

  • serve single-site and virtual-host static content

  • implement safe static path resolution

  • serve static files from request_filter

  • add Rust SPA fallback improvement

  • add content type and cache headers

  • add traversal, dotfile, fallback, proxy-host, and virtual-host tests

Phase 3: Handler chain execution (implemented)

  • run request and response handlers around static and proxied responses
  • implement correlation, CORS, and basic metrics
  • parse correlation.yml, cors.yml, and metrics.yml
  • pass generated correlation IDs upstream
  • apply response headers to both static and proxied responses
  • log handler duration when reportHandlerDuration is enabled
  • defer generic response headers to a handler-specific follow-up

Phase 4: Security and request/response policy handlers (implemented)

  • implement JWT, API key, basic auth, and rate-limit handlers
  • implement the generic header handler for request and response mutation
  • implement unified-security path-prefix selection for Basic, JWT, and API key
  • parse Java-compatible security.yml, apikey.yml, basic-auth.yml, unified-security.yml, header.yml, and limit.yml
  • add JWT pass-through claim request header mutation
  • add path-level chain selection for public SPA and protected API routes

Phase 5: Sidecar router (implemented)

  • load and register router.yml
  • implement dynamic target selection by service_url or service_id
  • enforce hostWhitelist
  • support static serviceTargets for service_id routing until runtime discovery is available
  • support router URL, method, query-parameter, and header rewrites
  • apply router request mutation in upstream_request_filter
  • remove router selection headers before forwarding
  • include router config in the active reload model
  • add sidecar-focused tests

Phase 6: Sidecar path-prefix and token flow (implemented)

  • load and register pathPrefixService.yml
  • resolve service_id by longest path-boundary prefix
  • load and register token.yml
  • load and register the token-related view of client.yml
  • support single-auth-server and multipleAuthServers client credentials
  • cache tokens locally and expose masked cache summaries through the runtime cache registry
  • inject Authorization or X-Scope-Token according to inbound request state
  • extend reload coverage to pathPrefixService.yml, token.yml, and token-related client.yml
  • add sidecar token/path-prefix tests

Phase 7: Discovery and control plane (implemented)

  • expose the runtime portal-registry client to framework transports
  • add discovery/lookup support to the portal-registry client
  • resolve router service_id targets through controller discovery
  • keep static router.serviceTargets as a fallback for local/static profiles
  • discover token service endpoints from client.yml token serviceId
  • expose active capabilities, hosts, paths, handlers, and chains through get_service_info
  • atomically replace resolved handler/resource/proxy models on reload

Phase 8: Advanced transport features (implemented)

  • add streaming static-file delivery for files at or above transferMinSize
  • add conditional static requests with ETag and Last-Modified
  • add wildcard virtual hosts with exact-host precedence
  • evaluate multi-cert TLS SNI support and document the Rustls limitation

Phase 2 Decisions

  • Static roots can be absolute, matching the Java deployment model, or relative to the runtime config directory for local Rust development.
  • SPA fallback applies only to browser routes. Paths that look like assets, such as /app.js or /favicon.ico, return 404 when the file is missing.
  • Handler path matching supports exact paths and Java/OpenAPI-style {name} path-template segments.

Open Questions

  • Should static content support ETag in the first implementation if portal deployments depend on browser cache validation?

MCP Router

Status

Phases 1, 2, 3, and 4 are implemented in light-pingora and light-gateway. The configurable tokenization client remains deferred until light-tokenization is migrated to portal-service/apps/portal-service and the protocol is selected. Stateful backend MCP session mapping is implemented for the single-process gateway session store and documented below.

Purpose

The Java mcp-router module exposes a configured Model Context Protocol endpoint, /mcp by default, and turns configured gateway targets into MCP tools. AI agents can call initialize, tools/list, and tools/call; the router then forwards the tool call to an HTTP service or another MCP server.

In light-fabric this should be a light-pingora handler that is activated by light-gateway through handler.yml. The same gateway binary can contain the MCP router implementation, but each product decides whether it runs by including the mcp handler and the mcp-router.yml configuration from the config server.

This feature is separate from the existing runtime MCP control plane in light-runtime. Runtime MCP is an internal management surface exposed through the portal registry connection. The MCP router is an HTTP-facing gateway feature and is subject to the normal inbound handler chain.

The transport target is MCP Streamable HTTP as defined by the current MCP transport specification: https://modelcontextprotocol.io/specification/2025-06-18/basic/transports.

Goals

  • Keep the Java configuration model recognizable: enabled, path, and tools.
  • Allow mcp-router.tools to be injected by the config server the same way handler.handlers, handler.chains, handler.paths, and handler.defaultHandlers are injected.
  • Activate the router with the existing mcp handler id in handler.yml.
  • Expose one MCP endpoint with Streamable HTTP semantics, so /mcp is the only public MCP path for both POST messages and optional GET streams.
  • Support MCP JSON-RPC methods needed by the Java module: initialize, notifications/initialized, tools/list, and tools/call.
  • Route tools to direct targetHost endpoints, discovered serviceId targets, and backend MCP servers.
  • Reuse existing cross-cutting handlers such as correlation, security, CORS, rate limit, header, metrics, and proxy routing where the chain order allows.
  • Register the router configuration with the module registry so it can be inspected and reloaded consistently with other light-fabric modules.

Non-Goals

  • Do not use Rust dynamic plugins or inventory for runtime tool registration. The active tools are product configuration, not compile-time discovery.
  • Do not merge the public MCP router and the internal runtime MCP control plane into one handler.
  • Do not implement a full MCP server framework in the first pass. The gateway only needs the methods used by agents to discover and call configured tools.
  • Do not copy Java's legacy HTTP+SSE endpoint split as the target transport. Streamable HTTP is the Rust target; legacy SSE can be considered only as a compatibility mode if an older client requires it.
  • Do not hardcode tokenization or masking service URLs. Java currently has a hardcoded tokenization endpoint in this path; the Rust port should make that configurable when masking/tokenization is added.

Java Behavior To Map

The Java module has three main pieces:

  • McpConfig loads mcp-router.yml with enabled, path, and tools.
  • McpHandler owns the HTTP MCP endpoint and JSON-RPC protocol handling.
  • McpToolRegistry stores configured tool implementations by name.

Java configuration:

enabled: ${mcp-router.enabled:true}
path: ${mcp-router.path:/mcp}
maxSessions: ${mcp-router.maxSessions:10000}
maxSessionsPerClient: ${mcp-router.maxSessionsPerClient:100}
tools: ${mcp-router.tools:}

Each tool supports these fields:

- name: weather
  description: Get weather information
  protocol: http
  serviceId: com.networknt.weather-1.0.0
  envTag: dev
  targetHost: http://localhost:7081
  path: /weather
  method: GET
  endpoint: /weather@get
  apiType: http
  inputSchema:
    type: object
    properties:
      city:
        type: string
  toolMetadata: {}

The Java handler currently supports:

  • GET /mcp as an SSE compatibility endpoint. It creates a session id and emits an endpoint event pointing to /mcp?sessionId=....
  • POST /mcp for JSON-RPC messages.
  • initialize, returning protocol version, tool capabilities, and server info.
  • notifications/initialized, returning no response.
  • tools/list, optionally filtered by params.query or params.intent.
  • tools/call, forwarding arguments to the configured tool.

The Java tool execution supports two target types:

  • HTTP tools call a configured HTTP endpoint. GET maps arguments to query parameters. Other methods send the arguments as a JSON body.
  • MCP proxy tools call a backend MCP server by sending a JSON-RPC tools/call request to the configured backend path.

Java also includes rule-based access checks, response filtering, masking, and tokenization around tool calls. The Rust version now implements access checks, response filtering, and schema-driven request masking without hardcoded service endpoints. Tokenization is intentionally deferred.

The Rust implementation should map this behavior to MCP Streamable HTTP rather than keeping Java's legacy HTTP+SSE transport as the default. Streamable HTTP uses one MCP endpoint path. Clients send JSON-RPC messages with POST /mcp; the server can return either a single application/json response or text/event-stream from that same POST when streaming is needed. Clients may also issue GET /mcp to open an optional server-to-client SSE stream on the same endpoint.

Resolved Decisions

  • Use Streamable HTTP so only one public MCP endpoint, normally /mcp, is exposed.
  • Defer the tokenization client design until light-tokenization is migrated into portal-service/apps/portal-service and its protocol is selected.
  • Reuse the light-4j access-control.yml compatibility contract for MCP, REST, and JSON-RPC authorization.
  • Do not add configured per-tool outbound headers. Backend tool calls should pass through the headers received from the agent, subject only to headers that the HTTP client must regenerate for a new outbound request and MCP session headers that the gateway must map or regenerate.

Rust Architecture

Add the MCP router to light-pingora because it is a request/response gateway handler. light-gateway should wire it into the existing handler descriptor table and runtime state.

Proposed modules:

frameworks/light-pingora/src/access_control.rs
frameworks/light-pingora/src/mcp.rs

Primary types:

#![allow(unused)]
fn main() {
pub struct McpRouterConfig {
    pub enabled: bool,
    pub path: String,
    pub tools: Vec<McpToolConfig>,
}

pub struct McpToolConfig {
    pub name: String,
    pub description: String,
    pub protocol: Option<String>,
    pub service_id: Option<String>,
    pub env_tag: Option<String>,
    pub target_host: Option<String>,
    pub path: String,
    pub method: HttpMethod,
    pub endpoint: Option<String>,
    pub api_type: McpToolType,
    pub input_schema: serde_json::Value,
    pub tool_metadata: serde_json::Value,
}

pub struct McpRouterRuntime {
    pub config: ArcSwap<McpRouterConfig>,
    pub client: reqwest::Client,
    pub registry_client: Option<Arc<PortalRegistryClient>>,
}
}

The exact field names should follow the existing light-fabric serde naming style while accepting the Java config names through aliases:

  • serviceId
  • envTag
  • targetHost
  • apiType
  • inputSchema
  • toolMetadata

mcp-router.yml should be the primary Rust file name, but the loader should also accept mcp-router.yaml for Java compatibility.

Tool Registration

The router does not need global static registration. Build an immutable tool map when mcp-router.yml is loaded:

McpRouterConfig -> BTreeMap<String, McpToolConfig> -> Arc<McpRouterState>

On reload, build a new state and atomically swap the Arc. In-flight requests continue with the old state.

This is simpler than Java's static McpToolRegistry and avoids Rust plugin complexity. It also matches the light-fabric product model: all handlers can be linked into one binary, while the config server decides which handlers and tools are active for a product.

Request Flow

The mcp handler should participate in the normal handler chain:

request
  -> correlation
  -> metrics
  -> cors
  -> security or unified security
  -> limit
  -> mcp
  -> proxy or route handler, only if mcp did not consume the request
response
  -> header
  -> metrics
  -> access log

When the request path matches mcp-router.path:

  • POST parses a JSON-RPC message. Requests return either application/json for a single response or text/event-stream for a streamed response on the same endpoint. Notifications and JSON-RPC responses sent by the client return 202 Accepted with no body when accepted.
  • GET with Accept: text/event-stream may open a server-to-client SSE stream on the same endpoint. If the gateway has no server-initiated messages to stream, it should return 405 Method Not Allowed.
  • DELETE should terminate the gateway session and any mapped backend MCP sessions. Until session termination is implemented, it can return 405 Method Not Allowed.
  • Other methods return 405 Method Not Allowed.

When the path does not match, the handler continues to the next handler in the configured chain.

The handler must be safe to include in shared chains. If mcp-router.enabled is false, or the mcp handler is not in handler.yml, no MCP route is exposed.

JSON-RPC Handling

Supported methods:

initialize
notifications/initialized
tools/list
tools/call

initialize response:

{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "protocolVersion": "2024-11-05",
    "capabilities": {
      "tools": {
        "listChanged": true
      }
    },
    "serverInfo": {
      "name": "light-gateway-mcp",
      "version": "1.0.0"
    }
  }
}

tools/list returns configured tools with name, description, and inputSchema. It should preserve Java's simple filtering:

  • params.query matches tool name or description.
  • params.intent matches tool name or description.

tools/call validates params.name, finds the tool, validates or forwards params.arguments, and returns either:

{
  "content": [
    {
      "type": "text",
      "text": "..."
    }
  ]
}

or the structured result returned by the backend MCP server.

JSON-RPC errors should use the same codes as Java where practical:

-32700 parse error
-32601 method or tool not found
-32602 invalid params
-32000 tool execution failed
-32001 access denied

Rust improvement: malformed transport payloads should return a clear HTTP 400 with a JSON-RPC error body instead of a generic HTTP 500.

For Streamable HTTP:

  • Clients must send each JSON-RPC message as a separate POST to the MCP endpoint.
  • Clients should send Accept: application/json, text/event-stream.
  • The router should negotiate and honor MCP-Protocol-Version.
  • The router terminates the client-facing MCP session. initialize responses should include a gateway-owned Mcp-Session-Id, and later client requests should be validated against that gateway session.

MCP Session Management

The MCP router should use a facade model. To the agent, light-gateway is the MCP server. To upstream MCP targets, light-gateway is an MCP client. This keeps gateway security, access-control policy, masking, response filtering, and tool aggregation in one place while still respecting upstream MCP session state.

There are two distinct session scopes:

  • Frontend session: the session between the MCP client and light-gateway.
  • Backend session: one upstream MCP server session owned by the gateway for a specific frontend session and backend target.

The frontend session is created during client initialize:

  1. The client sends initialize to mcp-router.path.
  2. The gateway returns the MCP capabilities it exposes and a gateway-generated Mcp-Session-Id.
  3. The gateway stores session state keyed by that id. The state should include the negotiated protocol version, client info, security principal or relevant auth context, and any backend MCP sessions created for this client session.
  4. Later client requests must include the gateway session id. Unknown or expired session ids should fail before tool execution.
  5. A client DELETE request, explicit expiry, or gateway shutdown should close all backend sessions associated with the frontend session.

The in-memory gateway store uses a 30-minute idle timeout, a configurable maximum frontend session count, and a configurable per-client frontend session count. Expired sessions are purged lazily during later MCP requests, and any mapped backend MCP sessions are closed during that purge. If the store is still full after lazy purge, or the client already owns the maximum allowed sessions, new initialize requests fail without issuing another session id.

The per-client key is derived from the authenticated principal when available, preferring client_id, then user_id, email, and host. If no security principal is available, the key falls back to MCP clientInfo.name and clientInfo.version from the initialize request.

For a single gateway process, the session store can start in memory. In a multi-pod deployment, the store should be external, such as Redis, or ingress must provide sticky routing for all requests that carry the same Mcp-Session-Id.

Backend handling depends on the tool type.

For apiType: http, the backend is a normal stateless API:

  1. No backend MCP session is created.
  2. The gateway translates tools/call arguments into a normal HTTP request.
  3. GET tools serialize arguments into the query string; body-capable methods send JSON.
  4. The gateway wraps the HTTP response into an MCP tools/call result.
  5. User-specific auth, tenant, correlation, and trace headers come from the frontend session or inbound request and are applied to the outbound HTTP call as normal gateway headers.

For apiType: mcp, the backend is a stateful MCP server:

  1. The gateway lazily initializes the backend session the first time a frontend session calls a tool for that backend target. If future dynamic tool discovery depends on the backend, this initialization can happen before tools/list instead.
  2. The gateway sends initialize to the backend MCP endpoint as an MCP client. It should use the client-requested protocol version when supported and pass only the capabilities it needs upstream.
  3. If the backend returns Mcp-Session-Id, the gateway stores it in a mapping keyed by the gateway session id and backend target identity.
  4. The gateway sends notifications/initialized to the backend when the backend session is established.
  5. For later backend calls, the gateway sends the backend session id to that backend. It must not forward the frontend gateway session id as if it were a backend session id.
  6. The gateway still performs access checks before calling the backend and response filtering after the backend response.
  7. When the frontend session ends, the gateway should terminate each mapped backend MCP session to avoid leaking backend resources.

The backend target identity used in the session map should be stable across requests. It should include the resolved route information that distinguishes one backend MCP endpoint from another, such as targetHost or serviceId, envTag, protocol, and tool path.

When the router aggregates tools from both MCP servers and normal APIs, the client still sees one gateway MCP session and one tools/list response. The gateway registry decides how each tools/call is executed:

FeatureMCP server backendNormal API backend
Config typeapiType: mcpapiType: http or omitted
Backend sessionYes, mapped from gateway session to backend targetNo
InitializationGateway initializes backend as an MCP clientNo upstream initialization
Message handlingJSON-RPC tools/call through backend MCP sessionTranslate JSON-RPC arguments to HTTP
Backend session headerSend backend Mcp-Session-Id only to that backendDo not send MCP session state
Tear-downClose backend session on client session endNothing backend-specific

The configured tools/list remains the gateway's public contract. A future dynamic-discovery mode may call backend MCP tools/list and merge those tools with configured HTTP tools, but that must still preserve the gateway's policy surface and avoid exposing backend tools that are not authorized for the product.

HTTP Tool Execution

For apiType: http or missing apiType:

  1. Resolve the target base URL.
  2. Build the target URL from base URL plus tool path.
  3. For GET, serialize arguments with url::form_urlencoded.
  4. For POST, PUT, and PATCH, send arguments as JSON.
  5. Pass through the inbound agent headers to the backend tool call so caller identity, authorization, correlation, tenant, locale, and tracing context are preserved.
  6. Let the HTTP client regenerate transport-specific headers for the new outbound request, such as Host, Content-Length, Transfer-Encoding, and connection management headers.
  7. Treat 2xx as success.
  8. Parse JSON responses as structured MCP results.
  9. Wrap non-JSON responses as MCP text content.
  10. Return an empty 2xx response as { "result": "success" }.

Target resolution:

  • Prefer targetHost for direct calls.
  • Otherwise use serviceId, protocol, and envTag through the existing portal registry discovery client.
  • If neither is available, return a tool execution error.

MCP Proxy Tool Execution

For apiType: mcp:

  1. Resolve the target base URL the same way as HTTP tools.
  2. Ensure a backend MCP session exists for the current gateway session and backend target. If none exists, initialize the backend MCP endpoint and store the returned backend Mcp-Session-Id.
  3. POST to the configured backend path.
  4. Pass through the inbound agent headers to the backend MCP server, with transport-specific headers regenerated for the new outbound request. Replace any frontend gateway Mcp-Session-Id with the mapped backend session id for this backend target.
  5. Send a backend JSON-RPC request:
{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "tools/call",
  "params": {
    "name": "tool-name",
    "arguments": {}
  }
}
  1. If the backend returns error, map it to -32000.
  2. If the backend returns result, return it to the caller.
  3. On frontend session termination or expiry, close the backend MCP session.

This preserves the Java McpProxyTool behavior while using Rust's typed JSON-RPC models where possible and adds the MCP session mapping required by stateful backend MCP servers.

Configuration Loading

The router should be loaded as a normal light-fabric module:

config-server product values
  -> mcp-router.yml placeholders
  -> light-gateway startup
  -> light-pingora mcp router state

Example product values:

mcp-router.enabled: true
mcp-router.path: /mcp
mcp-router.tools:
  - name: get_pet
    description: Get a pet by id.
    targetHost: http://petstore:8080
    path: /v1/pets
    method: GET
    inputSchema:
      type: object
      properties:
        id:
          type: string

Example handler.yml path wiring:

handlers:
  - correlation
  - metrics
  - cors
  - jwt
  - mcp
  - proxy

chains:
  default:
    - correlation
    - metrics
    - cors
    - jwt
    - proxy
  mcp:
    - correlation
    - metrics
    - cors
    - jwt
    - mcp

paths:
  - path: /mcp
    method: POST
    exec:
      - mcp
  - path: /mcp
    method: GET
    exec:
      - mcp

defaultHandlers:
  - proxy

The exact chain names are product choices. The important point is that /mcp can have a narrow chain while normal API proxy traffic keeps the normal proxy chain.

Module Registry

The MCP router should register its configuration with the module registry:

  • module name: mcp-router
  • config files: mcp-router.yml, with mcp-router.yaml as compatibility fallback
  • enabled status
  • configured path
  • tool count
  • tool names

The module registry should mask any future secret fields in toolMetadata, headers, or credential configuration.

Reload behavior:

  1. Reload mcp-router.yml.
  2. Validate duplicate tool names, missing paths, unsupported methods, and target resolution fields.
  3. Build a new immutable router state.
  4. Swap the runtime state atomically.
  5. Report the updated module registry status.

Security And Policy

The first layer of protection should be the handler chain. Products can place JWT, API key, basic auth, unified security, CORS, rate limit, and header handlers before or after mcp as needed.

Because MCP Streamable HTTP is browser-reachable, the mcp handler must also validate the Origin header according to the configured CORS or security policy. Invalid origins should fail before tool execution.

Fine-grained tool authorization should be added after the base router:

  • Reuse the existing light-4j access-control.yml model as the compatibility contract. access-control.yml controls enabled, accessRuleLogic, defaultDeny, and skipPathPrefixes; rule.yml provides ruleBodies and endpointRules.
  • Make the access policy endpoint stable. Java uses the tool endpoint field, such as /weather@get; when omitted, Rust derives {path}@{method}.
  • Include correlation id, caller claims, request headers, tool name, endpoint, and arguments in the policy input.
  • Support default deny when access control is enabled and no req-acc rule matches.
  • Provide built-in Rust actions compatible with the Java class names used by current config: RoleBasedAccessControlAction, ResponseColumnFilterAction, and ResponseRowFilterAction.

Response filtering should be implemented as a second policy stage:

  • Apply policy after backend execution and before JSON-RPC response emission.
  • Support both structuredContent and single text content responses, matching Java's behavior.
  • Match endpoint rules exactly first, then Java-style path templates and parent path entries such as /v1/accounts@get for /v1/accounts/123@get.

Masking and tokenization handling:

  • Preserve Java schema extensions: x-mask, x-mask-pattern, and x-tokenize.
  • Parse these extensions from inputSchema as serde_json::Value.
  • Apply schema-driven x-mask request masking before backend tool execution.
  • Keep x-tokenize as a future extension point. Do not call a tokenization service until the portal-service tokenization protocol is finalized.
  • Do not hardcode a tokenization service URL. The tokenization client should be designed after light-tokenization is migrated into portal-service/apps/portal-service, whether the final protocol is JSON-RPC, MCP, or gRPC.

Per-tool outbound headers would mean headers that the MCP router adds from tool configuration when it calls a specific backend target, for example a configured Authorization, X-API-Key, tenant routing header, or vendor-specific version header. We do not need that feature. The required behavior is header pass-through: backend tool calls receive the headers that came from the agent, while the HTTP client regenerates only the transport-specific headers required for a valid outbound request. MCP session headers are not normal pass-through headers. The gateway owns the frontend Mcp-Session-Id and maps it to backend session ids when an upstream MCP server is involved.

Relationship To Existing Runtime MCP

light-runtime already has RuntimeMcpHandler for runtime management tools. That should remain internal and registry-facing.

The gateway MCP router should not automatically expose runtime management tools. If a product needs that bridge later, add an explicit configured tool type, for example:

apiType: runtime

That keeps public agent-facing tools separate from management tools and avoids accidentally exposing cache, module, or service operations through a public gateway route.

Phased Implementation

Phase 1: Core Router

  • Add mcp-router.yml config parsing in light-pingora.
  • Accept tools as either a YAML array or a JSON string to match Java config server injection behavior.
  • Add immutable tool map validation.
  • Implement the base Streamable HTTP single endpoint: unary POST /mcp, Accept validation for application/json and text/event-stream, 202 Accepted for accepted notifications, and 405 for unsupported methods.
  • Implement JSON-RPC initialize, notifications/initialized, tools/list, and tools/call.
  • Implement direct targetHost HTTP tools.
  • Pass through agent request headers to direct HTTP and backend MCP tool calls, except MCP session headers that the gateway must map separately.
  • Wire the existing mcp handler id in light-gateway.
  • Register module status and config with the module registry.
  • Add parser and handler tests.

Status: implemented.

Phase 2: Discovery And MCP Proxy

  • Resolve serviceId, protocol, and envTag through the existing portal registry discovery client.
  • Implement apiType: mcp backend proxy tools.
  • Add reload support with atomic state swap.
  • Add tests with fake discovery and backend MCP responses.

Status: implemented.

Phase 3: Streamable HTTP Streaming

  • Add streamed text/event-stream responses from POST /mcp for long-running tool calls or server-to-client messages related to the originating request.
  • Add optional GET /mcp server-to-client streams on the same endpoint.
  • Track frontend sessions when Mcp-Session-Id is issued. Return 405 for standalone GET streams until server-initiated messages are implemented.
  • Add tests for content negotiation, 202 Accepted notifications, streamed POST responses, and optional GET behavior.

Status: implemented.

Phase 4: Policy, Filtering, Masking

  • Add tool-level authorization using the access-control.yml compatibility contract.
  • Add response filtering for structured and text MCP results.
  • Add schema-driven request masking.
  • Add MCP tool-call log fields for tool name, endpoint, duration, status, and policy outcome.

Status: implemented for access control, response filtering, and request masking. Tokenization is deferred until the portal-service tokenization client is designed.

Phase 5: Stateful MCP Backend Sessions

  • Add a gateway session store keyed by frontend Mcp-Session-Id.
  • Validate later client requests against the gateway session.
  • For apiType: mcp, maintain backend session mappings keyed by gateway session id and backend target identity.
  • Lazily initialize backend MCP sessions by sending backend initialize, capturing backend Mcp-Session-Id, and sending notifications/initialized.
  • Replace the frontend session id with the mapped backend session id on upstream MCP calls.
  • Terminate mapped backend MCP sessions when the frontend session is deleted, expires, or the gateway shuts down.
  • Add tests for frontend session validation, backend session creation, backend session reuse, and backend session termination.

Status: implemented for the in-memory frontend session store, configurable global and per-client session caps, 30-minute lazy idle expiry, lazy backend initialization, backend Mcp-Session-Id mapping, backend session reuse, and explicit DELETE teardown. Shutdown cleanup, external session storage, and multi-backend isolation tests remain future hardening for multi-pod deployments.

Testing Strategy

  • Config tests:
    • empty config
    • disabled config
    • duplicate tool names
    • tools as YAML array
    • tools as JSON string
    • inputSchema as object and string
  • JSON-RPC tests:
    • initialize
    • notifications/initialized
    • notification returns 202 Accepted
    • tools/list
    • tools/list with query and intent
    • missing method
    • invalid params
    • malformed JSON
  • Streamable HTTP tests:
    • single /mcp endpoint handles POST
    • POST validates Accept
    • unsupported methods return 405
    • optional GET stream returns 405 until enabled
  • Tool execution tests:
    • direct GET with encoded arguments
    • direct POST with JSON arguments
    • non-JSON backend response
    • empty 2xx backend response
    • non-2xx backend response
    • agent headers are forwarded to backend tool calls
    • discovered service target
    • backend MCP proxy success and error
  • Handler chain tests:
    • /mcp consumed by mcp
    • non-MCP path continues to the next handler
    • disabled router does not expose /mcp
  • Reload tests:
    • tool added
    • tool removed
    • invalid reload keeps the prior good state

Remaining Decisions

  • Confirm whether Phase 1 includes only unary Streamable HTTP POST or also streamed POST responses.
  • Decide the tokenization client protocol after light-tokenization is migrated into portal-service/apps/portal-service.
  • Map the Java access-control.yml schema to Rust policy execution and define how it will be shared by REST, JSON-RPC, and MCP handlers.

LLM Gateway

Status

Design proposal. The current light-agent runtime selects one active model provider from model-provider.yml. That is acceptable if the selected provider is an LLM gateway endpoint. The agent does not need to know how many upstream providers the gateway can reach.

Purpose

The LLM gateway is a centralized model access layer for agents and services. Instead of each agent carrying credentials, endpoint details, routing rules, and provider fallback logic, each agent calls one gateway endpoint. The gateway then routes the request to OpenAI, Azure OpenAI, Bedrock, Anthropic, Gemini, Ollama, Codex, or another provider based on agent configuration, model policy, prompt characteristics, capability requirements, health, cost, and compliance constraints.

This keeps light-agent simple and matches the current bootstrap model:

  1. startup.yml is local.
  2. Runtime configuration is fetched from config-server.
  3. The agent loads one model provider after bootstrap.
  4. That provider can be an OpenAI-compatible LLM gateway.
  5. The gateway owns multi-provider fan-out.

Goals

  • Keep agents configured with one active model endpoint.
  • Support many upstream LLM providers at the gateway at the same time.
  • Allow provider routing by agent, service id, environment, prompt intent, requested capability, logical model name, cost, latency, region, and health.
  • Keep provider credentials out of agent pods and agent config.
  • Preserve the existing model-provider abstraction for direct provider access and reuse it inside the gateway where useful.
  • Expose a provider-compatible HTTP API so existing agents can use the gateway without a new SDK.
  • Support normal light-fabric bootstrap, config-server overrides, module registry visibility, config reload, controller registration, and audit.
  • Make gateway decisions explainable enough for operations and compliance.

Non-Goals

  • Do not make light-agent load many providers directly for this use case. Multi-provider routing belongs in the gateway.
  • Do not require every agent to understand provider-specific fields such as AWS region, Azure deployment name, or Anthropic max token settings.
  • Do not expose upstream provider secrets through tools/list, diagnostics, or agent-visible configuration.
  • Do not depend on an LLM classification call for every routing decision. The gateway should support deterministic routing first and optional classifier routing later.
  • Do not merge the LLM gateway with the MCP router. The LLM gateway routes model calls; the MCP router routes tool calls.

Relationship To Existing Components

Light-Agent

light-agent should continue to select one model provider after runtime bootstrap. For an LLM gateway deployment, the selected provider is the gateway:

model-provider.provider: compatible
model-provider.model: agent-default
compatible.name: llm-gateway
compatible.baseUrl: https://llm-gateway.light-gateway:8443/v1
compatible.apiKey: ${secret.llmGatewayApiKey}

The model-provider.model value becomes a logical model name. It does not need to be an upstream provider model id. Examples:

model-provider.model: agent-default
model-provider.model: fast
model-provider.model: reasoning
model-provider.model: coding
model-provider.model: pii-safe

The gateway maps the logical model to a physical provider and model.

Light-Gateway

The LLM gateway should be implemented as a light-gateway product capability, activated by handler/config. This keeps LLM egress under the same gateway family that already handles MCP routing, auth, rule execution, metrics, service discovery, bootstrap, and reload.

The first implementation can expose an OpenAI-compatible endpoint:

POST /v1/chat/completions

That is enough for CompatibleProvider and many external clients. Later phases can add:

POST /v1/responses
GET  /v1/models

Model Provider Crate

The gateway can reuse crates/model-provider for upstream calls. The crate already contains concrete providers and meta-providers:

  • OpenAI
  • Azure OpenAI
  • Anthropic
  • Bedrock
  • Codex
  • Compatible
  • Gemini
  • GLM
  • Ollama
  • OpenRouter
  • Telnyx
  • Copilot
  • CLI providers where operationally appropriate
  • RouterProvider
  • ReliableProvider

For the gateway, direct concrete providers are upstream adapters. Routing and fallback should be controlled by gateway config and policy, not by each agent.

Request Flow

agent
  -> LLM provider trait
  -> CompatibleProvider
  -> light-gateway /v1/chat/completions
  -> auth, correlation, policy, rate limit
  -> LLM route decision
  -> upstream provider adapter
  -> upstream LLM provider
  -> normalized response
  -> audit, metrics, token usage
  -> agent

The agent sees one model provider. The gateway sees the full routing context.

Routing Inputs

The gateway should make routing decisions from a combination of trusted inputs:

  • Authenticated caller identity from JWT, mTLS, or gateway-authenticated service registration.
  • Agent metadata such as host id, agent definition id, service id, environment, tenant, and account.
  • Logical model name from the request body.
  • Request capabilities: tool calling, vision, JSON mode, long context, reasoning, streaming, prompt caching.
  • Prompt features: intent keywords, size, language, sensitivity markers, coding vs support vs workflow execution.
  • Configured policy: allowed providers, blocked providers, region constraints, cost tier, data residency, fallback chain.
  • Runtime health: provider availability, error rate, latency, quota pressure.

If metadata is supplied as headers, the gateway should only trust those headers from authenticated internal clients. Otherwise it should derive identity from the token or connection.

Suggested internal headers:

X-Light-Request-Id
X-Light-Service-Id
X-Light-Env-Tag
X-Light-Agent-Host-Id
X-Light-Agent-Definition-Id
X-Light-Tenant-Id

Routing Stages

Routing should be deterministic before it is intelligent.

  1. Explicit route

    If the request asks for a logical model with a direct configured route, use that route.

  2. Agent policy

    Apply policy for the authenticated agent or service. This can narrow the allowed logical models and upstream providers.

  3. Capability filter

    Remove upstreams that cannot satisfy required capabilities such as tools, vision, long context, or streaming.

  4. Prompt classifier

    Optionally classify the prompt into a routing domain such as fast, reasoning, coding, customer-support, or restricted-data.

  5. Cost and latency preference

    Choose the cheapest or fastest provider that satisfies policy and capability constraints.

  6. Health and fallback

    If the selected upstream is unhealthy or returns a retryable error, follow a configured fallback chain.

Gateway Configuration

The gateway should use a dedicated config file, for example llm-gateway.yml, loaded through the same runtime config layering as other light-fabric modules.

Example:

enabled: ${llm-gateway.enabled:true}
pathPrefix: ${llm-gateway.pathPrefix:/v1}
defaultRoute: ${llm-gateway.defaultRoute:agent-default}

routes:
  agent-default:
    provider: openai-prod
    model: gpt-4o
    fallbacks:
      - provider: bedrock-us
        model: anthropic.claude-3-5-sonnet-20240620-v1:0

  fast:
    provider: openai-prod
    model: gpt-4o-mini

  reasoning:
    provider: bedrock-us
    model: anthropic.claude-3-7-sonnet-20250219-v1:0
    requiredCapabilities:
      - tools
      - long-context

providers:
  openai-prod:
    type: openai
    baseUrl: ${llm.openai.baseUrl:https://api.openai.com/v1}
    apiKey: ${llm.openai.apiKey:}
    maxTokens: ${llm.openai.maxTokens:}
    costTier: medium
    regions:
      - global

  bedrock-us:
    type: bedrock
    region: ${llm.bedrock.region:us-east-1}
    accessKeyId: ${llm.bedrock.accessKeyId:}
    secretAccessKey: ${llm.bedrock.secretAccessKey:}
    sessionToken: ${llm.bedrock.sessionToken:}
    costTier: high
    regions:
      - us-east-1

agentPolicies:
  com.networknt.agent.account-1.0.0:
    defaultRoute: agent-default
    allowedRoutes:
      - agent-default
      - fast
      - reasoning
    blockedProviders: []
    dataResidency:
      allowedRegions:
        - us-east-1
        - global

fallback:
  maxRetries: ${llm-gateway.fallback.maxRetries:1}
  baseBackoffMs: ${llm-gateway.fallback.baseBackoffMs:100}

The exact schema can evolve, but the important boundary is stable:

  • Agent config points to one gateway endpoint.
  • Gateway config owns provider inventory and route policy.
  • Provider secrets are masked in module registry output.

Provider Inventory

Each configured provider should have:

  • A stable provider id.
  • A provider type.
  • Provider-specific connection settings.
  • Supported capabilities.
  • Allowed regions.
  • Cost tier.
  • Timeout and retry settings.
  • Optional quota metadata.
  • Optional tenant or account restrictions.

Provider ids should be operational names, not user-visible model names:

providers:
  openai-prod:
    type: openai
  openai-eu:
    type: azure-openai
  bedrock-us:
    type: bedrock
  local-ollama:
    type: ollama

Logical model names are route names. They are safe for agents to request.

Request And Response Contract

The first API should be OpenAI-compatible enough for CompatibleProvider:

POST /v1/chat/completions
Authorization: Bearer <agent-or-service-token>
Content-Type: application/json

Request:

{
  "model": "agent-default",
  "messages": [
    {"role": "system", "content": "You are a support agent."},
    {"role": "user", "content": "Help me investigate this account."}
  ],
  "temperature": 0.7,
  "tools": []
}

Response:

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 1200,
    "completion_tokens": 240,
    "total_tokens": 1440
  },
  "light_gateway": {
    "route": "agent-default",
    "provider": "openai-prod",
    "model": "gpt-4o"
  }
}

The light_gateway field should be optional and controlled by diagnostics policy. It is useful for internal debugging, but may be hidden from external clients.

Tool Calling

Tool calling remains an agent responsibility, but model-native tool-call generation flows through the LLM gateway.

The LLM gateway should:

  • Accept OpenAI-style tool definitions from the agent.
  • Convert tool definitions to the upstream provider's native format when possible.
  • Normalize provider tool-call responses back to the OpenAI-compatible shape.
  • Return clear errors when a route cannot support tool calling.

The gateway should not execute MCP tools. The agent still calls light-gateway MCP endpoints for tools/list and tools/call.

Security

The gateway becomes the model egress control point, so it must enforce:

  • Authentication for every model request.
  • Authorization for logical models and provider routes.
  • Tenant isolation.
  • Provider allowlists and denylists.
  • Secret masking in module registry and diagnostics.
  • Optional request/response redaction or tokenization hooks.
  • Data residency rules.
  • Rate limits by tenant, agent, route, and provider.
  • Audit records for route selection and usage.

Provider credentials should live only in gateway config or the secret system feeding config-server. They should not be copied into agent config.

Observability

Each gateway model call should produce structured telemetry:

  • Request id.
  • Caller identity.
  • Agent id and service id when available.
  • Logical model.
  • Selected provider and physical model.
  • Routing reason.
  • Fallback attempts.
  • Prompt and completion token counts.
  • Latency by stage.
  • Provider status code and error class.
  • Cache hit or prompt-cache usage where available.
  • Policy decisions.

Metrics should support dashboards by route, provider, tenant, and agent.

Config Reload

The gateway should register llm-gateway.yml in the module registry and support runtime reload.

Reload should be atomic:

  1. Load and validate the new config.
  2. Build provider clients and route tables.
  3. Reject invalid route references before swapping state.
  4. Swap active routing state.
  5. Keep in-flight requests on the old state.

Validation should catch:

  • Unknown provider ids.
  • Unknown provider types.
  • Routes without a provider.
  • Fallbacks pointing to missing providers.
  • Logical routes that require capabilities no provider can satisfy.
  • Missing required provider settings for active routes.

Agent Configuration Pattern

For a direct provider deployment:

model-provider.provider: bedrock
model-provider.model: anthropic.claude-3-5-sonnet-20240620-v1:0
bedrock.region: us-east-1

For a gateway deployment:

model-provider.provider: compatible
model-provider.model: agent-default
compatible.name: llm-gateway
compatible.baseUrl: https://llm-gateway.light-gateway:8443/v1
compatible.apiKey: ${llmGateway.agentApiKey}

The second form is the preferred enterprise model once centralized routing is available.

Phased Implementation

Phase 1: OpenAI-Compatible Gateway Endpoint

  • Add llm-gateway.yml.
  • Add a light-gateway handler for /v1/chat/completions.
  • Support non-streaming OpenAI-compatible requests and responses.
  • Route by logical model name to one configured upstream provider.
  • Mask provider secrets in module registry.
  • Add basic audit and metrics.

Phase 2: Policy, Fallback, And Reload

  • Add per-agent route policy.
  • Add health-aware fallback chains.
  • Support runtime reload with atomic state swap.
  • Add diagnostics endpoint or module registry details for active routes.

Phase 3: Capability-Aware Routing

  • Add capability metadata for each provider route.
  • Route by tools, vision, long context, streaming, and JSON mode.
  • Normalize tool-call request and response shapes across providers.

Phase 4: Prompt-Aware Routing

  • Add deterministic prompt classifiers.
  • Add optional lightweight model or embedding classifier for complex routing.
  • Record routing reasons for audit.

Phase 5: Advanced Provider Features

  • Add streaming.
  • Add /v1/responses.
  • Add prompt caching hints.
  • Add quota-aware routing.
  • Add data redaction or tokenization hooks when the tokenization service contract is finalized.

Open Questions

  • Should the first implementation live in light-pingora as a handler module or in a new llm-gateway crate used by light-gateway?
  • Should logical model policy be stored only in config-server values, or also managed by portal database tables for runtime UI edits?
  • Should gateway diagnostics expose selected provider/model to agents, or only to operators?
  • Should prompt-aware routing use Light-Rule first, a dedicated classifier, or both?
  • How should provider quota information be collected for cloud providers that do not expose uniform quota APIs?

Decision

Use the LLM gateway as the single model provider endpoint for enterprise agents. light-agent stays single-provider from its own point of view. The gateway owns multiple upstream providers, route selection, fallback, credentials, policy, audit, and observability.

WebSocket Router

Status

Phases 1, 2, and 3 are implemented. Phase 1 added configuration parsing, Java-compatible pathPrefixService normalization, route resolution, and upstream URI cleanup in light-pingora. Phase 2 wired the websocket handler into light-gateway with WebSocket upgrade detection, discovery-based upstream selection, request context storage, and upstream header/query cleanup. Phase 3 added a real gateway-to-backend WebSocket integration test for text, binary, close, subprotocol, and header behavior.

Purpose

The Java light-websocket-4j websocket-router module routes WebSocket traffic through a gateway or sidecar. A client connects to the gateway, the router resolves the downstream service from headers, query parameters, or path prefix configuration, and the gateway connects to the target WebSocket service.

In light-fabric this should be a light-pingora traffic handler activated by light-gateway through handler.yml. The same light-gateway binary can link the WebSocket router implementation, while each product decides whether it runs by including the websocket handler and websocket-router.yml configuration from config-server.

The Rust implementation should preserve the Java routing semantics and most of the Java configuration shape, but it should not copy Java's enabled flag or frame-bridging architecture. Pingora already supports HTTP/1 upgrade proxying, so the first implementation should resolve the target and let Pingora tunnel the upgraded connection.

Goals

  • Add a Java-compatible WebSocket router to frameworks/light-pingora.
  • Activate the router with the existing websocket handler id in apps/light-gateway.
  • Keep the Java websocket-router routing configuration recognizable: defaultProtocol, defaultEnvTag, and pathPrefixService.
  • Allow websocket-router.pathPrefixService to be injected by config-server at startup the same way other handler-specific config is injected.
  • Resolve downstream services from header, query parameter, or longest path prefix.
  • Reuse the existing light-gateway discovery and upstream selection model.
  • Preserve WebSocket handshake headers and pass normal agent/browser headers through to the downstream service.
  • Register the router configuration with the module registry and support the same reload model as other light-pingora handler configs.
  • Keep the design suitable for gateway, sidecar, and BFF deployments.

Non-Goals

  • Do not implement a separate WebSocket server framework in light-fabric.
  • Do not terminate and re-create WebSocket frames in the first phase.
  • Do not multiplex multiple client WebSocket sessions over one downstream connection.
  • Do not support HTTP/2 extended CONNECT for WebSocket in the first phase.
  • Do not use Rust dynamic plugins or inventory for WebSocket route registration.
  • Do not create a separate gateway binary for WebSocket routing.
  • Do not use enabled in websocket-router.yml. The handler is active when handler.yml includes websocket in the matched execution chain.

Resolved Decisions

  • Activation is controlled only by handler.yml. If a matched chain includes websocket, the router is enabled for that request.
  • websocket-router.yml should not contain enabled.
  • WebSocket-specific controls should cover both request/upgrade rate and active upgraded connection count.
  • The first implementation should use Pingora HTTP/1 upgrade passthrough, not a frame-aware WebSocket bridge.
  • Invalid websocket-router.yml configuration should fail startup. Invalid reloads should be rejected while the last valid runtime state keeps serving existing traffic.

Java Behavior To Map

Java configuration includes enabled, but the Rust target config removes it:

# Light websocket router configuration
defaultProtocol: ${websocket-router.defaultProtocol:http}
defaultEnvTag: ${websocket-router.defaultEnvTag:}
pathPrefixService: ${websocket-router.pathPrefixService:}
preserveRoutingHeaders: ${websocket-router.preserveRoutingHeaders:false}
idleTimeoutMs: ${websocket-router.idleTimeoutMs:3600000}
maxConnectionDurationMs: ${websocket-router.maxConnectionDurationMs:}
maxActiveConnections: ${websocket-router.maxActiveConnections:}
maxUpgradeRequestsPerSecond: ${websocket-router.maxUpgradeRequestsPerSecond:}

The Java enabled field is intentionally not carried forward. In Rust, the handler chain is the activation contract. Removing websocket from a path or default chain disables WebSocket routing for that path.

Production controls are optional. idleTimeoutMs defaults to one hour; blank or zero values disable the matching control. preserveRoutingHeaders defaults to false, so routing-only Service-Id, service_id, and serviceId headers are stripped before the upstream handshake unless a backend explicitly needs them.

pathPrefixService accepts three forms:

pathPrefixService:
  /chat:
    serviceId: com.networknt.llmchat-1.0.0
    protocol: http
    envTag: dev
pathPrefixService:
  /chat: com.networknt.llmchat-1.0.0
pathPrefixService: {"/chat":{"serviceId":"com.networknt.llmchat-1.0.0","protocol":"http","envTag":"dev"}}

The Java handler resolves the downstream service in this order:

  1. Header: first non-blank value from Service-Id, service_id, or serviceId.
  2. Query parameter: first non-blank value from service_id or serviceId.
  3. Path prefix: pathPrefixService match against the request path.

If a target is found, query parameters can override the target protocol and environment tag:

  • protocol
  • env_tag
  • envTag

The Java handler removes router-only query parameters before connecting to the downstream service:

  • protocol
  • service_id
  • serviceId
  • env_tag
  • envTag

The Java implementation accepts client WebSocket subprotocols, opens a new JDK WebSocket client connection to the downstream service, forwards Authorization, forwards the selected subprotocols, and then bridges text and binary frames in both directions.

Rust Architecture

Add the WebSocket router to light-pingora because it is a Pingora gateway traffic handler.

Proposed module:

frameworks/light-pingora/src/websocket.rs

Primary types:

#![allow(unused)]
fn main() {
pub struct WebSocketRouterConfig {
    pub default_protocol: String,
    pub default_env_tag: Option<String>,
    pub path_prefix_service: BTreeMap<String, WebSocketServiceTarget>,
}

pub struct WebSocketServiceTarget {
    pub service_id: String,
    pub protocol: String,
    pub env_tag: Option<String>,
}

pub struct WebSocketRouteDecision {
    pub service_id: String,
    pub protocol: String,
    pub env_tag: Option<String>,
    pub upstream_path_and_query: String,
}
}

The serde layer should accept Java field names through aliases:

  • defaultProtocol
  • defaultEnvTag
  • pathPrefixService
  • serviceId
  • envTag

Use websocket-router.yml as the preferred Rust file name. Accept websocket-router.yaml as a compatibility fallback.

Config Normalization

Normalize pathPrefixService at load time:

raw config
  -> validate defaultProtocol/defaultEnvTag
  -> parse pathPrefixService YAML map, JSON string map, or legacy key/value string
  -> apply defaults to entries missing protocol or envTag
  -> sort prefixes by length for longest-prefix matching
  -> build Arc<WebSocketRouterState>

An invalid entry should fail config loading instead of being ignored silently. This is stricter than Java and is safer for remote config delivered by config-server.

Handler Registration

apps/light-gateway already reserves the websocket handler id as a traffic handler. The implementation should attach that id to the WebSocket router runtime:

handlers:
  - correlation
  - metrics
  - jwt
  - limit
  - websocket

paths:
  - path: /chat
    method: GET
    exec:
      - correlation
      - metrics
      - jwt
      - limit
      - websocket

The router should only run for chains that include websocket. This lets a BFF serve static SPA assets, REST APIs, MCP, JSON-RPC, and WebSocket endpoints from the same gateway binary with path-specific handler chains.

Request Flow

The target flow should be:

client request
  -> handler.yml path/chain match
  -> cross-cutting request handlers
  -> websocket handler
       -> verify WebSocket upgrade
       -> resolve service target
       -> strip router-only query parameters
       -> store WebSocketRouteDecision in request context
  -> Pingora upstream_peer selects discovered target
  -> Pingora upstream_request_filter preserves WebSocket handshake headers
  -> Pingora proxies the HTTP/1 upgraded stream
  -> response/metrics handlers observe completion

The router should not read the request body and should not buffer WebSocket messages. Once the request is upgraded, Pingora owns the tunnel.

Upgrade Detection

The handler should require the normal WebSocket handshake:

  • method GET
  • Connection contains upgrade
  • Upgrade equals websocket
  • Sec-WebSocket-Key exists
  • HTTP version is compatible with HTTP/1 upgrade

If the websocket handler is selected by handler.yml but the request is not a WebSocket upgrade, return 426 Upgrade Required.

HTTP/2 extended CONNECT can be considered later, but should not block the first implementation.

Target Resolution

Target resolution should match Java precedence:

1. service id header
2. service id query parameter
3. pathPrefixService longest-prefix match

Header names:

Service-Id
service_id
serviceId

Query names:

service_id
serviceId
protocol
env_tag
envTag

For path-prefix matches, use the request path without the query string. When multiple prefixes match, choose the longest prefix.

The resolved protocol should be http or https. Conceptually this maps to ws or wss, but Pingora should still connect to the upstream as HTTP or HTTPS and then perform the WebSocket upgrade.

Header And Query Policy

Because the Rust implementation should use Pingora upgrade passthrough, it should preserve the original handshake headers:

  • Upgrade
  • Connection
  • Sec-WebSocket-Key
  • Sec-WebSocket-Version
  • Sec-WebSocket-Protocol
  • Sec-WebSocket-Extensions
  • Authorization
  • cookies
  • normal agent/browser headers

The router should strip only router-control query parameters from the upstream URI:

  • protocol
  • service_id
  • serviceId
  • env_tag
  • envTag

The service-id routing headers should be removed before the upstream request by default:

  • Service-Id
  • service_id
  • serviceId

This keeps gateway routing controls separate from backend application headers. If a backend later needs these headers, add an explicit config option rather than leaking them by default.

Discovery And Upstream Selection

The WebSocket router should reuse the same discovery/runtime model as router.yml and the existing Pingora proxy flow.

Resolved target:

protocol + serviceId + envTag

Discovery returns an upstream HTTP or HTTPS endpoint. upstream_peer creates the Pingora peer:

  • http: non-TLS upstream
  • https: TLS upstream with normal SNI/hostname handling

For the first implementation, require HTTP/1.1 to the backend for WebSocket upgrade. HTTP/2 WebSocket tunneling can be a later feature.

Error Handling

Errors should be returned before the connection is upgraded:

ConditionResponse
Handler selected but request is not WebSocket upgrade426 Upgrade Required
No service id and no path-prefix match403 Forbidden
Invalid protocol override400 Bad Request
Discovery has no usable endpoint502 Bad Gateway
Upstream connect/upgrade failure502 Bad Gateway

Returning HTTP errors before upgrade is clearer than Java's close-frame behavior because the Rust implementation does not accept the WebSocket until the target is known.

Module Registry And Reload

Register the loaded configuration with the module registry:

module id: light-pingora/websocket-router
config name: websocket-router
config file: websocket-router.yml or websocket-router.yaml

On reload:

  1. Load and validate the new config.
  2. Build a new immutable route state.
  3. Atomically swap the state.
  4. Let in-flight upgraded connections continue with the old decision.

Existing WebSocket tunnels should not be interrupted by a config reload unless the gateway process is restarted.

Observability

The handler should integrate with existing correlation and metrics handlers:

  • include correlation id in pre-upgrade logs
  • record target resolution result
  • record route source: header, query, or pathPrefixService
  • count upgrade attempts, successful upgrades, rejected upgrades, and upstream connection failures
  • optionally record tunnel duration once Pingora exposes completion

Do not log full query strings by default because they may contain application data.

Test Plan

Parser and resolver tests:

  • YAML object pathPrefixService
  • string service id entries
  • JSON string map entries
  • legacy key/value string entries
  • default protocol and env tag application
  • invalid entries fail load
  • header beats query and path prefix
  • query beats path prefix
  • longest prefix wins
  • query protocol/envTag override
  • router query params are stripped

Gateway tests:

  • non-upgrade request to a WebSocket chain returns 426
  • missing target returns 403
  • unknown discovery target returns 502
  • upgrade request preserves Sec-WebSocket-Protocol
  • Authorization and normal browser/agent headers pass through
  • service-id routing headers are stripped before upstream

Integration tests:

  • connect through light-gateway to a local WebSocket echo backend
  • text message round trip
  • binary message round trip
  • close frame behavior
  • subprotocol negotiation
  • TLS upstream smoke test when a local test certificate is available

Implementation Phases

Phase 1: Config And Resolver

Status: implemented.

  • Add frameworks/light-pingora/src/websocket.rs.
  • Parse websocket-router.yml and websocket-router.yaml.
  • Normalize all Java-compatible pathPrefixService forms.
  • Implement target resolution and upstream URI cleanup.
  • Add unit tests.

Phase 2: Gateway Handler Wiring

Status: implemented.

  • Connect the existing websocket handler id to the router runtime.
  • Detect WebSocket upgrade requests in the Pingora request flow.
  • Store WebSocketRouteDecision in the request context.
  • Select the discovered upstream in upstream_peer.
  • Strip router query params and service-id headers in upstream_request_filter.

Phase 3: WebSocket Integration Tests

Status: implemented.

  • Add a local test WebSocket echo service.
  • Verify text, binary, close, subprotocol, and header behavior through light-gateway.
  • Verify HTTP and HTTPS upstream paths if practical in CI.

Phase 4: Production Controls

Status: implemented.

  • Add optional idle timeout and max connection duration.
  • Add WebSocket-specific limit controls for both upgrade/request rate and active upgraded connection count.
  • Add explicit config for preserving routing headers if a backend requires them.
  • Add access-control integration once the same access-control model is shared across REST, JSON-RPC, MCP, and WebSocket routes.

Implementation notes:

  • maxUpgradeRequestsPerSecond gates accepted upgrade attempts before discovery lookup.
  • maxActiveConnections tracks proxied upgraded sessions with a permit that is released when Pingora finishes the request context. The active counter is preserved across router and policy reloads.
  • idleTimeoutMs is applied to downstream and upstream tunnel IO. Pingora's body-filter hooks also check idle age when either side sends tunneled data.
  • maxConnectionDurationMs is checked by the tunnel body filters and is also used as an IO timeout when it is the only timeout configured. A connection that continuously exchanges frames is closed on the next tunneled body chunk after the duration is exceeded.
  • WebSocket access-control uses the shared access-control.yml and rule.yml model. The rule context uses tool name websocket, endpoint from handler.yml, and tool arguments containing serviceId, protocol, envTag, upstreamPathAndQuery, and route source.

Open Questions

None.

Stateless Auth Handler

Status

Initial Rust implementation is complete in light-pingora and light-gateway. It includes the shared SPA session runtime, authorization-code entrypoint, logout, cookie handling, CSRF validation, refresh-token renewal, Google/Facebook/GitHub callback entrypoints, handler wiring, config stubs, and runtime-load tests.

Purpose

The Java light-spa-4j stateless-auth module is the BFF login bridge for SPA deployments that use OAuth 2.0 authorization code flow in the cloud. The browser completes the provider redirect, calls the gateway callback path with the authorization code, and the gateway exchanges that code for light-oauth tokens. The gateway then stores the internal access token, refresh token, user metadata, and CSRF value in browser cookies.

In light-fabric this should be a light-pingora security handler used by light-gateway. The handler should be activated by handler.yml, loaded from config-server with the same product-level configuration model as the rest of the gateway, and implemented with the same shared SPA session runtime used by the MSAL exchange handler.

Goals

  • Preserve the Java BFF behavior for authorization code login, logout, CSRF validation, refresh-token renewal, and downstream Authorization injection.
  • Keep the Java statelessAuth.yml field names recognizable so light-portal can inject statelessAuth.* values into config-server output.
  • Use handler.yml as the primary activation and ordering contract.
  • Keep the existing stateless handler id as the public handler-chain name.
  • Share cookie, CSRF, JWT parsing, refresh-token single-flight, and Authorization injection code with the MSAL exchange handler.
  • Use the existing client.yml OAuth token configuration for authorization code and refresh-token calls.
  • Register the loaded config in ModuleRegistry and reject invalid config at startup.
  • Support BFF chains that also use static SPA serving, proxy/router, WebSocket routing, and MCP routing.
  • Support Google, Facebook, and GitHub login entrypoints in addition to the generic authorization-code callback.

Non-Goals

  • Do not use Rust dynamic plugins or inventory.
  • Do not create a separate BFF binary.
  • Do not store server-side browser sessions in the first implementation.
  • Do not require the Rust social-login implementation to copy Java's provider-specific classes. Rust should preserve the external behavior and config contract, but it can use established OAuth/OIDC crates for provider protocol handling.
  • Do not redirect the browser from the gateway by default. Java returns a JSON body containing redirectUri, denyUri, and scopes; Rust should preserve that behavior.

Resolved Decisions

  • Google, Facebook, and GitHub login handlers are in scope. The existing google, facebook, and github handler ids should remain as public handler-chain names.
  • Rust should prefer provider-appropriate crates instead of hand-rolling every provider flow. openidconnect is a good fit for OpenID Connect providers such as Google, and oauth2 is a good fit for plain OAuth 2.0 providers or provider-specific extensions.
  • cookieTimeoutUri should be used by Rust to return a structured session-expired response when a browser session cannot be renewed.

Java Behavior To Map

Java config file:

enabled: ${statelessAuth.enabled:true}
redirectUri: ${statelessAuth.redirectUri:https://localhost:3000/#/app/dashboard}
denyUri: ${statelessAuth.denyUri:https://localhost:3000/#/app/dashboard}
enableHttp2: ${statelessAuth.enableHttp2:false}
authPath: ${statelessAuth.authPath:/authorization}
logoutPath: ${statelessAuth.logoutPath:/logout}
cookieDomain: ${statelessAuth.cookieDomain:localhost}
cookiePath: ${statelessAuth.cookiePath:/}
cookieTimeoutUri: ${statelessAuth.cookieTimeoutUri:/}
cookieSecure: ${statelessAuth.cookieSecure:true}
sessionTimeout: ${statelessAuth.sessionTimeout:3600}
rememberMeTimeout: ${statelessAuth.rememberMeTimeout:604800}
bootstrapToken: ${statelessAuth.bootstrapToken:token}
googlePath: ${statelessAuth.googlePath:/google}
googleClientId: ${statelessAuth.googleClientId:google_client_id}
googleClientSecret: ${statelessAuth.googleClientSecret:secret}
googleRedirectUri: ${statelessAuth.googleRedirectUri:https://localhost:3000}
facebookPath: ${statelessAuth.facebookPath:/facebook}
facebookClientId: ${statelessAuth.facebookClientId:facebook_client_id}
facebookClientSecret: ${statelessAuth.facebookClientSecret:secret}
githubPath: ${statelessAuth.githubPath:/github}
githubClientId: ${statelessAuth.githubClientId:github_client_id}
githubClientSecret: ${statelessAuth.githubClientSecret:secret}

Java request behavior:

  • GET authPath, normally /authorization, expects query parameter code and optional state.
  • Missing code returns ERR10035.
  • The handler generates a CSRF value and sends an authorization-code token request through http-client using client.yml oauth.token.authorization_code.
  • On success, it sets browser cookies and returns JSON containing scopes, redirectUri, and denyUri.
  • GET logoutPath, normally /logout, clears BFF cookies and ends the request.
  • Other requests are treated as downstream BFF requests. The handler reads the accessToken cookie, verifies/parses it, validates CSRF, refreshes the token if it expires within 90 seconds, and injects Authorization: Bearer <access-token> before the proxy/router handler runs.
  • If no access token exists but a refresh token exists, the handler attempts refresh and then injects the new access token.
  • If neither cookie exists, Java allows the request to continue. The downstream service can still decide whether the endpoint is anonymous or protected.

Java error codes to preserve:

CodeMeaning
ERR10035Authorization code is missing
ERR10000Access token is invalid
ERR10036CSRF token is missing from request
ERR10038CSRF claim is missing from JWT
ERR10039Request CSRF and JWT CSRF do not match
ERR10037Refresh-token response is empty

Rust Architecture

Add a shared SPA auth runtime in light-pingora and expose it through light-gateway.

Proposed modules:

frameworks/light-pingora/src/spa_auth.rs
frameworks/light-pingora/src/stateless_auth.rs

spa_auth.rs owns the reusable mechanics:

#![allow(unused)]
fn main() {
pub struct SpaCookieConfig {
    pub cookie_domain: String,
    pub cookie_path: String,
    pub cookie_secure: bool,
    pub session_timeout: u64,
    pub remember_me_timeout: u64,
    pub same_site: CookieSameSite,
    pub renew_before_seconds: u64,
}

pub struct SpaSessionRuntime {
    pub cookies: SpaCookieConfig,
    pub token_client: Arc<SpaTokenClient>,
    pub jwt_verifier: Arc<SecurityRuntime>,
    pub refresh_single_flight: RefreshSingleFlight,
}

pub struct SpaSessionResult {
    pub access_token: Option<String>,
    pub principal: Option<AuthPrincipal>,
    pub response_cookies: Vec<SetCookie>,
}
}

stateless_auth.rs owns the authorization-code entrypoint:

#![allow(unused)]
fn main() {
pub struct StatelessAuthConfig {
    pub enabled: bool,
    pub redirect_uri: String,
    pub deny_uri: Option<String>,
    pub enable_http2: bool,
    pub auth_path: String,
    pub logout_path: String,
    pub cookie_domain: String,
    pub cookie_path: String,
    pub cookie_timeout_uri: String,
    pub cookie_secure: bool,
    pub session_timeout: u64,
    pub remember_me_timeout: u64,
    pub bootstrap_token: Option<String>,
    pub renew_before_seconds: u64,
    pub google: Option<SocialProviderConfig>,
    pub facebook: Option<SocialProviderConfig>,
    pub github: Option<SocialProviderConfig>,
}

pub struct SocialProviderConfig {
    pub path: String,
    pub client_id: String,
    pub client_secret: String,
    pub redirect_uri: Option<String>,
    pub scopes: Vec<String>,
}

pub struct StatelessAuthRuntime {
    pub config: StatelessAuthConfig,
    pub session: SpaSessionRuntime,
}
}

Use Java-compatible serde aliases for camel-case config fields. The primary file should be statelessAuth.yml; accept statelessAuth.yaml as a compatibility fallback.

The serde layer can keep the Java-compatible flat fields, such as googlePath, googleClientId, and googleClientSecret, and normalize them into SocialProviderConfig entries after load. This keeps config-server compatibility while giving Rust a cleaner internal model.

Handler Registration

apps/light-gateway already reserves the stateless handler id. The runtime loader should follow the same pattern as MCP:

#![allow(unused)]
fn main() {
let stateless_auth = load_stateless_auth_runtime(
    config,
    active_handlers.is_handler_active("stateless"),
)?;
}

If stateless is not active in any chain, the config does not need to be loaded. If the config is active but enabled: false, register the disabled module and return None.

No @alias syntax is needed. The handler id in handler.yml is the stable Rust contract.

Example BFF chain:

handlers:
  - exception
  - cors
  - stateless
  - header
  - prefix
  - token
  - router

chains:
  default:
    - exception
    - cors
    - stateless
    - header
    - prefix
    - token
    - router
  websocket:
    - exception
    - stateless
    - security
    - websocket

paths:
  - path: /authorization
    method: GET
    exec:
      - default
  - path: /logout
    method: GET
    exec:
      - default

The handler should normally run after CORS and before proxy/router/WebSocket.

Login Flow

For authPath:

GET /authorization?code=...&state=...
  -> validate code
  -> generate csrf
  -> call token endpoint with authorization_code grant
  -> verify/parse returned internal access token
  -> set BFF cookies
  -> return { "scopes": [...], "redirectUri": "...?state=...", "denyUri": "..." }

Token request mapping should reuse client.yml:

  • oauth.token.server_url or oauth.token.serviceId
  • oauth.token.enableHttp2
  • oauth.token.authorization_code.uri
  • oauth.token.authorization_code.client_id
  • oauth.token.authorization_code.client_secret
  • oauth.token.authorization_code.redirect_uri
  • oauth.token.authorization_code.scope

The form body should match Java:

grant_type=authorization_code
code=<code>
redirect_uri=<optional redirect_uri>
csrf=<generated csrf>
scope=<space separated scopes, if configured>

Session Validation Flow

For requests that are not login/logout:

request
  -> read accessToken cookie
  -> verify/parse internal JWT with security.yml rules
  -> extract csrf claim
  -> find request CSRF from X-CSRF-TOKEN, WebSocket subprotocol, or query
  -> compare csrf values
  -> refresh token if exp is inside renew window
  -> inject Authorization: Bearer <access-token>
  -> continue handler chain

CSRF source order should match Java:

  1. X-CSRF-TOKEN header.
  2. Sec-WebSocket-Protocol value starting with csrf. when the request has Sec-WebSocket-Key and Sec-WebSocket-Version.
  3. Query parameter csrf.

The WebSocket subprotocol behavior is important for browser WebSocket clients that cannot set arbitrary headers. The auth handler should run before the websocket router so the downstream handshake receives the internal Authorization header.

Session-Expired Response

The Java handler usually allows requests with no cookies to continue so the downstream service can decide whether the endpoint is anonymous. Rust should preserve that pass-through behavior for requests with no session evidence.

When the request does have session evidence but the session cannot be renewed, for example an expired or rejected refresh token, Rust should clear BFF cookies and return a structured response using cookieTimeoutUri:

{
  "code": "ERR10040",
  "message": "SPA session expired",
  "timeoutUri": "/",
  "authenticated": false
}

The status should be 401 unless a later product config explicitly asks for a different behavior. This gives the SPA a deterministic signal to navigate to the configured timeout or login page without scraping an Undertow-style status string.

Internal JWT Verification

The shared SPA runtime should not call the existing verify_jwt_request function directly. That function is designed for API requests with an Authorization header, path skips, pass-through claims, and normal security handler behavior.

The SPA auth runtime needs a lower-level token verifier that can:

  • verify the access-token signature using the same certificates and algorithms as security.yml;
  • parse claims from a token stored in a cookie;
  • optionally ignore expiration while deciding whether the token can be refreshed;
  • fail hard on invalid signature, invalid algorithm, malformed JWT, and missing key;
  • return an AuthPrincipal and raw claims for CSRF, cookie metadata, and optional request-context propagation.

This can be implemented by extracting a reusable helper from security.rs, for example:

#![allow(unused)]
fn main() {
verify_jwt_token(
    runtime: &SecurityRuntime,
    token: &str,
    expiry_mode: JwtExpiryMode,
) -> Result<AuthPrincipal, HandlerRejection>
}

The normal security handler can keep its current request-level wrapper, while SPA auth uses the token-level helper for cookie tokens.

Social Provider Login

Google, Facebook, and GitHub login are implemented as thin handler entrypoints that reuse the same cookie/session runtime as the authorization-code callback. The existing handler ids are kept:

chains:
  google:
    - exception
    - correlation
    - cors
    - google
    - stateless
    - header
    - prefix
    - router
  facebook:
    - exception
    - correlation
    - cors
    - facebook
    - stateless
    - header
    - prefix
    - router
  github:
    - exception
    - correlation
    - cors
    - github
    - stateless
    - header
    - prefix
    - router

The implemented provider flow is:

  1. Match its configured provider path, for example googlePath, facebookPath, or githubPath.
  2. For Google, exchange the authorization code with the Google token endpoint and use the returned id_token as the subject token. If the provider does not return an ID token, fall back to access_token.
  3. For Facebook, accept the Java-compatible accessToken query parameter, or exchange an authorization code with the Facebook token endpoint.
  4. For GitHub, exchange the authorization code with the GitHub token endpoint.
  5. Use client.yml oauth.token.token_exchange to exchange the provider subject token for an internal light-oauth token set with a CSRF claim.
  6. Set the same BFF cookies as the generic stateless handler and return the same JSON shape.

Provider token endpoints default to the public provider URLs, but can be overridden for tests or regional deployments:

googleTokenEndpoint: ${statelessAuth.googleTokenEndpoint:https://oauth2.googleapis.com/token}
facebookTokenEndpoint: ${statelessAuth.facebookTokenEndpoint:https://graph.facebook.com/v19.0/oauth/access_token}
githubTokenEndpoint: ${statelessAuth.githubTokenEndpoint:https://github.com/login/oauth/access_token}

External identity mapping is intentionally delegated to the internal token-exchange implementation. Once portal-service tokenization has a final RPC contract, the subject-token exchange can map provider identities there without changing the gateway cookie/session runtime.

Refresh Flow

The Java handler refreshes 90 seconds before expiry and deduplicates concurrent refreshes with RefreshTokenSingleFlight. Rust should keep that behavior.

Default Rust settings:

renewBeforeSeconds: ${statelessAuth.renewBeforeSeconds:90}
refreshSingleFlightWaitMs: ${statelessAuth.refreshSingleFlightWaitMs:5000}
refreshSingleFlightCacheMs: ${statelessAuth.refreshSingleFlightCacheMs:3000}
refreshSingleFlightMaxEntries: ${statelessAuth.refreshSingleFlightMaxEntries:10000}

These fields are Rust improvements. They can be omitted from config-server templates until a product needs to tune them.

Refresh-token request mapping should reuse client.yml oauth.token.refresh_token and send:

grant_type=refresh_token
refresh_token=<cookie refresh token>
csrf=<new csrf>
scope=<space separated scopes, if configured>

Cookies

Cookie names should remain Java-compatible:

CookieHttpOnlySource
accessTokentrueOAuth access token
refreshTokentrueOAuth refresh token
csrffalseGenerated CSRF value
userIdfalseJWT uid claim
userTypefalseJWT userType claim
rolesfalseBase64-encoded JWT role claim, default user
hostfalseJWT host claim
emailfalseJWT eml claim
eidfalseJWT eid claim

Access-token, user-info, and CSRF cookies should use the access token expires_in value as Max-Age. Refresh-token cookie Max-Age should use sessionTimeout unless the token response includes a remember value other than N, in which case it should use rememberMeTimeout.

Java only clears cookies that were present on the request. Rust should improve logout by always emitting deletion cookies for the known cookie names, using the configured domain/path/secure attributes. This avoids stale browser cookies when a cookie is omitted from a particular request.

Default SameSite should remain None for Java parity. Add a Rust-only optional cookieSameSite field with default None so deployments can choose Lax or Strict when the SPA and BFF are same-site.

Config Server Model

The config-server should continue to resolve placeholders before startup:

statelessAuth.redirectUri: https://localhost:3000/#/app/dashboard
statelessAuth.cookieDomain: localhost
statelessAuth.cookieSecure: true
client.tokenAcClientId: ...
client.tokenAcClientSecret: ...
client.tokenRtClientId: ...
client.tokenRtClientSecret: ...

The Rust gateway should only consume the resolved statelessAuth.yml, client.yml, security.yml, and handler.yml files. It should not need to know whether the values came from product defaults, environment variables, or light-portal overrides.

Implemented Surface

  • Shared SPA cookie/session runtime, including cookie parser/writer, CSRF extraction, JWT claim extraction, and Java-compatible cookie names.
  • OAuth token client support for authorization-code, refresh-token, and token-exchange grant requests using client.yml.
  • Refresh-token renewal with a bounded completed-result cache.
  • statelessAuth.yml loader, module registry registration, active-handler gating, and runtime reload.
  • stateless, google, facebook, and github request handling in light-gateway.
  • Structured session-expired response using cookieTimeoutUri.
  • Unit/runtime-load coverage for config parsing, cookie attributes, provider subject-token selection, active-handler loading, and gateway wiring.

MSAL Exchange Handler

Status

Initial Rust implementation is complete in light-pingora and light-gateway. It includes config loading, named security-msal.yml validation support, token-exchange handling, shared SPA session/cookie/CSRF logic, logout, refresh-token renewal, handler wiring, config stubs, and runtime-load tests.

Purpose

The Java light-spa-4j msal-exchange module is the on-prem BFF login bridge for SPA deployments that use Microsoft Authentication Library SSO. The browser uses MSAL.js to obtain a Microsoft token, sends that token to the gateway, and the gateway exchanges it for an internal light-oauth token set. After exchange, the browser session behaves the same as the stateless authorization-code handler: internal tokens are stored in cookies, CSRF is validated on subsequent requests, refresh tokens keep the session alive, and the gateway injects Authorization: Bearer <internal-token> before routing downstream.

In light-fabric this should be a light-pingora security handler in light-gateway. It should share most of its implementation with stateless-auth.md; only the initial login exchange differs.

Goals

  • Preserve the Java MSAL token-exchange flow.
  • Keep msal-exchange.yml field names recognizable for light-portal and config-server product configuration.
  • Validate the incoming Microsoft token with a separate security-msal.yml runtime before token exchange.
  • Exchange the Microsoft token with light-oauth using client.yml oauth.token.token_exchange.
  • Store the returned internal token set in the same Java-compatible cookies as the stateless handler.
  • Share CSRF validation, cookie writing, logout, refresh-token renewal, and downstream Authorization injection with the stateless handler.
  • Add a stable msal-exchange handler id to light-gateway.
  • Register loaded config in ModuleRegistry and fail startup on invalid active configuration.

Non-Goals

  • Do not forward the Microsoft token to downstream services after exchange.
  • Do not implement a server-side browser session store.
  • Do not merge MSAL token validation into the normal downstream security handler. MSAL validation applies only to the exchange endpoint.
  • Do not invent a REST-specific tokenization or portal-service client in this handler. The only outbound call is the OAuth token-exchange request.
  • Do not require a separate BFF binary.

Resolved Decisions

  • Support subjectTokenType in both client.yml and msal-exchange.yml. The handler-specific value takes precedence when set, and client.yml remains the shared OAuth token-exchange default.
  • Support strict Microsoft token validation in security-msal.yml when a deployment needs issuer and audience checks.

Java Behavior To Map

Java config file:

enabled: ${msal-exchange.enabled:true}
exchangePath: ${msal-exchange.exchangePath:/auth/ms/exchange}
logoutPath: ${msal-exchange.logoutPath:/auth/ms/logout}
cookieDomain: ${msal-exchange.cookieDomain:localhost}
cookiePath: ${msal-exchange.cookiePath:/}
cookieSecure: ${msal-exchange.cookieSecure:false}
sessionTimeout: ${msal-exchange.sessionTimeout:3600}
rememberMeTimeout: ${msal-exchange.rememberMeTimeout:604800}

Java also loads a separate security config named security-msal:

SecurityConfig.load("security-msal")

This config verifies the incoming Microsoft token. The normal security.yml runtime verifies/parses internal light-oauth access tokens used in cookies.

Java request behavior:

  • exchangePath, normally /auth/ms/exchange, requires Authorization: Bearer <microsoft-token>.
  • Missing bearer token returns ERR11000.
  • The handler verifies the Microsoft token with security-msal.yml.
  • Verification failure returns ERR10000.
  • The handler generates a CSRF value and sends an OAuth token-exchange request with the Microsoft token as subject_token.
  • Token-exchange failure returns ERR11001.
  • On success, the handler sets the same BFF cookies as the stateless handler and returns JSON containing scopes.
  • logoutPath, normally /auth/ms/logout, clears BFF cookies and ends the request.
  • Subsequent requests use the same cookie, CSRF, refresh, and downstream Authorization injection flow as the stateless handler.

Java error codes to preserve:

CodeMeaning
ERR11000Microsoft bearer token is missing
ERR11001Internal token exchange failed
ERR10000Incoming Microsoft token or returned internal token is invalid
ERR10036CSRF token is missing from request
ERR10038CSRF claim is missing from JWT
ERR10039Request CSRF and JWT CSRF do not match

Rust Architecture

Use the shared SPA auth runtime described in stateless-auth.md.

Proposed modules:

frameworks/light-pingora/src/spa_auth.rs
frameworks/light-pingora/src/msal_exchange.rs

msal_exchange.rs owns only the Microsoft-token exchange entrypoint:

#![allow(unused)]
fn main() {
pub struct MsalExchangeConfig {
    pub enabled: bool,
    pub exchange_path: String,
    pub logout_path: String,
    pub cookie_domain: String,
    pub cookie_path: String,
    pub cookie_secure: bool,
    pub session_timeout: u64,
    pub remember_me_timeout: u64,
    pub renew_before_seconds: u64,
    pub subject_token_type: String,
}

pub struct MsalExchangeRuntime {
    pub config: MsalExchangeConfig,
    pub session: SpaSessionRuntime,
    pub msal_security: SecurityRuntime,
}
}

Use msal-exchange.yml as the primary file name and accept msal-exchange.yaml as a compatibility fallback.

The SecurityRuntime loader should be generalized so the MSAL handler can load a named security config:

#![allow(unused)]
fn main() {
load_security_runtime_from_file(
    runtime_config,
    "security-msal.yml",
    "light-pingora/security-msal",
    "security-msal",
    active,
)
}

That keeps normal downstream JWT behavior on security.yml while the exchange endpoint validates Microsoft tokens against security-msal.yml.

Handler Registration

Add msal-exchange to apps/light-gateway handler descriptors as a security handler:

#![allow(unused)]
fn main() {
("msal-exchange", PingoraHandlerKind::Security)
}

The primary handler id should be msal-exchange. No @alias syntax is needed. An additional short alias such as msal can be added later only if a real product config needs it.

Runtime loading should follow the existing active-handler model:

#![allow(unused)]
fn main() {
let msal_exchange = load_msal_exchange_runtime(
    config,
    active_handlers.is_handler_active("msal-exchange"),
)?;
}

If the handler is not active in handler.yml, no MSAL config is required. If the handler is active and its config is invalid, startup should fail. If enabled: false, register the disabled module and return None.

Example chain:

handlers:
  - exception
  - cors
  - msal-exchange
  - header
  - prefix
  - token
  - router

chains:
  bff:
    - exception
    - cors
    - msal-exchange
    - header
    - prefix
    - token
    - router
  websocket:
    - exception
    - msal-exchange
    - security
    - websocket

paths:
  - path: /auth/ms/exchange
    method: POST
    exec:
      - bff
  - path: /auth/ms/logout
    method: GET
    exec:
      - bff

Exchange Flow

For exchangePath:

POST /auth/ms/exchange
Authorization: Bearer <microsoft-token>

  -> extract bearer token
  -> verify Microsoft token with security-msal.yml
  -> generate csrf
  -> call light-oauth token endpoint with token-exchange grant
  -> verify/parse returned internal access token
  -> set BFF cookies
  -> return { "scopes": [...] }

The token-exchange request should use client.yml oauth.token.token_exchange:

  • oauth.token.server_url or oauth.token.serviceId
  • oauth.token.enableHttp2
  • oauth.token.token_exchange.uri
  • oauth.token.token_exchange.client_id
  • oauth.token.token_exchange.client_secret
  • oauth.token.token_exchange.scope
  • oauth.token.token_exchange.subjectTokenType as the default subject token type when the handler config does not override it

The form body should match Java and the http-client composer:

grant_type=urn:ietf:params:oauth:grant-type:token-exchange
subject_token=<microsoft-token>
subject_token_type=urn:ietf:params:oauth:token-type:jwt
csrf=<generated csrf>
requested_token_type=<optional requested token type>
audience=<optional audience>
scope=<space separated scopes, if configured>

The handler should set Authorization: Basic <client_id:client_secret> on the outbound token-exchange request.

Session Validation Flow

After exchange, MSAL and stateless auth must use the same downstream request flow:

request
  -> read accessToken cookie
  -> verify/parse internal JWT with security.yml
  -> validate CSRF from request against JWT csrf claim
  -> refresh internal token when it is inside the renew window
  -> inject Authorization: Bearer <internal-access-token>
  -> continue handler chain

CSRF source order should be identical to the stateless handler:

  1. X-CSRF-TOKEN header.
  2. Sec-WebSocket-Protocol value starting with csrf. when the request has Sec-WebSocket-Key and Sec-WebSocket-Version.
  3. Query parameter csrf.

The MSAL handler must never inject the Microsoft token downstream. The only downstream bearer token after login is the internal light-oauth token.

Internal JWT Verification

MSAL exchange should use the same lower-level token verifier as stateless auth for internal cookie tokens. It should not use the request-oriented verify_jwt_request wrapper because the token source is a cookie, not an Authorization header.

The shared verifier should validate signature and key material from security.yml, parse claims for CSRF and user cookies, and support an expiry-mode option so the refresh path can inspect tokens close to expiry without treating that as a downstream API authentication success.

Cookies

MSAL exchange should use the same cookie contract as stateless auth:

CookieHttpOnlySource
accessTokentrueInternal OAuth access token
refreshTokentrueInternal OAuth refresh token
csrffalseGenerated CSRF value
userIdfalseJWT uid claim
userTypefalseJWT userType claim
rolesfalseBase64-encoded JWT role claim, default user
hostfalseJWT host claim
emailfalseJWT eml claim
eidfalseJWT eid claim

For Java parity, keep cookieSecure defaulting to false in msal-exchange.yml, but production config should set it to true when the BFF is served over HTTPS.

Rust should share the logout improvement from stateless auth: always emit deletion cookies for known cookie names rather than only clearing cookies that were present on the request.

Security Config

security-msal.yml should be treated as an active handler dependency when msal-exchange is active. Missing or invalid config should fail startup because the gateway would otherwise accept an exchange endpoint without a working Microsoft-token verifier.

Recommended distinction:

  • security-msal.yml: verifies the incoming Microsoft token on exchangePath.
  • security.yml: verifies/parses internal light-oauth tokens in BFF cookies and is also used by normal API security handlers.

The Java code skips audience verification for MSAL in the current call path. Rust should preserve compatibility unless security-msal.yml explicitly configures audience validation support. That keeps on-prem deployments working when the Microsoft token audience is the SPA client id rather than the BFF.

When a product requires stricter validation, security-msal.yml should be able to require issuer and audience checks for the incoming Microsoft token. The initial implementation can add these checks to the named SecurityRuntime loader as optional fields:

issuer: ${security-msal.issuer:}
audience: ${security-msal.audience:}

Blank values preserve the Java-compatible relaxed behavior. Non-blank values must be enforced during exchange-path token verification, and invalid issuer/audience should return the same invalid-token error path as other Microsoft token verification failures.

Config Server Model

Light-portal should manage the product config values and config-server should deliver resolved files:

msal-exchange.exchangePath: /auth/ms/exchange
msal-exchange.logoutPath: /auth/ms/logout
msal-exchange.cookieDomain: localhost
msal-exchange.cookieSecure: true
msal-exchange.subjectTokenType: urn:ietf:params:oauth:token-type:jwt
client.tokenExClientId: ...
client.tokenExClientSecret: ...
client.subjectTokenType: urn:ietf:params:oauth:token-type:jwt
security-msal.issuer: https://login.microsoftonline.com/{tenant-id}/v2.0
security-msal.audience: <spa-client-id>

The gateway consumes only the resolved files:

  • handler.yml
  • msal-exchange.yml
  • security-msal.yml
  • security.yml
  • client.yml

Implemented Surface

  • Shared SPA auth runtime from stateless-auth.md.
  • Named SecurityRuntime loading for security-msal.yml.
  • Token-exchange support in the shared OAuth token client.
  • msal-exchange.yml parsing, module registry registration, active-handler gating, and runtime reload.
  • msal-exchange request handling in light-gateway.
  • Required bearer-token extraction, Microsoft token validation, token-exchange request, Java-compatible cookie writing, logout, refresh renewal, and downstream internal Authorization injection.
  • Optional issuer/audience validation through security-msal.yml.
  • Unit/runtime-load coverage for subject-token-type precedence and gateway wiring.

PII Tokenization

Status

Proposed design for migrating the light-tokenization capability into light-fabric as light-pingora handlers used by light-gateway.

Purpose

PII tokenization protects sensitive employee/customer data when a request is sent from inside the organization to an external cloud service through the gateway. The outbound request replaces configured PII fields with generated tokens. When the cloud response returns, the gateway replaces those tokens with the original cleartext values so internal employees can complete their work.

This is a request/response hot-path concern. The first Rust implementation should therefore run inside light-gateway and access PostgreSQL directly instead of making a network call to a tokenization service for every field.

Current Java Behavior

The current light-tokenization service exposes REST endpoints:

  • POST /v1/token: body { "schemeId": <int>, "value": "<cleartext>" }; returns a token string. If the value already exists, it returns the existing token.
  • GET /v1/token/{token}: returns the cleartext value.
  • DELETE /v1/token/{token}: deletes the token mapping.
  • GET /v1/scheme and GET /v1/scheme/{id}: return token format schemes.

Startup loads multiple JDBC pools from datasource.yml. One database is named tokenization; the others are vault databases such as vault000. The tokenization database maps client_id to a vault database through client_database. Each vault database has a token_vault table.

Java tokenization flow:

  1. Read client_id from the JWT audit info.
  2. Resolve client_id -> db_name.
  3. Select a vault datasource by db_name.
  4. For tokenization, look up by cleartext value; return existing id if found.
  5. If not found, generate a token with the configured schemeId, insert (id, value), cache token -> value, and return the token.
  6. For detokenization, check the cache first, then query by token id.

The current Java MCP router also uses tokenization through token-client. Tool input schemas can mark fields with x-tokenize; the router extracts JsonPath rules from the schema and calls the tokenization service.

Design Direction

Use direct PostgreSQL access for the initial light-fabric implementation.

Reasons:

  • It removes one HTTP hop per tokenized field in the gateway hot path.
  • It avoids running and scaling another service only to perform local database lookups.
  • PostgreSQL connection pooling is already used in nearby light-fabric apps with sqlx.
  • The same database will also support other gateway handlers that need local data access, such as vector search for MCP routing.
  • Multi-tenancy is cleaner with host_id in the schema than with one vault database per tenant.

If this capability is later exposed as a standalone service, prefer gRPC over MCP for the hot-path service API. gRPC gives a strongly typed protobuf contract, HTTP/2 multiplexing, compact binary payloads, deadlines, and well-understood client pooling. MCP is useful when tokenization is exposed as an agent tool or administrative capability, but it adds JSON-RPC/tooling semantics that are not needed for a low-latency service-to-service data-plane call.

Goals

  • Implement TokenizeHandler and DetokenizeHandler in light-pingora.
  • Activate handlers only through handler.yml.
  • Use one PostgreSQL database with host_id tenant isolation.
  • Integrate schema into portal-db/postgres/ddl.sql and future patch files.
  • Preserve the Java token schemes and stable tokenization behavior.
  • Avoid storing/indexing cleartext PII directly in PostgreSQL.
  • Support request-body tokenization before proxy/router sends to the external service.
  • Support response-body detokenization before the gateway returns to the internal caller.
  • Reuse the same runtime for MCP tool argument tokenization.

Non-Goals

  • Do not preserve multiple vault databases.
  • Do not preserve MySQL or SQLite runtime support in light-fabric.
  • Do not make tokenization an MCP-only service.
  • Do not require a separate tokenization service for the first implementation.
  • Do not try to tokenize arbitrary binary payloads in the first pass.

Handler Model

Use two public handler ids:

  • tokenize: request-phase handler that replaces cleartext fields with tokens.
  • detokenize: response-phase handler that replaces configured token fields with cleartext.

Both handlers share one runtime:

frameworks/light-pingora/src/pii_tokenization.rs

Primary types:

#![allow(unused)]
fn main() {
pub struct PiiTokenizationConfig {
    pub database: PiiDatabaseConfig,
    pub host_id_claim: String,
    pub max_body_size: usize,
    pub cache: PiiTokenCacheConfig,
    pub crypto: PiiTokenCryptoConfig,
    pub rules: Vec<PiiTokenizationRule>,
}

pub struct PiiTokenizationRuntime {
    pub config: Arc<PiiTokenizationConfig>,
    pub pool: PgPool,
    pub tokenizers: TokenizerRegistry,
    pub value_cache: TokenCache,
    pub token_cache: TokenCache,
    pub keyring: PiiKeyring,
}

pub struct PiiTokenizationRule {
    pub path_prefix: String,
    pub methods: Vec<String>,
    pub request: Vec<PiiFieldRule>,
    pub response: Vec<PiiFieldRule>,
}

pub struct PiiFieldRule {
    pub path: String,
    pub scheme: String,
    pub required: bool,
}
}

The handler should fail startup if an active config references an unknown scheme, has invalid field paths, cannot initialize the keyring, or cannot connect to PostgreSQL within the configured startup timeout.

Resolved Decisions

  • Handler ids are tokenize and detokenize to align with other light-fabric handler names.
  • Encrypt stored cleartext with AES-256-GCM. Resolve key material from environment variables first, with direct config values allowed only as a local-development fallback.
  • Detokenization fails closed by default when a configured token field cannot be resolved.
  • Field selection uses a constrained compiled JsonPath subset rather than full dynamic JsonPath evaluation.
  • Cleartext reverse caching is configurable through cache.cacheCleartext.
  • Request/response mutation buffers are bounded by configurable maxBodySize.

Handler Chain

For a BFF or gateway that calls an external cloud service:

handlers:
  - correlation
  - security
  - tokenize
  - router
  - detokenize

chains:
  external-cloud:
    - correlation
    - security
    - tokenize
    - router
    - detokenize

paths:
  - path: /claims
    method: POST
    exec:
      - external-cloud

tokenize must run after authentication so it can resolve host_id from the verified JWT principal. It must run before router or proxy so the external service never receives cleartext PII. detokenize must run after the upstream response body is available and before response delivery.

This likely requires extending the existing gateway handler model with a response-body filter phase:

#![allow(unused)]
fn main() {
pub trait PingoraBodyHandler {
    async fn request_body_filter(&self, ctx: &mut GatewayRequestContext, body: Bytes)
        -> Result<Bytes, HandlerRejection>;

    async fn response_body_filter(&self, ctx: &mut GatewayRequestContext, body: Bytes)
        -> Result<Bytes, HandlerRejection>;
}
}

The first implementation can wire this directly in light-gateway; later it can be generalized for other body-mutating handlers.

Configuration

Primary file: pii-tokenization.yml.

enabled is not needed. If neither tokenize nor detokenize appears in handler.yml, this config is not loaded. If either handler is active, the config is required and invalid config fails startup.

Example:

database:
  url: ${pii-tokenization.database.url:${database.url:}}
  maxConnections: ${pii-tokenization.database.maxConnections:8}
  minConnections: ${pii-tokenization.database.minConnections:1}
  connectTimeoutMs: ${pii-tokenization.database.connectTimeoutMs:2000}

hostIdClaim: ${pii-tokenization.hostIdClaim:host_id}
maxBodySize: ${pii-tokenization.maxBodySize:1048576}

crypto:
  algorithm: ${pii-tokenization.crypto.algorithm:AES-256-GCM}
  keyId: ${pii-tokenization.crypto.keyId:default}
  valueEncryptionKeyEnv: ${pii-tokenization.crypto.valueEncryptionKeyEnv:PII_TOKENIZATION_VALUE_ENCRYPTION_KEY}
  valueHashKeyEnv: ${pii-tokenization.crypto.valueHashKeyEnv:PII_TOKENIZATION_VALUE_HASH_KEY}
  valueEncryptionKey: ${pii-tokenization.crypto.valueEncryptionKey:}
  valueHashKey: ${pii-tokenization.crypto.valueHashKey:}

cache:
  enabled: ${pii-tokenization.cache.enabled:true}
  maxEntries: ${pii-tokenization.cache.maxEntries:10000}
  ttlSeconds: ${pii-tokenization.cache.ttlSeconds:86400}
  cacheCleartext: ${pii-tokenization.cache.cacheCleartext:true}

rules:
  - pathPrefix: /claims
    methods: [POST]
    request:
      - path: $.claimant.ssn
        scheme: LN
        required: false
      - path: $.payment.cardNumber
        scheme: CC4
        required: false
    response:
      - path: $.claimant.ssn
        scheme: LN
        required: false
      - path: $.payment.cardNumber
        scheme: CC4
        required: false

Field paths should support the Java-compatible JsonPath subset used by mcp-router tokenization rules: object fields and [*] arrays. For performance and predictable mutation, the Rust implementation should compile rules at startup and avoid dynamic path parsing on every request.

For MCP tools, keep supporting x-tokenize in input schemas. The MCP router can convert schema annotations into the same compiled field rules and call the shared PiiTokenizationRuntime directly.

PostgreSQL Schema

Replace the old split between tokenization and vault databases with tenant-scoped tables in portal-db.

Recommended DDL:

CREATE TABLE pii_token_scheme_t (
    scheme_id        SMALLINT PRIMARY KEY,
    scheme_code      VARCHAR(16) NOT NULL UNIQUE,
    description      TEXT NOT NULL,
    active           BOOLEAN DEFAULT TRUE NOT NULL,
    update_ts        TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
    update_user      VARCHAR(126) DEFAULT SESSION_USER NOT NULL
);

CREATE TABLE pii_token_vault_t (
    host_id           UUID NOT NULL,
    token             TEXT NOT NULL,
    scheme_id         SMALLINT NOT NULL,
    value_hash        BYTEA NOT NULL,
    value_ciphertext  BYTEA NOT NULL,
    value_nonce       BYTEA NOT NULL,
    key_id            VARCHAR(128) NOT NULL,
    created_ts        TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
    expires_ts        TIMESTAMP WITH TIME ZONE,
    active            BOOLEAN DEFAULT TRUE NOT NULL,
    update_ts         TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
    update_user       VARCHAR(126) DEFAULT SESSION_USER NOT NULL,
    PRIMARY KEY(host_id, token),
    FOREIGN KEY(scheme_id) REFERENCES pii_token_scheme_t(scheme_id)
);

CREATE UNIQUE INDEX pii_token_vault_value_uk
ON pii_token_vault_t(host_id, scheme_id, value_hash)
WHERE active = TRUE;

CREATE INDEX pii_token_vault_expiry_idx
ON pii_token_vault_t(expires_ts)
WHERE expires_ts IS NOT NULL;

Seed schemes:

IdCodeMeaning
0UUIDUUID v4 token
1GUIDURL-safe base64 UUID token
2LNLuhn compliant numeric token
3NRandom numeric token, length preserving
4LN4Luhn numeric token retaining last four digits
5ANRandom alpha-numeric token, length preserving
6AN4Alpha-numeric token retaining last four characters
7CCCredit-card-shaped Luhn token retaining first digit
8CC4Credit-card-shaped Luhn token retaining first and last four digits

The old database_owner and client_database tables are not needed. Tenant isolation is by host_id, resolved from the authenticated request. If a legacy client only has client_id, handle that with normal portal auth/client metadata rather than recreating tokenization-specific vault routing.

Cleartext Storage

The Java schema stores cleartext PII in token_vault.value and indexes it. The Rust schema should not.

Use:

  • value_hash: deterministic HMAC-SHA-256 of (host_id, scheme_id, canonical_value) with valueHashKey; used for idempotent token lookup.
  • value_ciphertext and value_nonce: encrypted cleartext value, for example AES-GCM or ChaCha20-Poly1305 with valueEncryptionKey.
  • key_id: identifies which key encrypted the row so key rotation is possible.

This keeps tokenization idempotent without indexing cleartext PII.

Tokenization Algorithm

Shared runtime operation:

tokenize(host_id, scheme_id, value)
  -> canonicalize value
  -> compute value_hash
  -> cache lookup by (host_id, scheme_id, value_hash)
  -> SELECT token WHERE host_id, scheme_id, value_hash, active
  -> if found, cache and return
  -> generate scheme-specific token
  -> encrypt cleartext
  -> INSERT row
  -> on token collision, retry generation
  -> on value_hash conflict, SELECT existing token and return it

Use PostgreSQL uniqueness instead of application locks:

INSERT INTO pii_token_vault_t (...)
VALUES (...)
ON CONFLICT DO NOTHING;

If no row is inserted, determine whether the conflict was on (host_id, token) or (host_id, scheme_id, value_hash). Token collision means retry with a new token. Value conflict means another request already inserted the mapping; select and return the existing token.

Detokenization:

detokenize(host_id, token)
  -> cache lookup by (host_id, token)
  -> SELECT encrypted value WHERE host_id, token, active
  -> decrypt cleartext
  -> cache and return

If token is not found, the handler fails the response with a handler error. For gateway response-body detokenization, fail closed so employees do not see partial or incorrect data without a signal.

Runtime Caching

Use bounded in-process caches:

  • (host_id, scheme_id, value_hash) -> token
  • (host_id, token) -> cleartext

The cache must be tenant-scoped and bounded by count and TTL. Because the reverse cache contains cleartext PII, make it configurable and register it with the runtime cache registry only with masked summaries. A clear-cache operation should be available through the runtime control plane.

The cache is an optimization only. PostgreSQL remains the source of truth.

Request And Response Mutation

Only mutate supported structured content:

  • application/json in phase 1.
  • JSON arrays and nested objects through compiled path rules.
  • Missing optional fields are ignored.
  • Missing required fields reject the request or response with a handler error.

For outbound request tokenization:

  1. Buffer the JSON request body within a configured max body size.
  2. Parse to serde_json::Value.
  3. Apply matching request rules.
  4. Replace every string value with a token.
  5. Serialize JSON, update Content-Length, and forward upstream.

For inbound response detokenization:

  1. Buffer the JSON response body within a configured max body size.
  2. Parse to serde_json::Value.
  3. Apply matching response rules.
  4. Replace every string token with cleartext.
  5. Serialize JSON, update Content-Length, and return downstream.

For very large or streaming payloads, skip mutation and fail closed by default. Streaming tokenization can be considered later only if a real product requires it.

Security

  • Require a verified JWT principal before tokenization.
  • Resolve host_id from a configured claim, default host_id.
  • Reject active tokenization if host_id is missing.
  • Do not log cleartext values, generated tokens, value hashes, ciphertext, or keys.
  • Mask crypto keys in module registry summaries.
  • Use least-privilege PostgreSQL credentials: only select/insert/update on the tokenization tables.
  • Prefer encrypted cleartext storage, not plaintext value.
  • Keep tokens scoped by host_id; the same token string in another tenant does not detokenize.

Future Service API

The direct database implementation should be the first production path. However, keep the core API independent from Pingora:

#![allow(unused)]
fn main() {
#[async_trait]
pub trait PiiTokenVault: Send + Sync {
    async fn tokenize(&self, host_id: Uuid, scheme_id: i16, value: &str)
        -> Result<String, PiiTokenError>;

    async fn detokenize(&self, host_id: Uuid, token: &str)
        -> Result<String, PiiTokenError>;
}
}

Then a future service can wrap the same trait.

Protocol recommendation:

  • gRPC for request-path service-to-service tokenization if a standalone service becomes necessary.
  • MCP only as an optional tool surface for agents or administrative workflows.
  • REST/JSON-RPC only for compatibility or operational simplicity, not the preferred low-latency path.

The gRPC API can be very small:

service PiiTokenization {
  rpc Tokenize(TokenizeRequest) returns (TokenizeResponse);
  rpc Detokenize(DetokenizeRequest) returns (DetokenizeResponse);
  rpc BatchTokenize(BatchTokenizeRequest) returns (BatchTokenizeResponse);
  rpc BatchDetokenize(BatchDetokenizeRequest) returns (BatchDetokenizeResponse);
}

Batch operations are important if a future remote service is used; otherwise per-field network calls will dominate latency.

Implementation Phases

  1. Add portal-db DDL and seed data for pii_token_scheme_t and pii_token_vault_t.
  2. Add a light-pingora shared tokenization runtime with sqlx::PgPool, scheme registry, value hashing, encryption, token generation, and tests.
  3. Add pii-tokenization.yml loader, module registry registration, and runtime reload.
  4. Add gateway request-body and response-body filter support.
  5. Implement tokenize and detokenize handler wiring in light-gateway.
  6. Integrate MCP x-tokenize with the same runtime so MCP tools do not call a hardcoded tokenization service.
  7. Add optional gRPC service wrapper only if deployment needs a separate tokenization service.

Remaining Considerations

  • KMS or light-portal managed keys can be added later, but the first implementation should read the configured environment variables before any resolved config fallback.
  • Products that disable cache.cacheCleartext will still use PostgreSQL as the source of truth, with higher detokenization latency.

Token Handler

Status

Proposed design for migrating the Java egress-router TokenHandler into light-fabric as the token handler used by light-pingora and light-gateway.

A baseline Rust token runtime already exists in light-pingora. This document captures the Java behavior, the compatibility contract, and the design direction for hardening it for gateway and sidecar deployments.

Purpose

The token handler obtains an OAuth 2.0 client credentials access token on behalf of the backend service in the sidecar or gateway egress path. The token is then attached to the outbound request before router or proxy sends the request to the downstream API.

This is different from the PII tokenize and detokenize handlers. The token handler deals only with service-to-service OAuth tokens.

Java Behavior To Map

The Java implementation is centered on:

  • egress-router/.../TokenHandler.java
  • sidecar/.../SidecarTokenHandler.java
  • router-config/.../TokenConfig.java
  • client-config/.../client.yaml
  • sidecar-config/.../sidecar.yml

token.yml controls whether the handler is active and which request paths need token injection:

enabled: ${token.enabled:false}
appliedPathPrefixes: ${token.appliedPathPrefixes:}

The OAuth provider, client credentials, cache, timeout, proxy, HTTP/2, and single-vs-multiple-auth-server settings live in client.yml:

oauth:
  multipleAuthServers: ${client.multipleAuthServers:false}
  token:
    cache:
      capacity: ${client.tokenCacheCapacity:200}
    tokenRenewBeforeExpired: ${client.tokenRenewBeforeExpired:60000}
    expiredRefreshRetryDelay: ${client.expiredRefreshRetryDelay:2000}
    earlyRefreshRetryDelay: ${client.earlyRefreshRetryDelay:30000}
    server_url: ${client.tokenServerUrl:}
    serviceId: ${client.tokenServiceId:}
    proxyHost: ${client.tokenProxyHost:}
    proxyPort: ${client.tokenProxyPort:}
    enableHttp2: ${client.tokenEnableHttp2:true}
    client_credentials:
      uri: ${client.tokenCcUri:/oauth2/token}
      client_id: ${client.tokenCcClientId:}
      client_secret: ${client.tokenCcClientSecret:}
      scope: ${client.tokenCcScope:}
      serviceIdAuthServers: ${client.tokenCcServiceIdAuthServers:}
pathPrefixServices: ${client.pathPrefixServices:}
request:
  connectTimeout: ${client.connectTimeout:2000}
  timeout: ${client.timeout:4000}

The Java request flow is:

  1. Reload token.yml for the request.
  2. Check appliedPathPrefixes with a string prefix match.
  3. Read service_id from the request. This header is expected to be set by PathPrefixServiceHandler or ServiceDictHandler.
  4. Resolve the auth server configuration from client.yml.
  5. Get or refresh a cached client credentials JWT for the service.
  6. If the request has no Authorization header, set Authorization: Bearer <token>.
  7. If the request already has Authorization, preserve it and set X-Scope-Token: Bearer <token>.
  8. Continue to the next handler, usually router.

For multiple auth servers, Java reads oauth.token.client_credentials.serviceIdAuthServers[service_id] and enriches that entry with the global token defaults. For a single auth server, it uses the global oauth.token.client_credentials section.

The Java cache is a static map keyed by service_id. The cached Jwt stores the access token and its exp claim in milliseconds. OauthHelper refreshes synchronously after expiry and attempts async refresh while the token is in the renewal window.

SidecarTokenHandler adds an egress gate before calling TokenHandler:

  • sidecar.egressIngressIndicator: header runs the token handler only when the request has service_id or service_url.
  • sidecar.egressIngressIndicator: protocol runs the token handler for HTTP requests, which is the usual in-pod sidecar egress protocol.
  • Any other value skips token injection.

The base Java TokenHandler still needs service_id to choose the service token. A request with only service_url can identify egress traffic, but it does not by itself select a service-specific token.

Goals

  • Preserve the Java configuration files: token.yml and client.yml.
  • Activate the handler with the existing token id in handler.yml.
  • Support config-server injection for token.enabled, token.appliedPathPrefixes, client.multipleAuthServers, client.tokenCcServiceIdAuthServers, sidecar.egressIngressIndicator, and the rest of the client.yml token fields.
  • Support single auth server and per-service auth server configurations.
  • Support token endpoint discovery through oauth.token.serviceId when a direct server_url is not configured.
  • Preserve the Java header behavior for Authorization and X-Scope-Token.
  • Keep token retrieval fast and safe for request-path execution.
  • Register configuration and token cache state with the module registry and runtime cache registry without exposing token or secret values.
  • Keep the design usable by light-gateway, future sidecar products, and BFF deployments that need to call downstream APIs.

Non-Goals

  • Do not use inventory or dynamic plugins. Handler availability is compiled into the binary; handler activation is controlled by handler.yml.
  • Do not implement authorization code, refresh token, or token exchange in this handler. This handler only performs client_credentials.
  • Do not migrate Java SAMLTokenHandler as part of this design.
  • Do not use the PII tokenization table or handlers. token, tokenize, and detokenize are separate concerns.
  • Do not send the generated access token to logs, metrics labels, module registry output, or cache summaries.

Resolved Decisions

  • Use sidecar.yml to differentiate inbound proxy traffic from outbound router traffic before applying token injection.
  • Implement refresh with the same concurrency model as Java http-client: synchronize refresh per cached token, refresh expired tokens synchronously, refresh valid tokens in the renewal window asynchronously, and use retry windows to prevent repeated failed refresh attempts.

Handler Chain

The token handler must run after service resolution and before egress routing:

handlers:
  - correlation
  - security
  - path-prefix-service
  - token
  - router

chains:
  sidecar-egress:
    - correlation
    - security
    - path-prefix-service
    - token
    - router

paths:
  - path: /v1/pets
    method: GET
    exec:
      - sidecar-egress

path-prefix-service sets service_id from path configuration. token uses that service id to resolve and cache the client credentials token. router uses the same service id to select the downstream API target and should remove routing-only headers before forwarding.

For products where only some outbound APIs need a scope token, keep one chain with token and another without it, or use token.appliedPathPrefixes to limit token injection inside a shared chain.

Rust Architecture

Keep the implementation in light-pingora because token injection is a request-path gateway handler. light-gateway wires the handler into the existing chain execution model.

Primary Rust module:

frameworks/light-pingora/src/token.rs

Primary types:

#![allow(unused)]
fn main() {
pub struct TokenHandlerConfig {
    pub enabled: bool,
    pub applied_path_prefixes: Vec<String>,
}

pub struct ClientTokenConfig {
    pub tls: ClientTlsConfig,
    pub oauth: ClientOauthConfig,
    pub path_prefix_services: BTreeMap<String, String>,
    pub request: ClientRequestConfig,
}

pub struct TokenRuntime {
    handler: TokenHandlerConfig,
    sidecar: SidecarTrafficConfig,
    client: ClientTokenConfig,
    cache: Arc<TokenCache>,
    registry_client: Option<Arc<PortalRegistryClient>>,
}
}

apps/light-gateway should load TokenRuntime only when the matched handler configuration contains token. For Java compatibility, token.yml still has enabled; therefore the handler is effective only when both conditions are true:

handler.yml contains token
token.yml enabled is true

If token.yml enables the handler, client.yml is required and invalid configuration should fail startup. sidecar.yml is also loaded into the token runtime so the same handler chain can distinguish inbound proxy requests from outbound router requests. Invalid reloads should be rejected while the last valid runtime keeps serving traffic.

Request Flow

The Rust request flow should be:

  1. Resolve the active handler chain for the path and method.
  2. When token is encountered, check TokenHandlerConfig.enabled.
  3. Evaluate sidecar.yml and skip token injection for inbound proxy traffic.
  4. Check appliedPathPrefixes with boundary-aware matching. /v1/address should match /v1/address/123, but not /v1/address2.
  5. Resolve the token service id:
    • first from the service_id request header,
    • then from client.yml pathPrefixServices,
    • then from oauth.token.serviceId for single-auth-server token endpoint discovery when applicable.
  6. Resolve the token endpoint:
    • use direct server_url first,
    • otherwise discover oauth.token.serviceId through portal registry.
  7. Select client credentials:
    • for single auth server, use oauth.token.client_credentials,
    • for multiple auth servers, require client_credentials.serviceIdAuthServers[service_id] and merge it with global token defaults.
  8. Look up the token cache.
  9. Fetch a new token when the cache is missing, expired, or inside the refresh window.
  10. Add Authorization or X-Scope-Token using the Java-compatible rule.

The outbound token request should be Java-compatible:

POST {server_url}{uri}
Content-Type: application/x-www-form-urlencoded
Accept: application/json
Authorization: Basic base64(client_id:client_secret)

grant_type=client_credentials&scope=...

The response must contain access_token. Expiry should be derived from the JWT exp claim when available, with expires_in as a fallback for non-JWT token servers.

Cache And Refresh

Use a bounded async cache owned by TokenRuntime.

The cache key should include both service id and scope:

#![allow(unused)]
fn main() {
pub struct TokenCacheKey {
    pub service_id: Option<String>,
    pub scope: Option<String>,
}
}

This is stricter than the Java Map<String, Jwt> keyed only by service_id and avoids collisions when the same service uses multiple scope sets.

Refresh policy:

  • If the token is valid and outside the renewal window, use the cached token.
  • If the token is expired, synchronize on that cache entry and refresh synchronously. Concurrent requests for the same service and scope should wait on the same per-entry lock, then re-check the refreshed token instead of making duplicate token endpoint calls.
  • If the token is expired but another failed refresh attempt is still inside expiredRefreshRetryDelay, fail closed with a token-not-available rejection.
  • If the token is in the renewal window but not expired, return the current token and start one background refresh for that cache entry when no refresh is already running and earlyRefreshRetryDelay has elapsed.
  • Keep refresh state per cached token: token string, expiry, scope, renewing, expired_retry_timeout, and early_retry_timeout.

This intentionally mirrors Java OauthHelper.populateCCToken. The Rust implementation should use tokio locks/tasks instead of Java synchronized and ScheduledExecutorService, but the observable behavior should stay the same: expired tokens block the current request, early refresh does not block the current request, and multiple concurrent requests for the same token are coordinated through one cache entry.

On token.yml or client.yml reload, build a new TokenRuntime and discard the old cache. This prevents tokens issued with old client credentials or old scope configuration from being reused after a config change.

Sidecar Egress Gate

The token handler must use sidecar.yml to decide whether the current request is outbound router traffic or inbound proxy traffic. This allows one gateway or sidecar process to host both directions while applying token injection only to egress calls.

Use the Java sidecar.yml contract:

egressIngressIndicator: ${sidecar.egressIngressIndicator:header}

Rust behavior:

  • header: run token only when service_id or service_url is present.
  • protocol: run token for HTTP requests entering the sidecar listener.
  • any other value: skip token injection.

Even with this gate, token selection should still require either a resolved service id or a single-auth-server configuration that can use a direct server_url.

The sidecar config should be registered in the module registry as a framework config. Invalid values should fail startup or reject reload.

Configuration Examples

Single auth server:

# sidecar.yml
egressIngressIndicator: ${sidecar.egressIngressIndicator:header}
# token.yml
enabled: ${token.enabled:true}
appliedPathPrefixes: ${token.appliedPathPrefixes:/v1}
# client.yml
oauth:
  multipleAuthServers: false
  token:
    server_url: ${client.tokenServerUrl:https://oauth.example.com}
    tokenRenewBeforeExpired: ${client.tokenRenewBeforeExpired:60000}
    client_credentials:
      uri: ${client.tokenCcUri:/oauth2/token}
      client_id: ${client.tokenCcClientId:gateway-client}
      client_secret: ${client.tokenCcClientSecret:}
      scope: ${client.tokenCcScope:petstore.r petstore.w}

Multiple auth servers:

# client.yml
oauth:
  multipleAuthServers: true
  token:
    tokenRenewBeforeExpired: ${client.tokenRenewBeforeExpired:60000}
    client_credentials:
      uri: /oauth2/token
      serviceIdAuthServers: ${client.tokenCcServiceIdAuthServers:}
pathPrefixServices: ${client.pathPrefixServices:}

The config server can inject client.tokenCcServiceIdAuthServers as YAML or a JSON string:

com.networknt.petstore-1.0.0:
  server_url: https://oauth-petstore.example.com
  client_id: petstore-client
  client_secret: ${PETSTORE_CLIENT_SECRET}
  scope:
    - petstore.r
    - petstore.w

Rust Improvements Over Java

  • Use boundary-aware path prefix matching instead of raw startsWith.
  • Include scope in the cache key.
  • Mask client_secret and token values in module registry and cache output.
  • Fail startup for enabled but invalid token configuration.
  • Use Rust async primitives to implement the same per-token synchronized refresh behavior as Java without spawning a dedicated executor per refresh attempt.
  • Support direct server_url and portal-registry discovery with the same runtime path.
  • Keep all config-server injected values in the normal module registry and reload model.

Observability

Record metrics and logs around the token operation, but never include the token or client secret:

  • handler duration for token,
  • cache hit, miss, refresh, and failure counts,
  • token endpoint latency and HTTP status,
  • service id and provider selection,
  • refresh retry suppression counts,
  • module registry entry for loaded token.yml and masked client.yml,
  • runtime cache entry count and expiry summaries without access token strings.

Failure Behavior

Fail closed when token injection is required but cannot be completed:

  • missing service_id for multiple auth servers,
  • missing serviceIdAuthServers[service_id],
  • missing client_id or client_secret,
  • no direct server_url and failed token service discovery,
  • token endpoint returns non-2xx,
  • token response has no access_token,
  • token response has neither JWT exp nor expires_in,
  • invalid proxy, URL, or TLS configuration.

Requests outside appliedPathPrefixes should bypass the handler without error.

Test Plan

Unit tests in light-pingora:

  • parse Java-compatible token.yml and client.yml,
  • parse and validate Java-compatible sidecar.yml,
  • parse appliedPathPrefixes as YAML list, JSON string list, and comma list,
  • parse serviceIdAuthServers as YAML map and JSON string map,
  • verify boundary-aware prefix matching,
  • verify sidecar.yml header mode applies token only to outbound requests with service_id or service_url,
  • verify sidecar.yml protocol mode applies token to HTTP egress traffic,
  • verify single auth server option resolution,
  • verify multiple auth server option merging,
  • verify Authorization versus X-Scope-Token header selection,
  • verify cache key includes service id and scope,
  • verify token cache summaries never include token strings,
  • verify expired token refresh is synchronized across concurrent requests,
  • verify early-window refresh returns the current token and starts only one background refresh.

Gateway tests in light-gateway:

  • chain with path-prefix-service -> token -> router,
  • inbound proxy request skips token injection according to sidecar.yml,
  • outbound router request applies token injection according to sidecar.yml,
  • missing service id for multiple auth servers returns a handler rejection,
  • existing caller Authorization is preserved and scope token is added to X-Scope-Token,
  • token runtime reload swaps config and clears old cache,
  • inactive token handler does not require token.yml or client.yml.

Integration tests:

  • mock OAuth token endpoint with client credentials Basic auth,
  • mock discovered token service through portal registry,
  • mock downstream service and assert the final outbound headers,
  • refresh behavior with expired and near-expiry tokens.

Service Discovery

Status

Implemented baseline.

light-runtime, portal-registry, light-pingora, and light-gateway already have the main pieces needed for controller-backed service discovery. This document captures the supported invocation path, the configuration contract, and the intended hardening direction for gateway, sidecar, BFF, MCP, WebSocket, and token-handler deployments.

Purpose

light-gateway should be able to discover downstream service instances from the Light Controller through portal-registry instead of relying only on static host lists in router.yml, proxy.yml, mcp-router.yml, or handler-specific configuration.

The same mechanism should work with both controller implementations:

  • Rust controller-rs
  • Java light-controller

The gateway should use one portal-registry connection for registration, runtime control-plane callbacks, and service discovery lookup. A separate discovery client connection is not required for a registered runtime.

Goals

  • Reuse the existing portal-registry JSON-RPC WebSocket client.
  • Keep service discovery available to all light-pingora handlers through RuntimeConfig.registry_client.
  • Support controller-backed lookup for:
    • REST/router outbound calls
    • WebSocket routing
    • MCP tool routing
    • OAuth token-server resolution
    • SPA auth token-server resolution
  • Keep direct URL configuration as an explicit override when a handler supports it.
  • Keep static target configuration as a fallback where it already exists.
  • Preserve Java-compatible discovery data names such as serviceId, envTag, protocol, address, and port.
  • Let light-portal and config-server manage product-specific registry and handler configuration.
  • Work with one light-gateway binary and different product config sets.

Non-Goals

  • Do not add a second discovery protocol for light-gateway.
  • Do not require dynamic Rust plugins, inventory, or reflection for discovery.
  • Do not make each handler own a separate controller connection.
  • Do not require /ws/discovery for registered gateway instances.
  • Do not remove static fallback configuration from router-style deployments.
  • Do not make service discovery hide invalid product configuration. Startup validation and runtime errors should remain explicit.

Controller Protocol

The controller exposes two WebSocket endpoints:

/ws/microservice
/ws/discovery

light-gateway uses /ws/microservice.

The flow is:

light-gateway
  -> connect /ws/microservice
  -> JSON-RPC service/register
  <- registered runtimeInstanceId
  -> JSON-RPC discovery/lookup on the same websocket
  <- DiscoverySnapshot

The dedicated /ws/discovery endpoint is still useful for non-service clients that only need discovery. It is not needed by the gateway because both controller-rs and light-controller accept discovery JSON-RPC methods on the registered microservice socket after service/register succeeds.

The lookup request uses a DiscoverySubscription shape:

{
  "serviceId": "com.networknt.petstore-1.0.0",
  "envTag": "dev",
  "protocol": "https"
}

envTag and protocol are optional. When protocol is omitted, the controller can return all matching protocols and the caller decides which nodes are usable.

The response is a DiscoverySnapshot:

{
  "serviceId": "com.networknt.petstore-1.0.0",
  "envTag": "dev",
  "protocol": "https",
  "nodes": [
    {
      "runtimeInstanceId": "...",
      "serviceId": "com.networknt.petstore-1.0.0",
      "envTag": "dev",
      "environment": "dev",
      "version": "1.0.0",
      "protocol": "https",
      "address": "petstore",
      "port": 8443,
      "tags": {},
      "connectedAt": "...",
      "lastSeenAt": "...",
      "connected": true
    }
  ]
}

Only connected nodes with a non-zero port should be used as upstream targets. Handlers should ignore protocols they cannot proxy.

Runtime Configuration

Registry participation is controlled by server.yml:

serviceId: ${server.serviceId:com.networknt.light-gateway-1.0.0}
advertisedAddress: ${server.advertisedAddress:127.0.0.1}
enableRegistry: ${server.enableRegistry:true}
startOnRegistryFailure: ${server.startOnRegistryFailure:true}
environment: ${server.environment:dev}

Controller connection settings come from portal-registry.yml:

portalUrl: ${portalRegistry.portalUrl:https://localhost:8438}
portalToken: ${light_portal_authorization:}
controllerDiscoveryToken: ${portalRegistry.controllerDiscoveryToken:}

Current light-gateway discovery uses the microservice registration token from LIGHT_PORTAL_AUTHORIZATION or portalToken. The token is sent in the service/register payload. controllerDiscoveryToken is reserved for clients that use the dedicated /ws/discovery endpoint and is not part of the current gateway lookup path.

The runtime converts portalUrl to /ws/microservice, strips any query string, and starts the shared PortalRegistryClient when registry is enabled. The client must be connected and registered before discovery lookup can succeed.

Gateway Invocation Path

Startup path:

config-server/local config
  -> light-runtime loads server.yml, client.yml, portal-registry.yml
  -> RuntimeConfig.service_identity is built from server/bootstrap config
  -> RuntimeConfig.registry_client is created when registry is enabled
  -> runtime startup registers the gateway with controller
  -> light-gateway builds Pingora proxy state from RuntimeConfig

Request-time path:

incoming request
  -> handler.yml selects a handler chain
  -> handler resolves direct target, serviceId, or static target
  -> handler calls PortalRegistryClient.lookup_discovery when serviceId discovery is needed
  -> controller returns DiscoverySnapshot
  -> handler converts nodes to Pingora ProxyTarget or base URL
  -> Pingora proxies the request

PortalRegistryClient.lookup_discovery sends JSON-RPC method discovery/lookup over the registered websocket and waits for a response. If the websocket is not connected, lookup fails with a registry client connection error.

Handler Usage

Router

The router handler supports both direct routing and service discovery.

Resolution order:

  1. service_url request routing, when configured and present.
  2. service_id from query/header/path-prefix logic.
  3. direct-registry.directUrls using serviceId|envTag, then serviceId.
  4. Controller discovery with serviceId and optional envTag.
  5. Legacy static router.serviceTargets fallback.

Direct registry is the standard static service map. router.serviceTargets remains a deprecated compatibility fallback for old Rust gateway configs. New product configs should not maintain both maps.

WebSocket Router

The websocket handler resolves the target service from header, query, or pathPrefixService. It checks direct-registry.directUrls first, then passes serviceId, optional envTag, and protocol to discovery. Connected http and https nodes are converted to upstream WebSocket targets and Pingora handles the upgrade proxying.

MCP Router

The mcp handler can route tools by direct targetHost or discovered serviceId.

Resolution order:

  1. Tool targetHost.
  2. direct-registry.directUrls using serviceId|envTag, then serviceId.
  3. Tool serviceId through controller discovery.

When a tool uses serviceId, portal registry is only required if no direct URL mapping exists. The tool can also specify envTag and protocol to constrain direct URL and discovery lookup.

Token Handler

The token handler can resolve the OAuth token server by direct oauth.token.server_url or by oauth.token.serviceId.

Resolution order:

  1. Direct token server URL.
  2. direct-registry.directUrls using token server serviceId.
  3. Token server serviceId through controller discovery.

The selected node prefers https and then falls back to http. If discovery is required and portal registry is not enabled, token injection fails explicitly.

SPA Auth

The stateless SPA auth and MSAL exchange token clients use the same token-server resolution model as the token handler:

  1. Direct token server URL.
  2. direct-registry.directUrls using token server serviceId.
  3. Token server serviceId through controller discovery.

This keeps BFF deployments independent from fixed OAuth hostnames when the token service is registered with the controller.

Direct URLs And Fallbacks

Discovery should not override an explicit direct URL selected by a handler. Direct URLs are operator intent and should remain authoritative. The standard shared direct URL map is direct-registry.directUrls.

Static fallback is handler-specific:

  • direct-registry.directUrls is checked before controller discovery.
  • router.serviceTargets is deprecated and only remains as a legacy router fallback.
  • MCP, token, SPA auth, JWK, and WebSocket service-id routing can use direct-registry.directUrls without per-handler duplicate maps.

This keeps failure behavior predictable. Product configs that require dynamic discovery should fail requests loudly when the controller connection is down instead of silently choosing an unrelated target.

Load Balancing

The controller returns a list of matching nodes. The handler is responsible for choosing one.

Current behavior is intentionally simple:

  • drop disconnected nodes
  • drop nodes with port 0
  • drop unsupported protocols
  • prefer https for token-server resolution
  • round-robin or index-based selection where the handler already has an index

Future hardening can add weighted selection, zone preference, health score, least-connections, or sticky routing. Those policies should live in the handler or a shared target-selection helper, not in the controller protocol.

Failure Semantics

Startup behavior is controlled by startOnRegistryFailure:

  • true: the gateway can start if initial controller registration times out; the registry client keeps retrying in the background.
  • false: initial controller registration timeout fails startup.

Request-time behavior depends on handler fallback:

  • with direct URL: discovery is bypassed
  • with usable static fallback: handler may continue
  • with discovery-only config: return an explicit gateway error

The runtime should continue reconnecting the registry websocket. Once the client is registered again, new discovery lookups can succeed without restarting the gateway.

Security

The gateway registers through /ws/microservice with the portal registry token. The controller validates the registration token and then allows discovery RPCs on that registered socket.

Security expectations:

  • Use TLS for controller connections outside local development.
  • Keep hostname verification enabled outside local development.
  • Prefer environment-provided token values over static config files.
  • Mask portalToken and controllerDiscoveryToken in module-registry output.
  • Do not pass registry tokens to downstream services.
  • Do not trust discovery data from an untrusted controller.

Discovery returns transport endpoints. Authentication, authorization, rate limit, CORS, header mutation, token injection, and access-control decisions remain normal handler-chain responsibilities.

Config Server Model

In production, light-portal owns product configuration and config-server delivers resolved files at startup.

A product that needs controller-backed discovery should include:

  • server.yml with enableRegistry: true
  • portal-registry.yml with portalUrl and a valid portal token source
  • direct-registry.yml or values.yml entries under direct-registry.directUrls for transition services that are not registered in the controller yet
  • handler-specific config that uses serviceId instead of direct host URLs
  • handler.yml chains that include the relevant handler IDs

For local Docker Compose, the Rust gateway must not keep the default https://localhost:8438 controller URL because localhost is the gateway container. Use portalRegistry.portalUrl: https://controller:8438, pass LIGHT_PORTAL_AUTHORIZATION, and keep static transition mappings in direct-registry.directUrls.

The same binary can therefore run as:

  • gateway
  • sidecar
  • proxy server
  • proxy client
  • balancer
  • BFF

The product identity comes from config, not from a separate executable.

Compatibility Notes

The current Rust and Java controllers are compatible with the gateway discovery path because both support:

  • /ws/microservice
  • service/register
  • discovery lookup on the registered microservice socket
  • serviceId, envTag, and protocol filters
  • DiscoverySnapshot.nodes
  • connected-node metadata with address, port, and protocol

The gateway does not currently depend on /ws/discovery, although that endpoint can remain available for external discovery clients.

Future Work

  • Add optional discovery subscriptions for handlers that benefit from a local in-memory discovery cache.
  • Add shared target-selection policies for weighted, sticky, or zone-aware routing.
  • Expose discovery health through the module registry or an admin endpoint.
  • Add an integration test that starts a controller, registers a backend, starts light-gateway, and verifies an end-to-end proxied request through discovery.
  • Decide whether controllerDiscoveryToken should be used by any standalone discovery-only client in light-fabric.
  • Document operational examples for gateway, sidecar, WebSocket, MCP, token handler, and BFF product profiles.

Tracing

Light-Fabric uses Rust tracing for application logs and runtime diagnostics. The same tracing events must support two different consumers:

  • operators and developers reading live logs from the console or control plane
  • log platforms such as Splunk that ingest structured JSON

The logging design should keep one source of truth for emitted events and make the output format configurable at the edge of the process.

Goals

  • Preserve the current human-readable console format for local development and controller-streamed logs.
  • Support newline-delimited JSON logs for Splunk and other log ingestion systems.
  • Allow deployments to choose text or JSON console output without changing application code.
  • Allow authorized control-plane users to change log levels and logger targets without restarting the service.
  • Avoid coupling Light-Fabric services directly to Splunk availability, credentials, retry policy, or backpressure handling.
  • Keep log fields stable enough for portal-view, controller, and Splunk queries.

Non-Goals

  • Implement a Splunk HTTP Event Collector client inside every Light-Fabric service.
  • Mix human text logs and JSON logs on the same stream.
  • Use values.yml to mutate process environment variables. Environment variables are startup inputs; runtime changes should use an explicit logging configuration model.

Current State

The application binaries initialize tracing_subscriber locally. The current format is text-oriented and is easy to read in a terminal, Docker logs, or a controller stream. Some binaries also support an ANSI toggle so container logs can avoid escape sequences.

This works well for humans, but it is less reliable for Splunk field extraction. Splunk can ingest text logs, but structured JSON gives predictable fields for filtering, dashboards, alerts, and correlation.

Output Formats

Light-Fabric should support the following output formats:

FormatIntended ConsumerNotes
texthumans, local development, controller live log streamExisting behavior. Best for direct reading.
jsonSplunk, OpenTelemetry Collector, Kubernetes log collectorsNewline-delimited JSON. Best for machine ingestion.

The output should be selected with an environment variable:

LIGHT_LOG_FORMAT=text

or:

LIGHT_LOG_FORMAT=json

If the variable is absent, the default should remain text to preserve existing operator behavior.

RUST_LOG should continue to provide the startup filter:

RUST_LOG=info
RUST_LOG=light_gateway=debug,info
RUST_LOG=light_workflow=debug,info

Single Console Stream

For most deployments, the preferred model is a single console stream with a configurable format:

application tracing event
        |
        v
tracing_subscriber fmt layer
        |
        +-- stdout/stderr as text or JSON

This has the lowest runtime overhead because each event is formatted and written once. It also keeps container logging simple: the platform captures the process console stream, and the customer chooses whether that stream is text or JSON.

When LIGHT_LOG_FORMAT=json, the console output should be newline-delimited JSON:

{"timestamp":"2026-06-03T14:12:41.233Z","level":"INFO","target":"light_gateway","fields":{"message":"proxy request completed","method":"GET","path":"/api/customer","status":200,"elapsed_ms":18,"correlation_id":"abc-123"}}

Raw JSON is readable, but it is not as pleasant as the text format. For the control plane, portal-view should parse JSON log lines and render a human projection:

14:12:41.233  INFO  light_gateway  proxy request completed
method=GET path=/api/customer status=200 elapsed_ms=18 correlation_id=abc-123

This lets Splunk receive structured logs while portal-view remains readable for operators.

Portal-View Rendering

The controller should stream log lines without needing to understand every field. Portal-view can detect whether a line is JSON:

  1. Trim the line.
  2. If it starts with {, try to parse it as JSON.
  3. If parsing succeeds, render common fields in a stable layout.
  4. If parsing fails, render the original line as plain text.

The renderer should treat JSON parsing as an enhancement, not a hard requirement. This keeps mixed historical output, startup messages, and unrelated tool output usable.

Recommended display fields:

JSON FieldDisplay Use
timestampleading timestamp
levelseverity badge/text
targetmodule or service source
fields.messagemain message
fields.correlation_idrequest correlation
fields.request_idrequest identifier, when present
fields.statusHTTP or operation status
fields.elapsed_mslatency

Unknown fields can be shown in an expandable details view or appended as key=value pairs.

Splunk Ingestion

A log file is not the only option for Splunk ingestion.

Console JSON in Containers

For Kubernetes and container deployments, console JSON is usually the best default. The service writes JSON to stdout/stderr, and the platform logging agent collects the container log stream. Splunk Connect for Kubernetes, OpenTelemetry Collector, or an equivalent customer-managed collector can parse the JSON and send it to Splunk HTTP Event Collector.

This avoids application-level Splunk credentials and keeps retry, batching, and backpressure in the collector.

JSON Log File

For VM or bare-metal deployments where the customer already uses Splunk Universal Forwarder, a JSON log file is also valid. In that mode the application would write newline-delimited JSON to a rotating file, and the forwarder or OpenTelemetry filelog receiver would tail it.

This mode is useful when stdout is reserved for human-readable controller logs, but it formats and writes each event through an additional sink if text console output remains enabled.

Direct Splunk HEC

Direct HTTP Event Collector delivery from the application is possible but should not be the default. It adds Splunk endpoint configuration, token management, retry policy, buffering, and failure handling to every service. A collector or forwarder is a cleaner boundary for production deployments.

Dual Sink Option

If a deployment must keep text console logs and produce JSON at the same time, Light-Fabric can use multiple tracing layers:

application tracing event
        |
        v
tracing subscriber registry
        |
        +-- text layer -> stdout/stderr
        |
        +-- JSON layer -> rolling file

This preserves the current control-plane stream and gives Splunk a clean JSON source. The tradeoff is extra formatting and I/O work per event.

Use this mode only when a single JSON console stream is not acceptable for the operator experience.

Configuration

The design supports both single-stream and dual-sink logging through configuration. The two common deployment profiles are:

DeploymentConsole OutputJSON FileTypical Splunk Path
Kubernetes/containerjsondisabledcontainer log collector to Splunk HEC
Bare metal/VM with human consoletextenabledSplunk Universal Forwarder or filelog receiver tails the JSON file
Local developmenttextdisabledterminal or controller stream only

The minimal configuration should be:

LIGHT_LOG_FORMAT=text
LIGHT_LOG_ANSI=false
RUST_LOG=info

JSON console mode:

LIGHT_LOG_FORMAT=json
LIGHT_LOG_ANSI=false
RUST_LOG=info

Optional dual-sink file mode:

LIGHT_LOG_FORMAT=text
LIGHT_LOG_ANSI=false
LIGHT_LOG_JSON_FILE_ENABLED=true
LIGHT_LOG_JSON_FILE_DIR=/var/log/light-fabric
LIGHT_LOG_JSON_FILE_NAME=light-gateway.jsonl
LIGHT_LOG_JSON_FILE_ROTATION=daily
RUST_LOG=info

In this dual-sink mode, the application emits the same tracing event to both sinks: text to stdout/stderr for humans and controller-streamed logs, and JSON to the configured file for Splunk ingestion.

Service-specific aliases such as GATEWAY_LOG_ANSI, AGENT_LOG_ANSI, or WORKFLOW_LOG_ANSI can remain during migration, but the long-term interface should converge on LIGHT_LOG_* variables shared by all Light-Fabric binaries.

Runtime Logging Control

Light-Fabric should support the Java control-plane behavior where an authorized operator changes log levels and logger targets from portal-view without restarting the service.

Rust can support this through tracing_subscriber::reload. Instead of installing a fixed EnvFilter, the runtime should wrap the filter in a reloadable layer and keep a reload handle in a shared logging controller:

application tracing event
        |
        v
reloadable EnvFilter
        |
        v
text/json formatting layers

The reloadable part is the filter only. A filter can change the global level and individual logger targets:

info
debug
info,light_gateway=debug
info,light_gateway=debug,light_pingora::security=trace
info,light_pingora::security=off

This matches the practical Java use case: enable debug or trace for one logger while keeping the rest of the service at info.

Dynamic Versus Restart-Only Settings

SettingDynamicReason
Global log levelyesUpdates the reloadable EnvFilter.
Per-target logger levelyesUpdates the reloadable EnvFilter.
Disable a target with target=offyesUpdates the reloadable EnvFilter.
Console format text/jsonnoRequires rebuilding formatter layers.
JSON file enabled/disablednoRequires adding or removing a writer layer.
JSON file directory/name/rotationnoRequires replacing the appender and guard.
ANSI settingnoFormatter setting; treat as startup-only.

Startup Precedence

The startup filter should use this precedence:

  1. RUST_LOG, when present.
  2. logging.filter from values.yml.
  3. The service default, such as info or light_workflow=debug,info.

This preserves existing RUST_LOG behavior for local and container deployments while allowing managed deployments to define a persistent default filter in config.

Example values.yml:

logging.filter: info

More targeted example:

logging.filter: info,light_gateway=debug,light_pingora::security=trace

values.yml should not overwrite environment variables and should not be the normal path for day-to-day control-plane log-level changes. It should provide the baseline filter that the logging module reads at startup. If an operator wants to restore that baseline after a live debugging change, reload_modules can reload runtime/logging from the latest resolved values.

Changing config server values and then triggering reload is therefore a persistence/reset workflow, not the primary live-control workflow.

MCP Tools

The runtime MCP tool surface should expose logging control alongside existing runtime tools such as get_service_info, get_modules, and reload_modules.

Recommended tools:

ToolPurpose
get_logging_filterReturn the current effective filter and startup source.
set_logging_filterValidate and apply a new live filter immediately. This is the normal portal-view control path.
reload_modules with runtime/loggingReset the live filter from the configured baseline in values.yml or remote values.

Example live filter update:

{
  "name": "set_logging_filter",
  "arguments": {
    "filter": "info,light_gateway=debug"
  }
}

Example reset from the configured baseline:

{
  "name": "reload_modules",
  "arguments": {
    "modules": ["runtime/logging"]
  }
}

The service response should include the active filter and status:

{
  "status": "success",
  "filter": "info,light_gateway=debug"
}

Invalid filters should be rejected without changing the current filter:

{
  "status": "error",
  "message": "invalid logging filter: ..."
}

Portal-View Flow

The portal-view control plane should follow the same route used for other runtime management tools:

portal-view
  -> controller
  -> portal-registry/runtime instance connection
  -> service runtime MCP handler
  -> logging control

The UI can offer:

  • a global level selector: off, error, warn, info, debug, trace
  • per-target rows for Rust targets such as light_gateway or light_pingora::security
  • an advanced filter text box for the full EnvFilter expression
  • an apply action that calls set_logging_filter
  • a reset action that reloads runtime/logging from the configured baseline
  • an optional "save as default" action that persists the filter to config server

The advanced filter is important because Rust logger targets are module paths, and operators may need precise target-level control during incident debugging.

The default portal-view workflow should be:

operator changes filter
  -> portal-view calls set_logging_filter
  -> service updates the reloadable EnvFilter immediately

Portal-view should not require this slower path for a temporary debug change:

operator changes filter
  -> portal-view updates config server
  -> portal-view calls reload_modules
  -> service reloads values.yml

That slower path is still useful when the operator intentionally wants the new filter to survive service restart or redeploy.

JSON Field Shape

JSON logs should be stable enough for both portal-view rendering and Splunk searches. Recommended fields include:

FieldMeaning
timestampevent time in UTC
levelERROR, WARN, INFO, DEBUG, or TRACE
targetRust module or logical component
fields.messagehuman message
fields.servicelogical service name, such as light-gateway
fields.instance_idruntime instance, when known
fields.host_idtenant/host context, when safe to log
fields.correlation_idcross-service request correlation
fields.request_idrequest identifier
fields.methodHTTP method, when applicable
fields.pathrequest path without sensitive query string
fields.statusresponse or operation status
fields.elapsed_msoperation duration

Sensitive values must not be logged in either format. This includes tokens, API keys, session cookies, full authorization headers, raw secrets, and request or response payload fields that may contain PII.

Implementation Notes

Use tracing_subscriber as the formatting boundary. The JSON format requires the json feature:

tracing-subscriber = { version = "0.3", features = ["env-filter", "fmt", "json"] }

File output should use tracing_appender:

tracing-appender = "0.2"

If non-blocking file output is used, the returned WorkerGuard must be kept alive until process shutdown so buffered log lines are flushed.

The implementation should move per-binary init_tracing() logic into a shared runtime helper so light-gateway, light-agent, light-workflow, and light-deployer expose the same behavior.

For dynamic filtering, the shared helper should:

  1. Build the initial EnvFilter from RUST_LOG, logging.filter, or the service default.
  2. Install the filter through tracing_subscriber::reload.
  3. Keep the reload handle in a LoggingControl value.
  4. Register a reloadable module named runtime/logging with ModuleRegistry.
  5. Add runtime MCP handlers for get_logging_filter and set_logging_filter.
  6. Reject invalid filter expressions before swapping the active filter.

Recommendation

Start with configurable single-stream console output:

  • default LIGHT_LOG_FORMAT=text
  • production/Splunk option LIGHT_LOG_FORMAT=json
  • portal-view JSON parsing and human-friendly rendering
  • no direct Splunk dependency in the application

Add dual-sink JSON file output only for customers who cannot change the console stream to JSON but still require structured Splunk ingestion.

Release Workflow

Light-Fabric already has a release.sh script that builds Linux binaries, packages release archives, and creates or updates a GitHub release. The current release page uses a static note string, so operators can download artifacts but cannot easily see what changed between tags.

This design introduces a cascading polyrepo release orchestrated by light-workflow. It automates release-notes, changelog flow, binary generation, Docker image pushes, and downstream dependency propagation across both public (light-fabric, light-example-rs) and private (controller-rs, portal-service) repositories.

The implementation should start with a small dependency-free git-log script and leave room to adopt a more structured changelog generator later. It should also centralize Docker image publishing so binary archives and container images use the same release version.

Goals

  • Generate release notes from commits between the previous release tag and the current release tag.
  • Use the same generated notes for GitHub release creation and release updates.
  • Maintain a checked-in CHANGELOG.md so release history is visible without opening GitHub.
  • Preserve the current release.sh VERSION [-l|--local] [--skip-build] operator workflow.
  • Release Linux binary archives and Docker images with the same version tag and the same compiled Linux binaries.
  • Support Apple Silicon and Windows binary artifacts through CI runners that match those operating systems.
  • Add one repo-root build.sh for all Docker images while preserving app-level build script compatibility.
  • Allow manual edits before publishing when release notes need customer-facing cleanup.
  • Avoid requiring Conventional Commit messages on day one.

Non-Goals

  • Replace GitHub releases as the artifact distribution point.
  • Require every commit message to follow feat:, fix:, or another convention immediately.
  • Generate perfect marketing release notes without review.
  • Upload changelog files as separate release artifacts.
  • Remove existing app-level build.sh entrypoints immediately.
  • Build macOS binaries from a normal Linux Docker builder. Apple toolchains and SDKs require a macOS build runner.
  • Build Windows MSVC binaries from a normal Linux Docker builder. Use a Windows runner for the official Windows artifacts.
  • Publish Windows container images as part of the first release flow. Windows container images require Windows base images and a Windows container builder.

Current State

release.sh currently performs these steps:

  1. Parse release options and target version.
  2. Build light-agent, light-deployer, light-gateway, and light-workflow for Linux GNU and Linux musl targets.
  3. Package the binaries into dist/light-fabric-${VERSION}-${TARGET}.tar.gz.
  4. If --local is not set, create a GitHub release or upload artifacts to an existing release.

When creating a new GitHub release, the script uses a static note body:

Light-Fabric Linux release binaries

When the release already exists, the script uploads artifacts but does not update the release notes.

Docker image builds are currently handled by app-level scripts:

apps/light-agent/build.sh
apps/light-deployer/build.sh
apps/light-gateway/build.sh
apps/light-workflow/build.sh

Most app scripts use this shape:

./build.sh 0.3.0
./build.sh 0.3.0 --local
./build.sh 0.3.0 --no-cache

Those scripts build and optionally push networknt/<app>:${VERSION} and networknt/<app>:latest. light-deployer has a simpler custom script, so the app-level workflow is not completely consistent.

release.sh does not currently build or push Docker images. As a result, binary archives and Docker images can drift if they are released in separate manual steps or with different version strings.

Options

Option 1: GitHub Generated Notes

GitHub CLI can generate release notes:

gh release create "$VERSION" --generate-notes --notes-start-tag "$PREVIOUS_TAG"

This is the least code, and it works well for the GitHub release page. The tradeoff is that it does not update CHANGELOG.md in the repository unless an additional script calls the GitHub API and copies the generated notes back into the repo.

This option is useful as a fallback, but it should not be the primary design if the repo changelog is a required output.

Option 2: Dependency-Free Git-Log Script

A local script can generate release notes from the git history:

git log "${PREVIOUS_TAG}..${TARGET_REF}" --pretty=format:"- %s (%h)"

The script can write a markdown file and use that same file for both CHANGELOG.md and gh release create --notes-file.

This option is simple, reviewable, and fits the current Bash release script. It does not require new tooling or commit-message conventions. The initial output will be commit-oriented rather than category-oriented, but it can be improved incrementally.

Option 3: git-cliff

git-cliff can generate structured changelogs from Conventional Commit messages and custom templates. It can group entries into sections such as features, fixes, documentation, and breaking changes.

This gives the best long-term release notes, but it adds a release-tool dependency and works best only after the team consistently writes conventional commit messages.

This can be adopted later without changing the overall release flow: replace the internal git-log generator with a git-cliff invocation that writes the same release notes file.

Proposed Design

Start with Option 2.

Add a helper script:

scripts/release-notes.sh

The script should generate:

dist/release-notes-${VERSION}.md

It should optionally update:

CHANGELOG.md

release.sh should call the helper before publishing the GitHub release. The generated notes file becomes the release page source:

gh release create "$VERSION" "${ARCHIVES[@]}" \
  --title "$VERSION" \
  --notes-file "$NOTES_FILE"

For an existing release, the script should update the release body as well as uploading artifacts:

gh release edit "$VERSION" --notes-file "$NOTES_FILE"
gh release upload "$VERSION" "${ARCHIVES[@]}" --clobber

Use Docker as the official Linux release builder. The controlled Docker builder environment should compile Linux binaries once per Linux platform, export those binaries into dist/, and use the same binaries when assembling runtime Docker images. Local host builds remain useful for development, but they should not be the official release source for Linux artifacts.

Add a repo-root Docker image script:

build.sh

The root script should become the source of truth for building and publishing all Light-Fabric app images:

./build.sh 0.3.0
./build.sh 0.3.0 --local
./build.sh 0.3.0 --app light-agent
./build.sh 0.3.0 --image-org networknt --no-cache

The script should build these images by default:

networknt/light-agent:0.3.0
networknt/light-deployer:0.3.0
networknt/light-gateway:0.3.0
networknt/light-workflow:0.3.0

Unless --skip-latest is set, it should also tag and push:

networknt/light-agent:latest
networknt/light-deployer:latest
networknt/light-gateway:latest
networknt/light-workflow:latest

Existing app-level build scripts should remain, but they should become thin wrappers around the root script:

../../build.sh "$@" --app light-agent

This preserves established operator muscle memory and removes duplicated Docker publish logic.

release.sh should call the root build.sh with the same VERSION. For Linux targets, the release should build once per platform and reuse the output:

Docker/BuildKit Linux builder
        |
        +-- dist/linux/<target>/bin/<app>       -> GitHub release tarballs
        |
        +-- dist/linux/<target>/bin/<app>       -> Docker runtime images

This makes one command release both binary artifacts and Docker images without compiling the same Linux binaries twice.

Changelog Format

CHANGELOG.md should use reverse chronological release sections:

# Changelog

## 0.3.0 - 2026-06-03

- Add JSON file logging support to `light-runtime` (abc1234)
- Wire runtime logging control into `light-gateway` (def5678)
- Document Splunk ingestion options for tracing (123abcd)

## 0.2.0 - 2026-05-20

- ...

The generated release notes file should contain the same section body:

## 0.3.0 - 2026-06-03

### Changes

- Add JSON file logging support to `light-runtime` (abc1234)
- Wire runtime logging control into `light-gateway` (def5678)
- Document Splunk ingestion options for tracing (123abcd)

### Artifacts

- `light-fabric-0.3.0-x86_64-unknown-linux-gnu.tar.gz`
- `light-fabric-0.3.0-x86_64-unknown-linux-musl.tar.gz`
- `light-fabric-0.3.0-aarch64-unknown-linux-gnu.tar.gz`
- `light-fabric-0.3.0-aarch64-unknown-linux-musl.tar.gz`
- `light-fabric-0.3.0-aarch64-apple-darwin.tar.gz`
- `light-fabric-0.3.0-x86_64-pc-windows-msvc.zip`
- `networknt/light-agent:0.3.0`
- `networknt/light-deployer:0.3.0`
- `networknt/light-gateway:0.3.0`
- `networknt/light-workflow:0.3.0`

The release notes file can include artifact names because it is used directly for the GitHub release page. CHANGELOG.md should focus on changes and can omit artifact details.

Docker images should be listed in the GitHub release body even though they are published to Docker Hub instead of attached to the release page. This gives operators one place to see every artifact produced by a release.

Docker image platform variants should also be visible:

networknt/light-agent:0.3.0       linux/amd64, linux/arm64
networknt/light-deployer:0.3.0    linux/amd64, linux/arm64
networknt/light-gateway:0.3.0     linux/amd64, linux/arm64
networknt/light-workflow:0.3.0    linux/amd64, linux/arm64

Tag Range Selection

The release-notes script needs a deterministic commit range.

Inputs:

  • VERSION: target tag, for example 0.3.0 or v0.3.0
  • optional --from PREVIOUS_TAG
  • optional --target TARGET_REF

Default behavior:

  1. If --target is supplied, use it as the end of the range.
  2. Else if the VERSION tag exists locally, use VERSION.
  3. Else use HEAD.
  4. If --from is supplied, use it as the start of the range.
  5. Else find the newest semver-like tag before VERSION.
  6. If no previous tag exists, use the first commit as the start.

For existing releases, this allows regenerating the notes for the exact tag. For new releases, this allows generating notes before the tag exists.

Recommended git command:

git log --no-merges --pretty=format:"- %s (%h)" "${PREVIOUS_TAG}..${TARGET_REF}"

If merge commits are important for the team, the script can add a --include-merges option.

Release Script Flow

The updated release.sh flow should be:

  1. Parse release options.
  2. Validate build and publish dependencies.
  3. Generate release notes into dist/release-notes-${VERSION}.md.
  4. Build Linux binaries with the Docker release builder unless --skip-build or --host-build is set.
  5. Package release archives.
  6. Build Docker images unless --skip-docker is set.
  7. Print generated archive names, Docker image names, and release notes path.
  8. If --local is set, stop before GitHub and Docker Hub publishing.
  9. If the GitHub release exists:
    • update the release body from the generated notes file
    • upload archives with --clobber
  10. If the GitHub release does not exist:
  • create it with --notes-file
  • upload archives during creation
  1. Push Docker images unless --skip-docker or --local is set.

The release notes should be generated before publishing, but the changelog update should be explicit. A release engineer may want to review and commit CHANGELOG.md before publishing.

Recommended flags:

--update-changelog       prepend the generated section to CHANGELOG.md
--notes-only             generate notes and optionally update changelog without building
--from TAG               override previous tag selection
--target REF             override release notes target ref
--include-merges         include merge commits in generated commit list
--skip-docker            release binary archives only
--docker-only            build and publish Docker images only
--skip-latest            publish VERSION image tags without updating latest
--host-build             use local cargo builds for Linux binaries instead of the Docker release builder
--app APP                restrict Docker image work to one app
--image-org ORG          Docker image namespace, default networknt
--platform PLATFORM      restrict Docker image platform, default linux/amd64,linux/arm64
--skip-macos             skip macOS binary artifacts in CI release mode
--skip-windows           skip Windows binary artifacts in CI release mode

--local should still build and package locally. It may generate release notes, but it should not call gh or push Docker images.

--docker-only should skip binary archive packaging and GitHub release asset upload. It should still generate release notes by default so the same version context is visible in the command output. If --local is also set, it should build images locally without pushing them.

Automated Polyrepo Release Workflow

Because controller-rs, portal-service, and light-example-rs depend on light-fabric crates, they must be released sequentially in a Cascading Release Pipeline. Attempting to release them manually is error-prone.

We will dogfood light-workflow as our Release Orchestrator to automate this across the public and private repository boundaries.

The Release-Train Workflow Template

The light-workflow template acts as the overarching controller:

  1. Step 1: Upstream Release (light-fabric)

    • Task A: The workflow runs cargo release (or equivalent) to bump versions, tag, and publish the public light-fabric crates to crates.io.
    • Task B: The workflow invokes the build.sh script to compile Linux binaries and push light-fabric Docker images.
    • Task C: The workflow calls release.sh to generate the changelog and publish the GitHub Release page.
  2. Step 2: The Sync Barrier (Wait Step)

    • The workflow pauses for a short duration (e.g., 2 minutes) to ensure crates.io indexing has completed, preventing downstream builds from failing to find the new crate versions.
  3. Step 3: Downstream Dependency Propagation

    • The workflow clones controller-rs, portal-service, and light-example-rs.
    • It runs cargo update -p light-fabric to point the downstream repositories to the newly published version.
    • It pushes these changes to their respective main branches.
  4. Step 4: Parallel Downstream Releases

    • The workflow uses a parallel execution pattern to trigger releases for the downstream repositories simultaneously:
      • Branch 1 (controller-rs): Build private binaries, push private Docker images, and tag the private repo.
      • Branch 2 (portal-service): Build private binaries, push private Docker images, and tag the private repo.
      • Branch 3 (light-example-rs): Publish any downstream public crates, push public Docker images, and create the GitHub Release.

By wrapping the individual release.sh and build.sh scripts in a light-workflow execution, we gain stateful retries, full pipeline visibility, and automated propagation without exposing secure tokens on developer workstations.

Root Docker Build Script

The repo-root build.sh should own Linux Docker image build and push behavior for all apps.

Recommended app metadata:

AppImageDockerfile
light-agentnetworknt/light-agentapps/light-agent/docker/Dockerfile
light-deployernetworknt/light-deployerapps/light-deployer/Dockerfile
light-gatewaynetworknt/light-gatewayapps/light-gateway/docker/Dockerfile
light-workflownetworknt/light-workflowapps/light-workflow/docker/Dockerfile

The Docker build context should remain the workspace root because the Dockerfiles copy workspace-level Cargo.toml, Cargo.lock, crates, frameworks, and app directories.

The script should support:

build.sh [VERSION] [-l|--local] [--no-cache] [--app APP] [--image-org ORG] [--platform PLATFORM] [--skip-latest]

Default behavior:

  1. Build all app images for linux/amd64 and linux/arm64.
  2. Tag each image as ${IMAGE_ORG}/${APP}:${VERSION}.
  3. Tag each image as ${IMAGE_ORG}/${APP}:latest unless --skip-latest is set.
  4. Use the Linux binaries produced by the release Docker builder instead of compiling Rust again inside each runtime image build.
  5. If --local is set, stop after local image builds.
  6. Otherwise push all generated tags and multi-platform manifests.

The script should print the full list of image tags it built and pushed. This list should be available to release.sh so the GitHub release notes can include the Docker image artifacts.

When build.sh is called from release.sh, it should receive the exported binary directory explicitly:

./build.sh "$VERSION" --binary-dir "dist/build"

When build.sh is called directly without --binary-dir, it can either invoke the Docker release builder for the requested platforms or fall back to the current Dockerfile builder stages. The preferred direct behavior is to invoke the same Docker release builder so local and CI image builds stay aligned.

Recommended implementation:

  1. Add a release builder Dockerfile, for example:
docker/Dockerfile.release
  1. Add a builder target that compiles all apps for one Linux target and exports binaries:
docker buildx build \
  --target export-binaries \
  --platform linux/amd64 \
  --output type=local,dest=dist/build/linux-amd64 \
  .
  1. Repeat for linux/arm64 if multi-architecture Linux images are enabled.
  2. Package the exported binaries into GitHub release tarballs.
  3. Build runtime images from those exported binaries, not from another cargo build.

The runtime image Dockerfiles can use a binary-only context or a release target that copies prebuilt binaries:

COPY dist/build/linux-amd64/bin/light-gateway /app/light-gateway

For multi-platform images, docker buildx build --platform linux/amd64,linux/arm64 can publish one image tag with a manifest list. The important point is that each platform-specific image must use the binary built for that platform.

Cross-Platform Binary Strategy

"Build once" means build once per target platform, then reuse that output everywhere that platform can run. It does not mean one binary can serve every operating system and CPU architecture.

Recommended artifact matrix:

ArtifactTargetBuilder
Linux x86_64 binary archivex86_64-unknown-linux-gnu or x86_64-unknown-linux-muslDocker/BuildKit Linux builder
Linux arm64 binary archiveaarch64-unknown-linux-gnu or aarch64-unknown-linux-muslDocker/BuildKit Linux builder
Linux Docker image for Intel/AMDlinux/amd64Docker/BuildKit Linux builder
Linux Docker image for Apple Silicon Docker Desktoplinux/arm64Docker/BuildKit Linux builder
Apple Silicon macOS binary archiveaarch64-apple-darwinmacOS arm64 runner
Windows binary archivex86_64-pc-windows-msvcWindows runner

Apple Silicon has two different release meanings:

  • Docker image support for Apple Silicon machines is a Linux arm64 container image. Docker Desktop on Apple Silicon runs Linux containers, so linux/arm64 is the right image platform.
  • Native Apple Silicon binaries are macOS binaries targeting aarch64-apple-darwin. These should be built on a macOS runner, not inside a normal Linux Docker build.

Windows binaries and Windows container images are also separate concerns:

  • Windows binary archives should target x86_64-pc-windows-msvc and should be built on a Windows runner for the official release.
  • Windows container images require Windows base images and a Windows container builder. They should be treated as a later phase unless customers explicitly need Windows containers.

In CI, these builds can run at the same time as separate jobs:

linux-release:
  Docker/BuildKit builds Linux binaries and Linux Docker images.

macos-release:
  macOS runner builds aarch64-apple-darwin binaries.

windows-release:
  Windows runner builds x86_64-pc-windows-msvc binaries.

The release publish job should collect all artifacts and update the same GitHub release page. Docker Hub publishing should remain in the Linux release job because the Docker images are Linux container images.

CHANGELOG Update Strategy

The changelog update should be idempotent.

Rules:

  • If CHANGELOG.md does not exist, create it with # Changelog.
  • If a section for VERSION already exists, replace that section.
  • If no section for VERSION exists, insert the new section immediately after the # Changelog heading.
  • Preserve older release sections as-is.
  • Never rewrite unrelated content below older release sections.

This makes rerunning the release script safe during release preparation.

Manual Review Workflow

For a normal release:

./release.sh 0.3.0 --notes-only --update-changelog
git diff CHANGELOG.md dist/release-notes-0.3.0.md

The release engineer reviews and edits CHANGELOG.md if needed, commits it, then publishes:

./release.sh 0.3.0 --skip-build

If binaries also need to be rebuilt:

./release.sh 0.3.0

By default, the official Linux binaries and Linux Docker images should be built from Docker and published together. If a developer needs the old host-build path for local troubleshooting:

./release.sh 0.3.0 --host-build --local

If CI is producing all OS artifacts, the release job should collect the platform-specific archives before publishing:

dist/light-fabric-0.3.0-x86_64-unknown-linux-gnu.tar.gz
dist/light-fabric-0.3.0-aarch64-unknown-linux-gnu.tar.gz
dist/light-fabric-0.3.0-aarch64-apple-darwin.tar.gz
dist/light-fabric-0.3.0-x86_64-pc-windows-msvc.zip

If only Docker images need to be rebuilt and pushed with the same release tag:

./release.sh 0.3.0 --docker-only

If only one Docker image needs to be rebuilt locally:

./build.sh 0.3.0 --app light-gateway --local

If the release page already exists and only the notes need refreshing:

./release.sh 0.3.0 --notes-only
gh release edit 0.3.0 --notes-file dist/release-notes-0.3.0.md

The final implementation can make the last command part of release.sh when --local is not set.

GitHub Release Body

The GitHub release body should be generated from the same release notes file. For new releases:

gh release create "$VERSION" "${ARCHIVES[@]}" \
  --title "$VERSION" \
  --notes-file "$NOTES_FILE"

For existing releases:

gh release edit "$VERSION" --notes-file "$NOTES_FILE"
gh release upload "$VERSION" "${ARCHIVES[@]}" --clobber

This keeps release reruns predictable. Re-uploading artifacts should not leave stale release notes behind.

Future Conventional Commit Mode

If the team later adopts Conventional Commits, the helper script can switch from plain git log output to grouped output:

### Features

- add JSON tracing output

### Fixes

- preserve ANSI toggle in demo services

### Documentation

- document Splunk ingestion options

At that point, git-cliff is a good fit. The public contract can remain the same:

scripts/release-notes.sh VERSION --update-changelog

Only the internals of the generator change.

Risks And Mitigations

RiskMitigation
Commit messages are too noisy for customer-facing notesGenerate notes early, then review and edit before publishing.
Previous tag detection picks the wrong tagSupport --from TAG override and print the selected range.
Release script rerun duplicates changelog sectionsReplace existing VERSION section instead of blindly prepending.
Existing GitHub release has stale notes after artifact uploadAlways call gh release edit --notes-file for existing releases.
Local builds unexpectedly modify CHANGELOG.mdRequire explicit --update-changelog for file mutation.
Binary archives publish but Docker push failsBuild and push images before or immediately after GitHub release publication, print clear recovery commands, and support --docker-only reruns.
Docker image tags drift from GitHub release versionHave release.sh call root build.sh with the same VERSION; do not ask operators to type the image version separately.
Full release builds take longer because Dockerfiles rebuild RustUse Docker/BuildKit as the release builder and make runtime images copy exported binaries instead of running another cargo build.
App-level build scripts diverge againConvert them to wrappers around repo-root build.sh.
Apple Silicon image support is confused with macOS binary supportDocument that Docker Desktop on Apple Silicon needs linux/arm64 images, while native macOS binaries need aarch64-apple-darwin.
Windows artifacts are expected from a Linux Docker buildBuild official Windows MSVC binaries on a Windows runner; treat Windows container images as a separate later phase.

Implementation Plan

  1. Add CHANGELOG.md with a short heading and no release entries.
  2. Add scripts/release-notes.sh with dependency-free git-log generation.
  3. Add idempotent changelog insertion or replacement.
  4. Add docker/Dockerfile.release or equivalent release-builder targets for Linux binaries.
  5. Add repo-root build.sh for all app Docker images and Linux image platforms.
  6. Convert app-level build scripts into compatibility wrappers.
  7. Update release.sh to generate dist/release-notes-${VERSION}.md.
  8. Update release.sh to call root build.sh with the same VERSION, unless --skip-docker is set.
  9. Update runtime image builds to copy binaries exported by the Docker release builder instead of compiling Rust again.
  10. Add CI matrix jobs for macOS Apple Silicon and Windows binary archives.
  11. Update publish_release() to use --notes-file for both new and existing releases.
  12. Add README release documentation for the new flags and review workflow.
  13. Validate changelog generation locally with:
./release.sh 0.3.0 --notes-only --update-changelog --local
git diff --check
  1. Validate Docker image builds locally with:
./build.sh 0.3.0 --local
./build.sh 0.3.0 --app light-gateway --local
  1. Validate combined local release packaging with:
./release.sh 0.3.0 --local
  1. Validate CI artifact collection for Linux, macOS, and Windows archives.
  2. Validate GitHub and Docker Hub publishing on a test tag or draft release before using it for a production release.

Light-Workflow Runner

Status

Proposed design.

light-workflow-runner is a tenant-side execution agent for workflow tasks that must run near tenant systems, tenant repositories, private tools, local gateways, sidecars, or sandboxed release workspaces. It is not a second workflow engine and it must not consume workflow start events directly.

The SaaS-owned light-workflow instance remains the authoritative orchestrator. It consumes workflow start events, creates workflow instances, persists task state, resolves workflow definitions, applies policy, and owns audit history. Tenant runners register with controller-rs, receive server-issued task leases, execute only the leased task, and report normalized results back to the control plane.

Problem

For SaaS deployments, Light owns the main workflow control plane. Tenants may run APIs, gateways, sidecars, deployers, and other services in their own networks. Some workflow tasks need to execute inside those tenant environments instead of inside the SaaS control plane.

Examples:

  • release workflows running in a prepared VM or sandbox with many repositories checked out,
  • command-line tasks that need local files or private repository access,
  • build and test tasks that need tenant-specific toolchains,
  • deployment tasks that need access to private clusters,
  • MCP servers or sidecars running only in the tenant network,
  • AI repair tasks that need to inspect and patch a local sandbox workspace.

Running multiple full light-workflow instances would create control-plane ambiguity:

  • more than one instance may see the same workflow start event,
  • tenant-side config can be changed through environment variables or local values.yml,
  • a tenant runtime could claim work outside its intended scope,
  • workflow definition loading and event consumption become hard to audit,
  • duplicate workflow starts require more complex idempotency and broker ACLs.

The platform needs a runner model that lets tenant-side services execute approved tasks without letting them own workflow orchestration.

Goals

  • Keep one authoritative SaaS light-workflow orchestrator for workflow start events and workflow state.
  • Add a tenant-side light-workflow-runner executable for command, sandbox, deployment, MCP, and local tool execution.
  • Register tenant runners through controller-rs.
  • Enforce task visibility with server-side leases, not runner-side local config.
  • Support release runners in prepared VMs or sandboxes with checked-out repos and approved toolchains.
  • Support per-tenant runner pools, execution profiles, capabilities, and network placement.
  • Let controller-rs periodically audit effective runtime configuration.
  • Reuse workflow-core task models and result contracts where possible.

Non-Goals

  • Do not create a second workflow orchestrator that consumes workflow start events.
  • Do not let tenant runners load arbitrary workflow definitions from local config.
  • Do not trust tenant-side environment variables or local values.yml as the enforcement boundary.
  • Do not expose all workflow tasks to all registered runners.
  • Do not let AI or command tasks bypass publish, signing, or human approval gates.

Current Runtime Boundary

The current light-workflow executable starts the workflow event consumer, task executor, and rule API in one process. The executor actively handles control-plane task types such as ask, assert, call, set, and switch.

workflow-core already models run.container, run.script, run.shell, and run.workflow. These task definitions are the right surface for runner-backed execution, but they still need a runtime executor boundary.

This design keeps the workflow model shared and adds a separate runner executable for effectful execution.

Domain Event
  |
  | consumed by SaaS control plane only
  v
light-workflow
  |
  | workflow instance, tasks, policy, audit
  v
controller-rs
  |
  | registration, leases, heartbeat, audit
  v
light-workflow-runner
  |
  | local command, sandbox, MCP, deploy, release tools
  v
Tenant Runtime Environment

The split is:

  • light-workflow: Authoritative orchestrator. It sees workflow start events, loads workflow definitions, creates tasks, computes effective task policy, and records state.
  • controller-rs: Runtime control plane. It authenticates runners, records runner capabilities, issues task leases, receives heartbeats, audits runtime config, and quarantines mismatched runners.
  • light-workflow-runner: Tenant-side execution agent. It claims only leased work, executes the assigned task in the approved environment, streams logs, and reports normalized results.
  • Sandbox or VM: Optional execution substrate used by the runner for high-risk tasks such as release builds, AI repair, scripts, and publishing.

The runner can run beside tenant APIs, gateways, sidecars, and deployers. It may also run in a prepared release VM or sandbox with approved tools and repository workspaces.

Event Visibility

Workflow start events should be visible only to the SaaS-owned light-workflow orchestrator.

Recommended flow:

  1. A domain event is published.
  2. The SaaS light-workflow consumer evaluates matching workflow definitions.
  3. It creates one workflow instance per matching definition.
  4. It creates tasks with runner requirements.
  5. controller-rs exposes only eligible task leases to registered runners.
  6. Runners execute leased tasks and return results.

This avoids duplicate starts and avoids tenant-side event subscription authorization problems.

If a future deployment requires separate workflow clusters, route start events by lane and enforce broker ACLs:

workflow.start.main
workflow.start.release
workflow.start.deployment
workflow.start.tenant.<tenantId>

Even with event lanes, the workflow database should enforce idempotency on a source-event key such as:

tenant_id + source_event_id + workflow_definition_id

For the SaaS model, task leases are the cleaner boundary than exposing start events to tenant runtimes.

Runner Registration

A runner must register before it can claim work.

Registration should include:

{
  "runnerId": "release-runner-01",
  "tenantId": "tenant-a",
  "hostId": "host-a",
  "runnerKind": "release",
  "runnerPools": ["release"],
  "executionProfiles": ["release-sandbox"],
  "capabilities": [
    "git",
    "maven",
    "cargo",
    "docker",
    "event-importer"
  ],
  "imageDigest": "sha256:...",
  "configHash": "sha256:...",
  "commandAllowlistHash": "sha256:...",
  "workspacePolicy": "release-workspace-v1",
  "networkZone": "tenant-private",
  "version": "0.3.0"
}

controller-rs validates the registration against server-side runtime policy. If accepted, it creates a runner session and issues short-lived credentials for heartbeat and task claim operations.

Local runner config can request capabilities, but the server decides the effective capabilities. A runner cannot claim work merely because it sets an environment variable or local values.yml value.

Task Lease Model

The task lease is the enforcement object. The runner should execute a task only when it has a valid lease issued by the control plane.

Lease example:

{
  "leaseId": "01970f5d-0000-7000-8000-000000000001",
  "tenantId": "tenant-a",
  "hostId": "host-a",
  "runnerId": "release-runner-01",
  "wfInstanceId": "release-2026.06.0",
  "taskId": "build-java-products",
  "taskType": "run.shell",
  "runnerPool": "release",
  "executionProfile": "release-sandbox",
  "capabilities": ["git", "maven"],
  "commandTemplateId": "light-fabric-release-build",
  "expiresAt": "2026-06-08T19:30:00Z",
  "nonce": "single-use-random-value"
}

Server-side validation must check:

  • runner session is active,
  • runner is not quarantined,
  • tenant and host match,
  • task runner pool matches registered pool,
  • task execution profile is allowed,
  • required capabilities are a subset of effective runner capabilities,
  • command template is approved,
  • lease is not expired,
  • lease has not already been used.

The runner reports task start, logs, progress, and final result using the lease. The control plane rejects reports that do not match the active lease.

Task Routing

light-workflow should execute pure control-plane tasks locally:

ask
assert
set
switch
context merge
workflow branching
workflow persistence
approved internal call tasks

light-workflow-runner should execute effectful or tenant-local tasks:

run.shell
run.script
run.container
call.mcp to tenant-local servers
deployment commands
release build and test commands
AI repair with filesystem access
browser automation
external tool processes

Some call.* tasks can run on either side. The routing decision should come from effective task policy:

TaskDefault RuntimeNotes
call.http internal SaaS APIlight-workflowUse host-side service credentials.
call.http tenant-private APIrunnerNeeds tenant network access.
call.mcp approved SaaS gatewaylight-workflowGateway enforces tool access.
call.mcp tenant-local serverrunnerLocal sidecar or private MCP server.
call.agent no toolslight-workflowBounded model call.
call.agent with file/toolsrunnerRequires sandbox/tool policy.

Agent Call Placement

Workflow agent calls need an explicit placement decision. The same workflow can use more than one agent execution mode, but the placement must come from server-side policy and task metadata, not tenant-side local config.

Use three agent execution modes.

Native Workflow Agent

Native call: agent stays in the SaaS-owned light-workflow process. This is the current bounded agent task model: light-workflow resolves the portal agent, skill, and tool metadata, builds a constrained prompt from workflow context, calls the configured model provider, validates structured output, and continues the workflow.

Use native workflow agents for bounded reasoning:

  • classify a request or command result,
  • summarize API responses or logs,
  • choose a workflow branch,
  • draft a customer-facing explanation,
  • decide whether human review is required,
  • produce JSON output that must match a schema.

Native workflow agents should not receive filesystem access, local network access, release secrets, or dynamic tool execution. API orchestration should remain explicit workflow tasks such as call.http, call.mcp, assert, switch, and ask.

By default, native workflow agents use SaaS-approved model providers and model credentials managed by the Light control plane. Tenant-private repository content, tenant-local logs, local files, and private network data should not be sent to this path unless the tenant policy explicitly allows it.

Runner Agent

Runner agents execute through light-workflow-runner under a server-issued task lease. Use this mode when the agent needs access to tenant-local state or effectful tools:

  • checked-out repositories,
  • command output plus working directory inspection,
  • private tenant network access,
  • local MCP servers,
  • sandbox tools,
  • AI repair of source code,
  • test reruns,
  • branch or pull-request creation.

The main light-workflow instance still creates the task and records the result. controller-rs issues a lease only to a runner whose effective capabilities, runner pool, execution profile, command allowlist, workspace policy, and audit state match the task requirements.

Runner agent lease example:

{
  "taskType": "call.agent",
  "agentPlacement": "runner",
  "runnerPool": "release",
  "executionProfile": "release-sandbox",
  "sandboxMode": "per-agent-call",
  "sandboxProvider": "cubesandbox",
  "modelProviderScope": "tenant",
  "modelProviderRef": "tenant-openai-eastus",
  "credentialRef": "runner-secret://llm-provider",
  "dataBoundary": "tenant-network",
  "allowedTools": ["git", "maven", "cargo"],
  "workspaceAccess": "copy-on-write-release-workspace",
  "networkPolicy": "release-egress",
  "secretPolicy": "none",
  "maxRepairAttempts": 2,
  "requiresHumanApprovalBefore": ["publish", "sign", "tag"]
}

The runner agent can inspect files and propose or apply bounded patches inside the approved workspace. It must not publish artifacts, sign releases, push final tags, read unrestricted secrets, or expand its own permission scope.

By default, runner agents use tenant-approved model providers and tenant-owned credentials. This keeps private workspace data and private network context inside the tenant boundary and avoids exposing SaaS model credentials to tenant-side runtimes.

Runner Agent Sandbox Isolation

The runner itself is a tenant-side execution agent. For stronger isolation, the runner can launch the agent task inside a separate sandbox such as Cube Sandbox, a VM, or a Kubernetes Job. This should be a tenant-selectable policy because the runner is deployed in the tenant namespace, but the effective choice must still be recorded and enforced by the control plane.

Recommended isolation levels:

Isolation LevelUse CaseDefault Policy
no sandboxbounded model call with no tools, no file access, and no private network mutationallowed for low-risk tasks
workflow-session sandboxrelease build/test/diagnosis that needs the same checkout and cache across stepsuseful for release workflows
per-agent-call sandboxAI repair, arbitrary code inspection, generated patches, dynamic tools, or untrusted scriptspreferred for high-risk agent tasks
per-publish sandboxsigning, publish tokens, artifact upload, and final tag pushrequired for high-value secrets

For a release workflow, the runner should usually orchestrate a separate per-agent-call sandbox for AI repair. The runner injects only the leased workspace, approved tools, network policy, and task-scoped secrets. It collects logs, artifacts, patches, and structured output, then destroys or freezes the sandbox according to retention policy.

This creates a layered boundary:

SaaS light-workflow
  -> controller-rs task lease
  -> tenant light-workflow-runner
  -> per-agent sandbox
  -> model, tools, files, network

Tenants may choose Cube Sandbox, VM isolation, Kubernetes Job isolation, or no sandbox for allowed profiles. Runner registration must advertise supported sandbox providers and modes. If a task requires per-agent-call isolation and the runner cannot provide it, controller-rs must not issue the lease.

Local runner config can select among tenant-approved profiles, but it cannot weaken a task requirement. The lease contains the final effective sandboxMode, sandboxProvider, workspace, network, tool, and secret policy. Heartbeat and audit snapshots should prove the runner is still operating under that profile.

Agent Service

Containerized light-agent services should be invoked explicitly. They are the right runtime for interactive or independently scaled agents:

  • chat and session memory,
  • dynamic tools/list and tools/call loops,
  • long-lived specialist agents,
  • independently deployed model/tool runtime,
  • local catalog caching.

Do not silently change native call: agent to call a containerized light-agent service. Use an explicit contract such as call: agent-service or call: agent with mode: service so operators can audit which runtime path was used.

Model Provider Boundary

Agent placement and model-provider placement should be decided together.

Recommended defaults:

native call: agent in SaaS light-workflow
  -> SaaS-approved model provider
  -> SaaS workflow context data boundary

leased runner agent in tenant workflow runner
  -> tenant-approved model provider
  -> tenant network/workspace data boundary

containerized light-agent service
  -> service-owned or tenant-approved model provider
  -> explicit service data boundary

The default SaaS model is useful for bounded reasoning over workflow-safe context, such as classification, summaries, branch decisions, and structured JSON output. It should not be the default path for tenant-local source code, private command logs, local files, or private network data.

The default runner model is useful when the task needs tenant-local context. The runner should resolve model credentials from tenant-controlled secret stores or tenant-approved local provider configuration. SaaS model credentials must not be sent to tenant runners.

The control plane should still make this policy-driven instead of hard-coding it. Some tenants may require every agent call, including bounded summaries, to use their own provider or regional model endpoint. In that case, the workflow task should be routed to a runner or to an approved tenant model gateway even if the reasoning itself is small.

Lease examples:

{
  "agentPlacement": "workflow",
  "modelProviderScope": "saas",
  "modelProviderRef": "light-managed-default",
  "credentialRef": "saas-secret://llm-provider",
  "dataBoundary": "saas-workflow-context"
}
{
  "agentPlacement": "runner",
  "modelProviderScope": "tenant",
  "modelProviderRef": "tenant-openai-eastus",
  "credentialRef": "runner-secret://llm-provider",
  "dataBoundary": "tenant-network"
}

Recommended placement rule:

bounded reasoning over workflow context -> native call: agent in light-workflow
agent needs files, tools, or private network -> leased runner agent
interactive session or dynamic tool loop -> containerized light-agent service

For release workflows, use native call: agent to summarize and classify a failed command. Use a runner agent for repo inspection, patch generation, test rerun, and pull-request creation. Human approval remains required before publish, signing, or final tag creation.

Effective Policy

Workflow definitions and tasks can request runner execution through metadata, but the control plane computes the effective policy.

Workflow-level example:

document:
  dsl: "1.0.3"
  namespace: release
  name: java-release
  version: "0.1.0"
  metadata:
    lightWorkflow:
      runner:
        runnerPool: release
        executionProfile: release-sandbox
        capabilities:
          - git
          - maven
          - docker

Task-level example:

do:
  - build-java:
      run:
        shell:
          command: ./release.sh
          arguments:
            - "${ .release.version }"
      metadata:
        lightWorkflow:
          runner:
            runnerPool: release
            commandTemplateId: light-fabric-release-build
          security:
            sandbox:
              mode: workflow-session

Runtime policy resolution:

  1. Workflow definition requests a runner profile.
  2. Task metadata can request stricter handling.
  3. Tenant policy sets the maximum tenant privilege.
  4. SaaS service policy sets global allowed runner types.
  5. Operator-approved profile definitions set allowed commands, networks, images, mounts, sandbox modes, sandbox providers, model provider scopes, data boundaries, and secrets.
  6. controller-rs validates actual registered runner state.
  7. The task lease contains the final allowed execution scope.

A task may request stricter isolation than the workflow, but it must not weaken the effective policy.

Runtime Configuration Audit

Tenant-controlled local configuration cannot be the source of truth. A runner can load local config for its own startup, but the server must verify and audit the effective runtime state.

controller-rs should audit at three points.

Startup Admission

On registration, the runner reports:

  • binary version,
  • image digest or VM image ID,
  • effective config hash,
  • command allowlist hash,
  • enabled execution profiles,
  • runner pools,
  • mounted workspace paths,
  • supported sandbox modes and providers,
  • sandbox provider and template,
  • allowed model provider scopes,
  • network zone,
  • secret policy,
  • host and tenant identity.

controller-rs compares this report with approved server-side policy before allowing claims.

Heartbeat

Each heartbeat should include:

{
  "runnerId": "release-runner-01",
  "sessionId": "01970f5d-1111-7000-8000-000000000001",
  "status": "ready",
  "configHash": "sha256:...",
  "commandAllowlistHash": "sha256:...",
  "imageDigest": "sha256:...",
  "activeLeases": 1,
  "timestamp": "2026-06-08T19:00:00Z"
}

If a hash changes unexpectedly, the controller marks the runner suspicious and stops issuing new leases.

Periodic Deep Audit

Periodically, controller-rs should request an effective runtime snapshot from the runner and compare it with the approved policy. For high-risk runners, the snapshot should include command allowlist, sandbox template, mount list, network policy, and secret bindings.

On mismatch:

  1. Mark runner as quarantined.
  2. Revoke active claim credentials.
  3. Stop issuing new leases.
  4. Emit a runtime audit event.
  5. Create an operator task if active work may be affected.

Audit is not the only enforcement mechanism. It detects drift after admission. The task lease remains the primary runtime authorization boundary.

Release Runner Mode

A release runner is a specialized light-workflow-runner profile.

It can run in:

  • a prepared VM,
  • a Cube Sandbox session,
  • a Kubernetes Job,
  • a controlled bare-metal release host.

Recommended default for release workflows:

  • one workflow-session sandbox or VM workspace for checkout, build, test, and package steps,
  • per-agent-call sandbox isolation for AI repair, source inspection, generated patches, and test reruns driven by an agent,
  • per-task sandbox isolation for publishing, signing, and tasks with release secrets,
  • clean checkout inside the runner rather than writable host repository mounts,
  • artifact export through controlled storage,
  • AI repair limited to sandbox workspace changes or branch/PR creation.

Writable host mounts should be avoided for AI repair and release commands. If host repositories must be mapped, default to read-only mounts and copy the repo into a runner-owned working directory before mutation.

Runner API

The first runner API can be small.

POST /runner/register
POST /runner/heartbeat
POST /runner/claim
POST /runner/task/{leaseId}/started
POST /runner/task/{leaseId}/log
POST /runner/task/{leaseId}/complete
POST /runner/task/{leaseId}/fail
POST /runner/audit-snapshot
POST /runner/drain

controller-rs can expose these APIs directly or mediate them over its existing persistent connection model. For private tenant networks, outbound runner registration and polling is preferable to inbound SaaS calls into the tenant environment.

The claim response should include only the task payload needed for execution, not the full workflow definition.

Command Result Contract

Runner results should use a normalized command result so light-workflow, human tasks, AI diagnosis, and audit do not depend on raw console parsing.

{
  "leaseId": "01970f5d-0000-7000-8000-000000000001",
  "taskId": "build-java-products",
  "runnerId": "release-runner-01",
  "attempt": 1,
  "status": "failed",
  "exitCode": 1,
  "startedAt": "2026-06-08T19:10:00Z",
  "completedAt": "2026-06-08T19:18:30Z",
  "summary": "Maven test failure in db-provider",
  "stdoutRef": "artifact://release/2026.06.0/build/stdout.log",
  "stderrRef": "artifact://release/2026.06.0/build/stderr.log",
  "artifactRefs": [
    "artifact://release/2026.06.0/build/surefire-reports.zip"
  ],
  "changedFiles": [],
  "aiDiagnosisAllowed": true
}

The runner should stream logs in chunks and store full logs as artifacts. Workflow context should keep summaries and artifact references, not unbounded stdout or stderr.

Security Requirements

  • Runners authenticate to controller-rs with tenant-scoped credentials.
  • Task leases are short-lived, single-use, and scoped to one task.
  • Runners never see workflow start events unless they are explicitly deployed as trusted orchestrators in a non-SaaS topology.
  • Runners receive task payloads, not complete workflow definitions.
  • Server-side policy decides runner pools, execution profiles, capabilities, commands, networks, mounts, sandbox modes, model provider scopes, data boundaries, and secrets.
  • SaaS model credentials must not be sent to tenant-side runners.
  • Tenant-private source code, local files, and private command logs should use tenant-approved model providers unless tenant policy explicitly allows SaaS model processing.
  • Secrets are task-scoped and never included in logs or AI prompts.
  • AI repair runs only in approved runner profiles and cannot publish or sign.
  • Publish and signing tasks require human approval and per-task isolation.
  • Runtime drift causes quarantine and lease revocation.
  • All task results include runner identity, effective policy version, command template ID, artifact references, and approval references.

Implementation Plan

Phase 1: Split Runner Boundary

  • Create apps/light-workflow-runner.
  • Reuse workflow-core models for run.* task payloads.
  • Define runner registration, heartbeat, claim, and result APIs.
  • Add server-side runner pools and execution profiles.
  • Keep the existing light-workflow event consumer as the only workflow start consumer.

Phase 2: Leased Run Task Execution

  • Implement run.shell execution in the runner.
  • Add command template allowlists.
  • Add normalized command result output.
  • Add log streaming and artifact references.
  • Route eligible run.shell tasks from light-workflow to registered runners through controller-rs.

Phase 3: Sandbox and Workspace Policy

  • Add workflow-session and per-task sandbox modes.
  • Support release VM or Cube Sandbox runner profiles.
  • Add workspace mount and checkout policies.
  • Add network and secret policy enforcement.
  • Add runtime config hash reporting.

Phase 4: Audit and Quarantine

  • Add periodic effective runtime snapshots.
  • Compare runner-reported config with server-approved policy.
  • Quarantine drifted runners.
  • Revoke active claim credentials.
  • Emit audit events and operator tasks.

Phase 5: Release and AI Workflows

  • Add release-runner profile.
  • Execute Java and Rust release build/test tasks through the runner.
  • Add ConfigProfile manifest and event-importer dry-run tasks.
  • Add AI failure analysis and bounded repair loops.
  • Gate publish and signing tasks behind human approval and per-task isolation.

Open Questions

  • Should runner registration and task claim be direct HTTP APIs, WebSocket messages through controller-rs, or both?
  • Where should long-running task logs and artifacts be stored for SaaS deployments?
  • How should the control plane attest VM-based runners that do not have a container image digest?
  • Should command templates be stored in workflow definitions, tenant policy, or a separate runner policy registry?
  • How much of the existing TaskExecutor should move into shared crates so light-workflow and light-workflow-runner can share evaluation and result handling without sharing orchestration responsibilities?

Recommendation

Create light-workflow-runner as a separate executable and keep light-workflow as the single SaaS-owned orchestrator. The runner should be a leased execution agent, not a workflow starter or workflow definition loader.

This gives tenants a practical way to run workflow tasks near their own APIs, gateways, repositories, clusters, and sandboxes while keeping workflow start events, policy decisions, task visibility, and audit under the SaaS control plane.

Asymmetric Decryptor

asymmetric-decryptor decrypts RSA encrypted configuration values.

It is used by config-loader when a service loads encrypted values that use the CRYPT:RSA: prefix. The crate supports RSA private keys in PKCS#8 and PKCS#1 PEM formats and decrypts payloads with RSA-OAEP using SHA-256.

Main Types

  • AsymmetricDecryptor: owns the RSA private key and decrypts supported payloads.
  • AsymmetricError: error type for prefix, base64, key, and decrypt failures.
  • CRYPT_RSA_PREFIX: the required CRYPT:RSA: payload prefix.

Usage

#![allow(unused)]
fn main() {
use asymmetric_decryptor::AsymmetricDecryptor;

let decryptor = AsymmetricDecryptor::from_pem(private_key_pem)?;
let plaintext = decryptor.decrypt("CRYPT:RSA:...")?;
}

Notes

This crate is intentionally small. It does not fetch keys, rotate keys, or perform configuration merging. Those concerns belong to config-loader and the runtime layer.

Config Loader

config-loader loads, merges, resolves, and decrypts service configuration.

It provides the common configuration behavior used by fabric services and runtime modules. Configuration can be loaded from YAML, JSON, or TOML files, merged across layers, expanded from values maps, and decrypted when encrypted values are present.

Main Types

  • ConfigLoader: loads files and resolves ${key:default} style values.
  • ConfigManager<T>: stores hot-swappable typed configuration behind an atomic reference.
  • ConfigError: shared error type for IO, parse, decrypt, and conversion failures.

Resolution Model

The loader supports:

  • merging multiple config files in order
  • external overlays through LIGHT_RS_CONFIG_DIR
  • whole-value variable replacement
  • embedded variable expansion inside strings
  • typed deserialization through Serde
  • symmetric encrypted values through symmetric-decryptor
  • asymmetric encrypted values through asymmetric-decryptor

Usage

#![allow(unused)]
fn main() {
use config_loader::ConfigLoader;
use std::collections::HashMap;

let loader = ConfigLoader::from_values(HashMap::new(), None, None)?;
let config: MyConfig = loader.load_typed(["config/my-service.yml"])?;
}

Consumers

light-runtime uses this crate for service bootstrap and runtime config. Application crates can also use it for app-specific policy or domain config.

Hindsight Client

hindsight-client provides a small client abstraction for persistent agent memory.

It stores and recalls memory units from PostgreSQL. The current implementation uses sqlx and pgvector for vector similarity search.

Main Types

  • HindsightMemory: trait used by applications that need memory retention and recall without coupling to a specific database implementation.
  • PgHindsightClient: PostgreSQL-backed implementation of HindsightMemory.
  • MemoryUnit: returned memory record with content, type, metadata, and bank identity.

Usage

#![allow(unused)]
fn main() {
use hindsight_client::{HindsightMemory, PgHindsightClient};

let memory = PgHindsightClient::new(pool);
let unit_id = memory
    .retain(host_id, bank_id, "User prefers concise answers", "fact", None, metadata)
    .await?;
}

Data Model

The PostgreSQL implementation writes to agent_memory_unit_t and uses host_id plus bank_id to isolate memory between tenants, users, or sessions.

Consumers

light-agent uses this crate to persist and recall agent conversation memory.

Light Rule

light-rule is the Rust rule engine for evaluating rule definitions and executing registered actions.

It is designed to align with the rule.yaml specification while remaining runtime-neutral. Java services can use yaml-rule; Rust services use this crate.

Main Types

  • RuleEngine: evaluates rule conditions and determines action execution.
  • MultiThreadRuleExecutor: executes rules with runtime state.
  • RuntimeState: input/output state passed through rule evaluation.
  • ActionRegistry: registry for action plugins.
  • RuleActionPlugin: trait implemented by Rust action handlers.
  • Rule, RuleCondition, RuleAction, RuleConfig, EndpointConfig: rule model types.

Action Model

Rules reference actions by actionRef. In Rust, actionRef resolves to a registered RuleActionPlugin; it is not a Java class name. This keeps the rule format portable across Java and Rust executors.

Usage

#![allow(unused)]
fn main() {
use light_rule::{ActionRegistry, RuleEngine};

let registry = ActionRegistry::default();
let engine = RuleEngine::new(registry);
}

See Light-Rule for the rule format and its relationship to workflow assertions and portal rule management.

Light Runtime

light-runtime is the shared service runtime for Light Fabric applications.

It owns bootstrap, configuration loading, transport startup, graceful shutdown, and optional portal registry registration. Apps such as light-agent and light-deployer should start through this crate instead of binding sockets directly.

Main Types

  • LightRuntimeBuilder: builds a runtime from a transport.
  • LightRuntime: configured runtime before start.
  • RunningRuntime: running service handle with shutdown support.
  • Module: lifecycle hook abstraction.
  • RuntimeConfig: resolved runtime configuration.
  • ServerConfig: HTTP/HTTPS bind and service identity settings.
  • BootstrapConfig: remote config bootstrap settings.
  • PortalRegistryConfig: portal registry connection settings.

Startup Pattern

#![allow(unused)]
fn main() {
use light_axum::AxumTransport;
use light_runtime::LightRuntimeBuilder;

let runtime = LightRuntimeBuilder::new(AxumTransport::new(app))
    .with_config_dir("config")
    .build();

let running = runtime.start().await?;
running.shutdown().await?;
}

Configuration

At minimum, runtime services need server.yml. Optional files include startup.yml, client.yml, and portal-registry.yml.

light-runtime is transport-neutral. light-axum supplies the Axum transport implementation.

MCP Client

mcp-client is a client for calling MCP-compatible gateway endpoints.

It provides a small API for listing and invoking tools through a configured MCP gateway path. It is intentionally focused on the client side; MCP server implementations live in applications or framework layers.

Main Types

  • McpGatewayClient: gateway client used by applications.
  • McpTool: tool metadata returned by the gateway.
  • McpContent: content item returned by MCP tool calls.
  • McpToolCallResult: structured result for a tool invocation.

Usage

#![allow(unused)]
fn main() {
use mcp_client::McpGatewayClient;

let client = McpGatewayClient::new(gateway_url, path, timeout_ms);
let result = client.call_tool("tool.name", arguments).await?;
}

Consumers

light-agent uses this crate when an agent session needs to discover or invoke tools exposed through an MCP gateway.

Model Provider

model-provider defines a common abstraction over LLM providers and implements multiple provider adapters.

The goal is to let agent and workflow code depend on one Provider trait while supporting local models, hosted APIs, and provider-specific features.

Main Types

  • Provider: async trait implemented by model providers.
  • ChatRequest, ChatResponse, ChatMessage: common chat data model.
  • ToolSpec, ToolCall: tool-calling model.
  • ProviderCapabilities: capability metadata.
  • TokenUsage: usage accounting.
  • ReliableProvider: reliability wrapper.
  • RouterProvider: route requests across multiple providers.

Provider Implementations

Current modules include:

  • Anthropic
  • Azure OpenAI
  • Bedrock
  • Claude Code
  • Codex
  • OpenAI-compatible providers
  • Copilot
  • Gemini
  • Gemini CLI
  • GLM
  • Kilo Code CLI
  • Ollama
  • OpenAI
  • OpenRouter
  • Telnyx

Consumers

light-agent uses this crate to send chat requests and tool specs without hard-coding a single LLM provider.

Portal Registry

portal-registry provides client support for registering services with Light Portal or Light Controller.

It uses a JSON-RPC style WebSocket protocol for service registration, metadata updates, discovery, and cache-management control. Runtime services normally use this through light-runtime, but applications can also use the client directly when they need custom registry behavior.

Main Types

  • PortalRegistryClient: WebSocket client for registry communication.
  • RegistryHandler: trait for handling registry callbacks and messages.
  • RegistrationState: client registration state.
  • RegistrationBuilder: helper for constructing registration parameters.
  • ServiceRegistrationParams: service identity and advertised endpoint.
  • ServiceMetadataUpdate: metadata update payload.

Usage

#![allow(unused)]
fn main() {
use portal_registry::RegistrationBuilder;

let registration = RegistrationBuilder::new(
    "com.networknt.service-1.0.0",
    "1.0.0",
    "http",
    "127.0.0.1",
    8080,
)
.with_env("dev")
.with_jwt(token)
.build();
}

Runtime Integration

light-runtime can register a service automatically when server.yml enables registry support and portal-registry.yml supplies the portal connection.

Symmetric Decryptor

symmetric-decryptor decrypts legacy symmetric encrypted configuration values.

It supports payloads with the CRYPT prefix and decrypts AES-256-CBC data with a key derived from the configured password using PBKDF2-HMAC-SHA256.

Main Types

  • Decryptor: trait implemented by decryptors.
  • SymmetricDecryptor: password-based decryptor.
  • DecryptError: error type for prefix, format, hex, and cipher failures.
  • CRYPT_PREFIX: required CRYPT payload prefix.

Usage

#![allow(unused)]
fn main() {
use symmetric_decryptor::{Decryptor, SymmetricDecryptor};

let decryptor = SymmetricDecryptor::new("password");
let plaintext = decryptor.decrypt("CRYPT:...")?;
}

Consumers

config-loader uses this crate when it encounters symmetric encrypted values and a config password is available.

Workflow Builder

workflow-builder provides fluent builders for creating Agentic Workflow definitions programmatically.

It depends on workflow-core for the actual model types and layers a builder API on top so applications and tests can construct valid workflows without manually assembling nested maps.

Main Areas

  • workflow metadata construction
  • authentication definitions
  • task definitions
  • nested do, for, fork, try, and other task structures
  • YAML/JSON serialization through workflow-core model types

Usage

#![allow(unused)]
fn main() {
use workflow_builder::services::workflow::WorkflowBuilder;

let workflow = WorkflowBuilder::new()
    .use_dsl("1.0.0")
    .with_namespace("lightapi")
    .with_name("example")
    .with_version("1.0.0")
    .build();
}

Relationship To Workflow Core

Use workflow-core when you need direct access to the schema model. Use workflow-builder when you want an ergonomic construction API.

Workflow Core

workflow-core contains the Rust model for the Agentic Workflow DSL.

The crate is schema-oriented: its structs and enums represent workflow documents, tasks, authentication blocks, durations, timeouts, errors, and supporting map types.

Main Areas

  • workflow document metadata
  • task definitions
  • call task protocol definitions
  • ask and assert task definitions
  • duration and timeout models
  • error definitions
  • ordered map support for workflow task lists

Usage

#![allow(unused)]
fn main() {
use workflow_core::models::workflow::{
    WorkflowDefinition,
    WorkflowDefinitionMetadata,
};

let document = WorkflowDefinitionMetadata::new(
    "lightapi",
    "example",
    "1.0.0",
    Some("Example".to_string()),
    None,
    None,
    None,
);
let workflow = WorkflowDefinition::new(document);
}

Consumers

workflow-builder builds on this crate. light-workflow and workflow-related services use the model for loading, validating, and executing workflow documents.

Light-Axum

light-axum adapts Axum applications to light-runtime.

Applications implement AxumApp and return an axum::Router. The framework owns binding, optional TLS, runtime metadata resolution, and graceful shutdown through the runtime transport contract.

Main Types

  • AxumApp: trait implemented by an application.
  • AxumTransport: transport passed to LightRuntimeBuilder.
  • ServerContext: runtime context passed into the app when building routes.
  • AxumBoundHandle: running Axum server handle.

Pattern

#![allow(unused)]
fn main() {
use light_axum::{AxumApp, AxumTransport, ServerContext};
use light_runtime::LightRuntimeBuilder;

#[derive(Clone)]
struct App;

impl AxumApp for App {
    fn router(&self, _context: ServerContext) -> axum::Router {
        axum::Router::new()
    }
}

let runtime = LightRuntimeBuilder::new(AxumTransport::new(App))
    .with_config_dir("config")
    .build();
}

Consumers

light-agent and light-deployer use this framework.

Light-Pingora

light-pingora adapts Pingora proxy services to light-runtime.

It is the framework layer for high-performance gateway and proxy products. The crate keeps runtime concerns such as configuration and service lifecycle separate from Pingora-specific proxy behavior.

Role

  • bridge Pingora services into the common runtime lifecycle
  • expose transport metadata to light-runtime
  • support gateway products without duplicating bootstrap code

Consumers

light-gateway uses this framework.

MSAL Exchange

The msal-exchange handler is a BFF security handler for SPA applications that authenticate with Microsoft Authentication Library, MSAL, and need an internal light-oauth security profile for gateway authorization.

The SPA obtains Azure MSAL tokens in the browser. It sends the MSAL ID token to the gateway for light-oauth token exchange. In the Azure authorization placement pattern, it also sends the MSAL access token during the exchange so the gateway can store it in a secure BFF cookie. The internal light-oauth token set is stored in secure BFF cookies and is used on later requests together with CSRF protection.

This page documents the current behavior and the token placement extension for deployments that must keep the Azure MSAL access token in the downstream Authorization header while forwarding the light-oauth token in a separate header.

Use Cases

Use msal-exchange when:

  • The UI is a browser SPA using MSAL.js.
  • Azure Entra ID is the identity provider for the browser login.
  • The gateway must exchange the Azure token for a light-oauth token containing the enterprise security profile and custom claims.
  • The gateway must protect browser requests with HttpOnly cookies and CSRF.
  • Downstream routing needs either the light-oauth token or the Azure MSAL token in the Authorization header.

Handler Placement

Enable the handler in the gateway handler chain before downstream routing and before handlers that depend on the authenticated principal.

Example:

handlers:
  - exception
  - cors
  - msal-exchange
  - header
  - prefix
  - router

chains:
  bff:
    - exception
    - cors
    - msal-exchange
    - header
    - prefix
    - router

paths:
  - path: /auth/ms/exchange
    method: POST
    exec:
      - bff
  - path: /auth/ms/logout
    method: GET
    exec:
      - bff

When the handler is active, the gateway needs these resolved config files:

  • msal-exchange.yml
  • security-msal.yml
  • security.yml
  • client.yml

security-msal.yml validates Azure MSAL tokens. security.yml validates the light-oauth tokens stored in BFF cookies. client.yml provides the light-oauth token-exchange client configuration.

Exchange Flow

The exchange endpoint receives the Azure MSAL ID token from the SPA and creates the BFF session.

POST /auth/ms/exchange
Authorization: Bearer <azure-msal-id-token>

  -> read the Azure MSAL ID token
  -> verify the ID token with security-msal.yml
  -> generate a CSRF value
  -> call light-oauth with the token-exchange grant
  -> verify the returned light-oauth access token with security.yml
  -> set BFF cookies
  -> return { "scopes": [...] }

The token-exchange request uses client.yml oauth.token.token_exchange. The outgoing form body contains:

grant_type=urn:ietf:params:oauth:grant-type:token-exchange
subject_token=<azure-msal-id-token>
subject_token_type=urn:ietf:params:oauth:token-type:jwt
csrf=<generated-csrf>

subjectTokenType can be set in msal-exchange.yml. When it is blank, the shared token client default from client.yml is used.

On success, the response body contains the scopes from the light-oauth token:

{
  "scopes": ["scope1", "scope2"]
}

Session Cookies

The handler uses the same cookie contract as the stateless SPA auth handler.

CookieHttpOnlyDescription
accessTokentruelight-oauth access token
refreshTokentruelight-oauth refresh token, when returned
msalAccessTokentrueAzure MSAL access token when authorizationToken is azure-msal
csrffalseGenerated CSRF value
userIdfalseUser id from uid, user_id, or sub
userTypefalseUser type from userType
rolesfalseBase64 encoded role value, default user
hostfalseHost claim
emailfalseEmail claim from eml
eidfalseEnterprise id claim

accessToken and refreshToken are HttpOnly so browser JavaScript cannot read the light-oauth tokens. The SPA reads the non-HttpOnly csrf cookie and sends it back with protected requests.

CSRF Validation

For normal protected requests, the handler validates the request CSRF value against the csrf claim in the light-oauth access token.

CSRF source order:

  1. X-CSRF-TOKEN request header.
  2. Sec-WebSocket-Protocol value starting with csrf. for WebSocket requests.
  3. csrf query parameter.

If the CSRF value is missing or does not match the JWT claim, the request is rejected.

Token Placement

authorizationToken selects which token owns the downstream Authorization header after the BFF session has been established.

Supported values:

ValueAuthorization headerLight-oauth token locationUse case
light-oauthBearer <light-oauth-token>AuthorizationExisting enterprise BFF pattern
azure-msalBearer <azure-msal-access-token>lightTokenHeader, default X-Light-TokenAzure-whitelisted downstream systems, such as AWS Agent Core

authorizationToken: light-oauth

This is the current default behavior.

After the exchange, the SPA calls the gateway with cookies and CSRF:

GET /api/orders
Cookie: accessToken=...; csrf=...
X-CSRF-TOKEN: <csrf>

The handler:

  -> reads the light-oauth accessToken cookie
  -> verifies it with security.yml
  -> validates CSRF
  -> refreshes the token if it is close to expiry
  -> injects Authorization: Bearer <light-oauth-token>
  -> continues the handler chain

Downstream services receive:

Authorization: Bearer <light-oauth-token>

This mode is appropriate when downstream services and MCP tools trust light-oauth directly and expect fine-grained security claims in the normal Authorization header.

authorizationToken: azure-msal

This token placement pattern uses both Azure and light-oauth tokens downstream.

At exchange time, the SPA sends the MSAL ID token in Authorization and the MSAL access token in msalAccessTokenHeader, which defaults to X-MSAL-Access-Token:

POST /auth/ms/exchange
Authorization: Bearer <azure-msal-id-token>
X-MSAL-Access-Token: Bearer <azure-msal-access-token>

  -> verify the MSAL ID token with security-msal.yml
  -> verify the MSAL access token with security-msal.yml
  -> exchange the ID token for a light-oauth token
  -> store the light-oauth token in accessToken
  -> store the MSAL access token in msalAccessToken

For later protected requests, the SPA sends cookies and CSRF. The SPA does not need to put the Azure access token in the browser request Authorization header because the gateway reads it from the HttpOnly msalAccessToken cookie:

GET /agent/chat
Cookie: accessToken=...; msalAccessToken=...; csrf=...
X-CSRF-TOKEN: <csrf>

The handler:

  -> read the MSAL access token from the msalAccessToken cookie
  -> verify the MSAL access token with security-msal.yml
  -> read the light-oauth accessToken cookie
  -> verify the light-oauth token with security.yml
  -> validate CSRF
  -> refresh the light-oauth token if it is close to expiry
  -> inject Authorization: Bearer <azure-msal-access-token>
  -> inject X-Light-Token: Bearer <light-oauth-token>
  -> continue the handler chain

Downstream systems receive both tokens:

Authorization: Bearer <azure-msal-access-token>
X-Light-Token: Bearer <light-oauth-token>

This mode is intended for systems that only allow Azure as the OAuth provider for the normal Authorization header, while still needing the light-oauth security profile for API and MCP authorization decisions.

The SPA should not read or send X-Light-Token itself. The gateway should derive that header from the HttpOnly light-oauth cookie after CSRF validation. That keeps the light-oauth token out of browser JavaScript.

If a downstream light-gateway is responsible for fine-grained authorization, it must be configured to verify X-Light-Token as the light-oauth token or to promote X-Light-Token to Authorization at a trusted boundary before the normal security/access-control handlers run.

Configuration

Example default configuration:

enabled: ${msal-exchange.enabled:true}
exchangePath: ${msal-exchange.exchangePath:/auth/ms/exchange}
logoutPath: ${msal-exchange.logoutPath:/auth/ms/logout}
cookieDomain: ${msal-exchange.cookieDomain:localhost}
cookiePath: ${msal-exchange.cookiePath:/}
cookieSecure: ${msal-exchange.cookieSecure:false}
sessionTimeout: ${msal-exchange.sessionTimeout:3600}
rememberMeTimeout: ${msal-exchange.rememberMeTimeout:604800}
renewBeforeSeconds: ${msal-exchange.renewBeforeSeconds:90}
refreshSingleFlightWaitMs: ${msal-exchange.refreshSingleFlightWaitMs:5000}
refreshSingleFlightCacheMs: ${msal-exchange.refreshSingleFlightCacheMs:3000}
refreshSingleFlightMaxEntries: ${msal-exchange.refreshSingleFlightMaxEntries:10000}
cookieSameSite: ${msal-exchange.cookieSameSite:None}
cookieTimeoutUri: ${msal-exchange.cookieTimeoutUri:/}
subjectTokenType: ${msal-exchange.subjectTokenType:}
authorizationToken: ${msal-exchange.authorizationToken:light-oauth}
lightTokenHeader: ${msal-exchange.lightTokenHeader:X-Light-Token}
msalAccessTokenHeader: ${msal-exchange.msalAccessTokenHeader:X-MSAL-Access-Token}
msalAccessTokenCookie: ${msal-exchange.msalAccessTokenCookie:msalAccessToken}

Fields:

FieldDefaultDescription
enabledtrueEnables or disables the handler once it is active in the chain.
exchangePath/auth/ms/exchangeEndpoint that receives the Azure MSAL ID token and creates the BFF session.
logoutPath/auth/ms/logoutEndpoint that clears BFF cookies.
cookieDomainlocalhostCookie domain for session cookies.
cookiePath/Cookie path for session cookies.
cookieSecurefalseAdds the Secure cookie attribute. Use true for HTTPS deployments.
sessionTimeout3600Default max age in seconds for session cookies.
rememberMeTimeout604800Max age in seconds for long-lived refresh-token cookies when light-oauth returns remember-me behavior.
renewBeforeSeconds90Refresh the light-oauth access token when it expires within this window.
refreshSingleFlightWaitMs5000Maximum wait time for concurrent refresh requests sharing the same refresh token.
refreshSingleFlightCacheMs3000Short cache window for a successful refresh result.
refreshSingleFlightMaxEntries10000Maximum refresh single-flight cache entries.
cookieSameSiteNoneCookie SameSite attribute. Supported values are None, Lax, and Strict.
cookieTimeoutUri/URI returned when the session expires and cannot be refreshed.
subjectTokenTypeblankOptional token-exchange subject token type override.
authorizationTokenlight-oauthToken to place in downstream Authorization: light-oauth or azure-msal.
lightTokenHeaderX-Light-TokenHeader used for the light-oauth token when authorizationToken is azure-msal.
msalAccessTokenHeaderX-MSAL-Access-TokenHeader that carries the Azure MSAL access token on the exchange request when authorizationToken is azure-msal.
msalAccessTokenCookiemsalAccessTokenHttpOnly cookie used to store the Azure MSAL access token after exchange when authorizationToken is azure-msal.

Invalid authorizationToken values should fail startup. lightTokenHeader should not be Authorization; use authorizationToken: light-oauth for that case. In azure-msal mode, msalAccessTokenHeader must not be Authorization because Authorization carries the MSAL ID token on the exchange endpoint. msalAccessTokenHeader must also be different from lightTokenHeader.

Security Configuration

security-msal.yml validates Azure MSAL tokens. It is required when the handler is active.

Example:

enableVerifyJwt: ${security-msal.enableVerifyJwt:true}
ignoreJwtExpiry: ${security-msal.ignoreJwtExpiry:false}
enableRelaxedKeyValidation: ${security-msal.enableRelaxedKeyValidation:false}
issuer: ${security-msal.issuer:}
audience: ${security-msal.audience:}
jwt:
  certificate: ${security-msal.jwt.certificate:}
  clockSkewInSeconds: ${security-msal.jwt.clockSkewInSeconds:60}
  keyResolver: ${security-msal.jwt.keyResolver:}

Recommended settings:

  • Set issuer to the Azure tenant issuer when the tenant is known.
  • Set audience to the SPA client id or the expected Azure access-token audience.
  • Keep ignoreJwtExpiry: false in production.
  • Use the configured Microsoft JWK or certificate resolver supported by the gateway security runtime.

security.yml remains the normal light-oauth verifier. It validates the light-oauth access token stored in the accessToken cookie and provides the principal used by gateway authorization logic.

SPA Integration

Initial exchange:

await fetch("/auth/ms/exchange", {
  method: "POST",
  credentials: "include",
  headers: {
    Authorization: `Bearer ${azureMsalIdToken}`
  }
});

Initial exchange with authorizationToken: azure-msal:

await fetch("/auth/ms/exchange", {
  method: "POST",
  credentials: "include",
  headers: {
    Authorization: `Bearer ${azureMsalIdToken}`,
    "X-MSAL-Access-Token": `Bearer ${azureMsalAccessToken}`
  }
});

Subsequent requests with the existing light-oauth authorization pattern:

await fetch("/api/orders", {
  credentials: "include",
  headers: {
    "X-CSRF-TOKEN": csrf
  }
});

Subsequent requests with the Azure MSAL authorization pattern:

await fetch("/agent/chat", {
  credentials: "include",
  headers: {
    "X-CSRF-TOKEN": csrf
  }
});

In both patterns, the SPA must send cookies with credentials: "include". In the Azure MSAL authorization pattern, MSAL.js is responsible for obtaining the Azure access token before calling /auth/ms/exchange. The gateway stores that access token in the HttpOnly msalAccessToken cookie, validates it on later BFF requests, injects it into Authorization, and injects the light-oauth token into lightTokenHeader.

Logout

Logout clears all BFF cookies managed by the handler:

GET /auth/ms/logout

The handler returns an empty 200 response with deletion cookies for the known session cookie names.

Error Handling

Important error codes:

CodeMeaning
ERR11000Required Azure MSAL bearer token is missing on the exchange endpoint or in the MSAL access-token cookie.
ERR11001light-oauth token exchange failed.
ERR10000Azure MSAL token or light-oauth token verification failed.
ERR10036CSRF token is missing from the request.
ERR10038CSRF claim is missing from the light-oauth token.
ERR10039Request CSRF and token CSRF do not match.
ERR10052Token response does not contain expires_in and the JWT has no usable exp.

Implementation Notes

Rust light-pingora and Java light-spa-4j use the same token placement contract:

  • authorizationToken: light-oauth preserves the existing behavior and injects the light-oauth token into Authorization.
  • authorizationToken: azure-msal verifies the exchange request's MSAL ID token and MSAL access token with security-msal.yml, stores the MSAL access token in msalAccessToken, injects it into downstream Authorization, and injects the light-oauth token into lightTokenHeader.
  • lightTokenHeader defaults to X-Light-Token and must not be Authorization when authorizationToken is azure-msal.
  • msalAccessTokenHeader defaults to X-MSAL-Access-Token and is used only on the exchange endpoint.
  • msalAccessTokenCookie defaults to msalAccessToken and is HttpOnly.

In azure-msal placement, the gateway requires the MSAL access-token cookie only when a BFF session cookie is present. Requests without accessToken or refreshToken cookies keep the existing pass-through behavior so public endpoints are not forced to authenticate at this handler.

Light-Agent

light-agent is the interactive agent service in Light Fabric.

It provides a WebSocket chat interface, integrates with model providers, invokes MCP tools through mcp-client, and stores conversation memory through hindsight-client.

Key Dependencies

  • light-runtime
  • light-axum
  • model-provider
  • mcp-client
  • hindsight-client
  • portal-registry

Runtime

The app follows the standard runtime pattern:

  • load config from config/
  • implement an Axum app
  • start through LightRuntimeBuilder
  • optionally register through portal registry

Deploy Native

This page describes the recommended VM deployment model for the Rust light-agent native binary.

Use this model when a customer wants to run an agent service on a VM and expose the chat UI/WebSocket endpoint outside Kubernetes. The agent serves the local chat UI, connects to an LLM provider, calls MCP tools through light-gateway, stores conversation memory in Postgres, and registers with controller.

Deliver a versioned install bundle, not an ad hoc runtime script.

The bundle should contain:

  • light-agent native binary.
  • public/ static assets for the chat UI.
  • Minimal bootstrap config files.
  • A systemd unit.
  • An install script for filesystem setup.
  • A root-owned environment file for secrets.

Use systemd to run the service:

  • It restarts the process on failure.
  • It keeps logs in the host journal.
  • It avoids shell-history and process-list leakage from command-line secrets.
  • It gives the customer a standard operational surface: start, stop, restart, status, and journalctl.

Do not use a long-running shell wrapper to pass the bootstrap token, database URL, or model configuration. Use config files and an environment file instead.

Runtime Layout

light-agent uses relative runtime paths:

  • config
  • public

The systemd service should therefore set WorkingDirectory to the installed application directory.

Recommended VM layout:

/opt/light-agent/
  light-agent -> releases/2.2.1/light-agent
  releases/
    2.2.1/
      light-agent
  config -> /etc/light-agent
  public/
    index.html

/etc/light-agent/
  startup.yml
  server.yml
  portal-registry.yml
  client.yml
  mcp-client.yml
  ollama.yml
  values.yml
  ca.pem
  light-agent.env

/var/lib/light-agent/
  config-cache/

The local config directory contains bootstrap and agent-specific config. Runtime config downloaded from config-server should be written to /var/lib/light-agent/config-cache by setting externalConfigDir in startup.yml.

Keep /etc/light-agent readable by the service user. Keep /var/lib/light-agent/config-cache writable by the service user.

Build Artifact

Build a release binary from light-fabric:

cargo build --release -p light-agent

The artifact is:

target/release/light-agent

For a static Linux build that matches the Docker build target:

rustup target add x86_64-unknown-linux-musl
cargo build --release -p light-agent --target x86_64-unknown-linux-musl

The static artifact is:

target/x86_64-unknown-linux-musl/release/light-agent

Build on a compatible Linux distribution for the customer VM. If the customer fleet has mixed Linux versions, prefer a static or target-compatible build so the binary does not fail on an older glibc.

Package with a versioned filename:

light-agent-<version>-linux-amd64.tar.gz

Include the static assets from:

apps/light-agent/public/

Runtime Dependencies

The VM must be able to reach:

  • Controller, through portalRegistry.portalUrl.
  • Config-server, through startup.configServerUri.
  • light-gateway, through mcp-client.gatewayUrl and mcp-client.path.
  • The model provider, currently Ollama by default.
  • Postgres, through DATABASE_URL.

The Postgres database must contain the Hindsight memory tables used by light-agent, including:

  • agent_memory_bank_t
  • agent_memory_unit_t
  • agent_session_history_t

LIGHT_AGENT_HOST_ID must be a valid host UUID for the target tenant/host. The agent stores memory and session history under this host id.

Agent Roles

The same binary can run different logical agents. Use a different service id, port, install directory, and systemd unit for each concurrently running role.

Common service ids are:

com.networknt.agent.account-1.0.0
com.networknt.agent.advisor-1.0.0
com.networknt.agent.tech-support-1.0.0

For a single account agent, keep the service name light-agent. For multiple agents on the same VM, use names such as:

light-agent-account
light-agent-advisor
light-agent-tech-support

Each role needs a unique listener port if they run on the same VM.

Bootstrap Config

The local bootstrap config needs enough information to reach config-server, controller, light-gateway, Ollama, and Postgres.

Example values.yml for an account agent:

startup.host: customer.example.com
startup.timeout: 3000
startup.connectTimeout: 3000
startup.bootstrapCaCertPath: config/ca.pem
startup.externalConfigDir: /var/lib/light-agent/config-cache

light-config-server-uri: https://config-server.customer.example.com:8435

server.serviceId: com.networknt.agent.account-1.0.0
server.environment: prod
server.ip: 0.0.0.0
server.advertisedAddress: agent-account-01.customer.example.com
server.httpPort: 8083
server.enableHttp: true
server.httpsPort: 8443
server.enableHttps: false
server.enableRegistry: true
server.startOnRegistryFailure: true

portalRegistry.portalUrl: https://controller.customer.example.com:8438

client.verifyHostname: true

mcp-client.gatewayUrl: https://mcp-gateway.customer.example.com
mcp-client.path: /mcp
mcp-client.timeoutMs: 5000

ollama.ollamaUrl: http://ollama.customer.example.com:11434
ollama.model: llama3.1:8b

server.advertisedAddress must be a stable address that controller and clients can use to reach the VM agent. Do not advertise 127.0.0.1 or 0.0.0.0.

Example startup.yml:

host: ${startup.host:dev.lightapi.net}
serviceId: ${server.serviceId:com.networknt.agent.account-1.0.0}
envTag: ${server.environment:dev}
acceptHeader: application/yaml
timeout: ${startup.timeout:3000}
connectTimeout: ${startup.connectTimeout:3000}
configServerUri: ${light-config-server-uri:https://local.localhost}
authorization: ${light_portal_authorization:}
bootstrapCaCertPath: ${startup.bootstrapCaCertPath:config/ca.pem}
externalConfigDir: ${startup.externalConfigDir:/var/lib/light-agent/config-cache}

Example server.yml:

ip: ${server.ip:0.0.0.0}
advertisedAddress: ${server.advertisedAddress:127.0.0.1}
httpPort: ${server.httpPort:8083}
enableHttp: ${server.enableHttp:true}
httpsPort: ${server.httpsPort:8443}
enableHttps: ${server.enableHttps:false}
serviceId: ${server.serviceId:com.networknt.agent.account-1.0.0}
enableRegistry: ${server.enableRegistry:true}
startOnRegistryFailure: ${server.startOnRegistryFailure:true}
dynamicPort: ${server.dynamicPort:false}
environment: ${server.environment:dev}
shutdownGracefulPeriod: ${server.shutdownGracefulPeriod:2000}

Example portal-registry.yml:

portalUrl: ${portalRegistry.portalUrl:https://localhost:8438}
portalToken: ${light_portal_authorization:}
controllerDiscoveryToken: ${portalRegistry.controllerDiscoveryToken:}

Example client.yml:

tls:
  verifyHostname: ${client.verifyHostname:true}

Example mcp-client.yml:

gatewayUrl: ${mcp-client.gatewayUrl:https://mcp-gateway.customer.example.com}
path: ${mcp-client.path:/mcp}
timeoutMs: ${mcp-client.timeoutMs:5000}

Example ollama.yml:

ollamaUrl: ${ollama.ollamaUrl:http://localhost:11434}
model: ${ollama.model:llama3.1:8b}

For the current light-agent implementation, keep ollama.yml and mcp-client.yml in the local bootstrap config. They are read during process initialization before the runtime completes remote config bootstrap.

Secrets

Keep secrets in a root-owned environment file or in the customer's secret manager. Do not pass secrets in command-line arguments.

Example /etc/light-agent/light-agent.env:

LIGHT_PORTAL_AUTHORIZATION=Bearer <token>
light_4j_config_password=<config-password-if-needed>
LIGHT_AGENT_HOST_ID=<host-uuid>
DATABASE_URL=postgres://agent_user:<password>@postgres.customer.example.com:5432/configserver
RUST_LOG=info
AGENT_LOG_ANSI=false

Permissions:

chown root:light-agent /etc/light-agent/light-agent.env
chmod 0640 /etc/light-agent/light-agent.env

LIGHT_PORTAL_AUTHORIZATION is used for config-server bootstrap and controller registration. It is not the end-user chat token. If downstream MCP tools require caller identity, the browser or BFF should send the user's Authorization header to the agent WebSocket endpoint so the agent can forward it to light-gateway.

Systemd Unit

Example /etc/systemd/system/light-agent.service:

[Unit]
Description=Light Agent
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=light-agent
Group=light-agent
WorkingDirectory=/opt/light-agent
EnvironmentFile=/etc/light-agent/light-agent.env
ExecStart=/opt/light-agent/light-agent
Restart=on-failure
RestartSec=5
LimitNOFILE=65535

NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=full
ProtectHome=true
ReadWritePaths=/var/lib/light-agent/config-cache

[Install]
WantedBy=multi-user.target

Install and start:

systemctl daemon-reload
systemctl enable light-agent
systemctl start light-agent
systemctl status light-agent

View logs:

journalctl -u light-agent -f

Install Script Scope

An install script is useful, but keep it deterministic and small.

It should:

  • Create the light-agent user and group.
  • Create /opt/light-agent, /etc/light-agent, and /var/lib/light-agent/config-cache.
  • Install the binary with executable permissions.
  • Install the public/ static assets.
  • Install bootstrap config files.
  • Install or update the systemd unit.
  • Set file ownership and permissions.
  • Print the next operator steps for adding secrets and starting the service.

It should not:

  • Embed bearer tokens.
  • Pass tokens to ExecStart.
  • Rewrite customer config-server state.
  • Start the process before secrets, CA files, and database access are ready.

Startup Flow

The expected runtime flow is:

systemd
  -> /opt/light-agent/light-agent
  -> read local config/values.yml, ollama.yml, and mcp-client.yml
  -> connect to Postgres with DATABASE_URL
  -> build the MCP client for light-gateway
  -> call config-server with LIGHT_PORTAL_AUTHORIZATION
  -> write downloaded runtime config into /var/lib/light-agent/config-cache
  -> start the Axum HTTP/WebSocket server
  -> register the agent with controller using portalRegistry.portalUrl
  -> serve the chat UI from public/
  -> forward tool discovery and tool calls to light-gateway

If config-server is temporarily unavailable and cached config exists, the runtime can continue from config-cache. If this is not acceptable for a customer environment, make it an operational policy to clear config-cache before restart or add a pre-start health check for config-server availability.

Endpoints

The native service exposes:

GET /health
GET /
GET /chat

/chat upgrades to WebSocket. The static chat UI is served from public/.

For local testing on the VM:

curl -i http://127.0.0.1:8083/health

Upgrade And Rollback

Use versioned binary releases:

/opt/light-agent/releases/2.2.1/light-agent
/opt/light-agent/releases/2.2.2/light-agent
/opt/light-agent/light-agent -> releases/2.2.2/light-agent

Upgrade:

systemctl stop light-agent
ln -sfn /opt/light-agent/releases/2.2.2/light-agent /opt/light-agent/light-agent
systemctl start light-agent

Rollback:

systemctl stop light-agent
ln -sfn /opt/light-agent/releases/2.2.1/light-agent /opt/light-agent/light-agent
systemctl start light-agent

Do not delete config-cache during a normal binary rollback. It is the local cache of the config-server-delivered runtime state.

Validation Checklist

Before handing the VM to the customer:

  • systemctl status light-agent is active.
  • journalctl -u light-agent shows successful config-server bootstrap.
  • journalctl -u light-agent shows successful controller registration.
  • The controller shows the agent registered with the expected service id, environment, address, and port.
  • curl http://127.0.0.1:8083/health returns 200 OK.
  • The chat UI loads from the VM address.
  • The chat WebSocket connects to /chat.
  • Logs show that the agent can connect to Postgres.
  • Logs do not show MCP tools/list failures from light-gateway.
  • A chat request can discover and call a tool through light-gateway.
  • Restarting the VM starts the agent automatically.

Security Checklist

  • Store bearer tokens, config passwords, and database passwords outside the install bundle.
  • Use a customer CA file instead of disabling TLS verification in production.
  • Use a stable DNS name for server.advertisedAddress.
  • Restrict inbound VM firewall rules to the required agent port.
  • Restrict outbound VM firewall rules to config-server, controller, light-gateway, Ollama, and Postgres.
  • Run as the dedicated light-agent user.
  • Keep /etc/light-agent/light-agent.env readable only by root and the service group.
  • Keep /etc/light-agent writable only by administrators.
  • Keep only /var/lib/light-agent/config-cache writable by the service.
  • Rotate LIGHT_PORTAL_AUTHORIZATION through the customer secret process.

Deploy Kubernetes

This page describes the recommended Kubernetes deployment model for the Rust light-agent image from light-fabric/apps/light-agent.

Use this model when an agent service runs in a cluster and exposes the chat UI/WebSocket endpoint through a Kubernetes Service, Ingress, or Gateway API. The agent serves the local chat UI, connects to an LLM provider, calls MCP tools through light-gateway, stores conversation memory in Postgres, and registers with controller.

Deploy the agent as a normal single-container Kubernetes workload:

  • Deployment for the agent pod.
  • Service for stable in-cluster access.
  • ConfigMap for bootstrap config and non-secret values.
  • Secret for bearer tokens, config passwords, host id, and database URL.
  • emptyDir or PersistentVolumeClaim for config-cache.
  • ConfigMap or custom image layer for public/ chat UI assets.
  • Optional Ingress, Gateway API, NodePort, or LoadBalancer for external browser access.

Keep runtime policy and shared platform configuration in config-server. The Kubernetes bootstrap config should only contain enough information for startup, trust, model/provider selection, light-gateway access, database access, and controller registration.

Image

Build the image from the workspace root:

./apps/light-agent/build.sh 2.2.1

For local testing without pushing:

./apps/light-agent/build.sh 2.2.1 --local

Use immutable tags in Kubernetes. Avoid latest for customer deployments.

The current runtime image uses:

/app/light-agent
/app/config -> /config

The process runs as the image user agent. Mount /config for bootstrap config and make /app/config-cache writable.

The current Dockerfile does not copy apps/light-agent/public/ into the runtime image. For Kubernetes, either mount the public/ files from a ConfigMap or build a custom image that includes them under /app/public.

Runtime Paths

Recommended container layout:

/config/
  startup.yml
  server.yml
  portal-registry.yml
  client.yml
  mcp-client.yml
  ollama.yml
  values.yml
  ca.pem

/app/config-cache/
  values.yml
  downloaded certs and files

/app/public/
  index.html

Use a read-only projected volume for /config. Use a writable volume for /app/config-cache.

For most deployments, use emptyDir for config-cache. This gives each pod a fresh cache and avoids accidentally keeping stale config across pod replacement.

Use a PersistentVolumeClaim only when the customer explicitly wants the agent to restart from the last downloaded config during a config-server outage. A persistent cache improves outage tolerance but can also preserve stale runtime state.

Runtime Dependencies

The pod must be able to reach:

  • Controller, through portalRegistry.portalUrl.
  • Config-server, through startup.configServerUri.
  • light-gateway, through mcp-client.gatewayUrl and mcp-client.path.
  • The model provider, currently Ollama by default.
  • Postgres, through DATABASE_URL.

The Postgres database must contain the Hindsight memory tables used by light-agent, including:

  • agent_memory_bank_t
  • agent_memory_unit_t
  • agent_session_history_t

LIGHT_AGENT_HOST_ID must be a valid host UUID for the target tenant/host. The agent stores memory and session history under this host id.

Agent Roles

The same image can run different logical agents. Use a different service id, deployment name, Service name, and port for each concurrently running role.

Common service ids are:

com.networknt.agent.account-1.0.0
com.networknt.agent.advisor-1.0.0
com.networknt.agent.tech-support-1.0.0

For a single account agent, a conventional Kubernetes name is light-agent-account. For multiple agents in the same namespace, use names such as:

light-agent-account
light-agent-advisor
light-agent-tech-support

Each role needs a unique Service name. If they share one namespace and expose through one Ingress host, route each role by host or path.

Registration Address

In Kubernetes, do not register the pod IP. Pod IPs are ephemeral.

If controller and callers are inside the same cluster, advertise the Service DNS name:

server.advertisedAddress: light-agent-account.light-agent

The pattern is:

<service-name>.<namespace>

The port is still registered separately from the host/address.

If controller or callers are outside the cluster, advertise the externally reachable DNS name instead, such as the Ingress or LoadBalancer hostname:

server.advertisedAddress: account-agent.customer.example.com

Bootstrap Config

Example values.yml for an in-cluster controller, config-server, gateway, Ollama, and Postgres:

startup.host: customer.example.com
startup.timeout: 3000
startup.connectTimeout: 3000
startup.bootstrapCaCertPath: config/ca.pem
startup.externalConfigDir: /app/config-cache

light-config-server-uri: https://config-server.lightapi.svc.cluster.local:8435

server.serviceId: com.networknt.agent.account-1.0.0
server.environment: prod
server.ip: 0.0.0.0
server.advertisedAddress: light-agent-account.light-agent
server.httpPort: 8083
server.enableHttp: true
server.httpsPort: 8443
server.enableHttps: false
server.enableRegistry: true
server.startOnRegistryFailure: true

portalRegistry.portalUrl: https://controller.lightapi.svc.cluster.local:8438

client.verifyHostname: true

mcp-client.gatewayUrl: https://ai-microgateway.light-gateway:8443
mcp-client.path: /mcp
mcp-client.timeoutMs: 5000

ollama.ollamaUrl: http://ollama.ai.svc.cluster.local:11434
ollama.model: llama3.1:8b

Example startup.yml:

host: ${startup.host:dev.lightapi.net}
serviceId: ${server.serviceId:com.networknt.agent.account-1.0.0}
envTag: ${server.environment:dev}
acceptHeader: application/yaml
timeout: ${startup.timeout:3000}
connectTimeout: ${startup.connectTimeout:3000}
configServerUri: ${light-config-server-uri:https://local.localhost}
authorization: ${light_portal_authorization:}
bootstrapCaCertPath: ${startup.bootstrapCaCertPath:config/ca.pem}
externalConfigDir: ${startup.externalConfigDir:/app/config-cache}

Example server.yml:

ip: ${server.ip:0.0.0.0}
advertisedAddress: ${server.advertisedAddress:127.0.0.1}
httpPort: ${server.httpPort:8083}
enableHttp: ${server.enableHttp:true}
httpsPort: ${server.httpsPort:8443}
enableHttps: ${server.enableHttps:false}
serviceId: ${server.serviceId:com.networknt.agent.account-1.0.0}
enableRegistry: ${server.enableRegistry:true}
startOnRegistryFailure: ${server.startOnRegistryFailure:true}
dynamicPort: ${server.dynamicPort:false}
environment: ${server.environment:dev}
shutdownGracefulPeriod: ${server.shutdownGracefulPeriod:2000}

Example portal-registry.yml:

portalUrl: ${portalRegistry.portalUrl:https://localhost:8438}
portalToken: ${light_portal_authorization:}
controllerDiscoveryToken: ${portalRegistry.controllerDiscoveryToken:}

Example client.yml:

tls:
  verifyHostname: ${client.verifyHostname:true}

Example mcp-client.yml:

gatewayUrl: ${mcp-client.gatewayUrl:https://ai-microgateway.light-gateway:8443}
path: ${mcp-client.path:/mcp}
timeoutMs: ${mcp-client.timeoutMs:5000}

Example ollama.yml:

ollamaUrl: ${ollama.ollamaUrl:http://ollama.ai.svc.cluster.local:11434}
model: ${ollama.model:llama3.1:8b}

For the current light-agent implementation, keep ollama.yml and mcp-client.yml in the local bootstrap config. They are read during process initialization before the runtime completes remote config bootstrap.

Use the customer CA in ca.pem. Do not disable hostname verification in production to work around certificate SAN problems.

Secrets

Store the portal bearer token, optional config password, host id, and database URL in a Kubernetes Secret.

Example:

apiVersion: v1
kind: Secret
metadata:
  name: light-agent-account-secret
  namespace: light-agent
type: Opaque
stringData:
  LIGHT_PORTAL_AUTHORIZATION: "Bearer <token>"
  light_4j_config_password: "<config-password-if-needed>"
  LIGHT_AGENT_HOST_ID: "<host-uuid>"
  DATABASE_URL: "postgres://agent_user:<password>@postgres.lightapi.svc.cluster.local:5432/configserver"
data:
  ca.pem: <base64-ca-pem>

LIGHT_PORTAL_AUTHORIZATION is used for config-server bootstrap and controller registration. It is not the end-user chat token. If downstream MCP tools require caller identity, the browser or BFF should send the user's Authorization header to the agent WebSocket endpoint so the agent can forward it to light-gateway.

Do not store real bearer tokens, database passwords, or customer CA material in Git, ConfigMaps, Helm values committed to the repo, or rendered deployment examples.

Example Manifests

Example ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: light-agent-account-config
  namespace: light-agent
  labels:
    app.kubernetes.io/name: light-agent-account
    app.kubernetes.io/component: agent
data:
  values.yml: |
    startup.host: customer.example.com
    startup.timeout: 3000
    startup.connectTimeout: 3000
    startup.bootstrapCaCertPath: config/ca.pem
    startup.externalConfigDir: /app/config-cache
    light-config-server-uri: https://config-server.lightapi.svc.cluster.local:8435
    server.serviceId: com.networknt.agent.account-1.0.0
    server.environment: prod
    server.ip: 0.0.0.0
    server.advertisedAddress: light-agent-account.light-agent
    server.httpPort: 8083
    server.enableHttp: true
    server.httpsPort: 8443
    server.enableHttps: false
    server.enableRegistry: true
    server.startOnRegistryFailure: true
    portalRegistry.portalUrl: https://controller.lightapi.svc.cluster.local:8438
    client.verifyHostname: true
    mcp-client.gatewayUrl: https://ai-microgateway.light-gateway:8443
    mcp-client.path: /mcp
    mcp-client.timeoutMs: 5000
    ollama.ollamaUrl: http://ollama.ai.svc.cluster.local:11434
    ollama.model: llama3.1:8b
  startup.yml: |
    host: ${startup.host:dev.lightapi.net}
    serviceId: ${server.serviceId:com.networknt.agent.account-1.0.0}
    envTag: ${server.environment:dev}
    acceptHeader: application/yaml
    timeout: ${startup.timeout:3000}
    connectTimeout: ${startup.connectTimeout:3000}
    configServerUri: ${light-config-server-uri:https://local.localhost}
    authorization: ${light_portal_authorization:}
    bootstrapCaCertPath: ${startup.bootstrapCaCertPath:config/ca.pem}
    externalConfigDir: ${startup.externalConfigDir:/app/config-cache}
  server.yml: |
    ip: ${server.ip:0.0.0.0}
    advertisedAddress: ${server.advertisedAddress:127.0.0.1}
    httpPort: ${server.httpPort:8083}
    enableHttp: ${server.enableHttp:true}
    httpsPort: ${server.httpsPort:8443}
    enableHttps: ${server.enableHttps:false}
    serviceId: ${server.serviceId:com.networknt.agent.account-1.0.0}
    enableRegistry: ${server.enableRegistry:true}
    startOnRegistryFailure: ${server.startOnRegistryFailure:true}
    dynamicPort: ${server.dynamicPort:false}
    environment: ${server.environment:dev}
    shutdownGracefulPeriod: ${server.shutdownGracefulPeriod:2000}
  portal-registry.yml: |
    portalUrl: ${portalRegistry.portalUrl:https://localhost:8438}
    portalToken: ${light_portal_authorization:}
    controllerDiscoveryToken: ${portalRegistry.controllerDiscoveryToken:}
  client.yml: |
    tls:
      verifyHostname: ${client.verifyHostname:true}
  mcp-client.yml: |
    gatewayUrl: ${mcp-client.gatewayUrl:https://ai-microgateway.light-gateway:8443}
    path: ${mcp-client.path:/mcp}
    timeoutMs: ${mcp-client.timeoutMs:5000}
  ollama.yml: |
    ollamaUrl: ${ollama.ollamaUrl:http://ollama.ai.svc.cluster.local:11434}
    model: ${ollama.model:llama3.1:8b}

Create the public/ ConfigMap from the repo asset:

kubectl -n light-agent create configmap light-agent-account-public \
  --from-file=index.html=apps/light-agent/public/index.html \
  --dry-run=client -o yaml

Example Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: light-agent-account
  namespace: light-agent
  labels:
    app.kubernetes.io/name: light-agent-account
    app.kubernetes.io/component: agent
    app.kubernetes.io/part-of: lightapi
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: light-agent-account
  template:
    metadata:
      labels:
        app.kubernetes.io/name: light-agent-account
        app.kubernetes.io/component: agent
        app.kubernetes.io/part-of: lightapi
    spec:
      securityContext:
        fsGroup: 999
        fsGroupChangePolicy: OnRootMismatch
      containers:
        - name: light-agent
          image: networknt/light-agent:2.2.1
          imagePullPolicy: IfNotPresent
          env:
            - name: LIGHT_PORTAL_AUTHORIZATION
              valueFrom:
                secretKeyRef:
                  name: light-agent-account-secret
                  key: LIGHT_PORTAL_AUTHORIZATION
            - name: light_4j_config_password
              valueFrom:
                secretKeyRef:
                  name: light-agent-account-secret
                  key: light_4j_config_password
                  optional: true
            - name: LIGHT_AGENT_HOST_ID
              valueFrom:
                secretKeyRef:
                  name: light-agent-account-secret
                  key: LIGHT_AGENT_HOST_ID
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: light-agent-account-secret
                  key: DATABASE_URL
            - name: RUST_LOG
              value: info
            - name: AGENT_LOG_ANSI
              value: "false"
          ports:
            - name: http
              containerPort: 8083
              protocol: TCP
            - name: https
              containerPort: 8443
              protocol: TCP
          readinessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 30
            periodSeconds: 30
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              cpu: 1000m
              memory: 768Mi
          volumeMounts:
            - name: bootstrap-config
              mountPath: /config
              readOnly: true
            - name: config-cache
              mountPath: /app/config-cache
            - name: public
              mountPath: /app/public
              readOnly: true
      volumes:
        - name: bootstrap-config
          projected:
            sources:
              - configMap:
                  name: light-agent-account-config
              - secret:
                  name: light-agent-account-secret
                  items:
                    - key: ca.pem
                      path: ca.pem
        - name: config-cache
          emptyDir: {}
        - name: public
          configMap:
            name: light-agent-account-public

Example Service:

apiVersion: v1
kind: Service
metadata:
  name: light-agent-account
  namespace: light-agent
  labels:
    app.kubernetes.io/name: light-agent-account
    app.kubernetes.io/component: agent
spec:
  type: ClusterIP
  selector:
    app.kubernetes.io/name: light-agent-account
  ports:
    - name: http
      port: 8083
      targetPort: http
      protocol: TCP
    - name: https
      port: 8443
      targetPort: https
      protocol: TCP

External Access

For local testing with a ClusterIP Service:

kubectl -n light-agent port-forward svc/light-agent-account 8083:8083

Health check:

curl -i http://127.0.0.1:8083/health

If exposing through Ingress, make sure WebSocket upgrade is supported and idle timeouts are long enough for chat sessions.

Example NGINX Ingress annotations:

nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
nginx.ingress.kubernetes.io/backend-protocol: "HTTP"

If downstream MCP tools require caller identity, put the agent behind a BFF or authenticated reverse proxy that forwards the user's Authorization header to the WebSocket request. A browser-created WebSocket from the embedded static UI does not directly set arbitrary authorization headers.

Deploy Through Light-Deployer

The repo template lives at:

apps/light-agent/k8s/light-agent

Use the same template rules as light-gateway.

When light-deployer runs outside the cluster and has LIGHT_DEPLOYER_TEMPLATE_BASE_DIR set, repoUrl: "local" can point to local templates.

When light-deployer runs inside Kubernetes, use a real Git URL:

{
  "template": {
    "repoUrl": "https://github.com/networknt/light-fabric.git",
    "ref": "main",
    "path": "apps/light-agent/k8s/light-agent"
  }
}

Do not use repoUrl: "local" for an in-cluster deployer unless the template repo is mounted into the deployer container and LIGHT_DEPLOYER_TEMPLATE_BASE_DIR points to it.

Keep Namespace out of templates rendered by light-deployer if the deployer policy blocks cluster-scoped resources. Create the namespace separately:

kubectl create namespace light-agent

Config-Server Requirements

Before deploying the agent pod, config-server should already have config for the tuple used by startup:

host = startup.host
serviceId = server.serviceId
envTag = server.environment

At minimum, config-server should return runtime config for:

  • values.yml
  • server.yml when listener or registration settings are centrally managed.
  • portal-registry.yml when controller URLs or registry settings are centrally managed.
  • client.yml when TLS verification behavior is centrally managed.

For the current light-agent, keep mcp-client.yml and ollama.yml in the local bootstrap ConfigMap even if other runtime config comes from config-server. They are loaded before remote bootstrap completes.

Startup Flow

Expected runtime flow:

Kubernetes starts pod
  -> /app/light-agent
  -> read local /config/values.yml, ollama.yml, and mcp-client.yml
  -> connect to Postgres with DATABASE_URL
  -> build the MCP client for light-gateway
  -> call config-server with LIGHT_PORTAL_AUTHORIZATION
  -> write downloaded runtime config into /app/config-cache
  -> start the Axum HTTP/WebSocket server
  -> register the agent with controller using portalRegistry.portalUrl
  -> serve the chat UI from /app/public
  -> forward tool discovery and tool calls to light-gateway

If config-server is unavailable and /app/config-cache/values.yml exists, the runtime can continue from cached config. With emptyDir, that cache disappears when the pod is recreated. With a PVC, it can survive pod replacement.

Upgrade And Rollback

Use Kubernetes rolling updates with immutable image tags:

kubectl -n light-agent set image deploy/light-agent-account \
  light-agent=networknt/light-agent:2.2.2
kubectl -n light-agent rollout status deploy/light-agent-account

Rollback:

kubectl -n light-agent rollout undo deploy/light-agent-account

For production, prefer changing only one variable at a time: either image tag or config-server runtime config, not both in the same rollout.

Validation Checklist

After deployment:

  • kubectl -n light-agent rollout status deploy/light-agent-account succeeds.
  • Pods are ready and restart count is stable.
  • Logs show successful Postgres connection.
  • Logs show successful config-server bootstrap.
  • Logs show successful controller registration.
  • Controller shows the agent registered with the expected service id, environment, host, and port.
  • curl http://127.0.0.1:8083/health succeeds through port-forward or Ingress.
  • The chat UI loads.
  • The chat WebSocket connects to /chat.
  • MCP tools/list reaches light-gateway.
  • MCP tools/call reaches the backend MCP server through light-gateway.
  • A pod restart still starts cleanly with the selected cache policy.

Security Checklist

  • Keep bearer tokens, config passwords, database passwords, and host ids in Kubernetes Secret, not ConfigMap.
  • Use customer CA trust and keep client.verifyHostname: true in production.
  • Use immutable image tags and image pull credentials from Kubernetes secrets when the registry is private.
  • Run as the non-root image user.
  • Make /config read-only.
  • Make only /app/config-cache writable.
  • Restrict ingress traffic to required agent ports.
  • Restrict egress traffic to config-server, controller, light-gateway, Ollama, and Postgres.
  • Rotate LIGHT_PORTAL_AUTHORIZATION through the customer secret process.

Light-Deployer

light-deployer is the cluster-local Kubernetes deployment executor for Light Portal.

It renders Kubernetes templates, validates manifests, applies resources through kube-rs, reports rollout status, and exposes deployment tools through an MCP JSON-RPC endpoint for local and MicroK8s testing.

Key Capabilities

  • MCP JSON-RPC endpoint at POST /mcp
  • AST-based YAML template rendering
  • Git template fetching with gix
  • Kubernetes dry-run, apply, delete, status, and prune
  • redacted manifest summaries and diffs
  • SSE deployment events

Runtime

light-deployer uses light-runtime, light-axum, config-loader, and portal-registry so it follows the same service boot model as light-agent.

Testing Path

Use these pages in order when testing locally:

  1. Build Local
  2. Prepare Config
  3. Run Standalone
  4. Run Kubernetes

Start with standalone noop mode to validate template rendering. Then move to MicroK8s real mode once the render request and target templates are correct.

For MCP clients, Light Portal, and AI agents, use POST /mcp with JSON-RPC methods such as tools/list and tools/call. The /mcp/tools/* routes are kept only as local debugging conveniences.

Build Local

This page builds the light-deployer binary and container image from the Light Fabric workspace.

Run all commands from the repository root:

cd ~/workspace/light-fabric

Rust Build

Use cargo check first for a quick compile validation:

cargo check -p light-deployer

Run the deployer tests:

cargo test -p light-deployer

Build a debug binary:

cargo build -p light-deployer

Build a release binary:

cargo build --release -p light-deployer

The release binary is written to:

target/release/light-deployer

Docker Image

Build the local image:

./apps/light-deployer/build.sh latest

The default image name is:

networknt/light-deployer:latest

To override the image name:

IMAGE=localhost:32000/light-deployer:latest ./apps/light-deployer/build.sh latest

Verify the image exists:

docker image inspect networknt/light-deployer:latest

What The Image Contains

The Dockerfile copies:

  • /usr/local/bin/light-deployer
  • /app/config

The container runs from /app, so the default runtime config directory is:

/app/config

The default HTTP port is 7088, configured in:

apps/light-deployer/config/server.yml

Expected Result

Before moving on, these commands should pass:

cargo check -p light-deployer
cargo test -p light-deployer
./apps/light-deployer/build.sh latest
docker image inspect networknt/light-deployer:latest

Prepare Config

light-deployer uses two kinds of configuration:

  • runtime config loaded by light-runtime
  • deployment request data sent through MCP tools/call at POST /mcp

Runtime Config Files

Default config lives in:

apps/light-deployer/config

Files:

  • server.yml: HTTP/HTTPS bind settings and service identity
  • deployer.yml: local deployer policy
  • portal-registry.yml: future portal/controller registry settings

When running from the workspace root, the deployer automatically uses:

apps/light-deployer/config

When running inside the Docker image, it uses:

/app/config

Override the config directory with:

LIGHT_DEPLOYER_CONFIG_DIR=/path/to/config

Server Config

The default server config listens on HTTP port 7088:

ip: ${server.ip:0.0.0.0}
httpPort: ${server.httpPort:7088}
enableHttp: ${server.enableHttp:true}
enableHttps: ${server.enableHttps:false}
serviceId: ${server.serviceId:com.networknt.light-deployer-0.1.0}
enableRegistry: ${server.enableRegistry:false}

To change the port without editing the file, provide values through the normal runtime values mechanism, or use a copied config directory for local testing.

Deployer Policy

The default policy is permissive enough for local testing:

deployerId: ${deployer.deployerId:local-light-deployer}
clusterId: ${deployer.clusterId:local}
allowedNamespaces: []
allowedRepoHosts: []
allowedRepoPrefixes: []
allowedImageRegistries: []
devInsecure: ${deployer.devInsecure:false}

Empty allow lists mean the policy does not restrict that dimension. For production, configure explicit values.

Example tighter policy:

deployerId: petstore-microk8s
clusterId: microk8s-local
allowedNamespaces:
  - petstore-dev
allowedRepoHosts:
  - github.com
allowedRepoPrefixes:
  - https://github.com/networknt/
allowedImageRegistries:
  - networknt
devInsecure: false
prune:
  enabled: true
  maxDeletePercent: 30
  sensitiveKinds:
    - PersistentVolumeClaim
  overrideRequired: true

Git Access

Public repositories do not need credentials.

For private HTTPS repositories, set:

LIGHT_DEPLOYER_GIT_TOKEN=...

Defaults:

  • GitHub username: x-access-token
  • Bitbucket Cloud username: x-token-auth

For Bitbucket app passwords or other Git servers:

LIGHT_DEPLOYER_GIT_USERNAME=my-user
LIGHT_DEPLOYER_GIT_TOKEN=my-token-or-app-password

Only HTTPS token auth is supported in Phase 1. SSH auth is deferred.

Template Repository Requirements

The target application repository should contain a k8s/ directory with YAML templates. The deployer reads all .yaml and .yml files under the requested template path.

Example template reference:

{
  "template": {
    "repoUrl": "https://github.com/networknt/openapi-petstore.git",
    "ref": "master",
    "path": "k8s"
  }
}

For local testing without Git clone, set:

LIGHT_DEPLOYER_TEMPLATE_BASE_DIR=/home/steve/workspace/openapi-petstore

Then use:

{
  "template": {
    "repoUrl": "local",
    "ref": "master",
    "path": "k8s"
  }
}

Request Values

The request values object supplies placeholder values for templates.

Example for openapi-petstore:

{
  "name": "openapi-petstore",
  "image": {
    "repository": "networknt/openapi-petstore",
    "tag": "latest",
    "pullPolicy": "IfNotPresent"
  },
  "service": {
    "name": "openapi-petstore",
    "type": "ClusterIP"
  },
  "resources": {
    "requests": {
      "memory": "64Mi",
      "cpu": "250m"
    },
    "limits": {
      "memory": "256Mi",
      "cpu": "500m"
    }
  }
}

The current renderer replaces placeholders inside YAML string scalar values. Avoid placeholders in Kubernetes fields that must be numeric unless the template keeps those fields as fixed numbers.

Run Standalone

Standalone mode is the fastest way to test light-deployer before using a real Kubernetes cluster.

Use noop mode first. It validates config, HTTP endpoints, template loading, rendering, resource summaries, and response shape without mutating Kubernetes.

Run all commands from:

cd /home/steve/workspace/light-fabric

Start With Built-In Sample

Start the deployer with the sample template directory:

LIGHT_DEPLOYER_TEMPLATE_BASE_DIR=apps/light-deployer/examples/petstore \
LIGHT_DEPLOYER_KUBE_MODE=noop \
cargo run -p light-deployer

The service listens on:

http://127.0.0.1:7088

Check health from another terminal:

curl -fsSL http://127.0.0.1:7088/health

Expected output:

ok

List Tools With MCP JSON-RPC

The MCP endpoint is JSON-RPC 2.0 over HTTP at:

POST /mcp

List all deployment tools:

curl -fsSL http://127.0.0.1:7088/mcp \
  -H 'content-type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "tools-list-1",
    "method": "tools/list",
    "params": {}
  }'

Call a tool through MCP:

curl -fsSL http://127.0.0.1:7088/mcp \
  -H 'content-type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "render-1",
    "method": "tools/call",
    "params": {
      "name": "deployment.render",
      "arguments": {
        "hostId": "local-host",
        "instanceId": "petstore-dev",
        "environment": "dev",
        "clusterId": "local",
        "namespace": "light-deployer",
        "values": {
          "name": "petstore",
          "image": {
            "repository": "nginx",
            "tag": "1.27"
          },
          "containerPort": 80
        },
        "template": {
          "repoUrl": "local",
          "ref": "main",
          "path": "k8s"
        }
      }
    }
  }'

For local debugging, the deployer also exposes REST-style convenience endpoints:

curl -fsSL http://127.0.0.1:7088/mcp/tools/list
curl -fsSL http://127.0.0.1:7088/mcp/tools
curl -fsSL http://127.0.0.1:7088/mcp/tools/deployment.render

Use POST /mcp for MCP clients and AI agents.

Render The Built-In Sample

curl -fsSL http://127.0.0.1:7088/mcp \
  -H 'content-type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "render-sample-1",
    "method": "tools/call",
    "params": {
      "name": "deployment.render",
      "arguments": {
        "hostId": "local-host",
        "instanceId": "petstore-dev",
        "environment": "dev",
        "clusterId": "microk8s-local",
        "namespace": "light-deployer",
        "values": {
          "name": "petstore",
          "replicas": 1,
          "image": {
            "repository": "nginx",
            "tag": "1.27"
          },
          "containerPort": 80,
          "service": {
            "port": 80
          }
        },
        "template": {
          "repoUrl": "local",
          "ref": "main",
          "path": "k8s"
        }
      }
    }
  }'

Expected response shape:

{
  "jsonrpc": "2.0",
  "result": {
    "isError": false,
    "structuredContent": {
      "action": "render",
      "status": "rendered",
      "deployerId": "local-light-deployer",
      "clusterId": "local",
      "resources": [
        {
          "kind": "Deployment",
          "name": "petstore"
        },
        {
          "kind": "Service",
          "name": "petstore"
        }
      ]
    }
  }
}

The exact requestId and manifestHash will differ.

Render openapi-petstore Locally

If /home/steve/workspace/openapi-petstore is available and has a k8s/ folder, run:

LIGHT_DEPLOYER_TEMPLATE_BASE_DIR=/home/steve/workspace/openapi-petstore \
LIGHT_DEPLOYER_KUBE_MODE=noop \
cargo run -p light-deployer

Render request:

curl -fsSL http://127.0.0.1:7088/mcp \
  -H 'content-type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "render-openapi-petstore-1",
    "method": "tools/call",
    "params": {
      "name": "deployment.render",
      "arguments": {
        "hostId": "local-host",
        "instanceId": "openapi-petstore-dev",
        "environment": "dev",
        "clusterId": "microk8s-local",
        "namespace": "petstore-dev",
        "values": {
          "name": "openapi-petstore",
          "image": {
            "repository": "networknt/openapi-petstore",
            "tag": "latest",
            "pullPolicy": "IfNotPresent"
          },
          "service": {
            "name": "openapi-petstore",
            "type": "ClusterIP"
          },
          "resources": {
            "requests": {
              "memory": "64Mi",
              "cpu": "250m"
            },
            "limits": {
              "memory": "256Mi",
              "cpu": "500m"
            }
          }
        },
        "template": {
          "repoUrl": "local",
          "ref": "master",
          "path": "k8s"
        }
      }
    }
  }'

Expected resources:

  • Deployment/openapi-petstore
  • Service/openapi-petstore

Test Git Fetch

Stop the local-template run and restart without LIGHT_DEPLOYER_TEMPLATE_BASE_DIR:

LIGHT_DEPLOYER_KUBE_MODE=noop \
cargo run -p light-deployer

Render from GitHub:

curl -fsSL http://127.0.0.1:7088/mcp \
  -H 'content-type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "render-git-1",
    "method": "tools/call",
    "params": {
      "name": "deployment.render",
      "arguments": {
        "hostId": "local-host",
        "instanceId": "openapi-petstore-dev",
        "environment": "dev",
        "clusterId": "microk8s-local",
        "namespace": "petstore-dev",
        "values": {
          "name": "openapi-petstore",
          "image": {
            "repository": "networknt/openapi-petstore",
            "tag": "latest"
          }
        },
        "template": {
          "repoUrl": "https://github.com/networknt/openapi-petstore.git",
          "ref": "master",
          "path": "k8s"
        }
      }
    }
  }'

For a private repository:

LIGHT_DEPLOYER_GIT_TOKEN=... \
LIGHT_DEPLOYER_KUBE_MODE=noop \
cargo run -p light-deployer

For Bitbucket app-password style auth:

LIGHT_DEPLOYER_GIT_USERNAME=my-user \
LIGHT_DEPLOYER_GIT_TOKEN=my-app-password \
LIGHT_DEPLOYER_KUBE_MODE=noop \
cargo run -p light-deployer

Dry Run And Diff In Noop Mode

Noop mode can also exercise the request path for these tools:

curl -fsSL http://127.0.0.1:7088/mcp \
  -H 'content-type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "dry-run-sample-1",
    "method": "tools/call",
    "params": {
      "name": "deployment.dryRun",
      "arguments": {
        "hostId": "local-host",
        "instanceId": "petstore-dev",
        "environment": "dev",
        "clusterId": "microk8s-local",
        "namespace": "light-deployer",
        "values": {
          "name": "petstore",
          "replicas": 1,
          "image": {
            "repository": "nginx",
            "tag": "1.27"
          },
          "containerPort": 80,
          "service": {
            "port": 80
          }
        },
        "template": {
          "repoUrl": "local",
          "ref": "main",
          "path": "k8s"
        }
      }
    }
  }'
curl -fsSL http://127.0.0.1:7088/mcp \
  -H 'content-type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "diff-sample-1",
    "method": "tools/call",
    "params": {
      "name": "deployment.diff",
      "arguments": {
        "hostId": "local-host",
        "instanceId": "petstore-dev",
        "environment": "dev",
        "clusterId": "microk8s-local",
        "namespace": "light-deployer",
        "values": {
          "name": "petstore",
          "replicas": 1,
          "image": {
            "repository": "nginx",
            "tag": "1.27"
          },
          "containerPort": 80,
          "service": {
            "port": 80
          }
        },
        "template": {
          "repoUrl": "local",
          "ref": "main",
          "path": "k8s"
        }
      }
    }
  }'

These calls do not validate against Kubernetes unless real mode is enabled.

Stop The Service

Press Ctrl-C in the terminal running cargo run.

Run Kubernetes

This page runs light-deployer inside MicroK8s and uses the in-cluster ServiceAccount with kube-rs.

Prerequisites

MicroK8s should be running and microk8s kubectl should work:

microk8s status --wait-ready
microk8s kubectl get nodes

Build the image first:

cd /home/steve/workspace/light-fabric
./apps/light-deployer/build.sh latest

Import Image Into MicroK8s

docker save networknt/light-deployer:latest | microk8s ctr image import -

If your MicroK8s install requires elevated permissions:

docker save networknt/light-deployer:latest | sudo microk8s ctr image import -

Verify the image is available:

microk8s ctr images ls | grep light-deployer

Install Deployer

Apply the included manifests:

microk8s kubectl apply -f apps/light-deployer/k8s/namespace.yaml
microk8s kubectl apply -f apps/light-deployer/k8s/rbac.yaml
microk8s kubectl apply -f apps/light-deployer/k8s/deployment.yaml
microk8s kubectl apply -f apps/light-deployer/k8s/service.yaml

Wait for the pod:

microk8s kubectl -n light-deployer rollout status deploy/light-deployer
microk8s kubectl -n light-deployer get pods

Check logs:

microk8s kubectl -n light-deployer logs deploy/light-deployer

The deployment sets:

LIGHT_DEPLOYER_KUBE_MODE=real

So the service uses real Kubernetes API calls from inside the cluster.

Port Forward

microk8s kubectl -n light-deployer port-forward svc/light-deployer 7088:7088

In another terminal:

curl -fsSL http://127.0.0.1:7088/health

Expected:

ok

List Tools

curl -fsSL http://127.0.0.1:7088/mcp \
  -H 'content-type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "tools-list-1",
    "method": "tools/list",
    "params": {}
  }'

The response contains the deployer's tool names, descriptions, input schemas, and invocation metadata. Light Portal can use this JSON-RPC response to populate MCP tools for the API details view.

Render In Kubernetes

Rendering does not mutate the cluster:

curl -fsSL http://127.0.0.1:7088/mcp \
  -H 'content-type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "render-sample-1",
    "method": "tools/call",
    "params": {
      "name": "deployment.render",
      "arguments": {
        "hostId": "local-host",
        "instanceId": "petstore-dev",
        "environment": "dev",
        "clusterId": "microk8s-local",
        "namespace": "light-deployer",
        "values": {
          "name": "petstore",
          "replicas": 1,
          "image": {
            "repository": "nginx",
            "tag": "1.27"
          },
          "containerPort": 80,
          "service": {
            "port": 80
          }
        },
        "template": {
          "repoUrl": "local",
          "ref": "main",
          "path": "k8s"
        }
      }
    }
  }'

Dry Run In Kubernetes

Dry-run renders the manifest and asks the Kubernetes API to validate it without persisting resources:

curl -fsSL http://127.0.0.1:7088/mcp \
  -H 'content-type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "dry-run-sample-1",
    "method": "tools/call",
    "params": {
      "name": "deployment.dryRun",
      "arguments": {
        "hostId": "local-host",
        "instanceId": "petstore-dev",
        "environment": "dev",
        "clusterId": "microk8s-local",
        "namespace": "light-deployer",
        "values": {
          "name": "petstore",
          "replicas": 1,
          "image": {
            "repository": "nginx",
            "tag": "1.27"
          },
          "containerPort": 80,
          "service": {
            "port": 80
          }
        },
        "template": {
          "repoUrl": "local",
          "ref": "main",
          "path": "k8s"
        }
      }
    }
  }'

Expected status:

{
  "jsonrpc": "2.0",
  "result": {
    "isError": false,
    "structuredContent": {
      "status": "validated"
    }
  }
}

Deploy Sample

The sample request deploys into the light-deployer namespace so it matches the included namespace-scoped RBAC.

curl -fsSL http://127.0.0.1:7088/mcp \
  -H 'content-type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "apply-sample-1",
    "method": "tools/call",
    "params": {
      "name": "deployment.apply",
      "arguments": {
        "hostId": "local-host",
        "instanceId": "petstore-dev",
        "environment": "dev",
        "clusterId": "microk8s-local",
        "namespace": "light-deployer",
        "values": {
          "name": "petstore",
          "replicas": 1,
          "image": {
            "repository": "nginx",
            "tag": "1.27"
          },
          "containerPort": 80,
          "service": {
            "port": 80
          }
        },
        "template": {
          "repoUrl": "local",
          "ref": "main",
          "path": "k8s"
        }
      }
    }
  }'

The response should return quickly with an accepted/applying-style status. The operation continues in the deployer.

Watch Kubernetes resources:

microk8s kubectl -n light-deployer get deploy,svc,pods

Stream Events

Use the requestId from the deployment response:

curl -N "http://127.0.0.1:7088/events?request_id=<requestId>"

The event stream reports deployment progress and failures for that request.

Check Status

curl -fsSL http://127.0.0.1:7088/mcp \
  -H 'content-type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "status-sample-1",
    "method": "tools/call",
    "params": {
      "name": "deployment.status",
      "arguments": {
        "hostId": "local-host",
        "instanceId": "petstore-dev",
        "environment": "dev",
        "clusterId": "microk8s-local",
        "namespace": "light-deployer",
        "template": {
          "repoUrl": "local",
          "ref": "main",
          "path": "k8s"
        }
      }
    }
  }'

Undeploy Sample

curl -fsSL http://127.0.0.1:7088/mcp \
  -H 'content-type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "delete-sample-1",
    "method": "tools/call",
    "params": {
      "name": "deployment.delete",
      "arguments": {
        "hostId": "local-host",
        "instanceId": "petstore-dev",
        "environment": "dev",
        "clusterId": "microk8s-local",
        "namespace": "light-deployer",
        "template": {
          "repoUrl": "local",
          "ref": "main",
          "path": "k8s"
        }
      }
    }
  }'

Then verify resources:

microk8s kubectl -n light-deployer get deploy,svc,pods

Deploy openapi-petstore From Git

After the openapi-petstore repository has a k8s/ folder committed, use a request like this:

curl -fsSL http://127.0.0.1:7088/mcp \
  -H 'content-type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "apply-openapi-petstore-1",
    "method": "tools/call",
    "params": {
      "name": "deployment.apply",
      "arguments": {
        "hostId": "local-host",
        "instanceId": "openapi-petstore-dev",
        "environment": "dev",
        "clusterId": "microk8s-local",
        "namespace": "light-deployer",
        "values": {
          "name": "openapi-petstore",
          "image": {
            "repository": "networknt/openapi-petstore",
            "tag": "latest",
            "pullPolicy": "IfNotPresent"
          },
          "service": {
            "name": "openapi-petstore",
            "type": "ClusterIP"
          }
        },
        "template": {
          "repoUrl": "https://github.com/networknt/openapi-petstore.git",
          "ref": "master",
          "path": "k8s"
        }
      }
    }
  }'

For private Git access, set LIGHT_DEPLOYER_GIT_TOKEN on the deployer pod. In Kubernetes this should be injected from a Secret, not written directly into the deployment manifest.

Update The Deployer Image

After rebuilding locally:

./apps/light-deployer/build.sh latest
docker save networknt/light-deployer:latest | microk8s ctr image import -
microk8s kubectl -n light-deployer rollout restart deploy/light-deployer
microk8s kubectl -n light-deployer rollout status deploy/light-deployer

Remove The Deployer

microk8s kubectl delete -f apps/light-deployer/k8s/service.yaml
microk8s kubectl delete -f apps/light-deployer/k8s/deployment.yaml
microk8s kubectl delete -f apps/light-deployer/k8s/rbac.yaml
microk8s kubectl delete -f apps/light-deployer/k8s/namespace.yaml

Light-Gateway

light-gateway is the Pingora-based gateway product in Light Fabric.

It is intended to host gateway behavior such as routing, proxying, and eventually AI/MCP gateway integrations while using the shared runtime and config model.

Key Dependencies

  • light-runtime
  • light-pingora
  • config-loader

Runtime

The gateway uses light-pingora as its transport framework and light-runtime for lifecycle, bootstrap, and service configuration.

Light Rule In Light-Gateway

light-gateway uses Light-Rule to enforce deterministic policy decisions in the Pingora request path. The first production use is MCP tool authorization and response filtering for the mcp handler.

This lets a gateway route agent MCP traffic to downstream MCP servers or API servers while enforcing fine-grained authorization locally from configuration delivered by config-server.

When It Runs

Light-Rule is invoked by light-gateway when all of these are true:

  • handler.yml includes the mcp handler in the matched chain.
  • mcp-router.yml enables the MCP router and defines tools.
  • access-control.yml and/or rule.yml are available from local config or config-server.
  • A client sends tools/call to the configured MCP endpoint, normally /mcp.

The dependency path is:

light-gateway
  -> light-pingora
  -> light-rule

light-gateway links light-pingora, and light-pingora links light-rule. The rule engine is therefore part of the gateway binary; there is no dynamic plugin loading step.

Request Flow

For MCP traffic, the runtime flow is:

POST /mcp
  -> handler.yml selects mcp
  -> mcp-router parses JSON-RPC tools/call
  -> access-control runtime builds rule context
  -> light-rule evaluates req-acc rules
  -> denied: return JSON-RPC error -32001
  -> allowed: call downstream HTTP or MCP tool
  -> light-rule evaluates optional res-fil rules
  -> return JSON-RPC result

Authorization happens before the downstream call. Response filtering happens after the downstream response and before the MCP JSON-RPC response is returned to the agent.

Required Files

handler.yml

The mcp handler must be in the execution chain for the MCP path:

handlers:
  - correlation
  - security
  - mcp

paths:
  - path: /mcp
    method: POST
    exec:
      - correlation
      - security
      - mcp

defaultHandlers: []

Security should run before mcp when rules depend on JWT claims such as role, grp, pos, att, uid, or sub.

mcp-router.yml

mcp-router.yml exposes the MCP endpoint and maps tools to downstream APIs or downstream MCP servers:

enabled: true
path: /mcp
maxSessions: 10000
maxSessionsPerClient: 100
tools:
  - name: weather
    description: Get weather.
    targetHost: http://weather-api:8080
    path: /weather
    method: GET
    endpoint: /weather@get
    apiType: http
    inputSchema:
      type: object
      properties:
        city:
          type: string

The endpoint field is the stable policy key used by rule.yml. If it is omitted, the gateway derives one from the tool path and method, such as /weather@get.

maxSessions caps the total in-memory MCP frontend sessions for this gateway process. maxSessionsPerClient caps sessions for one authenticated client or, when no principal is available, one MCP clientInfo.name and clientInfo.version pair.

For downstream MCP servers, set apiType: mcp. For downstream API servers, use apiType: http or omit it when the default is acceptable.

access-control.yml

access-control.yml controls whether policy is active and how rules combine:

enabled: true
accessRuleLogic: any
defaultDeny: true
skipPathPrefixes: []

Fields:

  • enabled: turns access-control evaluation on or off.
  • accessRuleLogic: any or all for req-acc rule ids on an endpoint.
  • defaultDeny: when true, deny calls with no matching endpoint rule.
  • skipPathPrefixes: endpoint prefixes that bypass access control.

The file name is access-control.yml. The loader also accepts access-control.yaml.

rule.yml

rule.yml provides the rules and endpoint mappings:

ruleBodies:
  allowMcpReader:
    common: Y
    ruleId: allowMcpReader
    ruleName: Allow MCP reader
    ruleType: req-acc
    conditions:
      - operatorCode: isNotNull
        propertyPath: auditInfo.subject_claims.ClaimsMap.role
    actions:
      - actionClassName: com.networknt.rule.RoleBasedAccessControlAction

endpointRules:
  /weather@get:
    req-acc:
      - allowMcpReader
    permission:
      roles: mcp-reader

In this example, a caller is allowed only when the authenticated principal has a role matching mcp-reader.

The file name is rule.yml. The loader also accepts rule.yaml.

Rule Context

For MCP tool calls, the gateway builds a rule context with:

  • auditInfo.subject_claims.ClaimsMap: normalized JWT claims from the security handler.
  • headers: incoming agent request headers, lowercased.
  • endpoint: the tool policy endpoint, for example /weather@get.
  • toolName: the MCP tool name.
  • toolArguments: the JSON arguments from tools/call.
  • correlationId: the correlation id when one is available.
  • permission: endpoint permission values merged into the root context.

The current built-in access-control action checks the caller role against permission.roles.

Response filter actions can also use these claim dimensions:

role
group or grp
position or pos
attribute or att
user, user_id, uid, or sub

Built-In Actions

The gateway registers Rust actions under Java-compatible class names:

com.networknt.rule.RoleBasedAccessControlAction
RoleBasedAccessControlAction
com.networknt.rule.ResponseColumnFilterAction
ResponseColumnFilterAction
com.networknt.rule.ResponseRowFilterAction
ResponseRowFilterAction

RoleBasedAccessControlAction

Used with req-acc. It compares the caller role claim to permission.roles. If there is no role claim or no configured roles, the action returns denied.

ResponseColumnFilterAction

Used with res-fil. It filters fields from array-like JSON responses according to endpoint permission configuration.

Example:

ruleBodies:
  filterColumns:
    common: Y
    ruleId: filterColumns
    ruleName: Filter account columns
    ruleType: res-fil
    conditions:
      - operatorCode: isNotNull
        propertyPath: col
    actions:
      - actionClassName: com.networknt.rule.ResponseColumnFilterAction

endpointRules:
  /accounts@get:
    res-fil:
      - filterColumns
    permission:
      col:
        role:
          mcp-reader: '["id","name"]'

ResponseRowFilterAction

Used with res-fil. It filters rows from array-like JSON responses according to configured row predicates.

Example:

ruleBodies:
  filterRows:
    common: Y
    ruleId: filterRows
    ruleName: Filter account rows
    ruleType: res-fil
    conditions:
      - operatorCode: isNotNull
        propertyPath: row
    actions:
      - actionClassName: com.networknt.rule.ResponseRowFilterAction

endpointRules:
  /accounts@get:
    res-fil:
      - filterRows
    permission:
      row:
        role:
          mcp-reader:
            - colName: status
              operator: "="
              colValue: "OPEN"

Matching Rules

Endpoint matching checks:

  • exact endpoint key first
  • Java-style path templates such as /accounts/{id}@get
  • parent path entries, for example /accounts@get for /accounts/123@get

For MCP tools, prefer explicitly setting endpoint in mcp-router.yml so the policy key remains stable even if the downstream path changes.

Reload Behavior

light-gateway has reload support for MCP and access-control config:

  • reloading mcp-router.yml rebuilds the MCP router runtime
  • reloading access-control.yml or rule.yml rebuilds MCP and WebSocket policy runtimes

This matches the product model where light-portal manages configuration and config-server delivers the resolved files.

Operational Notes

  • If access-control.yml is missing, MCP tools are allowed unless another handler blocks the request.
  • If access-control.yml is enabled and defaultDeny is true, a tool call with no matching req-acc endpoint rule is denied.
  • If the security handler does not run before mcp, role-based rules will not have caller claims and will deny.
  • Rule execution is local to the gateway. It does not call the database on each request.
  • x-mask and x-mask-pattern in MCP tool inputSchema are handled before downstream execution. x-tokenize is reserved for the tokenization service integration.

Verification

Useful checks:

cargo tree -p light-gateway -i light-rule
cargo test -p light-pingora access_control
cargo test -p light-gateway gateway_loads_mcp_router_when_mcp_handler_is_active

The first command verifies the binary linkage. The test commands verify the MCP access-control path, default deny behavior, role-based allow behavior, response filtering, and gateway MCP runtime loading.

Deploy Native

This page describes the recommended VM deployment model for the Rust light-gateway native binary.

Use this model when a customer wants to run light-gateway as a microgateway on a VM to protect backend MCP servers. The gateway starts from a small local bootstrap config, downloads runtime config from config-server, then registers itself with controller.

Deliver a versioned install bundle, not an ad hoc runtime script.

The bundle should contain:

  • light-gateway native binary.
  • Minimal bootstrap config files.
  • A systemd unit.
  • An install script for filesystem setup.
  • A root-owned environment file for secrets.

The install script can create users, directories, symlinks, permissions, and the systemd unit. It should not be the long-running process wrapper, and it should not pass secrets as command-line arguments.

Use systemd to run the service:

  • It restarts the process on failure.
  • It keeps logs in the host journal.
  • It avoids shell-history and process-list leakage from command-line secrets.
  • It gives the customer a standard operational surface: start, stop, restart, status, and journalctl.

Runtime Layout

light-gateway uses relative runtime paths:

  • config
  • config-cache

The systemd service should therefore set WorkingDirectory to the installed application directory.

Recommended VM layout:

/opt/light-gateway/
  light-gateway
  config -> /etc/light-gateway
  config-cache -> /var/lib/light-gateway/config-cache

/etc/light-gateway/
  startup.yml
  server.yml
  portal-registry.yml
  client.yml
  values.yml
  ca.pem
  light-gateway.env

/var/lib/light-gateway/
  config-cache/

The local config directory contains only bootstrap-time files. Runtime config downloaded from config-server is written to config-cache before Pingora starts. Keep config-cache writable by the light-gateway service user.

Build Artifact

Build a release binary from light-fabric:

cargo build --release -p light-gateway

The artifact is:

target/release/light-gateway

Build on a compatible Linux distribution for the customer VM. If the customer fleet has mixed Linux versions, prefer a static or target-compatible build so the binary does not fail on an older glibc.

Package with a versioned filename:

light-gateway-<version>-linux-amd64.tar.gz

For customers with package-management standards, wrap the same layout in a .deb or .rpm later. Start with tar.gz until the runtime contract is stable.

Bootstrap Config

The local bootstrap config only needs enough information to reach config-server, identify the gateway instance, and trust TLS.

Example values.yml:

startup.host: customer.example.com
startup.timeout: 3000
startup.connectTimeout: 3000
startup.bootstrapCaCertPath: config/ca.pem

light-config-server-uri: https://config-server.customer.example.com:8435

server.serviceId: com.customer.mcp-gateway-1.0.0
server.environment: prod
server.ip: 0.0.0.0
server.advertisedAddress: mcp-gateway-01.customer.example.com
server.httpPort: 8080
server.enableHttp: true
server.httpsPort: 8443
server.enableHttps: false
server.enableRegistry: true
server.startOnRegistryFailure: true

portalRegistry.portalUrl: https://controller.customer.example.com:8438

server.advertisedAddress must be a stable address that controller and clients can use to reach the VM gateway. Do not advertise 127.0.0.1 or 0.0.0.0.

Example startup.yml:

host: ${startup.host:dev.lightapi.net}
serviceId: ${server.serviceId:com.networknt.light-gateway-1.0.0}
envTag: ${server.environment:dev}
acceptHeader: application/yaml
timeout: ${startup.timeout:3000}
connectTimeout: ${startup.connectTimeout:3000}
configServerUri: ${light-config-server-uri:https://local.localhost}
authorization: ${light_portal_authorization:}
bootstrapCaCertPath: ${startup.bootstrapCaCertPath:config/ca.pem}

Example server.yml:

ip: ${server.ip:0.0.0.0}
advertisedAddress: ${server.advertisedAddress:127.0.0.1}
httpPort: ${server.httpPort:8080}
enableHttp: ${server.enableHttp:true}
httpsPort: ${server.httpsPort:8443}
enableHttps: ${server.enableHttps:false}
tlsCertPath: ${server.tlsCertPath:}
tlsKeyPath: ${server.tlsKeyPath:}
serviceId: ${server.serviceId:com.networknt.light-gateway-1.0.0}
enableRegistry: ${server.enableRegistry:true}
startOnRegistryFailure: ${server.startOnRegistryFailure:true}
dynamicPort: ${server.dynamicPort:false}
environment: ${server.environment:dev}
shutdownGracefulPeriod: ${server.shutdownGracefulPeriod:2000}

Example portal-registry.yml:

portalUrl: ${portalRegistry.portalUrl:https://localhost:8438}
portalToken: ${light_portal_authorization:}
controllerDiscoveryToken: ${portalRegistry.controllerDiscoveryToken:}

Example client.yml should include the customer CA path and hostname verification policy for outbound HTTPS calls:

tls:
  caCertPath: ${client.caCertPath:config/ca.pem}
  verifyHostname: ${client.verifyHostname:true}

Keep the full gateway behavior, including MCP routing, authentication, rule configuration, and downstream MCP targets, in config-server. The VM should not need local edits for normal policy or route changes.

Secrets

Keep secrets in a root-owned environment file or in the customer's secret manager. Do not pass secrets in command-line arguments.

Example /etc/light-gateway/light-gateway.env:

LIGHT_PORTAL_AUTHORIZATION=Bearer <token>
light_4j_config_password=<config-password-if-needed>
RUST_LOG=info

Permissions:

chown root:light-gateway /etc/light-gateway/light-gateway.env
chmod 0640 /etc/light-gateway/light-gateway.env

LIGHT_PORTAL_AUTHORIZATION is used for config-server bootstrap. The same token is also used by portal registry startup when portal-registry.yml resolves portalToken from light_portal_authorization.

Systemd Unit

Example /etc/systemd/system/light-gateway.service:

[Unit]
Description=Light Gateway
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=light-gateway
Group=light-gateway
WorkingDirectory=/opt/light-gateway
EnvironmentFile=/etc/light-gateway/light-gateway.env
ExecStart=/opt/light-gateway/light-gateway
Restart=on-failure
RestartSec=5
LimitNOFILE=65535

NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=full
ProtectHome=true
ReadWritePaths=/var/lib/light-gateway/config-cache

[Install]
WantedBy=multi-user.target

Install and start:

systemctl daemon-reload
systemctl enable light-gateway
systemctl start light-gateway
systemctl status light-gateway

View logs:

journalctl -u light-gateway -f

Install Script Scope

An install script is useful, but keep it deterministic and small.

It should:

  • Create the light-gateway user and group.
  • Create /opt/light-gateway, /etc/light-gateway, and /var/lib/light-gateway/config-cache.
  • Install the binary with executable permissions.
  • Install bootstrap config files.
  • Install or update the systemd unit.
  • Set file ownership and permissions.
  • Print the next operator steps for adding secrets and starting the service.

It should not:

  • Embed bearer tokens.
  • Pass tokens to ExecStart.
  • Rewrite customer config-server state.
  • Start the process before secrets and CA files are installed.

Startup Flow

The expected runtime flow is:

systemd
  -> /opt/light-gateway/light-gateway
  -> read local config/values.yml and startup.yml
  -> call config-server with LIGHT_PORTAL_AUTHORIZATION
  -> write downloaded config and files into config-cache
  -> start Pingora with resolved runtime config
  -> register gateway to controller using portalRegistry.portalUrl
  -> route protected MCP traffic to downstream MCP servers

If config-server is temporarily unavailable and cached config exists, the runtime can continue from config-cache. If this is not acceptable for a customer environment, make it an operational policy to clear config-cache before restart or add a pre-start health check for config-server availability.

Upgrade And Rollback

Use versioned binary releases:

/opt/light-gateway/releases/2.2.1/light-gateway
/opt/light-gateway/releases/2.2.2/light-gateway
/opt/light-gateway/light-gateway -> releases/2.2.2/light-gateway

Upgrade:

systemctl stop light-gateway
ln -sfn /opt/light-gateway/releases/2.2.2/light-gateway /opt/light-gateway/light-gateway
systemctl start light-gateway

Rollback:

systemctl stop light-gateway
ln -sfn /opt/light-gateway/releases/2.2.1/light-gateway /opt/light-gateway/light-gateway
systemctl start light-gateway

Do not delete config-cache during a normal binary rollback. It is the local cache of the config-server-delivered runtime state.

Validation Checklist

Before handing the VM to the customer:

  • systemctl status light-gateway is active.
  • journalctl -u light-gateway shows successful config-server bootstrap.
  • journalctl -u light-gateway shows successful controller registration.
  • The controller shows the gateway registered with the expected service id, environment, address, and port.
  • The gateway health endpoint responds from the VM network.
  • An MCP tools/list call reaches the gateway.
  • An MCP tools/call call reaches the configured backend MCP server.
  • Restarting the VM starts the gateway automatically.

Security Checklist

  • Store bearer tokens and config passwords outside the install bundle.
  • Use a customer CA file instead of disabling TLS verification in production.
  • Use a stable DNS name for server.advertisedAddress.
  • Restrict inbound VM firewall rules to required gateway ports.
  • Restrict outbound VM firewall rules to config-server, controller, and backend MCP server addresses.
  • Run as the dedicated light-gateway user.
  • Keep /etc/light-gateway/light-gateway.env readable only by root and the service group.
  • Rotate LIGHT_PORTAL_AUTHORIZATION through the customer secret process.

Deploy Kubernetes

This page describes the recommended Kubernetes deployment model for the Rust light-gateway image from light-fabric/apps/light-gateway.

Use this model when light-gateway runs as a microgateway in front of backend MCP servers. The pod starts from local bootstrap config, downloads runtime config from config-server into config-cache, starts Pingora, and registers the gateway with controller.

Deploy the gateway as a normal single-container Kubernetes workload:

  • Deployment for the gateway pod.
  • Service for stable in-cluster access.
  • ConfigMap for bootstrap config and non-secret values.
  • Secret for bearer tokens and config passwords.
  • emptyDir or PersistentVolumeClaim for config-cache.
  • Optional Ingress, Gateway API, NodePort, or LoadBalancer for external client access.

Keep gateway behavior such as MCP route definitions, access-control rules, backend MCP targets, and runtime TLS files in config-server. The Kubernetes bootstrap config should only contain enough information for startup, trust, and registration.

Image

Build the image from the workspace root:

./apps/light-gateway/build.sh 2.2.1

For local testing without pushing:

./apps/light-gateway/build.sh 2.2.1 --local

Use immutable tags in Kubernetes. Avoid latest for customer deployments.

The runtime image uses:

/app/light-gateway
/app/config -> /config
/app/config-cache

The process runs as the image user gateway. Mount /config for bootstrap config and make /app/config-cache writable.

Runtime Paths

Recommended container layout:

/config/
  startup.yml
  server.yml
  portal-registry.yml
  client.yml
  values.yml
  ca.pem

/app/config-cache/
  values.yml
  downloaded certs and files

Use a read-only ConfigMap for /config. Use a writable volume for /app/config-cache.

For most deployments, use emptyDir for config-cache. This gives each pod a fresh cache and avoids accidentally keeping stale config across pod replacement.

Use a PersistentVolumeClaim only when the customer explicitly wants the gateway to restart from the last downloaded config during a config-server outage. A persistent cache improves outage tolerance but can also preserve stale runtime state.

Registration Address

In Kubernetes, do not register the pod IP. Pod IPs are ephemeral.

If controller and callers are inside the same cluster, advertise the Service DNS name:

server.advertisedAddress: ai-microgateway.light-gateway

The pattern is:

<service-name>.<namespace>

The port is still registered separately from the host/address.

If controller or callers are outside the cluster, advertise the externally reachable DNS name instead, such as the Ingress or LoadBalancer hostname:

server.advertisedAddress: mcp-gateway.customer.example.com

For the Rust gateway, this is configured with server.advertisedAddress. The Java gateway template uses STATUS_HOST_IP; that is a light-4j-specific hook and is not the Rust gateway contract.

Bootstrap Config

Example values.yml for an in-cluster controller and config-server:

startup.host: customer.example.com
startup.timeout: 3000
startup.connectTimeout: 3000
startup.bootstrapCaCertPath: config/ca.pem

light-config-server-uri: https://config-server.lightapi.svc.cluster.local:8435

server.serviceId: com.customer.mcp-gateway-1.0.0
server.environment: prod
server.ip: 0.0.0.0
server.advertisedAddress: ai-microgateway.light-gateway
server.httpPort: 8080
server.enableHttp: true
server.httpsPort: 8443
server.enableHttps: false
server.enableRegistry: true
server.startOnRegistryFailure: true

portalRegistry.portalUrl: https://controller.lightapi.svc.cluster.local:8438
client.caCertPath: config/ca.pem
client.verifyHostname: true

Example startup.yml:

host: ${startup.host:dev.lightapi.net}
serviceId: ${server.serviceId:com.networknt.light-gateway-1.0.0}
envTag: ${server.environment:dev}
acceptHeader: application/yaml
timeout: ${startup.timeout:3000}
connectTimeout: ${startup.connectTimeout:3000}
configServerUri: ${light-config-server-uri:https://local.localhost}
authorization: ${light_portal_authorization:}
bootstrapCaCertPath: ${startup.bootstrapCaCertPath:config/ca.pem}

Example server.yml:

ip: ${server.ip:0.0.0.0}
advertisedAddress: ${server.advertisedAddress:127.0.0.1}
httpPort: ${server.httpPort:8080}
enableHttp: ${server.enableHttp:true}
httpsPort: ${server.httpsPort:8443}
enableHttps: ${server.enableHttps:false}
tlsCertPath: ${server.tlsCertPath:}
tlsKeyPath: ${server.tlsKeyPath:}
serviceId: ${server.serviceId:com.networknt.light-gateway-1.0.0}
enableRegistry: ${server.enableRegistry:true}
startOnRegistryFailure: ${server.startOnRegistryFailure:true}
dynamicPort: ${server.dynamicPort:false}
environment: ${server.environment:dev}
shutdownGracefulPeriod: ${server.shutdownGracefulPeriod:2000}

Example portal-registry.yml:

portalUrl: ${portalRegistry.portalUrl:https://localhost:8438}
portalToken: ${light_portal_authorization:}
controllerDiscoveryToken: ${portalRegistry.controllerDiscoveryToken:}

Example client.yml:

tls:
  caCertPath: ${client.caCertPath:config/ca.pem}
  verifyHostname: ${client.verifyHostname:true}

Use the customer CA in ca.pem. Do not disable hostname verification in production to work around certificate SAN problems.

Secrets

Store the portal bearer token and optional config password in a Kubernetes Secret.

Example:

apiVersion: v1
kind: Secret
metadata:
  name: light-gateway-secret
  namespace: light-gateway
type: Opaque
stringData:
  LIGHT_PORTAL_AUTHORIZATION: "Bearer <token>"
  light_4j_config_password: "<config-password-if-needed>"

LIGHT_PORTAL_AUTHORIZATION is used for config-server bootstrap. It is also used by portal registry startup when portal-registry.yml resolves portalToken from light_portal_authorization.

Do not store real bearer tokens in Git, ConfigMaps, Helm values committed to the repo, or rendered deployment examples.

Example Manifests

Create the namespace separately:

kubectl create namespace light-gateway

If deploying through light-deployer, keep Namespace out of the rendered bundle because deployer policy may block cluster-scoped resources.

Example bootstrap ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: light-gateway-bootstrap
  namespace: light-gateway
data:
  values.yml: |
    startup.host: customer.example.com
    startup.timeout: 3000
    startup.connectTimeout: 3000
    startup.bootstrapCaCertPath: config/ca.pem
    light-config-server-uri: https://config-server.lightapi.svc.cluster.local:8435
    server.serviceId: com.customer.mcp-gateway-1.0.0
    server.environment: prod
    server.ip: 0.0.0.0
    server.advertisedAddress: ai-microgateway.light-gateway
    server.httpPort: 8080
    server.enableHttp: true
    server.httpsPort: 8443
    server.enableHttps: false
    server.enableRegistry: true
    server.startOnRegistryFailure: true
    portalRegistry.portalUrl: https://controller.lightapi.svc.cluster.local:8438
    client.caCertPath: config/ca.pem
    client.verifyHostname: true
  startup.yml: |
    host: ${startup.host:dev.lightapi.net}
    serviceId: ${server.serviceId:com.networknt.light-gateway-1.0.0}
    envTag: ${server.environment:dev}
    acceptHeader: application/yaml
    timeout: ${startup.timeout:3000}
    connectTimeout: ${startup.connectTimeout:3000}
    configServerUri: ${light-config-server-uri:https://local.localhost}
    authorization: ${light_portal_authorization:}
    bootstrapCaCertPath: ${startup.bootstrapCaCertPath:config/ca.pem}
  server.yml: |
    ip: ${server.ip:0.0.0.0}
    advertisedAddress: ${server.advertisedAddress:127.0.0.1}
    httpPort: ${server.httpPort:8080}
    enableHttp: ${server.enableHttp:true}
    httpsPort: ${server.httpsPort:8443}
    enableHttps: ${server.enableHttps:false}
    tlsCertPath: ${server.tlsCertPath:}
    tlsKeyPath: ${server.tlsKeyPath:}
    serviceId: ${server.serviceId:com.networknt.light-gateway-1.0.0}
    enableRegistry: ${server.enableRegistry:true}
    startOnRegistryFailure: ${server.startOnRegistryFailure:true}
    dynamicPort: ${server.dynamicPort:false}
    environment: ${server.environment:dev}
    shutdownGracefulPeriod: ${server.shutdownGracefulPeriod:2000}
  portal-registry.yml: |
    portalUrl: ${portalRegistry.portalUrl:https://localhost:8438}
    portalToken: ${light_portal_authorization:}
    controllerDiscoveryToken: ${portalRegistry.controllerDiscoveryToken:}
  client.yml: |
    tls:
      caCertPath: ${client.caCertPath:config/ca.pem}
      verifyHostname: ${client.verifyHostname:true}
  ca.pem: |
    -----BEGIN CERTIFICATE-----
    <customer-ca-certificate>
    -----END CERTIFICATE-----

Example Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-microgateway
  namespace: light-gateway
  labels:
    app: ai-microgateway
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ai-microgateway
  template:
    metadata:
      labels:
        app: ai-microgateway
    spec:
      securityContext:
        fsGroup: 999
        fsGroupChangePolicy: OnRootMismatch
      containers:
        - name: light-gateway
          image: networknt/light-gateway:2.2.1
          imagePullPolicy: IfNotPresent
          env:
            - name: LIGHT_PORTAL_AUTHORIZATION
              valueFrom:
                secretKeyRef:
                  name: light-gateway-secret
                  key: LIGHT_PORTAL_AUTHORIZATION
            - name: light_4j_config_password
              valueFrom:
                secretKeyRef:
                  name: light-gateway-secret
                  key: light_4j_config_password
                  optional: true
            - name: RUST_LOG
              value: info
          ports:
            - name: http
              containerPort: 8080
            - name: https
              containerPort: 8443
          readinessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 30
            periodSeconds: 30
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: "1"
              memory: 512Mi
          volumeMounts:
            - name: bootstrap-config
              mountPath: /config
              readOnly: true
            - name: config-cache
              mountPath: /app/config-cache
      volumes:
        - name: bootstrap-config
          configMap:
            name: light-gateway-bootstrap
        - name: config-cache
          emptyDir: {}

The example uses fsGroup: 999, which matches the default gateway group in the current image. Adjust it if the image user or group changes.

If HTTP is disabled and only HTTPS is enabled, change the probes to an HTTPS probe or a TCP probe.

Example Service:

apiVersion: v1
kind: Service
metadata:
  name: ai-microgateway
  namespace: light-gateway
spec:
  type: ClusterIP
  selector:
    app: ai-microgateway
  ports:
    - name: http
      port: 8080
      targetPort: http
    - name: https
      port: 8443
      targetPort: https

For external access, add an Ingress, Gateway API route, NodePort, or LoadBalancer according to the customer cluster standard. If external clients or controller use that external path, set server.advertisedAddress to the same externally reachable DNS name.

Apply With Kubectl

Apply manifests in this order:

kubectl apply -f namespace.yml
kubectl apply -f secret.yml
kubectl apply -f configmap.yml
kubectl apply -f deployment.yml
kubectl apply -f service.yml

Check rollout:

kubectl -n light-gateway rollout status deploy/ai-microgateway
kubectl -n light-gateway get pods -l app=ai-microgateway
kubectl -n light-gateway logs deploy/ai-microgateway

For local testing with a ClusterIP Service:

kubectl -n light-gateway port-forward svc/ai-microgateway 8080:8080 8443:8443

Deploy Through Light-Deployer

When light-deployer runs outside the cluster and has LIGHT_DEPLOYER_TEMPLATE_BASE_DIR set, repoUrl: "local" can point to local templates.

When light-deployer runs inside Kubernetes, use a real Git URL:

{
  "template": {
    "repoUrl": "https://github.com/networknt/light-fabric.git",
    "ref": "main",
    "path": "apps/light-gateway/k8s/light-gateway"
  }
}

Do not use repoUrl: "local" for an in-cluster deployer unless the template repo is mounted into the deployer container and LIGHT_DEPLOYER_TEMPLATE_BASE_DIR points to it.

The in-cluster deployer checks out repoUrl at ref and reads manifests from template.path.

Keep Namespace out of templates rendered by light-deployer if the deployer policy blocks cluster-scoped resources. Create the namespace separately:

kubectl create namespace light-gateway

Config-Server Requirements

Before deploying the gateway pod, config-server should already have config for the tuple used by startup:

host = startup.host
serviceId = server.serviceId
envTag = server.environment

At minimum, config-server should return runtime config for:

  • handler.yml
  • mcp-router.yml
  • access-control.yml and rule.yml when MCP authorization is enabled.
  • security.yml, unified-security.yml, or other active auth config.
  • websocket-router.yml when WebSocket MCP/BFF routing is enabled.
  • Any downstream client, token, or registry config required by the selected handlers.

The pod bootstrap files should stay small and stable. Normal route, policy, and backend changes should go through config-server and controller reload flows.

Startup Flow

Expected runtime flow:

Kubernetes starts pod
  -> /app/light-gateway
  -> read /app/config -> /config bootstrap files
  -> call config-server with LIGHT_PORTAL_AUTHORIZATION
  -> write downloaded config and files into /app/config-cache
  -> start Pingora with resolved runtime config
  -> register gateway to controller using portalRegistry.portalUrl
  -> advertise server.advertisedAddress and configured port
  -> route protected MCP traffic to backend MCP servers

If config-server is unavailable and /app/config-cache/values.yml exists, the runtime can continue from cached config. With emptyDir, that cache disappears when the pod is recreated. With a PVC, it can survive pod replacement.

Upgrade And Rollback

Use Kubernetes rolling updates with immutable image tags:

kubectl -n light-gateway set image deploy/ai-microgateway \
  light-gateway=networknt/light-gateway:2.2.2
kubectl -n light-gateway rollout status deploy/ai-microgateway

Rollback:

kubectl -n light-gateway rollout undo deploy/ai-microgateway

For production, prefer changing only one variable at a time: either image tag or config-server runtime config, not both in the same rollout.

Validation Checklist

After deployment:

  • kubectl -n light-gateway rollout status deploy/ai-microgateway succeeds.
  • Pods are ready and restart count is stable.
  • Logs show successful config-server bootstrap.
  • Logs show successful controller registration.
  • Controller shows the gateway registered with the expected service id, environment, host, and port.
  • server.advertisedAddress is reachable from the controller.
  • The Service responds on /health.
  • MCP tools/list reaches the gateway.
  • MCP tools/call reaches the backend MCP server.
  • A pod restart still starts cleanly with the selected cache policy.

Security Checklist

  • Keep bearer tokens in Kubernetes Secret, not ConfigMap.
  • Use customer CA trust and keep client.verifyHostname: true in production.
  • Use immutable image tags and image pull credentials from Kubernetes secrets when the registry is private.
  • Run as the non-root image user.
  • Make /config read-only.
  • Make only /app/config-cache writable.
  • Restrict ingress traffic to required gateway ports.
  • Restrict egress traffic to config-server, controller, token/key services, and backend MCP servers.
  • Rotate LIGHT_PORTAL_AUTHORIZATION through the customer secret process.

Kubernetes Gateway API Design

Status

Proposal.

This page captures how the current light-gateway work can be reused for Kubernetes Gateway API without turning the microgateway product into a catch-all Kubernetes control plane. The recommended direction is a separate light-k8s-gateway product built on light-pingora for north/south ingress, with a later sidecar or mesh product for transparent east/west traffic.

Context

The current Kubernetes deployment model runs light-gateway as a normal Deployment with a ClusterIP Service. Runtime behavior comes from local bootstrap config, config-server downloaded files in config-cache, and the Pingora data plane built by light-pingora.

The current gateway already has useful data-plane pieces:

  • HTTP and HTTPS proxying through Pingora.
  • Static upstreams from proxy.yml.
  • Service-aware routing from router.yml.
  • Direct registry, controller-backed discovery, and static service targets.
  • Handler chains for security, header mutation, CORS, rate limits, token handling, MCP, WebSocket, static resources, and config reload.
  • Live config managers and reloaders for route and handler modules.

Gateway API adds a Kubernetes-native control plane. For ingress, users create GatewayClass, Gateway, and route resources such as HTTPRoute. For service mesh, the GAMMA model attaches route resources directly to Kubernetes Service objects instead of using Gateway and GatewayClass.

Product Boundary

Keep the product line split by operational role:

  • light-pingora is the shared data-plane framework.
  • light-gateway remains the microgateway, sidecar, BFF, API, agent, MCP, and LLM gateway product configured through Light runtime, config-server, controller-rs, and local config.
  • light-k8s-gateway is the proposed Kubernetes Gateway API product for north/south ingress. It should reuse light-pingora and lift reusable light-gateway modules where appropriate, but it should own Kubernetes watches, Gateway API status, RBAC, listener translation, TLS Secret handling, and EndpointSlice routing.
  • light-k8s-gateway-controller and light-k8s-gateway-proxy should be separate deployments from the first implementation. The controller owns Kubernetes RBAC and status writes. The proxy owns untrusted client traffic and should not need Kubernetes API permissions.
  • A future light-mesh or light-sidecar product should own transparent east/west Service Mesh behavior if we pursue GAMMA conformance. It should share the Gateway API route compiler and light-pingora data-plane modules, but its deployment model is sidecar or node-local interception, not ingress.

This avoids giving ordinary microgateway deployments broad Kubernetes RBAC and keeps config-server/controller-rs routing separate from portable Gateway API routing intent.

Goals

  • Let operators install light-k8s-gateway as a Gateway API implementation with a controller name such as networknt.com/light-k8s-gateway.
  • Support north/south ingress with GatewayClass, Gateway, HTTPRoute, Kubernetes Service, EndpointSlice, Secret, and ReferenceGrant.
  • Separate Kubernetes reconciliation from request proxying so control-plane RBAC is never granted to the public traffic data plane.
  • Provide a migration path from NGINX or Traefik by running side by side with a distinct GatewayClass, then moving routes class by class or host by host.
  • Reuse the existing Pingora proxy, handler chain, service discovery, metrics, and config reload model instead of creating a separate proxy stack.
  • Use Gateway API policy attachment for Light-specific Kubernetes policy CRDs instead of annotations or out-of-band route policy.
  • Support east/west traffic using Gateway API mesh semantics where HTTPRoute.parentRefs can point at a Service.
  • Keep Light-specific policies available without forcing them into portable Gateway API fields. Gateway API should configure routing; Light config and future policy CRDs should configure Light-specific behavior.
  • Build toward Gateway API conformance tests for both Gateway and Mesh feature sets.

Non-Goals

  • Do not remove existing config-server, direct registry, portal registry, or static route support.
  • Do not require every light-gateway deployment to watch Kubernetes. Gateway API support should be disabled unless explicitly configured.
  • Do not run the Kubernetes controller reconciler inside public data-plane pods with broad Kubernetes RBAC.
  • Do not claim immediate support for every Gateway API route type. Start with HTTPRoute; add GRPCRoute, TLSRoute, TCPRoute, and UDPRoute in later milestones.
  • Do not make transparent east/west interception a hidden side effect of the ingress deployment. Mesh mode needs an explicit data-plane deployment model.
  • Do not treat a non-transparent egress gateway as fully GAMMA-compliant mesh support.

Target API Versions

The north/south MVP targets the Gateway API v1 Standard Channel resources:

  • GatewayClass
  • Gateway
  • HTTPRoute
  • ReferenceGrant

Experimental or later milestones must be labeled explicitly in docs, manifests, and conformance reports. This includes GAMMA mesh behavior and route kinds such as GRPCRoute, TLSRoute, TCPRoute, and UDPRoute when those features rely on non-Standard channels in the installed Gateway API version.

North/South Ingress Model

For ingress replacement, light-k8s-gateway should run as two cooperating pieces:

  • light-k8s-gateway-controller: watches Kubernetes resources, validates attachment and policy, updates status, performs leader election, and produces a compiled routing snapshot.
  • light-k8s-gateway-proxy: consumes signed or mTLS-protected snapshots and serves client traffic through Pingora. It has no Kubernetes watch or status permissions and can scale independently with an HPA.

The split is mandatory from day 1. It prevents a proxy vulnerability in the public data plane from becoming a Kubernetes control-plane compromise. The controller can run as an HA deployment with Kubernetes Lease leader election; only the leader reconciles resources and writes status. Non-leader controller replicas stay warm and can take over quickly.

Snapshot delivery can start as a lightweight internal gRPC stream and evolve toward an xDS-like API if we need richer incremental updates. The proxy should apply the received GatewayApiSnapshot through the same kind of ConfigManager swap used by the current Pingora modules.

Typical installation:

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: light-k8s-gateway
spec:
  controllerName: networknt.com/light-k8s-gateway
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: public
  namespace: gateway-system
spec:
  gatewayClassName: light-k8s-gateway
  listeners:
    - name: http
      protocol: HTTP
      port: 80
      allowedRoutes:
        namespaces:
          from: All
    - name: https
      protocol: HTTPS
      port: 443
      hostname: api.example.com
      tls:
        mode: Terminate
        certificateRefs:
          - kind: Secret
            name: api-example-com
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: petstore
  namespace: apps
spec:
  parentRefs:
    - name: public
      namespace: gateway-system
      sectionName: https
  hostnames:
    - api.example.com
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /pets
      backendRefs:
        - name: petstore
          port: 8080

The controller resolves this into a runtime route table:

Gateway listener
  -> accepted HTTPRoutes
  -> host/path/header/method/query matches
  -> filters supported by light-k8s-gateway
  -> backend Service
  -> EndpointSlice addresses
  -> Pingora ProxyTarget set

The existing proxy.yml and router.yml paths remain useful for legacy and non-Kubernetes deployments. Kubernetes Gateway API routes should not depend on service_id headers or pathPrefixService.yml; they should route from the compiled Gateway API table directly to Kubernetes endpoints.

Required Ingress Patches

Add a Kubernetes Gateway API module:

k8sGatewayApi:
  enabled: ${k8sGatewayApi.enabled:false}
  mode: ${k8sGatewayApi.mode:ingress}
  controllerName: ${k8sGatewayApi.controllerName:networknt.com/light-k8s-gateway}
  gatewayClassName: ${k8sGatewayApi.gatewayClassName:light-k8s-gateway}
  watchNamespaces: ${k8sGatewayApi.watchNamespaces:[]}
  statusAddress: ${k8sGatewayApi.statusAddress:}

Implementation changes:

  • Create apps/light-k8s-gateway-controller and apps/light-k8s-gateway-proxy.
  • Add Gateway API and Kubernetes clients, likely behind a Cargo feature such as k8s-gateway-api, using kube, kube-runtime, k8s-openapi, and generated Gateway API resource types.
  • Watch GatewayClass, Gateway, HTTPRoute, ReferenceGrant, Service, EndpointSlice, Secret, and Namespace.
  • Compile watched objects into a deterministic GatewayApiSnapshot.
  • Push the compiled snapshot to proxy pods over an authenticated internal channel.
  • Store the received snapshot in a proxy-side ConfigManager, similar to the current proxy and router reload model.
  • Add a light-pingora Gateway API route-table module that can select a backend before falling back to existing proxy/router behavior.
  • Update Kubernetes status conditions for GatewayClass, Gateway, listeners, and routes. Status must clearly report unsupported route types, listener conflicts, missing TLS secrets, rejected cross-namespace references, empty backends, and unsupported filters.
  • Add Kubernetes Lease leader election so only one controller replica writes status and publishes snapshots.
  • Add controller RBAC for read watches, Secret reads where allowed, Lease writes, and status updates. Secret read permissions should be namespace-scoped where possible.
  • Give proxy pods no Kubernetes RBAC by default.
  • Add install manifests for separate controller and proxy ServiceAccount, ClusterRole, ClusterRoleBinding, Deployment, Service, and a sample GatewayClass.

The transport also needs a listener model. Today PingoraTransport binds the single server.httpPort and single server.httpsPort from server.yml. That is enough for the first 80/443 ingress path, but full Gateway API support needs multiple listeners with independent protocol, port, hostname, and TLS settings.

Suggested runtime patch:

server:
  listeners:
    - name: http
      protocol: HTTP
      ip: 0.0.0.0
      port: 80
    - name: https-api
      protocol: HTTPS
      ip: 0.0.0.0
      port: 443
      hostname: api.example.com
      tlsCertPath: /var/run/light-k8s-gateway/tls/api/tls.crt
      tlsKeyPath: /var/run/light-k8s-gateway/tls/api/tls.key

Keep server.httpPort, server.enableHttp, server.httpsPort, and server.enableHttps as backward-compatible shorthand.

HTTPRoute Support Plan

Start with the common ingress subset:

  • GatewayClass acceptance for networknt.com/light-k8s-gateway.
  • Gateway listeners for HTTP and terminated HTTPS.
  • HTTPRoute attachment by parentRefs, sectionName, listener hostname, listener namespace policy, and route hostname.
  • HTTPRoute matches for path prefix, exact path, method, headers, and query parameters.
  • backendRefs to Kubernetes Service backends, including weights.
  • ReferenceGrant for cross-namespace backend references.
  • Endpoint resolution from EndpointSlice, with Service DNS as a fallback only when endpoint watching is unavailable.
  • TLS Secret loading for terminated HTTPS.
  • Request header modification and URL rewrite where existing Pingora handlers already provide equivalent behavior.

Later milestones:

  • Request redirect, response header modification, request mirroring, retries, and timeouts.
  • GRPCRoute over HTTP/2.
  • TLSRoute for SNI routing and passthrough.
  • TCPRoute and UDPRoute for L4 ingress if Pingora transport support is added.
  • Backend TLS policy and mTLS to upstream services.

Light Policy Attachment

Kubernetes-native deployments should use the Gateway API Policy Attachment pattern from GEP-713 for Light-specific behavior. Do not use annotations for core behavior, and do not require config-server-owned route policy for the Kubernetes Gateway API path.

Add Light policy CRDs with targetRefs that point at Gateway API resources:

apiVersion: gateway.lightapi.net/v1alpha1
kind: LightAuthPolicy
metadata:
  name: petstore-auth
  namespace: apps
spec:
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: HTTPRoute
      name: petstore
  jwt:
    issuer: https://issuer.example.com
    audience:
      - petstore
apiVersion: gateway.lightapi.net/v1alpha1
kind: LightRateLimitPolicy
metadata:
  name: petstore-ratelimit
  namespace: apps
spec:
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: HTTPRoute
      name: petstore
  limits:
    - name: default
      requests: 1000
      window: 60s

The controller should resolve effective policy for supported target kinds such as Gateway, listener section, HTTPRoute, route rule, and eventually Service for mesh. Policy status should report Accepted, Programmed, and conflict conditions so resource owners can tell whether a policy is active.

Config-server remains valid for non-Kubernetes light-gateway deployments and for migration bridges. For light-k8s-gateway, Kubernetes resources should be the source of routing and policy intent.

TLS Secret Handling

TLS Secret material must not be written to persistent disk or normal config-cache.

Preferred handling:

  • The controller reads referenced TLS Secret objects, validates references and ReferenceGrant requirements, and distributes certificate material to proxies through the authenticated snapshot channel.
  • Proxies hold certificate material in memory and update Pingora TLS state without persisting private keys.
  • If Pingora integration requires file paths for an early milestone, write temporary files only to an emptyDir mounted with medium: Memory, under a path such as /var/run/light-k8s-gateway/tls.

Never copy TLS private keys into config-server, config-cache, persistent volumes, image layers, or logs.

Endpoint Abstraction

light-pingora should not need to know whether endpoints came from Kubernetes, direct-registry.yml, controller-rs discovery, or a static config file. Add a shared endpoint abstraction such as:

UpstreamCluster
  name
  protocol
  tls settings
  load-balancing policy
  EndpointSet
    endpoint address
    port
    health/ready state
    metadata

light-k8s-gateway-controller translates Service and EndpointSlice objects into this shape. Existing Light discovery paths can translate direct registry and portal-registry results into the same shape. The Pingora route-table module then selects an UpstreamCluster without carrying Kubernetes-specific logic.

East/West Mesh Model

Gateway API mesh support uses a different binding model. Routes attach directly to Service resources:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: petstore-policy
  namespace: apps
spec:
  parentRefs:
    - group: core
      kind: Service
      name: petstore
      port: 8080
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /v1
      backendRefs:
        - name: petstore-v2
          port: 8080
          weight: 10
        - name: petstore-v1
          port: 8080
          weight: 90

The runtime semantics are:

  • If no route attaches to a Service, default mesh behavior forwards to the Service backend.
  • If routes attach and the request matches at least one route, the selected route backendRefs determine the destination.
  • If routes attach and no route matches, reject the request.
  • Same-namespace routes are producer routes and affect all clients.
  • Different-namespace routes are consumer routes and affect clients in the route namespace.

The current light-gateway can proxy service-to-service calls explicitly, but it does not transparently intercept traffic to Kubernetes Service frontends. That means a real mesh implementation needs a data-plane attachment model, not only a route compiler.

Recommended mesh milestones:

  • Mesh M0: compile Service-attached HTTPRoute resources and expose the effective route table through logs, module registry, and status. This proves the control-plane model without traffic interception.
  • Mesh M1: support an explicit in-cluster egress gateway mode. Workloads call light-gateway directly or through a configured HTTP proxy. This is useful operationally, but not advertised as transparent GAMMA conformance.
  • Mesh M2: add sidecar mode. Inject a lightweight light-gateway sidecar, or preferably a smaller light-sidecar or light-mesh binary using the same light-pingora route-table module. Redirect outbound HTTP traffic to the sidecar, identify the original Service destination, apply Service-attached routes, then proxy to selected endpoints.
  • Mesh M3: add node-local or ambient mode. Use a DaemonSet plus CNI or eBPF redirection to intercept Service traffic without per-pod sidecars. This has a larger operational surface and should follow sidecar validation.

Sidecar mode is the shortest path because the current light-gateway already has sidecar concepts such as sidecar.egressIngressIndicator, token handling for outbound calls, and service discovery. The production packaging should still be a dedicated sidecar or mesh product if the target is transparent east/west traffic. The missing pieces are transparent redirect, original destination detection, and a Service-oriented route table.

Mesh Data-Plane Requirements

To proxy east/west traffic with GAMMA semantics, add:

  • A mesh route compiler that watches HTTPRoute, Service, EndpointSlice, ReferenceGrant, and namespaces.
  • A Service frontend index keyed by namespace, Service name, port, DNS name, ClusterIP, and possibly original destination socket address.
  • Producer and consumer route merge logic that follows Gateway API mesh rules.
  • Request matching and rejection behavior for Services with attached routes.
  • Backend endpoint selection from the selected route's backendRefs.
  • A sidecar or node-local interception mechanism that can recover the original destination Service before the request is proxied.
  • Policy hooks for Light security, token, and observability handlers.
  • Mesh conformance test wiring with --supported-features=Mesh.

Do not map GAMMA Service routes to Gateway listeners. In mesh mode, the Service is the parent object, and GatewayClass/Gateway are intentionally not part of the route binding.

Coexistence With Existing Light Runtime

Keep these layers distinct:

  • Gateway API resources express portable Kubernetes routing intent.
  • light-pingora route tables execute the selected routing intent.
  • handler.yml and Light module config apply Light-specific behavior.
  • light-gateway continues to serve the current microgateway, sidecar, BFF, API, agent, MCP, and LLM provider use cases.
  • light-k8s-gateway owns Kubernetes Gateway API ingress behavior.
  • portal-registry and direct-registry.yml remain available for non-Kubernetes targets and existing Light service discovery.
  • Config-server remains the source for non-Kubernetes light-gateway policy and migration bridges. Kubernetes-native light-k8s-gateway routing and policy intent should come from Gateway API resources and Light policy CRDs.

For ingress, Kubernetes Service and EndpointSlice should be the primary backend source. For non-Kubernetes or hybrid targets, add an explicit implementation-specific backend policy instead of overloading portable backendRefs.

Status And Conformance

Gateway API users rely on status. The controller must update:

  • GatewayClass.status.conditions.
  • Gateway.status.addresses, listener conditions, and supported features.
  • HTTPRoute.status.parents for every parentRef.
  • Light policy CRD status, including Accepted, Programmed, and conflict conditions.

Only the active leader should update Kubernetes status. Controller replicas use Kubernetes Lease leader election to avoid API-server write races and status flapping.

Minimum conformance gates:

go test ./conformance -run TestConformance -args \
  --gateway-class=light-k8s-gateway \
  --supported-features=Gateway,HTTPRoute

Mesh conformance gate:

go test ./conformance -run TestConformance -args \
  --supported-features=Mesh

When ingress and mesh are both enabled:

go test ./conformance -run TestConformance -args \
  --gateway-class=light-k8s-gateway \
  --supported-features=Mesh,Gateway,HTTPRoute

Observability And Telemetry

light-k8s-gateway must be operable as a primary ingress controller. Provide Prometheus metrics, OpenTelemetry traces, and structured logs from day 1.

Proxy metrics:

  • Request count tagged by Gateway, listener, route namespace, HTTPRoute, backend Service, status code, and status class.
  • Request duration and upstream duration histograms.
  • Active connections and in-flight requests.
  • Upstream connection errors, retries, timeouts, and circuit-breaker opens.
  • Snapshot version, snapshot age, and snapshot apply errors.

Controller metrics:

  • Reconcile count, duration, and error count by resource kind.
  • Kubernetes watch reconnect count and API-server request errors.
  • Status update count and conflict count.
  • Leader-election state.
  • Snapshot generation count, size, and publish errors.

Tracing:

  • Propagate W3C traceparent and existing Light correlation IDs.
  • Create ingress spans tagged with Gateway API resource identity: gateway.namespace, gateway.name, listener.name, route.namespace, route.name, route.rule, backend.service.namespace, and backend.service.name.
  • Record upstream selection, retries, and policy decisions as span events without logging tokens, private keys, or sensitive headers.

Migration From NGINX Or Traefik

Recommended customer migration:

  1. Install light-k8s-gateway with a new GatewayClass named light-k8s-gateway.
  2. Keep NGINX or Traefik running for existing Ingress or Gateway API classes.
  3. Create equivalent Gateway and HTTPRoute resources for one host.
  4. Validate status, route behavior, TLS, logs, metrics, and backend health.
  5. Move DNS or load balancer traffic for that host to light-k8s-gateway.
  6. Repeat host by host.
  7. Remove the old ingress controller only after route parity and operational dashboards are in place.

An optional Ingress-to-HTTPRoute converter can help customers migrate, but it should be a tool, not part of the runtime request path.

Open Questions

  • What is the first supported east/west deployment model: current light-gateway as explicit egress gateway, a dedicated sidecar, or ambient?
  • How much of the current server.yml listener contract should remain in light-runtime versus moving Gateway API listener binding into light-pingora?
  • Should the controller-to-proxy snapshot protocol stay as a small internal gRPC API, or should it adopt an xDS-compatible model early?
  • Which Light policy CRDs are required for the MVP: auth, rate limit, header policy, request size, token, or a generic handler-chain policy?
  • What is the exact UpstreamCluster health model shared by Kubernetes EndpointSlice, controller-rs discovery, and direct registry sources?

Suggested Implementation Order

  1. Create apps/light-k8s-gateway-controller and apps/light-k8s-gateway-proxy with separate ServiceAccounts and RBAC.
  2. Add controller leader election with Kubernetes Lease objects.
  3. Define GatewayApiSnapshot, UpstreamCluster, EndpointSet, and the authenticated controller-to-proxy snapshot stream.
  4. Implement proxy-side snapshot loading through ConfigManager.
  5. Implement GatewayClass, Gateway, HTTPRoute, ReferenceGrant, Service, EndpointSlice, Secret, and Namespace watches.
  6. Implement attachment validation, policy validation, status updates, and snapshot publishing.
  7. Add a light-pingora Gateway API route table and route HTTP traffic to Kubernetes Service endpoints.
  8. Add memory-only TLS Secret handling and terminated HTTPS listener support for the common 80/443 ingress case.
  9. Add initial Light policy CRDs using Gateway API policy attachment.
  10. Add Prometheus metrics, OpenTelemetry tracing, and structured logs for the controller and proxy.
  11. Run HTTPRoute Gateway conformance and close gaps.
  12. Add multi-listener runtime support.
  13. Add mesh route compilation for Service-attached HTTPRoute resources.
  14. Add explicit egress gateway mode for early east/west use.
  15. Add sidecar interception and run mesh conformance.
  16. Evaluate ambient/node-local mode after sidecar behavior is proven.

Light-Workflow

light-workflow is the workflow execution service for Agentic Workflow documents.

It loads workflow definitions, executes workflow tasks, integrates with light-rule for rule-backed checks, and exposes workflow execution APIs.

Key Dependencies

  • workflow-core
  • light-rule
  • axum
  • sqlx
  • reqwest

Role

light-workflow is the runtime service that turns workflow specifications into long-running execution state. It is used by agentic flows, human-in-the-loop orchestration, and integration-test style automation.

Start Workflow

This page describes the local workflow start path used to test light-workflow from light-portal.

light-workflow does not create workflow definitions and it is not the public entry point for starting a workflow. For local testing, create the workflow definition through the portal workflow service, then start it through the startWorkflow command. The running light-workflow process consumes the workflow start event from the portal database and executes the workflow tasks.

Runtime Path

The local start flow is:

  1. Create or update a workflow definition in light-portal.
  2. Start the workflow with the workflow service startWorkflow command.
  3. workflow-command writes a workflow started event into the event store and outbox tables.
  4. light-workflow polls the same database, loads the definition by wfDefId, creates the process and task records, and executes the workflow.

For this reason, the DATABASE_URL used by light-workflow must point to the same database used by the local portal stack.

Prerequisites

Start the local portal stack first. For the Rust local stack, use the normal portal-config-local deployment command from the portal-config-loc checkout:

./scripts/deploy-local.sh pg rust

Make sure the workflow command and query services are available in that stack. The workflow definition pages in portal-view depend on those services.

Then build light-workflow:

cd /home/steve/workspace/light-fabric/apps/light-workflow
cargo build -p light-workflow --locked

Start light-workflow Locally

Create light-workflow.env in /home/steve/workspace/light-fabric/apps/light-workflow:

DATABASE_URL=postgres://postgres:secret@localhost:5432/configserver
LIGHT_WORKFLOW_HTTP_ADDR=0.0.0.0:8080
RUST_LOG=light_workflow=debug,info
WORKFLOW_LOG_ANSI=false

Start the service with the debug binary:

./run.sh --debug-binary

The script loads light-workflow.env automatically. If you do not use the env file, export the values before running the script:

export DATABASE_URL=postgres://postgres:secret@localhost:5432/configserver
export LIGHT_WORKFLOW_HTTP_ADDR=0.0.0.0:8080
export RUST_LOG=light_workflow=debug,info
export WORKFLOW_LOG_ANSI=false
./run.sh --debug-binary

Do not set the variables on separate shell lines without export. That creates shell variables only for the current shell and run.sh will not receive them.

The easiest local test is to create the definition in the portal UI and start it from the workflow editor test action.

  1. Open light-portal.

  2. Go to the workflow definition page.

  3. Create a workflow definition.

  4. Paste one of the example workflow YAML files from:

    /home/steve/workspace/light-fabric/apps/light-workflow/examples
    
  5. Save the definition.

  6. Open the definition in the workflow editor.

  7. Use the editor test run action with a JSON input object.

For the basic example, use apps/light-workflow/examples/simple-set-assert.yaml and this input:

{
  "applicantId": "APP-001"
}

The editor test action is preferred for local testing because it parses the input text as JSON and sends input as an object.

The table run button opens the generic startWorkflow form. If using that path, make sure the request sends input as a JSON object, not as a string. If the input is submitted as a string, the workflow command may accept the request but the runtime context will not have the expected object fields.

Start with Postman or curl

You can also start the workflow directly through the portal command endpoint. Send the request to the same light-gateway or light-portal host used by the UI. Do not send this request to light-workflow; light-workflow is the executor, not the command API.

The command envelope is:

{
  "host": "lightapi.net",
  "service": "workflow",
  "action": "startWorkflow",
  "version": "0.1.0",
  "data": {
    "hostId": "<host-id>",
    "wfDefId": "<workflow-definition-id>",
    "input": {
      "applicantId": "APP-001"
    }
  }
}

Example curl shape:

curl -k -X POST "https://localhost:8443/portal/command" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <access-token>" \
  -d '{
    "host": "lightapi.net",
    "service": "workflow",
    "action": "startWorkflow",
    "version": "0.1.0",
    "data": {
      "hostId": "<host-id>",
      "wfDefId": "<workflow-definition-id>",
      "input": {
        "applicantId": "APP-001"
      }
    }
  }'

If your local UI uses a session cookie instead of a bearer token, use Postman with the same authenticated session or copy the current local authorization header from the browser request.

Creating the Definition by API

For most local tests, create the definition in the UI. It is easier because the YAML can be pasted directly.

If you create the definition through the command API, send a workflow definition command first and use the returned definition id as wfDefId in the startWorkflow command.

The command shape is:

{
  "host": "lightapi.net",
  "service": "workflow",
  "action": "createWfDefinition",
  "version": "0.1.0",
  "data": {
    "hostId": "<host-id>",
    "namespace": "light-portal",
    "name": "simple-set-assert",
    "version": "1.0.0",
    "definition": "<workflow-yaml-as-json-string>"
  }
}

When calling this from Postman, remember that the YAML definition is a JSON string field. Newlines must be escaped correctly by the JSON editor or sent by a tool that can build the JSON body safely.

Example Workflows

The current examples are in /home/steve/workspace/light-fabric/apps/light-workflow/examples:

FilePurposeInput
simple-set-assert.yamlBasic local smoke test with no external dependency.{ "applicantId": "APP-001" }
http-risk-decision.yamlCalls a risk evaluation HTTP endpoint and branches on the result.{ "applicantId": "APP-001", "loanAmount": 25000, "creditScore": 720 }
human-approval.yamlCreates a human approval style workflow and waits for a later decision.{ "requestId": "REQ-001", "summary": "Approve test request" }
insurance-claim-rest-v1.yamlComplete product demo with direct HTTP API orchestration, native agent tasks, and human tasks.See examples/README.md.
insurance-claim-mcp-v1.yamlComplete product demo with gateway MCP tool orchestration, native agent tasks, and human tasks.See examples/README.md.
insurance-claim-headless-v1.yamlHeadless insurance-claim regression workflow with deterministic agent outputs and no human-task pauses.See examples/README.md.

Start with simple-set-assert.yaml. It is the best smoke test because it does not require another service.

For a complete multi-agent product demo, use the insurance claim suite in apps/light-workflow/examples. The product walkthrough is Insurance Claim Agentic Workflow, and the operational runbook is in apps/light-workflow/examples/README.md.

For http-risk-decision.yaml, start a local mock service for the URL used by the definition. When light-workflow runs natively with run.sh, 127.0.0.1 means the host machine. When light-workflow runs in Docker, 127.0.0.1 means the container itself, so change the workflow endpoint to a Compose service name or host.docker.internal.

For human-approval.yaml, the first run should create a waiting task. Completing that flow requires the worklist or task-completion API path.

Verify Execution

Watch the light-workflow log after sending startWorkflow. A successful run should show that the start event was received, the first task was initialized, and the executor picked up task work.

Useful database checks:

select wf_def_id, namespace, name, version
from wf_definition_t
order by update_ts desc
limit 5;

select process_id, wf_instance_id, status_code, context_data
from process_info_t
order by started_ts desc
limit 5;

select wf_task_id, task_type, status_code, task_output
from task_info_t
order by started_ts desc
limit 10;

select c_offset, event_type, aggregate_id, payload
from outbox_message_t
order by c_offset desc
limit 10;

If outbox_message_t has the workflow started event but no process or task records appear, check that light-workflow is running against the same DATABASE_URL as the portal stack.

Troubleshooting

  • DATABASE_URL is required: Put DATABASE_URL in light-workflow.env, export it before running run.sh, or put the assignment on the same command line as ./run.sh.
  • function make_interval(mins => bigint) does not exist: Rebuild and restart light-workflow. The runtime query must cast the retry value to int before passing it to make_interval.
  • Workflow definition list is empty in the UI: Confirm the workflow query service is running and the local stack is using the jar or binary that contains the workflow definition owner-scope fix. Some local stacks run copied service artifacts, so rebuilding a source checkout is not enough unless the deployed artifact is refreshed.
  • No tasks are created after starting the workflow: Confirm the startWorkflow command wrote a workflow started event to the outbox table, and confirm light-workflow points to that same database.
  • The workflow input is missing fields: Confirm input was submitted as a JSON object. A string that contains JSON text is not the same as a JSON object in the workflow context.

Native Agent Call

Status

Recommended platform boundary.

call: agent is currently a native light-workflow task. It does not invoke a running light-agent container. The workflow engine loads the portal agent definition, selected skills, and skill tools from the database, builds a bounded model prompt, calls the configured model provider directly, validates the JSON output, and continues the workflow.

Containerized light-agent remains the interactive agent runtime. It serves chat clients, keeps session memory, loads its effective catalog, and calls MCP tools through light-gateway.

This page defines how both models should coexist in an enterprise platform.

Problem

The platform has two useful agent execution models:

  • native agent tasks inside light-workflow
  • containerized light-agent services

Both can use the same portal-authored concepts: agent definitions, skills, tools, workflow mappings, and gateway-routed API capabilities. They should not be treated as interchangeable runtime paths.

The main design question is whether a workflow should keep executing call: agent natively or call a containerized light-agent service for every agent step.

Current Behavior

When a workflow contains:

do:
  - review-offer:
      call: agent
      with:
        agent: com.networknt.agent.offer-1.0.0
        skill: offer-decision
        input:
          customerId: "${ .customerId }"
          profile: "${ .profile }"
        outputSchemaRef: offerDecision

light-workflow handles the task itself:

  1. Resolve the agent by agent_def_id or agent API name.
  2. Load active skills assigned to the agent from agent_skill_t.
  3. If a skill is specified, narrow the prompt to that skill.
  4. Load skill tool metadata from skill_tool_t, tool_t, and tool_param_t.
  5. Build a bounded prompt from workflow context, skill instructions, optional task instructions, and the expected output schema.
  6. Call the model provider configured on the portal agent definition.
  7. Parse and validate the model response as JSON.
  8. Return the structured output to the workflow context.

The native task does not:

  • call the light-agent HTTP or WebSocket endpoint,
  • use light-agent session memory,
  • let the model run a dynamic gateway tool loop,
  • execute tool calls from the model response.

Skill tools are included as guidance and future-routing context. In the current runtime phase, API orchestration remains explicit workflow tasks such as call: http, call: mcp, assert, switch, and ask.

Native Agent Tasks

Native agent tasks are best for bounded reasoning where the workflow remains the system of record.

Good examples:

  • classify a request,
  • normalize user-provided input,
  • summarize API results,
  • choose between workflow branches,
  • draft a customer-facing explanation,
  • assess whether human approval is required,
  • produce structured output that must match a schema.

Benefits:

  • Strong auditability: workflow records input, output, status, retry, and failure state.
  • Deterministic orchestration: API calls, approvals, assertions, and retries stay in the workflow definition.
  • Easier governance: output schemas and workflow-owned context constrain the model.
  • Lower operational coupling: the task does not depend on a separate agent service instance being healthy.
  • Better replay and diagnostics: the workflow engine owns the execution state.

Tradeoffs:

  • It is not the full light-agent runtime.
  • It does not use chat session history or Hindsight memory.
  • It can duplicate some prompt/catalog handling from light-agent.
  • Model provider scaling is tied to light-workflow.
  • Dynamic tool selection is intentionally limited.

Containerized Agents

Containerized agents are independently deployed light-agent services.

They are best for interactive or autonomous agent behavior where the agent runtime itself is the product surface.

Good examples:

  • user-facing chat agents,
  • long-lived specialist agents,
  • agents that need session memory,
  • agents that should cache and refresh their effective catalog locally,
  • agents that need a dynamic tools/list and tools/call loop through light-gateway,
  • agents that must scale independently from workflow execution.

Benefits:

  • Real agent runtime behavior: memory, chat sessions, local catalog cache, and gateway tool execution.
  • Independent deployment, scaling, health checks, and versioning.
  • Clear service identity through controller registration.
  • Better fit for interactive clients and long-running conversational work.

Tradeoffs:

  • Harder workflow audit if the agent internally decides which APIs to call.
  • More distributed failure modes: network errors, timeouts, retries, and partial progress.
  • Requires strict request and response contracts.
  • Requires idempotency, correlation IDs, auth scopes, and timeout policy.
  • Can make the workflow less deterministic if the agent is allowed to run an open-ended tool loop.

Recommendation

Keep the mixed approach, but make the boundary explicit.

Use native call: agent for bounded reasoning inside workflow-controlled processes. Use workflow tasks and subworkflows for API orchestration. Use containerized light-agent for interactive chat and specialist runtime agents.

The recommended enterprise pattern is:

main workflow
  -> call: mcp or call: http for deterministic API access
  -> run/start subworkflow for reusable skill-backed API orchestration
  -> call: agent for bounded reasoning over workflow-owned context
  -> ask/assert/switch/retry/audit in workflow

chat client
  -> containerized light-agent
  -> effective catalog from portal-query
  -> tools/list and tools/call through light-gateway
  -> session memory and chat history

Do not route every workflow agent step through a containerized agent by default. That would move too much process control into agent services and make enterprise audit, replay, and approval harder.

Do not remove native call: agent. It is the right primitive for workflow-owned reasoning steps.

Skill To Workflow Pattern

For skills that require API orchestration, prefer mapping the skill to a workflow or subworkflow.

Example:

skill_t: customer-profile-review
  -> skill_workflow_t: customer-profile-enrichment-v1
  -> wf_definition_t: workflow that calls gateway MCP tools

In that pattern:

  • the skill describes when and why to use the capability,
  • the workflow owns the API call sequence,
  • light-gateway executes MCP tool calls,
  • native call: agent can summarize or classify the results,
  • the workflow remains the audit boundary.

This is the preferred model for enterprise API access because it prevents an agent from inventing an unreviewed process path.

Demo Guidance

The current demos should be described precisely:

  • insurance-claim-rest-v1.yaml shows workflow-owned API orchestration with direct HTTP calls plus native agent tasks for bounded reasoning.
  • insurance-claim-mcp-v1.yaml shows the same business flow through light-gateway MCP tools plus native agent tasks for bounded reasoning.
  • insurance-claim-headless-v1.yaml shows the deterministic regression path without human-task pauses.

The demos do not currently prove that light-workflow invokes the containerized light-agent services. That can be added later as an explicit runtime integration if the platform needs it.

Future Containerized-Agent Invocation

If workflow needs to call containerized light-agent services in the future, do not silently change the meaning of native call: agent. Add an explicit mode or task contract so operators can see which runtime path is used.

Possible options:

call: agent
with:
  mode: native
  agent: com.networknt.agent.offer-1.0.0
  skill: offer-decision
call: agent
with:
  mode: service
  agent: com.networknt.agent.offer-1.0.0
  skill: offer-decision
  timeout: PT30S

or a separate task type:

call: agent-service
with:
  serviceId: com.networknt.agent.offer-1.0.0
  envTag: dev
  skill: offer-decision

The service-call contract must require:

  • explicit timeout and retry policy,
  • idempotency key for side-effecting work,
  • correlation and workflow instance headers,
  • output schema validation,
  • clear failure mapping to workflow status,
  • portal/gateway authorization policy,
  • observability across workflow, gateway, controller, and agent logs.

Decision Matrix

NeedPreferred runtime
Deterministic API sequenceWorkflow task or subworkflow
Gateway-routed API accesscall: mcp through light-gateway
Bounded model reasoningNative call: agent
Human approval or form inputask task
Policy assertionassert, switch, or rule task
Interactive chatContainerized light-agent
Session memoryContainerized light-agent
Dynamic tool loopContainerized light-agent
Enterprise audit and replayWorkflow-owned task

Long-Term Direction

The platform should keep both execution models:

  • Native agent tasks for workflow-owned reasoning.
  • Containerized agents for interactive, memory-backed, independently scaled agent services.

The enterprise control rule is simple: workflows own durable process state and auditable API orchestration; agents provide bounded reasoning or interactive specialist behavior within contracts defined by the platform.

Sandbox Execution

Status

Proposed product design.

light-workflow should support sandbox-backed execution for tenant-authored and automation-heavy workflows. The workflow engine remains the durable orchestrator on the host. Sandboxes execute selected effectful tasks, or a bounded group of related tasks, according to an approved execution security profile.

Cube Sandbox is a good candidate provider for this boundary because it is designed for fast VM-backed sandbox creation, hardware isolation, and network policy enforcement. The design below treats Cube Sandbox as a pluggable provider, not as a hard dependency in the workflow DSL.

Problem

Workflows can be created by tenants and can eventually include tasks that run commands, scripts, containers, model calls, MCP tools, browser automation, or release automation. Those capabilities are useful, but they are also the highest-risk part of the workflow runtime.

The platform needs a way to say:

  • whether a workflow is allowed to use sandbox execution,
  • which tasks must be sent to a sandbox,
  • whether tasks should share a sandbox session,
  • which network, filesystem, image, command, and secret policies apply,
  • how release-style workflows can keep a workspace and cache across steps without moving the workflow orchestrator itself into the sandbox.

Recommendation

Add an execution security profile to the workflow definition. The profile is a request, not a final authority. At runtime, light-workflow computes an effective profile from:

  • workflow definition metadata,
  • task metadata,
  • tenant policy,
  • service policy,
  • operator-approved profile definitions,
  • deployment defaults.

The workflow engine should stay on the host and continue to own task claiming, context loading, branching, retries, persistence, and audit. Sandbox execution should be delegated to a sandbox runner for the tasks that need isolation.

For release workflows, use one sandbox session per workflow instance by default. That allows checkout state, build caches, generated artifacts, and intermediate files to survive across related build and test tasks. Use a fresh task sandbox for high-privilege publish or signing tasks if they require release tokens or signing material.

First Schema Surface

Use existing metadata fields first so the design can be implemented without an immediate workflow-core schema break. WorkflowDefinitionMetadata already has document.metadata, and every task has metadata through TaskDefinitionFields.

Workflow-level example:

document:
  dsl: "1.0.3"
  namespace: release
  name: light-fabric-polyrepo-release
  version: "1.0.0"
  metadata:
    lightWorkflow:
      security:
        executionProfile: release-sandbox
        sandbox:
          mode: workflow-session
          provider: cubesandbox
          template: light-fabric-release
          reuse: same-workflow-instance
          ttl: PT2H
          idleTimeout: PT10M

Task-level example:

do:
  - publish-github-release:
      run:
        shell:
          command: ./release.sh
          arguments:
            - "${ .version }"
      metadata:
        lightWorkflow:
          security:
            sandbox:
              mode: per-task
              reason: release-token-isolation
            secrets:
              - github-release-token

Later, the runtime can normalize a first-class security field into the same internal policy object:

security:
  executionProfile: release-sandbox
  sandbox:
    mode: workflow-session
    provider: cubesandbox
    template: light-fabric-release

Execution Modes

none

Trusted workflows run in the host executor. This mode should be limited to internal workflows or workflows with no effectful untrusted task.

effectful-tasks

The default tenant mode. Pure orchestration tasks stay on the host, while effectful tasks are delegated to sandbox execution. Examples include shell, script, container, browser automation, external MCP servers, and filesystem work.

workflow-session

One sandbox session is created for a workflow instance and reused by approved tasks in that same instance. This is the right default for build, test, and release workflows because the sandbox can keep cloned repositories, dependency caches, build output, and temporary files across steps.

per-task

Each sandboxed task gets a fresh sandbox. This is the strongest isolation mode and should be used for untrusted commands, tasks with separate privilege levels, and tasks that receive high-value secrets.

Task Routing

Host execution should remain the default for control-plane tasks:

ask
assert
set
switch
workflow context merge
task claiming and completion
process state persistence

Sandbox execution should be required for high-risk task families:

run.shell
run.script
run.container
browser automation
tenant-provided code
filesystem mutation outside workflow context
external MCP server processes
command-line tools
release build and package commands

Provider calls need policy-based routing:

call.http      host or sandbox, depending on egress policy
call.jsonrpc   host or sandbox, depending on egress policy
call.mcp       host for approved gateway endpoints, sandbox for external servers
call.agent     host for bounded native model calls, sandbox when tools or code execution are enabled
call.rule      host unless a rule profile explicitly requires isolation

A task may request stricter isolation than the workflow profile, but it must not weaken the effective profile. For example, a workflow can run in workflow-session mode while a publish task requests per-task mode. A task inside a tenant workflow cannot request none if the tenant ceiling requires sandbox execution.

Effective Policy

The runtime should compute and persist an effective policy for each workflow instance:

{
  "requestedProfile": "release-sandbox",
  "effectiveProfile": "release-sandbox",
  "sandboxMode": "workflow-session",
  "provider": "cubesandbox",
  "template": "light-fabric-release",
  "networkPolicy": "release-egress",
  "secretPolicy": "task-scoped",
  "approvedTaskTypes": ["run.shell", "call.http", "call.mcp"],
  "policyVersion": 7
}

This policy should be written into process audit metadata so replay and incident review can prove which policy was active when the workflow ran.

Policy resolution rules:

  • Tenant policy sets the maximum privilege a tenant can request.
  • Service policy sets the maximum privilege light-workflow may grant in the current deployment.
  • Workflow metadata requests a profile.
  • Task metadata can request stricter handling.
  • Runtime validation rejects unsupported or unapproved task/provider combinations before the task executes.
  • Approval-required profile changes emit pending workflow-definition events and must not immediately publish an active workflow definition.

Sandbox Session Lifecycle

For workflow-session mode:

  1. Claim a task on the host.
  2. Resolve the effective workflow security profile.
  3. Create or resume the sandbox session for this workflow instance.
  4. Mount or create an isolated workspace for the workflow instance.
  5. Send the task input, command specification, environment allowlist, and permitted secret handles to the sandbox runner.
  6. Stream logs and collect bounded output.
  7. Copy declared artifacts to a controlled artifact store.
  8. Return structured task output to light-workflow.
  9. Update task and process state on the host.
  10. Destroy the sandbox when the workflow completes, fails permanently, times out, or is cancelled.

The sandbox session id should be scoped to:

tenant id
workflow definition id and version
workflow instance id
effective profile version
requesting principal

Do not reuse one sandbox across tenants, workflow definitions, unrelated workflow instances, or different requesting principals.

Release Workflow Example

A Light-Fabric release workflow can use one sandbox session to release these repositories:

light-fabric
portal-service
controller-rs
light-example-rs

The host light-workflow process should still own the workflow instance. The sandbox holds the release workspace:

light-workflow host
  - claims tasks
  - loads workflow context from Postgres
  - resolves policy
  - starts or resumes sandbox session
  - dispatches build/test/release commands
  - records task output, status, and audit

sandbox session
  - checks out repositories
  - runs tests and build scripts
  - stores dependency caches
  - produces release artifacts
  - exposes logs and declared artifacts

Recommended task grouping:

prepare workspace          workflow-session sandbox
checkout repositories      workflow-session sandbox
run unit tests             workflow-session sandbox
build docker images        workflow-session sandbox, if Docker or BuildKit is available in policy
package release artifacts  workflow-session sandbox
generate release notes     workflow-session sandbox
publish release            per-task sandbox or isolated publish worker
sign artifacts             per-task sandbox or external signing service

The normal build/test/package tasks can share the same sandbox because they belong to one workflow instance and benefit from shared workspace state. Publish and signing tasks should be isolated because they require stronger secrets and have irreversible external effects.

Secret Handling

The sandbox should never receive broad platform credentials. Secrets must be:

  • referenced by logical name in workflow or task metadata,
  • approved by the effective profile,
  • injected only for the task that needs them,
  • short-lived where the provider supports it,
  • redacted from logs and task output,
  • excluded from workflow context exports.

Release tokens should be task-scoped. For example, tests and builds do not need GitHub release credentials. The publish task can receive a short-lived release token in a fresh sandbox or through a separate publish worker.

Network Policy

Each profile should define egress explicitly. A release profile might allow:

github.com
api.github.com
ghcr.io
crates.io
index.crates.io
registry.npmjs.org
docker.io

Tenant workflows should not get unrestricted network access. The sandbox provider must enforce the egress policy, and light-workflow should still keep its existing destination validation for host-executed HTTP, JSON-RPC, and MCP calls.

Artifact Boundary

The sandbox filesystem is not the workflow state store. Tasks must declare which outputs are copied out:

metadata:
  lightWorkflow:
    artifacts:
      - dist/*.tar.gz
      - dist/*.sha256
      - target/release/light-workflow

The runtime should copy artifacts into a controlled store and record artifact metadata in task output:

{
  "artifacts": [
    {
      "name": "light-fabric-0.3.0-x86_64-unknown-linux-gnu.tar.gz",
      "sha256": "...",
      "size": 12450000,
      "storeUri": "artifact://..."
    }
  ]
}

Audit

Every sandboxed task should record:

  • workflow definition id and version,
  • workflow instance id,
  • task id and task name,
  • requested and effective profile,
  • sandbox provider, template, session id, and sandbox id,
  • command, argv, working directory, and environment allowlist,
  • injected secret names, not values,
  • network policy id,
  • artifact metadata,
  • exit status,
  • duration,
  • output size,
  • log reference,
  • policy version.

For call: agent, also record model provider, model name, prompt profile, token budget, output schema id, validation result, and whether tool execution was allowed.

Failure Handling

Sandbox failures should map to normal workflow task failure semantics:

  • startup failure: task fails with sandbox_start_failed,
  • policy rejection: task fails with sandbox_policy_denied,
  • timeout: task fails with sandbox_timeout,
  • command non-zero exit: task fails with command_failed,
  • oversized output: task fails with sandbox_output_too_large,
  • sandbox lost: task fails or retries according to task retry policy.

The host must not mark a task complete until the sandbox result has been validated and persisted. If a sandbox dies after a command has external side effects, retries must respect the task idempotency key and release workflow guardrails.

Implementation Plan

  1. Define ExecutionSecurityProfile and sandbox policy structs in light-workflow.
  2. Parse workflow and task metadata under lightWorkflow.security.
  3. Add effective-profile resolution using tenant and service ceilings.
  4. Add a SandboxRunner trait with provider-neutral operations: create session, execute task, copy artifacts, checkpoint, destroy.
  5. Add a Cube Sandbox provider implementation behind configuration.
  6. Keep unsupported run.* task types disabled until they route through SandboxRunner.
  7. Persist sandbox session metadata in process context or a dedicated runtime table.
  8. Add audit output to every sandboxed task.
  9. Add approval gates for profiles that enable command execution, external MCP servers, broad egress, or task-scoped secrets.
  10. Add release workflow examples that use workflow-session mode for build/test/package and per-task mode for publish/sign.

Open Decisions

  • Whether sandbox profiles live only in service configuration or are also portal-managed records.
  • Whether artifact storage should use portal tables, object storage, or both.
  • Whether publish/sign should run in a sandbox or call a separate release service.
  • How much of the Cube Sandbox API should be exposed directly versus hidden behind a provider-neutral interface.
  • Whether a first-class security field should be added to workflow-core after the metadata-based design proves stable.

References

Insurance Claim Agentic Workflow

This page describes a product workflow demo for orchestrating multiple agents, skills, APIs, and human tasks with light-workflow.

The scenario is an auto insurance claim from first notice of loss to a settlement recommendation. It is a useful demo because it is familiar, has clear business states, needs several API calls, and includes human decisions that should not be delegated fully to an agent.

Demo Goal

The workflow should show how a deterministic process can coordinate:

  • two or three agents
  • multiple skills per agent
  • REST API calls
  • MCP tool calls through light-gateway
  • human input and approval tasks
  • branching based on policy, risk, and claim severity

The same business flow should be executable in two variants:

  • REST workflow: calls the demo APIs directly with HTTP/OpenAPI tasks.
  • MCP workflow: calls the same capabilities through MCP tools exposed by light-gateway.

The workflow owns the process. Agents work inside bounded tasks and should not invent new process paths outside the workflow definition.

For the agent execution boundary, see Native Agent Call. In the current implementation, call: agent is a native light-workflow task. It does not invoke a containerized light-agent service. API access in this demo is owned by the workflow through direct HTTP tasks or MCP tool calls routed through light-gateway.

Execution Model

This demo uses the enterprise workflow-first model:

  • light-workflow owns the claim process, task state, retries, branching, human tasks, and audit trail.
  • API access is explicit in the workflow as call: http or call: mcp.
  • Native call: agent tasks perform bounded reasoning over workflow-owned context and must return structured output.
  • Skills provide instructions, tool context, and workflow mappings, but they do not give an agent permission to invent unreviewed process paths.
  • Containerized light-agent services are not invoked by this demo workflow. They remain the runtime for chat clients and future service-agent integration.

Demo APIs

The existing demo APIs can be used as stand-ins for insurance services.

APIRole in the claim workflow
demo-customer-profile-apiPolicyholder profile, vehicle list, policy status, contact preference, prior claims.
demo-offer-decision-apiClaim triage, risk decision, settlement or repair recommendation.

If more realism is needed later, the same workflow can add simulated services for document storage, repair estimates, fraud review, or payment authorization.

Agents

Claim Intake Agent

The Claim Intake Agent owns first notice of loss collection and basic validation.

Skills:

  • collect accident facts
  • validate required claim fields
  • look up customer, policy, and vehicle data
  • identify missing information
  • summarize the claim for the next agent

Typical tools or API calls:

  • get customer profile
  • get customer policies
  • get covered vehicles
  • get prior claims

Human tasks:

  • claimant confirms accident details
  • claimant answers missing information questions
  • claimant uploads or confirms photos, police report, and tow status

Coverage And Liability Agent

The Coverage and Liability Agent checks whether the claim can continue and whether a human adjuster must review it.

Skills:

  • coverage eligibility check
  • incident date versus policy period check
  • vehicle coverage check
  • liability and severity classification
  • fraud or special investigation flagging

Typical tools or API calls:

  • get policy status
  • get prior claim history
  • run triage decision
  • run risk decision

Human tasks:

  • adjuster reviews unclear liability
  • adjuster confirms coverage exception handling
  • special investigation team reviews high-risk claims

Settlement Agent

The Settlement Agent prepares the next action and customer-facing explanation.

Skills:

  • repair versus total-loss recommendation
  • deductible explanation
  • settlement recommendation
  • customer message draft
  • next-document request

Typical tools or API calls:

  • get offer decision
  • get customer contact preference
  • create settlement recommendation

Human tasks:

  • adjuster approves high-value payment
  • claimant accepts repair or settlement path
  • claimant requests callback or more review

Claim Context And Handoffs

The workflow engine owns the claim state. Agents should be treated as stateless workers that read the current claim context, perform a bounded task, and return structured output.

Each major step enriches a shared claim context:

  • intake adds normalized accident facts and missing information status
  • customer lookup adds profile, policy, vehicle, and prior-claim data
  • coverage review adds eligibility, deductible, liability, and risk signals
  • triage adds severity, recommended path, and human-review requirements
  • settlement adds the recommendation, explanation, and next actions

Handoffs between agents should happen through this workflow-owned context, not through private agent memory. This keeps the process deterministic, replayable, and auditable.

Workflow Outline

1. Start Claim

Input:

{
  "customerId": "CUST-001",
  "vehicleId": "VEH-001",
  "incidentDate": "2026-05-30",
  "accidentDescription": "Rear-ended at an intersection.",
  "location": "Ottawa, ON",
  "injuryReported": false,
  "vehicleDrivable": false
}

The workflow validates that customerId, vehicleId, incidentDate, and accidentDescription are present.

2. Fetch Customer Context

The workflow calls the profile and policy capabilities to retrieve:

  • customer identity
  • policy list
  • covered vehicles
  • contact preference
  • prior claim count

Assertions:

  • customer exists
  • vehicle belongs to customer
  • at least one active policy exists

3. Ask For Missing Information

If the input is incomplete, the workflow creates a human task for the claimant.

Example questions:

  • Was anyone injured?
  • Was another vehicle involved?
  • Is the vehicle drivable?
  • Was a police report filed?
  • Are photos available?

The workflow should be resumable after the claimant answers.

4. Coverage Check

The workflow passes the gathered claim context to a native Coverage and Liability agent task. That task checks:

  • policy active on incident date
  • covered vehicle
  • applicable coverage type
  • deductible
  • excluded conditions

Branches:

  • no matching policy: route to adjuster review
  • policy inactive: prepare denial draft for human review
  • coverage found: continue to triage

5. Triage Decision

The workflow calls the decision API, either directly with HTTP or through light-gateway MCP, with normalized claim context.

Expected decision output:

{
  "severity": "medium",
  "riskLevel": "low",
  "recommendedPath": "repair",
  "requiresAdjusterReview": false,
  "estimatedLoss": 3200
}

Branches:

  • low risk and low value: continue automatically
  • unclear liability: create adjuster review task
  • high risk: create special investigation task
  • high value: create approval task

6. Settlement Recommendation

The workflow passes the approved claim context to a native Settlement agent task. That task prepares:

  • recommended path: repair, estimate, total-loss review, denial draft, or more information
  • deductible explanation
  • next documents required
  • customer-facing summary

The result should be structured so the UI can render it and the agent can explain it.

7. Human Approval

Approval is required for:

  • high estimated loss
  • denial recommendation
  • special investigation referral
  • liability uncertainty
  • customer dispute

The task should record:

  • approver role
  • approval decision
  • comment
  • timestamp
  • whether the workflow should proceed, revise, or stop

8. Customer Response

The claimant chooses one of:

  • accept repair path
  • request adjuster callback
  • upload more documents
  • dispute the recommendation

This should be modeled as a human ask task rather than an agent-only step.

9. End State

Possible workflow outcomes:

StateMeaning
claim-approvedClaim can proceed to repair or settlement.
needs-adjuster-reviewHuman adjuster must review before next action.
needs-customer-infoClaimant must provide missing information.
referred-to-siuClaim is referred to special investigation.
claim-denied-draftDenial is drafted but still needs human approval.

Failure Handling And Fallbacks

The demo should show graceful degradation when an API call or agent task cannot finish automatically.

Recommended fallback behavior:

FailureWorkflow response
Customer profile returns 404Create a manual customer verification task.
Policy or vehicle lookup is unavailableRetry, then route to adjuster review with the partial claim context.
Decision API is unavailableCreate a manual triage task and include the last successful context.
Agent output fails validationRe-run once with validation feedback, then create a human review task.
Human task times outEscalate to the configured role or mark the claim as waiting for follow-up.

The failure branch should preserve the accumulated claim context and the failed request or response metadata so the human reviewer can continue from the same state instead of restarting the claim.

REST Variant

The REST workflow calls the demo APIs directly.

Use this variant to show:

  • deterministic API orchestration
  • direct HTTP/OpenAPI task execution
  • workflow assertions
  • human waiting tasks
  • repeatable headless tests with fixed inputs

Example task sequence:

start-claim
get-customer-profile
assert-active-policy
ask-missing-info
run-claim-triage
switch-risk-path
ask-adjuster-approval
prepare-settlement-summary
ask-customer-response
complete-claim

MCP Variant

The MCP workflow invokes the same capabilities through MCP tools exposed by light-gateway.

Use this variant to show:

  • tool discovery with tools/list
  • tool execution with tools/call
  • agent skill guidance over the selected tool set
  • gateway as the runtime MCP data plane

Skills should be treated as guidance and curation for the agent, not as the runtime transport. The workflow still calls MCP tools through light-gateway. A skill describes when and how to use tools. For example, the coverage-review skill can instruct the agent to call evaluate_coverage before score_claim_risk, explain which fields must be present, and define what output shape the workflow expects.

Example tool groups:

SkillTools
claim-intakeget_customer_profile, get_policy, get_vehicle, list_prior_claims
coverage-reviewevaluate_coverage, score_claim_risk, classify_liability
settlementrecommend_offer, generate_customer_summary, list_required_documents

Human Task Model

Human work should be explicit and durable.

Recommended task types:

  • claimant information request
  • adjuster approval
  • liability review
  • special investigation review
  • customer settlement response

Recommended fields:

  • prompt
  • mode: choice, text, object, file, approval
  • assignee or candidate role
  • due time
  • validation rules
  • sensitive flag
  • comments
  • decision result

The workflow should pause at the human task and resume after a valid response is recorded.

The pause is durable. light-workflow persists the process and task state while waiting, so the workflow can remain idle for hours or days without consuming active execution resources. When the claimant, adjuster, or investigator completes the task, the workflow resumes from the persisted state and continues with the same claim context.

Minimal First Implementation

Start with a narrow happy path:

  1. Start with customerId, vehicleId, and accident details.
  2. Workflow fetches customer profile through HTTP or MCP.
  3. Workflow asserts active policy and covered vehicle.
  4. Workflow calls the decision API for triage.
  5. Workflow asks an adjuster to approve if estimatedLoss exceeds a threshold.
  6. Native Settlement agent task prepares the recommendation.
  7. Workflow completes with claim-approved or needs-adjuster-review.

This first version is enough to demonstrate multi-agent orchestration without needing every insurance edge case.

Later Enhancements

Add complexity incrementally:

  • document upload and OCR simulation
  • repair shop estimate comparison
  • fraud and special investigation path
  • payment authorization
  • subrogation when another driver is liable
  • scheduled headless regression runs
  • customer notification drafting
  • analytics for cycle time and approval bottlenecks

Demo Success Criteria

The demo is successful if it shows:

  • the same business process running through REST and MCP variants
  • agents using skills to perform bounded work
  • APIs called through both direct HTTP and MCP tool paths
  • at least one human input task
  • at least one human approval task
  • auditable workflow state transitions
  • clear final outcome and explanation

Light Portal Setup

This page describes the portal-side setup required to run the light-workflow product demos from a local light-portal stack.

For the execution model behind native agent tasks, see Native Agent Call. For the insurance product scenario, see Insurance Claim Agentic Workflow.

Prerequisites

Start the local portal stack with the workflow services, gateway, controller, and Postgres available.

For the Rust local stack:

cd /home/steve/workspace/portal-config-loc
./scripts/deploy-local.sh pg rust

The local stack should include:

  • Postgres,
  • workflow-command,
  • workflow-query,
  • light-gateway,
  • controller,
  • config-server,
  • demo-customer-profile-api,
  • demo-offer-decision-api.

light-workflow must use the same database as workflow-command:

DATABASE_URL=postgres://postgres:secret@localhost:5432/configserver

Start Light-Workflow

Build and run light-workflow from the light-fabric checkout:

cd /home/steve/workspace/light-fabric
cargo build -p light-workflow --locked

cd apps/light-workflow
DATABASE_URL=postgres://postgres:secret@localhost:5432/configserver \
LIGHT_WORKFLOW_HTTP_ADDR=0.0.0.0:8080 \
RUST_LOG=light_workflow=debug,info \
WORKFLOW_LOG_ANSI=false \
./run.sh --debug-binary

For repeated runs, put those values in apps/light-workflow/light-workflow.env and run:

./run.sh --debug-binary

Import Agent Catalog Data

Native call: agent tasks load portal agent, skill, and tool metadata from the portal database. Import the demo catalog events before running workflows that contain agent tasks.

cd /home/steve/workspace/event-importer
./importer.sh \
  --filename /home/steve/workspace/light-fabric/apps/light-workflow/examples/agent-catalog-events.json

For a different host or user, pass replacement rules:

./importer.sh \
  --filename /home/steve/workspace/light-fabric/apps/light-workflow/examples/agent-catalog-events.json \
  --replacement '[
    {"field":"hostId","from":"01964b05-552a-7c4b-9184-6857e7f3dc5f","to":"<host-id>"},
    {"field":"user","from":"01964b05-5532-7c79-8cde-191dcbd421b8","to":"<user-id>"},
    {"field":"operationOwner","from":"01964b05-5532-7c79-8cde-191dcbd421b8","to":"<user-id>"},
    {"field":"deliveryOwner","from":"01964b05-5532-7c79-8cde-191dcbd421b8","to":"<user-id>"}
  ]'

The demo catalog uses modelProvider: mock for deterministic local runs. For real model execution, update the portal agent definitions to use the desired provider and apiKeyRef.

Upload API Metadata

For the insurance claim demos, upload or refresh the OpenAPI specs for:

  • demo-customer-profile-api,
  • demo-offer-decision-api.

The portal catalog should contain endpoint and tool projections for the demo APIs before the MCP workflow is run. The MCP workflow expects light-gateway tools/list to expose these tools:

getCustomerProfile
getCustomerPreferences
getCustomerPolicies
getCoveredVehicle
listPriorClaims
triageClaim
recommendSettlement

Verify the tool surface through the gateway:

curl -k -sS -X POST "https://localhost:8443/mcp" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <access-token>" \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}'

Create Workflow Definitions

Create workflow definitions in the portal UI or through the workflow command API. For the insurance claim demo, create these definitions:

insurance-claim-rest-v1.yaml
insurance-claim-mcp-v1.yaml
insurance-claim-headless-v1.yaml

The files live in:

/home/steve/workspace/light-fabric/apps/light-workflow/examples

After creation, capture their ids:

psql "postgresql://postgres:secret@localhost:5432/configserver" \
  -c "select host_id, wf_def_id, name from wf_definition_t where active and name in ('insurance-claim-rest-v1', 'insurance-claim-mcp-v1', 'insurance-claim-headless-v1') order by name;"

Roles And Human Tasks

The insurance claim workflow creates durable human tasks. Confirm that the demo host has the roles used by those assignments:

claimant
claims-adjuster
siu-investigator
customer-service

Human tasks remain in the portal database while waiting. The workflow resumes after the task-completion command records a valid response.

Start And Verify

Use the portal UI start action, Postman collection, or curl helper from the examples directory.

cd /home/steve/workspace/light-fabric/apps/light-workflow/examples

ACCESS_TOKEN=<token> \
HOST_ID=<host-id> \
HEADLESS_WF_DEF_ID=<headless-wf-def-id> \
./insurance-claim-demo-curl.sh start-headless

Run the SQL verification helper after each start or task completion:

psql "postgresql://postgres:secret@localhost:5432/configserver" \
  -v host_id=<host-id> \
  -f /home/steve/workspace/light-fabric/apps/light-workflow/examples/insurance-claim-demo-queries.sql

For the full runbook, see:

/home/steve/workspace/light-fabric/apps/light-workflow/examples/README.md

Troubleshooting

SymptomCheck
Workflow starts but no process appearsConfirm light-workflow uses the same DATABASE_URL as workflow-command.
Agent task fails before a human taskConfirm agent-catalog-events.json was imported for the same hostId.
MCP tool is not foundCall gateway tools/list and confirm the tool names match the workflow YAML.
Human task is not visibleCheck task_asst_t, role membership, and task status.
Input fields resolve as ${ .customerId }Confirm startWorkflow sends input as a JSON object, not a JSON string.

Comparison: Light-Fabric vs. AgentGateway

This document provides a high-level comparison between Light-Fabric and AgentGateway to help architects and engineering leaders choose the right foundation for their agentic workflows.

Overview

While both systems aim to facilitate interaction with Large Language Models (LLMs), they operate at different layers of the AI stack and prioritize different architectural outcomes.

FeatureLight-FabricAgentGateway
Primary PhilosophyAgentic Fabric: Unified Governance & LifecycleAgentic Gateway: High-performance Proxy
Core ArchitectureIntegrated Platform (Layer)Standalone Gateway (Service)
Target UserCentral IT / Platform EngineeringApplication Developers / DevOps
Lifecycle ManagementAPIs, Agents, MCPs, and GatewaysPrimarily LLM Request Routing
LanguageNative Rust (Extreme Performance)Rust / Go (Variable)

1. Governance vs. Connectivity

Light-Fabric (Governance)

Light-Fabric is designed as a Single Control Plane. It assumes that in an enterprise environment, "freedom without governance is chaos." It provides:

  • Centralized Registry: Every agent, skill, and tool is registered and governed via the light-portal.
  • Fine-Grained Authorization: Deep policy enforcement at the endpoint level, including row and column-level data masking.
  • Auditability: A unified audit trail for all agentic interactions across the entire organization.

AgentGateway (Connectivity)

AgentGateway typically focuses on the North-South traffic between an application and multiple LLM providers. Its primary strength is:

  • Simplified Routing: Getting a request from Point A to Point B with retries and failover.
  • Provider Abstraction: Normalizing different LLM APIs into a single interface.

2. Integrated Intelligence: Hindsight

One of the defining differences of the Light-Fabric is the deep integration of Hindsight Memory.

  • Light-Fabric: Memory is not an "add-on." The platform provides native biomimetic memory banks (World Facts, Experiences, Mental Models) that are automatically managed and scoped (Global, Shared, Private) as part of the fabric.
  • AgentGateway: Typically treats memory as external state. The application or a separate vector database must manage context before sending the request through the gateway.

3. Skill & Tool Management

Centralized Skills (Fabric)

In Light-Fabric, skills (tools) are first-class citizens. They are registered, versioned, and governed centrally. An agent doesn't just "have" a tool; the Fabric grants the agent access to a skill based on its role and the current context.

Standard Tooling (Gateway)

AgentGateway generally passes tool definitions through to the provider. The management of who can use which tool and how those tools are secured is usually left to the application logic.


4. Orchestration: Hybrid Agentic Workflows

Light-Fabric (Integrated Orchestrator)

Light-Fabric treats orchestration as a foundational service. It implements a Hybrid Model:

  • Deterministic Process: The overall business logic (e.g., insurance claim steps) is fixed and compliant.
  • Autonomous Tasks: Individual steps within the process are delegated to agents.
  • Statefulness: The Fabric manages long-running state across days or weeks, ensuring durability.

AgentGateway (Stateless Proxy)

AgentGateway is primarily a stateless component.

  • External Orchestration: The workflow logic must reside in your application code or an external engine (like Temporal).
  • Proxy Only: It handles the communication but does not "understand" or manage the multi-step business process itself.

5. Security: The Rule Engine

Light-Fabric (Integrated Governance)

Light-Fabric includes an integrated YAML-based Rule Engine (light-rule) designed for fine-grained authorization:

  • Data Filtering: Automatically masks or filters response data (column/row level) based on policies.
  • Policy Enforcement: Checks permissions before an agent executes a tool or accesses a memory unit.
  • Hot-Reloading: Security rules can be updated in real-time without redeploying the platform.

AgentGateway (Basic Middleware)

AgentGateway typically provides basic security features like API key validation or rate limiting.

  • Limited Filtering: While it can intercept traffic, implementing complex, context-aware data masking usually requires writing custom middleware or handling it at the application level.

6. MCP Support: Gateway vs. Ecosystem

Light-Fabric (Integrated Tooling)

Light-Fabric treats Model Context Protocol (MCP) as a primary source for agent tools.

  • Direct Integration: Agents use the mcp-client to directly consume tools from MCP servers.
  • Registry Management: MCP servers are registered in the light-portal, allowing for centralized discovery and governance.
  • Unified Security: The same Fine-Grained Authorization rules apply to MCP tools as they do to native Rust tools.

AgentGateway (Specialized MCP Proxy)

AgentGateway provides a highly specialized MCP Gateway layer.

  • Protocol Translation: It excels at translating between different MCP transports (SSE, Streamable HTTP, etc.).
  • Exposing Servers: Its primary role is to make MCP servers accessible to external applications through a normalized gateway interface.
  • Advanced Networking: Includes features like stream merging and specialized MCP routing.

For a deep dive into the technical differences, see our Detailed MCP Feature Comparison.


Summary: Which to Choose?

Choose Light-Fabric if:

  • You are building an Enterprise AI Strategy that requires unified governance, stateful workflows, and integrated security.
  • You need to manage the entire lifecycle of agents and the business processes they participate in.
  • You require advanced data privacy (masking) and long-term memory (Hindsight) as native platform features.

Choose AgentGateway if:

  • You need a lightweight proxy to handle LLM provider failover and basic request normalization.
  • You prefer to manage agent logic, workflows, memory, and security entirely within your external application stack.
  • You are looking for a simple tool to solve immediate connectivity needs without implementing a comprehensive platform layer.

Detailed Comparison: MCP Gateway Features

This document provides a technical deep dive into the Model Context Protocol (MCP) implementations in Light-Fabric and AgentGateway.

Feature Matrix

FeatureLight-FabricAgentGateway
Primary RoleProvider/Gateway/Portal: Exposes MCP/API Servers.Provider/Gateway: Exposes MCP servers.
OnboardingAuto-Discovery: Automatic tools/list sync.Manual: K8s CRD/Manifest configuration.
Data PrivacyDeep: Row/Column level masking.Basic: Allow/Deny access control.
TransportsSSE, Streamable HTTP, WebSocketSSE, Streamable HTTP, WebSocket
Legacy IntegrationNative: REST/RPC to MCP transformation.External: Manual wrappers required.
AuthorizationManaged: Roles, Groups, Positions, Attributes.Infrastructure: CEL-based policies.
Hot-ReloadingNative: Integrated Control Plane & Registry.Infrastructure: Istio/xDS sync.
AuthenticationJWT (End-to-End Propagation)JWT, Keycloak, OIDC, Passthrough
ObservabilityDistributed Tracing (OTEL) and Integrated Hindsight MemoryDistributed Tracing (OTEL)

1. Architectural Intent

AgentGateway: The Network Proxy Layer

AgentGateway is designed as a high-availability proxy for MCP servers. Its primary focus is the North-South traffic between an application and multiple MCP backends.

  • Multiplexing: Optimized for merging multiple MCP backends into a single upstream connection (mergestream.rs).
  • Protocol Translation: Excels at translating between SSE, Streamable HTTP, and WebSocket transports.
  • Infrastructure Focus: Operates as a Kubernetes-native component managed via manifests and standard networking policies.

Light-Fabric: The Managed Enterprise Platform

Light-Fabric provides a Unified Governance Fabric that treats AI agents and MCP tools as part of the broader enterprise API ecosystem.

  • Unified Gateway: The AI Gateway (Rust/Pingora-based) serves as a single entry point for UI, Agents, and Tools, supporting both MCP and traditional REST/RPC APIs.
  • Centralized Portal: Uses the Light-Portal as a control plane for onboarding (auto-discovery), configuration (hot-reloading), and security management.
  • Governed Intelligence: Integrates the gateway directly with Hindsight Memory and the Fine-Grained Rule Engine, ensuring that every tool call is governed by corporate compliance rules (e.g., row/column masking).
  • End-to-End Security: Maintains a single JWT-based identity from the user's chat interface all the way to the underlying MCP or API endpoint.

2. Security & Authorization

AgentGateway: Infrastructure-Aware RBAC

AgentGateway uses Common Expression Language (CEL) for its authorization policies.

  • Capabilities: High-speed, network-level blocking based on JWT claims and request headers.
  • Limitation: Lacks native support for content-aware data masking or organizational hierarchy logic.

Light-Fabric: Content-Aware Managed Auth

Light-Fabric provides a mature Fine-Grained Authorization layer:

  • Managed ABAC/PBAC: Supports Role, Group, Corporate Position (Hierarchy), and Attribute-based protection.
  • Data Privacy: Supports native Row and Column filtering (data masking), ensuring agents only see data they are authorized to process.
  • End-to-End JWT: The same JWT token is propagated from the UI through the Agent to the AI Gateway and MCP tool.

3. Lifecycle & Tool Onboarding

AgentGateway: Configuration-Driven

Onboarding tools in AgentGateway is an infrastructure task:

  • Manual Mapping: Requires defining Kubernetes Custom Resources (HTTPRoute, Backend) to map MCP servers to the gateway.
  • Scope: Primarily focused on exposing existing MCP servers.

Light-Fabric: Registry-Driven

Light-Fabric provides a "Zero-Effort" onboarding experience via Light-Portal:

  • Auto-Discovery: Registering an MCP API triggers an automatic tools/list call to populate the registry.
  • Protocol Transformation: Automatically transforms existing OpenAPI/REST and RPC services into MCP tools without requiring wrappers.
  • Centralized Governance: All tools (Native, REST, MCP) are managed in a single unified registry.

4. Control Plane & Configuration

AgentGateway: Kubernetes-Native

  • Orchestration: Managed via the Istio/xDS control plane.
  • Updates: Configuration changes are applied via Kubernetes manifests (YAML).

Light-Fabric: Portal-Managed

  • Hot-Reloading: Uses a dedicated Config Server and Control Plane to update gateway and agent configurations in real-time without restarts.
  • Enterprise Management: Business-centric UI for managing tool visibility, agent permissions, and security policies.

5. Conclusion

  • Use AgentGateway if you are an infrastructure provider who needs to expose MCP-based tools to multiple external applications securely and reliably.
  • Use Light-Fabric if you are building intelligent agents that need to use those tools to solve complex business problems within a governed framework.

Why Light-Fabric Already Covers the MCP Gateway — No Second Gateway Required

This document addresses a recommendation (produced by Grok AI) suggesting that an enterprise should deploy the open-source AgentGateway as a dedicated MCP layer alongside an existing API platform. After performing a side-by-side source code analysis of both projects (see vs-agentgateway.md and vs-agent-gateway-mcp.md), we present the findings below.


1. The Recommendation Was Generated Without Knowledge of Light-Fabric

The Grok-produced analysis operates under a critical blind spot: it has no knowledge of Light-Fabric (Rust-based, open-sourced to customers) or its capabilities. The recommendation frames the choice as "keep your existing REST platform + add AgentGateway for MCP," because Grok only knows about publicly documented open-source projects. It does not account for the fact that:

  • Light-Fabric is already in production and serving agentic workloads today.
  • Every feature listed in the recommendation — MCP federation, tool discovery, protocol translation, security, and observability — has already been built, demonstrated, and validated with the project team.
  • The comparison is therefore not between "a REST framework" and "an MCP gateway." It is between two systems that both provide MCP gateway capabilities, where one (Light-Fabric/Light-Gateway) is already deployed and battle-tested in our environment.

2. Source Code Analysis: Light-Fabric Already Does What AgentGateway Does

We conducted a detailed, code-level comparison of both projects. The full results are documented in our High-Level Comparison and Detailed MCP Feature Comparison. The key findings are summarized below.

2.1 MCP Protocol Support

CapabilityLight-FabricAgentGateway
TransportsSSE, Streamable HTTP, WebSocketSSE, Streamable HTTP, WebSocket
Tool DiscoveryAuto-discovery via tools/list syncManual K8s CRD configuration
Protocol TranslationNative REST/RPC → MCP transformationManual wrappers required
Stream HandlingSupportedSupported (mergestream)

Both projects support the same MCP transports. Light-Fabric goes further with automatic tool discovery and native protocol transformation from existing REST/RPC APIs — exactly the "OpenAPI-to-MCP mapping" that the Grok recommendation credits to AgentGateway, except Light-Fabric does it without requiring a separate component.

2.2 Security & Authorization

CapabilityLight-FabricAgentGateway
AuthenticationJWT (end-to-end propagation)JWT, Keycloak, OIDC, Passthrough
AuthorizationRole, Group, Position, Attribute-based (ABAC/PBAC)CEL-based policies
Data PrivacyRow/Column-level maskingAllow/Deny access control
Rule EngineIntegrated YAML-based, hot-reloadableBasic middleware

The Grok recommendation highlights "tool-level RBAC" and "MCP-compliant OAuth 2.1" as AgentGateway strengths. Our code analysis shows that Light-Fabric's authorization model is significantly deeper — it supports corporate-hierarchy-aware policies and content-level data masking that AgentGateway simply does not implement.

2.3 Lifecycle & Operations

CapabilityLight-FabricAgentGateway
OnboardingPortal-driven, auto-discoveryK8s manifest-driven, manual
Hot-ReloadingNative (Config Server + Control Plane)Infrastructure-dependent (Istio/xDS)
ObservabilityOTEL + integrated Hindsight MemoryOTEL + OpenInference
OrchestrationIntegrated hybrid workflows (deterministic + autonomous)None (stateless proxy)

Light-Fabric manages the entire lifecycle — from tool registration through governance to runtime orchestration — while AgentGateway only handles the proxy layer.


3. Two Gateways Is Overkill

The Grok recommendation frames the architecture as a "clean separation of concerns." In practice, deploying both Light-Fabric and AgentGateway creates redundant infrastructure with real costs:

Duplicated Capabilities

Both systems would be performing the same core functions:

  • Receiving MCP requests from agents
  • Translating tool calls to backend HTTP requests
  • Enforcing security policies on tool access
  • Providing observability for agentic traffic

Running two gateways that do the same thing is not "separation of concerns" — it is duplication of concerns. Every MCP request would traverse two proxy layers instead of one, adding latency and operational complexity for zero additional capability.

Operational Burden

  • Two deployment pipelines to maintain on EKS
  • Two sets of security policies to keep in sync
  • Two configuration surfaces (K8s CRDs for AgentGateway vs. Portal for Light-Fabric)
  • Two failure domains to monitor and troubleshoot
  • Two upgrade cycles to coordinate

The "No Code Changes" Claim Is Misleading

The Grok recommendation states AgentGateway requires "no code changes." This is true only if you ignore the work required to:

  • Write and maintain Kubernetes Custom Resources for every MCP backend
  • Build manual wrappers for non-MCP services (Light-Fabric does this natively)
  • Implement application-level logic for everything AgentGateway doesn't cover (stateful workflows, data masking, memory management)

Light-Fabric also requires no code changes to existing backend services — and it provides the governance layer out of the box.


4. Addressing the "Rust Performance" Argument

The recommendation claims AgentGateway has a "performance edge" due to its Rust data plane. This argument does not hold:

  • Light-Fabric's AI Gateway currently runs on the high-performance Java-based light-gateway, and a new Rust-based AI Gateway is also under way, built on the Pingora framework (Cloudflare's production proxy engine). Even the existing Java gateway delivers exceptional throughput, and the Rust gateway will remove the JVM from the critical path entirely.
  • Both systems benefit from Rust's zero-cost abstractions, memory safety, and lack of garbage collection pauses.
  • The performance comparison between the two Rust implementations would be marginal and workload-dependent — not a differentiator.

5. Addressing the "Custom Development" Concern

The recommendation warns against "implementing MCP directly" because it "involves significant custom development." This concern does not apply:

  • Light-Fabric's MCP support is not custom development — it is a fully implemented, production-ready feature of the platform.
  • The MCP client, gateway routing, tool registry, and security integration are all existing, tested components, not a backlog of work to be done.
  • The project team has already seen these features demonstrated end-to-end.

6. Summary

Concern from Grok RecommendationReality
"Light4j is a REST framework, not an AI proxy"Light-Fabric is a full agentic platform with an AI Gateway already in production
"AgentGateway provides MCP federation and tool discovery"Light-Fabric provides the same capabilities with deeper governance
"Rust performance advantage over JVM"Light-Fabric's Java gateway is already very fast, and a Rust (Pingora-based) gateway is coming
"Clean separation of concerns"Two gateways doing the same thing is duplication, not separation
"No code changes required"True for both — but AgentGateway requires extensive K8s manifest management
"Custom MCP implementation is risky"Light-Fabric's MCP support is already built, tested, and in production

Conclusion

The Grok-generated recommendation is well-structured but fundamentally flawed because it was produced without knowledge of Light-Fabric's capabilities. When evaluated against the actual source code and production state of both systems, the case for adding AgentGateway collapses:

  • Light-Fabric already provides every MCP gateway capability that AgentGateway offers.
  • Light-Fabric goes significantly further with integrated governance, data privacy, memory, and orchestration.
  • Adding a second gateway introduces operational complexity and latency with no net-new capability.

The pragmatic, low-risk path is to continue with the platform that is already built, already in production, and already proven to the team.