Light-Fabric
Light-Fabric is a high-performance, unified platform for managing the lifecycle, governance, and orchestration of enterprise AI services including agentic services, agents, tools, skills, memories, MCP servers, APIs, gateways and workflows.
Why Light-Fabric?
We chose the name Light-Fabric because it embodies the "Unified Governance" required for enterprise-grade AI:
- Unified Control Plane: Light-Fabric provides a single point of truth for discovering, governing, and auditing agents, MCP servers, and APIs via the
light-portal. - Enterprise Governance: It prioritizes security and policy enforcement (such as fine-grained authorization) over pure decentralized autonomy, making it safe for corporate environments.
- Integrated Ecosystem: It "weaves" together distributed components—from memory units (Hindsight) to centralized skills—into a cohesive, observable system.
- Durable Identity: The name emphasizes the platform's role as the infrastructure foundation, remaining relevant regardless of the underlying implementation details.
Technical Advantages
By building Light-Fabric on a Rust foundation, we achieve:
- Performance: Built on top of
tokioandaxumfor maximum throughput and memory safety. - Native Intelligence: Specialized crates for Hindsight memory, tool calling, and workflow orchestration.
- Production Ready: Includes robust features like retries, failover, and observability out of the box.
Core Components
The Light-Fabric is composed of modular crates, infrastructure frameworks, and reference applications:
Crates
crates/model-provider: A unified interface for multiple LLM providers (Ollama, etc.).crates/hindsight-client: Client for the Hindsight biomimetic memory system.crates/mcp-client: Implementation of the Model Context Protocol (MCP) for tool discovery and execution.crates/portal-registry: Integration with the Light-Portal for service registration and discovery.crates/light-runtime: Core runtime foundation for building agentic and microservice components.crates/light-rule: High-performance rule engine for fine-grained authorization and data filtering.crates/workflow-core&workflow-builder: Core engine and builder for complex agentic workflows.crates/config-loader: Flexible configuration management for enterprise environments.crates/asymmetric-decryptor&symmetric-decryptor: Security utilities for sensitive data handling.
Frameworks
frameworks/light-axum: A specialized microservice & agentic framework built on top of the Axum web ecosystem.frameworks/light-pingora: High-performance proxy and gateway framework built on top of Cloudflare's Pingora.
Applications
apps/light-agent: A managed AI agent capable of using tools, accessing memory, and executing complex tasks.apps/light-gateway: An enterprise-grade gateway for securing and governing API and agent traffic.apps/light-workflow: A service for orchestrating and executing long-running agentic workflows.
Getting Started with Light-Fabric
This guide will help you set up a local development environment for Light-Fabric, including the AI Gateway, Agent Engine, and the management Portal.
Prerequisites
- Rust: Latest stable version.
- Docker: For running database and backend services.
- Node.js: For running the
portal-viewUI. - Git: To clone the necessary repositories.
Local Development Setup
To run the entire ecosystem locally, we use the portal-config-loc and service-asset repositories to manage configuration and pre-built assets.
1. Initialize Workspace
Create a unified workspace directory (e.g., ~/lightapi) and clone the core management repositories:
cd ~
mkdir -p lightapi
cd lightapi
# Clone configuration and assets
git clone [email protected]:lightapi/portal-config-loc.git
git clone [email protected]:lightapi/service-asset.git
2. Deploy Local Services
Light-Fabric services are orchestrated via Docker Compose scripts in portal-config-loc. The following command starts the PostgreSQL database and the core services (including the Rust-based components):
cd ~/lightapi/portal-config-loc
./scripts/deploy-local.sh pg rust
3. Import Initial Data
Use the importer script in service-asset to populate the local database with initial events, users, and configurations:
cd ~/lightapi/service-asset
./importer.sh -f events.json
4. Update /etc/hosts
The platform uses virtual hosts for local routing. Add the following entry to your /etc/hosts file (replace with your actual local IP if necessary):
127.0.0.1 local.lightapi.net locsignin.lightapi.net
Running the Management Portal
The Light-Portal provides a unified UI for onboarding MCP servers, configuring AI Gateways, and interacting with agents.
cd ~/lightapi
git clone [email protected]:lightapi/portal-view.git
cd portal-view
npm install
npm run dev
Navigate to https://localhost:3000 and log in with your developer credentials.
Cloud Development (Coming Soon)
We are currently preparing a Cloud Development Server. This will allow developers to:
- Connect to a shared, high-performance AI Gateway.
- Onboard and test MCP servers without a full local installation.
- Collaborate on shared agentic workflows and Hindsight memory banks.
Stay tuned for the connection details and onboarding guide for the cloud environment.
Contributing to Light-Fabric
If you are developing for the Rust crates specifically:
cd ~/lightapi
git clone [email protected]:networknt/light-fabric.git
cd light-fabric
cargo build
Model Providers
Light-Fabric provides a unified, high-performance interface for interacting with diverse Large Language Model (LLM) providers. This abstraction is centered around the Provider trait, allowing applications to remain model-agnostic while leveraging advanced capabilities like native tool calling and prompt caching.
The Provider Trait
All model integrations implement the Provider trait, which supports:
- One-shot and Multi-turn Chat: Simplified APIs for simple prompts and full conversation histories.
- Structured Tool Calling: Native integration for function calling (OpenAI-style).
- Capabilities Detection: Programmatic checks for vision, native tool support, and prompt caching.
Supported Cloud Providers
Light-Fabric supports all major LLM providers. Because the Provider trait is model-agnostic, the framework is compatible with the latest flagship releases as soon as they are available.
- OpenAI: Native support for the GPT-5 series (5.4, mini, nano), the o4 reasoning models, and full legacy support for GPT-4o and GPT-4 Turbo.
- Anthropic: Support for the Claude 4 generation, including Opus 4.7, Sonnet, and Haiku.
- Google Gemini: Support for Gemini 3.1 Pro and Flash, leveraging Vertex AI or AI Studio for multi-modal and long-context tasks.
- Azure OpenAI: Enterprise-grade OpenAI deployments with support for the latest model deployments.
- AWS Bedrock: Access to the latest Claude and Titan models hosted on Amazon Web Services.
- OpenRouter: Access to hundreds of open-source and proprietary models via a single unified API.
- Telnyx: Support for models hosted on the Telnyx platform.
- GLM (Zhipu AI): Support for the ChatGLM/GLM-5 series of models.
Local & Specialized Providers
- Ollama: Seamless integration with local models running on your machine.
- OpenAI-Compatible: A generic
CompatibleProviderfor any service implementing the OpenAI REST API. - GitHub Copilot: Integration with GitHub Copilot Chat for developer-centric workflows.
Meta-Providers (Orchestration)
These providers wrap other providers to add resilient or intelligent behavior:
- ReliableProvider: Enhances any base provider with retries, exponential backoff, and automatic failover to fallback models.
- RouterProvider: Dynamically routes requests to different models based on hints or input complexity.
CLI & Tooling Integrations
Light-Fabric includes specialized integrations for developer tools and terminal environments:
- Claude Code CLI: Integration with Anthropic's Claude Code environment.
- Gemini CLI: Terminal-based access to Google's Gemini models.
- KiloCLI: Light-Fabric's native CLI integration for rapid testing and automation.
Key Capabilities
Providers can be queried for their support of advanced features:
- Native Tool Calling: Efficiently generate structured function calls.
- Vision: Process images alongside text prompts.
- Prompt Caching: Leverage provider-side caching to reduce latency and costs for long contexts.
Agentic Workflow Design
Hybrid Agentic Workflow Specification
Agentic Workflow in Light-Fabric implements a hybrid orchestration model for enterprise business processes. The workflow is deterministic, auditable, and stateful, while selected steps can be executed by agents, API calls, rule engine checks, or humans.
The design goal is not to replace enterprise process control with an open-ended agent loop. The goal is to let agents work inside a managed process that has clear state, clear ownership, repeatable execution, and human approval where needed.
Enterprise Challenge
In regulated or operationally sensitive environments, a purely autonomous AI agent is not enough for long-running business work.
- Compliance requires deterministic process paths, approval records, and audit history.
- Reliability requires long-running state to survive process restarts, UI disconnects, and agent failures.
- Safety requires human-in-the-loop checkpoints for decisions with business, security, or financial impact.
- Coordination requires multiple humans and roles to participate in the same process.
- Testing requires the same workflow to run interactively with humans or headlessly with example data.
Light-Fabric solves this by separating orchestration from execution.
Hybrid Model
The workflow is the deterministic process manager. It defines the ordered steps, conditions, retries, error handling, human checkpoints, and outputs.
Agents are workers inside that process. They can reason, call tools, ask for missing data, and use skills, but they do not own the overall process state.
| Feature | Traditional Workflow | Pure Agent Loop | Light-Fabric Hybrid |
|---|---|---|---|
| Path | Fixed | Dynamic | Fixed path with flexible task execution |
| State | Durable | Often transient | Durable workflow and task state |
| Human input | Forms and approvals | Ad hoc chat | First-class waiting tasks |
| Audit | Strong | Weak | Step-level audit and agent trace |
| API calls | Built into code | Tool calls | Spec-described endpoint invocations |
| Testing | Separate test harness | Prompt replay | Same workflow can run live tests |
Core Separation
There are two related specifications:
-
Agentic Workflow Specification Describes orchestration: task order, branching, human input, assertions, API calls, retries, errors, exports, and state transitions.
-
LightAPI Description Specification Describes API capabilities at the endpoint level: how an endpoint is invoked, what inputs it accepts, what result shape it returns, examples, behavior notes, and result expectations.
This separation is important. The workflow should not duplicate every endpoint contract. It should reference endpoint descriptions and use them to invoke calls, guide agents, and verify results.
Endpoint-Level Consumption
Light-Portal manages API descriptions at the endpoint level, not only at the whole API level.
This is necessary because real workflows often combine one endpoint from one API with one endpoint from another API. For example, onboarding an API to an AI gateway may involve:
- register an API
- create an API version from a specification
- create a development API instance
- configure the API through config server
- link the API instance to a gateway instance
- select endpoints to expose as MCP tools
- create a gateway config snapshot
- reload the gateway through controller
- run MCP tests against the gateway
Each step may come from a different API surface. The workflow consumes only the endpoints it needs.
The recommended model is:
- API-level descriptions can be authored for convenience and consistency.
- Endpoint-level descriptions are published and consumed by agents and workflows.
- Endpoint descriptions inherit shared context such as authentication, environments, sources, and secrets from an API catalog.
- Agents progressively load endpoint information by disclosure level instead of receiving the entire catalog up front.
Progressive Disclosure
Endpoint descriptions should be disclosed to agents in layers:
- index: operation id, title, tags, visibility
- summary: purpose, capability group, lifecycle
- invocation: input shape, request mapping, auth, examples
- behavior: result cases, errors, edge cases, assertions
- full: complete description for debugging or generation
This allows the agent to discover capabilities cheaply, load invocation details only for selected endpoints, and load behavior details only when verification or failure analysis needs it.
Workflow Task Types
The updated workflow specification adds first-class support for the task types needed by agentic API workflows.
Ask Task
ask pauses the workflow and waits for human input. It supports prompts, choices, validation, defaults, timeouts, and sensitive input.
The task returns the user's answer as task output. The normal export block should move the answer into workflow context.
Example:
- ask-authz:
ask:
prompt: Do you want to configure endpoint authorization?
mode: choice
options:
- label: Configure authorization
value: configure
- label: Skip
value: skip
export:
as:
authzChoice: ${ .result }
Assert Task
assert validates workflow state or API results. It is used for both live tests and interactive workflows.
It supports simple comparisons, JSONPath-style checks, length checks, regex checks, and rule-engine-backed assertions for complex business logic.
Assertion failures should produce structured, catchable errors so workflows can route failures to remediation, task creation, or agent investigation. Complex business assertions can delegate to Light-Rule.
API Call Tasks
The workflow supports direct and description-backed API calls:
- HTTP / OpenAPI
- JSON-RPC
- OpenRPC
- gRPC
- MCP tool/resource/prompt calls
For direct internal calls, jsonrpc can be used with an endpoint, method, params, id, notification flag, and error policy.
For cataloged JSON-RPC, openrpc references an OpenRPC document and method.
For MCP, the workflow references a tool, resource, or prompt and passes arguments. MCP capability descriptions belong in the API description layer; the workflow only selects and invokes them.
Explanation Metadata
Tasks can include explain metadata to help an agent or UI explain what is happening.
Useful fields include:
- purpose
- visible
- before
- success
- failure
- requires
Example:
explain:
purpose: Link the API instance to the development gateway.
visible: true
requires:
- portal-command-token authentication
- apiInstanceId from prior step
Human Task State
Human-in-the-loop behavior must be represented as durable workflow state.
Recommended task states:
A = active
W = waiting for input
C = completed
F = failed
X = canceled
When an ask or approval task reaches W, the process remains active but the task is no longer picked up by the executor. A user, CLI, scheduler, or agent must complete the task through the workflow API.
Waiting tasks should carry:
- prompt
- input mode
- options
- validation rules
- default value
- sensitive flag
- assignment metadata
- explanation metadata
- timeout policy
Assignment And Worklist
Enterprise workflows need more than chat. Some tasks must be assigned to roles or users and coordinated across multiple humans.
Human tasks should support:
- assigned user
- assigned role
- candidate roles
- claimed by
- claimed timestamp
- due timestamp
- priority
- comments
- audit trail
A role-based task appears in the worklist for users with a matching role. Once claimed, it belongs to the claiming user until completed, released, delegated, or timed out.
Client Architecture
light-workflow should run as a containerized backend service alongside other portal services. It owns workflow execution and state. Portal chat, worklist, CLI, scheduler, and agents are all clients of the same workflow APIs.
The client surfaces are:
- Portal Chat: conversational guidance for a single user.
- Worklist: role-based task inbox for approvals, reviews, and coordination.
- CLI: developer, CI/CD, live test, and automation interface.
- Scheduler: periodic headless execution, such as hourly live integration tests.
- Agent: task executor that can call APIs, use skills, and report results back to the workflow.
See Workflow Client Architecture for the dedicated client design.
Workflow Service API
The workflow service should expose one stable API boundary for all clients.
Core operations:
workflow.start
workflow.getInstance
workflow.listInstances
workflow.getEvents
workflow.listTasks
workflow.getTask
workflow.claimTask
workflow.releaseTask
workflow.completeTask
workflow.delegateTask
workflow.cancelInstance
Streaming clients should subscribe to workflow events through Server-Sent Events, WebSocket, or another portal-standard event mechanism.
Important event types:
- workflow started
- task started
- task completed
- task failed
- task waiting for input
- task assigned
- task claimed
- task completed by human
- agent started
- agent completed
- workflow completed
- workflow failed
Live Testing
The same workflow runtime should support interactive runs and headless live tests.
Interactive workflows use ask tasks when decisions or missing values are needed.
Live tests should use example data from LightAPI endpoint descriptions and workflow input fixtures instead of asking the user. Assertions should verify results through assert tasks or rule-engine checks.
This lets the scheduler run workflows every hour against the latest deployed services. When a test fails, the workflow can create a task with the failure detail and assign an agent or human to investigate.
Example: API Onboarding To AI Gateway
An API onboarding workflow can guide a user through a complex multi-endpoint process without requiring a dedicated UI for every operation.
The workflow can:
- ask for or infer the API metadata
- call the register API endpoint
- create an API version from an OpenAPI specification
- create a development API instance
- configure the API
- ask whether fine-grained authorization should be configured
- route to create or select authorization rules
- link the API instance to the development AI gateway
- select endpoints to expose as MCP tools
- create a gateway config snapshot
- reload the gateway through controller
- run MCP tests through the gateway
- assert expected results
- report success or create remediation tasks
The same workflow can run interactively through portal chat, be managed through the worklist, or run headlessly with examples as a live test.
Technical Implementation
The Light-Fabric implementation is split across:
workflow-core: Rust models for the workflow specification.workflow-builder: fluent builders for programmatic workflow construction.light-workflow: runtime service and executor.light-agent: agent execution surface for delegated agent tasks.light-rule: rule engine used by workflow and assertion tasks. See Light-Rule Design.
Runtime responsibilities include:
- deserializing workflow definitions
- claiming active tasks
- executing supported task types
- storing task output
- applying exports into process context
- creating next tasks
- pausing waiting tasks
- resuming after human completion
- failing or completing process instances
- exposing workflow APIs to clients
The current executable slice supports API invocation and verification tasks such as HTTP, JSON-RPC, OpenRPC, MCP over enterprise HTTP transports, rules, assertions, and waiting human input. MCP stdio transport is intentionally not a priority for enterprise deployment.
Design Rule
There must be one workflow runtime and one task state model.
Chat, worklist, CLI, scheduler, and agents should never implement their own workflow execution. They should all use the same light-workflow service APIs.
This keeps enterprise workflow behavior auditable, testable, and consistent regardless of how a process is started, resumed, or observed.
Workflow Client Architecture
Light-Fabric workflow execution should run as a containerized backend service, not as logic embedded in a portal screen, CLI, scheduler, or agent. The workflow service owns process state, task state, audit records, API invocation, agent invocation, and human-in-the-loop transitions. Clients are thin interaction surfaces over the same service APIs.
This separation lets the same workflow instance be driven by a portal chat session, a worklist user, a CLI command, a scheduler, or an AI agent without creating multiple execution models.
Goals
- Provide one authoritative workflow runtime for long-running enterprise processes.
- Support human-in-the-loop tasks from both conversational and worklist interfaces.
- Support headless execution for live tests, scheduled runs, and CI/CD.
- Keep all clients stateless or lightly stateful; workflow state lives in
light-workflow. - Make role assignment, audit, and retry behavior consistent across UI, CLI, scheduler, and agent use.
Runtime Service
light-workflow should be deployed as a portal service in a container alongside the other portal services. It should expose APIs for workflow definitions, workflow instances, task claiming, task completion, event streaming, and operational control.
The service is responsible for:
- loading workflow definitions
- starting workflow instances
- persisting
process_info_tandtask_info_t - executing API calls and assertions
- invoking agents for agent-owned tasks
- pausing on
askand approval tasks - assigning human tasks to users or roles
- resuming workflows when a human answer is submitted
- emitting workflow and task events
- recording audit history
Clients should never execute workflow steps themselves. They should only start workflows, inspect workflow state, and complete assigned tasks.
Client Surfaces
Portal Chat
The portal chat client is the guided conversational interface for a single user working through a process. It is useful when the workflow needs to ask clarifying questions, explain the next action, or guide a user through a complex multi-endpoint operation.
Typical uses:
- API onboarding
- API endpoint publication to an AI gateway
- guided configuration
- troubleshooting and remediation workflows
- interactive approval with explanation
The chat client should call the workflow service for current state and submit answers to waiting tasks. It may stream workflow events and render agent explanations, but it should not own workflow state.
Worklist
The worklist is the enterprise task inbox. It is the right interface for multi-user coordination, role-based assignment, approvals, escalations, and audit-sensitive operations.
Typical uses:
- approval tasks
- compliance review
- operations handoff
- role-based queue processing
- task claim and release
- delegated work
- due-date and priority management
The worklist should be built around waiting human tasks. A task may have:
- assigned user
- candidate roles
- assigned role
- priority
- due time
- claim status
- comments
- completion payload
- audit trail
The worklist is especially important because many enterprise workflows are not purely conversational. They need accountable ownership and coordination between multiple humans.
CLI
The CLI is a developer and automation client. It should use the same workflow service APIs as portal-view and should not contain separate execution logic.
Typical uses:
- local workflow testing
- live parity tests
- CI/CD automation
- scheduled headless runs
- debugging stuck workflow instances
- submitting test data
- completing simple waiting tasks from scripts
Example commands:
light-workflow start portal.onboard-api --input input.yaml
light-workflow status <instance-id>
light-workflow tasks --role portal-admin
light-workflow claim <task-id>
light-workflow answer <task-id> --value approve
light-workflow logs <instance-id>
light-workflow cancel <instance-id>
The CLI should be added after the workflow APIs stabilize. It will be valuable for developers and automation, but the worklist and portal chat should drive the primary enterprise UX.
API Boundary
The workflow service should expose a stable API boundary that all clients use. The API can be HTTP, JSON-RPC, or both, but the concepts should remain the same.
Core operations:
workflow.start
workflow.getInstance
workflow.listInstances
workflow.getEvents
workflow.listTasks
workflow.getTask
workflow.claimTask
workflow.releaseTask
workflow.completeTask
workflow.delegateTask
workflow.cancelInstance
For streaming clients, the service should expose workflow events through Server-Sent Events, WebSocket, or another portal-standard event mechanism.
Important event types:
- workflow started
- task started
- task completed
- task failed
- task waiting for input
- task assigned
- task claimed
- task completed by human
- agent started
- agent completed
- workflow completed
- workflow failed
Human Task State
ask and approval-style tasks should enter a waiting state. While waiting, the workflow instance remains active, but the task is no longer executable by the worker loop until a human answer is submitted.
Recommended states:
A = active
W = waiting for input
C = completed
F = failed
X = canceled
The waiting task should include enough metadata for all clients:
- prompt
- input mode
- options
- validation rules
- default value
- sensitivity flag
- assignment metadata
- explanation metadata
- timeout policy
The completion API should validate submitted input against the task definition before resuming the workflow.
Assignment Model
Human tasks should support both direct assignment and role-based queues.
Recommended fields:
assigned_user
assigned_role
candidate_roles
claimed_by
claimed_ts
due_ts
priority
comments
A role-based task can appear in the worklist for all users with a matching role. Once a user claims it, the task becomes owned by that user until completed, released, delegated, or timed out.
Recommended Build Order
- Implement stable workflow service APIs for start, status, events, task list, task claim, and task completion.
- Harden the
askresume path and waiting task state machine. - Build the worklist because it forces the assignment, audit, and state model to be correct.
- Build the portal chat workflow interaction on top of the same task APIs.
- Add the CLI after the API shape stabilizes.
- Add scheduler integration for hourly live tests and headless workflow runs.
Design Rule
There must be one workflow runtime and one task state model. Chat, worklist, CLI, scheduler, and agents are only clients of that runtime.
This keeps enterprise workflow behavior auditable, testable, and consistent regardless of how a workflow is started or resumed.
LightAPI Description Design
lightapi-description-specification
LightAPI Description is the endpoint capability specification used by Light-Fabric agents, workflows, live tests, and portal API administration.
It describes how an API endpoint is discovered, invoked, explained, and verified. It is intentionally separate from the Agentic Workflow Specification. Workflow describes process orchestration. LightAPI describes endpoint capability.
Why LightAPI
OpenAPI is useful for REST APIs, and OpenRPC is useful for JSON-RPC APIs, but Light-Fabric needs a common description model across multiple enterprise protocols:
- REST / HTTP
- OpenAPI-described HTTP
- JSON-RPC 2.0
- OpenRPC-described JSON-RPC
- gRPC
- MCP tools, resources, and prompts
LightAPI provides a single agent-facing and workflow-facing description layer over these protocols.
The goal is not to replace OpenAPI or OpenRPC. The goal is to reference them where they exist and add the missing information needed by agents and workflow live tests.
API-Level Authoring, Endpoint-Level Consumption
Light-Portal may let teams author descriptions at the API level for convenience. However, workflows and agents consume descriptions at the endpoint level.
This distinction is important because real workflow processes rarely use a whole API. They usually combine selected endpoints from multiple APIs.
For example, onboarding an API to an AI gateway may consume:
- one endpoint from API registration
- one endpoint from API version management
- one endpoint from API instance management
- one endpoint from config server
- one endpoint from gateway linking
- one endpoint from controller reload
- one or more MCP tools exposed through the gateway
Each consumed operation should have an endpoint-level description with a stable endpointId.
API-level descriptions are still useful as catalogs. Endpoint-level descriptions may inherit shared API context such as:
- environments
- authentication
- secrets
- sources
- common tags
- lifecycle metadata
Relationship To Agentic Workflow
Agentic Workflow and LightAPI have different responsibilities.
| Concern | Agentic Workflow | LightAPI Description |
|---|---|---|
| Process order | Yes | No |
| Branching and retries | Yes | No |
| Human-in-the-loop | Yes | No |
| Endpoint invocation contract | Reference only | Yes |
| Input and result examples | Optional workflow fixtures | Yes |
| Result verification expectations | Calls assert | Describes expected result cases |
| Agent progressive disclosure | Uses selected endpoints | Defines disclosure levels |
| Live testing | Orchestrates execution | Supplies examples and expected results |
In live tests, the workflow should use example data from LightAPI descriptions and workflow fixtures instead of asking for user input.
In interactive runs, the workflow may ask the user for missing values, then invoke endpoints described by LightAPI.
Relationship To Centralized Agent Skills
LightAPI endpoint descriptions are a source of agent skills.
The centralized skill registry should not require every API operation to be manually rewritten as a separate skill. Instead, Light-Portal can publish selected LightAPI endpoint descriptions into the skill registry as invokable capabilities.
The skill registry adds:
- permission-aware discovery
- semantic search
- skill grouping
- agent persona scoping
- audit around skill disclosure and execution
LightAPI provides:
- endpoint identity
- protocol details
- input schema
- request mapping
- result shape
- examples
- behavior notes
- result cases
Together, they allow an agent to discover a capability as a skill, progressively load only the endpoint details it needs, and execute through the workflow or controller runtime.
See Centralized Agentic Skill Registry for the skill registry design.
Core Document Concepts
A LightAPI document should support both API-level catalogs and endpoint-level documents.
Important top-level concepts:
lightapi: specification versionprofile:apiorendpointinfo: name, title, version, namespace, owner, contactcontext: inherited catalog context for endpoint-level documentssources: OpenAPI, OpenRPC, protobuf, MCP, or raw protocol referencesenvironments: environment-specific server detailssecrets: required secret namesauthentications: reusable authentication policiesoperations: endpoint operation descriptionstestSequences: linear endpoint test sequencesagent: progressive disclosure and skill metadata
For profile: endpoint, the document should describe at most one operation.
Operation Model
Each operation represents one endpoint-level capability.
Common fields include:
operationId: local operation identifierendpointId: globally stable endpoint identifiertitlesummarydescriptionvisibilitylifecycletagscapabilityagentinputrequestresultexamples
The input section describes the logical interface the agent or workflow sees.
The request section describes how logical input maps to the wire protocol.
The result section describes expected output, result cases, and failure shapes.
Protocol Coverage
HTTP And OpenAPI
For raw HTTP, the operation describes method, endpoint, headers, query, path, and body mappings.
For OpenAPI, LightAPI references the OpenAPI document and operation, then adds agent-oriented behavior, examples, and result expectations.
JSON-RPC And OpenRPC
For direct JSON-RPC, the operation describes endpoint, method, params, id behavior, notification behavior, and error policy.
For OpenRPC, LightAPI references the OpenRPC document and method. The workflow runtime can use the OpenRPC document to validate that the method exists and that required params are present before calling it.
gRPC
For gRPC, the operation describes service, method, protobuf source, transport, metadata, request mapping, and result mapping.
For browser or gateway-mediated enterprise deployments, gRPC over WebSocket can be represented as a transport on the structured protocol operation.
MCP
For MCP, the operation describes tool, resource, or prompt invocation.
Tool listing alone is not enough. The description must also include:
- input schema
- result shape
- examples
- behavior differences for important input cases
- error cases
- verification expectations
MCP stdio is not a priority for enterprise portal deployment. HTTP and streamable HTTP transports should be the main runtime targets.
Result Cases And Verification
LightAPI should describe expected result behavior, but Agentic Workflow should execute the actual assertions.
This keeps verification orchestration in one place.
Recommended model:
- LightAPI operation result cases describe expected outputs, failure shapes, and examples.
- Workflow test steps invoke the operation.
- Workflow
asserttasks verify actual output against expected result cases. - Complex business checks can call the rule engine.
This allows the same endpoint description to support:
- agent skill usage
- workflow execution
- live integration testing
- failure diagnosis
Progressive Disclosure For Agents
A LightAPI document should support progressive disclosure so an agent can load only the information needed at each stage.
Recommended levels:
index: endpoint id, title, tags, visibilitysummary: purpose, capability group, lifecycleinvocation: input schema, request mapping, authentication, examplesbehavior: result cases, edge cases, errors, assertionsfull: complete endpoint description
The portal can expose query APIs such as:
lightapi.listOperations
lightapi.getOperation
lightapi.getCapabilityGroup
Agents should start with index or summary data, load invocation details only for selected endpoints, and load behavior details only for testing, troubleshooting, or failure repair.
Portal Publishing Flow
Light-Portal should manage endpoint descriptions as part of API endpoint administration.
Recommended flow:
- API owner creates or imports API metadata.
- Portal extracts initial endpoint descriptions from OpenAPI, OpenRPC, protobuf, MCP, or raw endpoint configuration.
- API owner enriches endpoint descriptions with examples, behavior notes, result cases, and visibility.
- Portal stores endpoint-level LightAPI descriptions.
- Authorized agents and workflows query descriptions by endpoint, tag, lifecycle, visibility, or capability.
- Selected endpoints can be published into the centralized skill registry.
- Workflow instances reference endpoint descriptions during execution and live testing.
Live Test Use
Live tests should be workflow-driven.
LightAPI supplies:
- example input data
- expected result cases
- protocol invocation details
- error behavior
Agentic Workflow supplies:
- sequence
- fixtures
- environment selection
- endpoint invocation
- assertions
- failure routing
- task creation
- agent assignment
This avoids building a second test runner model outside the workflow engine.
Design Rule
LightAPI describes endpoint capability. Agentic Workflow orchestrates endpoint use. Centralized Skills expose selected capabilities to agents.
Keeping these responsibilities separate lets Light-Fabric support API administration, agent skill discovery, workflow execution, and live integration testing without duplicating endpoint definitions across multiple systems.
Light-Rule Design
Light-Rule is the local YAML rule engine used by Light-Fabric services and workflows for deterministic business checks, transformations, authorization decisions, and workflow assertions.
It complements agentic workflow by keeping critical decisions explicit, repeatable, and auditable. Agents can propose or select rules, but the rule engine executes the deterministic logic.
Purpose
Light-Rule is designed for enterprise services that need fast local policy and transformation logic without a database call on every request.
Primary uses:
- fine-grained authorization
- request transformation
- response transformation
- workflow assertions
- business validation
- permission and filter injection
- reusable rule templates selected from Light-Portal
The rule configuration is loaded locally by the target service. When permissions or rule mappings change, the controller can trigger a config reload so the service swaps to the latest rules.
Relationship To Agentic Workflow
Agentic Workflow orchestrates process steps. Light-Rule evaluates deterministic logic inside those steps.
Workflow uses Light-Rule in two main ways:
-
Rule call task A workflow task can call a named rule to validate or mutate workflow context.
-
Assert task extension Simple checks can be handled directly by
assert, while complex business checks can delegate to Light-Rule.
This separation keeps workflows readable. The workflow says when a check happens; Light-Rule defines the reusable business logic for the check.
Example workflow responsibilities:
- decide when authorization configuration is needed
- select or create a rule
- invoke a rule during live testing
- route failures to a human or agent
Example Light-Rule responsibilities:
- evaluate role, group, position, or attribute checks
- inject endpoint permissions into the context
- compute row or column filters
- execute transformation plugins
- return pass/fail for business assertions
See Agentic Workflow Design for the workflow orchestration model.
Relationship To LightAPI
LightAPI endpoint descriptions describe endpoint invocation and expected result behavior. Light-Rule can implement complex result checks that are too business-specific for simple schema assertions.
Recommended model:
- LightAPI describes endpoint result cases and expected behavior.
- Agentic Workflow invokes the endpoint and runs
asserttasks. asserthandles simple checks directly.- Light-Rule handles complex checks, authorization logic, row filters, column filters, and reusable business policies.
See LightAPI Description Design for endpoint capability descriptions.
Rule Specification
Rules are described by the rule specification in rule-specification/schema/rule.yaml.
The top-level configuration contains:
ruleBodies: named rule definitionsendpointRules: endpoint-to-rule mappings
Each rule can contain:
ruleIdruleDescversionauthorupdatedAtconditionsactions
Each endpoint mapping can contain:
req-tra: request transformation rulesres-tra: response transformation rulesaccess-control: access control rulespermission: permission values injected into contextx-*: extension rule phases
Rule Conditions
Conditions evaluate fields in the input context.
Supported operand forms:
- direct field:
role - dotted path:
user.role - JSON Pointer:
/user/role - JSONPath-like path:
$.user.roles[0]
Supported operators:
==
!=
>
<
>=
<=
eq
ne
contains
matches
startsWith
endsWith
exists
notExists
expected is typed and may be a string, number, boolean, array, object, or null.
Flat condition arrays are evaluated left-to-right. joinCode combines the current condition with the previous result.
A AND B OR C
is evaluated as:
(A AND B) OR C
If explicit grouping is required, split logic into multiple rules and combine them through endpoint mapping or workflow orchestration.
Rule Actions
Actions execute plugin logic after conditions pass.
An action contains:
actionIdactionClassNameactionValues
actionClassName identifies the registered plugin. actionValues carries plugin-specific configuration.
Typical action plugins:
- add values to request context
- inject permission attributes
- compute filters
- transform request body
- transform response body
- call a local business function
Actions are intentionally plugin-based so the schema remains stable while implementation logic can evolve.
Endpoint Rule Phases
Endpoint mappings define when rules run.
Request Transformation
req-tra rules run before the service handles the request. They can enrich or transform request context.
Response Transformation
res-tra rules run after the service produces a response. They can filter, redact, or reshape response data.
Access Control
access-control rules validate whether a request is allowed. These rules normally run in parallel because they should not mutate shared state.
Permission Injection
permission values are injected into the evaluation context before rule execution. This lets API owners configure roles, groups, attributes, row filters, or column filters without editing the technical rule body.
Extension Phases
Custom phases must use the x-* prefix. This avoids silent typos in standard phase names while preserving controlled extensibility.
Execution Model
The Rust implementation lives in crates/light-rule.
Core components:
RuleConfig: top-level config modelRule: rule definitionRuleCondition: condition modelRuleAction: action modelRuleEngine: evaluates one ruleActionRegistry: maps action class names to pluginsMultiThreadRuleExecutor: executes rule lists and endpoint phase mappings
Sequential phases such as req-tra and res-tra should run with all semantics so transformations happen in order.
Access control can run in parallel because it should be a validation step rather than a mutation step.
Why Not Replace With Cedar Or Casbin
Cedar and Casbin are strong policy engines, but Light-Rule has a different role in this platform.
Light-Rule supports:
- local YAML configuration
- request and response transformation
- permission injection
- row and column filters
- endpoint-specific rule selection
- technical-team-authored reusable rules
- API-owner-selected rule parameters
- config reload through controller
Cedar is excellent for authorization policy, but it does not naturally cover transformation, row filter, and column filter use cases. Casbin is strong for policy enforcement, but it introduces a different policy storage and matching model.
Light-Rule should remain the native rule engine for Light-Fabric service configuration and workflow assertions. External policy engines can still be integrated as action plugins if needed.
Governance
Rule bodies should be authored and reviewed like code or controlled configuration.
Recommended governance metadata:
versionauthorupdatedAtruleDesc
Recommended operational controls:
- validate rule YAML against the schema before publishing
- reject endpoint phase typos
- keep
ruleIdequal to theruleBodiesmap key - audit rule publication and reload events
- test rules with representative input contexts
- use workflow live tests to verify rules in integrated environments
Workflow Live Testing
Light-Rule is useful in live tests because it can express business checks that are more specific than generic JSON assertions.
Example flow:
- Workflow invokes an endpoint using LightAPI description.
- Workflow captures the endpoint response.
assertverifies simple fields.- A rule task validates business-specific behavior.
- On failure, workflow creates a task for a human or agent to investigate.
This keeps live test orchestration in workflow while preserving reusable business rules in Light-Rule.
Design Rule
Use workflow for process control. Use LightAPI for endpoint capability. Use Light-Rule for deterministic business logic.
Agents may select, explain, or help author rules, but the rule engine should execute the final deterministic decision.
CEL Rule Conditions
Light-Rule should support both the existing native condition schema and CEL expressions. The two forms solve different problems and should share the same rule lifecycle, endpoint mapping, action execution, config loading, testing, and governance model.
The native condition schema remains the default because it is easy to render in Light-Portal, simple to validate, and suitable for most API-owner use cases. CEL is an advanced condition form for customers that need richer boolean logic, grouping, list predicates, or compatibility with existing CEL-based policy assets.
Each rule should choose one condition language: native or cel. Mixing native
condition rows and CEL expressions inside the same rule is not recommended as the
canonical model because it makes portal authoring, validation, and runtime
dispatch harder to reason about.
Goals
- keep existing rule YAML and portal-authored rules compatible
- support CEL expressions as a rule-level condition language
- evaluate native and CEL rules in the same
RuleEngine - reuse the existing rule context for gateway, workflow, and test execution
- preserve existing
actions,endpointRules, and rule phase semantics - let Light-Portal choose the correct editor from rule metadata without parsing arbitrary rule bodies
- validate CEL before publishing or reloading rules where possible
- keep CEL execution deterministic and side-effect free
Non-Goals
- replacing the native Light-Rule condition schema
- replacing actions with CEL
- allowing CEL expressions to perform I/O, network calls, mutation, or service lookups
- making every native operator available as a custom CEL function on day one
- requiring business users to write CEL for common rules
- supporting mixed native and CEL condition blocks in the canonical portal authoring flow
Current Model
Today a rule contains an optional flat list of native conditions:
ruleBodies:
allowMcpReader:
common: Y
ruleId: allowMcpReader
ruleName: Allow MCP reader
ruleType: req-acc
conditions:
- operatorCode: isNotNull
propertyPath: auditInfo.subject_claims.ClaimsMap.role
actions:
- actionClassName: com.networknt.rule.RoleBasedAccessControlAction
Each native condition contains:
operatoroperandexpectedjoinCode
The engine evaluates conditions left-to-right. joinCode combines each
condition with the accumulated result. If the final condition result is true,
actions run as they do today.
Portal persistence stores rule metadata in rule_t and the executable rule JSON
in rule_t.rule_body. Today there is no dedicated column that tells the portal
which condition editor to render, so the UI would have to inspect rule_body.
Proposed Rule Shape
Add a rule-level condition language flag. Use native for existing condition
rows and cel for a single CEL expression.
Persist the flag in both places:
rule_t.condition_language: indexed/listable portal metadataruleBody.conditionLanguage: self-contained exported runtime configuration
Recommended values:
native
cel
Existing rules without the field are interpreted as native.
Native rule body:
ruleBodies:
allowMcpReader:
common: Y
ruleId: allowMcpReader
ruleName: Allow MCP reader
ruleType: req-acc
conditionLanguage: native
conditions:
- operatorCode: isNotNull
propertyPath: auditInfo.subject_claims.ClaimsMap.role
actions:
- actionClassName: com.networknt.rule.RoleBasedAccessControlAction
CEL rule body:
ruleBodies:
allowApprovedTransfer:
common: Y
ruleId: allowApprovedTransfer
ruleName: Allow approved transfer
ruleType: req-acc
conditionLanguage: cel
conditionSecurityProfile: strict
expression: >
auditInfo.subject_claims.ClaimsMap.role != null
&& roles.exists(r, r == auditInfo.subject_claims.ClaimsMap.role)
actions:
- actionClassName: com.networknt.rule.RoleBasedAccessControlAction
Recommended database shape:
ALTER TABLE rule_t
ADD COLUMN condition_language VARCHAR(16) DEFAULT 'native' NOT NULL;
ALTER TABLE rule_t
ADD COLUMN condition_security_profile VARCHAR(32);
ALTER TABLE rule_t
ADD CONSTRAINT rule_t_condition_language_check
CHECK (condition_language IN ('native', 'cel'));
ALTER TABLE rule_t
ADD CONSTRAINT rule_t_condition_security_profile_check
CHECK (
condition_security_profile IS NULL
OR condition_security_profile IN ('strict', 'standard', 'internal-admin')
);
Recommended schema rules:
conditionLanguageis optional and defaults tonativeconditionLanguage: nativeallowsconditionsand rejectsexpressionconditionLanguage: celrequiresexpressionand rejectsconditionsconditionSecurityProfileis optional and names a runtime-defined profile- native conditions continue to require
operatororoperatorCode - native conditions continue to require
operandorpropertyPath - unknown rule and condition fields should continue to be rejected by the schema
- command handlers should reject requests where the DB metadata and rule body condition language disagree
This can be represented with conditional validation in
rule-specification/schema/rule.yaml:
allOf:
- if:
properties:
conditionLanguage:
const: cel
required: [conditionLanguage]
then:
required: [expression]
not:
required: [conditions]
else:
not:
required: [expression]
The Rust model can add optional fields to Rule:
#![allow(unused)] fn main() { pub condition_language: Option<String>, pub condition_security_profile: Option<String>, pub expression: Option<String>, }
This is less disruptive than changing RuleCondition into an enum and keeps old
rule bodies valid.
Cross-Repository Scope
This change crosses the rule specification, runtime engines, portal services, and
portal UI. The implementation should be tracked as a coordinated change rather
than a light-fabric-only feature.
| Area | Required work |
|---|---|
rule-specification | Add conditionLanguage, conditionSecurityProfile, expression, native rule and CEL rule schema branches, and mode/profile-specific validation rules. |
portal-db | Add rule_t.condition_language with default native, optional rule_t.condition_security_profile, check constraints, and pending rule-change approval state if workflow task payloads are not sufficient. Keep existing rows valid without rewriting rule_body. |
light-portal | Update persistence and projection code so rule create/update/read/export/import paths carry conditionLanguage and conditionSecurityProfile; ensure endpoint rule config generation emits only approved, self-contained rule bodies; integrate stronger-profile requests with worklist and assistant-task approval. |
rule-command | Accept conditionLanguage, conditionSecurityProfile, and expression, normalize old/native payloads, validate mode/profile-specific shape, publish strict changes immediately, route stronger profile requests through approval, and write both DB metadata and rule body consistently after approval. |
rule-query | Return conditionLanguage, conditionSecurityProfile, and approval status for list/detail APIs, include selected/effective profiles in test-case execution payloads, and surface CEL parse/type/missing-field/profile errors from Java and Rust runners. |
portal-view | Render either the native condition builder or a CEL expression editor based on conditionLanguage; show a controlled profile selector for CEL rules; submit strict directly and route standard or internal-admin to worklist approval; do not require the UI to infer mode from ruleBody. |
| workflow and assistant task | Use the existing human-in-the-loop worklist flow for stronger profile approval, route tasks to admin and rule-admin, and attach an advisory assistant-task risk summary for the approver. |
light-fabric | Add conditionLanguage, conditionSecurityProfile, and expression to crates/light-rule, dispatch in RuleEngine, add policy-driven CEL evaluator/caching, and update gateway/workflow tests. |
yaml-rule | Add Java runtime parity for conditionLanguage: cel and named profile enforcement if Java services need to execute the same rules; otherwise reject CEL rules explicitly with a clear runtime-capability error. |
portal-db is listed even though it is not a rule engine because rule_t lives
there. Without the DB column, portal-view would need to parse the compact rule
body to choose the editor, which is the coupling this design is trying to avoid.
Operator Alias Alternative
Another possible shape is to add operatorCode: cel and store the CEL
expression in expected inside conditions:
conditions:
- operatorCode: cel
expected: >
context.toolArguments.amount < 1000
|| roles.exists(r, r == "approver")
This has one advantage: it can be implemented with a small Rust model change
because operator, operand, and expected already exist. It is useful as a
compatibility alias or import format.
It should not be the canonical schema because:
- CEL is a full boolean expression, not a comparison operator
- overloading
expectedmakes validation and portal rendering less clear operandbecomes ignored or artificial- the UI still has to draw a condition-row editor even though the rule is really a single expression
- the rule schema still needs to change because the operator enum must include
celand nativeoperandrequirements must be relaxed - future expression languages would continue overloading native condition fields
The recommended contract is therefore:
- canonical form:
conditionLanguage: celplus rule-levelexpression - optional compatibility form:
operatorCode: celplus stringexpected - normalize compatibility imports to the canonical rule-level model before persistence or runtime evaluation
Mixed Conditions Alternative
Another possible shape is to allow native and CEL conditions in the same
conditions array. The runtime can support this if needed, but it should not be
the default authoring model.
Reasons to avoid canonical mixed rules:
- Light-Portal would need a hybrid editor that switches row-by-row
- validation errors become harder to explain to non-technical users
joinCodesemantics across native and CEL expressions are correct but subtle- users may expect CEL operator precedence inside the whole rule even though
native
joinCoderemains left-to-right - runtime dispatch is simpler and faster when the rule selects one evaluator
If mixed rules are ever accepted for import or advanced API use, joinCode
should still apply left-to-right to the accumulated result regardless of which
evaluator handled the current or previous condition.
Execution Model
Rule execution should dispatch by conditionLanguage once per rule:
RuleEngine::execute_rule
-> conditionLanguage == native
-> evaluate native conditions
-> conditionLanguage == cel
-> evaluate rule expression
-> execute actions when conditions pass
The outer behavior stays unchanged:
- rules with no conditions continue to run actions
- CEL rules without an expression fail validation before runtime
- failed conditions skip actions
- failed action execution fails the rule
- endpoint rule ordering and access-control logic stay unchanged
req-traandres-tracontinue to run sequentially- access-control rules can still be evaluated independently
Runtime should treat a missing conditionLanguage as native for backward
compatibility.
Rule Context
CEL should evaluate against the same JSON context used by native conditions. For gateway access-control and response filtering, this includes fields such as:
auditInfoheadersendpointtoolNametoolArgumentscorrelationIdresponseBodystatusCode
Endpoint permission values are merged into the root context as their configured
keys. For example, permission.roles in endpointRules is available to
conditions as roles, response row filters are available as row, and column
filters are available as col. A future runtime can also expose a namespaced
permission object as an additive convenience, but CEL support should not
require that shape to preserve compatibility with existing native rules and
actions.
For standard and internal-admin profiles, the CEL environment can expose
variables in two ways:
- top-level context fields as direct CEL variables, such as
auditInfo,toolArguments, androles - the full root object as
context, so expressions can use explicit paths such ascontext.toolArguments.amount
Direct variables keep expressions concise and close to the native condition path
style. The context variable is safer for generated expressions, collision
avoidance, and future fields that are not valid CEL identifiers.
For the strict profile, the runtime should expose only curated root variables
such as auditInfo, headers, toolArguments, endpoint metadata, and
permission values needed by the rule phase. It should not expose the full
context object by default. This prevents future internal runtime metadata from
becoming visible to tenant-authored CEL just because it was appended to the root
request context.
The context contract should be documented as part of Light-Rule because CEL expressions depend on stable field names. Adding fields is compatible. Renaming or changing field shapes is a breaking change for CEL rules.
Type Mapping
The CEL evaluator should receive deterministic values converted from
serde_json::Value:
- JSON object to CEL map
- JSON array to CEL list
- JSON string to CEL string
- JSON number to CEL integer or double
- JSON boolean to CEL bool
- JSON null to CEL null
Missing fields should evaluate according to the chosen CEL implementation's standard behavior. The rule test API should expose these failures clearly so authors can distinguish "expression false" from "expression invalid".
Authors should guard optional fields explicitly. Depending on the selected CEL
runtime and the field shape, this can use presence checks such as has(...) or
map membership checks such as:
"role" in auditInfo.subject_claims.ClaimsMap
&& auditInfo.subject_claims.ClaimsMap.role == "admin"
The portal rule tester should surface missing-field evaluation errors and suggest guarded expressions instead of letting these failures look like ordinary denied rules.
Context Injection Performance
CEL expressions run on request paths, so context conversion must be controlled. The implementation should not recursively deep-clone and convert large JSON payloads separately for every CEL rule evaluation.
Recommended approach:
- compile expressions once at rule load
- build the rule context once per request or response phase
- reuse converted CEL variables across evaluations in the same request or response phase when possible
- prefer lazy or reference-backed variable resolution if the selected CEL crate supports it
- if eager conversion is required, convert only the variables exposed to CEL and
avoid parsing large string fields such as
responseBodyunless an expression explicitly needs structured access to them - benchmark access-control and response-filter scenarios before enabling CEL by default in high-throughput paths
The initial implementation can be pragmatic, but performance tests should guard against accidentally making CEL expression evaluation proportional to the full response body size when the expression only needs claims or endpoint metadata.
Validation
CEL should be validated earlier than request execution.
Recommended validation points:
- portal rule editor
- rule command create/update handler
- rule test API
- runtime config reload
Validation must enforce mode-specific shape:
native:conditionsis allowed,expressionis rejectedcel:expressionis required,conditionsis rejected- persisted
rule_t.condition_languagemust matchruleBody.conditionLanguage - persisted
rule_t.condition_security_profilemust matchruleBody.conditionSecurityProfilewhen either side is present
Runtime reload should reject invalid CEL when strict validation is enabled. If a service must preserve availability, it can keep the last known-good rule set and report the new config as rejected.
Approval workflow should not bypass validation. For profile escalation requests, the command path should validate the submitted rule shape and expression before creating the approval task. Final approval should revalidate the exact submitted rule body before emitting the active rule event.
Validation output should include:
- rule id
- condition language
- parse or type error
- source offset when provided by the CEL implementation
Compilation And Caching
Do not compile CEL on every request. Compile once per rule load and cache the compiled program with the loaded rule set.
Recommended cache key:
ruleId + expression hash + effective profile
The compiled expression cache should be replaced atomically when the rule config reloads. It should not outlive the rule version it was compiled from. Old compiled entries must be evicted during reload so repeated rule updates cannot leak memory through stale expression hashes.
Rust CEL Library
cel-interpreter is a practical first candidate for the Rust implementation. It
provides Program::compile(...), Program::execute(...), a Context for
variables and functions, and compiled Program values that are Send + Sync.
Implementation should still be isolated behind a small internal trait:
CelEvaluator
-> compile(ruleId, expression) -> compiled expression
-> evaluate(compiled expression, serde_json::Value context) -> bool
This keeps Light-Rule from leaking third-party crate types through its public model and allows the implementation to change if CEL crate maturity, feature flags, or Java parity requirements change.
Native Operator Parity
The native evaluator includes operators that may not map one-to-one to the selected CEL runtime. Examples include:
containsIgnoreCasematchesandnotMatchinListandnotInListcontainsAny,containsAll, andcontainsNone- date-style comparisons such as
before,after, andon
Before encouraging migration from native conditions to CEL, the implementation should define a small compatibility function registry for any gaps. Candidate pure helper functions include:
contains_ignore_case(value, substring)
matches(value, pattern)
in_list(value, values)
contains_any(value, values)
contains_all(value, values)
These functions must be deterministic, side-effect free, and shared by the rule tester and runtime evaluator. If Java parity is required, the same function names and edge-case behavior should be implemented in the Java runtime.
Safety
CEL support should be deterministic and sandboxed.
The evaluator does not need an operating-system sandbox for normal trusted/admin-authored rule configuration. CEL is an interpreted expression language, not arbitrary Rust or JavaScript execution, and expressions can only resolve variables and functions registered in the CEL context. The CEL context is therefore the primary sandbox boundary.
For the Rust cel-interpreter integration, context construction should be
explicit. Context::default() exposes standard pure CEL functions such as
size, contains, string helpers, type conversions, regex matches, and time
parsing helpers depending on enabled crate features. If a service accepts
tenant-authored or otherwise untrusted CEL, prefer Context::empty() and add
only platform-approved helper functions.
Security policy should be engine-owned. A rule may request a named condition security profile, but it must not define its own function allowlist, size limits, resource limits, or isolation mode. If a rule author controls the rule body, then inline security settings are also attacker-controlled.
Recommended policy model:
runtime config defines profiles:
strict
standard
internal-admin
rule optionally requests:
conditionSecurityProfile: strict
effective policy:
runtime maximum profile intersected with requested profile
If a rule omits conditionSecurityProfile, the runtime default applies. If a
rule requests a profile that the service, tenant, or rule phase does not allow,
the rule config should be rejected during validation or runtime reload. The
engine may choose a stricter profile than requested, but it must never choose a
weaker one because the rule requested it.
Recommended profiles:
strict: default for tenant-authored, portal self-service, imported, or marketplace-style CEL. Use an empty CEL context, expose only approved variables, add only pure helper functions, and enforce tight size and expression-shape limits. Do not expose the fullcontextroot, and disable regex until both Java and Rust provide matching bounded or linear-time behavior.standard: default for internal business rules. Keep allowlists and resource limits, but permit common pure helpers such assize,contains,startsWith,endsWith,contains_ignore_case, and bounded regex support if needed.internal-admin: limited to trusted operator-maintained rules. This may be closer to the selected CEL runtime's default behavior, but should still compile during rule load, validate references, enforce maximum input size, and protect reloads with the last known-good rule set.
Allowed:
- boolean logic
- comparisons
- arithmetic supported by the CEL implementation
- string operations
- list and map predicates
- approved pure helper functions
Not allowed:
- file access
- network access
- database access
- current time unless explicitly added as an input field
- random values
- mutation of the rule context
- action execution from inside CEL
Custom functions should be added conservatively. Native Light-Rule actions remain the extension point for side effects and transformations.
The core runtime object should be a policy-driven condition evaluator rather
than ad hoc logic embedded directly in RuleEngine:
RuleEngineOptions
-> ConditionExecutionPolicy
-> defaultCelProfile
-> allowRuleProfileSelection
-> profiles[name] = CelSecurityProfile
CelSecurityProfile
-> allowedFunctions
-> allowedRootVariables
-> exposeContextRoot
-> exposeTopLevelAliases
-> maxExpressionBytes
-> maxContextBytes
-> maxStringBytes
-> maxCollectionItems
-> allowRegex
-> allowTimeParsing
-> allowComprehensions
-> maxComprehensionNesting
CEL still needs resource and robustness controls because expressions run on request paths and can iterate over input data. Runtime and publish-time validation should:
- allow-list functions and variables, using compiled expression references where available
- reject functions that perform I/O, mutation, service lookup, action execution, random generation, or implicit current-time access
- cap expression length and input context size
- reject or limit expensive access to large request or response bodies
- compile during rule load and fail invalid expression shapes before request execution
- keep the last known-good rule set if reload validation fails
Phase ceilings should be enforced by runtime policy. Response phases such as
res-tra and res-fil should default to a strict ceiling or tight
maxContextBytes limits because they can include large response payloads.
Access-control phases may allow standard only when the exposed context is
small and bounded. A rule request for a stronger profile than the phase ceiling
must be rejected or downgraded to the stricter effective profile.
For fully untrusted public input, evaluate CEL in a separate worker, process, or another resource-isolated execution path with CPU and memory limits. A Tokio timeout alone is not a complete guard for synchronous CPU-bound expression evaluation.
Portal Experience
Light-Portal should use conditionLanguage to choose the rule editor. This
keeps the form predictable and avoids mixing two mental models on the same
screen.
Recommended authoring modes:
Builder: native condition rows with operand, operator, expected, and join controlsCEL: advanced text area for one rule-level CEL expression
Recommended behavior:
- default new rules to
native - render condition subforms only for
conditionLanguage: native - render a CEL expression text area only for
conditionLanguage: cel - hide native condition controls when CEL is selected
- hide CEL expression controls when native is selected
- require confirmation when switching modes if the existing mode has content
- do not try to round-trip arbitrary CEL into native builder rows
- store the selected mode in
rule_t.condition_languageand in the JSON rule body asconditionLanguage - for CEL rules, store only the selected profile name in
rule_t.condition_security_profileand in the JSON rule body asconditionSecurityProfile; do not expose raw policy limits in the form - do not show
internal-adminin standard self-service forms; allow it only through checked-in runtime configuration or an explicitly authorized internal admin JWT/role path
The CEL editor should provide:
- syntax validation
- test context input
- expression result preview
- visible context field reference
- selected and effective security profile display
- rule test execution against the same backend evaluator used by runtime
Profile Approval Workflow
Light-Portal may allow a user to select a CEL security profile, but the selected profile is only a request. Runtime policy still computes the effective profile from the requested profile, the service maximum, the tenant maximum, and the rule phase ceiling.
Recommended publish behavior:
strict: direct publish. If schema, CEL validation, and command authorization pass, create or update the rule immediately.standard: approval required. Submit the proposed rule change, create a worklist task forrule-adminandadmin, and keep the change pending until approval.internal-admin: hidden from standard self-service authoring. If exposed to an operator-only flow, require stronger approval and never allow ordinary self-service users to request it.
For approval-required changes, the command side should not emit the final active
RuleCreated or RuleUpdated event at submission time. It should emit a
submission event such as RuleChangeSubmittedEvent or
RuleApprovalRequestedEvent, store the proposed rule body and requested profile,
and create the human-in-the-loop worklist task. Only approval should emit the
active rule event. Rejection should emit a rejection event and leave the active
rule unchanged.
Assistant tasks can help the approver by summarizing the CEL expression, rule phase, requested profile, referenced context roots, use of response body fields, regex usage, and any runtime ceiling that would downgrade the effective profile. The assistant output is advisory only; the human approver remains responsible for the approval decision.
Recommended approval rules:
- changing the expression, action list, rule phase, requested profile, or exposed context assumptions invalidates prior approval
- downgrading from
standardtostrictcan publish directly after validation - upgrading from
stricttostandardorinternal-adminrequires approval - requester and approver should be different users except for an explicit break-glass workflow
- approval audit should record requested profile, effective profile, requester, approver, approval time, assistant-task summary id, and approval comments
- pending rules must not be exported to runtime endpoint rule config until approved
Compatibility
Existing rule YAML remains valid.
Rules without conditionLanguage are treated as native. The database
migration should add rule_t.condition_language with default native, so
existing rows do not need their rule_body rewritten immediately.
Rules without conditionSecurityProfile use the runtime default CEL profile.
The field is meaningful only for CEL rules; native rules do not need a condition
security profile.
Native condition aliases must continue to work:
operatorCodeas alias foroperatorpropertyPathas alias foroperandactionClassNameas alias foractionRef
CEL introduces a new capability. If the Java yaml-rule runtime needs to execute
the same rules, it must implement the same CEL rule shape. Until then, Java
runtimes must fail closed with a clear capability error, such as
UnsupportedConditionLanguageException, when loading or executing a rule with
conditionLanguage: cel. A runtime must not silently ignore a CEL rule because
that can fail open for access-control rules.
Java parity is feasible because Google maintains CEL-Java under the dev.cel
Maven group, including the dev.cel:cel artifact with compiler and runtime
APIs. The compatibility requirement is therefore mostly about aligning the rule
schema, context shape, custom functions, and error handling across the Rust and
Java runtimes.
Example: Access Control
ruleBodies:
allowApprovedTransfer:
common: Y
ruleId: allowApprovedTransfer
ruleName: Allow approved transfer
ruleType: req-acc
conditionLanguage: cel
conditionSecurityProfile: strict
expression: >
auditInfo.subject_claims.ClaimsMap.role in roles
&& (
toolArguments.amount < 1000
|| "transfer.approve" in auditInfo.subject_claims.ClaimsMap.scope
)
actions:
- actionClassName: com.networknt.rule.RoleBasedAccessControlAction
endpointRules:
/transfer@post:
req-acc:
- allowApprovedTransfer
permission:
roles:
- teller
- approver
Example: Response Filter Guard
ruleBodies:
filterAccountsForPortalUsers:
common: Y
ruleId: filterAccountsForPortalUsers
ruleName: Filter accounts for portal users
ruleType: res-fil
conditionLanguage: cel
conditionSecurityProfile: strict
expression: >
statusCode == 200
&& responseBody != ""
&& auditInfo.subject_claims.ClaimsMap.role != null
actions:
- actionClassName: com.networknt.rule.ResponseRowFilterAction
Rollout Plan
- Add
rule_t.condition_languagewith defaultnative, optionalrule_t.condition_security_profile, and check constraints. - Extend the rule specification with native and CEL rule branches plus
optional
conditionSecurityProfile. - Add
conditionLanguage,conditionSecurityProfile, andexpressionfields to the RustRulemodel. - Update command/query APIs so the portal can persist and read the condition
language, security profile, and approval state without parsing
ruleBody. - Optionally accept
operatorCode: celas an import/compatibility alias and normalize it to the rule-level CEL shape. - Choose and pin the Rust CEL crate behind an internal evaluator abstraction.
- Add runtime-owned CEL security profiles and policy-driven context building.
- Add approval workflow integration for
standardandinternal-adminprofile requests, including worklist and assistant-task support. - Dispatch inside
RuleEngine::execute_rulebased onconditionLanguage. - Compile and cache CEL expressions during rule config load.
- Add unit tests for CEL true, CEL false, invalid expression, mode validation, and missing-field behavior.
- Add tests for custom native-parity helper functions.
- Add performance tests for context conversion with large
toolArgumentsand response payloads. - Add gateway integration tests using the existing rule context and the
contextroot variable. - Add rule test API support so Light-Portal can validate CEL before publish.
- Add portal mode-based rule editing, a controlled CEL profile selector, and approval UX for stronger profile requests.
- Document runtime compatibility and Java parity requirements.
Decision
Support both condition languages. Native Light-Rule conditions remain the
stable, portal-friendly default. CEL becomes an optional advanced expression
language inside the same rule engine for customers that need richer policy
expressions. A rule should select one condition language through
conditionLanguage; mixed native/CEL condition arrays are not the canonical
authoring model.
Design Document: Centralized Agentic Skill Registry
Subject: Transitioning from File-Based Markdown Skills to a Database-Backed Skill Registry
1. Executive Summary
Currently, most AI agent frameworks rely on localized Markdown (.md) files to define agent "skills." While Markdown is highly LLM-native and human-readable, it creates significant bottlenecks at an enterprise scale regarding strict typing, API integration, and context window limits.
This document proposes transitioning to an Agentic Control Plane (Centralized Skill Registry) backed by a database. By decoupling skill metadata, schemas, and instructions, and by utilizing dynamic routing, we will achieve hierarchical structuring, strict schema enforcement, and progressive disclosure of tools to agents.
2. Problem Statement
Managing agent skills as flat Markdown files introduces several scaling challenges:
- Lack of Strict Typing: Markdown cannot enforce data types (e.g., ensuring a parameter is an integer vs. string), leading to hallucinated or malformed tool inputs.
- Context Window Exhaustion: Loading dozens or hundreds of skill definitions at startup overwhelms the LLM context window, increasing latency, token costs, and tool-misuse.
- Static Deployments: Updating a skill or changing access permissions requires a full application redeploy.
- Poor Discoverability: Flat file structures offer no native mechanism for progressive disclosure or tool search.
3. Data Models & Formats
To solve the limitations of purely text-based skills, we will adopt a hybrid, structured format stored within a database (e.g., PostgreSQL/MongoDB). The architecture uses the right format for the right job:
- JSON Schema: Used strictly for defining parameters, inputs, and tool shapes. Natively supported by OpenAI/Anthropic/Google tool-calling APIs.
- LightAPI Description (YAML/JSON): Used to map endpoint-level API capabilities to skills across REST, JSON-RPC, gRPC, and MCP.
- OpenAPI / OpenRPC / Protobuf: Referenced by LightAPI where protocol-native specifications already exist.
- Executable Code (Python/JS) / URI: Stores the actual execution logic or the endpoint reference.
- Markdown: Retained only for the
instructionsorpromptfields, as LLMs excel at parsing markdown headers and lists for constraints and persona instructions.
LightAPI is the preferred source format for API-backed skills because it describes endpoint identity, protocol invocation, input schema, request mapping, result shape, examples, and behavior notes in one agent-oriented document. See LightAPI Description Design for the endpoint description model.
YAML and JSON are the external skill document formats. In the portal database,
they should not replace the Markdown instruction field. The normalized model is
structured columns and relationships for identity, versioning, taxonomy, tools,
and execution metadata, plus content_markdown for the LLM-facing instruction
body. If the portal later needs to persist a full structured skill document,
add a nullable JSONB skill-spec column beside content_markdown and normalize
YAML imports to JSON.
3.1 Proposed Database Schema Structure
Light Portal stores skills in structured catalog tables. Below is a representation of the skill payload:
{
"skill_id": "sk_finance_001",
"name": "generate_financial_report",
"version": "1.2.0",
"tags": ["finance", "reporting"],
"tool_schema": {
"type": "function",
"function": {
"name": "generate_financial_report",
"description": "Generates a Q3 report based on ticker symbol.",
"parameters": {
"type": "object",
"properties": {
"ticker": {"type": "string", "description": "The stock ticker"}
},
"required": ["ticker"]
},
"response_schema": {
"type": "object",
"properties": {
"report_url": {"type": "string"},
"status": {"type": "string"}
}
}
}
},
"execution": {
"type": "rest_api",
"endpoint_id": "ep_finance_report_001",
"endpoint": "https://internal-api.company.com/v1/finance/report",
"method": "POST"
},
"instructions": "## Role\nYou are a financial analyst.\n## Constraints\n- Never hallucinate financial data.\n- Always return exact numbers."
}
4. Hierarchical Structure & Progressive Disclosure
Dumping 500 JSON schemas into an LLM's context window will cause system failure. The Centralized Controller will act as a mediator, enforcing hierarchy and progressive disclosure (giving the agent only the schemas it needs, exactly when it needs them).
4.1 Implementing Hierarchy & Tagging
Because JSON Schema does not have built-in folders, hierarchy and categorization are enforced via the platform's global entity management system:
- Namespacing: Tool names follow a strict convention:
[domain]_[subdomain]_[action](e.g.,aws_rds_provision). - Tags & Categories: Instead of hardcoded columns, the registry utilizes the
entity_tag_tandentity_category_ttables (withentity_type = 'skill'). This allows for unlimited flat tagging and deep hierarchical folder structures that are consistent across the entire portal. - Discovery API: Portal-query filters by these tags/categories to scoped skill sets for specific agent personas. Agents cache the effective catalog locally and reload it when runtime cache-management invalidation is triggered.
4.2 Progressive Disclosure Patterns
Agents should not load every executable tool into the LLM context. Instead, they should load their assigned skill/tool catalog from the portal API, cache it locally, and use one of the following progressive disclosure patterns:
Phase 5 starts with the Rust light-agent. The agent loads
genai-query/getEffectiveAgentCatalog, keeps a local cache keyed by
hostId + agentDefId + serviceId + envTag, ranks cached skill/tool entries with
keyword and routing-field matching, and intersects the selected tool names with
the live gateway tools/list result before giving schemas to the model.
Execution remains gateway tools/call.
Pattern A: Meta-Tools (Dynamic Injection)
The agent is booted with only two "meta-tools" designed for discovery.
- Local catalog search: Agent searches its cached assigned skills. The cache contains lightweight summaries and mapped tool names.
- Schema loading: Once the agent identifies the correct tool, it loads the schema from the local catalog cache or refreshes the cache from portal-query.
Pattern B: Semantic Tool RAG (Zero-Shot Discovery)
For highly complex systems with thousands of skills:
- Tool descriptions are embedded into a Vector Database (e.g.,
pgvector). - When the user prompts the system (e.g., "Reset my AWS password"), portal-query or the agent's local cache performs semantic search and retrieves the Top-3 most relevant JSON Schemas.
- The agent boots with only those 3 tools in its context.
Pattern C: Multi-Agent Orchestration (Supervisor / Worker)
Hierarchy is mapped to agent teams.
- A Supervisor Agent holds routing tools (e.g.,
delegate_to_finance,delegate_to_devops). - When
delegate_to_devopsis triggered, the supervisor routes to a DevOps Worker Agent, loading only the specific DevOps JSON schemas into its context.
5. Example Flow: Dynamic Loading in Action
User: "I need to provision a new database for the marketing team."
- Turn 1: Discovery
- Agent Context: Has a local cache of assigned skill summaries.
- Agent Action: Searches the local cache for
provision database.
- Turn 2: High-Level Awareness
- Local Cache Result: Returns token-efficient summaries from the portal catalog:
[{"name": "aws_rds_provision", "description": "Creates AWS RDS DB"}, {"name": "mongo_atlas_create", "description": "Creates Mongo cluster"}] - Agent Action: Decides AWS is needed and loads the cached schema for
aws_rds_provision.
- Local Cache Result: Returns token-efficient summaries from the portal catalog:
- Turn 3: Strict Execution
- Agent Catalog: Provides the full JSON schema (requiring
instance_type,storage_gb). - Agent Action: Understands parameters and safely executes
aws_rds_provisionthrough the gatewaytools/callpath.
- Agent Catalog: Provides the full JSON schema (requiring
6. Operational Benefits & Security
By centralizing skills in a database, the platform gains enterprise-grade operational capabilities:
- Dynamic Updates: API endpoints, instructions, and schemas can be updated in the database without restarting agents.
- Permission-Aware Discovery (RBAC): By linking skills to LightAPI endpoint descriptions and
api_endpoint_t, portal-query can limit catalog disclosure to the current agent or tenant, while runtime gateway policy still authorizes execution. - A/B Testing: Portal catalog metadata can route 50% of an agent's requests to
skill_v1and 50% toskill_v2to measure prompt/tool efficacy. - Audit Logging: Catalog disclosure and gateway execution can be logged separately, preserving a compliance trail without moving tool execution into the registry.
- Distilled Memory RAG: Following the "Hindsight" pattern, raw conversation history (
agent_session_history_t) is separated from RAG-optimized memory (session_memory_t). This prevents the "noisy context" problem while maintaining a perfect audit trail.
7. LightAPI As Skill Source
API-backed skills should be generated from endpoint-level LightAPI descriptions whenever possible.
The skill registry should store skill metadata, access control, grouping, and agent-facing instructions. The LightAPI description should remain the source of truth for endpoint invocation and verification details.
Recommended flow:
- Light-Portal creates or imports endpoint-level LightAPI descriptions.
- API owners enrich endpoint descriptions with examples, behavior notes, result cases, and visibility.
- Approved endpoint descriptions are published as agent skills.
- The agent loads assigned skill summaries from portal-query and caches them locally.
- When the agent selects a skill, it loads the relevant LightAPI disclosure level from the local cache or refreshes from portal-query.
- Execution goes through the gateway
tools/callpath, preserving runtime policy and downstream authorization.
This avoids manually duplicating every API endpoint as a separate hand-written skill while still giving agents strict schemas and progressive disclosure.
8. Workflow-Backed Skills
Some skills need more than instructions and a curated tool set. A skill that
must orchestrate several tools, wait for human approval, retry failed steps,
run assertions, or preserve a durable audit trail should be backed by
light-workflow.
The boundary should stay clear:
| Layer | Responsibility |
|---|---|
| Skill | Discovery metadata, taxonomy, instructions, allowed tools, and agent guidance. |
| Workflow | Ordered execution, branching, retries, assertions, human tasks, durable state, and audit events. |
| Gateway | Runtime tool execution through tools/list and tools/call. |
Workflow backing should be optional. Simple skills can stay as instructions plus
tool mappings. Durable or regulated processes should link to workflow
definitions and let light-workflow own execution.
Recommended storage:
- Keep
wf_definition_t.definitionas the canonical workflow YAML. - Keep
skill_t.content_markdownas the LLM-facing skill instruction body. - Add
skill_workflow_tto link skills to workflow definitions with a role such asprimary,validation,remediation, ortest. - Treat
skill_tool_tas the allowed tool set for a workflow-backed skill. Validation should flag workflow tool-call steps that are not linked to the skill.
The Portal Skill Workspace should embed a generic Workflow Editor instead of creating a skill-specific workflow runtime. The editor provides YAML editing, step preview, reference lookup, validation, and test runs. Skill authoring provides the surrounding context: skill metadata, taxonomy, allowed tools, effective prompt preview, and workflow link configuration.
9. Next Steps
- Complete phase 3 by adding category and tag assignment to existing skill create/update forms, backed by
entity_category_tandentity_tag_twithentity_type = 'skill'. - Save skill taxonomy through a composite skill command so the skill row and selected taxonomy associations are emitted from the same user action.
- Move the richer authoring workspace, effective prompt preview,
skill_tool_t.configformalization, workflow-backed skills, and "create skill from LightAPI/tool" flows to phase 3.5. - Build the generic Workflow Editor for YAML editing, parsed step preview, catalog references, validation, and workflow test runs.
- Complete phase 4 agent assignment by improving the
agent_skill_tUI, adding an Agent Definition assignment context, and adding a batch assignment composite command that emits oneAgentSkillCreatedEventper selected skill. - Enforce phase 4 assignment validation in command handlers and UI preflight: assigned skills must be active and must have at least one active direct
skill_tool_tlink. Workflow-backed skills still rely onskill_tool_tas the allowed tool set. - Keep live gateway
tools/listruntime executability checks as a diagnostics or governance concern, not as phase 4 persistence validation. - Complete phase 5 for the Rust agent with the
genai-querygetEffectiveAgentCatalogendpoint, claim checks againsthost,sid, andenv, local catalog caching, keyword/routing search, gatewaytools/listintersection, and controller-driven cache invalidation. - Complete phase 6 governance for the Rust agent only: normalize sensitivity
tiers to
public,internal,confidential, andrestricted; filter blocked tools before catalog disclosure; compare the effective catalog with gatewaytools/listthrough/diagnostics/tools; and keep execution through gatewaytools/call. - Enforce destructive, approval-required, and sensitivity metadata at the
gateway with debug/auditInfo fields when a call is blocked. Do not use
workflow
audit_log_tfor catalog disclosure; use auditInfo/file logging until a generic governance audit table is introduced. - Keep current active row plus aggregate version as the approval/version boundary until workflow-owned approval state is implemented.
- Add publishing from LightAPI endpoint descriptions into the skill registry.
- Migrate existing file-based skills into structured catalog payloads, keeping instructions in Markdown and converting parameters to JSON Schema.
- Implement Pattern B (Semantic Tool RAG) after indexed catalog fields and embeddings are ready for production search.
Skill Workflow Orchestration
Status
Proposed demo design.
Executive Summary
This design describes a focused demo for agent-driven orchestration in Light-Fabric. The demo uses one agent with two skills:
- A skill that starts a workflow which calls two REST APIs directly.
- A skill that starts a workflow which calls the same two REST APIs through the MCP router.
Both paths solve the same business use case and return the same output. The visible difference is the execution trace:
- The REST workflow shows
light-workflowinvoking HTTP endpoints directly. - The MCP workflow shows
light-workflowinvoking MCPtools/call, withlight-gatewayrouting each tool call to the same backend REST APIs.
This demonstrates that skills provide agent-facing guidance and discovery, workflows provide durable orchestration, and the gateway provides the MCP data plane for tool execution.
Goals
- Show one agent selecting between two assigned skills.
- Show a workflow that orchestrates multiple REST APIs directly.
- Show a second workflow that orchestrates the same APIs through MCP tools.
- Keep the input and output contract identical across both workflows.
- Keep the demo small enough to explain in a few minutes.
- Preserve the runtime boundary: skills guide, workflows orchestrate, gateway executes MCP tool calls.
Non-Goals
- Do not benchmark REST versus MCP latency.
- Do not claim that MCP replaces REST. The demo shows two supported access patterns over the same backend capabilities.
- Do not require every skill to be workflow-backed. Simple skills can remain instructions plus allowed tools.
- Do not move MCP tool execution into the portal registry or agent catalog.
Runtime tool execution stays on the gateway
tools/callpath. - Do not make the demo depend on a large endpoint catalog.
Recommendation
Use two APIs, not one.
A one-API demo can show sequencing, but it does not clearly prove cross-service orchestration. Two APIs show a more realistic enterprise shape: the workflow has to collect data from one business capability and make a decision through another capability.
Use four endpoints for the base demo.
| Demo size | Endpoint count | Recommendation | Why |
|---|---|---|---|
| Smoke test | 2 | Optional only | Shows a happy path, but not enough variation. |
| Base demo | 4 | Recommended | Covers path parameters, query parameters, arrays, request bodies, branching, and transformation. |
| Advanced demo | 6 | Later phase | Adds parallel enrichment, compensation, or audit callbacks. |
The base demo should be small enough to run repeatedly while still proving meaningful orchestration behavior.
Demo Scenario
The demo domain is personalized offer recommendation.
The agent receives a prompt such as:
Recommend an offer for customer CUST-1001.
The agent can use either skill:
Personalized Offer via REST WorkflowPersonalized Offer via MCP Router
If the prompt does not specify REST or MCP, the demo agent should not pick a path at random. It should ask a short clarification question:
Do you want to run this through the direct REST workflow or through the MCP
router workflow?
Scripted demos can avoid the clarification by naming the path in the prompt.
Both skills start a workflow that:
- Loads the customer profile.
- Loads customer preferences and consent.
- Stops if the customer has not consented.
- Searches for eligible offers.
- Selects the best offer.
- Records the offer decision.
- Returns a normalized decision payload.
APIs And Endpoints
Customer Profile API
The Customer Profile API owns customer data and preferences.
| Endpoint | Shape | Purpose |
|---|---|---|
GET /customers/{customerId} | Path parameter, object response | Load customer identity, segment, region, and account status. |
GET /customers/{customerId}/preferences?channel=portal | Path parameter plus query parameter | Load consent, preferred categories, and contact channel rules. |
Offer Decision API
The Offer Decision API owns eligible offer lookup and decision recording.
| Endpoint | Shape | Purpose |
|---|---|---|
GET /offers?segment={segment}&state={state}&category={category} | Query parameters, array response | Search active offers matching the customer profile and preferences. |
POST /offer-decisions | JSON request body, object response | Persist the selected offer decision and return a decision id. |
Demo API Runtime Services
The two business APIs should be implemented as real Rust services using the
light-axum framework, not as ad hoc mocks. This keeps the demo aligned with
normal Light-Fabric service lifecycle behavior:
- load runtime configuration from config-server
- bind HTTP using configured server settings
- register with controller through
portal-registry - appear in the control panel service-discovery view
- support gateway service discovery by
serviceIdandenvTag
Recommended demo apps:
| App | Service id | Default HTTP port | Purpose |
|---|---|---|---|
demo-customer-profile-api | com.networknt.demo.customer-profile-1.0.0 | 8085 | Serves customer profile and preference data. |
demo-offer-decision-api | com.networknt.demo.offer-decision-1.0.0 | 8086 | Serves offer lookup and decision recording. |
The ports are config defaults only. They must be configurable through config-server values so local, Docker, Kubernetes, and shared demo environments can choose different ports without recompiling.
Both services should expose:
GET /health
The API endpoints should return deterministic demo data. A database is not required for the first demo; in-memory seed data is enough as long as the data is stable and documented. If later demos need persistence, keep it behind the same endpoint contract.
Light-Axum Bootstrap
Each demo API should follow the normal light-axum pattern: implement
AxumApp, return an axum::Router, and let LightRuntimeBuilder own binding,
configuration, shutdown, and controller registration.
The service should read config from the same runtime config files used by other Light-Fabric services:
startup.yml
server.yml
portal-registry.yml
Example config-server values for the Customer Profile API:
startup.host: dev.lightapi.net
startup.externalConfigDir: /var/lib/demo-customer-profile-api/config-cache
light-config-server-uri: https://config-server.lightapi.svc.cluster.local:8435
server.serviceId: com.networknt.demo.customer-profile-1.0.0
server.environment: demo
server.ip: 0.0.0.0
server.advertisedAddress: demo-customer-profile-api
server.httpPort: 8085
server.enableHttp: true
server.enableHttps: false
server.enableRegistry: true
server.startOnRegistryFailure: true
portalRegistry.portalUrl: https://controller.lightapi.svc.cluster.local:8438
Example config-server values for the Offer Decision API:
startup.host: dev.lightapi.net
startup.externalConfigDir: /var/lib/demo-offer-decision-api/config-cache
light-config-server-uri: https://config-server.lightapi.svc.cluster.local:8435
server.serviceId: com.networknt.demo.offer-decision-1.0.0
server.environment: demo
server.ip: 0.0.0.0
server.advertisedAddress: demo-offer-decision-api
server.httpPort: 8086
server.enableHttp: true
server.enableHttps: false
server.enableRegistry: true
server.startOnRegistryFailure: true
portalRegistry.portalUrl: https://controller.lightapi.svc.cluster.local:8438
server.advertisedAddress must be a reachable address, not 0.0.0.0. In
Kubernetes, use the Service DNS name. In local Docker Compose, use the Compose
service name. In a native VM demo, use the VM hostname or another reachable
address.
Controller Registration
The services should register with controller using the runtime's
portal-registry integration. The controller registration payload must include
at least:
serviceIdenvTag- protocol
- advertised address
- port
- discovery token or portal registry token, according to environment policy
After startup, the control panel should show two registered service instances:
com.networknt.demo.customer-profile-1.0.0 / demo
com.networknt.demo.offer-decision-1.0.0 / demo
The MCP router configuration should prefer these service IDs over fixed
targetHost values where service discovery is available. Fixed targetHost
values are still useful for a minimal local smoke test.
Optional Advanced Endpoints
The base demo should start with four endpoints. If we later want to demonstrate more workflow shapes, add one or two optional endpoints:
| Endpoint | Shape Demonstrated | Use |
|---|---|---|
GET /customers/{customerId}/risk | Parallel enrichment | Run profile, preferences, and risk lookup before offer selection. |
POST /offer-decisions/{decisionId}/audit | Follow-up side effect | Record a compliance audit event after the decision is created. |
POST /offer-decisions/{decisionId}/cancel | Compensation | Cancel the decision if a later step fails. |
Agent, Skills, And Workflows
Use one agent so the demo highlights skill selection rather than agent handoff.
| Object | Name | Responsibility |
|---|---|---|
| Agent | Demo Orchestration Agent | Receives the user request and selects one of the assigned skills. |
| Skill | Personalized Offer via REST Workflow | Guides the agent to start the direct REST workflow. |
| Skill | Personalized Offer via MCP Router | Guides the agent to start the MCP-backed workflow. |
| Workflow | personalized-offer-rest-v1 | Orchestrates direct HTTP calls to the two REST APIs. |
| Workflow | personalized-offer-mcp-v1 | Orchestrates MCP tool calls through the gateway router. |
The skill registry should link each skill to its workflow definition through
skill_workflow_t. The workflow definition remains canonical in
wf_definition_t.definition. The skill content_markdown remains
agent-facing guidance, not the executable workflow source.
Execution Paths
Direct REST Workflow
User prompt
-> Demo Orchestration Agent
-> Personalized Offer via REST Workflow skill
-> light-workflow
-> Customer Profile API
-> Offer Decision API
-> normalized decision result
This path is useful for showing direct, durable API orchestration.
MCP Router Workflow
User prompt
-> Demo Orchestration Agent
-> Personalized Offer via MCP Router skill
-> light-workflow
-> MCP tools/call
-> light-gateway MCP router
-> Customer Profile API
-> Offer Decision API
-> normalized decision result
This path is useful for showing MCP protocol orchestration over the same backend API capabilities.
Common Workflow Contract
Both workflows should accept the same input:
{
"customerId": "CUST-1001",
"channel": "portal"
}
Both workflows should return the same successful output shape:
{
"status": "APPROVED",
"customerId": "CUST-1001",
"selectedOfferId": "OFFER-TRAVEL-01",
"decisionId": "DEC-1001"
}
Both workflows should return comparable business outcomes for known edge cases:
{
"status": "NO_CONSENT",
"customerId": "CUST-3003",
"reason": "Customer has not consented to personalized offers."
}
{
"status": "NO_ELIGIBLE_OFFER",
"customerId": "CUST-2002",
"reason": "No active offer matches the customer profile and preferences."
}
Workflow Shape
The REST and MCP workflows should have the same logical steps.
| Step | REST workflow action | MCP workflow action |
|---|---|---|
| Load profile | GET /customers/{customerId} | tools/call customer_get_profile |
| Load preferences | GET /customers/{customerId}/preferences | tools/call customer_get_preferences |
| Check consent | Workflow condition | Workflow condition |
| Search offers | GET /offers | tools/call offer_search |
| Select offer | Workflow expression or rule | Workflow expression or rule |
| Record decision | POST /offer-decisions | tools/call offer_record_decision |
| Return result | Workflow output mapping | Workflow output mapping |
The workflow should own branching, retries, and output normalization. The agent should not manually sequence each API call after the workflow starts.
Error Handling And Retries
Business outcomes and technical failures should be treated differently.
Business outcomes are expected workflow results and should not be retried:
NO_CONSENTNO_ELIGIBLE_OFFER
Technical failures should use bounded workflow retries:
| Failure | Recommended behavior |
|---|---|
| Customer Profile API timeout | Retry the profile step with exponential backoff. |
Offer Decision API returns 503 | Retry the affected offer step with exponential backoff. |
Gateway MCP tools/call timeout | Retry the MCP tool-call step with the same workflow policy. |
| Persistent downstream failure | End with a controlled technical failure result and preserve the workflow trace. |
Recommended transient retry status codes:
408, 429, 502, 503, 504
The POST /offer-decisions step should include an idempotency key derived from
the workflow instance id and selected offer id. This prevents duplicate
decisions when a retry happens after the backend processed the first request
but the response was lost.
For parity, the REST and MCP workflows should use the same retry policy. In the MCP path, the gateway should preserve enough error detail for the workflow trace to show the tool name, mapped backend endpoint, status code, and correlation id.
MCP Tool Mapping
The MCP workflow should use a small, explicit tool set.
| MCP tool | Backend endpoint | Arguments |
|---|---|---|
customer_get_profile | GET /customers/{customerId} | customerId |
customer_get_preferences | GET /customers/{customerId}/preferences | customerId, channel |
offer_search | GET /offers | segment, state, category |
offer_record_decision | POST /offer-decisions | customerId, offerId, channel, source, reason |
The MCP tool input schemas should be normalized JSON objects. The gateway router maps those objects to path parameters, query parameters, or request bodies for the backend REST APIs.
The MCP skill should list these tools in skill_tool_t as its allowed runtime
tool set. Workflow validation should flag an MCP tool-call step if it references
a tool that is not linked to the skill.
Gateway Tool Configuration Example
Current gateway HTTP tool execution maps GET arguments to query parameters and
sends non-GET arguments as JSON request bodies. To support endpoint shapes such
as GET /customers/{customerId} without changing the backend API, the demo
should add or configure explicit path-template substitution before the request
is sent.
Recommended minimal mapping shape:
mcp-router.tools:
- name: customer_get_profile
description: Get a customer profile by id.
protocol: http
serviceId: com.networknt.demo.customer-profile-1.0.0
envTag: demo
path: /customers/{customerId}
method: GET
apiType: http
inputSchema:
type: object
required:
- customerId
properties:
customerId:
type: string
toolMetadata:
pathParams:
- customerId
With this mapping, the MCP tool call:
{
"name": "customer_get_profile",
"arguments": {
"customerId": "CUST-1001"
}
}
should be routed to:
GET /customers/CUST-1001
The path parameter should not also be appended as a query parameter. Arguments
not listed under pathParams can still be appended as query parameters for GET
requests or sent as JSON body fields for POST requests.
Skill Content Markdown Guidance
The skill content_markdown should explain when and how the agent should use
the skill. It should not duplicate the workflow definition or the full API
contract.
Example REST skill content:
## Purpose
Use this skill when the user asks for a personalized offer decision through the
direct REST workflow.
## Inputs
- customerId: customer identifier, such as CUST-1001
- channel: request channel, default portal
## Behavior
- Start workflow personalized-offer-rest-v1.
- Return the workflow result as the answer.
- Do not manually call offer APIs outside the workflow.
- If the user does not specify REST or MCP, ask which execution path they want.
Example MCP skill content:
## Purpose
Use this skill when the user asks to demonstrate MCP router orchestration for a
personalized offer decision.
## Inputs
- customerId: customer identifier, such as CUST-1001
- channel: request channel, default portal
## Behavior
- Start workflow personalized-offer-mcp-v1.
- The workflow will call MCP tools through the gateway.
- Return the workflow result as the answer.
- If the user does not specify REST or MCP, ask which execution path they want.
Structured execution metadata belongs in registry rows and workflow definitions, not only in markdown. The markdown is the LLM-facing explanation.
Output Normalization
The workflows should not pass raw endpoint responses directly to the agent. They should normalize backend responses into a stable business result.
Example raw POST /offer-decisions response:
{
"decisionId": "DEC-1001",
"customerId": "CUST-1001",
"offerId": "OFFER-TRAVEL-01",
"decision": "approved",
"createdAt": "2026-05-25T14:12:00Z",
"auditRef": "AUD-7788"
}
Normalized workflow output:
{
"status": "APPROVED",
"customerId": "CUST-1001",
"selectedOfferId": "OFFER-TRAVEL-01",
"decisionId": "DEC-1001"
}
The workflow should own this transformation so the REST and MCP variants produce identical final results even if their intermediate transport envelopes are different.
Demo Data
Use deterministic seed data so the demo is repeatable.
| Customer | Profile | Preferences | Expected result |
|---|---|---|---|
CUST-1001 | Premium segment, active, Ontario | Consent true, travel preferred | APPROVED with OFFER-TRAVEL-01. |
CUST-2002 | Standard segment, active, Ontario | Consent true, travel preferred | NO_ELIGIBLE_OFFER. |
CUST-3003 | Premium segment, active, Ontario | Consent false | NO_CONSENT. |
Seed offers:
| Offer | Match condition | Result |
|---|---|---|
OFFER-TRAVEL-01 | segment=premium, state=ON, category=travel | Eligible for CUST-1001. |
OFFER-CASHBACK-01 | segment=premium, state=BC, category=shopping | Not eligible for Ontario travel scenario. |
Demo Script
Run the REST workflow path first:
Use the REST workflow skill to recommend an offer for CUST-1001.
Expected observation:
- The agent selects
Personalized Offer via REST Workflow. - The workflow trace shows direct HTTP calls to the Customer Profile API and Offer Decision API.
- The final response contains
status=APPROVEDand a decision id.
Run the MCP workflow path second:
Use the MCP router skill to recommend an offer for CUST-1001.
Expected observation:
- The agent selects
Personalized Offer via MCP Router. - The workflow trace shows MCP
tools/callinvocations. - The gateway trace shows those tool calls routed to the same backend REST endpoints.
- The final response uses the same output shape as the REST workflow.
Then run one edge case:
Use either skill to recommend an offer for CUST-3003.
Expected observation:
- The workflow stops after the consent check.
- No offer decision is recorded.
- The result is
NO_CONSENT.
Run one ambiguity case:
Recommend an offer for CUST-1001.
Expected observation:
- The agent asks whether to use the direct REST workflow or the MCP router workflow.
- After the user chooses, the agent starts the selected workflow.
Run one technical failure case:
Use the MCP router skill to recommend an offer for CUST-1001 while the Offer
Decision API returns one transient 503.
Expected observation:
- The workflow retries the failed tool-call step.
- The gateway trace records the failed
offer_record_decisioncall and the successful retry. - The final response still uses the normalized
APPROVEDoutput shape.
Portal Authoring Flow
The portal should make the demo visible from the existing GenAI and workflow surfaces:
- Create or import the two REST APIs and four endpoint descriptions.
- Implement the two APIs as
light-axumservices. - Add config-server values for both API services.
- Start both services and verify controller registration.
- Publish MCP router tools for the same four endpoints.
- Create
personalized-offer-rest-v1in the workflow catalog. - Create
personalized-offer-mcp-v1in the workflow catalog. - Create the two skills in the skill registry.
- Link each skill to its primary workflow through
skill_workflow_t. - Link the MCP skill to its allowed tool set through
skill_tool_t. - Assign both skills to
Demo Orchestration Agent. - Use Skill Workspace preview and test panels to validate the effective prompt, workflow link, allowed tools, and sample test input.
Validation Rules
The authoring experience should validate the following before the demo is considered complete:
- Each skill has exactly one primary workflow link.
- The REST workflow does not require MCP tools.
- The MCP workflow references only MCP tools linked through
skill_tool_t. - Both workflows declare the same input schema.
- Both workflows declare the same normalized output shape.
- The four backend endpoint descriptions are active.
- Both demo API services load config from config-server.
- Both demo API services register with controller and appear in the control panel service-discovery view.
- The MCP router
tools/listresult includes the four expected tool names. - MCP router tools resolve the demo APIs by
serviceIdandenvTagin the service-discovery environment. - MCP path-parameter mappings are validated before the workflow test run.
POST /offer-decisionsincludes an idempotency key for retry safety.- Test runs for
CUST-1001,CUST-2002, andCUST-3003produce the expected outcomes.
Observability
The demo should show three different traces:
- Agent trace: which skill the agent selected and what workflow it started.
- Workflow trace: step order, branches, retries, and final output.
- Gateway trace: MCP tool name, mapped backend endpoint, status, duration, and correlation id for the MCP path.
Use the same correlation id across the agent request, workflow instance, and gateway calls where possible. This makes the REST and MCP execution paths easy to compare.
Security And Authorization
Authorization should be enforced at each layer:
- The agent can discover only assigned skills.
- The workflow can start only definitions visible to the authenticated caller or service identity.
- The MCP skill can expose only tools linked to the skill and allowed for the agent.
- The gateway still performs runtime MCP access checks before executing
tools/call. - Backend REST APIs continue to enforce their own authorization policies.
The skill registry is not a runtime bypass. It narrows discovery and guidance, while the workflow and gateway remain responsible for execution-time controls.
Context And Auth Propagation
The demo should explicitly show that caller context is preserved.
For direct REST workflow steps:
- The workflow start request records the initiating user, host, tenant, correlation id, and authorization context.
- The workflow executor builds outbound REST calls with the correct caller context. If the original bearer token is safe to forward, it can be passed through. Otherwise, the workflow service should use a service token with on-behalf-of metadata that preserves the initiating subject.
- Backend APIs enforce their normal authorization policies.
For MCP workflow steps:
light-workflowcalls the gateway MCP endpoint with the same correlation, tenant, locale, and authorization context.light-gatewayvalidates the MCP request and runtime tool authorization.- The MCP router forwards the allowed caller headers to the backend REST API
while regenerating transport-specific headers such as
Host,Content-Length, and connection management headers. - Backend APIs see the same business identity context they would see on the direct REST path.
The trace should show this propagation without exposing sensitive token values.
Acceptance Criteria
- One demo agent has both skills assigned.
- The REST skill starts
personalized-offer-rest-v1. - The MCP skill starts
personalized-offer-mcp-v1. - Both workflows accept the same input JSON.
- Both workflows return the same normalized output shape.
- The REST workflow trace shows direct REST calls to two APIs.
- The MCP workflow trace shows MCP
tools/callrouted through the gateway to the same two APIs. - The two APIs run as
light-axumservices with config-server supplied HTTP ports. - The two APIs register with controller and are visible in the control panel service-discovery view.
- The demo succeeds for
CUST-1001. - The demo returns controlled business outcomes for
CUST-2002andCUST-3003. - Ambiguous user prompts trigger a clarification question instead of random skill selection.
- A transient
503from the Offer Decision API is retried and appears in the workflow trace. - The MCP path preserves caller context through workflow, gateway, and backend REST calls.
Related Designs
Hindsight Memory
Hindsight Memory is the core memory system for light-rs, designed to move beyond simple chat logs. Instead of just remembering what was said, the agent learns and forms mental models over time.
This design is strongly inspired by the paper Hindsight is 20/20: Building Agent Memory that Retains, Recalls, and Reflects and extends it with multi-tenant support.
1. Core Concepts
Hindsight memory organizes information into three distinct "Pathway" types:
- World Facts: Objective truths about the environment (e.g., "The production server is in US-East-1").
- Experiences: The agent's own history of actions and results (e.g., "I tried to deploy to US-East-1 and it failed due to a timeout").
- Mental Models: Synthesized understandings formed by reflecting on facts and experiences (e.g., "Deployments to US-East-1 are unstable during peak hours").
2. The Three Operations
Interaction with the memory system is standardized into three primary operations:
Retain (Storage)
The retain operation ingests information. Behind the scenes, the system:
- Extracts entities and relationships.
- Normalizes time and temporal data.
- Stores the data in
agent_memory_unit_t.
Recall (Retrieval)
The recall operation retrieves relevant context using a hybrid strategy:
- Semantic: Vector similarity using the
hnswindex. - Graph: Following links in
agent_memory_link_t(causes, enables, prevents). - Temporal: Time-series filtering.
Reflect (Synthesis)
The reflect operation performs "deep thinking." It analyzes existing memories to generate new insights, which are stored in agent_memory_reflection_t.
3. Database Architecture
The Hindsight system is fully integrated into the portal's multi-tenant schema:
| Table Name | Description |
|---|---|
agent_memory_bank_t | The primary container. Defines personality and disposition (skepticism, empathy). |
agent_memory_doc_t | Source documents (logs, files, transcripts) that provide the raw text for memory units. |
agent_memory_unit_t | Sentence-level "atoms" of thought. Stores content, embeddings, and fact types (world, experience, etc.). |
agent_memory_entity_t | Resolved Knowledge Graph nodes, optionally linked to platform users (user_t). |
agent_memory_unit_entity_t | The join table linking individual memories to the entities they mention. |
agent_memory_entity_cooccur_t | Association graph tracking concept relationships and co-occurrence counts. |
agent_memory_link_t | Defines causal and semantic relationships between memories (causes, enables, etc.). |
agent_memory_directive_t | "Hard rules" that override probabilistic learning. |
agent_memory_reflection_t | Synthesized high-level insights generated during the "Reflect" phase. |
agent_session_history_t | The live record of active conversations, linked to a specific bank for context. |
4. Privacy & Multi-Tenancy
Isolation is managed at the Bank level using three scoping tiers:
- Global Host Bank (
user_idIS NULL,agent_def_idIS NULL):- Knowledge shared across all users and all agents within a specific
host_id. - Ideal for organization-wide SOPs, common facts, and shared documentation.
- Knowledge shared across all users and all agents within a specific
- Shared Agent Bank (
user_idIS NULL,agent_def_idIS NOT NULL):- Knowledge shared by all users interacting with a specific agent type.
- Used for maintaining a consistent agent "Persona" or specialized domain expertise.
- Private User Bank (
user_idIS NOT NULL):- Knowledge unique to a specific user.
- Can be scoped further by
agent_def_idto provide user-specific memory within a particular agent persona. - Used for personal preferences, private history, and individualized learning.
5. Implementation Guide
To implement a "Learning Agent," follow this sequence in your application logic:
- Ingestion: After every tool call or user interaction, call
retainto update the bank. - Context Loading: Before calling the LLM, call
recallto fetch the most relevant 3-5 memories for the current prompt. - Scheduled Reflection: Run the
reflectoperation during idle time to compress raw experiences into high-level mental models.
Fine-Grained Authorization
The following is the raw material collected when I chat with AI about migrating the light-4j fine-grained authorization to light-rs to secure light-agent and light-gateway.
The information needs to be organized and documented into a design document for light-rs fine-grained authorization.
Question: We have a Java implementation of fine-grained authorization that we want to migrate to light-rs. In our java implementation, we have a middleware handler to verify the custom claims like role, group, attribute etc in the jwt token against endpoint specific rules defined in database and cached at the service. How should we handle the same requirement in rust?
Response:
In Rust, bridging the gap between dynamic, database-driven rules and high-performance web requests is a well-solved problem. If you are coming from Java, the architecture translates very cleanly, but takes advantage of Rust's async runtime and thread-safe concurrency primitives.
To handle fine-grained authorization (Attribute-Based Access Control / ABAC or Role-Based Access Control / RBAC) where rules are cached locally but backed by a database, you have two primary paths in Rust:
- The Custom Middleware Approach (Closest to your current Java implementation)
- The Policy Engine Approach (Using an industry-standard Rust authorization crate)
Here is how you handle this requirement idiomatically in Rust, assuming you are using Axum (the current standard web framework in Rust).
Approach 1: Custom Middleware with Async Caching
In Java, you likely used something like Guava or Caffeine for your local cache. In Rust, the exact equivalent for high-performance, concurrent, async-aware caching is the moka crate.
Here is the architectural blueprint:
- Define your custom claims using
serde. - Setup a shared Application State holding your Database connection pool (e.g.,
sqlx) and your local Cache (moka). - Create an Axum Middleware to intercept the request, parse the JWT, consult the cache based on the route, and authorize the request.
1. Dependencies (Cargo.toml)
axum = "0.7"
jsonwebtoken = "9.0"
moka = { version = "0.12", features = ["future"] } # Async cache (Rust's Caffeine)
sqlx = { version = "0.7", features = ["postgres", "runtime-tokio"] }
serde = { version = "1.0", features = ["derive"] }
2. Defining Claims and State
#![allow(unused)] fn main() { use axum::{extract::State, http::Request, middleware::Next, response::Response}; use moka::future::Cache; use serde::{Deserialize, Serialize}; use sqlx::PgPool; use std::sync::Arc; // 1. Define your custom JWT claims #[derive(Debug, Serialize, Deserialize, Clone)] struct CustomClaims { sub: String, roles: Vec<String>, groups: Vec<String>, attributes: std::collections::HashMap<String, String>, exp: usize, } // 2. Define the Endpoint Rule (what you fetch from DB) #[derive(Clone)] struct EndpointRule { required_roles: Vec<String>, // add required_groups, attributes, etc. } // 3. Define the Application State to share across your app struct AppState { db: PgPool, // Cache keyed by String (e.g., "GET:/api/data") returning an EndpointRule rule_cache: Cache<String, EndpointRule>, } }
3. The Authorization Middleware
In Axum, middleware is written as an asynchronous function that takes the request, does some work, and either rejects it or passes it to the Next handler.
#![allow(unused)] fn main() { use axum::http::StatusCode; use axum::response::IntoResponse; async fn fine_grained_authz_middleware( State(state): State<Arc<AppState>>, req: Request<axum::body::Body>, next: Next, ) -> Result<Response, StatusCode> { // 1. Extract and decode the JWT from Authorization header let auth_header = req.headers().get("Authorization").and_then(|h| h.to_str().ok()); let claims = decode_jwt(auth_header).ok_or(StatusCode::UNAUTHORIZED)?; // 2. Identify the endpoint being accessed (e.g., "GET:/api/users/123") // Note: In practice, use req.extensions() to get the matched generic route template // rather than the raw URI so "/api/users/123" matches "/api/users/{id}" let endpoint_key = format!("{}:{}", req.method(), req.uri().path()); // 3. Check Cache or Fallback to DB let rule = match state.rule_cache.get(&endpoint_key).await { Some(cached_rule) => cached_rule, None => { // Cache miss: Fetch from database let db_rule = fetch_rule_from_db(&state.db, &endpoint_key) .await .map_err(|_| StatusCode::INTERNAL_SERVER_ERROR)?; // Populate cache state.rule_cache.insert(endpoint_key, db_rule.clone()).await; db_rule } }; // 4. Execute Fine-Grained Authorization Logic let is_authorized = rule.required_roles.iter().any(|role| claims.roles.contains(role)); // ... check groups, evaluate custom attribute expressions, etc. if !is_authorized { return Err(StatusCode::FORBIDDEN); } // 5. Pass to the actual route handler! Ok(next.run(req).await) } // Dummy helper functions to make the example compile mentally fn decode_jwt(_header: Option<&str>) -> Option<CustomClaims> { /* ... */ unimplemented!() } async fn fetch_rule_from_db(_db: &PgPool, _key: &str) -> Result<EndpointRule, ()> { /* ... */ unimplemented!() } }
4. Wiring it up
use axum::{routing::get, Router}; #[tokio::main] async fn main() { let state = Arc::new(AppState { db: PgPool::connect("postgres://...").await.unwrap(), rule_cache: Cache::builder() .max_capacity(10_000) .time_to_live(std::time::Duration::from_secs(300)) // 5 minute TTL .build(), }); let app = Router::new() .route("/api/secure-data", get(|| async { "Secret Data" })) // Apply the middleware to all routes above this line .route_layer(axum::middleware::from_fn_with_state(state.clone(), fine_grained_authz_middleware)) .with_state(state); let listener = tokio::net::TcpListener::bind("0.0.0.0:3000").await.unwrap(); axum::serve(listener, app).await.unwrap(); }
Approach 2: Using a Dedicated Policy Engine (Recommended for complex rules)
If your endpoint-specific rules are highly complex (e.g., "User must be in Group X AND have Attribute Y = 'finance', UNLESS Role is 'Admin'"), writing that evaluation logic manually in Rust gets messy.
Instead of writing custom evaluation logic, the Rust ecosystem heavily relies on external Policy Engines for ABAC/RBAC. The two leaders are:
1. AWS Cedar (cedar-policy) - Highly Recommended
Developed by Amazon Web Services in Rust, Cedar is a blazing-fast language for defining permissions.
- You store your Cedar policies in your database as text.
- Your Rust middleware fetches the Cedar policy string from the DB, compiles it (and caches the compiled policy using
moka). - You pass the JWT
CustomClaimsinto the Cedar engine as a context entity. - Cedar evaluates the claims against the policy in microseconds.
2. Casbin (casbin)
Casbin is a very popular authorization library that exists in Java, Go, and Rust.
- It supports ACL, RBAC, and ABAC.
- It has native adapters for
sqlx(Postgres/MySQL). You don't even need to write the caching logic yourself;casbin-rshandles reading the rules from the database and keeping them cached in memory automatically.
Summary
- For simple Role/Group matching: Stick to the Custom Middleware approach using
mokafor async caching andaxum::middlewareto intercept requests. It is functionally identical to your Java approach but extremely fast. - For complex Attribute evaluation: Use your middleware to parse the JWT, but delegate the actual rule evaluation to
cedar-policyorcasbin.
Question: Could you please provide some examples for cedar-policy and casbin? For example, what is the format in database? In our system, we are using a yaml rule engine to evaluate the fine-grained authorization. We create some rules users can choose and they can also create their own rules if necessary.
Response:
Transitioning from a custom YAML rule engine to an industry-standard policy engine is a fantastic move. Both AWS Cedar and Casbin are designed precisely for this kind of "user-defined fine-grained rules" requirement, but they take very different approaches to how rules are written and stored.
Here is a breakdown of how both engines work, what their database format looks like, and how you map your JWT claims to them in Rust.
1. AWS Cedar (cedar-policy) - Best for Complex/Custom Rules
Cedar uses a custom, highly readable Domain Specific Language (DSL). It is specifically built to evaluate ABAC (Attribute-Based Access Control) using contextual data like JWT claims.
The Format (In the Database)
In your database, you do not need complex tables. You simply store the Cedar policies as plain text strings in a standard TEXT or VARCHAR column.
If a user wants to create a custom rule saying, "Allow access if the user's JWT has the 'finance' group and a clearance level greater than 3," the Cedar policy stored in the DB looks like this:
// Database column: policy_text
permit (
principal,
action == Action::"read_data",
resource == Resource::"secure_endpoint"
)
when {
context.jwt.groups contains "finance" &&
context.jwt.attributes.clearance_level > 3
};
For "predefined rules that users can choose," Cedar has Policy Templates. You store the template once, and users just link their parameters to it.
// Template: "Require specific group"
permit(principal, action, resource)
when { context.jwt.groups contains ?required_group };
How you execute it in Rust
When a request comes in, you fetch the relevant Cedar text strings from your database, pass in the JWT claims as the Context, and let Cedar evaluate it.
#![allow(unused)] fn main() { use cedar_policy::{Authorizer, Context, Decision, Entities, PolicySet, Request}; use serde_json::json; use std::str::FromStr; fn evaluate_cedar_rule(db_policy_text: &str, jwt_claims: &serde_json::Value) { // 1. Parse the text policy from the database let policies = PolicySet::from_str(db_policy_text).expect("Invalid policy format"); // 2. Map your JWT claims directly into the Cedar "Context" using JSON let context_json = json!({ "jwt": jwt_claims }); let context = Context::from_json_value(context_json, None).unwrap(); // 3. Create the evaluation request let request = Request::new( Some("User::\"Alice\"".parse().unwrap()), Some("Action::\"read_data\"".parse().unwrap()), Some("Resource::\"secure_endpoint\"".parse().unwrap()), context, None, ).unwrap(); // 4. Evaluate let authorizer = Authorizer::new(); let entities = Entities::empty(); // Used if fetching hierarchical data let answer = authorizer.is_authorized(&request, &policies, &entities); if answer.decision() == Decision::Allow { println!("Access Granted!"); } else { println!("Access Denied. Failed policies: {:?}", answer.diagnostics().errors()); } } }
2. Casbin (casbin-rs) - Best for Tabular/Database-Native Rules
Casbin works fundamentally differently. It splits authorization into two parts: the Model (a config file defining the structure of your rules) and the Policies/Rules (the actual data stored in the database).
The Format (In the Database)
Casbin doesn't store a DSL string. Instead, it natively connects to your database using an adapter (like sqlx-adapter) and creates a table usually called casbin_rule.
The table structure looks exactly like a CSV file, with columns: ptype, v0, v1, v2, v3, v4, v5.
If a user selects a predefined rule to allow the finance_admin role to read /api/data, Casbin inserts a row into your database like this:
| ptype | v0 (Subject) | v1 (Object) | v2 (Action) | v3 (Condition - optional) |
|---|---|---|---|---|
p | finance_admin | /api/data | read |
If a user writes a custom ABAC rule using a UI (e.g., clearance > 3), Casbin can store the evaluation expression as a string in one of the columns (evaluated internally via the rhai scripting language):
| ptype | v0 (Subject/Condition) | v1 (Object) | v2 (Action) |
|---|---|---|---|
p | r.sub.groups.contains("finance") && r.sub.clearance > 3 | /api/data | read |
How you execute it in Rust
To make Casbin evaluate JWT attributes dynamically, you define a model.conf file that tells Casbin to evaluate (eval) the string stored in the database.
The model.conf File:
[request_definition]
r = sub, obj, act
[policy_definition]
p = sub_rule, obj, act
[policy_effect]
e = some(where (p.eft == allow))
[matchers]
# This matcher says: check if the object and action match,
# AND evaluate the ABAC rule stored in the database (p.sub_rule) against the request subject (r.sub)
m = r.obj == p.obj && r.act == p.act && eval(p.sub_rule)
The Rust Code:
#![allow(unused)] fn main() { use casbin::{CoreApi, Enforcer, DefaultModel}; use sqlx_adapter::SqlxAdapter; use std::collections::HashMap; // A struct representing your JWT Claims #[derive(Clone)] struct JwtClaims { groups: Vec<String>, clearance: i32, } async fn evaluate_casbin() { // 1. Connect to DB. The adapter automatically fetches the rules from the `casbin_rule` table. let adapter = SqlxAdapter::new("postgres://user:pass@localhost/db", 8).await.unwrap(); let mut enforcer = Enforcer::new("model.conf", adapter).await.unwrap(); // 2. Parse the JWT for the incoming request let claims = JwtClaims { groups: vec!["finance".to_string()], clearance: 5, }; // 3. Casbin in Rust uses `rhai` under the hood for dynamic object evaluation. // You map your JWT fields into a format the engine understands. let mut sub_map = HashMap::new(); sub_map.insert("groups", claims.groups); sub_map.insert("clearance", claims.clearance); // 4. Enforce the rules against the DB // Request is: Subject (JWT Claims Map), Object (/api/data), Action (read) if enforcer.enforce((sub_map, "/api/data", "read")).unwrap() { println!("Access Granted!"); } else { println!("Access Denied!"); } } }
Which one maps better to your YAML Rule Engine?
AWS Cedar is highly recommended for your use case.
Because you currently use a YAML rule engine where users define complex logical rules, your users are already accustomed to defining logical expressions.
- Translation: It is very easy to write a script that takes your existing YAML definitions and converts them into Cedar text policies.
- Safety: Cedar is statically typed and formally verified. If a user writes an invalid custom rule, Cedar will fail to parse it instantly and give you a helpful error message to show in your UI. Casbin's
eval()logic can sometimes be brittle at runtime if the user types bad syntax. - Database Schema: With Cedar, you control your database schema. You just add a
policy_textcolumn to your existing "Rules" table. Casbin requires taking over a table (casbin_rule) to map to its strict CSV-like format.
Question: In our current implementation, we define all the persmissions and filters from UI and save the permissions and picked up rule in database. During server startup, it load everything locally in yaml format like the following. One of the pros is that our technical team write the rules, and the user or api owner just define the roles, groups, positions, attributes for the endpoint. The rule will be executed locally on the target server without going to the database so the performance is very good. If any permission changed for the target server, we can trigger a config reload from controller to ask the server to reload the config/permission. I am not sure if we should replace this implementation with Cedar. One of the feature we have is the column and row filter and I don't think Cedar support it. What do you think?
rule.endpointRules:
echo@call:
req-acc:
- allow-role-based-access-control.lightapi.net
permission:
roles: mcp-reader
/v1/pets@get:
req-acc:
- allow-role-based-access-control.lightapi.net
permission:
roles: host-admin
/v1/accounts@get:
req-acc:
- allow-role-based-access-control.lightapi.net
res-fil:
- response-column-filter.lightapi.net
- response-row-filter.lightapi.net
permission:
col:
role:
teller: '["accountNo","userId","accountType","firstName","lastName"]'
account-manager: '["accountNo","userId","accountType","firstName","lastName","status"]'
row:
role:
teller:
- colName: accountType
colValue: C
operator: =
roles: account-manager teller
Response:
You have excellent architectural instincts. You are absolutely right to hesitate, and your concern about column and row filtering is the exact reason you probably should not replace your current system with Cedar.
Your current design is actually highly sophisticated. Let’s break down why your current implementation is so strong, why Cedar would struggle with your specific requirements, and how you should idiomatically build your current design in Rust.
Why Cedar is NOT a drop-in replacement for your system
AWS Cedar (and most standard policy engines like Casbin or OPA) are designed as Binary Decision Engines. You give them a Request, and they return an Allow or Deny.
The Missing Feature: Data Filtering (Masking) Your system handles two fundamentally different concerns in one elegant YAML file:
- Access Control (
req-acc): Can the user access the/v1/accountsendpoint? (Cedar does this perfectly). - Data Filtering (
res-fil): Which rows and columns can the user see in the response? (Cedar does not do this natively).
If you forced Cedar to do column filtering, you would have to misuse its "Annotations" feature to attach custom metadata to policies, read the output, and then write a custom Rust engine to parse that metadata and filter the JSON anyway. You would essentially be rebuilding your existing system, just with Cedar awkwardly wedged in the middle.
Strengths of your current architecture
What you have built is essentially a custom API Gateway / Sidecar Data Masking Engine.
- Performance: Loading rules into memory at startup means zero network latency. (Matches Cedar).
- Hot Reloading: Fetching updates via a controller is standard cloud-native practice. (Matches Cedar).
- Separation of Concerns: Tech team writes the underlying engine logic (
allow-role-based-access-control), while API owners just write YAML configs. (Matches Cedar’s Policy Templates). - Row/Column filtering: Natively understood by your engine. (Far exceeds Cedar).
Verdict: Do not switch to Cedar. Your current YAML-based, locally cached evaluation engine is a perfect fit for your business requirements.
How to implement your architecture in Rust
Since you are migrating to or building in Rust, your current Java-based architecture translates beautifully. In fact, Rust will make this system significantly faster and more memory-efficient.
Here is how you replicate your exact Java architecture—including the hot-reloading and filtering—in Rust.
1. Hot-Reloading using arc-swap
In Java, hot-reloading configuration safely across threads can be tricky. In Rust, the industry standard for "read-heavy, occasionally-updated" memory structures is the arc-swap crate. It allows your web server to read the YAML rules lock-free at millions of requests per second, while your controller can swap out the rules in the background instantly.
#![allow(unused)] fn main() { use arc_swap::ArcSwap; use serde::Deserialize; use std::sync::Arc; use std::collections::HashMap; // 1. Map your YAML exactly to Rust Structs #[derive(Deserialize, Debug, Clone)] struct EndpointRule { #[serde(rename = "req-acc")] req_acc: Option<Vec<String>>, #[serde(rename = "res-fil")] res_fil: Option<Vec<String>>, permission: PermissionConfig, } #[derive(Deserialize, Debug, Clone)] struct PermissionConfig { roles: String, col: Option<HashMap<String, HashMap<String, String>>>, // Map for roles to column JSON strings } // 2. Wrap your entire config in an ArcSwap lazy_static::lazy_static! { static ref CONFIG: ArcSwap<HashMap<String, EndpointRule>> = ArcSwap::from_pointee(HashMap::new()); } // 3. Background task to listen to the Controller and Hot-Reload async fn reload_config_from_controller(new_yaml_str: &str) { let new_config: HashMap<String, EndpointRule> = serde_yaml::from_str(new_yaml_str).unwrap(); // Swaps the pointer atomically. // All new requests see the new rules instantly. Old requests finish using the old rules. CONFIG.store(Arc::new(new_config)); } }
2. The Request Access (req-acc) Middleware
You would write an Axum middleware that intercepts the request, looks up the current endpoint in the arc-swap configuration, checks the JWT roles against permission.roles, and rejects the request if it fails.
#![allow(unused)] fn main() { // Inside your Axum middleware let current_config = CONFIG.load(); if let Some(rule) = current_config.get("/v1/accounts@get") { let required_roles: Vec<&str> = rule.permission.roles.split_whitespace().collect(); if !user_roles.iter().any(|r| required_roles.contains(&r.as_str())) { return Err(StatusCode::FORBIDDEN); // req-acc fails } } }
3. The Response Filtering (res-fil) Middleware
This is where Rust shines. Because Axum and Actix allow you to intercept the response body after the handler completes but before it goes to the user, you can apply your column filtering dynamically.
Using serde_json, you can strip out columns that the user isn't allowed to see based on the col mapping in your YAML.
#![allow(unused)] fn main() { use serde_json::Value; fn apply_column_filter(mut response_json: Value, allowed_columns_json: &str) -> Value { // 1. Parse the allowed columns from the YAML string (e.g., '["accountNo","firstName"]') let allowed_cols: Vec<String> = serde_json::from_str(allowed_columns_json).unwrap(); // 2. Filter the JSON object if let Value::Object(ref mut map) = response_json { // Retain only the keys that are in the allowed_cols list map.retain(|key, _| allowed_cols.contains(key)); } response_json } }
Summary
Keep your current system!
- The YAML approach is great. It acts as a perfect declarative contract between your API owners and your technical rules engine.
- Cedar is the wrong tool for Data Masking. Cedar is for strict Allow/Deny decisions, not mutating response bodies or filtering data.
- Rust will supercharge this. By using
serde_yamlto parse your config,arc-swapfor lock-free hot-reloading, andserde_jsonfor column masking inside an Axum middleware, you will get a system that does exactly what your Java system does, but with significantly less memory overhead and higher throughput.
Agent Engine Pattern
The Agent Engine Pattern is the architectural standard for building industrial-grade, metadata-driven AI platforms within the Light-Fabric ecosystem.
In this model, the Rust Runtime acts as a high-performance Orchestrator, while the Application Logic resides in externalized metadata (JSON/YAML) and the Hindsight Memory database.
1. Why the Metadata-Driven Approach?
- Separation of Concerns: Complex platform logic (security, retries, database connectivity, LLM integration) is implemented once in Rust. Business logic—defining agent personas, goals, and steps—is "programmed" via JSON or Database records.
- Hot-Reloading: Using the
arc-swapcrate and YAML-based rule engines, agent personas, model parameters, and tool access can be updated in real-time without a server restart. - Elastic Scalability: Deploy a single, generic
light-agentbinary. At runtime, it specializes into a "Researcher," "Auditor," or "Support Specialist" based on theworkflow_idoragent_idit retrieves from the registry. - High Performance: Rust's asynchronous
tokioruntime allows a single engine instance to manage thousands of concurrent agentic sessions with minimal memory overhead.
2. The Core Architecture: Engine vs. Content
To function as a generic interpreter, the Light-Fabric Engine relies on four primary components:
A. The Tool & Skill Registry (The "Hands")
The engine maps string identifiers in the workflow JSON (e.g., "call": "get_customer_data") to executable code or remote MCP tools.
- Implementation: Uses a
ToolRegistrywith trait objects (Box<dyn Tool>) or dynamic dispatch to MCP (Model Context Protocol) servers. - Logic: When the LLM requests a tool call, the engine verifies permissions via Fine-Grained Authorization, executes the tool, and feeds the result back into the context.
B. Hindsight State Manager (The "Memory")
Unlike simple session storage, the state manager persists every step of the agentic interaction into biomimetic memory banks.
- Implementation: Every "turn" in the conversation is saved as a
unit_tin the Hindsight database. - Benefit: Provides fault tolerance (resuming from a crashed step) and "Recall" capabilities, allowing agents to remember past interactions across different sessions.
C. Prompt Templating (The "Mind")
System prompts and instructions are stored as templates rather than hardcoded strings.
- Implementation: Uses the
teraorrinjaengines for high-performance string interpolation. - Example:
"You are a {{agent_role}}. Your current objective is to {{agent_goal}}." - Rust Logic: The engine merges runtime context (user input, memory recall, tool results) into the template before calling the LLM.
D. Policy Engine (The "Shield")
Before any tool execution or data retrieval, the engine consults the Light-Rule middleware.
- Logic: Ensures the agent has the authority to access specific data or execute specific functions, preventing "prompt injection" from leading to unauthorized actions.
3. Conceptual Implementation in Rust
The AgentEngine in Light-Fabric follows a non-blocking, async loop:
#![allow(unused)] fn main() { pub struct AgentEngine { registry: Arc<ToolRegistry>, memory: Arc<HindsightClient>, rules: Arc<RuleEngine>, } impl AgentEngine { pub async fn execute_step(&self, session_id: Uuid, task: Task) -> anyhow::Result<()> { // 1. Fetch current context from Hindsight Memory let mut context = self.memory.get_context(session_id).await?; // 2. Resolve Task Type (Agentic vs. Tool Call) match task { Task::LlmCall { agent_id, prompt_template } => { // Render prompt with Tera let prompt = self.render_prompt(prompt_template, &context)?; // Call LLM Provider let response = self.llm_provider.chat(prompt, &context).await?; // Retain turn in Hindsight self.memory.retain_turn(session_id, response).await?; }, Task::ToolCall { tool_name, params } => { // 3. Enforce Fine-Grained Authorization if self.rules.authorize(session_id, &tool_name).await? { let result = self.registry.call(&tool_name, params).await?; context.add_result(tool_name, result); } } } // 4. Update Session State self.memory.checkpoint(session_id, context).await } } }
4. Operational Challenges & Solutions
- Tool Versioning: As the platform evolves, tools may change. Light-Fabric handles this by versioning tool definitions in the Registry, ensuring old workflows remain compatible with the tools they were designed for.
- Safe Execution: For dynamic "scripts" defined in metadata, Light-Fabric utilizes WebAssembly (WASM) runtimes to provide a high-performance, secure sandbox that is superior to traditional container-based isolation.
- Observability: Because the engine is generic, tracing is built into the
light-runtime. Every step generates OpenTelemetry traces, allowing developers to visualize the "thought process" and execution path of any agent in real-time.
The Recommendation
Light-Fabric adopts this "Engine-first" philosophy to ensure the platform remains sustainable. By treating the Agentic Workflow as data and the Rust Runtime as the interpreter, we achieve the perfect balance of extreme performance and business flexibility.
Database Design
The Light-Fabric utilizes a robust PostgreSQL schema to manage the entire lifecycle of agentic workflows, skills, and the biomimetic Hindsight memory system. The schema is organized into four logical layers:
1. Workflow Engine
These tables manage the definition and execution of long-running agentic workflows.
wf_definition_t
Stores the Agentic Workflow DSL (YAML) that defines the high-level orchestration logic.
process_info_t & task_info_t
Manage the runtime state of workflow instances (processes) and individual steps (tasks). They include input_data, context_data, and error_info to provide a resilient "scratchpad" for intermediate variables.
worklist_t & worklist_asst_t
Manage task assignments and visibility for human-in-the-loop interactions.
2. Agentic Core (The "Brain & Skills")
These tables define the identity, expertise, and capabilities of individual agents.
agent_definition_t
Defines the agent's persona, model provider (OpenAI, Anthropic, etc.), and runtime parameters like temperature and max tokens.
skill_t
Stores the "Expertise" of an agent in Markdown format. Skills are hierarchical and versioned.
tool_t & tool_param_t
The "Hands" of the agent. Defines executable functions, including REST endpoints, MCP server calls, or WASM scripts.
agent_skill_t & skill_tool_t
Maps agents to skills and skills to tools, implementing the Progressive Disclosure pattern where agents only see the tools required for their current skill context.
3. Hindsight Memory System
A biomimetic memory architecture that transitions from flat logs to structured "atoms of thought."
agent_memory_bank_t
Profiles for memory banks, defining the "Personality and Disposition" (e.g., skepticism, empathy) of the memory layer.
agent_memory_unit_t
The individual "Atoms" of memory. Each unit contains content and a vector embedding (384-dim) for semantic retrieval.
agent_memory_entity_t & agent_memory_link_t
A Knowledge Graph layer that resolves entities and causal/semantic relationships between memory units.
4. Session Management
agent_session_history_t
The "Source of Truth" for active conversations, linking specific sessions to their respective Hindsight memory banks.
DDL Specification
-- Workflow Definitions: Stores the Agentic Workflow JSON
CREATE TABLE wf_definition_t (
host_id UUID NOT NULL,
wf_def_id UUID NOT NULL,
namespace VARCHAR(126) NOT NULL,
name VARCHAR(126) NOT NULL,
version VARCHAR(20) NOT NULL,
definition TEXT NOT NULL, -- The Agentic Workflow DSL in YAML
aggregate_version BIGINT DEFAULT 1 NOT NULL,
active BOOLEAN DEFAULT TRUE,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
update_user VARCHAR(126) DEFAULT SESSION_USER,
PRIMARY KEY(host_id, wf_def_id),
UNIQUE(host_id, namespace, name, version)
);
CREATE TABLE worklist_t (
host_id UUID NOT NULL,
assignee_id VARCHAR(126) NOT NULL,
category_id VARCHAR(126) DEFAULT '(all)' NOT NULL,
status_code VARCHAR(10) DEFAULT 'Active' NOT NULL,
app_id VARCHAR(512) DEFAULT 'global' NOT NULL,
aggregate_version BIGINT DEFAULT 1 NOT NULL,
active BOOLEAN NOT NULL DEFAULT TRUE,
update_user VARCHAR (255) DEFAULT SESSION_USER NOT NULL,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
PRIMARY KEY(host_id, assignee_id, category_id)
);
CREATE TABLE worklist_column_t (
host_id UUID NOT NULL,
assignee_id VARCHAR(126) NOT NULL,
category_id VARCHAR(126) DEFAULT '(all)' NOT NULL,
sequence_id INTEGER NOT NULL,
column_id VARCHAR(126) NOT NULL,
aggregate_version BIGINT DEFAULT 1 NOT NULL,
active BOOLEAN DEFAULT TRUE,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
update_user VARCHAR(126) DEFAULT SESSION_USER,
PRIMARY KEY(host_id, assignee_id, category_id, sequence_id),
FOREIGN KEY(host_id, assignee_id, category_id) REFERENCES worklist_t(host_id, assignee_id, category_id) ON DELETE CASCADE
);
CREATE TABLE process_info_t (
host_id UUID NOT NULL,
process_id UUID NOT NULL, -- generated uuid
wf_def_id UUID NOT NULL, -- workflow definition id
wf_instance_id VARCHAR(126) NOT NULL, -- workflow intance id
app_id VARCHAR(512) NOT NULL, -- application id
process_type VARCHAR(126) NOT NULL,
status_code CHAR(1) NOT NULL, -- process status code 'A', 'C'
started_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
ex_trigger_ts TIMESTAMP WITH TIME ZONE NOT NULL,
custom_status_code VARCHAR(126),
completed_ts TIMESTAMP WITH TIME ZONE,
result_code VARCHAR(126),
source_id VARCHAR(126),
branch_code VARCHAR(126),
rr_code VARCHAR(126),
party_id VARCHAR(126),
party_name VARCHAR(126),
counter_party_id VARCHAR(126),
counter_party_name VARCHAR(126),
txn_id VARCHAR(126),
txn_name VARCHAR(126),
product_id VARCHAR(126),
product_name VARCHAR(126),
product_type VARCHAR(126),
group_name VARCHAR(126),
subgroup_name VARCHAR(126),
event_start_ts TIMESTAMP WITH TIME ZONE,
event_end_ts TIMESTAMP WITH TIME ZONE,
event_other_ts TIMESTAMP WITH TIME ZONE,
event_other VARCHAR(126),
risk NUMERIC,
risk_scale INTEGER,
price NUMERIC,
price_scale INTEGER, -- Scale (number of digits to the right of the decimal) of the risk column. NULL implies zero
product_qy NUMERIC,
currency_code CHAR(3),
ex_ref_id VARCHAR(126),
ex_ref_code VARCHAR(126),
product_qy_scale INTEGER,
parent_process_id VARCHAR(22),
deadline_ts TIMESTAMP WITH TIME ZONE,
parent_group_id NUMERIC,
process_subtype_code VARCHAR(126),
owning_group_name VARCHAR(126), -- Name of the group that owns the process
input_data JSONB, -- The initial data that triggered the workflow
context_data JSONB, -- The runtime "scratchpad" for intermediate variables
error_info TEXT, -- Detailed error or stack trace if the process fails
aggregate_version BIGINT DEFAULT 1 NOT NULL,
active BOOLEAN DEFAULT TRUE,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
update_user VARCHAR(126) DEFAULT SESSION_USER,
PRIMARY KEY(host_id, process_id),
FOREIGN KEY(host_id, wf_def_id) REFERENCES wf_definition_t(host_id, wf_def_id) ON DELETE CASCADE
);
CREATE TABLE task_info_t
(
host_id UUID NOT NULL,
task_id UUID NOT NULL,
task_type VARCHAR(126) NOT NULL,
process_id UUID NOT NULL,
wf_instance_id VARCHAR(126) NOT NULL,
wf_task_id VARCHAR(126) NOT NULL,
status_code CHAR(1) NOT NULL, -- U, A, C
started_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
locked CHAR(1) NOT NULL,
priority INTEGER NOT NULL,
completed_ts TIMESTAMP WITH TIME ZONE NULL,
completed_user VARCHAR(126) NULL,
result_code VARCHAR(126) NULL,
locking_user VARCHAR(126) NULL,
locking_role VARCHAR(126) NULL,
deadline_ts TIMESTAMP WITH TIME ZONE NULL,
lock_group VARCHAR(126) NULL,
task_input JSONB, -- Specific data passed to the task
task_output JSONB, -- Result returned by the task action
aggregate_version BIGINT DEFAULT 1 NOT NULL,
active BOOLEAN DEFAULT TRUE,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
update_user VARCHAR(126) DEFAULT SESSION_USER,
PRIMARY KEY(host_id, task_id),
FOREIGN KEY (host_id, process_id) REFERENCES process_info_t(host_id, process_id) ON DELETE CASCADE
);
CREATE TABLE task_asst_t
(
host_id UUID NOT NULL,
task_asst_id UUID NOT NULL,
task_id UUID NOT NULL,
assigned_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
assignee_id VARCHAR(126) NOT NULL,
reason_code VARCHAR(126) NOT NULL,
unassigned_ts TIMESTAMP WITH TIME ZONE NULL,
unassigned_reason VARCHAR(126) NULL,
category_code VARCHAR(126) NULL,
aggregate_version BIGINT DEFAULT 1 NOT NULL,
active BOOLEAN DEFAULT TRUE,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
update_user VARCHAR(126) DEFAULT SESSION_USER,
PRIMARY KEY(host_id, task_asst_id),
FOREIGN KEY(host_id, task_id) REFERENCES task_info_t(host_id, task_id) ON DELETE CASCADE
);
CREATE TABLE audit_log_t
(
host_id UUID NOT NULL,
audit_log_id UUID NOT NULL,
source_type_id VARCHAR(126) NULL,
correlation_id VARCHAR(126) NULL,
user_id VARCHAR(126) NULL,
event_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
success CHAR(1) NULL,
message0 VARCHAR(126) NULL,
message1 VARCHAR(126) NULL,
message2 VARCHAR(126) NULL,
message3 VARCHAR(126) NULL,
message VARCHAR(500) NULL,
user_comment VARCHAR(500) NULL,
PRIMARY KEY(host_id, audit_log_id)
);
CREATE INDEX audit_log_idx1 ON audit_log_t (source_type_id, correlation_id, event_ts, user_id);
-- Agent Definitions: Stores the "Brain" configuration
CREATE TABLE agent_definition_t (
host_id UUID NOT NULL,
agent_def_id UUID NOT NULL,
agent_name VARCHAR(126) NOT NULL,
model_provider VARCHAR(64) NOT NULL, -- 'openai', 'anthropic', etc.
model_name VARCHAR(126) NOT NULL, -- 'gpt-4o', 'claude-3-5-sonnet'
api_key_ref VARCHAR(126), -- Reference to Secret Manager key
temperature NUMERIC(3,2) DEFAULT 0.7,
max_tokens INTEGER, -- max number of tokens can be used
aggregate_version BIGINT DEFAULT 1 NOT NULL,
active BOOLEAN DEFAULT TRUE,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
update_user VARCHAR(126) DEFAULT SESSION_USER,
PRIMARY KEY(host_id, agent_def_id),
UNIQUE(host_id, agent_name)
);
-- Skills: Stores Instructions and Domain Knowledge (The "Expertise")
-- Note: Use entity_tag_t and entity_category_t with entity_type = 'skill'
-- for flat tagging and hierarchical folder structure of skills.
CREATE TABLE skill_t (
host_id UUID NOT NULL,
skill_id UUID NOT NULL,
parent_skill_id UUID, -- Self-reference for Hierarchy
name VARCHAR(126) NOT NULL,
description VARCHAR(500), -- High-level description for the initial LLM prompt
content_markdown TEXT NOT NULL, -- The actual instructions/prompts
description_embedding VECTOR(384), -- For semantic lookup/discovery
version VARCHAR(20) DEFAULT '1.0.0',
aggregate_version BIGINT DEFAULT 1 NOT NULL,
active BOOLEAN DEFAULT true,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
update_user VARCHAR(126) DEFAULT SESSION_USER,
PRIMARY KEY(host_id, skill_id),
FOREIGN KEY(host_id, parent_skill_id) REFERENCES skill_t(host_id, skill_id)
);
CREATE INDEX idx_skill_active ON skill_t(active);
CREATE INDEX idx_skill_name ON skill_t(name);
-- Tools: Stores Executable Functions (The "Hands")
CREATE TABLE tool_t (
host_id UUID NOT NULL,
tool_id UUID NOT NULL,
name VARCHAR(126) NOT NULL,
description TEXT NOT NULL, -- Instructions for LLM on when/how to use this tool
-- Implementation specifics
implementation_type VARCHAR(50), -- 'java', 'mcp_server', 'rest', 'python', 'javascript'
implementation_class VARCHAR(500), -- FQCN if 'java'
mcp_server_name VARCHAR(126), -- MCP server name if 'mcp_server'
api_endpoint VARCHAR(1024), -- URL if 'rest'
api_method VARCHAR(10), -- HTTP Method if 'rest'
endpoint_id UUID, -- Reference to fine-grained auth endpoint
script_content TEXT, -- Source code if 'python'/'javascript'
response_schema JSONB, -- Strict output schema for tool results
description_embedding VECTOR(384), -- For semantic lookup/discovery
version VARCHAR(20) DEFAULT '1.0.0',
aggregate_version BIGINT DEFAULT 1 NOT NULL,
active BOOLEAN DEFAULT true,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
update_user VARCHAR(126) DEFAULT SESSION_USER,
PRIMARY KEY(host_id, tool_id),
FOREIGN KEY(host_id, endpoint_id) REFERENCES api_endpoint_t(host_id, endpoint_id) ON DELETE CASCADE
);
CREATE INDEX idx_tool_host_endpoint ON tool_t(host_id, endpoint_id);
CREATE INDEX idx_tool_active ON tool_t(active);
CREATE INDEX idx_tool_name ON tool_t(name);
-- Tool Parameters: Defines the arguments for each tool
CREATE TABLE tool_param_t (
host_id UUID NOT NULL,
param_id UUID NOT NULL,
tool_id UUID NOT NULL,
name VARCHAR(255) NOT NULL,
param_type VARCHAR(50) NOT NULL, -- 'string', 'number', 'boolean', 'object', 'array'
required BOOLEAN DEFAULT true,
default_value JSONB,
description TEXT, -- Helps LLM understand what value to extract
validation_schema JSONB, -- JSON Schema for complex validation
order_index INTEGER DEFAULT 0,
aggregate_version BIGINT DEFAULT 1 NOT NULL,
active BOOLEAN DEFAULT true,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
update_user VARCHAR(126) DEFAULT SESSION_USER,
PRIMARY KEY(host_id, param_id),
FOREIGN KEY(host_id, tool_id) REFERENCES tool_t(host_id, tool_id) ON DELETE CASCADE
);
-- Skill Dependencies: Manages hierarchies where one skill requires another
CREATE TABLE skill_dependency_t (
host_id UUID NOT NULL,
skill_id UUID NOT NULL,
depends_on_skill_id UUID NOT NULL,
required BOOLEAN DEFAULT true,
aggregate_version BIGINT DEFAULT 1 NOT NULL,
active BOOLEAN DEFAULT true,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
update_user VARCHAR(126) DEFAULT SESSION_USER,
PRIMARY KEY (host_id, skill_id, depends_on_skill_id),
FOREIGN KEY(host_id, skill_id) REFERENCES skill_t(host_id, skill_id),
FOREIGN KEY(host_id, depends_on_skill_id) REFERENCES skill_t(host_id, skill_id)
);
-- Agent-Skill Mapping: Links Agents to their Skills
CREATE TABLE agent_skill_t (
host_id UUID NOT NULL,
agent_def_id UUID NOT NULL,
skill_id UUID NOT NULL,
config JSONB DEFAULT '{}',
priority INTEGER DEFAULT 0,
sequence_id INTEGER DEFAULT 0, -- Order in which skills are concatenated
aggregate_version BIGINT DEFAULT 1 NOT NULL,
active BOOLEAN DEFAULT true,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
update_user VARCHAR(126) DEFAULT SESSION_USER,
PRIMARY KEY(host_id, agent_def_id, skill_id),
FOREIGN KEY(host_id, agent_def_id) REFERENCES agent_definition_t(host_id, agent_def_id) ON DELETE CASCADE,
FOREIGN KEY(host_id, skill_id) REFERENCES skill_t(host_id, skill_id) ON DELETE CASCADE
);
CREATE INDEX idx_agent_skill_agent ON agent_skill_t(agent_def_id);
-- Skill-Tool Mapping: Implements Progressive Disclosure
CREATE TABLE skill_tool_t (
host_id UUID NOT NULL,
skill_id UUID NOT NULL,
tool_id UUID NOT NULL,
config JSONB DEFAULT '{}',
access_level VARCHAR(20) DEFAULT 'read', -- e.g., 'read', 'write', 'execute'
aggregate_version BIGINT DEFAULT 1 NOT NULL,
active BOOLEAN DEFAULT true,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
update_user VARCHAR(126) DEFAULT SESSION_USER,
PRIMARY KEY(host_id, skill_id, tool_id),
FOREIGN KEY(host_id, skill_id) REFERENCES skill_t(host_id, skill_id) ON DELETE CASCADE,
FOREIGN KEY(host_id, tool_id) REFERENCES tool_t(host_id, tool_id) ON DELETE CASCADE
);
CREATE INDEX idx_skill_tool_skill ON skill_tool_t(skill_id);
-- -- Hindsight Advanced Memory System
-- Transitioned from flat logs to biomimetic memory banks (World, Experiences, Mental Models)
-- Memory bank profiles (Personality & Disposition)
CREATE TABLE agent_memory_bank_t (
host_id UUID NOT NULL,
bank_id UUID NOT NULL,
agent_def_id UUID, -- NULL if bank is shared across agents
user_id UUID, -- NULL if bank is global for the host/agent
bank_name VARCHAR(126) NOT NULL,
disposition JSONB NOT NULL DEFAULT '{"skepticism": 3, "literalism": 3, "empathy": 3}'::jsonb,
background TEXT,
aggregate_version BIGINT DEFAULT 1 NOT NULL,
active BOOLEAN DEFAULT true,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
update_user VARCHAR(126) DEFAULT SESSION_USER,
PRIMARY KEY(host_id, bank_id),
FOREIGN KEY(host_id) REFERENCES host_t(host_id) ON DELETE CASCADE,
FOREIGN KEY(host_id, agent_def_id) REFERENCES agent_definition_t(host_id, agent_def_id) ON DELETE CASCADE,
FOREIGN KEY(user_id) REFERENCES user_t(user_id) ON DELETE CASCADE
);
-- Source documents for memory units
CREATE TABLE agent_memory_doc_t (
host_id UUID NOT NULL,
doc_id UUID NOT NULL,
bank_id UUID NOT NULL,
original_text TEXT,
content_hash TEXT,
aggregate_version BIGINT DEFAULT 1 NOT NULL,
active BOOLEAN DEFAULT true,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
update_user VARCHAR(126) DEFAULT SESSION_USER,
PRIMARY KEY (host_id, bank_id, doc_id),
FOREIGN KEY (host_id, bank_id) REFERENCES agent_memory_bank_t(host_id, bank_id) ON DELETE CASCADE
);
-- Individual sentence-level memories (The "Atoms" of thought)
CREATE TABLE agent_memory_unit_t (
host_id UUID NOT NULL,
unit_id UUID NOT NULL,
bank_id UUID NOT NULL,
doc_id UUID,
content TEXT NOT NULL,
embedding vector(384),
context TEXT,
event_date TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now(),
occurred_start TIMESTAMP WITH TIME ZONE,
occurred_end TIMESTAMP WITH TIME ZONE,
mentioned_at TIMESTAMP WITH TIME ZONE,
fact_type VARCHAR(32) NOT NULL DEFAULT 'world' CHECK (fact_type IN ('world', 'experience', 'opinion', 'observation', 'mental_model')),
metadata JSONB DEFAULT '{}'::jsonb,
proof_count INT DEFAULT 1,
source_memory_ids UUID[] DEFAULT ARRAY[]::UUID[],
aggregate_version BIGINT DEFAULT 1 NOT NULL,
active BOOLEAN DEFAULT true,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
update_user VARCHAR(126) DEFAULT SESSION_USER,
PRIMARY KEY(host_id, bank_id, unit_id),
FOREIGN KEY(host_id, bank_id) REFERENCES agent_memory_bank_t(host_id, bank_id) ON DELETE CASCADE,
FOREIGN KEY(host_id, bank_id, doc_id) REFERENCES agent_memory_doc_t(host_id, bank_id, doc_id) ON DELETE CASCADE
);
CREATE INDEX idx_mem_unit_bank ON agent_memory_unit_t(bank_id);
CREATE INDEX idx_mem_unit_embedding ON agent_memory_unit_t USING hnsw (embedding vector_cosine_ops);
-- Resolved entities (Knowledge Graph Nodes)
CREATE TABLE agent_memory_entity_t (
host_id UUID NOT NULL,
entity_id UUID NOT NULL,
bank_id UUID NOT NULL,
user_id UUID, -- Link to user_t if this entity is a platform user
canonical_name TEXT NOT NULL,
mention_count INT DEFAULT 1,
metadata JSONB DEFAULT '{}'::jsonb,
aggregate_version BIGINT DEFAULT 1 NOT NULL,
active BOOLEAN DEFAULT true,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
update_user VARCHAR(126) DEFAULT SESSION_USER,
PRIMARY KEY (host_id, bank_id, entity_id),
FOREIGN KEY (host_id, bank_id) REFERENCES agent_memory_bank_t(host_id, bank_id) ON DELETE CASCADE,
FOREIGN KEY (user_id) REFERENCES user_t(user_id) ON DELETE CASCADE
);
-- Association between memory units and entities
CREATE TABLE agent_memory_unit_entity_t (
host_id UUID NOT NULL,
bank_id UUID NOT NULL,
unit_id UUID NOT NULL,
entity_id UUID NOT NULL,
PRIMARY KEY (host_id, bank_id, unit_id, entity_id),
FOREIGN KEY (host_id, bank_id, unit_id) REFERENCES agent_memory_unit_t(host_id, bank_id, unit_id) ON DELETE CASCADE,
FOREIGN KEY (host_id, bank_id, entity_id) REFERENCES agent_memory_entity_t(host_id, bank_id, entity_id) ON DELETE CASCADE
);
-- Cache of entity co-occurrences (Concept Relationship Graph)
CREATE TABLE agent_memory_entity_cooccur_t (
host_id UUID NOT NULL,
bank_id UUID NOT NULL,
entity_id_1 UUID NOT NULL,
entity_id_2 UUID NOT NULL,
cooccur_count INT DEFAULT 1,
last_cooccurred TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
aggregate_version BIGINT DEFAULT 1 NOT NULL,
active BOOLEAN DEFAULT true,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
update_user VARCHAR(126) DEFAULT SESSION_USER,
PRIMARY KEY (host_id, bank_id, entity_id_1, entity_id_2),
CONSTRAINT entity_cooccur_order_check CHECK (entity_id_1 < entity_id_2),
FOREIGN KEY (host_id, bank_id, entity_id_1) REFERENCES agent_memory_entity_t(host_id, bank_id, entity_id) ON DELETE CASCADE,
FOREIGN KEY (host_id, bank_id, entity_id_2) REFERENCES agent_memory_entity_t(host_id, bank_id, entity_id) ON DELETE CASCADE
);
CREATE INDEX idx_mem_cooccur_e1 ON agent_memory_entity_cooccur_t(host_id, entity_id_1);
CREATE INDEX idx_mem_cooccur_e2 ON agent_memory_entity_cooccur_t(host_id, entity_id_2);
-- Links between memory units (Semantic & Causal relationships)
CREATE TABLE agent_memory_link_t (
host_id UUID NOT NULL,
bank_id UUID NOT NULL,
from_unit_id UUID NOT NULL,
to_unit_id UUID NOT NULL,
link_type VARCHAR(32) NOT NULL,
weight FLOAT NOT NULL DEFAULT 1.0,
aggregate_version BIGINT DEFAULT 1 NOT NULL,
active BOOLEAN DEFAULT true,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
update_user VARCHAR(126) DEFAULT SESSION_USER,
PRIMARY KEY (host_id, bank_id, from_unit_id, to_unit_id, link_type),
CONSTRAINT memory_links_type_check CHECK (link_type IN ('temporal', 'semantic', 'entity', 'causes', 'caused_by', 'enables', 'prevents')),
FOREIGN KEY (host_id, bank_id, from_unit_id) REFERENCES agent_memory_unit_t(host_id, bank_id, unit_id) ON DELETE CASCADE,
FOREIGN KEY (host_id, bank_id, to_unit_id) REFERENCES agent_memory_unit_t(host_id, bank_id, unit_id) ON DELETE CASCADE
);
-- Directives (Hard rules that override probabilistic learning)
CREATE TABLE agent_memory_directive_t (
host_id UUID NOT NULL,
directive_id UUID NOT NULL,
bank_id UUID NOT NULL,
name VARCHAR(256) NOT NULL,
content TEXT NOT NULL,
priority INT NOT NULL DEFAULT 0,
aggregate_version BIGINT DEFAULT 1 NOT NULL,
active BOOLEAN DEFAULT true,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
update_user VARCHAR(126) DEFAULT SESSION_USER,
PRIMARY KEY(host_id, bank_id, directive_id),
FOREIGN KEY(host_id, bank_id) REFERENCES agent_memory_bank_t(host_id, bank_id) ON DELETE CASCADE
);
-- Reflections (Synthesized knowledge and high-level observations)
CREATE TABLE agent_memory_reflection_t (
host_id UUID NOT NULL,
reflection_id UUID NOT NULL,
bank_id UUID NOT NULL,
content TEXT NOT NULL,
embedding vector(384),
aggregate_version BIGINT DEFAULT 1 NOT NULL,
active BOOLEAN DEFAULT true,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
update_user VARCHAR(126) DEFAULT SESSION_USER,
PRIMARY KEY(host_id, bank_id, reflection_id),
FOREIGN KEY(host_id, bank_id) REFERENCES agent_memory_bank_t(host_id, bank_id) ON DELETE CASCADE
);
CREATE INDEX idx_mem_reflection_embedding ON agent_memory_reflection_t USING hnsw (embedding vector_cosine_ops);
-- Raw Session History (The source of Truth for active conversations)
CREATE TABLE agent_session_history_t (
host_id UUID NOT NULL,
session_id UUID NOT NULL,
bank_id UUID NOT NULL, -- Links the session to a Hindsight bank
messages JSONB NOT NULL DEFAULT '[]'::jsonb,
metadata JSONB DEFAULT '{}'::jsonb,
aggregate_version BIGINT DEFAULT 1 NOT NULL,
active BOOLEAN DEFAULT true,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
update_user VARCHAR(126) DEFAULT SESSION_USER,
PRIMARY KEY(host_id, bank_id, session_id),
FOREIGN KEY(host_id, bank_id) REFERENCES agent_memory_bank_t(host_id, bank_id) ON DELETE CASCADE
);
CREATE INDEX idx_session_bank ON agent_session_history_t(host_id, bank_id);
Light-Deployer Design
light-deployer is the cluster-local Kubernetes deployment executor in
Light Fabric.
This document focuses only on the deployer service that lives in
apps/light-deployer. The broader Light Portal deployment workflow, approval
flow, deployment history model, controller routing, and portal UI are covered
outside this repository.
Purpose
light-deployer receives a deployment command, fetches Kubernetes templates,
renders them with deployment values, validates the resulting resources, applies
or deletes resources in the target Kubernetes cluster, and returns safe status
details.
It is intentionally narrow. It does not decide whether a user is allowed to deploy an instance, does not own portal deployment history, and does not create tenant business workflows. Those decisions belong to Light Portal, Light Controller, and the workflow engine.
Service Boundary
light-deployer owns:
- local deployment policy enforcement
- template repository fetch
- YAML template rendering
- manifest parsing and resource summary generation
- Kubernetes dry-run, apply, delete, status, and pruning
- safe event and error reporting
- direct local/MicroK8s deployment endpoints
light-deployer does not own:
- tenant authorization
- instance metadata
- deployment approval
- deployment history persistence
- config snapshot creation
- long-running human workflow decisions
The deployer should reject commands outside its local policy even if an upstream service sends them.
Runtime Model
The service follows the same runtime pattern as light-agent.
main.rs builds the domain service and starts it through:
#![allow(unused)] fn main() { LightRuntimeBuilder::new(AxumTransport::new(app)) }
The HTTP listener is owned by light-runtime and light-axum, not by
service-specific socket code. Bind address, HTTP/HTTPS ports, service identity,
and registry settings live in runtime config files.
Default config files:
config/server.ymlconfig/deployer.ymlconfig/portal-registry.yml
Local cargo run resolves config from apps/light-deployer/config when run
from the workspace root. The container image runs from /app and uses
/app/config.
Public Endpoints
Phase 1 exposes a direct HTTP surface for local and MicroK8s testing:
GET /health
GET /ready
POST /mcp
GET /mcp/tools
GET /mcp/tools/list
GET /mcp/tools/{tool}
POST /deployments
POST /mcp/tools/{tool}
GET /events?request_id=...
POST /mcp is the MCP JSON-RPC 2.0 endpoint. It supports tools/list,
tools/call, and a minimal initialize response. This is the endpoint that
MCP clients, Light Portal, and AI agents should use.
/deployments accepts the canonical deployment request directly.
/mcp/tools/{tool} maps tool names onto the same internal service functions as
a REST-style local debugging convenience. The convenience tool-list endpoints
return metadata with name, description, inputSchema, endpoint, and
method, but they are not the MCP protocol endpoint.
Supported tool names:
deployment.renderdeployment.dryRundeployment.diffdeployment.applydeployment.deletedeployment.statusdeployment.rollback
The direct HTTP mode is useful for development and managed environments. The same internal command handling should later be reused by controller-mediated WebSocket/MCP routing.
Request Model
A deployment request is explicit and auditable.
{
"requestId": "01964b05-0000-7000-8000-000000000001",
"hostId": "01964b05-552a-7c4b-9184-6857e7f3dc5f",
"instanceId": "petstore-dev",
"environment": "dev",
"clusterId": "microk8s-local",
"namespace": "petstore-dev",
"action": "deploy",
"values": {
"name": "petstore",
"image": {
"repository": "networknt/openapi-petstore",
"tag": "latest"
}
},
"template": {
"repoUrl": "https://github.com/networknt/openapi-petstore.git",
"ref": "master",
"path": "k8s"
},
"options": {
"dryRun": false,
"waitForRollout": true,
"timeoutSeconds": 300,
"pruneOverride": false
}
}
The current implementation supports inline values. The request model also
contains fields for future values references and immutable snapshot metadata so
it can align with the full portal deployment workflow.
When invoking a specific /mcp/tools/{tool} endpoint, callers do not need to
send action. The deployer derives the action from the tool name. The generic
/deployments endpoint still expects an explicit action in the request body.
For the MCP endpoint, callers use JSON-RPC:
{
"jsonrpc": "2.0",
"id": "tools-list-1",
"method": "tools/list",
"params": {}
}
Tool invocation uses tools/call:
{
"jsonrpc": "2.0",
"id": "render-1",
"method": "tools/call",
"params": {
"name": "deployment.render",
"arguments": {
"hostId": "local-host",
"instanceId": "petstore-dev",
"environment": "dev",
"clusterId": "local",
"namespace": "light-deployer",
"values": {},
"template": {
"repoUrl": "local",
"ref": "main",
"path": "k8s"
}
}
}
}
tools/call derives the deployment action from params.name; callers should
not provide an action field in arguments.
Actions
render
: Fetch templates, render manifests, add namespaces and management labels, and
return resource summaries plus a manifest hash.
dryRun
: Render manifests and validate them against Kubernetes using server-side
dry-run.
diff
: Render manifests, fetch current managed resources, calculate additions,
modifications, and pruned resources, and return a redacted diff summary.
deploy
: Accept the request, run the deployment in the background, apply manifests,
prune removed managed resources, and stream events.
undeploy
: Delete resources associated with the deployment.
status
: Return current managed resource status.
rollback
: Reserved for redeploying a previous immutable portal snapshot. Native
Kubernetes rollout undo is not the target rollback model because it does not
restore ConfigMaps, Secrets, or values snapshots.
Template Fetching
Templates are loaded through the TemplateSource trait.
The current source supports two modes:
- local template root through
LIGHT_DEPLOYER_TEMPLATE_BASE_DIR - remote HTTPS Git clone through
gix
For remote repositories, the deployment request provides:
{
"template": {
"repoUrl": "https://github.com/networknt/openapi-petstore.git",
"ref": "master",
"path": "k8s"
}
}
Private HTTPS Git access is controlled by environment variables:
LIGHT_DEPLOYER_GIT_TOKEN: token or app passwordLIGHT_DEPLOYER_GIT_USERNAME: optional username override
Defaults:
- GitHub uses
x-access-token - Bitbucket Cloud uses
x-token-auth
SSH authentication is intentionally deferred because it requires private key
handling and strict known_hosts validation.
Template Format
The built-in renderer uses simple placeholders:
image: ${image.repository}:${image.tag:latest}
Supported behavior:
- nested paths such as
image.repository - default values after
: - render failure when a required value is missing
- placeholder replacement only inside YAML string scalar values
The renderer parses YAML into serde_yaml::Value, traverses the AST, replaces
placeholders, and serializes or applies structured YAML values afterward. This
avoids the most common raw string replacement bugs around quoting,
indentation, certificates, and multi-line values.
Because placeholders currently produce strings, templates should avoid
placeholders in numeric-only Kubernetes fields unless Kubernetes accepts a
string value there. For example, containerPort should be fixed or rendered by
a future typed placeholder extension.
Resource Metadata
After rendering, the deployer ensures every resource has the target namespace and adds management labels:
app.kubernetes.io/managed-by=light-deployerlightapi.net/host-idlightapi.net/instance-idlightapi.net/request-id
These labels are used for status lookup and pruning.
Kubernetes Execution
Kubernetes execution is behind the KubeExecutor trait.
Current implementations:
KubeRsExecutor: real Kubernetes API execution throughkube-rsNoopKubeExecutor: local render/test mode
Execution mode:
LIGHT_DEPLOYER_KUBE_MODE=real: force real Kubernetes modeLIGHT_DEPLOYER_KUBE_MODE=noop: force no-op mode- default: real mode when
KUBERNETES_SERVICE_HOSTis present, otherwise no-op
The production path uses kube-rs, not kubectl.
Kubernetes operations should use:
- in-cluster ServiceAccount auth when running as a pod
- server-side dry-run for validation
- server-side apply with field manager
light-deployer - structured status and error handling
Pruning
The deployer is declarative. If a previously managed resource is no longer rendered from the template, it should be considered for pruning.
Pruning is calculated by comparing:
- current resources in the namespace with
lightapi.net/instance-id - resources rendered from the new template
The policy layer enforces blast-radius protection:
- maximum delete percentage
- sensitive kinds requiring override
- explicit
pruneOverridein deployment options
This prevents stale resources while still protecting against accidental large-scale deletion.
Policy
The local deployer.yml policy constrains what a deployer is allowed to do.
Policy dimensions:
- allowed namespaces
- allowed repository hosts
- allowed repository URL prefixes
- allowed image registries
- allowed actions
- allowed Kubernetes kinds
- blocked Kubernetes kinds
- prune settings
- development insecure mode
Version 1 allows application-level resource kinds by default:
DeploymentServiceIngressConfigMapSecret
Cluster-scoped and control-plane resources are blocked by default:
NamespaceClusterRoleClusterRoleBindingCustomResourceDefinition- admission webhooks
Security
The deployer can mutate a Kubernetes cluster, so its default posture must be conservative.
Required practices:
- run in Kubernetes with a dedicated ServiceAccount
- prefer namespace-scoped
RoleandRoleBinding - restrict allowed namespaces and resource kinds
- restrict template repository hosts or prefixes in production
- restrict image registries in production
- never log raw rendered Secret manifests
- never log raw Kubernetes patch/apply payloads containing Secret data
- return redacted summaries and diffs
Secret values in rendered manifests are redacted before being included in responses or diffs. Kubernetes Secret values are base64 encoded, not encrypted, so they must be treated as plaintext for logging purposes.
Response Model
Responses include enough detail for callers to understand what happened without exposing secrets.
Important fields:
requestIdactionstatusdeployerIdclusterIdnamespacemanifestHashtemplateCommitSharesourcesdiffeventserror
Resource summaries contain kind, namespace, name, apiVersion, and action. Full rendered manifests should not be returned or persisted by default.
Event Model
Long-running operations return quickly and continue in the background.
Clients can subscribe to:
GET /events?request_id=...
Events contain:
- request ID
- timestamp
- status
- message
- optional resource identity
The event stream is currently direct SSE. Controller-mediated mode can forward the same event shape later.
Installation
The app includes Kubernetes install manifests under apps/light-deployer/k8s:
- namespace
- RBAC
- deployment
- service
The deployment runs the container with LIGHT_DEPLOYER_KUBE_MODE=real. The
image contains /app/config, and server.yml defaults the HTTP port to 7088.
For MicroK8s testing:
./apps/light-deployer/build.sh latest
docker save networknt/light-deployer:latest | microk8s ctr image import -
microk8s kubectl apply -f apps/light-deployer/k8s/namespace.yaml
microk8s kubectl apply -f apps/light-deployer/k8s/rbac.yaml
microk8s kubectl apply -f apps/light-deployer/k8s/deployment.yaml
microk8s kubectl apply -f apps/light-deployer/k8s/service.yaml
Current Limitations
- Direct HTTP/MCP-style mode is implemented first; controller-mediated WebSocket routing is a later integration step.
- Inline values are implemented; config-server
valuesReffetching is still a future integration point. - Rollback is represented in the model but needs portal snapshot integration.
- Helm and Kustomize are not implemented yet.
- Typed placeholders are not implemented yet.
- Rollout watch depth is intentionally basic in the first phase.
Design Direction
Keep light-deployer small and cluster-local.
The deployer should execute precise deployment commands, enforce local safety policy, and report structured results. It should not grow into a portal, workflow engine, or deployment database. That separation keeps the service easy to install inside customer clusters and reduces the security blast radius.
Module Registry
Status: Phase 4 implemented for light-gateway/gateway; additional module
reloaders remain planned.
Purpose
Light Fabric needs a runtime module registry equivalent to the ModuleRegistry
feature in light-4j.
In light-4j, each active component registers its runtime configuration when
the component loads. Older integrations exposed this through the
/adm/server/info REST endpoint, but the current control-plane path uses MCP
tools through portal-registry. The same registry is also used by the
config-reload operation to decide which modules can reload configuration from
the config server.
Light Fabric already has structured config files and a shared runtime startup flow, but it does not yet have a central registry that answers these operational questions:
- which modules are active in this running instance
- which config file each module loaded
- what masked runtime config is currently active
- which modules can be reloaded without restarting the process
- what happened during the last reload attempt
This document proposes a registry in light-runtime so every Light Fabric
application can expose the same control-plane behavior.
Goals
- Register built-in runtime configs such as
startup,server,client, andportal-registry. - Register application configs such as
gateway,deployer,ollama, andmcp-client. - Store only masked config snapshots in the registry.
- Expose a Java-compatible server-info payload through the
get_service_infoMCP tool. - Expose a module list through the
get_modulesMCP tool for config reload selection. - Support control-plane reload requests for one module, several modules, or all
modules through the
reload_modulesMCP tool. Phase 3 reports non-reloadable modules as skipped. Phase 4 adds real hot reload forlight-gateway/gateway. - Keep the feature transport-neutral by routing management requests through
portal-registry, not through framework-specific REST routes.
Non-Goals
- Do not make every config hot-reloadable in the first phase.
- Do not rebind server ports or TLS listeners unless a transport explicitly supports it.
- Do not expose decrypted secrets through diagnostics.
- Do not make Rust type names part of the public control-plane contract.
- Do not add
/adm/...REST endpoints for Light Fabric.
Current Light Fabric Runtime Shape
The natural home for this feature is crates/light-runtime.
LightRuntimeBuilder already owns the startup sequence:
- load local bootstrap config
- optionally fetch remote config from config server
- build
RuntimeConfig - call registered runtime modules
- bind the transport
- register the running instance with the controller
- mark the runtime ready
RuntimeConfig already carries the merged resolved_values, config_dir, and
external_config_dir. Application code can use those fields to load resolved
application config without reparsing values.yml.
The config registry should build on that runtime boundary instead of creating a separate app-local registry per product.
Registry Model
Add a shared registry type in light-runtime.
#![allow(unused)] fn main() { pub struct ModuleRegistry { entries: RwLock<BTreeMap<String, ModuleEntry>>, reloaders: RwLock<BTreeMap<String, Arc<dyn ReloadableModule>>>, } pub struct ModuleEntry { pub module_id: String, pub config_name: String, pub kind: ModuleKind, pub active: bool, pub enabled: Option<bool>, pub reloadable: bool, pub config: serde_json::Value, pub masks: Vec<MaskSpec>, pub loaded_at: DateTime<Utc>, pub last_reload: Option<ReloadStatus>, } pub enum ModuleKind { Core, Framework, Application, Plugin, } }
Use stable module IDs instead of Rust type names. Java uses class names because they are stable operational identifiers in the JVM. Rust type names are not a good public API and can change during refactoring.
Example module IDs:
light-runtime/startuplight-runtime/serverlight-client/clientlight-runtime/portal-registrylight-gateway/gatewaylight-deployer/deployerlight-agent/ollamalight-agent/mcp-client
The registry key should be module_id. Each entry also carries config_name
so the server-info response can preserve the Java-style component map keyed by
config name.
Registered Config Loading
Add a small registered-loader API around the existing ConfigLoader behavior.
#![allow(unused)] fn main() { let gateway_config: GatewayConfig = context .config() .load_registered( "gateway", "light-gateway/gateway", [MaskSpec::key("password")], )?; }
The helper should:
- merge the base file from
config_dir - overlay the external file from
external_config_dir - resolve variables from
RuntimeConfig.resolved_values - deserialize the typed config
- serialize the resolved config to
serde_json::Value - apply masks to the serialized copy
- store only the masked copy in
ModuleRegistry - return the typed config to the caller
This keeps the app code simple and prevents accidental registry entries that contain raw secrets.
Phase 2 added this shared registered-loader path in ModuleRegistry and
attached the registry to RuntimeConfig so apps that load after runtime
bootstrap can register resolved config through the same runtime-owned registry.
Apps that load before runtime startup can create the registry first, register
their application configs, and pass that registry into LightRuntimeBuilder.
For modules that must validate typed config before changing the registry
snapshot, the same loader is also available as load_config(...) followed by
register_loaded_config(...) after validation succeeds.
Masking
Masking must happen at registration time. The registry should not store raw config and then mask it later.
Support two mask forms:
#![allow(unused)] fn main() { pub enum MaskSpec { Key(String), Path(String), } }
MaskSpec::Key("password") masks every matching key recursively, matching the
current light-4j behavior.
MaskSpec::Path("oauth.clientSecret") masks a precise path for configs where a
generic key would be too broad.
Suggested default masks:
authorizationpasswordsecretclientSecretapiKeytokenportalTokencontrollerDiscoveryTokenprivateKeytlsKeyPathbootstrapKeyPath
Add a runtime flag such as server.maskConfigProperties or
admin.maskConfigProperties, defaulting to true, for parity with the Java
server.maskConfigProperties behavior. Even if this flag is disabled, the
control-plane documentation should treat unmasked output as a local debugging
mode only.
Server Info MCP Response
The get_service_info MCP tool response should preserve the same logical shape
that portal-view already understands from Java instances.
{
"deployment": {
"apiVersion": "0.1.0",
"frameworkVersion": "0.1.0"
},
"environment": {
"host": {
"ip": "127.0.0.1",
"hostname": "light-gateway-0"
},
"runtime": {},
"system": {}
},
"security": {},
"component": {
"server": {},
"gateway": {}
},
"plugin": {},
"plugins": [],
"modules": []
}
component should remain keyed by config_name for compatibility.
modules should provide richer Rust metadata:
[
{
"moduleId": "light-gateway/gateway",
"configName": "gateway",
"kind": "application",
"active": true,
"enabled": true,
"reloadable": true,
"loadedAt": "2026-05-07T14:30:00Z",
"lastReload": {
"status": "success",
"message": "reloaded from config server",
"completedAt": "2026-05-07T14:45:00Z"
}
}
]
MCP Access
Expose the registry only through MCP tools served by the runtime's
portal-registry connection.
MCP tools:
get_service_info
get_modules
reload_modules
These are invoked through standard MCP JSON-RPC calls:
{
"jsonrpc": "2.0",
"id": "info-1",
"method": "tools/call",
"params": {
"name": "get_service_info",
"arguments": {}
}
}
The controller remains the management channel. portal-registry receives the
MCP request from the controller, dispatches it to the local runtime registry,
and returns the result through the same websocket session. Light Fabric should
not expose a parallel REST admin surface for this feature.
For compatibility with the existing Java and portal-view workflow,
get_modules returns a string list of module IDs:
{
"modules": [
"light-runtime/server",
"light-gateway/gateway"
]
}
The richer module metadata remains available in the modules field of
get_service_info.
Reload Request
The reload_modules tool should accept omitted arguments, ALL, or explicit
module IDs.
{
"modules": [
"light-gateway/gateway",
"light-runtime/portal-registry"
]
}
An omitted modules value, an empty array, or ["ALL"] targets all registered
modules. Registered modules without concrete reload implementations are
reported as skipped instead of being marked as reloaded.
The response should be explicit about what happened:
{
"modules": ["light-gateway/gateway"],
"reloaded": ["light-gateway/gateway"],
"skipped": [
{
"moduleId": "light-runtime/server",
"reason": "requiresRestart"
}
],
"failed": [
{
"moduleId": "light-agent/ollama",
"message": "missing ollama.yml"
}
]
}
modules is a Java-compatible alias for the successfully reloaded module IDs
and is the field portal-view reads today. reloaded, skipped, and failed
carry the more explicit Rust result details.
Reload Implementation
Phase 4 adds a reload trait for modules that can safely swap runtime config.
#![allow(unused)] fn main() { #[async_trait] pub trait ReloadableModule: Send + Sync { async fn reload(&self, ctx: ReloadContext) -> Result<ReloadOutcome, RuntimeError>; } }
ReloadContext includes:
- a refreshed
RuntimeConfig - updated
resolved_values - the existing
config_dir - the existing
external_config_dir - the shared
ModuleRegistry
Reload flow:
- Re-fetch
values.yml, certs, and files from the config server intoexternal_config_dir. - Rebuild the merged
resolved_values. - Resolve requested module IDs.
- For each reloadable module, call its
reloadimplementation. - Each module validates the new typed config before swapping it into live state.
- Update the registry entry and
last_reloadstatus. - Return a detailed reload result.
Use ConfigManager<T> or another ArcSwap-backed holder for modules that need
hot reload. This avoids locking the request path while still allowing atomic
config replacement.
Phase 4 implements this with ConfigManager<T> in light-runtime. It stores an
Arc<T> behind a short-lived RwLock, so request handlers clone the current
config quickly and reloaders replace the entire typed config only after the new
config has loaded and validated.
Reloadability Rules
Classify configs by reload safety.
Reloadable candidates:
light-gateway/gatewaylight-deployer/deployerlight-agent/ollamalight-agent/mcp-client- route, policy, provider, or rule configs that are already read through swappable state
Requires restart by default:
- bind IP
- HTTP/HTTPS port
- protocol enablement
- TLS certificate path used by the listener
- runtime config directory
- config-server bootstrap identity
- controller registration identity
Some server.yml fields can still be reloadable later, such as
shutdownGracefulPeriod, but listener-affecting fields should stay
requiresRestart until each transport supports safe rebinding.
Framework Integration
The registry should not require each framework to expose admin routes.
light-runtime should attach an MCP-capable RegistryHandler to the
portal-registry client. When the controller invokes tools/list or
tools/call, the handler can advertise and execute the local management tools
without involving light-axum or light-pingora request routing.
This keeps light-axum and light-pingora focused on application traffic. It
also avoids adding service ports, Kubernetes routes, or Pingora request filters
only for control-plane operations.
Application Integration
light-gateway is integrated first because it already loads gateway.yml from
RuntimeConfig.resolved_values, config_dir, and external_config_dir. It
loads the resolved typed config, validates upstreams, and then stores the
masked registry snapshot. In Phase 4, light-gateway/gateway also registers a
ReloadableModule that reloads and validates gateway.yml, updates the masked
registry snapshot, and swaps the live GatewayConfig through ConfigManager.
light-deployer loads deployer.yml before the runtime is started, so it
creates a ModuleRegistry before loading its config, registers the final
env-overridden deployer config, and passes the same registry to
LightRuntimeBuilder.
light-agent also loads application configs before runtime startup. It now
registers ollama.yml and mcp-client.yml in the pre-runtime registry and
passes that registry into LightRuntimeBuilder. The existing manual
PortalRegistryClient setup is unchanged so the registry feature does not
reintroduce duplicate controller registration.
Current Registered Modules
Phase 4 registers these modules:
| Module ID | Config name | Kind | Reloadable |
|---|---|---|---|
light-runtime/startup | startup | core | no |
light-runtime/server | server | core | no |
light-client/client | client | core | no |
light-runtime/portal-registry | portal-registry | core | no |
light-gateway/gateway | gateway | application | yes |
light-deployer/deployer | deployer | application | no |
light-agent/ollama | ollama | application | no |
light-agent/mcp-client | mcp-client | application | no |
The application modules are visible in get_service_info once their owning
application loads them. get_modules returns the corresponding module ID
strings for portal-view selection. light-gateway/gateway can reload without a
restart. Other application modules keep reloadable=false until their runtime
state is moved behind swappable holders.
Rollout Plan
Phase 1: Registry and Masked Info
- Implemented:
ModuleRegistry,ModuleEntry, and mask utilities inlight-runtime. - Implemented: built-in runtime config registration.
- Implemented: tests proving raw secrets are not stored in registry entries.
- Implemented: Java-compatible server-info response assembly.
- Implemented: module-list response.
- Implemented: a
portal-registryMCP handler that exposesget_service_infoandget_modules.
Phase 2: Application Registration
- Implemented: convert
light-gateway/gatewayto registered config loading. - Implemented: convert
light-deployer/deployer. - Implemented: convert
light-agent/ollamaandlight-agent/mcp-client. - Implemented: add docs showing module IDs and reloadability.
Phase 3: Controller Operations
- Implemented: add MCP
tools/listandtools/callsupport forreload_modules. - Implemented: align portal-view calls so Java and Rust instances can be managed with the same control-plane workflow.
- Implemented: return Java-compatible
modulesstring lists while preserving detailedreloaded,skipped, andfailedreload result fields.
Phase 4: Hot Reload
- Implemented: add
ReloadableModule,ReloadContext, andReloadOutcome. - Implemented: add
ConfigManager<T>for swappable typed configs. - Implemented: implement reload for
light-gateway/gateway. - Implemented: add reload result tracking in the registry.
- Implemented: add tests for registry reload results, gateway live config swapping, and config-server-backed reload context refresh.
Open Questions
- Should module IDs be centrally reserved in
light-runtime, or should each application own its ID namespace? - Should the Java-compatible
componentmap include only active modules, whilemodulesincludes inactive-but-known modules? - Should MCP tool execution be enabled whenever
portal-registryis enabled, or guarded by a separate admin-tools flag? - Should
server.maskConfigProperties=falsebe allowed in production builds, or should Rust always mask known dangerous keys?
Implementation Sequence
Phase 1 implemented registry and masked server info first, without hot reload.
Phase 2 added application registration, so portal-view can display Rust
application modules next to Java modules once it calls the MCP tools through
portal-registry.
Phase 3 added the controller-facing reload_modules tool and Java-compatible
module ID lists.
Phase 4 added the first real hot reload implementation for
light-gateway/gateway. The next implementation step is to move additional
application configs, such as light-deployer/deployer,
light-agent/ollama, and light-agent/mcp-client, behind swappable runtime
state before marking them reloadable.
Cache Control Plane
Status: Proposed
Purpose
Light Fabric should expose the same cache operations through the portal control
plane that Java services expose through light-4j and portal-registry.
Today, portal-view can list caches and inspect cache entries for a running
service instance. The next required operation is clearing a cache so cached data
can be reloaded from its source of truth after operational data changes. A
common case is clearing the reference-data cache in portal-service after
reference tables are changed from light-portal.
The feature should be generic. It should not be a portal-service only endpoint.
Any Java or Rust service that registers with the controller and has named local
caches should be manageable through the same MCP tool contract.
Current Shape
The Java implementation already has most of the control-plane pieces:
light-4j/cache-managerdefines the genericCacheManagerAPI.light-4j/caffeine-cacheprovides the Caffeine-backed implementation.light-4j/portal-registryexposes MCP tools such aslist_cachesandget_cache_entries.controller-rsand the Java controller forward instance-specific MCP tool calls byruntimeInstanceId.portal-viewcalls the controller MCP websocket and passesruntimeInstanceIdfor cache exploration.
The main semantic gap is that CacheManager.removeCache(name) removes the cache
from the manager in the Caffeine implementation. For a control-plane clear
operation, the desired behavior is different: invalidate all entries while
keeping the configured cache alive so the next application read repopulates it.
Goals
- Add a generic whole-cache clear operation.
- Keep the control-plane contract compatible between Java services and Light Fabric services.
- Expose cache operations through
portal-registryand controller MCP routing, not through service-specific REST endpoints. - Let portal-view clear a selected cache from the existing Cache Explorer page.
- Use the same feature for
portal-servicereference data caching. - Preserve existing cache inspection behavior.
Non-Goals
- Do not remove or unregister a configured cache when clearing entries.
- Do not require every service to use the same cache backend.
- Do not expose raw secrets or unsafe object internals through cache inspection.
- Do not build event-driven cross-service cache invalidation in the first phase.
- Do not confuse runtime data caches with the
config-cachedirectory used for remote configuration files.
MCP Tool Contract
Add a new generic tool:
{
"name": "clear_cache",
"description": "Clear all entries from a named cache on a live runtime instance.",
"inputSchema": {
"type": "object",
"required": ["runtimeInstanceId", "name"],
"properties": {
"runtimeInstanceId": { "type": "string", "format": "uuid" },
"name": { "type": "string" }
}
}
}
The controller accepts runtimeInstanceId, removes it from the forwarded
arguments, and sends this to the target runtime:
{
"name": "clear_cache",
"arguments": {
"name": "reference-data"
}
}
Recommended success response:
{
"supported": true,
"status": "success",
"name": "reference-data",
"beforeSize": 42,
"afterSize": 0
}
Recommended unsupported response:
{
"supported": false,
"status": "unsupported",
"name": "reference-data",
"message": "Cache support is not available on this service."
}
Key-level invalidation can be added later as a separate
invalidate_cache_entry tool with { "name": "...", "key": "..." }.
Whole-cache clear should be implemented first because it solves the reference
data reload case without introducing cache-key UX and serialization questions.
Java Compatibility Work
In light-4j, add an explicit clear operation to the generic cache API:
void clear(String cacheName);
The Caffeine implementation should call cache.invalidateAll() and keep the
cache in the manager. It may call cache.cleanUp() before returning size data.
removeCache(name) should keep its existing unregister/remove semantics.
portal-registry should advertise clear_cache in tools/list and handle it
in tools/call by using CacheManager.getInstance(). The handler should
return supported: false when cache classes or a cache manager are not
available, matching the current list_caches and get_cache_entries behavior.
The controller catalogs need the same tool so portal-view can call it through the normal controller websocket:
controller-rstool catalog and command serialization- Java
light-controllertool catalog and routed-call handling, if it remains a supported control-plane runtime
Light Fabric Runtime Design
Light Fabric should provide a small cache abstraction at the runtime layer so applications do not each define a different operational surface.
A practical shape is:
#![allow(unused)] fn main() { #[async_trait::async_trait] pub trait RuntimeCache: Send + Sync { async fn len(&self) -> usize; async fn entries_summary(&self) -> serde_json::Value; async fn clear(&self); } #[derive(Default)] pub struct CacheRegistry { caches: RwLock<BTreeMap<String, Arc<dyn RuntimeCache>>>, } }
The registry should support:
- register named cache
- list cache names
- get summarized entries
- clear a named cache
moka is the preferred default backend for async Rust services because it maps
well to the Caffeine use case. Applications should still be free to register
custom cache wrappers as long as they implement the runtime trait.
RuntimeMcpHandler in light-runtime should expose the same tools as Java:
list_cachesget_cache_entriesclear_cache
If a runtime has no cache registry, these tools should return supported: false rather than failing the request.
Portal Service Reference Data Cache
portal-service can use the generic Light Fabric cache for /r/data.
Suggested cache names:
reference-datareference-data-relation
Suggested keys:
host:{hostId|global}:lang:{lang}:table:{name}host:{hostId|global}:lang:{lang}:table:{name}:rela:{rela}:from:{from}
The request flow becomes:
/r/datareceives a reference-data request.ReferenceServicebuilds a stable cache key from host, language, table, relation, and source value.- On cache hit, return cached reference data.
- On cache miss, query Postgres, cache the result, and return it.
- When reference data changes in
light-portal, an operator clearsreference-dataorreference-data-relationfor the targetportal-serviceruntime instance from portal-view. - The next
/r/datacall reloads from Postgres.
This keeps the first implementation manual and deterministic. A later phase can subscribe to reference-table change events and clear matching caches automatically.
Portal View UX
The existing Cache Explorer page should stay the main UI.
Add a clear action for the selected cache:
- show the selected cache name
- require confirmation before clearing
- disable the button while the request is running
- call
clear_cachewith{ runtimeInstanceId, name } - show success or error status
- refetch cache entries after a successful clear
The UI should not require users to know whether the target service is Java or Rust. Unsupported runtimes should show the returned unsupported message.
Implementation Phases
Phase 1: Java clear support
- Add
CacheManager.clear(cacheName). - Implement it in
caffeine-cache. - Add
clear_cachetoportal-registryMCP tools. - Add targeted tests for clearing while preserving the configured cache.
Phase 2: Controller and portal-view
- Add
clear_cacheto controller tool catalogs and command routing. - Add the Cache Explorer clear button and confirmation.
- Verify the existing
runtimeInstanceIdforwarding path is reused.
Phase 3: Light Fabric generic cache
- Add a runtime cache registry and trait.
- Add
mokabacked cache support. - Expose
list_caches,get_cache_entries, andclear_cachefromRuntimeMcpHandler. - Add focused
light-runtimetests for supported and unsupported cache cases.
Phase 4: Portal service reference data
- Register
reference-dataandreference-data-relationcaches. - Cache
/r/dataquery results. - Clear the cache from portal-view and verify the next request reloads from Postgres.
Verification
Recommended targeted checks:
mvn -q -pl cache-manager,caffeine-cache,portal-registry test
cargo test -p light-runtime
cargo check --workspace
yarn build
Use the Maven command in light-4j, the Cargo commands in light-fabric and
portal-service as appropriate, and the frontend build in portal-view.
Client Configuration And Modules
Status
Brainstorming proposal for standardizing client.yml across Light Fabric
runtime, framework modules, and products.
The immediate trigger is that different Rust modules currently interpret
client.yml differently. For example, light-runtime reads a small top-level
verifyHostname field for controller and config-server clients, while
light-pingora token and SPA modules read a Java-style nested tls section.
That split makes a single client.verifyHostname: false value unreliable.
This document proposes a common contract so every Rust module uses the same
client.yml file and the same typed configuration model.
Purpose
client.yml should describe outbound client behavior for a running service:
- TLS trust, hostname verification, and optional client identity.
- HTTP request timeout, retry, circuit breaker, connection pool, and HTTP/2 behavior.
- OAuth 2.0 token, key, sign, dereference, and provider-selection behavior.
- Path-prefix-to-service mapping used when different downstream services use different OAuth providers.
The file should be loaded once through the runtime configuration system, registered once in the module registry with secrets masked, then shared by all modules that make outbound calls.
Compatibility Contract
The Java light-4j client.yml remains the compatibility baseline. Rust can
clean up the internal model, but it should not remove behavior that Java
http-client and client-config expose.
Important Java sections:
tls:
verifyHostname: ${client.verifyHostname:true}
loadDefaultTrustStore: ${client.loadDefaultTrustStore:true}
loadTrustStore: ${client.loadTrustStore:true}
trustStore: ${client.trustStore:client.truststore}
trustStorePass: ${client.trustStorePass:password}
loadKeyStore: ${client.loadKeyStore:false}
keyStore: ${client.keyStore:client.keystore}
keyStorePass: ${client.keyStorePass:password}
keyPass: ${client.keyPass:password}
defaultCertPassword: ${client.defaultCertPassword:changeit}
tlsVersion: ${client.tlsVersion:TLSv1.3}
oauth:
multipleAuthServers: ${client.multipleAuthServers:false}
token:
cache:
capacity: ${client.tokenCacheCapacity:200}
tokenRenewBeforeExpired: ${client.tokenRenewBeforeExpired:60000}
expiredRefreshRetryDelay: ${client.expiredRefreshRetryDelay:2000}
earlyRefreshRetryDelay: ${client.earlyRefreshRetryDelay:4000}
server_url: ${client.tokenServerUrl:}
serviceId: ${client.tokenServiceId:com.networknt.oauth2-token-1.0.0}
proxyHost: ${client.tokenProxyHost:}
proxyPort: ${client.tokenProxyPort:}
enableHttp2: ${client.tokenEnableHttp2:true}
authorization_code: {}
client_credentials: {}
refresh_token: {}
token_exchange: {}
key: {}
sign: {}
deref: {}
pathPrefixServices: ${client.pathPrefixServices:}
request:
errorThreshold: ${client.errorThreshold:2}
connectTimeout: ${client.connectTimeout:2000}
timeout: ${client.timeout:3000}
resetTimeout: ${client.resetTimeout:7000}
injectOpenTracing: ${client.injectOpenTracing:false}
injectCallerId: ${client.injectCallerId:false}
enableHttp2: ${client.enableHttp2:true}
connectionPoolSize: ${client.connectionPoolSize:1000}
connectionExpireTime: ${client.connectionExpireTime:1800000}
maxReqPerConn: ${client.maxReqPerConn:1000000}
maxConnectionNumPerHost: ${client.maxConnectionNumPerHost:1000}
minConnectionNumPerHost: ${client.minConnectionNumPerHost:250}
maxRequestRetry: ${client.maxRequestRetry:3}
requestRetryDelay: ${client.requestRetryDelay:1000}
poolMetricsEnabled: ${client.poolMetricsEnabled:false}
poolWarmUpEnabled: ${client.poolWarmUpEnabled:false}
poolWarmUpSize: ${client.poolWarmUpSize:1}
healthCheckEnabled: ${client.healthCheckEnabled:true}
healthCheckIntervalMs: ${client.healthCheckIntervalMs:30000}
Rust should add fields such as tls.caCertPath, tls.clientCertPath, and
tls.clientKeyPath because PEM files are the native Rust deployment shape.
Rust does not need to support Java-specific JKS/JCEKS truststore or keystore
formats. If those Java-only fields appear in a Rust client.yml, they can be
ignored because config-server should control which fields it injects for Rust
services.
Initial Rust Gaps
At the start of this migration, the Rust implementation had three separate interpretations of client configuration:
| Area | Current behavior | Problem |
|---|---|---|
light-runtime config-server and portal-registry clients | Read ClientConfig { verify_hostname } from top-level client.yml | Did not understand the Java nested tls.verifyHostname shape |
light-pingora token, security JWKS, stateless auth, and MSAL exchange | Read ClientTokenConfig with tls, oauth, pathPrefixServices, and request | Was closer to Java, but framework-local and did not drive runtime clients |
light-gateway upstream proxy | Read the resolved flat value client.verifyHostname directly from values.yml | Bypassed typed client.yml and could disagree with other modules |
Before this design, Rust support was also partial compared with Java:
| Java capability | Initial Rust status |
|---|---|
tls.verifyHostname | Supported by Pingora token/SPAs, not by runtime controller/config-server clients |
| CA trust | Supported through Rust caCertPath; Java truststore fields are not modeled |
| Client certificate and key for mTLS | Not yet modeled for outbound clients |
| TLS version | Not yet modeled |
| Request connect and total timeout | Supported for token/SPAs |
| Retries, circuit breaker, pool sizing, pool health | Not yet modeled as shared client behavior |
OAuth authorization_code | Supported by SPA auth |
OAuth client_credentials | Supported by token handler |
OAuth refresh_token | Supported by SPA auth |
OAuth token_exchange | Supported by MSAL exchange and SPA auth |
OAuth token key / JWKS | Partially supported by security runtime |
token.key.serviceIdAuthServers and audience | Not fully modeled in Rust |
OAuth sign | Not yet modeled |
OAuth sign.key / sign JWKS | Not yet modeled |
OAuth deref | Not yet modeled |
| Multiple auth providers by service id | Supported for client credentials, but should become a shared resolver |
pathPrefixServices | Supported in token handler, but should become shared resolver logic |
Goals
- Keep
client.ymlas the only config file for outbound client behavior. - Make the Java nested shape canonical:
tls.verifyHostname, not top-levelverifyHostname. - Load and register the resolved
client.ymlonce throughlight-runtime. - Share one typed
ClientConfigacross runtime, Pingora, gateway, agent, deployer, MCP clients, model-provider clients, and future products. - Preserve Java-compatible field names and config-server placeholder names.
- Support direct URL, direct registry, and portal registry service discovery consistently for token, key, sign, deref, and generic outbound calls.
- Keep secrets masked in module registry snapshots and logs.
- Make invalid active client config fail startup or reject reload before it changes live runtime behavior.
- Allow Rust-native PEM fields without forcing Java keystore names into every Rust deployment.
Non-Goals
- Do not move handler activation into
client.yml. Handler-specific files such astoken.yml,statelessAuth.yml, andmsal-exchange.ymlstill decide whether a handler runs. - Do not implement every Java-only low-level connection-pool behavior in the first phase. The shared schema should include the fields so config is not lost, but unsupported fields can be ignored deliberately until the transport supports them.
- Do not expose decrypted client secrets, tokens, or legacy Java password fields through module registry, MCP tools, logs, metrics, or cache output.
- Do not require every module to use OAuth. The shared config must support simple TLS-only clients too.
Resolved Decisions
- Create a separate
light-clientcrate now so the shared config, HTTP client factory, OAuth client, and provider resolver can be reused without coupling every consumer tolight-runtime. - Standardize Rust outbound TLS material on PEM paths. Java truststore and keystore formats are not required for Rust services.
client.ymlreload should not force an immediate portal-registry reconnect. Reload is primarily for newly onboarded JWKS/JWT access and future outbound requests. Existing long-lived controller connections can keep running until their normal reconnect or service restart.- Unsupported Java fields can be ignored by Rust. Config-server should avoid injecting unsupported fields into Rust service config.
- Ignored Java-only fields should be ignored silently. Rust startup does not need to warn about fields that config-server may omit for Rust services.
oauth.multipleAuthServersremains accepted for Java compatibility, but Rust should infer multi-provider mode whenserviceIdAuthServersis configured.pathPrefixServicesstays inclient.yml. It is outbound-client provider selection and is different from inbound path routing to downstream services.- Circuit breaker behavior is only needed by Pingora. Shared request config can carry the Java-compatible fields, but non-Pingora clients do not need to own circuit breaker state.
- SAML bearer is not required for Light Fabric and should remain out of scope unless a future product explicitly needs it.
Proposed Canonical Shape
The canonical Rust client.yml should stay close to Java:
tls:
verifyHostname: ${client.verifyHostname:true}
caCertPath: ${client.caCertPath:}
clientCertPath: ${client.clientCertPath:}
clientKeyPath: ${client.clientKeyPath:}
tlsVersion: ${client.tlsVersion:TLSv1.3}
request:
connectTimeout: ${client.connectTimeout:2000}
timeout: ${client.timeout:3000}
maxRequestRetry: ${client.maxRequestRetry:3}
requestRetryDelay: ${client.requestRetryDelay:1000}
errorThreshold: ${client.errorThreshold:2}
resetTimeout: ${client.resetTimeout:7000}
injectCallerId: ${client.injectCallerId:false}
enableHttp2: ${client.enableHttp2:true}
connectionPoolSize: ${client.connectionPoolSize:1000}
connectionExpireTime: ${client.connectionExpireTime:1800000}
maxReqPerConn: ${client.maxReqPerConn:1000000}
maxConnectionNumPerHost: ${client.maxConnectionNumPerHost:1000}
minConnectionNumPerHost: ${client.minConnectionNumPerHost:250}
poolMetricsEnabled: ${client.poolMetricsEnabled:false}
poolWarmUpEnabled: ${client.poolWarmUpEnabled:false}
poolWarmUpSize: ${client.poolWarmUpSize:1}
healthCheckEnabled: ${client.healthCheckEnabled:true}
healthCheckIntervalMs: ${client.healthCheckIntervalMs:30000}
oauth:
multipleAuthServers: ${client.multipleAuthServers:false}
token:
cache:
capacity: ${client.tokenCacheCapacity:200}
tokenRenewBeforeExpired: ${client.tokenRenewBeforeExpired:60000}
expiredRefreshRetryDelay: ${client.expiredRefreshRetryDelay:2000}
earlyRefreshRetryDelay: ${client.earlyRefreshRetryDelay:4000}
server_url: ${client.tokenServerUrl:}
serviceId: ${client.tokenServiceId:com.networknt.oauth2-token-1.0.0}
proxyHost: ${client.tokenProxyHost:}
proxyPort: ${client.tokenProxyPort:}
enableHttp2: ${client.tokenEnableHttp2:true}
authorization_code:
uri: ${client.tokenAcUri:/oauth2/token}
client_id: ${client.tokenAcClientId:}
client_secret: ${client.tokenAcClientSecret:}
redirect_uri: ${client.tokenAcRedirectUri:}
scope: ${client.tokenAcScope:}
client_credentials:
uri: ${client.tokenCcUri:/oauth2/token}
client_id: ${client.tokenCcClientId:}
client_secret: ${client.tokenCcClientSecret:}
scope: ${client.tokenCcScope:}
serviceIdAuthServers: ${client.tokenCcServiceIdAuthServers:}
refresh_token:
uri: ${client.tokenRtUri:/oauth2/token}
client_id: ${client.tokenRtClientId:}
client_secret: ${client.tokenRtClientSecret:}
scope: ${client.tokenRtScope:}
token_exchange:
uri: ${client.tokenExUri:/oauth2/token}
client_id: ${client.tokenExClientId:}
client_secret: ${client.tokenExClientSecret:}
scope: ${client.tokenExScope:}
subjectToken: ${client.subjectToken:}
subjectTokenType: ${client.subjectTokenType:urn:ietf:params:oauth:token-type:jwt}
requestedTokenType: ${client.requestedTokenType:}
audience: ${client.tokenExAudience:}
key:
server_url: ${client.tokenKeyServerUrl:}
serviceId: ${client.tokenKeyServiceId:com.networknt.oauth2-key-1.0.0}
uri: ${client.tokenKeyUri:/oauth2/key}
client_id: ${client.tokenKeyClientId:}
client_secret: ${client.tokenKeyClientSecret:}
enableHttp2: ${client.tokenKeyEnableHttp2:true}
serviceIdAuthServers: ${client.tokenKeyServiceIdAuthServers:}
audience: ${client.tokenKeyAudience:}
sign:
server_url: ${client.signServerUrl:}
serviceId: ${client.signServiceId:com.networknt.oauth2-token-1.0.0}
uri: ${client.signUri:/oauth2/sign}
timeout: ${client.signTimeout:2000}
client_id: ${client.signClientId:}
client_secret: ${client.signClientSecret:}
proxyHost: ${client.signProxyHost:}
proxyPort: ${client.signProxyPort:}
enableHttp2: ${client.signEnableHttp2:true}
key:
server_url: ${client.signKeyServerUrl:}
serviceId: ${client.signKeyServiceId:com.networknt.oauth2-key-1.0.0}
uri: ${client.signKeyUri:/oauth2/key}
client_id: ${client.signKeyClientId:}
client_secret: ${client.signKeyClientSecret:}
enableHttp2: ${client.signKeyEnableHttp2:true}
audience: ${client.signKeyAudience:}
deref:
server_url: ${client.derefServerUrl:}
serviceId: ${client.derefServiceId:com.networknt.oauth2-token-1.0.0}
uri: ${client.derefUri:/oauth2/deref}
client_id: ${client.derefClientId:}
client_secret: ${client.derefClientSecret:}
proxyHost: ${client.derefProxyHost:}
proxyPort: ${client.derefProxyPort:}
enableHttp2: ${client.derefEnableHttp2:true}
pathPrefixServices: ${client.pathPrefixServices:}
Compatibility aliases:
- Accept
serverUrlin addition to Javaserver_urlfor Rust callers. - Accept
clientIdandclientSecretin addition to Javaclient_idandclient_secretonly as aliases. The emitted template should keep Java names. - Temporarily accept top-level
verifyHostnameonly as a migration fallback, but register a warning and normalize it intotls.verifyHostname.
Serde strategy for the top-level verifyHostname fallback:
- The shared
ClientConfigshould deserialize into a struct that has atls.verifyHostnamefield and a separate#[serde(default)]top-levelverify_hostnamefield. - After deserialization, a post-parse normalization step should check whether
the top-level field was explicitly set. If so, it logs a deprecation warning
and copies the value into
tls.verify_hostnameonly when the nested field was not also explicitly set. - When both the top-level and nested fields are present, the nested
tls.verifyHostnamevalue wins. The top-level value is ignored after the warning. - Do not rely on two competing
#[serde(default)]fields resolving the conflict. Use a customDeserializeimpl or an explicit post-parse step.
Serde strategy for Java-compatible but unimplemented sections:
- Do not use
#[serde(deny_unknown_fields)]for the top-levelClientConfigor OAuth section during Phase 1. - Known but not-yet-implemented Java sections such as
oauth.signandoauth.derefshould deserialize into typed structs orserde_json::Valueplaceholders so representative Java fixtures load successfully. - Demand-driven validation decides whether a section is required. If no active
module consumes
oauth.signoroauth.deref, those sections can be present and ignored silently.
Proposed Rust Modules
Shared Config Model
Create one shared typed config model outside light-pingora and
light-runtime:
crates/light-client/src/lib.rs
crates/light-client/src/config.rs
crates/light-client/src/http.rs
crates/light-client/src/oauth.rs
crates/light-client/src/provider.rs
light-runtime should use light-client for loading, validating, and building
outbound clients, but the reusable client model should not live inside the
runtime crate.
Core types:
#![allow(unused)] fn main() { pub struct ClientConfig { pub tls: ClientTlsConfig, pub request: ClientRequestConfig, pub oauth: ClientOauthConfig, pub path_prefix_services: BTreeMap<String, String>, } pub struct ClientTlsConfig { pub verify_hostname: bool, pub ca_cert_path: Option<PathBuf>, pub client_cert_path: Option<PathBuf>, pub client_key_path: Option<PathBuf>, pub tls_version: Option<TlsVersion>, } pub struct ClientRequestConfig { pub connect_timeout_ms: u64, pub timeout_ms: u64, pub max_request_retry: u32, pub request_retry_delay_ms: u64, pub error_threshold: u32, pub reset_timeout_ms: u64, pub inject_caller_id: bool, pub enable_http2: bool, pub pool: ClientPoolConfig, } }
TlsVersion should be an enum with serde names for Java-compatible strings
such as TLSv1.2 and TLSv1.3, rather than a raw string in runtime code.
Secrets should use a type that serializes as masked data for registry output, or the registry masks should cover every secret field recursively.
Runtime Loader
light-runtime should own the startup lifecycle for client.yml loading, but
delegate parsing and validation to light-client:
- Load local
values.yml. - Load local
startup.yml. - Load local
client.ymlwith resolved values for config-server bootstrap. - Fetch remote config if configured.
- Rebuild the final
RuntimeConfigwith the remoteclient.ymloverlay. - Register masked
light-client/clientinModuleRegistry.
Every runtime client should use this shared config:
- config-server fetch client
- portal-registry WebSocket client
- MCP client
- future model-provider outbound clients
- framework/application clients through
RuntimeConfig.client
For the earlier hostname-verification bug, the controller client should read:
runtime_config.client.tls.verify_hostname
not a separate top-level ClientConfig.verify_hostname.
HTTP Client Factory
Add a small factory that converts ClientConfig plus optional per-endpoint
overrides into concrete clients:
#![allow(unused)] fn main() { pub struct ClientFactory { config: Arc<ClientConfig>, direct_registry: DirectRegistryConfig, registry_client: Option<Arc<PortalRegistryClient>>, } pub struct EndpointOptions { pub server_url: Option<String>, pub service_id: Option<String>, pub proxy_host: Option<String>, pub proxy_port: Option<u16>, pub enable_http2: Option<bool>, pub timeout_ms: Option<u64>, } }
Responsibilities:
- Build
reqwest::Clientwith consistent TLS, timeout, proxy, HTTP/2, retry, and pool settings for non-Pingora consumers. - Build Pingora
HttpPeeroptions from the same TLS config for gateway upstream proxying. - Resolve endpoint base URL by priority:
- direct
server_url direct-registry.yml- portal-registry discovery by
serviceId
- direct
- Apply per-service
AuthServerConfigoverrides without duplicating resolver logic in each handler.
The config-server bootstrap path still starts from BootstrapConfig because it
needs enough client settings before remote client.yml has been fetched. To
keep light-client independent from light-runtime, the factory should not
take a BootstrapConfig type directly. Instead, light-runtime should adapt
BootstrapConfig.connect_timeout, BootstrapConfig.timeout, authorization,
and bootstrap CA path into EndpointOptions or a small bootstrap options type
owned by light-client.
OAuth Client
Add a shared OAuth client module that implements Java http-client behavior:
oauth/client_credentials
oauth/authorization_code
oauth/refresh_token
oauth/token_exchange
oauth/key
oauth/sign
oauth/deref
The existing light-pingora SpaTokenClient, token handler client
credentials code, and security JWKS fetcher should delegate to this shared
module. Handler modules still own request-path decisions, cookies, headers,
and rejection mapping.
OAuth provider selection should be one reusable resolver:
#![allow(unused)] fn main() { pub struct OAuthProviderResolver { client: Arc<ClientConfig>, } impl OAuthProviderResolver { pub fn service_for_path(&self, path: &str) -> Option<&str>; pub fn client_credentials_provider(&self, service_id: Option<&str>) -> Result<AuthServerConfig>; pub fn key_provider(&self, service_id: Option<&str>) -> Result<AuthServerConfig>; } }
Rules:
- Single-provider mode uses global
oauth.token.*defaults. - Multi-provider mode is enabled when
oauth.multipleAuthServers: trueor when relevantserviceIdAuthServersmaps are non-empty. - Multi-provider mode selects the service id from an explicit request header
first, then outbound
pathPrefixServices. client_credentials.serviceIdAuthServers[serviceId]selects the token provider.key.serviceIdAuthServers[serviceId]selects the JWKS/key provider.- Per-service config inherits unset values from global
oauth.tokendefaults. - Path-prefix matching should be boundary-aware in Rust. Java uses
startsWith; the Rust implementation can be stricter as an intentional improvement. Exact rule: a prefix matches when the request path equals the prefix or starts withprefix + "/". Therefore/apimatches/apiand/api/orders, but does not match/api-v2. pathPrefixServicesis not an inbound routing table. It maps outbound request paths to service ids only for client-side OAuth provider selection.
Consumer Modules
All modules should consume the same shared config:
| Module | Uses |
|---|---|
light-runtime/config-server | light-client tls, request |
light-runtime/portal-registry | light-client tls, request |
light-pingora/security | oauth.token.key, tls, request, provider resolver |
light-pingora/token | oauth.token.client_credentials, token cache settings, provider resolver |
light-pingora/stateless-auth | authorization_code, refresh_token, token client |
light-pingora/msal-exchange | token_exchange, token client |
light-gateway/proxy | tls.verifyHostname, PEM mTLS, request timeout, retry, circuit breaker, and pool settings where Pingora supports them |
light-agent | controller/MCP outbound clients |
light-deployer | controller/MCP/outbound clients as needed |
Reload Behavior
client.yml should be reloadable as a module, but reload must be conservative:
- Load and validate the new config into a fresh
ClientConfig. - Build new shared client factories and OAuth clients.
- Swap the config atomically for future requests.
- Clear OAuth token caches because client credentials, scopes, providers, or trust settings may have changed.
- Keep old in-flight requests on their existing client instances.
- Reject the reload if active modules cannot build required clients from the new config.
Reload atomicity: all runtimes that consume client.yml must be swapped
together in the same reload callback. Today, the gateway TokenReloader
already rebuilds token_runtime, stateless_auth, and msal_exchange as a
unit. This must remain a hard requirement. A reload that updates the client
config without also rebuilding dependent runtimes would leave stale TLS or
OAuth state in the old runtime instances.
Controller registration is long-lived. Reloading client.yml should not force
an immediate portal-registry reconnect. New TLS and request settings should
apply to future outbound clients and the next normal controller reconnect, but
the active controller WebSocket can remain open.
Validation Rules
Base validation:
tls.verifyHostname: falserequires explicit trust material unless the transport has a clear dev-only mode.- If Rust-native mTLS is configured, both client certificate and client key paths are required.
request.connectTimeoutandrequest.timeoutmust be positive.proxyPortmust be 0 to 65535.pathPrefixServiceskeys must start with/.- Secret fields may be empty only when the consuming active module does not need that grant.
OAuth validation should be demand-driven:
- If
tokenhandler is active and enabled, validateclient_credentials. - If
stateless-authis active, validateauthorization_codeandrefresh_token. - If
msal-exchangeis active, validatetoken_exchange. - If
security.ymlenables JWKS bootstrap from key service, validateoauth.token.key. - If a future sign module is active, validate
oauth.sign. - If a future deref module is active, validate
oauth.deref.
This avoids forcing every service to configure every Java OAuth section.
Validation failure behavior:
- At startup, validation failures are fatal. The process must exit with a clear error message identifying which active module requires which missing or invalid client config section.
- On reload, validation failures are non-fatal. The reload is rejected, the old config stays live, and the rejection reason is logged and reported through the module registry reload outcome.
Masking
Mask these fields recursively in registry output:
client_secretclientSecrettrustStorePasskeyStorePasskeyPassdefaultCertPasswordsubjectTokenaccess_tokenrefresh_tokenid_tokenauthorization- any field ending in
Tokenwhose value is a scalar string (not a nested object, list, or URN-typed field likesubjectTokenTypeorrequestedTokenType) - any field ending in
Secret
Explicit exclusions from suffix matching:
subjectTokenType- a URN string, not a secret.requestedTokenType- a URN string, not a secret.
The registry should store only the masked snapshot. It should not store raw config and mask later.
Migration Plan
Phase 0: Deprecation Logging
- Add a
tracing::warn!inlight-gatewaywhere it readsresolved_values["client.verifyHostname"]to alert operators that this path is deprecated and will be replaced byruntime_config.client.tls.verify_hostname. - This gives operators visibility into the migration before behavior changes.
Phase 1: Unify The Schema
- Add the
light-clientcrate with the full sharedClientConfigtype. - Make
light-runtimeload nestedtls.verifyHostname. - Keep top-level
verifyHostnameas a temporary compatibility fallback. - Update Rust config templates to include only the canonical nested shape.
- Add tests proving
client.verifyHostname: falsereaches config-server, portal-registry, token, security JWKS, SPA auth, and gateway proxy clients.
Phase 2: Move Consumers To Shared Config
- Replace
light-pingora::token::ClientTokenConfigwith thelight-clientshared type or a type alias. - Replace gateway direct
resolved_values["client.verifyHostname"]lookup withruntime_config.client.tls.verify_hostname. - Move JWKS, token, and SPA token HTTP client construction behind the shared client factory.
- Register one masked
light-client/clientmodule instead of separate partial client registry entries.
Phase 3: Shared OAuth Provider Resolver
- Extract provider selection from the token handler.
- Support
token.key.serviceIdAuthServersandaudience. - Use the same resolver for token injection and JWT key lookup.
- Keep Java field names and config-server placeholders.
Phase 4: Java Feature Completion
- Implemented sign client support in
light-client. - Implemented deref client support in
light-client. - Implemented Rust-native PEM mTLS for reqwest clients and Pingora upstreams.
- Implemented retry, circuit breaker, and pool behavior where the Rust transport supports them.
Open Questions
None at this stage.
Test Plan
Unit tests:
- Parse the Java
client.ymltemplate into the shared Rust config. - Parse the current Rust
client.ymltemplate into the shared Rust config. - Resolve
client.verifyHostnameintotls.verifyHostname. - Accept top-level
verifyHostnameonly as a fallback and prefer nested TLS when both are set. - Mask every secret field in the module registry snapshot.
- Validate provider selection by service id and path prefix.
- Validate per-service override inheritance for token and key providers.
Runtime tests:
- Config-server bootstrap uses
tls.verifyHostname. - Portal-registry controller WebSocket uses
tls.verifyHostname. - Gateway upstream proxy uses
tls.verifyHostname. - Token handler, stateless auth, MSAL exchange, and security JWKS all receive
the same
ClientConfiginstance or snapshot. - Client reload clears token caches and rejects invalid active grant config.
- Reload round-trip: verify that reloading from config A to config B swaps the
ClientConfig, creates fresh token caches, and that in-flight requests on the old config are not affected. Verify that a reload from valid config to invalid config is rejected and the old config stays live.
Compatibility tests:
- Reuse representative Java
client.ymlfixtures for single provider, multiple providers, proxy, token key, sign, and deref sections. - Confirm Java-compatible form bodies for
authorization_code,client_credentials,refresh_token, andtoken_exchange. - Confirm config-server injected YAML strings and structured YAML maps both
deserialize for
serviceIdAuthServersandpathPrefixServices.
Embedded Configuration Templates
Status
Initial implementation completed. Rust applications in light-fabric and
related portal-service applications keep template configuration files under
each app's config directory. Container images may copy those files into
/app/config-defaults, then runtime overlays local config, downloaded
config-cache, remote values.yml, and environment variables.
That works well for container deployments. It is awkward for native binary
deployments on a VM because the operator must copy a full template directory
beside the binary even when they only want to provide values.yml, certs, or a
small local override.
This design embeds the template files into the Rust binary while keeping the
app config directories in source control as the readable template source.
Purpose
Embedded configuration templates should make the Rust deployment model match the Java module model more closely:
- The application binary carries its default template files.
- Operators provide only overrides, usually
values.yml,startup.yml, certs, keys, or environment variables. - Config-server can still return
values.ymlafter bootstrap, plus external files for explicit migration or operational exceptions. - Developers and operators can still inspect the app's
configdirectory in source control to learn supported properties.
The embedded files are defaults. They are not runtime state and should not be written out automatically unless an explicit diagnostic/export command is added later.
Current Model
The current runtime model has these filesystem layers:
| Layer | Example | Purpose |
|---|---|---|
| Default templates | config-defaults/server.yml | App-provided templates copied into the container image |
| Local config | config/values.yml, config/startup.yml | Operator overrides and bootstrap inputs |
| External/cache config | config-cache/values.yml | Files downloaded from config-server |
| Remote values | config-server response body | Runtime values fetched during bootstrap |
| Environment variables | CLIENT_VERIFYHOSTNAME=false | Last-mile process overrides during placeholder expansion |
For light-fabric runtime applications, LightRuntimeBuilder passes
default_config_dir, config_dir, and external_config_dir into
light-runtime. load_bootstrap_config() reads bootstrap-time values.yml,
startup.yml, and client.yml before remote config-server bootstrap. After
remote bootstrap, runtime config loads server.yml, client.yml,
portal-registry.yml, and framework/application module files through the same
merged configuration path.
Some portal-service apps share the light-runtime path, while standalone apps
such as config-server and light-oauth have local helper functions that merge
config-defaults and config.
Goals
- Allow a native binary deployment to start with embedded templates and a small
external
config/values.yml. - Keep
apps/<app>/config/*.ymlas the source of truth for template content. - Keep container deployment behavior compatible with the current
/app/config-defaultscopy. - Preserve the existing overlay order and placeholder expansion behavior.
- Support bootstrap-time files such as
startup.ymlandclient.yml. - Support runtime module files such as
handler.yml,proxy.yml,model-provider.yml, provider configs, and product-specific files. - Provide one reusable loading abstraction for
light-fabricandportal-serviceinstead of app-specific parsing logic. - Avoid writing embedded templates to disk during normal startup.
Non-Goals
- Do not embed secrets, certificates, private keys, trust bundles, static web assets, or downloaded config-server files.
- Do not remove the source
configdirectories. They remain the reviewable, documented template source. - Do not make
values.ymlmandatory. Apps should keep current defaults where they are already valid. - Do not make config-server responsible for delivering template files that are already part of the binary.
- Do not change the meaning of
values.ymlplaceholders or environment variable expansion.
Proposed Layer Order
The new effective source order should be:
- Embedded template file from the binary.
- Filesystem default template from
config-defaults, if present. - Local operator file from
config. - External/cache file from
config-cache, when runtime loading supports it. - Remote
values.ymlpayload from config-server. - Environment variables during placeholder resolution.
This keeps existing container images compatible. If config-defaults exists, it
can override the embedded template. That gives operators and image builders a
transition path and a deliberate escape hatch for patched images.
For native binary deployment, config-defaults is simply absent and the binary
falls back to embedded templates.
Structured config files and values.yml should use different overlay
semantics:
| File type | Semantics | Reason |
|---|---|---|
Structured config files such as server.yml, handler.yml, proxy.yml, and model-provider.yml | Source-level override. The highest-priority source that contains the file supplies the whole template. | Avoids surprising hybrid files assembled from embedded, image, local, and cache layers. Operators should use values.yml for partial property overrides. |
values.yml | Key-level overlay in source order, followed by remote values and environment variables. | values.yml is explicitly the property override surface. Partial overlays are expected and useful. |
After the structured file source is selected, placeholders in that file are resolved from the merged values map and environment variables.
Embedded Template Representation
include_dir is a possible embedding mechanism. It embeds the entire app
config directory at compile time and avoids custom directory-scanning build
scripts in every application crate:
#![allow(unused)] fn main() { use include_dir::{include_dir, Dir}; pub static EMBEDDED_CONFIG: Dir<'_> = include_dir!("$CARGO_MANIFEST_DIR/config"); }
The runtime should hide the concrete embedding mechanism behind a small config source abstraction. A typed file representation is still useful as the stable runtime boundary:
#![allow(unused)] fn main() { pub struct EmbeddedConfigFile { pub name: &'static str, pub content: &'static str, } }
Application code should pass a flattened static file list into the runtime:
#![allow(unused)] fn main() { LightRuntimeBuilder::new(transport) .with_embedded_config(embedded_config::FILES) .build(); }
include_str! is still acceptable for one or two files, but application
main.rs files should not accumulate hand-maintained include_str! lists.
include_bytes! is not preferred for YAML templates because configuration
templates should be valid UTF-8 before they are parsed.
The initial implementation uses a shared build-time generator instead of adding
an external embedding dependency. Each app has a small build.rs that calls
config-embed-build, which scans the committed config directory and produces
a manifest like this under OUT_DIR:
#![allow(unused)] fn main() { pub const FILES: &[config_loader::EmbeddedConfigFile] = &[ config_loader::EmbeddedConfigFile { name: "server.yml", content: include_str!(concat!(env!("CARGO_MANIFEST_DIR"), "/config/server.yml")), }, config_loader::EmbeddedConfigFile { name: "startup.yml", content: include_str!(concat!(env!("CARGO_MANIFEST_DIR"), "/config/startup.yml")), }, ]; }
Build-Time Generation Fallback
The project currently uses the build-time manifest path. Each app uses a shared
build.rs helper to scan its config directory and generate the embedded
manifest. The generator lives in one reusable crate so apps do not carry
duplicated build logic.
The generated manifest should:
- Include only known text config extensions, initially
.yml,.yaml,.json, and.toml. - Preserve the file name relative to the app
configdirectory. - Emit
cargo:rerun-if-changed=config. - Fail the build if a template file cannot be read as UTF-8.
Nested config paths are not needed for current app templates, but the manifest
should allow names such as oauth/server.yml if a future product needs them.
Runtime API
Add embedded defaults to LightRuntimeBuilder:
#![allow(unused)] fn main() { LightRuntimeBuilder::new(transport) .with_embedded_config(embedded_config::FILES) .with_default_config_dir(DEFAULT_CONFIG_DIR) .with_config_dir(CONFIG_DIR) .with_external_config_dir(EXTERNAL_CONFIG_DIR) .build(); }
RuntimeConfig should carry the embedded source as skipped runtime state, the
same way it carries default_config_dir and registries today:
#![allow(unused)] fn main() { pub struct RuntimeConfig { // existing fields #[serde(skip, default)] pub embedded_config: &'static [EmbeddedConfigFile], } }
The stable contract is lookup by relative file name and iteration for diagnostics or dumping. The concrete representation can remain a static file slice or later move behind a provider abstraction if needed.
The low-level loader should accept named in-memory content as another config source:
#![allow(unused)] fn main() { pub enum ConfigSource { Embedded { name: &'static str, content: &'static str }, File(PathBuf), } }
ConfigLoader can then parse embedded and filesystem sources with the same
YAML/JSON/TOML parser. Structured config loading should select the highest
priority source for the requested file. values.yml loading should continue to
merge maps in source order.
Bootstrap Behavior
Bootstrap must support embedded templates because this is the path that native deployments need most.
load_bootstrap_values() should merge:
- Embedded
values.yml, if present. config-defaults/values.yml, if present.config/values.yml, if present.
load_bootstrap_config() should load startup.yml and client.yml from:
- Embedded templates.
config-defaults.config.
For startup.yml and client.yml, the highest-priority source that contains
the file should be used as the full template. Placeholder resolution still uses
the merged bootstrap values.
After bootstrap fetches remote values, load_values_map() should merge embedded
values.yml before the existing file and remote layers. This allows remote
values to override embedded placeholders exactly as they override copied
template files today.
Application Integration
Light-Gateway
light-gateway should be the first light-fabric application to adopt the
runtime API because it has the richest template set:
- bootstrap and server files
- client and portal registry files
- handler chain files
- proxy, resource, MCP, websocket, auth, token, metrics, and rule-related files
After integration, a native gateway deployment can run with the binary plus a
small config/values.yml and any required cert/key files.
Light-Agent
light-agent should use the same runtime API for all provider templates. The
embedded set should include model-provider.yml, mcp-client.yml, and every
provider-specific template such as openai.yml, bedrock.yml, codex.yml,
anthropic.yml, and ollama.yml.
Runtime provider selection should still happen after bootstrap. Embedded templates do not mean provider clients are created before config-server values are loaded.
Light-Deployer
light-deployer currently has a separate app-level config load for
deployer.yml. It should either move to the shared embedded-source helper or
set embedded defaults on LightRuntimeBuilder and use the same merged source
logic for its application config.
Portal-Service App
portal-service/apps/portal-service already uses LightRuntimeBuilder, but it
loads portal-service.yml before runtime startup to create the database pool.
That pre-runtime load should use the same shared embedded-source helper.
The portal-service.yml config remains non-reloadable because dbUrl and
hostId feed process-owned state.
Portal-Service Config-Server And Light-OAuth
portal-service/apps/config-server and apps/light-oauth do not bootstrap from
config-server. They should still embed their server.yml templates so native
deployment does not require a copied config-defaults directory.
Because these apps have local merge helpers today, they should consume a shared
config-loader helper that can merge:
- Embedded defaults.
- Filesystem defaults.
- Local config.
This keeps their behavior aligned with light-runtime without requiring them
to become runtime-bootstrap applications.
Operator Model
For a native deployment, the recommended layout becomes:
/opt/light-gateway/
light-gateway
config/
values.yml
startup.yml # optional, only when values/env defaults are not enough
cert.pem # optional external asset
key.pem # optional external asset
The operator no longer needs to copy every template file beside the binary. They only provide files that are deployment-specific.
For a container deployment, the current layout continues to work:
/app/light-gateway
/app/config-defaults/*.yml
/config/values.yml
/app/config-cache/values.yml
In the long term, the /app/config-defaults copy can become optional. Keeping it
during migration is useful because it lets operators inspect templates inside
the image and provides a familiar override layer.
After embedded templates are stable across production deployments, Docker images
should deprecate and then remove the unconditional /app/config-defaults copy.
Template inspectability should move to explicit dump/print commands rather than
extra image layers.
Diagnostics
The runtime should expose enough information to make source precedence clear:
- Log whether embedded templates were registered for the application.
- When a required config file is missing, include the searched source names:
embedded,
config-defaults,config, andconfig-cache. - Module registry snapshots should show the resolved config, not the raw embedded template.
- Module registry metadata should include config source provenance when
available, for example
embedded,file:/app/config-defaults/server.yml, orfile:/config/server.yml. - Normal startup should not write embedded templates to disk.
Native operators should have explicit inspection commands:
light-gateway --print-default-config server.yml
light-gateway --dump-default-configs ./config-defaults
The print command writes one embedded template to stdout. The dump command writes all embedded templates to a target directory so operators can inspect, copy, and customize them.
Controller Server Info Compatibility
Rust services register with the controller, and the controller can call the runtime MCP service-info path to inspect runtime configuration. This behavior must continue to work with embedded templates.
The service-info response should expose resolved runtime configuration, not raw templates. The implementation contract is:
- Select the effective structured config source, such as embedded
server.yml, filesystemconfig/server.yml, or cachedconfig-cacheserver.yml. - Build the merged values map from embedded, filesystem, cached, remote
values.yml, and environment variables. - Resolve placeholders in the selected config source.
- Deserialize the resolved config into the typed runtime or module config.
- Register that typed config in
ModuleRegistry. - Return
ModuleRegistrycomponent configs from the controller service-info MCP call.
With that flow, the controller still sees every registered config file with defaults and overrides applied. Embedded templates only replace the missing filesystem default-template layer. They should not bypass typed config loading, masking, module registration, reload validation, or service-info reporting.
Source provenance can be added as metadata beside each registered config, but it must not replace the resolved config payload that operators and the controller depend on.
Testing Strategy
Add unit tests at the shared loader boundary:
- Embedded-only
server.ymlloads successfully. - Local
config/server.ymlreplaces embeddedserver.ymlrather than deep merging with it. config-defaults/server.ymlreplaces embeddedserver.yml.config-cache/server.ymlreplaces local config during runtime loads.- Embedded
values.ymlis overridden by localvalues.yml. - Remote
values.ymloverrides embedded and filesystem values. - Missing required config reports all searched layers.
- Source provenance is recorded for resolved module configs.
--print-default-configand--dump-default-configsexpose embedded templates without changing normal startup behavior.- Controller service-info output includes resolved values from embedded defaults plus local, cached, remote, and environment overrides.
Add application-level smoke tests for:
light-gatewaystartup with no filesystemserver.yml, using embedded templates plus localvalues.yml.light-agentprovider config loading from embedded templates after bootstrap.portal-service/apps/portal-servicepre-runtimeportal-service.ymlload from embedded templates.portal-service/apps/config-serverstandaloneserver.ymlload from embedded templates.
Migration Plan
- Add embedded source support to
config-loaderandlight-runtime. - Add shared build-time template embedding for
light-gateway. - Wire
light-gatewayto pass embedded templates toLightRuntimeBuilder. - Keep Docker
config-defaultscopies unchanged and verify container parity. - Add native startup tests that run without a copied template directory.
- Roll the same pattern to
light-agentandlight-deployer. - Add the shared embedded-source helper to
portal-serviceand migrateportal-service,config-server, andlight-oauth. - Add print and dump commands for embedded templates.
- After several releases, deprecate Docker
config-defaultscopies and rely on embedded defaults plus explicit dump commands for inspectability.
Risks And Mitigations
| Risk | Mitigation |
|---|---|
| Embedded templates drift from source templates | Embed the committed config/ directory directly with include_dir, or generate a manifest from that directory at build time |
| Operators cannot inspect templates in native deployment | Keep source templates in repo and add print/dump commands for embedded templates |
| Docker behavior changes unexpectedly | Keep config-defaults above embedded defaults during migration |
| Config-server remote values stop overriding defaults | Preserve remote values as the highest non-env value layer |
| Apps duplicate merge logic | Move embedded-source merging into shared loader/runtime helpers |
| Secrets accidentally embedded | Embed only committed template files and keep secrets in values, env, or external files |
| Structured config becomes hard to reason about | Use source-level override for config files and reserve key-level merging for values.yml |
Resolved Decisions
- Native operators should get
--print-default-config <name>and--dump-default-configs <directory>commands. - Module registry should expose resolved config first, with source provenance as metadata when available.
- Docker images should keep
/app/config-defaultsduring migration, then deprecate it once embedded templates and dump commands are stable. - Rust deployments should standardize on embedded templates plus remote
values.yml. Config-server should not normally deliver full template files for Rust services.
Decision Summary
Embed app config/*.yml templates into the binary as the lowest-priority
default configuration source. The initial implementation uses a shared
build-time manifest generator, with include_dir remaining a possible future
implementation detail. Keep the existing source config directories for
documentation and build input. Use source-level override for structured config
files and key-level overlay for values.yml. Preserve current filesystem and
remote value layers so container deployments keep working, while native
deployments can run with only the binary and a small deployment-specific config
directory.
Handler Chain
Status: Phases 1, 2, 3, 4, 5, 6, 7, and 8 implemented; further transport phases proposed
Purpose
Light Fabric needs a light-pingora handler chain for the Rust
light-gateway product.
The first implementation should focus on light-pingora, not a generic
cross-framework abstraction. A Pingora-first design is simpler and matches the
gateway family of use cases: gateway, sidecar, proxy server, proxy client, load
balancer, and BFF.
The deployment model should use one light-gateway binary. Different runtime
behaviors should come from product-specific configuration managed in
light-portal and delivered by config-server. A BFF deployment, a sidecar
deployment, and a load-balancer deployment can therefore run the same binary
with different handler.yml, traffic/resource config, and handler-specific
config files.
The design should preserve the useful part of light-4j handler.yml: ordered
configuration of cross-cutting request and response concerns. It should not copy
the Java reflection model, mutable next handler pattern, or class-name-based
configuration.
Goals
- Add middleware handler-chain support to
frameworks/light-pingora. - Use one
apps/light-gatewaybinary for the Pingora gateway family. - Keep
handler.ymlas the chain and ordering configuration. - Let
light-portalmanage product-specific configuration and config-server deliver it at startup. - Support virtual hosts selected from the HTTP
Hostheader. - Serve static SPA content directly from Pingora.
- Proxy API, BFF, sidecar, and balancer routes to upstream services.
- Use stable handler IDs instead of Rust type names.
- Use explicit handler registration. Do not require
inventory. - Integrate loaded handler and traffic/resource config with
ModuleRegistry. - Keep the design compatible with runtime config reload.
Non-Goals
- Do not build a transport-neutral
light-handlercrate in the first phase. - Do not add an Axum/Tower adapter in the first phase.
- Do not create separate binaries for gateway, sidecar, proxy server, proxy client, load balancer, and BFF in the first phase.
- Do not dynamically load handler crates from
handler.yml. - Do not use Java-style reflection or string-to-type construction.
- Do not make Rust type names part of the public config contract.
- Do not support multi-certificate TLS SNI selection in the first phase.
- Do not implement streaming static-file delivery in the first phase unless it is needed for a concrete SPA asset size problem.
Current Shape
light-pingora already adapts a Pingora proxy into the shared runtime:
#![allow(unused)] fn main() { pub trait PingoraApp: Send + Sync + 'static { type Proxy: ProxyHttp + Send + Sync + 'static; fn proxy(&self, config: &RuntimeConfig) -> Result<Self::Proxy, RuntimeError>; } }
PingoraTransport calls app.proxy(config) and passes the result to
pingora::proxy::http_proxy_service(...).
Pingora's ProxyHttp lifecycle already has the hooks needed for the gateway
family:
request_filter: validate, authenticate, rate limit, or directly write a local response such as a static fileupstream_peer: select the upstream for proxy routesupstream_request_filter: mutate the request sent to upstreamupstream_response_filter: mutate the upstream response before cachingresponse_filter: mutate the response sent to the browser
The current light-gateway already writes /health directly from
request_filter. Static SPA serving can use the same pattern.
Product Model
The Rust light-gateway binary should link all built-in Pingora gateway
capabilities:
- virtual host routing
- static SPA serving
- reverse proxy routing
- outbound proxy behavior
- upstream load balancing
- sidecar token/header behavior
- shared middleware handlers
The active behavior is selected by configuration, not by compiling a different binary. The six product personas are configuration profiles:
gatewaysidecarproxy-serverproxy-clientbalancerbff
These profiles can be represented in light-portal as product-specific config
sets. At runtime, light-gateway only sees the resolved files returned by
config-server. The binary should not need to know whether the files came from a
portal product template, an environment override, or a local fallback.
This keeps deployment simple:
- one binary
- one container image
- one
light-pingoraframework - different behavior by remote config
The tradeoff is that config validation must be strong. A product config should not silently start in a different mode if a static root, virtual host, upstream, or chain is wrong.
High-Level Flow
The Pingora gateway request flow should be:
request
-> match handler.yml paths by path and method
-> fall back to handler.yml defaultHandlers when no path matches
-> run request handlers
-> proxy fixed upstream, route by service_id/service_url, serve static file,
or return error
-> run response handlers
-> response
For static handlers such as virtual-host or path-resource, request_filter
writes the response and returns Ok(true) so Pingora does not proxy the
request.
For proxy or router handlers, request_filter stores the selected upstream
decision in the per-request context and returns Ok(false). upstream_peer
and upstream_request_filter then use that context to connect to the right
upstream and set headers.
Crate Layout
Keep the first implementation inside frameworks/light-pingora.
Suggested modules:
frameworks/light-pingora/src/
lib.rs
handler.rs
correlation.rs
cors.rs
metrics.rs
proxy.rs
resource.rs
router.rs
service.rs
token.rs
Responsibilities:
- parse and validate
handler.yml - parse
handler.yamlas a compatibility fallback - parse and validate
proxy.yml,router.yml,path-resource.yml, andvirtual-host.yml - build explicit handler registry
- resolve handler chains
- match handler paths and fallback handlers
- capture Java-style
{name}path-template variables - load active handler-specific config files
- serve static SPA content
- select fixed proxy upstreams from
proxy.yml - select dynamic sidecar/router upstreams from
router.yml - resolve sidecar
service_idvalues frompathPrefixService.yml - retrieve and cache OAuth client-credentials tokens from
client.yml - expose module-registry entries for active handler and traffic/resource config
This keeps the first implementation close to the Pingora lifecycle and avoids premature abstractions for Axum.
If Axum later needs the same handler semantics, extract the framework-neutral parts after the Pingora implementation has stabilized.
Configuration Split
Use handler.yml for the Java-compatible handler middleware contract:
handler declarations, reusable chains, path-to-chain mappings, and fallback
handlers.
Use Java-compatible product-specific config files for traffic and static resource behavior:
proxy.yml: fixed inbound reverse proxy targets for gateway, proxy server, balancer, and simple BFF API forwarding.router.yml: dynamic outbound routing byservice_idorservice_url, mainly for sidecar-style deployments.path-resource.ymlorpath-resource.yaml: a single static resource mount.virtual-host.ymlorvirtual-host.yaml: host-based static resource mounts for BFF/SPA deployments.
The product profile selected in light-portal decides which of these files are
included and which handlers are active in handler.yml. The Rust binary should
not require a separate gateway.yml to duplicate these existing contracts.
Handler-specific files such as correlation.yml, cors.yml, metrics.yml,
header.yml, security.yml, apikey.yml, basic-auth.yml,
unified-security.yml, and limit.yml stay separate. They are loaded only
when the corresponding handler is active in the resolved path/default
execution model. Phase 3 implements this active loading for correlation.yml,
cors.yml, and metrics.yml. Phase 4 extends the same active-loading and
reload model to header.yml, security.yml, apikey.yml,
basic-auth.yml, unified-security.yml, and limit.yml.
Remote Config Source
light-gateway starts with enough local bootstrap configuration to contact
config-server. The existing Light Fabric runtime then resolves local and remote
configuration before light-pingora builds the runtime handler/resource/proxy
model.
Startup flow:
- load local bootstrap files from the configured config directory
- contact config-server using the configured service identity, environment, and authorization
- download remote product configuration managed by
light-portal - merge remote config with local fallback config
- load
handler.yml, applicable traffic/resource config files, and active handler-specific config files - validate the complete route and handler model
- bind Pingora listeners
- register the runtime instance with the controller
The remote product config should include:
handler.ymlproxy.ymlfor fixed inbound proxy profilesrouter.ymlfor sidecar/router profilespath-resource.ymlorvirtual-host.ymlfor static/BFF profiles- active handler config files
- TLS, trust, or client files required by the runtime
- optional product-specific static file references or mount paths
handler.yml decides which linked handlers are active. A handler that is
registered in the binary but not referenced by any configured paths entry or
defaultHandlers chain should not be instantiated, should not load its config
file, and should never run.
Handler Config
Example handler.yml:
enabled: ${handler.enabled:true}
reportHandlerDuration: ${handler.reportHandlerDuration:false}
handlerMetricsLogLevel: ${handler.handlerMetricsLogLevel:DEBUG}
basePath: ${handler.basePath:/}
handlers: ${handler.handlers:[]}
chains: ${handler.chains:{}}
paths: ${handler.paths:[]}
defaultHandlers: ${handler.defaultHandlers:[]}
The config-server values managed by light-portal provide the concrete arrays
and maps:
handler.handlers:
- correlation
- headers
- metrics
- cors
- jwt
- rate-limit
handler.chains:
spa:
exec:
- correlation
- headers
- metrics
- cors
api:
exec:
- correlation
- headers
- metrics
- cors
- jwt
- rate-limit
public:
exec:
- correlation
- headers
- metrics
handler.paths:
- path: /api/
method: GET
exec:
- api
handler.defaultHandlers:
- public
This keeps the same top-level handler.yml contract as the Java framework:
enabled, reportHandlerDuration, handlerMetricsLogLevel, basePath,
handlers, chains, paths, and defaultHandlers.
The Rust implementation also accepts the Java extension fields
additionalHandlers, additionalChains, and additionalPaths. They are
merged into the effective handler model before validation.
Unlike Java, the Rust handlers list uses stable short handler IDs. It does
not use fully qualified class names, and it does not need @alias because the
IDs are already short and stable.
handler.yml is the preferred Rust file name. handler.yaml is accepted as a
compatibility fallback because some Java modules and templates use that suffix.
Fixed Proxy Config
proxy.yml should keep the Java inbound reverse-proxy contract. It is used
when the deployment has a known set of target upstream URIs.
enabled: ${proxy.enabled:true}
http2Enabled: ${proxy.http2Enabled:false}
hosts: ${proxy.hosts:http://localhost:8080}
connectionsPerThread: ${proxy.connectionsPerThread:20}
maxRequestTime: ${proxy.maxRequestTime:1000}
rewriteHostHeader: ${proxy.rewriteHostHeader:true}
reuseXForwarded: ${proxy.reuseXForwarded:false}
maxConnectionRetries: ${proxy.maxConnectionRetries:3}
maxQueueSize: ${proxy.maxQueueSize:0}
forwardJwtClaims: ${proxy.forwardJwtClaims:false}
metricsInjection: ${proxy.metricsInjection:false}
metricsName: ${proxy.metricsName:proxy-response}
The Rust implementation should parse proxy.hosts as one or more comma
separated http:// or https:// targets and select a target with round-robin
load balancing. It should preserve rewriteHostHeader, reuseXForwarded,
request timeout, retry, and queue settings where Pingora exposes equivalent
behavior.
Router Config
router.yml should keep the Java outbound router contract. This is primarily
for the sidecar pattern, where earlier handlers resolve service_id,
service_url, tokens, and discovery context before the router connects to the
downstream service.
http2Enabled: ${router.http2Enabled:true}
httpsEnabled: ${router.httpsEnabled:true}
maxRequestTime: ${router.maxRequestTime:1000}
pathPrefixMaxRequestTime: ${router.pathPrefixMaxRequestTime:{}}
connectionsPerThread: ${router.connectionsPerThread:10}
softMaxConnectionsPerThread: ${router.softMaxConnectionsPerThread:5}
maxQueueSize: ${router.maxQueueSize:0}
rewriteHostHeader: ${router.rewriteHostHeader:true}
reuseXForwarded: ${router.reuseXForwarded:false}
maxConnectionRetries: ${router.maxConnectionRetries:3}
preResolveFQDN2IP: ${router.preResolveFQDN2IP:false}
hostWhitelist: ${router.hostWhitelist:[]}
serviceIdQueryParameter: ${router.serviceIdQueryParameter:false}
urlRewriteRules: ${router.urlRewriteRules:[]}
methodRewriteRules: ${router.methodRewriteRules:[]}
queryParamRewriteRules: ${router.queryParamRewriteRules:{}}
headerRewriteRules: ${router.headerRewriteRules:{}}
metricsInjection: ${router.metricsInjection:false}
metricsName: ${router.metricsName:router-response}
The Java router chooses the target from service_url first, guarded by
hostWhitelist, or from service_id plus optional env_tag through service
discovery.
Phase 5 implements the Pingora router execution path and keeps the Java
configuration shape. The active router handler loads and registers
router.yml, selects direct service_url targets after hostWhitelist
validation, supports serviceIdQueryParameter, and removes router selection
headers before forwarding upstream. It also applies Java-style URL, method,
query-parameter, and header rewrite rules.
Rust adds serviceTargets as an interim improvement for service_id routing:
serviceTargets:
com.networknt.petstore-1.0.0:
- http://localhost:8080
com.networknt.petstore-1.0.0|dev:
- https://petstore-dev.example.com
This lets sidecar-style router flows run in local/static deployments and acts as the fallback when controller discovery is unavailable.
Phase 6 adds the sidecar path-prefix and token flow. Phase 7 adds
controller-backed service_id discovery while keeping the same request
contract and the same static fallback.
Sidecar Path Prefix And Token Config
pathPrefixService.yml maps request path prefixes to downstream service IDs.
The handler writes service_id only when the request does not already provide
one.
enabled: ${pathPrefixService.enabled:true}
mapping: ${pathPrefixService.mapping:{}}
Rust intentionally selects the longest path-boundary prefix. This avoids map
iteration ambiguity when prefixes overlap and prevents /v1/address from
matching /v1/address2.
token.yml gates when the token handler should run:
enabled: ${token.enabled:false}
appliedPathPrefixes: ${token.appliedPathPrefixes:}
The token handler reads the Java-compatible client credentials section from
client.yml:
tls:
verifyHostname: ${client.verifyHostname:true}
oauth:
multipleAuthServers: ${client.multipleAuthServers:false}
token:
cache:
capacity: ${client.tokenCacheCapacity:200}
tokenRenewBeforeExpired: ${client.tokenRenewBeforeExpired:60000}
server_url: ${client.tokenServerUrl:}
serviceId: ${client.tokenServiceId:com.networknt.oauth2-token-1.0.0}
proxyHost: ${client.tokenProxyHost:}
proxyPort: ${client.tokenProxyPort:}
enableHttp2: ${client.tokenEnableHttp2:true}
client_credentials:
uri: ${client.tokenCcUri:/oauth2/token}
client_id: ${client.tokenCcClientId:}
client_secret: ${client.tokenCcClientSecret:}
scope: ${client.tokenCcScope:}
serviceIdAuthServers: ${client.tokenCcServiceIdAuthServers:}
pathPrefixServices: ${client.pathPrefixServices:}
request:
connectTimeout: ${client.connectTimeout:2000}
timeout: ${client.timeout:3000}
enableHttp2: ${client.enableHttp2:true}
In single-auth-server mode, the handler uses the configured token server and
client credentials for all matched paths. In multipleAuthServers mode, it
uses service_id or pathPrefixServices to select
client_credentials.serviceIdAuthServers[service_id].
The token request follows the Java request shape:
POSTtoserver_url + uriContent-Type: application/x-www-form-urlencodedAccept: application/json- HTTP Basic authentication with
client_id:client_secret - form fields
grant_type=client_credentialsand optional space-joinedscope
The injected header follows the Java gateway rule:
- if the inbound request has no
Authorization, injectAuthorization: Bearer <token> - if the inbound request already has
Authorization, injectX-Scope-Token: Bearer <token>
The Rust cache is local to the gateway process and is registered as
light-pingora/token-cache when a runtime cache registry is available. Cache
summaries expose key and expiry metadata but never expose bearer token values.
Tokens are refreshed synchronously inside the configured renew-before-expiry
window. Async background renewal can be added later if blocking refresh latency
becomes visible.
When server_url is not configured, phase 7 discovers the token service from
serviceId through the runtime portal-registry client. This requires
server.enableRegistry and a live controller registration. A disconnected
registry client returns a clear configuration/runtime error instead of silently
falling back to an unknown token endpoint.
Static Resource Config
For a single static site, keep path-resource.yml:
path: ${path-resource.path:/public}
base: ${path-resource.base:/opt/light-4j/public}
prefix: ${path-resource.prefix:true}
transferMinSize: ${path-resource.transferMinSize:1024}
directoryListingEnabled: ${path-resource.directoryListingEnabled:false}
For host-based BFF/static sites, keep virtual-host.yml:
hosts: ${virtual-host.hosts:[]}
Example config-server values:
virtual-host.hosts:
- domain: local.localhost
path: /
base: /lightapi/dist
transferMinSize: 10245760
directoryListingEnabled: false
- domain: signin.localhost
path: /
base: /signin/dist
transferMinSize: 10245760
directoryListingEnabled: false
Rust should preserve the Java domain, path, base, transferMinSize, and
directoryListingEnabled fields. It should also add the Rust improvement for
SPA fallback: when a static virtual host cannot find a requested browser route
and the path does not look like an asset, it should serve index.html from the
matched static root.
BFF Wiring Example
The Java BFF config in portal-config-loc/all-in-lt/light-gateway uses
handler.paths to send API routes through the default chain, which includes
path-prefix service resolution, token handling, and the router. It then uses:
handler.defaultHandlers:
- cors
- virtual
That means unmatched browser routes fall through to CORS plus virtual-host
static serving. Rust should keep this pattern: handler.yml decides whether a
request goes to proxy/router/static handling, based on paths and fallback
handlers.
Other product personas use different config file combinations. A BFF commonly
uses handler.yml, router.yml, path-prefix/token configs, and
virtual-host.yml. A simple proxy or balancer can use handler.yml and
proxy.yml. A sidecar uses handler.yml, router.yml, token/cache config,
registry/discovery config, and usually no static resource config.
Phase 3 Handler Config
Phase 3 implements the first three Java-compatible cross-cutting handlers.
correlation.yml:
enabled: ${correlation.enabled:true}
autogenCorrelationID: ${correlation.autogenCorrelationID:true}
correlationMdcField: ${correlation.correlationMdcField:cId}
traceabilityMdcField: ${correlation.traceabilityMdcField:tId}
The Rust handler reads X-Correlation-Id and X-Traceability-Id, generates a
Java-compatible URL-safe UUID value when correlation is missing, passes the
correlation ID to the upstream request, and echoes X-Traceability-Id on the
response. It stores the values in the Pingora request context instead of MDC.
cors.yml:
enabled: ${cors.enabled:true}
allowedOrigins: ${cors.allowedOrigins:}
allowedMethods: ${cors.allowedMethods:}
pathPrefixAllowed: ${cors.pathPrefixAllowed:}
The Rust handler accepts the same list/string forms as Java, supports
pathPrefixAllowed, short-circuits preflight OPTIONS, rejects disallowed
origins with 403, and adds the CORS response headers before static or proxied
responses are sent. Rust intentionally uses longest-prefix selection for
pathPrefixAllowed so overlapping prefixes are deterministic.
metrics.yml:
enabled: ${metrics.enabled:true}
enableJVMMonitor: ${metrics.enableJVMMonitor:false}
serverProtocol: ${metrics.serverProtocol:http}
serverHost: ${metrics.serverHost:localhost}
serverPath: ${metrics.serverPath:/apm/metricFeed}
serverPort: ${metrics.serverPort:8086}
serverName: ${metrics.serverName:metrics}
serverUser: ${metrics.serverUser:admin}
serverPass: ${metrics.serverPass:admin}
reportInMinutes: ${metrics.reportInMinutes:1}
productName: ${metrics.productName:http-sidecar}
sendScopeClientId: ${metrics.sendScopeClientId:false}
sendCallerId: ${metrics.sendCallerId:false}
sendIssuer: ${metrics.sendIssuer:false}
issuerRegex: ${metrics.issuerRegex:}
Phase 3 parses and registers this config with serverPass masked, records
request counts and status classes in memory, and logs request metrics with the
matched endpoint and correlation ID. enableJVMMonitor is parsed for config
compatibility but is not applicable to Rust. External Influx/APM reporters are
deferred until the metrics sink decision is made.
Phase 4 Handler Config
Phase 4 implements the security-oriented Java-compatible handlers that fit the Pingora request metadata model.
header.yml:
enabled: ${header.enabled:false}
request:
remove: ${header.request.remove:}
update: ${header.request.update:}
response:
remove: ${header.response.remove:}
update: ${header.response.update:}
pathPrefixHeader: ${header.pathPrefixHeader:}
The Rust handler applies request header remove/update rules before proxying and
response header remove/update rules before static or proxied responses are
sent. Rust intentionally uses longest-prefix selection for pathPrefixHeader
so overlapping prefixes are deterministic.
apikey.yml:
enabled: ${apikey.enabled:true}
hashEnabled: ${apikey.hashEnabled:false}
pathPrefixAuths: ${apikey.pathPrefixAuths:[]}
The Rust handler follows the Java rule that no matching path prefix means the
handler passes the request. A matching rule validates the configured header
against either a plain API key or the Java iterations:saltHex:hashHex
PBKDF2-HMAC-SHA1 hash format.
basic-auth.yml:
enabled: ${basic.enabled:false}
enableAD: ${basic.enableAD:true}
allowAnonymous: ${basic.allowAnonymous:false}
allowBearerToken: ${basic.allowBearerToken:false}
users: ${basic.users:[]}
The Rust handler supports configured local users, anonymous path users, and the Java-compatible bearer pass-through mode. LDAP/AD authentication is parsed for configuration compatibility but is not implemented in phase 4.
security.yml:
enableVerifyJwt: ${security.enableVerifyJwt:true}
ignoreJwtExpiry: ${security.ignoreJwtExpiry:false}
enableH2c: ${security.enableH2c:false}
enableMockJwt: ${security.enableMockJwt:false}
jwt:
certificate: ${security.jwt.certificate:{}}
clockSkewInSeconds: ${security.jwt.clockSkewInSeconds:60}
keyResolver: ${security.jwt.keyResolver:}
skipPathPrefixes: ${security.skipPathPrefixes:[]}
passThroughClaims: ${security.passThroughClaims:{}}
The Rust handler verifies Bearer JWTs with configured PEM certificates, honors
kid when present, supports RSA and EC algorithms handled by the Rust JWT
library, applies clock skew and optional expiry bypass, caches decoded claims,
and forwards configured pass-through claims as request headers. Dynamic JWK key
service bootstrap and SWT/SJWT verification are deferred until the runtime has
the discovery and key-service client surface needed by those flows.
unified-security.yml:
enabled: ${unified-security.enabled:true}
anonymousPrefixes: ${unified-security.anonymousPrefixes:[]}
pathPrefixAuths: ${unified-security.pathPrefixAuths:[]}
The Rust handler supports Java-style path-prefix selection across Basic, JWT, and API-key authentication. Anonymous prefixes bypass authentication. SWT/SJWT rules return a clear not-implemented response until the discovery-backed key flow is added.
limit.yml:
enabled: ${limit.enabled:false}
concurrentRequest: ${limit.concurrentRequest:0}
queueSize: ${limit.queueSize:0}
errorCode: ${limit.errorCode:429}
rateLimit: ${limit.rateLimit:}
headersAlwaysSet: ${limit.headersAlwaysSet:false}
key: ${limit.key:server}
server: ${limit.server:{}}
address: ${limit.address:{}}
client: ${limit.client:{}}
user: ${limit.user:{}}
The Rust handler implements in-memory request rate limiting by server, client
address, JWT client ID, or JWT user ID. It emits X-RateLimit-Limit,
X-RateLimit-Remaining, X-RateLimit-Reset, and Retry-After when a request
is rejected, and it can always emit the rate-limit headers when
headersAlwaysSet is enabled. Cluster-wide distributed counters are deferred
until there is a concrete gateway clustering requirement.
Handler Registry
Use explicit registration.
#![allow(unused)] fn main() { let handlers = PingoraHandlerRegistry::new() .register(correlation::descriptor()) .register(headers::descriptor()) .register(metrics::descriptor()) .register(cors::descriptor()) .register(jwt::descriptor()) .register(rate_limit::descriptor()); }
No inventory is needed for the first version. Explicit registration is
deterministic, testable, and makes the compiled-in handler set clear from the
service binary.
The light-gateway binary can register every built-in handler it supports.
Registration only makes a handler available. Activation is controlled by
handler.yml.
Build the active handler set lazily:
- parse
handler.yml - resolve
pathsanddefaultHandlers - expand any referenced chains
- compute the set of referenced handler IDs
- instantiate only referenced handlers
- load config only for referenced handlers
This allows one binary to support gateway, sidecar, proxy, balancer, and BFF profiles without requiring unused handler config files.
The registry maps stable config IDs to factories:
#![allow(unused)] fn main() { pub struct PingoraHandlerDescriptor { pub id: &'static str, pub kind: PingoraHandlerKind, pub factory: PingoraHandlerFactory, } }
Suggested first handler IDs:
correlationheadersmetricscorsjwtapi-keybasic-authrate-limitrequest-size-limit
Trace headers should be handled by correlation; there should not be a
separate traceability handler.
Handler API
Use Pingora phases directly. Avoid a generic exchange abstraction until another framework needs it.
The current implementation keeps PingoraHandler as a descriptor/factory
surface and executes the built-in phase 3 handlers from light-gateway's
Pingora lifecycle. This keeps the first implementation straightforward:
request_filterresolves the configured chain and runs request-stage handlers in order.- A request-stage handler can continue, short-circuit with a local response, or select a terminal action such as proxy/static/health.
upstream_request_filterapplies upstream request mutations such as generated correlation IDs.response_filterapplies response-stage headers and records proxied response metrics.- Static responses call the same response decoration and metrics code before writing the local response.
Once security/rate-limit handlers are added, this can be lifted into a richer trait with request/upstream/response hooks if the duplication becomes real. It is intentionally not generalized before the Pingora behavior stabilizes.
Response handlers should run before both static and proxied responses are sent.
For proxied responses, this maps to Pingora response_filter. For static
responses, the static-file renderer calls the same response handler chain
before writing the local response.
Request Context
The per-request context should carry route decisions across Pingora phases.
#![allow(unused)] fn main() { pub struct GatewayRequestContext { pub upstream: Option<ProxyTarget>, pub endpoint: String, pub method: String, pub path_params: BTreeMap<String, String>, pub correlation: CorrelationState, pub cors: Option<CorsResponseHeaders>, pub metrics_enabled: bool, } }
The context is created by ProxyHttp::new_ctx() and populated in
request_filter.
upstream_peer should only select an upstream after a proxy or router handler
has selected one. If no upstream is selected for a proxied request, the
implementation should return a clear configuration error rather than silently
falling back.
Virtual Hosts
Virtual-host static serving should use the HTTP Host header.
Host normalization rules:
- lowercase the host
- strip the port when present
- reject empty or invalid hosts unless a default virtual host is configured
- exact host match first
- wildcard match such as
*.example.comafter exact hosts, with the longest matching suffix winning
HTTP host routing is enough for the first implementation.
TLS certificate selection by SNI is separate. The current light-pingora
transport uses one Rustls TLS setting for the listener, so the first production
options are:
- terminate TLS at ingress or a load balancer
- use a wildcard certificate
- use one certificate with all required SANs
Phase 8 evaluated dynamic multi-cert SNI selection. The current
light-pingora build uses Pingora's Rustls listener, and Pingora 0.8 Rustls
TLS settings do not support certificate callbacks. For now the production
options remain terminating TLS before light-gateway, using a wildcard
certificate, or using one certificate with all required SANs. Native multi-cert
SNI can be added only after moving to a Pingora TLS backend/version that
supports server certificate callbacks or certificate resolution through Rustls.
Static SPA Rendering
Static SPA rendering should be part of the Pingora resource engine, not a
generic middleware handler. It is enabled by path-resource.yml or
virtual-host.yml, typically for BFF profiles.
Rules for the first implementation:
- support
GETandHEAD - return
405for unsupported methods on static routes - canonicalize requested paths under the configured static root
- reject path traversal
- do not serve files outside the static root
- deny dotfiles by default
- do not list directories
- serve
index.htmlfor the root path - support SPA fallback to
index.htmlfor non-asset routes - infer
Content-Typefrom file extension - set
Cache-Control: no-cacheforindex.html - set long immutable cache headers for hashed assets
- allow static route prefixes to be bypassed by API routes such as
/api/,/oauth/,/mcp/, or/ws/
Recommended cache behavior:
index.html Cache-Control: no-cache
*.js, *.css with hash Cache-Control: public, max-age=31536000, immutable
images/fonts with hash Cache-Control: public, max-age=31536000, immutable
other assets Cache-Control: public, max-age=3600
Phase 8 keeps small static files on the simple read-then-write path and streams
files whose size is greater than or equal to the configured
transferMinSize. Static responses include ETag and Last-Modified, honor
If-None-Match and If-Modified-Since, and return 304 without a response
body when the browser cache is current.
Proxy And Router Behavior
proxy.yml selects from configured upstream URIs. This is the simpler inbound
reverse-proxy case and should be implemented before dynamic sidecar routing.
Fixed proxy target behavior:
- parse comma-separated
proxy.hosts - support
http://andhttps:// - duplicate a single host internally if retry/load-balancer behavior needs at least two entries
- select upstream with round-robin
- apply timeout, retry, queue, and host-forwarding settings where Pingora supports them
router.yml selects from request metadata. Phase 5 implements direct
service_url targets, static serviceTargets for service_id, host whitelist
enforcement, and rewrite behavior. Phase 7 adds controller-backed
service_id lookup through the runtime portal-registry client and keeps static
serviceTargets as a local fallback.
Router target behavior:
- prefer
service_urlwhen present and allowed byrouter.hostWhitelist - otherwise use
service_idplus optionalenv_tag - optionally allow
service_idfrom the query string whenserviceIdQueryParameteris true - resolve
service_idfrom controller discovery when the portal-registry client is connected - fall back to
router.serviceTargetsfor local/static deployments or controller lookup failures - support URL, method, query-parameter, and header rewrite rules
- remove
service_urlandservice_idheaders before forwarding
upstream_peer creates the HttpPeer from the selected upstream:
- address
- TLS enabled
- SNI
- optional host header
upstream_request_filter should set or override upstream headers such as:
HostX-Forwarded-ForX-Forwarded-ProtoX-Forwarded-HostX-Light-Gatewayor equivalent runtime marker
Handler-specific upstream mutations should also run from this phase.
Chain Resolution
Startup should validate handler and selected traffic/resource configuration before binding listeners.
Validation rules:
- every handler ID in
handler.ymlmust exist in the explicit registry - every chain item must resolve to a registered handler or another chain
- recursive chain references are invalid
- every
handler.pathsentry must reference existing chains or handlers - every
handler.defaultHandlersentry must reference existing chains or handlers proxy.ymlhosts must be validhttp://orhttps://URIs when the proxy handler is activerouter.ymlrewrite rules must be parseable when the router handler is active- every static virtual host must have a static root
- static roots must be absolute or resolved relative to a configured base
- duplicate exact virtual hosts are invalid
- duplicate handler IDs in the registry are invalid
The resolved model should be immutable and cheap to read:
#![allow(unused)] fn main() { pub struct GatewayRuntimeModel { pub virtual_hosts: BTreeMap<String, Arc<VirtualHost>>, pub default_host: Option<Arc<VirtualHost>>, pub chains: BTreeMap<String, Arc<ResolvedHandlerChain>>, pub proxy_targets: Vec<Arc<ProxyTarget>>, } }
Config reload should continue to swap loaded models atomically. In-flight requests should keep using the handler/resource/proxy/router model they already selected.
Runtime Integration
light-runtime remains responsible for bootstrap, config loading, lifecycle,
controller registration, and module registry. light-pingora should load its
Pingora-specific handler, traffic, and resource config through the existing
runtime config loader.
Module IDs:
light-pingora/handlerlight-pingora/proxylight-pingora/routerlight-pingora/path-prefix-servicelight-pingora/tokenlight-client/clientlight-pingora/path-resourcelight-pingora/virtual-hostlight-pingora/correlationlight-pingora/corslight-pingora/metricslight-pingora/headerlight-pingora/securitylight-pingora/apikeylight-pingora/basic-authlight-pingora/unified-securitylight-pingora/limit
The module registry should expose:
- handler config snapshot, masked
- proxy, router, path-resource, and virtual-host config snapshots, masked
- active handler IDs
- active chains
- active virtual hosts
- active proxy/router/static capabilities
- reloadable status
The implemented phases use the existing ReloadableModule pattern for active
handler, proxy, router, resource, virtual-host, path-prefix service, token, and
handler-specific config files. Phase 7 exposes a capabilities summary from
get_service_info, including active modules, traffic capabilities, active
handlers, chain names, path mappings, default handlers, virtual hosts, and
path-resource config.
Suitable First Handlers
Start with handlers that map cleanly to Pingora request and response metadata:
- correlation ID and trace headers
- response headers
- metrics
- CORS
- JWT verification
- API key verification
- basic auth
- request size limit from headers
- simple rate limiting by principal, IP, host, or route
Defer handlers that require deeper body handling:
- request decompression
- response compression policy beyond Pingora modules
- request body sanitizer
- generic body parser
- WebSocket message handlers
Error Model
Handlers and proxy/resource selection should return structured errors that render consistently.
#![allow(unused)] fn main() { pub struct HandlerError { pub status: u16, pub code: Cow<'static, str>, pub message: Cow<'static, str>, pub metadata: serde_json::Value, } }
Security handlers should avoid returning sensitive validation details to the browser. Detailed diagnostics should go to logs with correlation IDs.
Common gateway errors:
- unknown host:
404 - no matching handler path or static resource:
404 - unsupported method for static route:
405 - static file outside root:
403 - missing upstream: startup validation error
- auth failure:
401or403 - rate limit:
429
Testing Strategy
Unit tests in light-pingora:
- build active handler set from referenced
pathsanddefaultHandlers - ignore registered but unreferenced handlers
- do not require config files for unreferenced handlers
- parse valid
handler.yml - reject unknown handler IDs
- reject recursive chains
- resolve path/default handler chains in order
- parse
handler.yamlfallback - merge
additionalHandlers,additionalChains, andadditionalPaths - capture path-template variables
- parse CORS list/string and path-prefix config
- classify metrics status codes
- normalize host names and strip ports
- reject duplicate virtual hosts
- match exact virtual hosts
- parse and validate
proxy.ymlhosts - parse and validate
router.ymlrewrite-rule config - select router targets from direct
service_url - reject direct router targets that do not match
hostWhitelist - select router targets from controller discovery and static
serviceTargets - apply router URL, method, query-parameter, and header rewrites
- parse
pathPrefixService.ymland avoid partial-segment path matches - parse
token.ymland the client credentials subset ofclient.yml - support single and multiple auth-server token configuration
- discover token service endpoints from
client.ymltokenserviceId - mask token cache summaries and never expose bearer token values
- expose gateway capabilities in
get_service_info - prevent static path traversal
- deny dotfiles by default
- serve
index.htmlfor/ - serve SPA fallback for non-asset paths
- avoid SPA fallback for
/api/proxy routes - select cache headers for
index.htmland hashed assets - stop handler execution on early response
- run response handlers before static response write
Integration tests:
- same binary starts with BFF profile config
- same binary starts with proxy or balancer profile config
- BFF profile can route API paths through configured handlers and serve SPA
fallback through
defaultHandlers - static SPA route returns
index.html - static asset route returns correct content type and cache header
- virtual host A and virtual host B serve different roots
- API route is proxied to the configured
proxy.ymlupstream - auth handler blocks protected API routes
- public static route does not require auth unless configured
Rollout Plan
Phase 1: Product config and active handler model (implemented)
- keep a single
apps/light-gatewaybinary - register all built-in handler descriptors explicitly
- resolve active handler IDs from
handler.yml - instantiate only active handlers
- load config only for active handlers
- document product profiles managed by
light-portal
Phase 2: BFF and fixed proxy engine (implemented)
-
load and register
proxy.yml,path-resource.yml, andvirtual-host.yml -
match
handler.ymlpaths and fallback handlers in Java-compatible order -
select fixed proxy upstreams from
proxy.yml -
match virtual hosts by
Host -
serve single-site and virtual-host static content
-
implement safe static path resolution
-
serve static files from
request_filter -
add Rust SPA fallback improvement
-
add content type and cache headers
-
add traversal, dotfile, fallback, proxy-host, and virtual-host tests
Phase 3: Handler chain execution (implemented)
- run request and response handlers around static and proxied responses
- implement correlation, CORS, and basic metrics
- parse
correlation.yml,cors.yml, andmetrics.yml - pass generated correlation IDs upstream
- apply response headers to both static and proxied responses
- log handler duration when
reportHandlerDurationis enabled - defer generic response headers to a handler-specific follow-up
Phase 4: Security and request/response policy handlers (implemented)
- implement JWT, API key, basic auth, and rate-limit handlers
- implement the generic header handler for request and response mutation
- implement unified-security path-prefix selection for Basic, JWT, and API key
- parse Java-compatible
security.yml,apikey.yml,basic-auth.yml,unified-security.yml,header.yml, andlimit.yml - add JWT pass-through claim request header mutation
- add path-level chain selection for public SPA and protected API routes
Phase 5: Sidecar router (implemented)
- load and register
router.yml - implement dynamic target selection by
service_urlorservice_id - enforce
hostWhitelist - support static
serviceTargetsforservice_idrouting until runtime discovery is available - support router URL, method, query-parameter, and header rewrites
- apply router request mutation in
upstream_request_filter - remove router selection headers before forwarding
- include router config in the active reload model
- add sidecar-focused tests
Phase 6: Sidecar path-prefix and token flow (implemented)
- load and register
pathPrefixService.yml - resolve
service_idby longest path-boundary prefix - load and register
token.yml - load and register the token-related view of
client.yml - support single-auth-server and
multipleAuthServersclient credentials - cache tokens locally and expose masked cache summaries through the runtime cache registry
- inject
AuthorizationorX-Scope-Tokenaccording to inbound request state - extend reload coverage to
pathPrefixService.yml,token.yml, and token-relatedclient.yml - add sidecar token/path-prefix tests
Phase 7: Discovery and control plane (implemented)
- expose the runtime portal-registry client to framework transports
- add
discovery/lookupsupport to the portal-registry client - resolve router
service_idtargets through controller discovery - keep static
router.serviceTargetsas a fallback for local/static profiles - discover token service endpoints from
client.ymltokenserviceId - expose active capabilities, hosts, paths, handlers, and chains through
get_service_info - atomically replace resolved handler/resource/proxy models on reload
Phase 8: Advanced transport features (implemented)
- add streaming static-file delivery for files at or above
transferMinSize - add conditional static requests with
ETagandLast-Modified - add wildcard virtual hosts with exact-host precedence
- evaluate multi-cert TLS SNI support and document the Rustls limitation
Phase 2 Decisions
- Static roots can be absolute, matching the Java deployment model, or relative to the runtime config directory for local Rust development.
- SPA fallback applies only to browser routes. Paths that look like assets,
such as
/app.jsor/favicon.ico, return 404 when the file is missing. - Handler path matching supports exact paths and Java/OpenAPI-style
{name}path-template segments.
Open Questions
- Should static content support
ETagin the first implementation if portal deployments depend on browser cache validation?
MCP Router
Status
Phases 1, 2, 3, and 4 are implemented in light-pingora and
light-gateway. The configurable tokenization client remains deferred until
light-tokenization is migrated to portal-service/apps/portal-service and
the protocol is selected. Stateful backend MCP session mapping is implemented
for the single-process gateway session store and documented below.
Purpose
The Java mcp-router module exposes a configured Model Context Protocol
endpoint, /mcp by default, and turns configured gateway targets into MCP
tools. AI agents can call initialize, tools/list, and tools/call; the
router then forwards the tool call to an HTTP service or another MCP server.
In light-fabric this should be a light-pingora handler that is activated by
light-gateway through handler.yml. The same gateway binary can contain the
MCP router implementation, but each product decides whether it runs by including
the mcp handler and the mcp-router.yml configuration from the config server.
This feature is separate from the existing runtime MCP control plane in
light-runtime. Runtime MCP is an internal management surface exposed through
the portal registry connection. The MCP router is an HTTP-facing gateway
feature and is subject to the normal inbound handler chain.
The transport target is MCP Streamable HTTP as defined by the current MCP transport specification: https://modelcontextprotocol.io/specification/2025-06-18/basic/transports.
Goals
- Keep the Java configuration model recognizable:
enabled,path, andtools. - Allow
mcp-router.toolsto be injected by the config server the same wayhandler.handlers,handler.chains,handler.paths, andhandler.defaultHandlersare injected. - Activate the router with the existing
mcphandler id inhandler.yml. - Expose one MCP endpoint with Streamable HTTP semantics, so
/mcpis the only public MCP path for both POST messages and optional GET streams. - Support MCP JSON-RPC methods needed by the Java module:
initialize,notifications/initialized,tools/list, andtools/call. - Route tools to direct
targetHostendpoints, discoveredserviceIdtargets, and backend MCP servers. - Reuse existing cross-cutting handlers such as correlation, security, CORS, rate limit, header, metrics, and proxy routing where the chain order allows.
- Register the router configuration with the module registry so it can be inspected and reloaded consistently with other light-fabric modules.
Non-Goals
- Do not use Rust dynamic plugins or
inventoryfor runtime tool registration. The active tools are product configuration, not compile-time discovery. - Do not merge the public MCP router and the internal runtime MCP control plane into one handler.
- Do not implement a full MCP server framework in the first pass. The gateway only needs the methods used by agents to discover and call configured tools.
- Do not copy Java's legacy HTTP+SSE endpoint split as the target transport. Streamable HTTP is the Rust target; legacy SSE can be considered only as a compatibility mode if an older client requires it.
- Do not hardcode tokenization or masking service URLs. Java currently has a hardcoded tokenization endpoint in this path; the Rust port should make that configurable when masking/tokenization is added.
Java Behavior To Map
The Java module has three main pieces:
McpConfigloadsmcp-router.ymlwithenabled,path, andtools.McpHandlerowns the HTTP MCP endpoint and JSON-RPC protocol handling.McpToolRegistrystores configured tool implementations by name.
Java configuration:
enabled: ${mcp-router.enabled:true}
path: ${mcp-router.path:/mcp}
maxSessions: ${mcp-router.maxSessions:10000}
maxSessionsPerClient: ${mcp-router.maxSessionsPerClient:100}
tools: ${mcp-router.tools:}
Each tool supports these fields:
- name: weather
description: Get weather information
protocol: http
serviceId: com.networknt.weather-1.0.0
envTag: dev
targetHost: http://localhost:7081
path: /weather
method: GET
endpoint: /weather@get
apiType: http
inputSchema:
type: object
properties:
city:
type: string
toolMetadata: {}
The Java handler currently supports:
GET /mcpas an SSE compatibility endpoint. It creates a session id and emits anendpointevent pointing to/mcp?sessionId=....POST /mcpfor JSON-RPC messages.initialize, returning protocol version, tool capabilities, and server info.notifications/initialized, returning no response.tools/list, optionally filtered byparams.queryorparams.intent.tools/call, forwarding arguments to the configured tool.
The Java tool execution supports two target types:
- HTTP tools call a configured HTTP endpoint.
GETmaps arguments to query parameters. Other methods send the arguments as a JSON body. - MCP proxy tools call a backend MCP server by sending a JSON-RPC
tools/callrequest to the configured backend path.
Java also includes rule-based access checks, response filtering, masking, and tokenization around tool calls. The Rust version now implements access checks, response filtering, and schema-driven request masking without hardcoded service endpoints. Tokenization is intentionally deferred.
The Rust implementation should map this behavior to MCP Streamable HTTP rather
than keeping Java's legacy HTTP+SSE transport as the default. Streamable HTTP
uses one MCP endpoint path. Clients send JSON-RPC messages with POST /mcp;
the server can return either a single application/json response or
text/event-stream from that same POST when streaming is needed. Clients may
also issue GET /mcp to open an optional server-to-client SSE stream on the
same endpoint.
Resolved Decisions
- Use Streamable HTTP so only one public MCP endpoint, normally
/mcp, is exposed. - Defer the tokenization client design until
light-tokenizationis migrated intoportal-service/apps/portal-serviceand its protocol is selected. - Reuse the light-4j
access-control.ymlcompatibility contract for MCP, REST, and JSON-RPC authorization. - Do not add configured per-tool outbound headers. Backend tool calls should pass through the headers received from the agent, subject only to headers that the HTTP client must regenerate for a new outbound request and MCP session headers that the gateway must map or regenerate.
Rust Architecture
Add the MCP router to light-pingora because it is a request/response gateway
handler. light-gateway should wire it into the existing handler descriptor
table and runtime state.
Proposed modules:
frameworks/light-pingora/src/access_control.rs
frameworks/light-pingora/src/mcp.rs
Primary types:
#![allow(unused)] fn main() { pub struct McpRouterConfig { pub enabled: bool, pub path: String, pub tools: Vec<McpToolConfig>, } pub struct McpToolConfig { pub name: String, pub description: String, pub protocol: Option<String>, pub service_id: Option<String>, pub env_tag: Option<String>, pub target_host: Option<String>, pub path: String, pub method: HttpMethod, pub endpoint: Option<String>, pub api_type: McpToolType, pub input_schema: serde_json::Value, pub tool_metadata: serde_json::Value, } pub struct McpRouterRuntime { pub config: ArcSwap<McpRouterConfig>, pub client: reqwest::Client, pub registry_client: Option<Arc<PortalRegistryClient>>, } }
The exact field names should follow the existing light-fabric serde naming style while accepting the Java config names through aliases:
serviceIdenvTagtargetHostapiTypeinputSchematoolMetadata
mcp-router.yml should be the primary Rust file name, but the loader should
also accept mcp-router.yaml for Java compatibility.
Tool Registration
The router does not need global static registration. Build an immutable tool map
when mcp-router.yml is loaded:
McpRouterConfig -> BTreeMap<String, McpToolConfig> -> Arc<McpRouterState>
On reload, build a new state and atomically swap the Arc. In-flight requests
continue with the old state.
This is simpler than Java's static McpToolRegistry and avoids Rust plugin
complexity. It also matches the light-fabric product model: all handlers can be
linked into one binary, while the config server decides which handlers and tools
are active for a product.
Request Flow
The mcp handler should participate in the normal handler chain:
request
-> correlation
-> metrics
-> cors
-> security or unified security
-> limit
-> mcp
-> proxy or route handler, only if mcp did not consume the request
response
-> header
-> metrics
-> access log
When the request path matches mcp-router.path:
POSTparses a JSON-RPC message. Requests return eitherapplication/jsonfor a single response ortext/event-streamfor a streamed response on the same endpoint. Notifications and JSON-RPC responses sent by the client return202 Acceptedwith no body when accepted.GETwithAccept: text/event-streammay open a server-to-client SSE stream on the same endpoint. If the gateway has no server-initiated messages to stream, it should return405 Method Not Allowed.DELETEshould terminate the gateway session and any mapped backend MCP sessions. Until session termination is implemented, it can return405 Method Not Allowed.- Other methods return
405 Method Not Allowed.
When the path does not match, the handler continues to the next handler in the configured chain.
The handler must be safe to include in shared chains. If mcp-router.enabled is
false, or the mcp handler is not in handler.yml, no MCP route is exposed.
JSON-RPC Handling
Supported methods:
initialize
notifications/initialized
tools/list
tools/call
initialize response:
{
"jsonrpc": "2.0",
"id": 1,
"result": {
"protocolVersion": "2024-11-05",
"capabilities": {
"tools": {
"listChanged": true
}
},
"serverInfo": {
"name": "light-gateway-mcp",
"version": "1.0.0"
}
}
}
tools/list returns configured tools with name, description, and
inputSchema. It should preserve Java's simple filtering:
params.querymatches tool name or description.params.intentmatches tool name or description.
tools/call validates params.name, finds the tool, validates or forwards
params.arguments, and returns either:
{
"content": [
{
"type": "text",
"text": "..."
}
]
}
or the structured result returned by the backend MCP server.
JSON-RPC errors should use the same codes as Java where practical:
-32700 parse error
-32601 method or tool not found
-32602 invalid params
-32000 tool execution failed
-32001 access denied
Rust improvement: malformed transport payloads should return a clear HTTP 400
with a JSON-RPC error body instead of a generic HTTP 500.
For Streamable HTTP:
- Clients must send each JSON-RPC message as a separate
POSTto the MCP endpoint. - Clients should send
Accept: application/json, text/event-stream. - The router should negotiate and honor
MCP-Protocol-Version. - The router terminates the client-facing MCP session.
initializeresponses should include a gateway-ownedMcp-Session-Id, and later client requests should be validated against that gateway session.
MCP Session Management
The MCP router should use a facade model. To the agent, light-gateway is the
MCP server. To upstream MCP targets, light-gateway is an MCP client. This
keeps gateway security, access-control policy, masking, response filtering, and
tool aggregation in one place while still respecting upstream MCP session
state.
There are two distinct session scopes:
- Frontend session: the session between the MCP client and
light-gateway. - Backend session: one upstream MCP server session owned by the gateway for a specific frontend session and backend target.
The frontend session is created during client initialize:
- The client sends
initializetomcp-router.path. - The gateway returns the MCP capabilities it exposes and a gateway-generated
Mcp-Session-Id. - The gateway stores session state keyed by that id. The state should include the negotiated protocol version, client info, security principal or relevant auth context, and any backend MCP sessions created for this client session.
- Later client requests must include the gateway session id. Unknown or expired session ids should fail before tool execution.
- A client
DELETErequest, explicit expiry, or gateway shutdown should close all backend sessions associated with the frontend session.
The in-memory gateway store uses a 30-minute idle timeout, a configurable
maximum frontend session count, and a configurable per-client frontend session
count. Expired sessions are purged lazily during later MCP requests, and any
mapped backend MCP sessions are closed during that purge. If the store is still
full after lazy purge, or the client already owns the maximum allowed sessions,
new initialize requests fail without issuing another session id.
The per-client key is derived from the authenticated principal when available,
preferring client_id, then user_id, email, and host. If no security
principal is available, the key falls back to MCP clientInfo.name and
clientInfo.version from the initialize request.
For a single gateway process, the session store can start in memory. In a
multi-pod deployment, the store should be external, such as Redis, or ingress
must provide sticky routing for all requests that carry the same
Mcp-Session-Id.
Backend handling depends on the tool type.
For apiType: http, the backend is a normal stateless API:
- No backend MCP session is created.
- The gateway translates
tools/callarguments into a normal HTTP request. GETtools serialize arguments into the query string; body-capable methods send JSON.- The gateway wraps the HTTP response into an MCP
tools/callresult. - User-specific auth, tenant, correlation, and trace headers come from the frontend session or inbound request and are applied to the outbound HTTP call as normal gateway headers.
For apiType: mcp, the backend is a stateful MCP server:
- The gateway lazily initializes the backend session the first time a frontend
session calls a tool for that backend target. If future dynamic tool
discovery depends on the backend, this initialization can happen before
tools/listinstead. - The gateway sends
initializeto the backend MCP endpoint as an MCP client. It should use the client-requested protocol version when supported and pass only the capabilities it needs upstream. - If the backend returns
Mcp-Session-Id, the gateway stores it in a mapping keyed by the gateway session id and backend target identity. - The gateway sends
notifications/initializedto the backend when the backend session is established. - For later backend calls, the gateway sends the backend session id to that backend. It must not forward the frontend gateway session id as if it were a backend session id.
- The gateway still performs access checks before calling the backend and response filtering after the backend response.
- When the frontend session ends, the gateway should terminate each mapped backend MCP session to avoid leaking backend resources.
The backend target identity used in the session map should be stable across
requests. It should include the resolved route information that distinguishes
one backend MCP endpoint from another, such as targetHost or serviceId,
envTag, protocol, and tool path.
When the router aggregates tools from both MCP servers and normal APIs, the
client still sees one gateway MCP session and one tools/list response. The
gateway registry decides how each tools/call is executed:
| Feature | MCP server backend | Normal API backend |
|---|---|---|
| Config type | apiType: mcp | apiType: http or omitted |
| Backend session | Yes, mapped from gateway session to backend target | No |
| Initialization | Gateway initializes backend as an MCP client | No upstream initialization |
| Message handling | JSON-RPC tools/call through backend MCP session | Translate JSON-RPC arguments to HTTP |
| Backend session header | Send backend Mcp-Session-Id only to that backend | Do not send MCP session state |
| Tear-down | Close backend session on client session end | Nothing backend-specific |
The configured tools/list remains the gateway's public contract. A future
dynamic-discovery mode may call backend MCP tools/list and merge those tools
with configured HTTP tools, but that must still preserve the gateway's policy
surface and avoid exposing backend tools that are not authorized for the
product.
HTTP Tool Execution
For apiType: http or missing apiType:
- Resolve the target base URL.
- Build the target URL from base URL plus tool
path. - For
GET, serialize arguments withurl::form_urlencoded. - For
POST,PUT, andPATCH, send arguments as JSON. - Pass through the inbound agent headers to the backend tool call so caller identity, authorization, correlation, tenant, locale, and tracing context are preserved.
- Let the HTTP client regenerate transport-specific headers for the new
outbound request, such as
Host,Content-Length,Transfer-Encoding, and connection management headers. - Treat 2xx as success.
- Parse JSON responses as structured MCP results.
- Wrap non-JSON responses as MCP text content.
- Return an empty 2xx response as
{ "result": "success" }.
Target resolution:
- Prefer
targetHostfor direct calls. - Otherwise use
serviceId,protocol, andenvTagthrough the existing portal registry discovery client. - If neither is available, return a tool execution error.
MCP Proxy Tool Execution
For apiType: mcp:
- Resolve the target base URL the same way as HTTP tools.
- Ensure a backend MCP session exists for the current gateway session and
backend target. If none exists, initialize the backend MCP endpoint and store
the returned backend
Mcp-Session-Id. - POST to the configured backend
path. - Pass through the inbound agent headers to the backend MCP server, with
transport-specific headers regenerated for the new outbound request.
Replace any frontend gateway
Mcp-Session-Idwith the mapped backend session id for this backend target. - Send a backend JSON-RPC request:
{
"jsonrpc": "2.0",
"id": 1,
"method": "tools/call",
"params": {
"name": "tool-name",
"arguments": {}
}
}
- If the backend returns
error, map it to-32000. - If the backend returns
result, return it to the caller. - On frontend session termination or expiry, close the backend MCP session.
This preserves the Java McpProxyTool behavior while using Rust's typed
JSON-RPC models where possible and adds the MCP session mapping required by
stateful backend MCP servers.
Configuration Loading
The router should be loaded as a normal light-fabric module:
config-server product values
-> mcp-router.yml placeholders
-> light-gateway startup
-> light-pingora mcp router state
Example product values:
mcp-router.enabled: true
mcp-router.path: /mcp
mcp-router.tools:
- name: get_pet
description: Get a pet by id.
targetHost: http://petstore:8080
path: /v1/pets
method: GET
inputSchema:
type: object
properties:
id:
type: string
Example handler.yml path wiring:
handlers:
- correlation
- metrics
- cors
- jwt
- mcp
- proxy
chains:
default:
- correlation
- metrics
- cors
- jwt
- proxy
mcp:
- correlation
- metrics
- cors
- jwt
- mcp
paths:
- path: /mcp
method: POST
exec:
- mcp
- path: /mcp
method: GET
exec:
- mcp
defaultHandlers:
- proxy
The exact chain names are product choices. The important point is that /mcp
can have a narrow chain while normal API proxy traffic keeps the normal proxy
chain.
Module Registry
The MCP router should register its configuration with the module registry:
- module name:
mcp-router - config files:
mcp-router.yml, withmcp-router.yamlas compatibility fallback - enabled status
- configured path
- tool count
- tool names
The module registry should mask any future secret fields in toolMetadata,
headers, or credential configuration.
Reload behavior:
- Reload
mcp-router.yml. - Validate duplicate tool names, missing paths, unsupported methods, and target resolution fields.
- Build a new immutable router state.
- Swap the runtime state atomically.
- Report the updated module registry status.
Security And Policy
The first layer of protection should be the handler chain. Products can place
JWT, API key, basic auth, unified security, CORS, rate limit, and header
handlers before or after mcp as needed.
Because MCP Streamable HTTP is browser-reachable, the mcp handler must also
validate the Origin header according to the configured CORS or security
policy. Invalid origins should fail before tool execution.
Fine-grained tool authorization should be added after the base router:
- Reuse the existing light-4j
access-control.ymlmodel as the compatibility contract.access-control.ymlcontrolsenabled,accessRuleLogic,defaultDeny, andskipPathPrefixes;rule.ymlprovidesruleBodiesandendpointRules. - Make the access policy endpoint stable. Java uses the tool
endpointfield, such as/weather@get; when omitted, Rust derives{path}@{method}. - Include correlation id, caller claims, request headers, tool name, endpoint, and arguments in the policy input.
- Support default deny when access control is enabled and no
req-accrule matches. - Provide built-in Rust actions compatible with the Java class names used by
current config:
RoleBasedAccessControlAction,ResponseColumnFilterAction, andResponseRowFilterAction.
Response filtering should be implemented as a second policy stage:
- Apply policy after backend execution and before JSON-RPC response emission.
- Support both
structuredContentand single text content responses, matching Java's behavior. - Match endpoint rules exactly first, then Java-style path templates and
parent path entries such as
/v1/accounts@getfor/v1/accounts/123@get.
Masking and tokenization handling:
- Preserve Java schema extensions:
x-mask,x-mask-pattern, andx-tokenize. - Parse these extensions from
inputSchemaasserde_json::Value. - Apply schema-driven
x-maskrequest masking before backend tool execution. - Keep
x-tokenizeas a future extension point. Do not call a tokenization service until the portal-service tokenization protocol is finalized. - Do not hardcode a tokenization service URL. The tokenization client should be
designed after
light-tokenizationis migrated intoportal-service/apps/portal-service, whether the final protocol is JSON-RPC, MCP, or gRPC.
Per-tool outbound headers would mean headers that the MCP router adds from tool
configuration when it calls a specific backend target, for example a configured
Authorization, X-API-Key, tenant routing header, or vendor-specific version
header. We do not need that feature. The required behavior is header
pass-through: backend tool calls receive the headers that came from the agent,
while the HTTP client regenerates only the transport-specific headers required
for a valid outbound request. MCP session headers are not normal pass-through
headers. The gateway owns the frontend Mcp-Session-Id and maps it to backend
session ids when an upstream MCP server is involved.
Relationship To Existing Runtime MCP
light-runtime already has RuntimeMcpHandler for runtime management tools.
That should remain internal and registry-facing.
The gateway MCP router should not automatically expose runtime management tools. If a product needs that bridge later, add an explicit configured tool type, for example:
apiType: runtime
That keeps public agent-facing tools separate from management tools and avoids accidentally exposing cache, module, or service operations through a public gateway route.
Phased Implementation
Phase 1: Core Router
- Add
mcp-router.ymlconfig parsing inlight-pingora. - Accept
toolsas either a YAML array or a JSON string to match Java config server injection behavior. - Add immutable tool map validation.
- Implement the base Streamable HTTP single endpoint: unary
POST /mcp,Acceptvalidation forapplication/jsonandtext/event-stream,202 Acceptedfor accepted notifications, and405for unsupported methods. - Implement JSON-RPC
initialize,notifications/initialized,tools/list, andtools/call. - Implement direct
targetHostHTTP tools. - Pass through agent request headers to direct HTTP and backend MCP tool calls, except MCP session headers that the gateway must map separately.
- Wire the existing
mcphandler id inlight-gateway. - Register module status and config with the module registry.
- Add parser and handler tests.
Status: implemented.
Phase 2: Discovery And MCP Proxy
- Resolve
serviceId,protocol, andenvTagthrough the existing portal registry discovery client. - Implement
apiType: mcpbackend proxy tools. - Add reload support with atomic state swap.
- Add tests with fake discovery and backend MCP responses.
Status: implemented.
Phase 3: Streamable HTTP Streaming
- Add streamed
text/event-streamresponses fromPOST /mcpfor long-running tool calls or server-to-client messages related to the originating request. - Add optional
GET /mcpserver-to-client streams on the same endpoint. - Track frontend sessions when
Mcp-Session-Idis issued. Return405for standalone GET streams until server-initiated messages are implemented. - Add tests for content negotiation,
202 Acceptednotifications, streamed POST responses, and optional GET behavior.
Status: implemented.
Phase 4: Policy, Filtering, Masking
- Add tool-level authorization using the
access-control.ymlcompatibility contract. - Add response filtering for structured and text MCP results.
- Add schema-driven request masking.
- Add MCP tool-call log fields for tool name, endpoint, duration, status, and policy outcome.
Status: implemented for access control, response filtering, and request masking. Tokenization is deferred until the portal-service tokenization client is designed.
Phase 5: Stateful MCP Backend Sessions
- Add a gateway session store keyed by frontend
Mcp-Session-Id. - Validate later client requests against the gateway session.
- For
apiType: mcp, maintain backend session mappings keyed by gateway session id and backend target identity. - Lazily initialize backend MCP sessions by sending backend
initialize, capturing backendMcp-Session-Id, and sendingnotifications/initialized. - Replace the frontend session id with the mapped backend session id on upstream MCP calls.
- Terminate mapped backend MCP sessions when the frontend session is deleted, expires, or the gateway shuts down.
- Add tests for frontend session validation, backend session creation, backend session reuse, and backend session termination.
Status: implemented for the in-memory frontend session store, configurable
global and per-client session caps, 30-minute lazy idle expiry, lazy backend
initialization, backend Mcp-Session-Id mapping, backend session reuse, and
explicit DELETE teardown. Shutdown cleanup, external session storage, and
multi-backend isolation tests remain future hardening for multi-pod
deployments.
Testing Strategy
- Config tests:
- empty config
- disabled config
- duplicate tool names
toolsas YAML arraytoolsas JSON stringinputSchemaas object and string
- JSON-RPC tests:
initializenotifications/initialized- notification returns
202 Accepted tools/listtools/listwithqueryandintent- missing method
- invalid params
- malformed JSON
- Streamable HTTP tests:
- single
/mcpendpoint handles POST - POST validates
Accept - unsupported methods return
405 - optional GET stream returns
405until enabled
- single
- Tool execution tests:
- direct
GETwith encoded arguments - direct
POSTwith JSON arguments - non-JSON backend response
- empty 2xx backend response
- non-2xx backend response
- agent headers are forwarded to backend tool calls
- discovered service target
- backend MCP proxy success and error
- direct
- Handler chain tests:
/mcpconsumed bymcp- non-MCP path continues to the next handler
- disabled router does not expose
/mcp
- Reload tests:
- tool added
- tool removed
- invalid reload keeps the prior good state
Remaining Decisions
- Confirm whether Phase 1 includes only unary Streamable HTTP POST or also streamed POST responses.
- Decide the tokenization client protocol after
light-tokenizationis migrated intoportal-service/apps/portal-service. - Map the Java
access-control.ymlschema to Rust policy execution and define how it will be shared by REST, JSON-RPC, and MCP handlers.
LLM Gateway
Status
Design proposal. The current light-agent runtime selects one active model
provider from model-provider.yml. That is acceptable if the selected provider
is an LLM gateway endpoint. The agent does not need to know how many upstream
providers the gateway can reach.
Purpose
The LLM gateway is a centralized model access layer for agents and services. Instead of each agent carrying credentials, endpoint details, routing rules, and provider fallback logic, each agent calls one gateway endpoint. The gateway then routes the request to OpenAI, Azure OpenAI, Bedrock, Anthropic, Gemini, Ollama, Codex, or another provider based on agent configuration, model policy, prompt characteristics, capability requirements, health, cost, and compliance constraints.
This keeps light-agent simple and matches the current bootstrap model:
startup.ymlis local.- Runtime configuration is fetched from config-server.
- The agent loads one model provider after bootstrap.
- That provider can be an OpenAI-compatible LLM gateway.
- The gateway owns multi-provider fan-out.
Goals
- Keep agents configured with one active model endpoint.
- Support many upstream LLM providers at the gateway at the same time.
- Allow provider routing by agent, service id, environment, prompt intent, requested capability, logical model name, cost, latency, region, and health.
- Keep provider credentials out of agent pods and agent config.
- Preserve the existing
model-providerabstraction for direct provider access and reuse it inside the gateway where useful. - Expose a provider-compatible HTTP API so existing agents can use the gateway without a new SDK.
- Support normal light-fabric bootstrap, config-server overrides, module registry visibility, config reload, controller registration, and audit.
- Make gateway decisions explainable enough for operations and compliance.
Non-Goals
- Do not make
light-agentload many providers directly for this use case. Multi-provider routing belongs in the gateway. - Do not require every agent to understand provider-specific fields such as AWS region, Azure deployment name, or Anthropic max token settings.
- Do not expose upstream provider secrets through
tools/list, diagnostics, or agent-visible configuration. - Do not depend on an LLM classification call for every routing decision. The gateway should support deterministic routing first and optional classifier routing later.
- Do not merge the LLM gateway with the MCP router. The LLM gateway routes model calls; the MCP router routes tool calls.
Relationship To Existing Components
Light-Agent
light-agent should continue to select one model provider after runtime
bootstrap. For an LLM gateway deployment, the selected provider is the gateway:
model-provider.provider: compatible
model-provider.model: agent-default
compatible.name: llm-gateway
compatible.baseUrl: https://llm-gateway.light-gateway:8443/v1
compatible.apiKey: ${secret.llmGatewayApiKey}
The model-provider.model value becomes a logical model name. It does not need
to be an upstream provider model id. Examples:
model-provider.model: agent-default
model-provider.model: fast
model-provider.model: reasoning
model-provider.model: coding
model-provider.model: pii-safe
The gateway maps the logical model to a physical provider and model.
Light-Gateway
The LLM gateway should be implemented as a light-gateway product capability,
activated by handler/config. This keeps LLM egress under the same gateway
family that already handles MCP routing, auth, rule execution, metrics,
service discovery, bootstrap, and reload.
The first implementation can expose an OpenAI-compatible endpoint:
POST /v1/chat/completions
That is enough for CompatibleProvider and many external clients. Later phases
can add:
POST /v1/responses
GET /v1/models
Model Provider Crate
The gateway can reuse crates/model-provider for upstream calls. The crate
already contains concrete providers and meta-providers:
- OpenAI
- Azure OpenAI
- Anthropic
- Bedrock
- Codex
- Compatible
- Gemini
- GLM
- Ollama
- OpenRouter
- Telnyx
- Copilot
- CLI providers where operationally appropriate
RouterProviderReliableProvider
For the gateway, direct concrete providers are upstream adapters. Routing and fallback should be controlled by gateway config and policy, not by each agent.
Request Flow
agent
-> LLM provider trait
-> CompatibleProvider
-> light-gateway /v1/chat/completions
-> auth, correlation, policy, rate limit
-> LLM route decision
-> upstream provider adapter
-> upstream LLM provider
-> normalized response
-> audit, metrics, token usage
-> agent
The agent sees one model provider. The gateway sees the full routing context.
Routing Inputs
The gateway should make routing decisions from a combination of trusted inputs:
- Authenticated caller identity from JWT, mTLS, or gateway-authenticated service registration.
- Agent metadata such as host id, agent definition id, service id, environment, tenant, and account.
- Logical model name from the request body.
- Request capabilities: tool calling, vision, JSON mode, long context, reasoning, streaming, prompt caching.
- Prompt features: intent keywords, size, language, sensitivity markers, coding vs support vs workflow execution.
- Configured policy: allowed providers, blocked providers, region constraints, cost tier, data residency, fallback chain.
- Runtime health: provider availability, error rate, latency, quota pressure.
If metadata is supplied as headers, the gateway should only trust those headers from authenticated internal clients. Otherwise it should derive identity from the token or connection.
Suggested internal headers:
X-Light-Request-Id
X-Light-Service-Id
X-Light-Env-Tag
X-Light-Agent-Host-Id
X-Light-Agent-Definition-Id
X-Light-Tenant-Id
Routing Stages
Routing should be deterministic before it is intelligent.
-
Explicit route
If the request asks for a logical model with a direct configured route, use that route.
-
Agent policy
Apply policy for the authenticated agent or service. This can narrow the allowed logical models and upstream providers.
-
Capability filter
Remove upstreams that cannot satisfy required capabilities such as tools, vision, long context, or streaming.
-
Prompt classifier
Optionally classify the prompt into a routing domain such as
fast,reasoning,coding,customer-support, orrestricted-data. -
Cost and latency preference
Choose the cheapest or fastest provider that satisfies policy and capability constraints.
-
Health and fallback
If the selected upstream is unhealthy or returns a retryable error, follow a configured fallback chain.
Gateway Configuration
The gateway should use a dedicated config file, for example
llm-gateway.yml, loaded through the same runtime config layering as other
light-fabric modules.
Example:
enabled: ${llm-gateway.enabled:true}
pathPrefix: ${llm-gateway.pathPrefix:/v1}
defaultRoute: ${llm-gateway.defaultRoute:agent-default}
routes:
agent-default:
provider: openai-prod
model: gpt-4o
fallbacks:
- provider: bedrock-us
model: anthropic.claude-3-5-sonnet-20240620-v1:0
fast:
provider: openai-prod
model: gpt-4o-mini
reasoning:
provider: bedrock-us
model: anthropic.claude-3-7-sonnet-20250219-v1:0
requiredCapabilities:
- tools
- long-context
providers:
openai-prod:
type: openai
baseUrl: ${llm.openai.baseUrl:https://api.openai.com/v1}
apiKey: ${llm.openai.apiKey:}
maxTokens: ${llm.openai.maxTokens:}
costTier: medium
regions:
- global
bedrock-us:
type: bedrock
region: ${llm.bedrock.region:us-east-1}
accessKeyId: ${llm.bedrock.accessKeyId:}
secretAccessKey: ${llm.bedrock.secretAccessKey:}
sessionToken: ${llm.bedrock.sessionToken:}
costTier: high
regions:
- us-east-1
agentPolicies:
com.networknt.agent.account-1.0.0:
defaultRoute: agent-default
allowedRoutes:
- agent-default
- fast
- reasoning
blockedProviders: []
dataResidency:
allowedRegions:
- us-east-1
- global
fallback:
maxRetries: ${llm-gateway.fallback.maxRetries:1}
baseBackoffMs: ${llm-gateway.fallback.baseBackoffMs:100}
The exact schema can evolve, but the important boundary is stable:
- Agent config points to one gateway endpoint.
- Gateway config owns provider inventory and route policy.
- Provider secrets are masked in module registry output.
Provider Inventory
Each configured provider should have:
- A stable provider id.
- A provider type.
- Provider-specific connection settings.
- Supported capabilities.
- Allowed regions.
- Cost tier.
- Timeout and retry settings.
- Optional quota metadata.
- Optional tenant or account restrictions.
Provider ids should be operational names, not user-visible model names:
providers:
openai-prod:
type: openai
openai-eu:
type: azure-openai
bedrock-us:
type: bedrock
local-ollama:
type: ollama
Logical model names are route names. They are safe for agents to request.
Request And Response Contract
The first API should be OpenAI-compatible enough for CompatibleProvider:
POST /v1/chat/completions
Authorization: Bearer <agent-or-service-token>
Content-Type: application/json
Request:
{
"model": "agent-default",
"messages": [
{"role": "system", "content": "You are a support agent."},
{"role": "user", "content": "Help me investigate this account."}
],
"temperature": 0.7,
"tools": []
}
Response:
{
"id": "chatcmpl-...",
"object": "chat.completion",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 1200,
"completion_tokens": 240,
"total_tokens": 1440
},
"light_gateway": {
"route": "agent-default",
"provider": "openai-prod",
"model": "gpt-4o"
}
}
The light_gateway field should be optional and controlled by diagnostics
policy. It is useful for internal debugging, but may be hidden from external
clients.
Tool Calling
Tool calling remains an agent responsibility, but model-native tool-call generation flows through the LLM gateway.
The LLM gateway should:
- Accept OpenAI-style tool definitions from the agent.
- Convert tool definitions to the upstream provider's native format when possible.
- Normalize provider tool-call responses back to the OpenAI-compatible shape.
- Return clear errors when a route cannot support tool calling.
The gateway should not execute MCP tools. The agent still calls
light-gateway MCP endpoints for tools/list and tools/call.
Security
The gateway becomes the model egress control point, so it must enforce:
- Authentication for every model request.
- Authorization for logical models and provider routes.
- Tenant isolation.
- Provider allowlists and denylists.
- Secret masking in module registry and diagnostics.
- Optional request/response redaction or tokenization hooks.
- Data residency rules.
- Rate limits by tenant, agent, route, and provider.
- Audit records for route selection and usage.
Provider credentials should live only in gateway config or the secret system feeding config-server. They should not be copied into agent config.
Observability
Each gateway model call should produce structured telemetry:
- Request id.
- Caller identity.
- Agent id and service id when available.
- Logical model.
- Selected provider and physical model.
- Routing reason.
- Fallback attempts.
- Prompt and completion token counts.
- Latency by stage.
- Provider status code and error class.
- Cache hit or prompt-cache usage where available.
- Policy decisions.
Metrics should support dashboards by route, provider, tenant, and agent.
Config Reload
The gateway should register llm-gateway.yml in the module registry and support
runtime reload.
Reload should be atomic:
- Load and validate the new config.
- Build provider clients and route tables.
- Reject invalid route references before swapping state.
- Swap active routing state.
- Keep in-flight requests on the old state.
Validation should catch:
- Unknown provider ids.
- Unknown provider types.
- Routes without a provider.
- Fallbacks pointing to missing providers.
- Logical routes that require capabilities no provider can satisfy.
- Missing required provider settings for active routes.
Agent Configuration Pattern
For a direct provider deployment:
model-provider.provider: bedrock
model-provider.model: anthropic.claude-3-5-sonnet-20240620-v1:0
bedrock.region: us-east-1
For a gateway deployment:
model-provider.provider: compatible
model-provider.model: agent-default
compatible.name: llm-gateway
compatible.baseUrl: https://llm-gateway.light-gateway:8443/v1
compatible.apiKey: ${llmGateway.agentApiKey}
The second form is the preferred enterprise model once centralized routing is available.
Phased Implementation
Phase 1: OpenAI-Compatible Gateway Endpoint
- Add
llm-gateway.yml. - Add a
light-gatewayhandler for/v1/chat/completions. - Support non-streaming OpenAI-compatible requests and responses.
- Route by logical model name to one configured upstream provider.
- Mask provider secrets in module registry.
- Add basic audit and metrics.
Phase 2: Policy, Fallback, And Reload
- Add per-agent route policy.
- Add health-aware fallback chains.
- Support runtime reload with atomic state swap.
- Add diagnostics endpoint or module registry details for active routes.
Phase 3: Capability-Aware Routing
- Add capability metadata for each provider route.
- Route by tools, vision, long context, streaming, and JSON mode.
- Normalize tool-call request and response shapes across providers.
Phase 4: Prompt-Aware Routing
- Add deterministic prompt classifiers.
- Add optional lightweight model or embedding classifier for complex routing.
- Record routing reasons for audit.
Phase 5: Advanced Provider Features
- Add streaming.
- Add
/v1/responses. - Add prompt caching hints.
- Add quota-aware routing.
- Add data redaction or tokenization hooks when the tokenization service contract is finalized.
Open Questions
- Should the first implementation live in
light-pingoraas a handler module or in a newllm-gatewaycrate used bylight-gateway? - Should logical model policy be stored only in config-server values, or also managed by portal database tables for runtime UI edits?
- Should gateway diagnostics expose selected provider/model to agents, or only to operators?
- Should prompt-aware routing use Light-Rule first, a dedicated classifier, or both?
- How should provider quota information be collected for cloud providers that do not expose uniform quota APIs?
Decision
Use the LLM gateway as the single model provider endpoint for enterprise
agents. light-agent stays single-provider from its own point of view. The
gateway owns multiple upstream providers, route selection, fallback,
credentials, policy, audit, and observability.
WebSocket Router
Status
Phases 1, 2, and 3 are implemented. Phase 1 added configuration parsing,
Java-compatible pathPrefixService normalization, route resolution, and
upstream URI cleanup in light-pingora. Phase 2 wired the websocket handler
into light-gateway with WebSocket upgrade detection, discovery-based upstream
selection, request context storage, and upstream header/query cleanup. Phase 3
added a real gateway-to-backend WebSocket integration test for text, binary,
close, subprotocol, and header behavior.
Purpose
The Java light-websocket-4j websocket-router module routes WebSocket
traffic through a gateway or sidecar. A client connects to the gateway, the
router resolves the downstream service from headers, query parameters, or path
prefix configuration, and the gateway connects to the target WebSocket service.
In light-fabric this should be a light-pingora traffic handler activated by
light-gateway through handler.yml. The same light-gateway binary can link
the WebSocket router implementation, while each product decides whether it runs
by including the websocket handler and websocket-router.yml configuration
from config-server.
The Rust implementation should preserve the Java routing semantics and most of
the Java configuration shape, but it should not copy Java's enabled flag or
frame-bridging architecture. Pingora already supports HTTP/1 upgrade proxying,
so the first implementation should resolve the target and let Pingora tunnel
the upgraded connection.
Goals
- Add a Java-compatible WebSocket router to
frameworks/light-pingora. - Activate the router with the existing
websockethandler id inapps/light-gateway. - Keep the Java
websocket-routerrouting configuration recognizable:defaultProtocol,defaultEnvTag, andpathPrefixService. - Allow
websocket-router.pathPrefixServiceto be injected by config-server at startup the same way other handler-specific config is injected. - Resolve downstream services from header, query parameter, or longest path prefix.
- Reuse the existing light-gateway discovery and upstream selection model.
- Preserve WebSocket handshake headers and pass normal agent/browser headers through to the downstream service.
- Register the router configuration with the module registry and support the same reload model as other light-pingora handler configs.
- Keep the design suitable for gateway, sidecar, and BFF deployments.
Non-Goals
- Do not implement a separate WebSocket server framework in light-fabric.
- Do not terminate and re-create WebSocket frames in the first phase.
- Do not multiplex multiple client WebSocket sessions over one downstream connection.
- Do not support HTTP/2 extended CONNECT for WebSocket in the first phase.
- Do not use Rust dynamic plugins or
inventoryfor WebSocket route registration. - Do not create a separate gateway binary for WebSocket routing.
- Do not use
enabledinwebsocket-router.yml. The handler is active whenhandler.ymlincludeswebsocketin the matched execution chain.
Resolved Decisions
- Activation is controlled only by
handler.yml. If a matched chain includeswebsocket, the router is enabled for that request. websocket-router.ymlshould not containenabled.- WebSocket-specific controls should cover both request/upgrade rate and active upgraded connection count.
- The first implementation should use Pingora HTTP/1 upgrade passthrough, not a frame-aware WebSocket bridge.
- Invalid
websocket-router.ymlconfiguration should fail startup. Invalid reloads should be rejected while the last valid runtime state keeps serving existing traffic.
Java Behavior To Map
Java configuration includes enabled, but the Rust target config removes it:
# Light websocket router configuration
defaultProtocol: ${websocket-router.defaultProtocol:http}
defaultEnvTag: ${websocket-router.defaultEnvTag:}
pathPrefixService: ${websocket-router.pathPrefixService:}
preserveRoutingHeaders: ${websocket-router.preserveRoutingHeaders:false}
idleTimeoutMs: ${websocket-router.idleTimeoutMs:3600000}
maxConnectionDurationMs: ${websocket-router.maxConnectionDurationMs:}
maxActiveConnections: ${websocket-router.maxActiveConnections:}
maxUpgradeRequestsPerSecond: ${websocket-router.maxUpgradeRequestsPerSecond:}
The Java enabled field is intentionally not carried forward. In Rust, the
handler chain is the activation contract. Removing websocket from a path or
default chain disables WebSocket routing for that path.
Production controls are optional. idleTimeoutMs defaults to one hour; blank
or zero values disable the matching control. preserveRoutingHeaders defaults
to false, so routing-only
Service-Id, service_id, and serviceId headers are stripped before the
upstream handshake unless a backend explicitly needs them.
pathPrefixService accepts three forms:
pathPrefixService:
/chat:
serviceId: com.networknt.llmchat-1.0.0
protocol: http
envTag: dev
pathPrefixService:
/chat: com.networknt.llmchat-1.0.0
pathPrefixService: {"/chat":{"serviceId":"com.networknt.llmchat-1.0.0","protocol":"http","envTag":"dev"}}
The Java handler resolves the downstream service in this order:
- Header: first non-blank value from
Service-Id,service_id, orserviceId. - Query parameter: first non-blank value from
service_idorserviceId. - Path prefix:
pathPrefixServicematch against the request path.
If a target is found, query parameters can override the target protocol and environment tag:
protocolenv_tagenvTag
The Java handler removes router-only query parameters before connecting to the downstream service:
protocolservice_idserviceIdenv_tagenvTag
The Java implementation accepts client WebSocket subprotocols, opens a new JDK
WebSocket client connection to the downstream service, forwards Authorization,
forwards the selected subprotocols, and then bridges text and binary frames in
both directions.
Rust Architecture
Add the WebSocket router to light-pingora because it is a Pingora gateway
traffic handler.
Proposed module:
frameworks/light-pingora/src/websocket.rs
Primary types:
#![allow(unused)] fn main() { pub struct WebSocketRouterConfig { pub default_protocol: String, pub default_env_tag: Option<String>, pub path_prefix_service: BTreeMap<String, WebSocketServiceTarget>, } pub struct WebSocketServiceTarget { pub service_id: String, pub protocol: String, pub env_tag: Option<String>, } pub struct WebSocketRouteDecision { pub service_id: String, pub protocol: String, pub env_tag: Option<String>, pub upstream_path_and_query: String, } }
The serde layer should accept Java field names through aliases:
defaultProtocoldefaultEnvTagpathPrefixServiceserviceIdenvTag
Use websocket-router.yml as the preferred Rust file name. Accept
websocket-router.yaml as a compatibility fallback.
Config Normalization
Normalize pathPrefixService at load time:
raw config
-> validate defaultProtocol/defaultEnvTag
-> parse pathPrefixService YAML map, JSON string map, or legacy key/value string
-> apply defaults to entries missing protocol or envTag
-> sort prefixes by length for longest-prefix matching
-> build Arc<WebSocketRouterState>
An invalid entry should fail config loading instead of being ignored silently. This is stricter than Java and is safer for remote config delivered by config-server.
Handler Registration
apps/light-gateway already reserves the websocket handler id as a traffic
handler. The implementation should attach that id to the WebSocket router
runtime:
handlers:
- correlation
- metrics
- jwt
- limit
- websocket
paths:
- path: /chat
method: GET
exec:
- correlation
- metrics
- jwt
- limit
- websocket
The router should only run for chains that include websocket. This lets a BFF
serve static SPA assets, REST APIs, MCP, JSON-RPC, and WebSocket endpoints from
the same gateway binary with path-specific handler chains.
Request Flow
The target flow should be:
client request
-> handler.yml path/chain match
-> cross-cutting request handlers
-> websocket handler
-> verify WebSocket upgrade
-> resolve service target
-> strip router-only query parameters
-> store WebSocketRouteDecision in request context
-> Pingora upstream_peer selects discovered target
-> Pingora upstream_request_filter preserves WebSocket handshake headers
-> Pingora proxies the HTTP/1 upgraded stream
-> response/metrics handlers observe completion
The router should not read the request body and should not buffer WebSocket messages. Once the request is upgraded, Pingora owns the tunnel.
Upgrade Detection
The handler should require the normal WebSocket handshake:
- method
GET ConnectioncontainsupgradeUpgradeequalswebsocketSec-WebSocket-Keyexists- HTTP version is compatible with HTTP/1 upgrade
If the websocket handler is selected by handler.yml but the request is not
a WebSocket upgrade, return 426 Upgrade Required.
HTTP/2 extended CONNECT can be considered later, but should not block the first implementation.
Target Resolution
Target resolution should match Java precedence:
1. service id header
2. service id query parameter
3. pathPrefixService longest-prefix match
Header names:
Service-Id
service_id
serviceId
Query names:
service_id
serviceId
protocol
env_tag
envTag
For path-prefix matches, use the request path without the query string. When multiple prefixes match, choose the longest prefix.
The resolved protocol should be http or https. Conceptually this maps to
ws or wss, but Pingora should still connect to the upstream as HTTP or
HTTPS and then perform the WebSocket upgrade.
Header And Query Policy
Because the Rust implementation should use Pingora upgrade passthrough, it should preserve the original handshake headers:
UpgradeConnectionSec-WebSocket-KeySec-WebSocket-VersionSec-WebSocket-ProtocolSec-WebSocket-ExtensionsAuthorization- cookies
- normal agent/browser headers
The router should strip only router-control query parameters from the upstream URI:
protocolservice_idserviceIdenv_tagenvTag
The service-id routing headers should be removed before the upstream request by default:
Service-Idservice_idserviceId
This keeps gateway routing controls separate from backend application headers. If a backend later needs these headers, add an explicit config option rather than leaking them by default.
Discovery And Upstream Selection
The WebSocket router should reuse the same discovery/runtime model as
router.yml and the existing Pingora proxy flow.
Resolved target:
protocol + serviceId + envTag
Discovery returns an upstream HTTP or HTTPS endpoint. upstream_peer creates
the Pingora peer:
http: non-TLS upstreamhttps: TLS upstream with normal SNI/hostname handling
For the first implementation, require HTTP/1.1 to the backend for WebSocket upgrade. HTTP/2 WebSocket tunneling can be a later feature.
Error Handling
Errors should be returned before the connection is upgraded:
| Condition | Response |
|---|---|
| Handler selected but request is not WebSocket upgrade | 426 Upgrade Required |
| No service id and no path-prefix match | 403 Forbidden |
| Invalid protocol override | 400 Bad Request |
| Discovery has no usable endpoint | 502 Bad Gateway |
| Upstream connect/upgrade failure | 502 Bad Gateway |
Returning HTTP errors before upgrade is clearer than Java's close-frame behavior because the Rust implementation does not accept the WebSocket until the target is known.
Module Registry And Reload
Register the loaded configuration with the module registry:
module id: light-pingora/websocket-router
config name: websocket-router
config file: websocket-router.yml or websocket-router.yaml
On reload:
- Load and validate the new config.
- Build a new immutable route state.
- Atomically swap the state.
- Let in-flight upgraded connections continue with the old decision.
Existing WebSocket tunnels should not be interrupted by a config reload unless the gateway process is restarted.
Observability
The handler should integrate with existing correlation and metrics handlers:
- include correlation id in pre-upgrade logs
- record target resolution result
- record route source:
header,query, orpathPrefixService - count upgrade attempts, successful upgrades, rejected upgrades, and upstream connection failures
- optionally record tunnel duration once Pingora exposes completion
Do not log full query strings by default because they may contain application data.
Test Plan
Parser and resolver tests:
- YAML object
pathPrefixService - string service id entries
- JSON string map entries
- legacy key/value string entries
- default protocol and env tag application
- invalid entries fail load
- header beats query and path prefix
- query beats path prefix
- longest prefix wins
- query protocol/envTag override
- router query params are stripped
Gateway tests:
- non-upgrade request to a WebSocket chain returns
426 - missing target returns
403 - unknown discovery target returns
502 - upgrade request preserves
Sec-WebSocket-Protocol Authorizationand normal browser/agent headers pass through- service-id routing headers are stripped before upstream
Integration tests:
- connect through light-gateway to a local WebSocket echo backend
- text message round trip
- binary message round trip
- close frame behavior
- subprotocol negotiation
- TLS upstream smoke test when a local test certificate is available
Implementation Phases
Phase 1: Config And Resolver
Status: implemented.
- Add
frameworks/light-pingora/src/websocket.rs. - Parse
websocket-router.ymlandwebsocket-router.yaml. - Normalize all Java-compatible
pathPrefixServiceforms. - Implement target resolution and upstream URI cleanup.
- Add unit tests.
Phase 2: Gateway Handler Wiring
Status: implemented.
- Connect the existing
websockethandler id to the router runtime. - Detect WebSocket upgrade requests in the Pingora request flow.
- Store
WebSocketRouteDecisionin the request context. - Select the discovered upstream in
upstream_peer. - Strip router query params and service-id headers in
upstream_request_filter.
Phase 3: WebSocket Integration Tests
Status: implemented.
- Add a local test WebSocket echo service.
- Verify text, binary, close, subprotocol, and header behavior through light-gateway.
- Verify HTTP and HTTPS upstream paths if practical in CI.
Phase 4: Production Controls
Status: implemented.
- Add optional idle timeout and max connection duration.
- Add WebSocket-specific limit controls for both upgrade/request rate and active upgraded connection count.
- Add explicit config for preserving routing headers if a backend requires them.
- Add access-control integration once the same access-control model is shared across REST, JSON-RPC, MCP, and WebSocket routes.
Implementation notes:
maxUpgradeRequestsPerSecondgates accepted upgrade attempts before discovery lookup.maxActiveConnectionstracks proxied upgraded sessions with a permit that is released when Pingora finishes the request context. The active counter is preserved across router and policy reloads.idleTimeoutMsis applied to downstream and upstream tunnel IO. Pingora's body-filter hooks also check idle age when either side sends tunneled data.maxConnectionDurationMsis checked by the tunnel body filters and is also used as an IO timeout when it is the only timeout configured. A connection that continuously exchanges frames is closed on the next tunneled body chunk after the duration is exceeded.- WebSocket access-control uses the shared
access-control.ymlandrule.ymlmodel. The rule context uses tool namewebsocket, endpoint fromhandler.yml, and tool arguments containingserviceId,protocol,envTag,upstreamPathAndQuery, and routesource.
Open Questions
None.
Stateless Auth Handler
Status
Initial Rust implementation is complete in light-pingora and
light-gateway. It includes the shared SPA session runtime, authorization-code
entrypoint, logout, cookie handling, CSRF validation, refresh-token renewal,
Google/Facebook/GitHub callback entrypoints, handler wiring, config stubs, and
runtime-load tests.
Purpose
The Java light-spa-4j stateless-auth module is the BFF login bridge for
SPA deployments that use OAuth 2.0 authorization code flow in the cloud. The
browser completes the provider redirect, calls the gateway callback path with
the authorization code, and the gateway exchanges that code for light-oauth
tokens. The gateway then stores the internal access token, refresh token, user
metadata, and CSRF value in browser cookies.
In light-fabric this should be a light-pingora security handler used by
light-gateway. The handler should be activated by handler.yml, loaded from
config-server with the same product-level configuration model as the rest of
the gateway, and implemented with the same shared SPA session runtime used by
the MSAL exchange handler.
Goals
- Preserve the Java BFF behavior for authorization code login, logout, CSRF
validation, refresh-token renewal, and downstream
Authorizationinjection. - Keep the Java
statelessAuth.ymlfield names recognizable so light-portal can injectstatelessAuth.*values into config-server output. - Use
handler.ymlas the primary activation and ordering contract. - Keep the existing
statelesshandler id as the public handler-chain name. - Share cookie, CSRF, JWT parsing, refresh-token single-flight, and
Authorizationinjection code with the MSAL exchange handler. - Use the existing
client.ymlOAuth token configuration for authorization code and refresh-token calls. - Register the loaded config in
ModuleRegistryand reject invalid config at startup. - Support BFF chains that also use static SPA serving, proxy/router, WebSocket routing, and MCP routing.
- Support Google, Facebook, and GitHub login entrypoints in addition to the generic authorization-code callback.
Non-Goals
- Do not use Rust dynamic plugins or
inventory. - Do not create a separate BFF binary.
- Do not store server-side browser sessions in the first implementation.
- Do not require the Rust social-login implementation to copy Java's provider-specific classes. Rust should preserve the external behavior and config contract, but it can use established OAuth/OIDC crates for provider protocol handling.
- Do not redirect the browser from the gateway by default. Java returns a JSON
body containing
redirectUri,denyUri, andscopes; Rust should preserve that behavior.
Resolved Decisions
- Google, Facebook, and GitHub login handlers are in scope. The existing
google,facebook, andgithubhandler ids should remain as public handler-chain names. - Rust should prefer provider-appropriate crates instead of hand-rolling every
provider flow.
openidconnectis a good fit for OpenID Connect providers such as Google, andoauth2is a good fit for plain OAuth 2.0 providers or provider-specific extensions. cookieTimeoutUrishould be used by Rust to return a structured session-expired response when a browser session cannot be renewed.
Java Behavior To Map
Java config file:
enabled: ${statelessAuth.enabled:true}
redirectUri: ${statelessAuth.redirectUri:https://localhost:3000/#/app/dashboard}
denyUri: ${statelessAuth.denyUri:https://localhost:3000/#/app/dashboard}
enableHttp2: ${statelessAuth.enableHttp2:false}
authPath: ${statelessAuth.authPath:/authorization}
logoutPath: ${statelessAuth.logoutPath:/logout}
cookieDomain: ${statelessAuth.cookieDomain:localhost}
cookiePath: ${statelessAuth.cookiePath:/}
cookieTimeoutUri: ${statelessAuth.cookieTimeoutUri:/}
cookieSecure: ${statelessAuth.cookieSecure:true}
sessionTimeout: ${statelessAuth.sessionTimeout:3600}
rememberMeTimeout: ${statelessAuth.rememberMeTimeout:604800}
bootstrapToken: ${statelessAuth.bootstrapToken:token}
googlePath: ${statelessAuth.googlePath:/google}
googleClientId: ${statelessAuth.googleClientId:google_client_id}
googleClientSecret: ${statelessAuth.googleClientSecret:secret}
googleRedirectUri: ${statelessAuth.googleRedirectUri:https://localhost:3000}
facebookPath: ${statelessAuth.facebookPath:/facebook}
facebookClientId: ${statelessAuth.facebookClientId:facebook_client_id}
facebookClientSecret: ${statelessAuth.facebookClientSecret:secret}
githubPath: ${statelessAuth.githubPath:/github}
githubClientId: ${statelessAuth.githubClientId:github_client_id}
githubClientSecret: ${statelessAuth.githubClientSecret:secret}
Java request behavior:
GET authPath, normally/authorization, expects query parametercodeand optionalstate.- Missing
codereturnsERR10035. - The handler generates a CSRF value and sends an authorization-code token
request through
http-clientusingclient.ymloauth.token.authorization_code. - On success, it sets browser cookies and returns JSON containing
scopes,redirectUri, anddenyUri. GET logoutPath, normally/logout, clears BFF cookies and ends the request.- Other requests are treated as downstream BFF requests. The handler reads the
accessTokencookie, verifies/parses it, validates CSRF, refreshes the token if it expires within 90 seconds, and injectsAuthorization: Bearer <access-token>before the proxy/router handler runs. - If no access token exists but a refresh token exists, the handler attempts refresh and then injects the new access token.
- If neither cookie exists, Java allows the request to continue. The downstream service can still decide whether the endpoint is anonymous or protected.
Java error codes to preserve:
| Code | Meaning |
|---|---|
ERR10035 | Authorization code is missing |
ERR10000 | Access token is invalid |
ERR10036 | CSRF token is missing from request |
ERR10038 | CSRF claim is missing from JWT |
ERR10039 | Request CSRF and JWT CSRF do not match |
ERR10037 | Refresh-token response is empty |
Rust Architecture
Add a shared SPA auth runtime in light-pingora and expose it through
light-gateway.
Proposed modules:
frameworks/light-pingora/src/spa_auth.rs
frameworks/light-pingora/src/stateless_auth.rs
spa_auth.rs owns the reusable mechanics:
#![allow(unused)] fn main() { pub struct SpaCookieConfig { pub cookie_domain: String, pub cookie_path: String, pub cookie_secure: bool, pub session_timeout: u64, pub remember_me_timeout: u64, pub same_site: CookieSameSite, pub renew_before_seconds: u64, } pub struct SpaSessionRuntime { pub cookies: SpaCookieConfig, pub token_client: Arc<SpaTokenClient>, pub jwt_verifier: Arc<SecurityRuntime>, pub refresh_single_flight: RefreshSingleFlight, } pub struct SpaSessionResult { pub access_token: Option<String>, pub principal: Option<AuthPrincipal>, pub response_cookies: Vec<SetCookie>, } }
stateless_auth.rs owns the authorization-code entrypoint:
#![allow(unused)] fn main() { pub struct StatelessAuthConfig { pub enabled: bool, pub redirect_uri: String, pub deny_uri: Option<String>, pub enable_http2: bool, pub auth_path: String, pub logout_path: String, pub cookie_domain: String, pub cookie_path: String, pub cookie_timeout_uri: String, pub cookie_secure: bool, pub session_timeout: u64, pub remember_me_timeout: u64, pub bootstrap_token: Option<String>, pub renew_before_seconds: u64, pub google: Option<SocialProviderConfig>, pub facebook: Option<SocialProviderConfig>, pub github: Option<SocialProviderConfig>, } pub struct SocialProviderConfig { pub path: String, pub client_id: String, pub client_secret: String, pub redirect_uri: Option<String>, pub scopes: Vec<String>, } pub struct StatelessAuthRuntime { pub config: StatelessAuthConfig, pub session: SpaSessionRuntime, } }
Use Java-compatible serde aliases for camel-case config fields. The primary
file should be statelessAuth.yml; accept statelessAuth.yaml as a
compatibility fallback.
The serde layer can keep the Java-compatible flat fields, such as
googlePath, googleClientId, and googleClientSecret, and normalize them
into SocialProviderConfig entries after load. This keeps config-server
compatibility while giving Rust a cleaner internal model.
Handler Registration
apps/light-gateway already reserves the stateless handler id. The runtime
loader should follow the same pattern as MCP:
#![allow(unused)] fn main() { let stateless_auth = load_stateless_auth_runtime( config, active_handlers.is_handler_active("stateless"), )?; }
If stateless is not active in any chain, the config does not need to be
loaded. If the config is active but enabled: false, register the disabled
module and return None.
No @alias syntax is needed. The handler id in handler.yml is the stable
Rust contract.
Example BFF chain:
handlers:
- exception
- cors
- stateless
- header
- prefix
- token
- router
chains:
default:
- exception
- cors
- stateless
- header
- prefix
- token
- router
websocket:
- exception
- stateless
- security
- websocket
paths:
- path: /authorization
method: GET
exec:
- default
- path: /logout
method: GET
exec:
- default
The handler should normally run after CORS and before proxy/router/WebSocket.
Login Flow
For authPath:
GET /authorization?code=...&state=...
-> validate code
-> generate csrf
-> call token endpoint with authorization_code grant
-> verify/parse returned internal access token
-> set BFF cookies
-> return { "scopes": [...], "redirectUri": "...?state=...", "denyUri": "..." }
Token request mapping should reuse client.yml:
oauth.token.server_urloroauth.token.serviceIdoauth.token.enableHttp2oauth.token.authorization_code.urioauth.token.authorization_code.client_idoauth.token.authorization_code.client_secretoauth.token.authorization_code.redirect_urioauth.token.authorization_code.scope
The form body should match Java:
grant_type=authorization_code
code=<code>
redirect_uri=<optional redirect_uri>
csrf=<generated csrf>
scope=<space separated scopes, if configured>
Session Validation Flow
For requests that are not login/logout:
request
-> read accessToken cookie
-> verify/parse internal JWT with security.yml rules
-> extract csrf claim
-> find request CSRF from X-CSRF-TOKEN, WebSocket subprotocol, or query
-> compare csrf values
-> refresh token if exp is inside renew window
-> inject Authorization: Bearer <access-token>
-> continue handler chain
CSRF source order should match Java:
X-CSRF-TOKENheader.Sec-WebSocket-Protocolvalue starting withcsrf.when the request hasSec-WebSocket-KeyandSec-WebSocket-Version.- Query parameter
csrf.
The WebSocket subprotocol behavior is important for browser WebSocket clients
that cannot set arbitrary headers. The auth handler should run before the
websocket router so the downstream handshake receives the internal
Authorization header.
Session-Expired Response
The Java handler usually allows requests with no cookies to continue so the downstream service can decide whether the endpoint is anonymous. Rust should preserve that pass-through behavior for requests with no session evidence.
When the request does have session evidence but the session cannot be renewed,
for example an expired or rejected refresh token, Rust should clear BFF cookies
and return a structured response using cookieTimeoutUri:
{
"code": "ERR10040",
"message": "SPA session expired",
"timeoutUri": "/",
"authenticated": false
}
The status should be 401 unless a later product config explicitly asks for a
different behavior. This gives the SPA a deterministic signal to navigate to
the configured timeout or login page without scraping an Undertow-style status
string.
Internal JWT Verification
The shared SPA runtime should not call the existing verify_jwt_request
function directly. That function is designed for API requests with an
Authorization header, path skips, pass-through claims, and normal security
handler behavior.
The SPA auth runtime needs a lower-level token verifier that can:
- verify the access-token signature using the same certificates and algorithms
as
security.yml; - parse claims from a token stored in a cookie;
- optionally ignore expiration while deciding whether the token can be refreshed;
- fail hard on invalid signature, invalid algorithm, malformed JWT, and missing key;
- return an
AuthPrincipaland raw claims for CSRF, cookie metadata, and optional request-context propagation.
This can be implemented by extracting a reusable helper from security.rs,
for example:
#![allow(unused)] fn main() { verify_jwt_token( runtime: &SecurityRuntime, token: &str, expiry_mode: JwtExpiryMode, ) -> Result<AuthPrincipal, HandlerRejection> }
The normal security handler can keep its current request-level wrapper, while
SPA auth uses the token-level helper for cookie tokens.
Social Provider Login
Google, Facebook, and GitHub login are implemented as thin handler entrypoints that reuse the same cookie/session runtime as the authorization-code callback. The existing handler ids are kept:
chains:
google:
- exception
- correlation
- cors
- google
- stateless
- header
- prefix
- router
facebook:
- exception
- correlation
- cors
- facebook
- stateless
- header
- prefix
- router
github:
- exception
- correlation
- cors
- github
- stateless
- header
- prefix
- router
The implemented provider flow is:
- Match its configured provider path, for example
googlePath,facebookPath, orgithubPath. - For Google, exchange the authorization
codewith the Google token endpoint and use the returnedid_tokenas the subject token. If the provider does not return an ID token, fall back toaccess_token. - For Facebook, accept the Java-compatible
accessTokenquery parameter, or exchange an authorizationcodewith the Facebook token endpoint. - For GitHub, exchange the authorization
codewith the GitHub token endpoint. - Use
client.ymloauth.token.token_exchangeto exchange the provider subject token for an internal light-oauth token set with a CSRF claim. - Set the same BFF cookies as the generic stateless handler and return the same JSON shape.
Provider token endpoints default to the public provider URLs, but can be overridden for tests or regional deployments:
googleTokenEndpoint: ${statelessAuth.googleTokenEndpoint:https://oauth2.googleapis.com/token}
facebookTokenEndpoint: ${statelessAuth.facebookTokenEndpoint:https://graph.facebook.com/v19.0/oauth/access_token}
githubTokenEndpoint: ${statelessAuth.githubTokenEndpoint:https://github.com/login/oauth/access_token}
External identity mapping is intentionally delegated to the internal token-exchange implementation. Once portal-service tokenization has a final RPC contract, the subject-token exchange can map provider identities there without changing the gateway cookie/session runtime.
Refresh Flow
The Java handler refreshes 90 seconds before expiry and deduplicates concurrent
refreshes with RefreshTokenSingleFlight. Rust should keep that behavior.
Default Rust settings:
renewBeforeSeconds: ${statelessAuth.renewBeforeSeconds:90}
refreshSingleFlightWaitMs: ${statelessAuth.refreshSingleFlightWaitMs:5000}
refreshSingleFlightCacheMs: ${statelessAuth.refreshSingleFlightCacheMs:3000}
refreshSingleFlightMaxEntries: ${statelessAuth.refreshSingleFlightMaxEntries:10000}
These fields are Rust improvements. They can be omitted from config-server templates until a product needs to tune them.
Refresh-token request mapping should reuse client.yml
oauth.token.refresh_token and send:
grant_type=refresh_token
refresh_token=<cookie refresh token>
csrf=<new csrf>
scope=<space separated scopes, if configured>
Cookies
Cookie names should remain Java-compatible:
| Cookie | HttpOnly | Source |
|---|---|---|
accessToken | true | OAuth access token |
refreshToken | true | OAuth refresh token |
csrf | false | Generated CSRF value |
userId | false | JWT uid claim |
userType | false | JWT userType claim |
roles | false | Base64-encoded JWT role claim, default user |
host | false | JWT host claim |
email | false | JWT eml claim |
eid | false | JWT eid claim |
Access-token, user-info, and CSRF cookies should use the access token
expires_in value as Max-Age. Refresh-token cookie Max-Age should use
sessionTimeout unless the token response includes a remember value other than
N, in which case it should use rememberMeTimeout.
Java only clears cookies that were present on the request. Rust should improve logout by always emitting deletion cookies for the known cookie names, using the configured domain/path/secure attributes. This avoids stale browser cookies when a cookie is omitted from a particular request.
Default SameSite should remain None for Java parity. Add a Rust-only optional
cookieSameSite field with default None so deployments can choose Lax or
Strict when the SPA and BFF are same-site.
Config Server Model
The config-server should continue to resolve placeholders before startup:
statelessAuth.redirectUri: https://localhost:3000/#/app/dashboard
statelessAuth.cookieDomain: localhost
statelessAuth.cookieSecure: true
client.tokenAcClientId: ...
client.tokenAcClientSecret: ...
client.tokenRtClientId: ...
client.tokenRtClientSecret: ...
The Rust gateway should only consume the resolved statelessAuth.yml,
client.yml, security.yml, and handler.yml files. It should not need to
know whether the values came from product defaults, environment variables, or
light-portal overrides.
Implemented Surface
- Shared SPA cookie/session runtime, including cookie parser/writer, CSRF extraction, JWT claim extraction, and Java-compatible cookie names.
- OAuth token client support for authorization-code, refresh-token, and
token-exchange grant requests using
client.yml. - Refresh-token renewal with a bounded completed-result cache.
statelessAuth.ymlloader, module registry registration, active-handler gating, and runtime reload.stateless,google,facebook, andgithubrequest handling inlight-gateway.- Structured session-expired response using
cookieTimeoutUri. - Unit/runtime-load coverage for config parsing, cookie attributes, provider subject-token selection, active-handler loading, and gateway wiring.
MSAL Exchange Handler
Status
Initial Rust implementation is complete in light-pingora and
light-gateway. It includes config loading, named security-msal.yml
validation support, token-exchange handling, shared SPA session/cookie/CSRF
logic, logout, refresh-token renewal, handler wiring, config stubs, and
runtime-load tests.
Purpose
The Java light-spa-4j msal-exchange module is the on-prem BFF login bridge
for SPA deployments that use Microsoft Authentication Library SSO. The browser
uses MSAL.js to obtain a Microsoft token, sends that token to the gateway, and
the gateway exchanges it for an internal light-oauth token set. After exchange,
the browser session behaves the same as the stateless authorization-code
handler: internal tokens are stored in cookies, CSRF is validated on subsequent
requests, refresh tokens keep the session alive, and the gateway injects
Authorization: Bearer <internal-token> before routing downstream.
In light-fabric this should be a light-pingora security handler in
light-gateway. It should share most of its implementation with
stateless-auth.md; only the initial login exchange differs.
Goals
- Preserve the Java MSAL token-exchange flow.
- Keep
msal-exchange.ymlfield names recognizable for light-portal and config-server product configuration. - Validate the incoming Microsoft token with a separate
security-msal.ymlruntime before token exchange. - Exchange the Microsoft token with light-oauth using
client.ymloauth.token.token_exchange. - Store the returned internal token set in the same Java-compatible cookies as the stateless handler.
- Share CSRF validation, cookie writing, logout, refresh-token renewal, and
downstream
Authorizationinjection with the stateless handler. - Add a stable
msal-exchangehandler id tolight-gateway. - Register loaded config in
ModuleRegistryand fail startup on invalid active configuration.
Non-Goals
- Do not forward the Microsoft token to downstream services after exchange.
- Do not implement a server-side browser session store.
- Do not merge MSAL token validation into the normal downstream
securityhandler. MSAL validation applies only to the exchange endpoint. - Do not invent a REST-specific tokenization or portal-service client in this handler. The only outbound call is the OAuth token-exchange request.
- Do not require a separate BFF binary.
Resolved Decisions
- Support
subjectTokenTypein bothclient.ymlandmsal-exchange.yml. The handler-specific value takes precedence when set, andclient.ymlremains the shared OAuth token-exchange default. - Support strict Microsoft token validation in
security-msal.ymlwhen a deployment needs issuer and audience checks.
Java Behavior To Map
Java config file:
enabled: ${msal-exchange.enabled:true}
exchangePath: ${msal-exchange.exchangePath:/auth/ms/exchange}
logoutPath: ${msal-exchange.logoutPath:/auth/ms/logout}
cookieDomain: ${msal-exchange.cookieDomain:localhost}
cookiePath: ${msal-exchange.cookiePath:/}
cookieSecure: ${msal-exchange.cookieSecure:false}
sessionTimeout: ${msal-exchange.sessionTimeout:3600}
rememberMeTimeout: ${msal-exchange.rememberMeTimeout:604800}
Java also loads a separate security config named security-msal:
SecurityConfig.load("security-msal")
This config verifies the incoming Microsoft token. The normal security.yml
runtime verifies/parses internal light-oauth access tokens used in cookies.
Java request behavior:
exchangePath, normally/auth/ms/exchange, requiresAuthorization: Bearer <microsoft-token>.- Missing bearer token returns
ERR11000. - The handler verifies the Microsoft token with
security-msal.yml. - Verification failure returns
ERR10000. - The handler generates a CSRF value and sends an OAuth token-exchange request
with the Microsoft token as
subject_token. - Token-exchange failure returns
ERR11001. - On success, the handler sets the same BFF cookies as the stateless handler
and returns JSON containing
scopes. logoutPath, normally/auth/ms/logout, clears BFF cookies and ends the request.- Subsequent requests use the same cookie, CSRF, refresh, and downstream
Authorizationinjection flow as the stateless handler.
Java error codes to preserve:
| Code | Meaning |
|---|---|
ERR11000 | Microsoft bearer token is missing |
ERR11001 | Internal token exchange failed |
ERR10000 | Incoming Microsoft token or returned internal token is invalid |
ERR10036 | CSRF token is missing from request |
ERR10038 | CSRF claim is missing from JWT |
ERR10039 | Request CSRF and JWT CSRF do not match |
Rust Architecture
Use the shared SPA auth runtime described in stateless-auth.md.
Proposed modules:
frameworks/light-pingora/src/spa_auth.rs
frameworks/light-pingora/src/msal_exchange.rs
msal_exchange.rs owns only the Microsoft-token exchange entrypoint:
#![allow(unused)] fn main() { pub struct MsalExchangeConfig { pub enabled: bool, pub exchange_path: String, pub logout_path: String, pub cookie_domain: String, pub cookie_path: String, pub cookie_secure: bool, pub session_timeout: u64, pub remember_me_timeout: u64, pub renew_before_seconds: u64, pub subject_token_type: String, } pub struct MsalExchangeRuntime { pub config: MsalExchangeConfig, pub session: SpaSessionRuntime, pub msal_security: SecurityRuntime, } }
Use msal-exchange.yml as the primary file name and accept
msal-exchange.yaml as a compatibility fallback.
The SecurityRuntime loader should be generalized so the MSAL handler can load
a named security config:
#![allow(unused)] fn main() { load_security_runtime_from_file( runtime_config, "security-msal.yml", "light-pingora/security-msal", "security-msal", active, ) }
That keeps normal downstream JWT behavior on security.yml while the exchange
endpoint validates Microsoft tokens against security-msal.yml.
Handler Registration
Add msal-exchange to apps/light-gateway handler descriptors as a security
handler:
#![allow(unused)] fn main() { ("msal-exchange", PingoraHandlerKind::Security) }
The primary handler id should be msal-exchange. No @alias syntax is
needed. An additional short alias such as msal can be added later only if a
real product config needs it.
Runtime loading should follow the existing active-handler model:
#![allow(unused)] fn main() { let msal_exchange = load_msal_exchange_runtime( config, active_handlers.is_handler_active("msal-exchange"), )?; }
If the handler is not active in handler.yml, no MSAL config is required. If
the handler is active and its config is invalid, startup should fail. If
enabled: false, register the disabled module and return None.
Example chain:
handlers:
- exception
- cors
- msal-exchange
- header
- prefix
- token
- router
chains:
bff:
- exception
- cors
- msal-exchange
- header
- prefix
- token
- router
websocket:
- exception
- msal-exchange
- security
- websocket
paths:
- path: /auth/ms/exchange
method: POST
exec:
- bff
- path: /auth/ms/logout
method: GET
exec:
- bff
Exchange Flow
For exchangePath:
POST /auth/ms/exchange
Authorization: Bearer <microsoft-token>
-> extract bearer token
-> verify Microsoft token with security-msal.yml
-> generate csrf
-> call light-oauth token endpoint with token-exchange grant
-> verify/parse returned internal access token
-> set BFF cookies
-> return { "scopes": [...] }
The token-exchange request should use client.yml
oauth.token.token_exchange:
oauth.token.server_urloroauth.token.serviceIdoauth.token.enableHttp2oauth.token.token_exchange.urioauth.token.token_exchange.client_idoauth.token.token_exchange.client_secretoauth.token.token_exchange.scopeoauth.token.token_exchange.subjectTokenTypeas the default subject token type when the handler config does not override it
The form body should match Java and the http-client composer:
grant_type=urn:ietf:params:oauth:grant-type:token-exchange
subject_token=<microsoft-token>
subject_token_type=urn:ietf:params:oauth:token-type:jwt
csrf=<generated csrf>
requested_token_type=<optional requested token type>
audience=<optional audience>
scope=<space separated scopes, if configured>
The handler should set Authorization: Basic <client_id:client_secret> on the
outbound token-exchange request.
Session Validation Flow
After exchange, MSAL and stateless auth must use the same downstream request flow:
request
-> read accessToken cookie
-> verify/parse internal JWT with security.yml
-> validate CSRF from request against JWT csrf claim
-> refresh internal token when it is inside the renew window
-> inject Authorization: Bearer <internal-access-token>
-> continue handler chain
CSRF source order should be identical to the stateless handler:
X-CSRF-TOKENheader.Sec-WebSocket-Protocolvalue starting withcsrf.when the request hasSec-WebSocket-KeyandSec-WebSocket-Version.- Query parameter
csrf.
The MSAL handler must never inject the Microsoft token downstream. The only downstream bearer token after login is the internal light-oauth token.
Internal JWT Verification
MSAL exchange should use the same lower-level token verifier as stateless auth
for internal cookie tokens. It should not use the request-oriented
verify_jwt_request wrapper because the token source is a cookie, not an
Authorization header.
The shared verifier should validate signature and key material from
security.yml, parse claims for CSRF and user cookies, and support an
expiry-mode option so the refresh path can inspect tokens close to expiry
without treating that as a downstream API authentication success.
Cookies
MSAL exchange should use the same cookie contract as stateless auth:
| Cookie | HttpOnly | Source |
|---|---|---|
accessToken | true | Internal OAuth access token |
refreshToken | true | Internal OAuth refresh token |
csrf | false | Generated CSRF value |
userId | false | JWT uid claim |
userType | false | JWT userType claim |
roles | false | Base64-encoded JWT role claim, default user |
host | false | JWT host claim |
email | false | JWT eml claim |
eid | false | JWT eid claim |
For Java parity, keep cookieSecure defaulting to false in
msal-exchange.yml, but production config should set it to true when the BFF
is served over HTTPS.
Rust should share the logout improvement from stateless auth: always emit deletion cookies for known cookie names rather than only clearing cookies that were present on the request.
Security Config
security-msal.yml should be treated as an active handler dependency when
msal-exchange is active. Missing or invalid config should fail startup
because the gateway would otherwise accept an exchange endpoint without a
working Microsoft-token verifier.
Recommended distinction:
security-msal.yml: verifies the incoming Microsoft token onexchangePath.security.yml: verifies/parses internal light-oauth tokens in BFF cookies and is also used by normal API security handlers.
The Java code skips audience verification for MSAL in the current call path.
Rust should preserve compatibility unless security-msal.yml explicitly
configures audience validation support. That keeps on-prem deployments working
when the Microsoft token audience is the SPA client id rather than the BFF.
When a product requires stricter validation, security-msal.yml should be able
to require issuer and audience checks for the incoming Microsoft token. The
initial implementation can add these checks to the named SecurityRuntime
loader as optional fields:
issuer: ${security-msal.issuer:}
audience: ${security-msal.audience:}
Blank values preserve the Java-compatible relaxed behavior. Non-blank values must be enforced during exchange-path token verification, and invalid issuer/audience should return the same invalid-token error path as other Microsoft token verification failures.
Config Server Model
Light-portal should manage the product config values and config-server should deliver resolved files:
msal-exchange.exchangePath: /auth/ms/exchange
msal-exchange.logoutPath: /auth/ms/logout
msal-exchange.cookieDomain: localhost
msal-exchange.cookieSecure: true
msal-exchange.subjectTokenType: urn:ietf:params:oauth:token-type:jwt
client.tokenExClientId: ...
client.tokenExClientSecret: ...
client.subjectTokenType: urn:ietf:params:oauth:token-type:jwt
security-msal.issuer: https://login.microsoftonline.com/{tenant-id}/v2.0
security-msal.audience: <spa-client-id>
The gateway consumes only the resolved files:
handler.ymlmsal-exchange.ymlsecurity-msal.ymlsecurity.ymlclient.yml
Implemented Surface
- Shared SPA auth runtime from
stateless-auth.md. - Named
SecurityRuntimeloading forsecurity-msal.yml. - Token-exchange support in the shared OAuth token client.
msal-exchange.ymlparsing, module registry registration, active-handler gating, and runtime reload.msal-exchangerequest handling inlight-gateway.- Required bearer-token extraction, Microsoft token validation,
token-exchange request, Java-compatible cookie writing, logout, refresh
renewal, and downstream internal
Authorizationinjection. - Optional issuer/audience validation through
security-msal.yml. - Unit/runtime-load coverage for subject-token-type precedence and gateway wiring.
PII Tokenization
Status
Proposed design for migrating the light-tokenization capability into
light-fabric as light-pingora handlers used by light-gateway.
Purpose
PII tokenization protects sensitive employee/customer data when a request is sent from inside the organization to an external cloud service through the gateway. The outbound request replaces configured PII fields with generated tokens. When the cloud response returns, the gateway replaces those tokens with the original cleartext values so internal employees can complete their work.
This is a request/response hot-path concern. The first Rust implementation
should therefore run inside light-gateway and access PostgreSQL directly
instead of making a network call to a tokenization service for every field.
Current Java Behavior
The current light-tokenization service exposes REST endpoints:
POST /v1/token: body{ "schemeId": <int>, "value": "<cleartext>" }; returns a token string. If the value already exists, it returns the existing token.GET /v1/token/{token}: returns the cleartext value.DELETE /v1/token/{token}: deletes the token mapping.GET /v1/schemeandGET /v1/scheme/{id}: return token format schemes.
Startup loads multiple JDBC pools from datasource.yml. One database is named
tokenization; the others are vault databases such as vault000. The
tokenization database maps client_id to a vault database through
client_database. Each vault database has a token_vault table.
Java tokenization flow:
- Read
client_idfrom the JWT audit info. - Resolve
client_id -> db_name. - Select a vault datasource by
db_name. - For tokenization, look up by cleartext
value; return existingidif found. - If not found, generate a token with the configured
schemeId, insert(id, value), cachetoken -> value, and return the token. - For detokenization, check the cache first, then query by token
id.
The current Java MCP router also uses tokenization through token-client.
Tool input schemas can mark fields with x-tokenize; the router extracts
JsonPath rules from the schema and calls the tokenization service.
Design Direction
Use direct PostgreSQL access for the initial light-fabric implementation.
Reasons:
- It removes one HTTP hop per tokenized field in the gateway hot path.
- It avoids running and scaling another service only to perform local database lookups.
- PostgreSQL connection pooling is already used in nearby light-fabric apps
with
sqlx. - The same database will also support other gateway handlers that need local data access, such as vector search for MCP routing.
- Multi-tenancy is cleaner with
host_idin the schema than with one vault database per tenant.
If this capability is later exposed as a standalone service, prefer gRPC over MCP for the hot-path service API. gRPC gives a strongly typed protobuf contract, HTTP/2 multiplexing, compact binary payloads, deadlines, and well-understood client pooling. MCP is useful when tokenization is exposed as an agent tool or administrative capability, but it adds JSON-RPC/tooling semantics that are not needed for a low-latency service-to-service data-plane call.
Goals
- Implement
TokenizeHandlerandDetokenizeHandlerinlight-pingora. - Activate handlers only through
handler.yml. - Use one PostgreSQL database with
host_idtenant isolation. - Integrate schema into
portal-db/postgres/ddl.sqland future patch files. - Preserve the Java token schemes and stable tokenization behavior.
- Avoid storing/indexing cleartext PII directly in PostgreSQL.
- Support request-body tokenization before proxy/router sends to the external service.
- Support response-body detokenization before the gateway returns to the internal caller.
- Reuse the same runtime for MCP tool argument tokenization.
Non-Goals
- Do not preserve multiple vault databases.
- Do not preserve MySQL or SQLite runtime support in light-fabric.
- Do not make tokenization an MCP-only service.
- Do not require a separate tokenization service for the first implementation.
- Do not try to tokenize arbitrary binary payloads in the first pass.
Handler Model
Use two public handler ids:
tokenize: request-phase handler that replaces cleartext fields with tokens.detokenize: response-phase handler that replaces configured token fields with cleartext.
Both handlers share one runtime:
frameworks/light-pingora/src/pii_tokenization.rs
Primary types:
#![allow(unused)] fn main() { pub struct PiiTokenizationConfig { pub database: PiiDatabaseConfig, pub host_id_claim: String, pub max_body_size: usize, pub cache: PiiTokenCacheConfig, pub crypto: PiiTokenCryptoConfig, pub rules: Vec<PiiTokenizationRule>, } pub struct PiiTokenizationRuntime { pub config: Arc<PiiTokenizationConfig>, pub pool: PgPool, pub tokenizers: TokenizerRegistry, pub value_cache: TokenCache, pub token_cache: TokenCache, pub keyring: PiiKeyring, } pub struct PiiTokenizationRule { pub path_prefix: String, pub methods: Vec<String>, pub request: Vec<PiiFieldRule>, pub response: Vec<PiiFieldRule>, } pub struct PiiFieldRule { pub path: String, pub scheme: String, pub required: bool, } }
The handler should fail startup if an active config references an unknown scheme, has invalid field paths, cannot initialize the keyring, or cannot connect to PostgreSQL within the configured startup timeout.
Resolved Decisions
- Handler ids are
tokenizeanddetokenizeto align with other light-fabric handler names. - Encrypt stored cleartext with AES-256-GCM. Resolve key material from environment variables first, with direct config values allowed only as a local-development fallback.
- Detokenization fails closed by default when a configured token field cannot be resolved.
- Field selection uses a constrained compiled JsonPath subset rather than full dynamic JsonPath evaluation.
- Cleartext reverse caching is configurable through
cache.cacheCleartext. - Request/response mutation buffers are bounded by configurable
maxBodySize.
Handler Chain
For a BFF or gateway that calls an external cloud service:
handlers:
- correlation
- security
- tokenize
- router
- detokenize
chains:
external-cloud:
- correlation
- security
- tokenize
- router
- detokenize
paths:
- path: /claims
method: POST
exec:
- external-cloud
tokenize must run after authentication so it can resolve host_id from the
verified JWT principal. It must run before router or proxy so the external
service never receives cleartext PII. detokenize must run after the upstream
response body is available and before response delivery.
This likely requires extending the existing gateway handler model with a response-body filter phase:
#![allow(unused)] fn main() { pub trait PingoraBodyHandler { async fn request_body_filter(&self, ctx: &mut GatewayRequestContext, body: Bytes) -> Result<Bytes, HandlerRejection>; async fn response_body_filter(&self, ctx: &mut GatewayRequestContext, body: Bytes) -> Result<Bytes, HandlerRejection>; } }
The first implementation can wire this directly in light-gateway; later it
can be generalized for other body-mutating handlers.
Configuration
Primary file: pii-tokenization.yml.
enabled is not needed. If neither tokenize nor detokenize appears in
handler.yml, this config is not loaded. If either handler is active, the
config is required and invalid config fails startup.
Example:
database:
url: ${pii-tokenization.database.url:${database.url:}}
maxConnections: ${pii-tokenization.database.maxConnections:8}
minConnections: ${pii-tokenization.database.minConnections:1}
connectTimeoutMs: ${pii-tokenization.database.connectTimeoutMs:2000}
hostIdClaim: ${pii-tokenization.hostIdClaim:host_id}
maxBodySize: ${pii-tokenization.maxBodySize:1048576}
crypto:
algorithm: ${pii-tokenization.crypto.algorithm:AES-256-GCM}
keyId: ${pii-tokenization.crypto.keyId:default}
valueEncryptionKeyEnv: ${pii-tokenization.crypto.valueEncryptionKeyEnv:PII_TOKENIZATION_VALUE_ENCRYPTION_KEY}
valueHashKeyEnv: ${pii-tokenization.crypto.valueHashKeyEnv:PII_TOKENIZATION_VALUE_HASH_KEY}
valueEncryptionKey: ${pii-tokenization.crypto.valueEncryptionKey:}
valueHashKey: ${pii-tokenization.crypto.valueHashKey:}
cache:
enabled: ${pii-tokenization.cache.enabled:true}
maxEntries: ${pii-tokenization.cache.maxEntries:10000}
ttlSeconds: ${pii-tokenization.cache.ttlSeconds:86400}
cacheCleartext: ${pii-tokenization.cache.cacheCleartext:true}
rules:
- pathPrefix: /claims
methods: [POST]
request:
- path: $.claimant.ssn
scheme: LN
required: false
- path: $.payment.cardNumber
scheme: CC4
required: false
response:
- path: $.claimant.ssn
scheme: LN
required: false
- path: $.payment.cardNumber
scheme: CC4
required: false
Field paths should support the Java-compatible JsonPath subset used by
mcp-router tokenization rules: object fields and [*] arrays. For
performance and predictable mutation, the Rust implementation should compile
rules at startup and avoid dynamic path parsing on every request.
For MCP tools, keep supporting x-tokenize in input schemas. The MCP router
can convert schema annotations into the same compiled field rules and call the
shared PiiTokenizationRuntime directly.
PostgreSQL Schema
Replace the old split between tokenization and vault databases with
tenant-scoped tables in portal-db.
Recommended DDL:
CREATE TABLE pii_token_scheme_t (
scheme_id SMALLINT PRIMARY KEY,
scheme_code VARCHAR(16) NOT NULL UNIQUE,
description TEXT NOT NULL,
active BOOLEAN DEFAULT TRUE NOT NULL,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
update_user VARCHAR(126) DEFAULT SESSION_USER NOT NULL
);
CREATE TABLE pii_token_vault_t (
host_id UUID NOT NULL,
token TEXT NOT NULL,
scheme_id SMALLINT NOT NULL,
value_hash BYTEA NOT NULL,
value_ciphertext BYTEA NOT NULL,
value_nonce BYTEA NOT NULL,
key_id VARCHAR(128) NOT NULL,
created_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
expires_ts TIMESTAMP WITH TIME ZONE,
active BOOLEAN DEFAULT TRUE NOT NULL,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
update_user VARCHAR(126) DEFAULT SESSION_USER NOT NULL,
PRIMARY KEY(host_id, token),
FOREIGN KEY(scheme_id) REFERENCES pii_token_scheme_t(scheme_id)
);
CREATE UNIQUE INDEX pii_token_vault_value_uk
ON pii_token_vault_t(host_id, scheme_id, value_hash)
WHERE active = TRUE;
CREATE INDEX pii_token_vault_expiry_idx
ON pii_token_vault_t(expires_ts)
WHERE expires_ts IS NOT NULL;
Seed schemes:
| Id | Code | Meaning |
|---|---|---|
0 | UUID | UUID v4 token |
1 | GUID | URL-safe base64 UUID token |
2 | LN | Luhn compliant numeric token |
3 | N | Random numeric token, length preserving |
4 | LN4 | Luhn numeric token retaining last four digits |
5 | AN | Random alpha-numeric token, length preserving |
6 | AN4 | Alpha-numeric token retaining last four characters |
7 | CC | Credit-card-shaped Luhn token retaining first digit |
8 | CC4 | Credit-card-shaped Luhn token retaining first and last four digits |
The old database_owner and client_database tables are not needed. Tenant
isolation is by host_id, resolved from the authenticated request. If a
legacy client only has client_id, handle that with normal portal auth/client
metadata rather than recreating tokenization-specific vault routing.
Cleartext Storage
The Java schema stores cleartext PII in token_vault.value and indexes it.
The Rust schema should not.
Use:
value_hash: deterministic HMAC-SHA-256 of(host_id, scheme_id, canonical_value)withvalueHashKey; used for idempotent token lookup.value_ciphertextandvalue_nonce: encrypted cleartext value, for example AES-GCM or ChaCha20-Poly1305 withvalueEncryptionKey.key_id: identifies which key encrypted the row so key rotation is possible.
This keeps tokenization idempotent without indexing cleartext PII.
Tokenization Algorithm
Shared runtime operation:
tokenize(host_id, scheme_id, value)
-> canonicalize value
-> compute value_hash
-> cache lookup by (host_id, scheme_id, value_hash)
-> SELECT token WHERE host_id, scheme_id, value_hash, active
-> if found, cache and return
-> generate scheme-specific token
-> encrypt cleartext
-> INSERT row
-> on token collision, retry generation
-> on value_hash conflict, SELECT existing token and return it
Use PostgreSQL uniqueness instead of application locks:
INSERT INTO pii_token_vault_t (...)
VALUES (...)
ON CONFLICT DO NOTHING;
If no row is inserted, determine whether the conflict was on
(host_id, token) or (host_id, scheme_id, value_hash). Token collision means
retry with a new token. Value conflict means another request already inserted
the mapping; select and return the existing token.
Detokenization:
detokenize(host_id, token)
-> cache lookup by (host_id, token)
-> SELECT encrypted value WHERE host_id, token, active
-> decrypt cleartext
-> cache and return
If token is not found, the handler fails the response with a handler error. For gateway response-body detokenization, fail closed so employees do not see partial or incorrect data without a signal.
Runtime Caching
Use bounded in-process caches:
(host_id, scheme_id, value_hash) -> token(host_id, token) -> cleartext
The cache must be tenant-scoped and bounded by count and TTL. Because the reverse cache contains cleartext PII, make it configurable and register it with the runtime cache registry only with masked summaries. A clear-cache operation should be available through the runtime control plane.
The cache is an optimization only. PostgreSQL remains the source of truth.
Request And Response Mutation
Only mutate supported structured content:
application/jsonin phase 1.- JSON arrays and nested objects through compiled path rules.
- Missing optional fields are ignored.
- Missing required fields reject the request or response with a handler error.
For outbound request tokenization:
- Buffer the JSON request body within a configured max body size.
- Parse to
serde_json::Value. - Apply matching request rules.
- Replace every string value with a token.
- Serialize JSON, update
Content-Length, and forward upstream.
For inbound response detokenization:
- Buffer the JSON response body within a configured max body size.
- Parse to
serde_json::Value. - Apply matching response rules.
- Replace every string token with cleartext.
- Serialize JSON, update
Content-Length, and return downstream.
For very large or streaming payloads, skip mutation and fail closed by default. Streaming tokenization can be considered later only if a real product requires it.
Security
- Require a verified JWT principal before tokenization.
- Resolve
host_idfrom a configured claim, defaulthost_id. - Reject active tokenization if
host_idis missing. - Do not log cleartext values, generated tokens, value hashes, ciphertext, or keys.
- Mask crypto keys in module registry summaries.
- Use least-privilege PostgreSQL credentials: only select/insert/update on the tokenization tables.
- Prefer encrypted cleartext storage, not plaintext
value. - Keep tokens scoped by
host_id; the same token string in another tenant does not detokenize.
Future Service API
The direct database implementation should be the first production path. However, keep the core API independent from Pingora:
#![allow(unused)] fn main() { #[async_trait] pub trait PiiTokenVault: Send + Sync { async fn tokenize(&self, host_id: Uuid, scheme_id: i16, value: &str) -> Result<String, PiiTokenError>; async fn detokenize(&self, host_id: Uuid, token: &str) -> Result<String, PiiTokenError>; } }
Then a future service can wrap the same trait.
Protocol recommendation:
- gRPC for request-path service-to-service tokenization if a standalone service becomes necessary.
- MCP only as an optional tool surface for agents or administrative workflows.
- REST/JSON-RPC only for compatibility or operational simplicity, not the preferred low-latency path.
The gRPC API can be very small:
service PiiTokenization {
rpc Tokenize(TokenizeRequest) returns (TokenizeResponse);
rpc Detokenize(DetokenizeRequest) returns (DetokenizeResponse);
rpc BatchTokenize(BatchTokenizeRequest) returns (BatchTokenizeResponse);
rpc BatchDetokenize(BatchDetokenizeRequest) returns (BatchDetokenizeResponse);
}
Batch operations are important if a future remote service is used; otherwise per-field network calls will dominate latency.
Implementation Phases
- Add portal-db DDL and seed data for
pii_token_scheme_tandpii_token_vault_t. - Add a
light-pingorashared tokenization runtime withsqlx::PgPool, scheme registry, value hashing, encryption, token generation, and tests. - Add
pii-tokenization.ymlloader, module registry registration, and runtime reload. - Add gateway request-body and response-body filter support.
- Implement
tokenizeanddetokenizehandler wiring inlight-gateway. - Integrate MCP
x-tokenizewith the same runtime so MCP tools do not call a hardcoded tokenization service. - Add optional gRPC service wrapper only if deployment needs a separate tokenization service.
Remaining Considerations
- KMS or light-portal managed keys can be added later, but the first implementation should read the configured environment variables before any resolved config fallback.
- Products that disable
cache.cacheCleartextwill still use PostgreSQL as the source of truth, with higher detokenization latency.
Token Handler
Status
Proposed design for migrating the Java egress-router TokenHandler into
light-fabric as the token handler used by light-pingora and
light-gateway.
A baseline Rust token runtime already exists in light-pingora. This document
captures the Java behavior, the compatibility contract, and the design direction
for hardening it for gateway and sidecar deployments.
Purpose
The token handler obtains an OAuth 2.0 client credentials access token on behalf
of the backend service in the sidecar or gateway egress path. The token is then
attached to the outbound request before router or proxy sends the request to
the downstream API.
This is different from the PII tokenize and detokenize handlers. The
token handler deals only with service-to-service OAuth tokens.
Java Behavior To Map
The Java implementation is centered on:
egress-router/.../TokenHandler.javasidecar/.../SidecarTokenHandler.javarouter-config/.../TokenConfig.javaclient-config/.../client.yamlsidecar-config/.../sidecar.yml
token.yml controls whether the handler is active and which request paths need
token injection:
enabled: ${token.enabled:false}
appliedPathPrefixes: ${token.appliedPathPrefixes:}
The OAuth provider, client credentials, cache, timeout, proxy, HTTP/2, and
single-vs-multiple-auth-server settings live in client.yml:
oauth:
multipleAuthServers: ${client.multipleAuthServers:false}
token:
cache:
capacity: ${client.tokenCacheCapacity:200}
tokenRenewBeforeExpired: ${client.tokenRenewBeforeExpired:60000}
expiredRefreshRetryDelay: ${client.expiredRefreshRetryDelay:2000}
earlyRefreshRetryDelay: ${client.earlyRefreshRetryDelay:30000}
server_url: ${client.tokenServerUrl:}
serviceId: ${client.tokenServiceId:}
proxyHost: ${client.tokenProxyHost:}
proxyPort: ${client.tokenProxyPort:}
enableHttp2: ${client.tokenEnableHttp2:true}
client_credentials:
uri: ${client.tokenCcUri:/oauth2/token}
client_id: ${client.tokenCcClientId:}
client_secret: ${client.tokenCcClientSecret:}
scope: ${client.tokenCcScope:}
serviceIdAuthServers: ${client.tokenCcServiceIdAuthServers:}
pathPrefixServices: ${client.pathPrefixServices:}
request:
connectTimeout: ${client.connectTimeout:2000}
timeout: ${client.timeout:4000}
The Java request flow is:
- Reload
token.ymlfor the request. - Check
appliedPathPrefixeswith a string prefix match. - Read
service_idfrom the request. This header is expected to be set byPathPrefixServiceHandlerorServiceDictHandler. - Resolve the auth server configuration from
client.yml. - Get or refresh a cached client credentials JWT for the service.
- If the request has no
Authorizationheader, setAuthorization: Bearer <token>. - If the request already has
Authorization, preserve it and setX-Scope-Token: Bearer <token>. - Continue to the next handler, usually
router.
For multiple auth servers, Java reads
oauth.token.client_credentials.serviceIdAuthServers[service_id] and enriches
that entry with the global token defaults. For a single auth server, it uses the
global oauth.token.client_credentials section.
The Java cache is a static map keyed by service_id. The cached Jwt stores
the access token and its exp claim in milliseconds. OauthHelper refreshes
synchronously after expiry and attempts async refresh while the token is in the
renewal window.
SidecarTokenHandler adds an egress gate before calling TokenHandler:
sidecar.egressIngressIndicator: headerruns the token handler only when the request hasservice_idorservice_url.sidecar.egressIngressIndicator: protocolruns the token handler for HTTP requests, which is the usual in-pod sidecar egress protocol.- Any other value skips token injection.
The base Java TokenHandler still needs service_id to choose the service
token. A request with only service_url can identify egress traffic, but it
does not by itself select a service-specific token.
Goals
- Preserve the Java configuration files:
token.ymlandclient.yml. - Activate the handler with the existing
tokenid inhandler.yml. - Support config-server injection for
token.enabled,token.appliedPathPrefixes,client.multipleAuthServers,client.tokenCcServiceIdAuthServers,sidecar.egressIngressIndicator, and the rest of theclient.ymltoken fields. - Support single auth server and per-service auth server configurations.
- Support token endpoint discovery through
oauth.token.serviceIdwhen a directserver_urlis not configured. - Preserve the Java header behavior for
AuthorizationandX-Scope-Token. - Keep token retrieval fast and safe for request-path execution.
- Register configuration and token cache state with the module registry and runtime cache registry without exposing token or secret values.
- Keep the design usable by
light-gateway, future sidecar products, and BFF deployments that need to call downstream APIs.
Non-Goals
- Do not use
inventoryor dynamic plugins. Handler availability is compiled into the binary; handler activation is controlled byhandler.yml. - Do not implement authorization code, refresh token, or token exchange in this
handler. This handler only performs
client_credentials. - Do not migrate Java
SAMLTokenHandleras part of this design. - Do not use the PII tokenization table or handlers.
token,tokenize, anddetokenizeare separate concerns. - Do not send the generated access token to logs, metrics labels, module registry output, or cache summaries.
Resolved Decisions
- Use
sidecar.ymlto differentiate inbound proxy traffic from outbound router traffic before applying token injection. - Implement refresh with the same concurrency model as Java
http-client: synchronize refresh per cached token, refresh expired tokens synchronously, refresh valid tokens in the renewal window asynchronously, and use retry windows to prevent repeated failed refresh attempts.
Handler Chain
The token handler must run after service resolution and before egress routing:
handlers:
- correlation
- security
- path-prefix-service
- token
- router
chains:
sidecar-egress:
- correlation
- security
- path-prefix-service
- token
- router
paths:
- path: /v1/pets
method: GET
exec:
- sidecar-egress
path-prefix-service sets service_id from path configuration. token uses
that service id to resolve and cache the client credentials token. router
uses the same service id to select the downstream API target and should remove
routing-only headers before forwarding.
For products where only some outbound APIs need a scope token, keep one chain
with token and another without it, or use token.appliedPathPrefixes to
limit token injection inside a shared chain.
Rust Architecture
Keep the implementation in light-pingora because token injection is a
request-path gateway handler. light-gateway wires the handler into the
existing chain execution model.
Primary Rust module:
frameworks/light-pingora/src/token.rs
Primary types:
#![allow(unused)] fn main() { pub struct TokenHandlerConfig { pub enabled: bool, pub applied_path_prefixes: Vec<String>, } pub struct ClientTokenConfig { pub tls: ClientTlsConfig, pub oauth: ClientOauthConfig, pub path_prefix_services: BTreeMap<String, String>, pub request: ClientRequestConfig, } pub struct TokenRuntime { handler: TokenHandlerConfig, sidecar: SidecarTrafficConfig, client: ClientTokenConfig, cache: Arc<TokenCache>, registry_client: Option<Arc<PortalRegistryClient>>, } }
apps/light-gateway should load TokenRuntime only when the matched handler
configuration contains token. For Java compatibility, token.yml still has
enabled; therefore the handler is effective only when both conditions are
true:
handler.yml contains token
token.yml enabled is true
If token.yml enables the handler, client.yml is required and invalid
configuration should fail startup. sidecar.yml is also loaded into the token
runtime so the same handler chain can distinguish inbound proxy requests from
outbound router requests. Invalid reloads should be rejected while the last
valid runtime keeps serving traffic.
Request Flow
The Rust request flow should be:
- Resolve the active handler chain for the path and method.
- When
tokenis encountered, checkTokenHandlerConfig.enabled. - Evaluate
sidecar.ymland skip token injection for inbound proxy traffic. - Check
appliedPathPrefixeswith boundary-aware matching./v1/addressshould match/v1/address/123, but not/v1/address2. - Resolve the token service id:
- first from the
service_idrequest header, - then from
client.yml pathPrefixServices, - then from
oauth.token.serviceIdfor single-auth-server token endpoint discovery when applicable.
- first from the
- Resolve the token endpoint:
- use direct
server_urlfirst, - otherwise discover
oauth.token.serviceIdthrough portal registry.
- use direct
- Select client credentials:
- for single auth server, use
oauth.token.client_credentials, - for multiple auth servers, require
client_credentials.serviceIdAuthServers[service_id]and merge it with global token defaults.
- for single auth server, use
- Look up the token cache.
- Fetch a new token when the cache is missing, expired, or inside the refresh window.
- Add
AuthorizationorX-Scope-Tokenusing the Java-compatible rule.
The outbound token request should be Java-compatible:
POST {server_url}{uri}
Content-Type: application/x-www-form-urlencoded
Accept: application/json
Authorization: Basic base64(client_id:client_secret)
grant_type=client_credentials&scope=...
The response must contain access_token. Expiry should be derived from the JWT
exp claim when available, with expires_in as a fallback for non-JWT token
servers.
Cache And Refresh
Use a bounded async cache owned by TokenRuntime.
The cache key should include both service id and scope:
#![allow(unused)] fn main() { pub struct TokenCacheKey { pub service_id: Option<String>, pub scope: Option<String>, } }
This is stricter than the Java Map<String, Jwt> keyed only by service_id
and avoids collisions when the same service uses multiple scope sets.
Refresh policy:
- If the token is valid and outside the renewal window, use the cached token.
- If the token is expired, synchronize on that cache entry and refresh synchronously. Concurrent requests for the same service and scope should wait on the same per-entry lock, then re-check the refreshed token instead of making duplicate token endpoint calls.
- If the token is expired but another failed refresh attempt is still inside
expiredRefreshRetryDelay, fail closed with a token-not-available rejection. - If the token is in the renewal window but not expired, return the current
token and start one background refresh for that cache entry when no refresh is
already running and
earlyRefreshRetryDelayhas elapsed. - Keep refresh state per cached token: token string, expiry, scope,
renewing,expired_retry_timeout, andearly_retry_timeout.
This intentionally mirrors Java OauthHelper.populateCCToken. The Rust
implementation should use tokio locks/tasks instead of Java synchronized
and ScheduledExecutorService, but the observable behavior should stay the
same: expired tokens block the current request, early refresh does not block the
current request, and multiple concurrent requests for the same token are
coordinated through one cache entry.
On token.yml or client.yml reload, build a new TokenRuntime and discard
the old cache. This prevents tokens issued with old client credentials or old
scope configuration from being reused after a config change.
Sidecar Egress Gate
The token handler must use sidecar.yml to decide whether the current request
is outbound router traffic or inbound proxy traffic. This allows one gateway or
sidecar process to host both directions while applying token injection only to
egress calls.
Use the Java sidecar.yml contract:
egressIngressIndicator: ${sidecar.egressIngressIndicator:header}
Rust behavior:
header: runtokenonly whenservice_idorservice_urlis present.protocol: runtokenfor HTTP requests entering the sidecar listener.- any other value: skip token injection.
Even with this gate, token selection should still require either a resolved
service id or a single-auth-server configuration that can use a direct
server_url.
The sidecar config should be registered in the module registry as a framework config. Invalid values should fail startup or reject reload.
Configuration Examples
Single auth server:
# sidecar.yml
egressIngressIndicator: ${sidecar.egressIngressIndicator:header}
# token.yml
enabled: ${token.enabled:true}
appliedPathPrefixes: ${token.appliedPathPrefixes:/v1}
# client.yml
oauth:
multipleAuthServers: false
token:
server_url: ${client.tokenServerUrl:https://oauth.example.com}
tokenRenewBeforeExpired: ${client.tokenRenewBeforeExpired:60000}
client_credentials:
uri: ${client.tokenCcUri:/oauth2/token}
client_id: ${client.tokenCcClientId:gateway-client}
client_secret: ${client.tokenCcClientSecret:}
scope: ${client.tokenCcScope:petstore.r petstore.w}
Multiple auth servers:
# client.yml
oauth:
multipleAuthServers: true
token:
tokenRenewBeforeExpired: ${client.tokenRenewBeforeExpired:60000}
client_credentials:
uri: /oauth2/token
serviceIdAuthServers: ${client.tokenCcServiceIdAuthServers:}
pathPrefixServices: ${client.pathPrefixServices:}
The config server can inject client.tokenCcServiceIdAuthServers as YAML or a
JSON string:
com.networknt.petstore-1.0.0:
server_url: https://oauth-petstore.example.com
client_id: petstore-client
client_secret: ${PETSTORE_CLIENT_SECRET}
scope:
- petstore.r
- petstore.w
Rust Improvements Over Java
- Use boundary-aware path prefix matching instead of raw
startsWith. - Include scope in the cache key.
- Mask
client_secretand token values in module registry and cache output. - Fail startup for enabled but invalid token configuration.
- Use Rust async primitives to implement the same per-token synchronized refresh behavior as Java without spawning a dedicated executor per refresh attempt.
- Support direct
server_urland portal-registry discovery with the same runtime path. - Keep all config-server injected values in the normal module registry and reload model.
Observability
Record metrics and logs around the token operation, but never include the token or client secret:
- handler duration for
token, - cache hit, miss, refresh, and failure counts,
- token endpoint latency and HTTP status,
- service id and provider selection,
- refresh retry suppression counts,
- module registry entry for loaded
token.ymland maskedclient.yml, - runtime cache entry count and expiry summaries without access token strings.
Failure Behavior
Fail closed when token injection is required but cannot be completed:
- missing
service_idfor multiple auth servers, - missing
serviceIdAuthServers[service_id], - missing
client_idorclient_secret, - no direct
server_urland failed token service discovery, - token endpoint returns non-2xx,
- token response has no
access_token, - token response has neither JWT
expnorexpires_in, - invalid proxy, URL, or TLS configuration.
Requests outside appliedPathPrefixes should bypass the handler without error.
Test Plan
Unit tests in light-pingora:
- parse Java-compatible
token.ymlandclient.yml, - parse and validate Java-compatible
sidecar.yml, - parse
appliedPathPrefixesas YAML list, JSON string list, and comma list, - parse
serviceIdAuthServersas YAML map and JSON string map, - verify boundary-aware prefix matching,
- verify
sidecar.ymlheader mode applies token only to outbound requests withservice_idorservice_url, - verify
sidecar.ymlprotocol mode applies token to HTTP egress traffic, - verify single auth server option resolution,
- verify multiple auth server option merging,
- verify
AuthorizationversusX-Scope-Tokenheader selection, - verify cache key includes service id and scope,
- verify token cache summaries never include token strings,
- verify expired token refresh is synchronized across concurrent requests,
- verify early-window refresh returns the current token and starts only one background refresh.
Gateway tests in light-gateway:
- chain with
path-prefix-service -> token -> router, - inbound proxy request skips token injection according to
sidecar.yml, - outbound router request applies token injection according to
sidecar.yml, - missing service id for multiple auth servers returns a handler rejection,
- existing caller
Authorizationis preserved and scope token is added toX-Scope-Token, - token runtime reload swaps config and clears old cache,
- inactive
tokenhandler does not requiretoken.ymlorclient.yml.
Integration tests:
- mock OAuth token endpoint with client credentials Basic auth,
- mock discovered token service through portal registry,
- mock downstream service and assert the final outbound headers,
- refresh behavior with expired and near-expiry tokens.
Service Discovery
Status
Implemented baseline.
light-runtime, portal-registry, light-pingora, and light-gateway
already have the main pieces needed for controller-backed service discovery.
This document captures the supported invocation path, the configuration
contract, and the intended hardening direction for gateway, sidecar, BFF, MCP,
WebSocket, and token-handler deployments.
Purpose
light-gateway should be able to discover downstream service instances from
the Light Controller through portal-registry instead of relying only on static
host lists in router.yml, proxy.yml, mcp-router.yml, or handler-specific
configuration.
The same mechanism should work with both controller implementations:
- Rust
controller-rs - Java
light-controller
The gateway should use one portal-registry connection for registration,
runtime control-plane callbacks, and service discovery lookup. A separate
discovery client connection is not required for a registered runtime.
Goals
- Reuse the existing
portal-registryJSON-RPC WebSocket client. - Keep service discovery available to all
light-pingorahandlers throughRuntimeConfig.registry_client. - Support controller-backed lookup for:
- REST/router outbound calls
- WebSocket routing
- MCP tool routing
- OAuth token-server resolution
- SPA auth token-server resolution
- Keep direct URL configuration as an explicit override when a handler supports it.
- Keep static target configuration as a fallback where it already exists.
- Preserve Java-compatible discovery data names such as
serviceId,envTag,protocol,address, andport. - Let
light-portaland config-server manage product-specific registry and handler configuration. - Work with one
light-gatewaybinary and different product config sets.
Non-Goals
- Do not add a second discovery protocol for
light-gateway. - Do not require dynamic Rust plugins,
inventory, or reflection for discovery. - Do not make each handler own a separate controller connection.
- Do not require
/ws/discoveryfor registered gateway instances. - Do not remove static fallback configuration from router-style deployments.
- Do not make service discovery hide invalid product configuration. Startup validation and runtime errors should remain explicit.
Controller Protocol
The controller exposes two WebSocket endpoints:
/ws/microservice
/ws/discovery
light-gateway uses /ws/microservice.
The flow is:
light-gateway
-> connect /ws/microservice
-> JSON-RPC service/register
<- registered runtimeInstanceId
-> JSON-RPC discovery/lookup on the same websocket
<- DiscoverySnapshot
The dedicated /ws/discovery endpoint is still useful for non-service clients
that only need discovery. It is not needed by the gateway because both
controller-rs and light-controller accept discovery JSON-RPC methods on the
registered microservice socket after service/register succeeds.
The lookup request uses a DiscoverySubscription shape:
{
"serviceId": "com.networknt.petstore-1.0.0",
"envTag": "dev",
"protocol": "https"
}
envTag and protocol are optional. When protocol is omitted, the
controller can return all matching protocols and the caller decides which nodes
are usable.
The response is a DiscoverySnapshot:
{
"serviceId": "com.networknt.petstore-1.0.0",
"envTag": "dev",
"protocol": "https",
"nodes": [
{
"runtimeInstanceId": "...",
"serviceId": "com.networknt.petstore-1.0.0",
"envTag": "dev",
"environment": "dev",
"version": "1.0.0",
"protocol": "https",
"address": "petstore",
"port": 8443,
"tags": {},
"connectedAt": "...",
"lastSeenAt": "...",
"connected": true
}
]
}
Only connected nodes with a non-zero port should be used as upstream targets. Handlers should ignore protocols they cannot proxy.
Runtime Configuration
Registry participation is controlled by server.yml:
serviceId: ${server.serviceId:com.networknt.light-gateway-1.0.0}
advertisedAddress: ${server.advertisedAddress:127.0.0.1}
enableRegistry: ${server.enableRegistry:true}
startOnRegistryFailure: ${server.startOnRegistryFailure:true}
environment: ${server.environment:dev}
Controller connection settings come from portal-registry.yml:
portalUrl: ${portalRegistry.portalUrl:https://localhost:8438}
portalToken: ${light_portal_authorization:}
controllerDiscoveryToken: ${portalRegistry.controllerDiscoveryToken:}
Current light-gateway discovery uses the microservice registration token from
LIGHT_PORTAL_AUTHORIZATION or portalToken. The token is sent in the
service/register payload. controllerDiscoveryToken is reserved for clients
that use the dedicated /ws/discovery endpoint and is not part of the current
gateway lookup path.
The runtime converts portalUrl to /ws/microservice, strips any query string,
and starts the shared PortalRegistryClient when registry is enabled. The
client must be connected and registered before discovery lookup can succeed.
Gateway Invocation Path
Startup path:
config-server/local config
-> light-runtime loads server.yml, client.yml, portal-registry.yml
-> RuntimeConfig.service_identity is built from server/bootstrap config
-> RuntimeConfig.registry_client is created when registry is enabled
-> runtime startup registers the gateway with controller
-> light-gateway builds Pingora proxy state from RuntimeConfig
Request-time path:
incoming request
-> handler.yml selects a handler chain
-> handler resolves direct target, serviceId, or static target
-> handler calls PortalRegistryClient.lookup_discovery when serviceId discovery is needed
-> controller returns DiscoverySnapshot
-> handler converts nodes to Pingora ProxyTarget or base URL
-> Pingora proxies the request
PortalRegistryClient.lookup_discovery sends JSON-RPC method
discovery/lookup over the registered websocket and waits for a response. If
the websocket is not connected, lookup fails with a registry client connection
error.
Handler Usage
Router
The router handler supports both direct routing and service discovery.
Resolution order:
service_urlrequest routing, when configured and present.service_idfrom query/header/path-prefix logic.direct-registry.directUrlsusingserviceId|envTag, thenserviceId.- Controller discovery with
serviceIdand optionalenvTag. - Legacy static
router.serviceTargetsfallback.
Direct registry is the standard static service map. router.serviceTargets
remains a deprecated compatibility fallback for old Rust gateway configs. New
product configs should not maintain both maps.
WebSocket Router
The websocket handler resolves the target service from header, query, or
pathPrefixService. It checks direct-registry.directUrls first, then passes
serviceId, optional envTag, and protocol to discovery. Connected http and
https nodes are converted to upstream WebSocket targets and Pingora handles
the upgrade proxying.
MCP Router
The mcp handler can route tools by direct targetHost or discovered
serviceId.
Resolution order:
- Tool
targetHost. direct-registry.directUrlsusingserviceId|envTag, thenserviceId.- Tool
serviceIdthrough controller discovery.
When a tool uses serviceId, portal registry is only required if no direct URL
mapping exists. The tool can also specify envTag and protocol to constrain
direct URL and discovery lookup.
Token Handler
The token handler can resolve the OAuth token server by direct
oauth.token.server_url or by oauth.token.serviceId.
Resolution order:
- Direct token server URL.
direct-registry.directUrlsusing token serverserviceId.- Token server
serviceIdthrough controller discovery.
The selected node prefers https and then falls back to http. If discovery is
required and portal registry is not enabled, token injection fails explicitly.
SPA Auth
The stateless SPA auth and MSAL exchange token clients use the same token-server resolution model as the token handler:
- Direct token server URL.
direct-registry.directUrlsusing token serverserviceId.- Token server
serviceIdthrough controller discovery.
This keeps BFF deployments independent from fixed OAuth hostnames when the token service is registered with the controller.
Direct URLs And Fallbacks
Discovery should not override an explicit direct URL selected by a handler.
Direct URLs are operator intent and should remain authoritative. The standard
shared direct URL map is direct-registry.directUrls.
Static fallback is handler-specific:
direct-registry.directUrlsis checked before controller discovery.router.serviceTargetsis deprecated and only remains as a legacy router fallback.- MCP, token, SPA auth, JWK, and WebSocket service-id routing can use
direct-registry.directUrlswithout per-handler duplicate maps.
This keeps failure behavior predictable. Product configs that require dynamic discovery should fail requests loudly when the controller connection is down instead of silently choosing an unrelated target.
Load Balancing
The controller returns a list of matching nodes. The handler is responsible for choosing one.
Current behavior is intentionally simple:
- drop disconnected nodes
- drop nodes with port
0 - drop unsupported protocols
- prefer
httpsfor token-server resolution - round-robin or index-based selection where the handler already has an index
Future hardening can add weighted selection, zone preference, health score, least-connections, or sticky routing. Those policies should live in the handler or a shared target-selection helper, not in the controller protocol.
Failure Semantics
Startup behavior is controlled by startOnRegistryFailure:
true: the gateway can start if initial controller registration times out; the registry client keeps retrying in the background.false: initial controller registration timeout fails startup.
Request-time behavior depends on handler fallback:
- with direct URL: discovery is bypassed
- with usable static fallback: handler may continue
- with discovery-only config: return an explicit gateway error
The runtime should continue reconnecting the registry websocket. Once the client is registered again, new discovery lookups can succeed without restarting the gateway.
Security
The gateway registers through /ws/microservice with the portal registry token.
The controller validates the registration token and then allows discovery RPCs
on that registered socket.
Security expectations:
- Use TLS for controller connections outside local development.
- Keep hostname verification enabled outside local development.
- Prefer environment-provided token values over static config files.
- Mask
portalTokenandcontrollerDiscoveryTokenin module-registry output. - Do not pass registry tokens to downstream services.
- Do not trust discovery data from an untrusted controller.
Discovery returns transport endpoints. Authentication, authorization, rate limit, CORS, header mutation, token injection, and access-control decisions remain normal handler-chain responsibilities.
Config Server Model
In production, light-portal owns product configuration and config-server
delivers resolved files at startup.
A product that needs controller-backed discovery should include:
server.ymlwithenableRegistry: trueportal-registry.ymlwithportalUrland a valid portal token sourcedirect-registry.ymlorvalues.ymlentries underdirect-registry.directUrlsfor transition services that are not registered in the controller yet- handler-specific config that uses
serviceIdinstead of direct host URLs handler.ymlchains that include the relevant handler IDs
For local Docker Compose, the Rust gateway must not keep the default
https://localhost:8438 controller URL because localhost is the gateway
container. Use portalRegistry.portalUrl: https://controller:8438, pass
LIGHT_PORTAL_AUTHORIZATION, and keep static transition mappings in
direct-registry.directUrls.
The same binary can therefore run as:
- gateway
- sidecar
- proxy server
- proxy client
- balancer
- BFF
The product identity comes from config, not from a separate executable.
Compatibility Notes
The current Rust and Java controllers are compatible with the gateway discovery path because both support:
/ws/microserviceservice/register- discovery lookup on the registered microservice socket
serviceId,envTag, andprotocolfiltersDiscoverySnapshot.nodes- connected-node metadata with
address,port, andprotocol
The gateway does not currently depend on /ws/discovery, although that endpoint
can remain available for external discovery clients.
Future Work
- Add optional discovery subscriptions for handlers that benefit from a local in-memory discovery cache.
- Add shared target-selection policies for weighted, sticky, or zone-aware routing.
- Expose discovery health through the module registry or an admin endpoint.
- Add an integration test that starts a controller, registers a backend, starts light-gateway, and verifies an end-to-end proxied request through discovery.
- Decide whether
controllerDiscoveryTokenshould be used by any standalone discovery-only client in light-fabric. - Document operational examples for gateway, sidecar, WebSocket, MCP, token handler, and BFF product profiles.
Tracing
Light-Fabric uses Rust tracing for application logs and runtime diagnostics.
The same tracing events must support two different consumers:
- operators and developers reading live logs from the console or control plane
- log platforms such as Splunk that ingest structured JSON
The logging design should keep one source of truth for emitted events and make the output format configurable at the edge of the process.
Goals
- Preserve the current human-readable console format for local development and controller-streamed logs.
- Support newline-delimited JSON logs for Splunk and other log ingestion systems.
- Allow deployments to choose text or JSON console output without changing application code.
- Allow authorized control-plane users to change log levels and logger targets without restarting the service.
- Avoid coupling Light-Fabric services directly to Splunk availability, credentials, retry policy, or backpressure handling.
- Keep log fields stable enough for portal-view, controller, and Splunk queries.
Non-Goals
- Implement a Splunk HTTP Event Collector client inside every Light-Fabric service.
- Mix human text logs and JSON logs on the same stream.
- Use
values.ymlto mutate process environment variables. Environment variables are startup inputs; runtime changes should use an explicit logging configuration model.
Current State
The application binaries initialize tracing_subscriber locally. The current
format is text-oriented and is easy to read in a terminal, Docker logs, or a
controller stream. Some binaries also support an ANSI toggle so container logs
can avoid escape sequences.
This works well for humans, but it is less reliable for Splunk field extraction. Splunk can ingest text logs, but structured JSON gives predictable fields for filtering, dashboards, alerts, and correlation.
Output Formats
Light-Fabric should support the following output formats:
| Format | Intended Consumer | Notes |
|---|---|---|
text | humans, local development, controller live log stream | Existing behavior. Best for direct reading. |
json | Splunk, OpenTelemetry Collector, Kubernetes log collectors | Newline-delimited JSON. Best for machine ingestion. |
The output should be selected with an environment variable:
LIGHT_LOG_FORMAT=text
or:
LIGHT_LOG_FORMAT=json
If the variable is absent, the default should remain text to preserve existing
operator behavior.
RUST_LOG should continue to provide the startup filter:
RUST_LOG=info
RUST_LOG=light_gateway=debug,info
RUST_LOG=light_workflow=debug,info
Single Console Stream
For most deployments, the preferred model is a single console stream with a configurable format:
application tracing event
|
v
tracing_subscriber fmt layer
|
+-- stdout/stderr as text or JSON
This has the lowest runtime overhead because each event is formatted and written once. It also keeps container logging simple: the platform captures the process console stream, and the customer chooses whether that stream is text or JSON.
When LIGHT_LOG_FORMAT=json, the console output should be newline-delimited
JSON:
{"timestamp":"2026-06-03T14:12:41.233Z","level":"INFO","target":"light_gateway","fields":{"message":"proxy request completed","method":"GET","path":"/api/customer","status":200,"elapsed_ms":18,"correlation_id":"abc-123"}}
Raw JSON is readable, but it is not as pleasant as the text format. For the control plane, portal-view should parse JSON log lines and render a human projection:
14:12:41.233 INFO light_gateway proxy request completed
method=GET path=/api/customer status=200 elapsed_ms=18 correlation_id=abc-123
This lets Splunk receive structured logs while portal-view remains readable for operators.
Portal-View Rendering
The controller should stream log lines without needing to understand every field. Portal-view can detect whether a line is JSON:
- Trim the line.
- If it starts with
{, try to parse it as JSON. - If parsing succeeds, render common fields in a stable layout.
- If parsing fails, render the original line as plain text.
The renderer should treat JSON parsing as an enhancement, not a hard requirement. This keeps mixed historical output, startup messages, and unrelated tool output usable.
Recommended display fields:
| JSON Field | Display Use |
|---|---|
timestamp | leading timestamp |
level | severity badge/text |
target | module or service source |
fields.message | main message |
fields.correlation_id | request correlation |
fields.request_id | request identifier, when present |
fields.status | HTTP or operation status |
fields.elapsed_ms | latency |
Unknown fields can be shown in an expandable details view or appended as
key=value pairs.
Splunk Ingestion
A log file is not the only option for Splunk ingestion.
Console JSON in Containers
For Kubernetes and container deployments, console JSON is usually the best default. The service writes JSON to stdout/stderr, and the platform logging agent collects the container log stream. Splunk Connect for Kubernetes, OpenTelemetry Collector, or an equivalent customer-managed collector can parse the JSON and send it to Splunk HTTP Event Collector.
This avoids application-level Splunk credentials and keeps retry, batching, and backpressure in the collector.
JSON Log File
For VM or bare-metal deployments where the customer already uses Splunk Universal Forwarder, a JSON log file is also valid. In that mode the application would write newline-delimited JSON to a rotating file, and the forwarder or OpenTelemetry filelog receiver would tail it.
This mode is useful when stdout is reserved for human-readable controller logs, but it formats and writes each event through an additional sink if text console output remains enabled.
Direct Splunk HEC
Direct HTTP Event Collector delivery from the application is possible but should not be the default. It adds Splunk endpoint configuration, token management, retry policy, buffering, and failure handling to every service. A collector or forwarder is a cleaner boundary for production deployments.
Dual Sink Option
If a deployment must keep text console logs and produce JSON at the same time, Light-Fabric can use multiple tracing layers:
application tracing event
|
v
tracing subscriber registry
|
+-- text layer -> stdout/stderr
|
+-- JSON layer -> rolling file
This preserves the current control-plane stream and gives Splunk a clean JSON source. The tradeoff is extra formatting and I/O work per event.
Use this mode only when a single JSON console stream is not acceptable for the operator experience.
Configuration
The design supports both single-stream and dual-sink logging through configuration. The two common deployment profiles are:
| Deployment | Console Output | JSON File | Typical Splunk Path |
|---|---|---|---|
| Kubernetes/container | json | disabled | container log collector to Splunk HEC |
| Bare metal/VM with human console | text | enabled | Splunk Universal Forwarder or filelog receiver tails the JSON file |
| Local development | text | disabled | terminal or controller stream only |
The minimal configuration should be:
LIGHT_LOG_FORMAT=text
LIGHT_LOG_ANSI=false
RUST_LOG=info
JSON console mode:
LIGHT_LOG_FORMAT=json
LIGHT_LOG_ANSI=false
RUST_LOG=info
Optional dual-sink file mode:
LIGHT_LOG_FORMAT=text
LIGHT_LOG_ANSI=false
LIGHT_LOG_JSON_FILE_ENABLED=true
LIGHT_LOG_JSON_FILE_DIR=/var/log/light-fabric
LIGHT_LOG_JSON_FILE_NAME=light-gateway.jsonl
LIGHT_LOG_JSON_FILE_ROTATION=daily
RUST_LOG=info
In this dual-sink mode, the application emits the same tracing event to both sinks: text to stdout/stderr for humans and controller-streamed logs, and JSON to the configured file for Splunk ingestion.
Service-specific aliases such as GATEWAY_LOG_ANSI, AGENT_LOG_ANSI, or
WORKFLOW_LOG_ANSI can remain during migration, but the long-term interface
should converge on LIGHT_LOG_* variables shared by all Light-Fabric binaries.
Runtime Logging Control
Light-Fabric should support the Java control-plane behavior where an authorized operator changes log levels and logger targets from portal-view without restarting the service.
Rust can support this through tracing_subscriber::reload. Instead of installing
a fixed EnvFilter, the runtime should wrap the filter in a reloadable layer and
keep a reload handle in a shared logging controller:
application tracing event
|
v
reloadable EnvFilter
|
v
text/json formatting layers
The reloadable part is the filter only. A filter can change the global level and individual logger targets:
info
debug
info,light_gateway=debug
info,light_gateway=debug,light_pingora::security=trace
info,light_pingora::security=off
This matches the practical Java use case: enable debug or trace for one logger
while keeping the rest of the service at info.
Dynamic Versus Restart-Only Settings
| Setting | Dynamic | Reason |
|---|---|---|
| Global log level | yes | Updates the reloadable EnvFilter. |
| Per-target logger level | yes | Updates the reloadable EnvFilter. |
Disable a target with target=off | yes | Updates the reloadable EnvFilter. |
Console format text/json | no | Requires rebuilding formatter layers. |
| JSON file enabled/disabled | no | Requires adding or removing a writer layer. |
| JSON file directory/name/rotation | no | Requires replacing the appender and guard. |
| ANSI setting | no | Formatter setting; treat as startup-only. |
Startup Precedence
The startup filter should use this precedence:
RUST_LOG, when present.logging.filterfromvalues.yml.- The service default, such as
infoorlight_workflow=debug,info.
This preserves existing RUST_LOG behavior for local and container deployments
while allowing managed deployments to define a persistent default filter in
config.
Example values.yml:
logging.filter: info
More targeted example:
logging.filter: info,light_gateway=debug,light_pingora::security=trace
values.yml should not overwrite environment variables and should not be the
normal path for day-to-day control-plane log-level changes. It should provide the
baseline filter that the logging module reads at startup. If an operator wants to
restore that baseline after a live debugging change, reload_modules can reload
runtime/logging from the latest resolved values.
Changing config server values and then triggering reload is therefore a persistence/reset workflow, not the primary live-control workflow.
MCP Tools
The runtime MCP tool surface should expose logging control alongside existing
runtime tools such as get_service_info, get_modules, and reload_modules.
Recommended tools:
| Tool | Purpose |
|---|---|
get_logging_filter | Return the current effective filter and startup source. |
set_logging_filter | Validate and apply a new live filter immediately. This is the normal portal-view control path. |
reload_modules with runtime/logging | Reset the live filter from the configured baseline in values.yml or remote values. |
Example live filter update:
{
"name": "set_logging_filter",
"arguments": {
"filter": "info,light_gateway=debug"
}
}
Example reset from the configured baseline:
{
"name": "reload_modules",
"arguments": {
"modules": ["runtime/logging"]
}
}
The service response should include the active filter and status:
{
"status": "success",
"filter": "info,light_gateway=debug"
}
Invalid filters should be rejected without changing the current filter:
{
"status": "error",
"message": "invalid logging filter: ..."
}
Portal-View Flow
The portal-view control plane should follow the same route used for other runtime management tools:
portal-view
-> controller
-> portal-registry/runtime instance connection
-> service runtime MCP handler
-> logging control
The UI can offer:
- a global level selector:
off,error,warn,info,debug,trace - per-target rows for Rust targets such as
light_gatewayorlight_pingora::security - an advanced filter text box for the full
EnvFilterexpression - an apply action that calls
set_logging_filter - a reset action that reloads
runtime/loggingfrom the configured baseline - an optional "save as default" action that persists the filter to config server
The advanced filter is important because Rust logger targets are module paths, and operators may need precise target-level control during incident debugging.
The default portal-view workflow should be:
operator changes filter
-> portal-view calls set_logging_filter
-> service updates the reloadable EnvFilter immediately
Portal-view should not require this slower path for a temporary debug change:
operator changes filter
-> portal-view updates config server
-> portal-view calls reload_modules
-> service reloads values.yml
That slower path is still useful when the operator intentionally wants the new filter to survive service restart or redeploy.
JSON Field Shape
JSON logs should be stable enough for both portal-view rendering and Splunk searches. Recommended fields include:
| Field | Meaning |
|---|---|
timestamp | event time in UTC |
level | ERROR, WARN, INFO, DEBUG, or TRACE |
target | Rust module or logical component |
fields.message | human message |
fields.service | logical service name, such as light-gateway |
fields.instance_id | runtime instance, when known |
fields.host_id | tenant/host context, when safe to log |
fields.correlation_id | cross-service request correlation |
fields.request_id | request identifier |
fields.method | HTTP method, when applicable |
fields.path | request path without sensitive query string |
fields.status | response or operation status |
fields.elapsed_ms | operation duration |
Sensitive values must not be logged in either format. This includes tokens, API keys, session cookies, full authorization headers, raw secrets, and request or response payload fields that may contain PII.
Implementation Notes
Use tracing_subscriber as the formatting boundary. The JSON format requires
the json feature:
tracing-subscriber = { version = "0.3", features = ["env-filter", "fmt", "json"] }
File output should use tracing_appender:
tracing-appender = "0.2"
If non-blocking file output is used, the returned WorkerGuard must be kept
alive until process shutdown so buffered log lines are flushed.
The implementation should move per-binary init_tracing() logic into a shared
runtime helper so light-gateway, light-agent, light-workflow, and
light-deployer expose the same behavior.
For dynamic filtering, the shared helper should:
- Build the initial
EnvFilterfromRUST_LOG,logging.filter, or the service default. - Install the filter through
tracing_subscriber::reload. - Keep the reload handle in a
LoggingControlvalue. - Register a reloadable module named
runtime/loggingwithModuleRegistry. - Add runtime MCP handlers for
get_logging_filterandset_logging_filter. - Reject invalid filter expressions before swapping the active filter.
Recommendation
Start with configurable single-stream console output:
- default
LIGHT_LOG_FORMAT=text - production/Splunk option
LIGHT_LOG_FORMAT=json - portal-view JSON parsing and human-friendly rendering
- no direct Splunk dependency in the application
Add dual-sink JSON file output only for customers who cannot change the console stream to JSON but still require structured Splunk ingestion.
Release Workflow
Light-Fabric already has a release.sh script that builds Linux binaries,
packages release archives, and creates or updates a GitHub release. The current
release page uses a static note string, so operators can download artifacts but
cannot easily see what changed between tags.
This design introduces a cascading polyrepo release orchestrated by light-workflow.
It automates release-notes, changelog flow, binary generation, Docker image pushes,
and downstream dependency propagation across both public (light-fabric, light-example-rs)
and private (controller-rs, portal-service) repositories.
The implementation should start with a small dependency-free git-log script and leave room to adopt a more structured changelog generator later. It should also centralize Docker image publishing so binary archives and container images use the same release version.
Goals
- Generate release notes from commits between the previous release tag and the current release tag.
- Use the same generated notes for GitHub release creation and release updates.
- Maintain a checked-in
CHANGELOG.mdso release history is visible without opening GitHub. - Preserve the current
release.sh VERSION [-l|--local] [--skip-build]operator workflow. - Release Linux binary archives and Docker images with the same version tag and the same compiled Linux binaries.
- Support Apple Silicon and Windows binary artifacts through CI runners that match those operating systems.
- Add one repo-root
build.shfor all Docker images while preserving app-level build script compatibility. - Allow manual edits before publishing when release notes need customer-facing cleanup.
- Avoid requiring Conventional Commit messages on day one.
Non-Goals
- Replace GitHub releases as the artifact distribution point.
- Require every commit message to follow
feat:,fix:, or another convention immediately. - Generate perfect marketing release notes without review.
- Upload changelog files as separate release artifacts.
- Remove existing app-level
build.shentrypoints immediately. - Build macOS binaries from a normal Linux Docker builder. Apple toolchains and SDKs require a macOS build runner.
- Build Windows MSVC binaries from a normal Linux Docker builder. Use a Windows runner for the official Windows artifacts.
- Publish Windows container images as part of the first release flow. Windows container images require Windows base images and a Windows container builder.
Current State
release.sh currently performs these steps:
- Parse release options and target version.
- Build
light-agent,light-deployer,light-gateway, andlight-workflowfor Linux GNU and Linux musl targets. - Package the binaries into
dist/light-fabric-${VERSION}-${TARGET}.tar.gz. - If
--localis not set, create a GitHub release or upload artifacts to an existing release.
When creating a new GitHub release, the script uses a static note body:
Light-Fabric Linux release binaries
When the release already exists, the script uploads artifacts but does not update the release notes.
Docker image builds are currently handled by app-level scripts:
apps/light-agent/build.sh
apps/light-deployer/build.sh
apps/light-gateway/build.sh
apps/light-workflow/build.sh
Most app scripts use this shape:
./build.sh 0.3.0
./build.sh 0.3.0 --local
./build.sh 0.3.0 --no-cache
Those scripts build and optionally push networknt/<app>:${VERSION} and
networknt/<app>:latest. light-deployer has a simpler custom script, so the
app-level workflow is not completely consistent.
release.sh does not currently build or push Docker images. As a result,
binary archives and Docker images can drift if they are released in separate
manual steps or with different version strings.
Options
Option 1: GitHub Generated Notes
GitHub CLI can generate release notes:
gh release create "$VERSION" --generate-notes --notes-start-tag "$PREVIOUS_TAG"
This is the least code, and it works well for the GitHub release page. The
tradeoff is that it does not update CHANGELOG.md in the repository unless an
additional script calls the GitHub API and copies the generated notes back into
the repo.
This option is useful as a fallback, but it should not be the primary design if the repo changelog is a required output.
Option 2: Dependency-Free Git-Log Script
A local script can generate release notes from the git history:
git log "${PREVIOUS_TAG}..${TARGET_REF}" --pretty=format:"- %s (%h)"
The script can write a markdown file and use that same file for both
CHANGELOG.md and gh release create --notes-file.
This option is simple, reviewable, and fits the current Bash release script. It does not require new tooling or commit-message conventions. The initial output will be commit-oriented rather than category-oriented, but it can be improved incrementally.
Option 3: git-cliff
git-cliff can generate structured changelogs from Conventional Commit
messages and custom templates. It can group entries into sections such as
features, fixes, documentation, and breaking changes.
This gives the best long-term release notes, but it adds a release-tool dependency and works best only after the team consistently writes conventional commit messages.
This can be adopted later without changing the overall release flow: replace the
internal git-log generator with a git-cliff invocation that writes the same
release notes file.
Proposed Design
Start with Option 2.
Add a helper script:
scripts/release-notes.sh
The script should generate:
dist/release-notes-${VERSION}.md
It should optionally update:
CHANGELOG.md
release.sh should call the helper before publishing the GitHub release. The
generated notes file becomes the release page source:
gh release create "$VERSION" "${ARCHIVES[@]}" \
--title "$VERSION" \
--notes-file "$NOTES_FILE"
For an existing release, the script should update the release body as well as uploading artifacts:
gh release edit "$VERSION" --notes-file "$NOTES_FILE"
gh release upload "$VERSION" "${ARCHIVES[@]}" --clobber
Use Docker as the official Linux release builder. The controlled Docker builder
environment should compile Linux binaries once per Linux platform, export those
binaries into dist/, and use the same binaries when assembling runtime Docker
images. Local host builds remain useful for development, but they should not be
the official release source for Linux artifacts.
Add a repo-root Docker image script:
build.sh
The root script should become the source of truth for building and publishing all Light-Fabric app images:
./build.sh 0.3.0
./build.sh 0.3.0 --local
./build.sh 0.3.0 --app light-agent
./build.sh 0.3.0 --image-org networknt --no-cache
The script should build these images by default:
networknt/light-agent:0.3.0
networknt/light-deployer:0.3.0
networknt/light-gateway:0.3.0
networknt/light-workflow:0.3.0
Unless --skip-latest is set, it should also tag and push:
networknt/light-agent:latest
networknt/light-deployer:latest
networknt/light-gateway:latest
networknt/light-workflow:latest
Existing app-level build scripts should remain, but they should become thin wrappers around the root script:
../../build.sh "$@" --app light-agent
This preserves established operator muscle memory and removes duplicated Docker publish logic.
release.sh should call the root build.sh with the same VERSION. For Linux
targets, the release should build once per platform and reuse the output:
Docker/BuildKit Linux builder
|
+-- dist/linux/<target>/bin/<app> -> GitHub release tarballs
|
+-- dist/linux/<target>/bin/<app> -> Docker runtime images
This makes one command release both binary artifacts and Docker images without compiling the same Linux binaries twice.
Changelog Format
CHANGELOG.md should use reverse chronological release sections:
# Changelog
## 0.3.0 - 2026-06-03
- Add JSON file logging support to `light-runtime` (abc1234)
- Wire runtime logging control into `light-gateway` (def5678)
- Document Splunk ingestion options for tracing (123abcd)
## 0.2.0 - 2026-05-20
- ...
The generated release notes file should contain the same section body:
## 0.3.0 - 2026-06-03
### Changes
- Add JSON file logging support to `light-runtime` (abc1234)
- Wire runtime logging control into `light-gateway` (def5678)
- Document Splunk ingestion options for tracing (123abcd)
### Artifacts
- `light-fabric-0.3.0-x86_64-unknown-linux-gnu.tar.gz`
- `light-fabric-0.3.0-x86_64-unknown-linux-musl.tar.gz`
- `light-fabric-0.3.0-aarch64-unknown-linux-gnu.tar.gz`
- `light-fabric-0.3.0-aarch64-unknown-linux-musl.tar.gz`
- `light-fabric-0.3.0-aarch64-apple-darwin.tar.gz`
- `light-fabric-0.3.0-x86_64-pc-windows-msvc.zip`
- `networknt/light-agent:0.3.0`
- `networknt/light-deployer:0.3.0`
- `networknt/light-gateway:0.3.0`
- `networknt/light-workflow:0.3.0`
The release notes file can include artifact names because it is used directly
for the GitHub release page. CHANGELOG.md should focus on changes and can
omit artifact details.
Docker images should be listed in the GitHub release body even though they are published to Docker Hub instead of attached to the release page. This gives operators one place to see every artifact produced by a release.
Docker image platform variants should also be visible:
networknt/light-agent:0.3.0 linux/amd64, linux/arm64
networknt/light-deployer:0.3.0 linux/amd64, linux/arm64
networknt/light-gateway:0.3.0 linux/amd64, linux/arm64
networknt/light-workflow:0.3.0 linux/amd64, linux/arm64
Tag Range Selection
The release-notes script needs a deterministic commit range.
Inputs:
VERSION: target tag, for example0.3.0orv0.3.0- optional
--from PREVIOUS_TAG - optional
--target TARGET_REF
Default behavior:
- If
--targetis supplied, use it as the end of the range. - Else if the
VERSIONtag exists locally, useVERSION. - Else use
HEAD. - If
--fromis supplied, use it as the start of the range. - Else find the newest semver-like tag before
VERSION. - If no previous tag exists, use the first commit as the start.
For existing releases, this allows regenerating the notes for the exact tag. For new releases, this allows generating notes before the tag exists.
Recommended git command:
git log --no-merges --pretty=format:"- %s (%h)" "${PREVIOUS_TAG}..${TARGET_REF}"
If merge commits are important for the team, the script can add a
--include-merges option.
Release Script Flow
The updated release.sh flow should be:
- Parse release options.
- Validate build and publish dependencies.
- Generate release notes into
dist/release-notes-${VERSION}.md. - Build Linux binaries with the Docker release builder unless
--skip-buildor--host-buildis set. - Package release archives.
- Build Docker images unless
--skip-dockeris set. - Print generated archive names, Docker image names, and release notes path.
- If
--localis set, stop before GitHub and Docker Hub publishing. - If the GitHub release exists:
- update the release body from the generated notes file
- upload archives with
--clobber
- If the GitHub release does not exist:
- create it with
--notes-file - upload archives during creation
- Push Docker images unless
--skip-dockeror--localis set.
The release notes should be generated before publishing, but the changelog
update should be explicit. A release engineer may want to review and commit
CHANGELOG.md before publishing.
Recommended flags:
--update-changelog prepend the generated section to CHANGELOG.md
--notes-only generate notes and optionally update changelog without building
--from TAG override previous tag selection
--target REF override release notes target ref
--include-merges include merge commits in generated commit list
--skip-docker release binary archives only
--docker-only build and publish Docker images only
--skip-latest publish VERSION image tags without updating latest
--host-build use local cargo builds for Linux binaries instead of the Docker release builder
--app APP restrict Docker image work to one app
--image-org ORG Docker image namespace, default networknt
--platform PLATFORM restrict Docker image platform, default linux/amd64,linux/arm64
--skip-macos skip macOS binary artifacts in CI release mode
--skip-windows skip Windows binary artifacts in CI release mode
--local should still build and package locally. It may generate release notes,
but it should not call gh or push Docker images.
--docker-only should skip binary archive packaging and GitHub release asset
upload. It should still generate release notes by default so the same version
context is visible in the command output. If --local is also set, it should
build images locally without pushing them.
Automated Polyrepo Release Workflow
Because controller-rs, portal-service, and light-example-rs depend on light-fabric crates, they must be released sequentially in a Cascading Release Pipeline. Attempting to release them manually is error-prone.
We will dogfood light-workflow as our Release Orchestrator to automate this across the public and private repository boundaries.
The Release-Train Workflow Template
The light-workflow template acts as the overarching controller:
-
Step 1: Upstream Release (
light-fabric)- Task A: The workflow runs
cargo release(or equivalent) to bump versions, tag, and publish the publiclight-fabriccrates tocrates.io. - Task B: The workflow invokes the
build.shscript to compile Linux binaries and pushlight-fabricDocker images. - Task C: The workflow calls
release.shto generate the changelog and publish the GitHub Release page.
- Task A: The workflow runs
-
Step 2: The Sync Barrier (Wait Step)
- The workflow pauses for a short duration (e.g., 2 minutes) to ensure
crates.ioindexing has completed, preventing downstream builds from failing to find the new crate versions.
- The workflow pauses for a short duration (e.g., 2 minutes) to ensure
-
Step 3: Downstream Dependency Propagation
- The workflow clones
controller-rs,portal-service, andlight-example-rs. - It runs
cargo update -p light-fabricto point the downstream repositories to the newly published version. - It pushes these changes to their respective
mainbranches.
- The workflow clones
-
Step 4: Parallel Downstream Releases
- The workflow uses a parallel execution pattern to trigger releases for the downstream repositories simultaneously:
- Branch 1 (
controller-rs): Build private binaries, push private Docker images, and tag the private repo. - Branch 2 (
portal-service): Build private binaries, push private Docker images, and tag the private repo. - Branch 3 (
light-example-rs): Publish any downstream public crates, push public Docker images, and create the GitHub Release.
- Branch 1 (
- The workflow uses a parallel execution pattern to trigger releases for the downstream repositories simultaneously:
By wrapping the individual release.sh and build.sh scripts in a light-workflow execution, we gain stateful retries, full pipeline visibility, and automated propagation without exposing secure tokens on developer workstations.
Root Docker Build Script
The repo-root build.sh should own Linux Docker image build and push behavior
for all apps.
Recommended app metadata:
| App | Image | Dockerfile |
|---|---|---|
light-agent | networknt/light-agent | apps/light-agent/docker/Dockerfile |
light-deployer | networknt/light-deployer | apps/light-deployer/Dockerfile |
light-gateway | networknt/light-gateway | apps/light-gateway/docker/Dockerfile |
light-workflow | networknt/light-workflow | apps/light-workflow/docker/Dockerfile |
The Docker build context should remain the workspace root because the
Dockerfiles copy workspace-level Cargo.toml, Cargo.lock, crates,
frameworks, and app directories.
The script should support:
build.sh [VERSION] [-l|--local] [--no-cache] [--app APP] [--image-org ORG] [--platform PLATFORM] [--skip-latest]
Default behavior:
- Build all app images for
linux/amd64andlinux/arm64. - Tag each image as
${IMAGE_ORG}/${APP}:${VERSION}. - Tag each image as
${IMAGE_ORG}/${APP}:latestunless--skip-latestis set. - Use the Linux binaries produced by the release Docker builder instead of compiling Rust again inside each runtime image build.
- If
--localis set, stop after local image builds. - Otherwise push all generated tags and multi-platform manifests.
The script should print the full list of image tags it built and pushed. This
list should be available to release.sh so the GitHub release notes can include
the Docker image artifacts.
When build.sh is called from release.sh, it should receive the exported
binary directory explicitly:
./build.sh "$VERSION" --binary-dir "dist/build"
When build.sh is called directly without --binary-dir, it can either invoke
the Docker release builder for the requested platforms or fall back to the
current Dockerfile builder stages. The preferred direct behavior is to invoke
the same Docker release builder so local and CI image builds stay aligned.
Recommended implementation:
- Add a release builder Dockerfile, for example:
docker/Dockerfile.release
- Add a builder target that compiles all apps for one Linux target and exports binaries:
docker buildx build \
--target export-binaries \
--platform linux/amd64 \
--output type=local,dest=dist/build/linux-amd64 \
.
- Repeat for
linux/arm64if multi-architecture Linux images are enabled. - Package the exported binaries into GitHub release tarballs.
- Build runtime images from those exported binaries, not from another
cargo build.
The runtime image Dockerfiles can use a binary-only context or a release target that copies prebuilt binaries:
COPY dist/build/linux-amd64/bin/light-gateway /app/light-gateway
For multi-platform images, docker buildx build --platform linux/amd64,linux/arm64
can publish one image tag with a manifest list. The important point is that
each platform-specific image must use the binary built for that platform.
Cross-Platform Binary Strategy
"Build once" means build once per target platform, then reuse that output everywhere that platform can run. It does not mean one binary can serve every operating system and CPU architecture.
Recommended artifact matrix:
| Artifact | Target | Builder |
|---|---|---|
| Linux x86_64 binary archive | x86_64-unknown-linux-gnu or x86_64-unknown-linux-musl | Docker/BuildKit Linux builder |
| Linux arm64 binary archive | aarch64-unknown-linux-gnu or aarch64-unknown-linux-musl | Docker/BuildKit Linux builder |
| Linux Docker image for Intel/AMD | linux/amd64 | Docker/BuildKit Linux builder |
| Linux Docker image for Apple Silicon Docker Desktop | linux/arm64 | Docker/BuildKit Linux builder |
| Apple Silicon macOS binary archive | aarch64-apple-darwin | macOS arm64 runner |
| Windows binary archive | x86_64-pc-windows-msvc | Windows runner |
Apple Silicon has two different release meanings:
- Docker image support for Apple Silicon machines is a Linux
arm64container image. Docker Desktop on Apple Silicon runs Linux containers, solinux/arm64is the right image platform. - Native Apple Silicon binaries are macOS binaries targeting
aarch64-apple-darwin. These should be built on a macOS runner, not inside a normal Linux Docker build.
Windows binaries and Windows container images are also separate concerns:
- Windows binary archives should target
x86_64-pc-windows-msvcand should be built on a Windows runner for the official release. - Windows container images require Windows base images and a Windows container builder. They should be treated as a later phase unless customers explicitly need Windows containers.
In CI, these builds can run at the same time as separate jobs:
linux-release:
Docker/BuildKit builds Linux binaries and Linux Docker images.
macos-release:
macOS runner builds aarch64-apple-darwin binaries.
windows-release:
Windows runner builds x86_64-pc-windows-msvc binaries.
The release publish job should collect all artifacts and update the same GitHub release page. Docker Hub publishing should remain in the Linux release job because the Docker images are Linux container images.
CHANGELOG Update Strategy
The changelog update should be idempotent.
Rules:
- If
CHANGELOG.mddoes not exist, create it with# Changelog. - If a section for
VERSIONalready exists, replace that section. - If no section for
VERSIONexists, insert the new section immediately after the# Changelogheading. - Preserve older release sections as-is.
- Never rewrite unrelated content below older release sections.
This makes rerunning the release script safe during release preparation.
Manual Review Workflow
For a normal release:
./release.sh 0.3.0 --notes-only --update-changelog
git diff CHANGELOG.md dist/release-notes-0.3.0.md
The release engineer reviews and edits CHANGELOG.md if needed, commits it,
then publishes:
./release.sh 0.3.0 --skip-build
If binaries also need to be rebuilt:
./release.sh 0.3.0
By default, the official Linux binaries and Linux Docker images should be built from Docker and published together. If a developer needs the old host-build path for local troubleshooting:
./release.sh 0.3.0 --host-build --local
If CI is producing all OS artifacts, the release job should collect the platform-specific archives before publishing:
dist/light-fabric-0.3.0-x86_64-unknown-linux-gnu.tar.gz
dist/light-fabric-0.3.0-aarch64-unknown-linux-gnu.tar.gz
dist/light-fabric-0.3.0-aarch64-apple-darwin.tar.gz
dist/light-fabric-0.3.0-x86_64-pc-windows-msvc.zip
If only Docker images need to be rebuilt and pushed with the same release tag:
./release.sh 0.3.0 --docker-only
If only one Docker image needs to be rebuilt locally:
./build.sh 0.3.0 --app light-gateway --local
If the release page already exists and only the notes need refreshing:
./release.sh 0.3.0 --notes-only
gh release edit 0.3.0 --notes-file dist/release-notes-0.3.0.md
The final implementation can make the last command part of release.sh when
--local is not set.
GitHub Release Body
The GitHub release body should be generated from the same release notes file. For new releases:
gh release create "$VERSION" "${ARCHIVES[@]}" \
--title "$VERSION" \
--notes-file "$NOTES_FILE"
For existing releases:
gh release edit "$VERSION" --notes-file "$NOTES_FILE"
gh release upload "$VERSION" "${ARCHIVES[@]}" --clobber
This keeps release reruns predictable. Re-uploading artifacts should not leave stale release notes behind.
Future Conventional Commit Mode
If the team later adopts Conventional Commits, the helper script can switch from
plain git log output to grouped output:
### Features
- add JSON tracing output
### Fixes
- preserve ANSI toggle in demo services
### Documentation
- document Splunk ingestion options
At that point, git-cliff is a good fit. The public contract can remain the
same:
scripts/release-notes.sh VERSION --update-changelog
Only the internals of the generator change.
Risks And Mitigations
| Risk | Mitigation |
|---|---|
| Commit messages are too noisy for customer-facing notes | Generate notes early, then review and edit before publishing. |
| Previous tag detection picks the wrong tag | Support --from TAG override and print the selected range. |
| Release script rerun duplicates changelog sections | Replace existing VERSION section instead of blindly prepending. |
| Existing GitHub release has stale notes after artifact upload | Always call gh release edit --notes-file for existing releases. |
Local builds unexpectedly modify CHANGELOG.md | Require explicit --update-changelog for file mutation. |
| Binary archives publish but Docker push fails | Build and push images before or immediately after GitHub release publication, print clear recovery commands, and support --docker-only reruns. |
| Docker image tags drift from GitHub release version | Have release.sh call root build.sh with the same VERSION; do not ask operators to type the image version separately. |
| Full release builds take longer because Dockerfiles rebuild Rust | Use Docker/BuildKit as the release builder and make runtime images copy exported binaries instead of running another cargo build. |
| App-level build scripts diverge again | Convert them to wrappers around repo-root build.sh. |
| Apple Silicon image support is confused with macOS binary support | Document that Docker Desktop on Apple Silicon needs linux/arm64 images, while native macOS binaries need aarch64-apple-darwin. |
| Windows artifacts are expected from a Linux Docker build | Build official Windows MSVC binaries on a Windows runner; treat Windows container images as a separate later phase. |
Implementation Plan
- Add
CHANGELOG.mdwith a short heading and no release entries. - Add
scripts/release-notes.shwith dependency-free git-log generation. - Add idempotent changelog insertion or replacement.
- Add
docker/Dockerfile.releaseor equivalent release-builder targets for Linux binaries. - Add repo-root
build.shfor all app Docker images and Linux image platforms. - Convert app-level build scripts into compatibility wrappers.
- Update
release.shto generatedist/release-notes-${VERSION}.md. - Update
release.shto call rootbuild.shwith the sameVERSION, unless--skip-dockeris set. - Update runtime image builds to copy binaries exported by the Docker release builder instead of compiling Rust again.
- Add CI matrix jobs for macOS Apple Silicon and Windows binary archives.
- Update
publish_release()to use--notes-filefor both new and existing releases. - Add README release documentation for the new flags and review workflow.
- Validate changelog generation locally with:
./release.sh 0.3.0 --notes-only --update-changelog --local
git diff --check
- Validate Docker image builds locally with:
./build.sh 0.3.0 --local
./build.sh 0.3.0 --app light-gateway --local
- Validate combined local release packaging with:
./release.sh 0.3.0 --local
- Validate CI artifact collection for Linux, macOS, and Windows archives.
- Validate GitHub and Docker Hub publishing on a test tag or draft release before using it for a production release.
Light-Workflow Runner
Status
Proposed design.
light-workflow-runner is a tenant-side execution agent for workflow tasks
that must run near tenant systems, tenant repositories, private tools, local
gateways, sidecars, or sandboxed release workspaces. It is not a second
workflow engine and it must not consume workflow start events directly.
The SaaS-owned light-workflow instance remains the authoritative
orchestrator. It consumes workflow start events, creates workflow instances,
persists task state, resolves workflow definitions, applies policy, and owns
audit history. Tenant runners register with controller-rs, receive
server-issued task leases, execute only the leased task, and report normalized
results back to the control plane.
Problem
For SaaS deployments, Light owns the main workflow control plane. Tenants may run APIs, gateways, sidecars, deployers, and other services in their own networks. Some workflow tasks need to execute inside those tenant environments instead of inside the SaaS control plane.
Examples:
- release workflows running in a prepared VM or sandbox with many repositories checked out,
- command-line tasks that need local files or private repository access,
- build and test tasks that need tenant-specific toolchains,
- deployment tasks that need access to private clusters,
- MCP servers or sidecars running only in the tenant network,
- AI repair tasks that need to inspect and patch a local sandbox workspace.
Running multiple full light-workflow instances would create control-plane
ambiguity:
- more than one instance may see the same workflow start event,
- tenant-side config can be changed through environment variables or local
values.yml, - a tenant runtime could claim work outside its intended scope,
- workflow definition loading and event consumption become hard to audit,
- duplicate workflow starts require more complex idempotency and broker ACLs.
The platform needs a runner model that lets tenant-side services execute approved tasks without letting them own workflow orchestration.
Goals
- Keep one authoritative SaaS
light-workfloworchestrator for workflow start events and workflow state. - Add a tenant-side
light-workflow-runnerexecutable for command, sandbox, deployment, MCP, and local tool execution. - Register tenant runners through
controller-rs. - Enforce task visibility with server-side leases, not runner-side local config.
- Support release runners in prepared VMs or sandboxes with checked-out repos and approved toolchains.
- Support per-tenant runner pools, execution profiles, capabilities, and network placement.
- Let
controller-rsperiodically audit effective runtime configuration. - Reuse
workflow-coretask models and result contracts where possible.
Non-Goals
- Do not create a second workflow orchestrator that consumes workflow start events.
- Do not let tenant runners load arbitrary workflow definitions from local config.
- Do not trust tenant-side environment variables or local
values.ymlas the enforcement boundary. - Do not expose all workflow tasks to all registered runners.
- Do not let AI or command tasks bypass publish, signing, or human approval gates.
Current Runtime Boundary
The current light-workflow executable starts the workflow event consumer, task
executor, and rule API in one process. The executor actively handles
control-plane task types such as ask, assert, call, set, and switch.
workflow-core already models run.container, run.script, run.shell, and
run.workflow. These task definitions are the right surface for runner-backed
execution, but they still need a runtime executor boundary.
This design keeps the workflow model shared and adds a separate runner executable for effectful execution.
Recommended Architecture
Domain Event
|
| consumed by SaaS control plane only
v
light-workflow
|
| workflow instance, tasks, policy, audit
v
controller-rs
|
| registration, leases, heartbeat, audit
v
light-workflow-runner
|
| local command, sandbox, MCP, deploy, release tools
v
Tenant Runtime Environment
The split is:
light-workflow: Authoritative orchestrator. It sees workflow start events, loads workflow definitions, creates tasks, computes effective task policy, and records state.controller-rs: Runtime control plane. It authenticates runners, records runner capabilities, issues task leases, receives heartbeats, audits runtime config, and quarantines mismatched runners.light-workflow-runner: Tenant-side execution agent. It claims only leased work, executes the assigned task in the approved environment, streams logs, and reports normalized results.- Sandbox or VM: Optional execution substrate used by the runner for high-risk tasks such as release builds, AI repair, scripts, and publishing.
The runner can run beside tenant APIs, gateways, sidecars, and deployers. It may also run in a prepared release VM or sandbox with approved tools and repository workspaces.
Event Visibility
Workflow start events should be visible only to the SaaS-owned
light-workflow orchestrator.
Recommended flow:
- A domain event is published.
- The SaaS
light-workflowconsumer evaluates matching workflow definitions. - It creates one workflow instance per matching definition.
- It creates tasks with runner requirements.
controller-rsexposes only eligible task leases to registered runners.- Runners execute leased tasks and return results.
This avoids duplicate starts and avoids tenant-side event subscription authorization problems.
If a future deployment requires separate workflow clusters, route start events by lane and enforce broker ACLs:
workflow.start.main
workflow.start.release
workflow.start.deployment
workflow.start.tenant.<tenantId>
Even with event lanes, the workflow database should enforce idempotency on a source-event key such as:
tenant_id + source_event_id + workflow_definition_id
For the SaaS model, task leases are the cleaner boundary than exposing start events to tenant runtimes.
Runner Registration
A runner must register before it can claim work.
Registration should include:
{
"runnerId": "release-runner-01",
"tenantId": "tenant-a",
"hostId": "host-a",
"runnerKind": "release",
"runnerPools": ["release"],
"executionProfiles": ["release-sandbox"],
"capabilities": [
"git",
"maven",
"cargo",
"docker",
"event-importer"
],
"imageDigest": "sha256:...",
"configHash": "sha256:...",
"commandAllowlistHash": "sha256:...",
"workspacePolicy": "release-workspace-v1",
"networkZone": "tenant-private",
"version": "0.3.0"
}
controller-rs validates the registration against server-side runtime policy.
If accepted, it creates a runner session and issues short-lived credentials for
heartbeat and task claim operations.
Local runner config can request capabilities, but the server decides the
effective capabilities. A runner cannot claim work merely because it sets an
environment variable or local values.yml value.
Task Lease Model
The task lease is the enforcement object. The runner should execute a task only when it has a valid lease issued by the control plane.
Lease example:
{
"leaseId": "01970f5d-0000-7000-8000-000000000001",
"tenantId": "tenant-a",
"hostId": "host-a",
"runnerId": "release-runner-01",
"wfInstanceId": "release-2026.06.0",
"taskId": "build-java-products",
"taskType": "run.shell",
"runnerPool": "release",
"executionProfile": "release-sandbox",
"capabilities": ["git", "maven"],
"commandTemplateId": "light-fabric-release-build",
"expiresAt": "2026-06-08T19:30:00Z",
"nonce": "single-use-random-value"
}
Server-side validation must check:
- runner session is active,
- runner is not quarantined,
- tenant and host match,
- task runner pool matches registered pool,
- task execution profile is allowed,
- required capabilities are a subset of effective runner capabilities,
- command template is approved,
- lease is not expired,
- lease has not already been used.
The runner reports task start, logs, progress, and final result using the lease. The control plane rejects reports that do not match the active lease.
Task Routing
light-workflow should execute pure control-plane tasks locally:
ask
assert
set
switch
context merge
workflow branching
workflow persistence
approved internal call tasks
light-workflow-runner should execute effectful or tenant-local tasks:
run.shell
run.script
run.container
call.mcp to tenant-local servers
deployment commands
release build and test commands
AI repair with filesystem access
browser automation
external tool processes
Some call.* tasks can run on either side. The routing decision should come
from effective task policy:
| Task | Default Runtime | Notes |
|---|---|---|
call.http internal SaaS API | light-workflow | Use host-side service credentials. |
call.http tenant-private API | runner | Needs tenant network access. |
call.mcp approved SaaS gateway | light-workflow | Gateway enforces tool access. |
call.mcp tenant-local server | runner | Local sidecar or private MCP server. |
call.agent no tools | light-workflow | Bounded model call. |
call.agent with file/tools | runner | Requires sandbox/tool policy. |
Agent Call Placement
Workflow agent calls need an explicit placement decision. The same workflow can use more than one agent execution mode, but the placement must come from server-side policy and task metadata, not tenant-side local config.
Use three agent execution modes.
Native Workflow Agent
Native call: agent stays in the SaaS-owned light-workflow process. This is
the current bounded agent task model: light-workflow resolves the portal
agent, skill, and tool metadata, builds a constrained prompt from workflow
context, calls the configured model provider, validates structured output, and
continues the workflow.
Use native workflow agents for bounded reasoning:
- classify a request or command result,
- summarize API responses or logs,
- choose a workflow branch,
- draft a customer-facing explanation,
- decide whether human review is required,
- produce JSON output that must match a schema.
Native workflow agents should not receive filesystem access, local network
access, release secrets, or dynamic tool execution. API orchestration should
remain explicit workflow tasks such as call.http, call.mcp, assert,
switch, and ask.
By default, native workflow agents use SaaS-approved model providers and model credentials managed by the Light control plane. Tenant-private repository content, tenant-local logs, local files, and private network data should not be sent to this path unless the tenant policy explicitly allows it.
Runner Agent
Runner agents execute through light-workflow-runner under a server-issued
task lease. Use this mode when the agent needs access to tenant-local state or
effectful tools:
- checked-out repositories,
- command output plus working directory inspection,
- private tenant network access,
- local MCP servers,
- sandbox tools,
- AI repair of source code,
- test reruns,
- branch or pull-request creation.
The main light-workflow instance still creates the task and records the
result. controller-rs issues a lease only to a runner whose effective
capabilities, runner pool, execution profile, command allowlist, workspace
policy, and audit state match the task requirements.
Runner agent lease example:
{
"taskType": "call.agent",
"agentPlacement": "runner",
"runnerPool": "release",
"executionProfile": "release-sandbox",
"sandboxMode": "per-agent-call",
"sandboxProvider": "cubesandbox",
"modelProviderScope": "tenant",
"modelProviderRef": "tenant-openai-eastus",
"credentialRef": "runner-secret://llm-provider",
"dataBoundary": "tenant-network",
"allowedTools": ["git", "maven", "cargo"],
"workspaceAccess": "copy-on-write-release-workspace",
"networkPolicy": "release-egress",
"secretPolicy": "none",
"maxRepairAttempts": 2,
"requiresHumanApprovalBefore": ["publish", "sign", "tag"]
}
The runner agent can inspect files and propose or apply bounded patches inside the approved workspace. It must not publish artifacts, sign releases, push final tags, read unrestricted secrets, or expand its own permission scope.
By default, runner agents use tenant-approved model providers and tenant-owned credentials. This keeps private workspace data and private network context inside the tenant boundary and avoids exposing SaaS model credentials to tenant-side runtimes.
Runner Agent Sandbox Isolation
The runner itself is a tenant-side execution agent. For stronger isolation, the runner can launch the agent task inside a separate sandbox such as Cube Sandbox, a VM, or a Kubernetes Job. This should be a tenant-selectable policy because the runner is deployed in the tenant namespace, but the effective choice must still be recorded and enforced by the control plane.
Recommended isolation levels:
| Isolation Level | Use Case | Default Policy |
|---|---|---|
| no sandbox | bounded model call with no tools, no file access, and no private network mutation | allowed for low-risk tasks |
| workflow-session sandbox | release build/test/diagnosis that needs the same checkout and cache across steps | useful for release workflows |
| per-agent-call sandbox | AI repair, arbitrary code inspection, generated patches, dynamic tools, or untrusted scripts | preferred for high-risk agent tasks |
| per-publish sandbox | signing, publish tokens, artifact upload, and final tag push | required for high-value secrets |
For a release workflow, the runner should usually orchestrate a separate per-agent-call sandbox for AI repair. The runner injects only the leased workspace, approved tools, network policy, and task-scoped secrets. It collects logs, artifacts, patches, and structured output, then destroys or freezes the sandbox according to retention policy.
This creates a layered boundary:
SaaS light-workflow
-> controller-rs task lease
-> tenant light-workflow-runner
-> per-agent sandbox
-> model, tools, files, network
Tenants may choose Cube Sandbox, VM isolation, Kubernetes Job isolation, or no
sandbox for allowed profiles. Runner registration must advertise supported
sandbox providers and modes. If a task requires per-agent-call isolation and
the runner cannot provide it, controller-rs must not issue the lease.
Local runner config can select among tenant-approved profiles, but it cannot
weaken a task requirement. The lease contains the final effective
sandboxMode, sandboxProvider, workspace, network, tool, and secret policy.
Heartbeat and audit snapshots should prove the runner is still operating under
that profile.
Agent Service
Containerized light-agent services should be invoked explicitly. They are the
right runtime for interactive or independently scaled agents:
- chat and session memory,
- dynamic
tools/listandtools/callloops, - long-lived specialist agents,
- independently deployed model/tool runtime,
- local catalog caching.
Do not silently change native call: agent to call a containerized
light-agent service. Use an explicit contract such as call: agent-service
or call: agent with mode: service so operators can audit which runtime path
was used.
Model Provider Boundary
Agent placement and model-provider placement should be decided together.
Recommended defaults:
native call: agent in SaaS light-workflow
-> SaaS-approved model provider
-> SaaS workflow context data boundary
leased runner agent in tenant workflow runner
-> tenant-approved model provider
-> tenant network/workspace data boundary
containerized light-agent service
-> service-owned or tenant-approved model provider
-> explicit service data boundary
The default SaaS model is useful for bounded reasoning over workflow-safe context, such as classification, summaries, branch decisions, and structured JSON output. It should not be the default path for tenant-local source code, private command logs, local files, or private network data.
The default runner model is useful when the task needs tenant-local context. The runner should resolve model credentials from tenant-controlled secret stores or tenant-approved local provider configuration. SaaS model credentials must not be sent to tenant runners.
The control plane should still make this policy-driven instead of hard-coding it. Some tenants may require every agent call, including bounded summaries, to use their own provider or regional model endpoint. In that case, the workflow task should be routed to a runner or to an approved tenant model gateway even if the reasoning itself is small.
Lease examples:
{
"agentPlacement": "workflow",
"modelProviderScope": "saas",
"modelProviderRef": "light-managed-default",
"credentialRef": "saas-secret://llm-provider",
"dataBoundary": "saas-workflow-context"
}
{
"agentPlacement": "runner",
"modelProviderScope": "tenant",
"modelProviderRef": "tenant-openai-eastus",
"credentialRef": "runner-secret://llm-provider",
"dataBoundary": "tenant-network"
}
Recommended placement rule:
bounded reasoning over workflow context -> native call: agent in light-workflow
agent needs files, tools, or private network -> leased runner agent
interactive session or dynamic tool loop -> containerized light-agent service
For release workflows, use native call: agent to summarize and classify a
failed command. Use a runner agent for repo inspection, patch generation, test
rerun, and pull-request creation. Human approval remains required before
publish, signing, or final tag creation.
Effective Policy
Workflow definitions and tasks can request runner execution through metadata, but the control plane computes the effective policy.
Workflow-level example:
document:
dsl: "1.0.3"
namespace: release
name: java-release
version: "0.1.0"
metadata:
lightWorkflow:
runner:
runnerPool: release
executionProfile: release-sandbox
capabilities:
- git
- maven
- docker
Task-level example:
do:
- build-java:
run:
shell:
command: ./release.sh
arguments:
- "${ .release.version }"
metadata:
lightWorkflow:
runner:
runnerPool: release
commandTemplateId: light-fabric-release-build
security:
sandbox:
mode: workflow-session
Runtime policy resolution:
- Workflow definition requests a runner profile.
- Task metadata can request stricter handling.
- Tenant policy sets the maximum tenant privilege.
- SaaS service policy sets global allowed runner types.
- Operator-approved profile definitions set allowed commands, networks, images, mounts, sandbox modes, sandbox providers, model provider scopes, data boundaries, and secrets.
controller-rsvalidates actual registered runner state.- The task lease contains the final allowed execution scope.
A task may request stricter isolation than the workflow, but it must not weaken the effective policy.
Runtime Configuration Audit
Tenant-controlled local configuration cannot be the source of truth. A runner can load local config for its own startup, but the server must verify and audit the effective runtime state.
controller-rs should audit at three points.
Startup Admission
On registration, the runner reports:
- binary version,
- image digest or VM image ID,
- effective config hash,
- command allowlist hash,
- enabled execution profiles,
- runner pools,
- mounted workspace paths,
- supported sandbox modes and providers,
- sandbox provider and template,
- allowed model provider scopes,
- network zone,
- secret policy,
- host and tenant identity.
controller-rs compares this report with approved server-side policy before
allowing claims.
Heartbeat
Each heartbeat should include:
{
"runnerId": "release-runner-01",
"sessionId": "01970f5d-1111-7000-8000-000000000001",
"status": "ready",
"configHash": "sha256:...",
"commandAllowlistHash": "sha256:...",
"imageDigest": "sha256:...",
"activeLeases": 1,
"timestamp": "2026-06-08T19:00:00Z"
}
If a hash changes unexpectedly, the controller marks the runner suspicious and stops issuing new leases.
Periodic Deep Audit
Periodically, controller-rs should request an effective runtime snapshot from
the runner and compare it with the approved policy. For high-risk runners, the
snapshot should include command allowlist, sandbox template, mount list, network
policy, and secret bindings.
On mismatch:
- Mark runner as
quarantined. - Revoke active claim credentials.
- Stop issuing new leases.
- Emit a runtime audit event.
- Create an operator task if active work may be affected.
Audit is not the only enforcement mechanism. It detects drift after admission. The task lease remains the primary runtime authorization boundary.
Release Runner Mode
A release runner is a specialized light-workflow-runner profile.
It can run in:
- a prepared VM,
- a Cube Sandbox session,
- a Kubernetes Job,
- a controlled bare-metal release host.
Recommended default for release workflows:
- one workflow-session sandbox or VM workspace for checkout, build, test, and package steps,
- per-agent-call sandbox isolation for AI repair, source inspection, generated patches, and test reruns driven by an agent,
- per-task sandbox isolation for publishing, signing, and tasks with release secrets,
- clean checkout inside the runner rather than writable host repository mounts,
- artifact export through controlled storage,
- AI repair limited to sandbox workspace changes or branch/PR creation.
Writable host mounts should be avoided for AI repair and release commands. If host repositories must be mapped, default to read-only mounts and copy the repo into a runner-owned working directory before mutation.
Runner API
The first runner API can be small.
POST /runner/register
POST /runner/heartbeat
POST /runner/claim
POST /runner/task/{leaseId}/started
POST /runner/task/{leaseId}/log
POST /runner/task/{leaseId}/complete
POST /runner/task/{leaseId}/fail
POST /runner/audit-snapshot
POST /runner/drain
controller-rs can expose these APIs directly or mediate them over its
existing persistent connection model. For private tenant networks, outbound
runner registration and polling is preferable to inbound SaaS calls into the
tenant environment.
The claim response should include only the task payload needed for execution, not the full workflow definition.
Command Result Contract
Runner results should use a normalized command result so light-workflow,
human tasks, AI diagnosis, and audit do not depend on raw console parsing.
{
"leaseId": "01970f5d-0000-7000-8000-000000000001",
"taskId": "build-java-products",
"runnerId": "release-runner-01",
"attempt": 1,
"status": "failed",
"exitCode": 1,
"startedAt": "2026-06-08T19:10:00Z",
"completedAt": "2026-06-08T19:18:30Z",
"summary": "Maven test failure in db-provider",
"stdoutRef": "artifact://release/2026.06.0/build/stdout.log",
"stderrRef": "artifact://release/2026.06.0/build/stderr.log",
"artifactRefs": [
"artifact://release/2026.06.0/build/surefire-reports.zip"
],
"changedFiles": [],
"aiDiagnosisAllowed": true
}
The runner should stream logs in chunks and store full logs as artifacts. Workflow context should keep summaries and artifact references, not unbounded stdout or stderr.
Security Requirements
- Runners authenticate to
controller-rswith tenant-scoped credentials. - Task leases are short-lived, single-use, and scoped to one task.
- Runners never see workflow start events unless they are explicitly deployed as trusted orchestrators in a non-SaaS topology.
- Runners receive task payloads, not complete workflow definitions.
- Server-side policy decides runner pools, execution profiles, capabilities, commands, networks, mounts, sandbox modes, model provider scopes, data boundaries, and secrets.
- SaaS model credentials must not be sent to tenant-side runners.
- Tenant-private source code, local files, and private command logs should use tenant-approved model providers unless tenant policy explicitly allows SaaS model processing.
- Secrets are task-scoped and never included in logs or AI prompts.
- AI repair runs only in approved runner profiles and cannot publish or sign.
- Publish and signing tasks require human approval and per-task isolation.
- Runtime drift causes quarantine and lease revocation.
- All task results include runner identity, effective policy version, command template ID, artifact references, and approval references.
Implementation Plan
Phase 1: Split Runner Boundary
- Create
apps/light-workflow-runner. - Reuse
workflow-coremodels forrun.*task payloads. - Define runner registration, heartbeat, claim, and result APIs.
- Add server-side runner pools and execution profiles.
- Keep the existing
light-workflowevent consumer as the only workflow start consumer.
Phase 2: Leased Run Task Execution
- Implement
run.shellexecution in the runner. - Add command template allowlists.
- Add normalized command result output.
- Add log streaming and artifact references.
- Route eligible
run.shelltasks fromlight-workflowto registered runners throughcontroller-rs.
Phase 3: Sandbox and Workspace Policy
- Add workflow-session and per-task sandbox modes.
- Support release VM or Cube Sandbox runner profiles.
- Add workspace mount and checkout policies.
- Add network and secret policy enforcement.
- Add runtime config hash reporting.
Phase 4: Audit and Quarantine
- Add periodic effective runtime snapshots.
- Compare runner-reported config with server-approved policy.
- Quarantine drifted runners.
- Revoke active claim credentials.
- Emit audit events and operator tasks.
Phase 5: Release and AI Workflows
- Add release-runner profile.
- Execute Java and Rust release build/test tasks through the runner.
- Add ConfigProfile manifest and
event-importerdry-run tasks. - Add AI failure analysis and bounded repair loops.
- Gate publish and signing tasks behind human approval and per-task isolation.
Open Questions
- Should runner registration and task claim be direct HTTP APIs, WebSocket
messages through
controller-rs, or both? - Where should long-running task logs and artifacts be stored for SaaS deployments?
- How should the control plane attest VM-based runners that do not have a container image digest?
- Should command templates be stored in workflow definitions, tenant policy, or a separate runner policy registry?
- How much of the existing
TaskExecutorshould move into shared crates solight-workflowandlight-workflow-runnercan share evaluation and result handling without sharing orchestration responsibilities?
Recommendation
Create light-workflow-runner as a separate executable and keep
light-workflow as the single SaaS-owned orchestrator. The runner should be a
leased execution agent, not a workflow starter or workflow definition loader.
This gives tenants a practical way to run workflow tasks near their own APIs, gateways, repositories, clusters, and sandboxes while keeping workflow start events, policy decisions, task visibility, and audit under the SaaS control plane.
Asymmetric Decryptor
asymmetric-decryptor decrypts RSA encrypted configuration values.
It is used by config-loader when a service loads encrypted values that use
the CRYPT:RSA: prefix. The crate supports RSA private keys in PKCS#8 and
PKCS#1 PEM formats and decrypts payloads with RSA-OAEP using SHA-256.
Main Types
AsymmetricDecryptor: owns the RSA private key and decrypts supported payloads.AsymmetricError: error type for prefix, base64, key, and decrypt failures.CRYPT_RSA_PREFIX: the requiredCRYPT:RSA:payload prefix.
Usage
#![allow(unused)] fn main() { use asymmetric_decryptor::AsymmetricDecryptor; let decryptor = AsymmetricDecryptor::from_pem(private_key_pem)?; let plaintext = decryptor.decrypt("CRYPT:RSA:...")?; }
Notes
This crate is intentionally small. It does not fetch keys, rotate keys, or
perform configuration merging. Those concerns belong to config-loader and the
runtime layer.
Config Loader
config-loader loads, merges, resolves, and decrypts service configuration.
It provides the common configuration behavior used by fabric services and runtime modules. Configuration can be loaded from YAML, JSON, or TOML files, merged across layers, expanded from values maps, and decrypted when encrypted values are present.
Main Types
ConfigLoader: loads files and resolves${key:default}style values.ConfigManager<T>: stores hot-swappable typed configuration behind an atomic reference.ConfigError: shared error type for IO, parse, decrypt, and conversion failures.
Resolution Model
The loader supports:
- merging multiple config files in order
- external overlays through
LIGHT_RS_CONFIG_DIR - whole-value variable replacement
- embedded variable expansion inside strings
- typed deserialization through Serde
- symmetric encrypted values through
symmetric-decryptor - asymmetric encrypted values through
asymmetric-decryptor
Usage
#![allow(unused)] fn main() { use config_loader::ConfigLoader; use std::collections::HashMap; let loader = ConfigLoader::from_values(HashMap::new(), None, None)?; let config: MyConfig = loader.load_typed(["config/my-service.yml"])?; }
Consumers
light-runtime uses this crate for service bootstrap and runtime config.
Application crates can also use it for app-specific policy or domain config.
Hindsight Client
hindsight-client provides a small client abstraction for persistent agent
memory.
It stores and recalls memory units from PostgreSQL. The current implementation
uses sqlx and pgvector for vector similarity search.
Main Types
HindsightMemory: trait used by applications that need memory retention and recall without coupling to a specific database implementation.PgHindsightClient: PostgreSQL-backed implementation ofHindsightMemory.MemoryUnit: returned memory record with content, type, metadata, and bank identity.
Usage
#![allow(unused)] fn main() { use hindsight_client::{HindsightMemory, PgHindsightClient}; let memory = PgHindsightClient::new(pool); let unit_id = memory .retain(host_id, bank_id, "User prefers concise answers", "fact", None, metadata) .await?; }
Data Model
The PostgreSQL implementation writes to agent_memory_unit_t and uses
host_id plus bank_id to isolate memory between tenants, users, or sessions.
Consumers
light-agent uses this crate to persist and recall agent conversation memory.
Light Rule
light-rule is the Rust rule engine for evaluating rule definitions and
executing registered actions.
It is designed to align with the rule.yaml specification while remaining
runtime-neutral. Java services can use yaml-rule; Rust services use this
crate.
Main Types
RuleEngine: evaluates rule conditions and determines action execution.MultiThreadRuleExecutor: executes rules with runtime state.RuntimeState: input/output state passed through rule evaluation.ActionRegistry: registry for action plugins.RuleActionPlugin: trait implemented by Rust action handlers.Rule,RuleCondition,RuleAction,RuleConfig,EndpointConfig: rule model types.
Action Model
Rules reference actions by actionRef. In Rust, actionRef resolves to a
registered RuleActionPlugin; it is not a Java class name. This keeps the rule
format portable across Java and Rust executors.
Usage
#![allow(unused)] fn main() { use light_rule::{ActionRegistry, RuleEngine}; let registry = ActionRegistry::default(); let engine = RuleEngine::new(registry); }
Related Design
See Light-Rule for the rule format and its relationship to workflow assertions and portal rule management.
Light Runtime
light-runtime is the shared service runtime for Light Fabric applications.
It owns bootstrap, configuration loading, transport startup, graceful shutdown,
and optional portal registry registration. Apps such as light-agent and
light-deployer should start through this crate instead of binding sockets
directly.
Main Types
LightRuntimeBuilder: builds a runtime from a transport.LightRuntime: configured runtime before start.RunningRuntime: running service handle with shutdown support.Module: lifecycle hook abstraction.RuntimeConfig: resolved runtime configuration.ServerConfig: HTTP/HTTPS bind and service identity settings.BootstrapConfig: remote config bootstrap settings.PortalRegistryConfig: portal registry connection settings.
Startup Pattern
#![allow(unused)] fn main() { use light_axum::AxumTransport; use light_runtime::LightRuntimeBuilder; let runtime = LightRuntimeBuilder::new(AxumTransport::new(app)) .with_config_dir("config") .build(); let running = runtime.start().await?; running.shutdown().await?; }
Configuration
At minimum, runtime services need server.yml. Optional files include
startup.yml, client.yml, and portal-registry.yml.
Related Frameworks
light-runtime is transport-neutral. light-axum supplies the Axum transport
implementation.
MCP Client
mcp-client is a client for calling MCP-compatible gateway endpoints.
It provides a small API for listing and invoking tools through a configured MCP gateway path. It is intentionally focused on the client side; MCP server implementations live in applications or framework layers.
Main Types
McpGatewayClient: gateway client used by applications.McpTool: tool metadata returned by the gateway.McpContent: content item returned by MCP tool calls.McpToolCallResult: structured result for a tool invocation.
Usage
#![allow(unused)] fn main() { use mcp_client::McpGatewayClient; let client = McpGatewayClient::new(gateway_url, path, timeout_ms); let result = client.call_tool("tool.name", arguments).await?; }
Consumers
light-agent uses this crate when an agent session needs to discover or invoke
tools exposed through an MCP gateway.
Model Provider
model-provider defines a common abstraction over LLM providers and implements
multiple provider adapters.
The goal is to let agent and workflow code depend on one Provider trait while
supporting local models, hosted APIs, and provider-specific features.
Main Types
Provider: async trait implemented by model providers.ChatRequest,ChatResponse,ChatMessage: common chat data model.ToolSpec,ToolCall: tool-calling model.ProviderCapabilities: capability metadata.TokenUsage: usage accounting.ReliableProvider: reliability wrapper.RouterProvider: route requests across multiple providers.
Provider Implementations
Current modules include:
- Anthropic
- Azure OpenAI
- Bedrock
- Claude Code
- Codex
- OpenAI-compatible providers
- Copilot
- Gemini
- Gemini CLI
- GLM
- Kilo Code CLI
- Ollama
- OpenAI
- OpenRouter
- Telnyx
Consumers
light-agent uses this crate to send chat requests and tool specs without
hard-coding a single LLM provider.
Portal Registry
portal-registry provides client support for registering services with Light
Portal or Light Controller.
It uses a JSON-RPC style WebSocket protocol for service registration, metadata
updates, discovery, and cache-management control. Runtime services normally use
this through light-runtime, but applications can also use the client directly
when they need custom registry behavior.
Main Types
PortalRegistryClient: WebSocket client for registry communication.RegistryHandler: trait for handling registry callbacks and messages.RegistrationState: client registration state.RegistrationBuilder: helper for constructing registration parameters.ServiceRegistrationParams: service identity and advertised endpoint.ServiceMetadataUpdate: metadata update payload.
Usage
#![allow(unused)] fn main() { use portal_registry::RegistrationBuilder; let registration = RegistrationBuilder::new( "com.networknt.service-1.0.0", "1.0.0", "http", "127.0.0.1", 8080, ) .with_env("dev") .with_jwt(token) .build(); }
Runtime Integration
light-runtime can register a service automatically when server.yml enables
registry support and portal-registry.yml supplies the portal connection.
Symmetric Decryptor
symmetric-decryptor decrypts legacy symmetric encrypted configuration values.
It supports payloads with the CRYPT prefix and decrypts AES-256-CBC data with
a key derived from the configured password using PBKDF2-HMAC-SHA256.
Main Types
Decryptor: trait implemented by decryptors.SymmetricDecryptor: password-based decryptor.DecryptError: error type for prefix, format, hex, and cipher failures.CRYPT_PREFIX: requiredCRYPTpayload prefix.
Usage
#![allow(unused)] fn main() { use symmetric_decryptor::{Decryptor, SymmetricDecryptor}; let decryptor = SymmetricDecryptor::new("password"); let plaintext = decryptor.decrypt("CRYPT:...")?; }
Consumers
config-loader uses this crate when it encounters symmetric encrypted values
and a config password is available.
Workflow Builder
workflow-builder provides fluent builders for creating Agentic Workflow
definitions programmatically.
It depends on workflow-core for the actual model types and layers a builder
API on top so applications and tests can construct valid workflows without
manually assembling nested maps.
Main Areas
- workflow metadata construction
- authentication definitions
- task definitions
- nested
do,for,fork,try, and other task structures - YAML/JSON serialization through
workflow-coremodel types
Usage
#![allow(unused)] fn main() { use workflow_builder::services::workflow::WorkflowBuilder; let workflow = WorkflowBuilder::new() .use_dsl("1.0.0") .with_namespace("lightapi") .with_name("example") .with_version("1.0.0") .build(); }
Relationship To Workflow Core
Use workflow-core when you need direct access to the schema model. Use
workflow-builder when you want an ergonomic construction API.
Workflow Core
workflow-core contains the Rust model for the Agentic Workflow DSL.
The crate is schema-oriented: its structs and enums represent workflow documents, tasks, authentication blocks, durations, timeouts, errors, and supporting map types.
Main Areas
- workflow document metadata
- task definitions
- call task protocol definitions
- ask and assert task definitions
- duration and timeout models
- error definitions
- ordered map support for workflow task lists
Usage
#![allow(unused)] fn main() { use workflow_core::models::workflow::{ WorkflowDefinition, WorkflowDefinitionMetadata, }; let document = WorkflowDefinitionMetadata::new( "lightapi", "example", "1.0.0", Some("Example".to_string()), None, None, None, ); let workflow = WorkflowDefinition::new(document); }
Consumers
workflow-builder builds on this crate. light-workflow and workflow-related
services use the model for loading, validating, and executing workflow
documents.
Light-Axum
light-axum adapts Axum applications to light-runtime.
Applications implement AxumApp and return an axum::Router. The framework
owns binding, optional TLS, runtime metadata resolution, and graceful shutdown
through the runtime transport contract.
Main Types
AxumApp: trait implemented by an application.AxumTransport: transport passed toLightRuntimeBuilder.ServerContext: runtime context passed into the app when building routes.AxumBoundHandle: running Axum server handle.
Pattern
#![allow(unused)] fn main() { use light_axum::{AxumApp, AxumTransport, ServerContext}; use light_runtime::LightRuntimeBuilder; #[derive(Clone)] struct App; impl AxumApp for App { fn router(&self, _context: ServerContext) -> axum::Router { axum::Router::new() } } let runtime = LightRuntimeBuilder::new(AxumTransport::new(App)) .with_config_dir("config") .build(); }
Consumers
light-agent and light-deployer use this framework.
Light-Pingora
light-pingora adapts Pingora proxy services to light-runtime.
It is the framework layer for high-performance gateway and proxy products. The crate keeps runtime concerns such as configuration and service lifecycle separate from Pingora-specific proxy behavior.
Role
- bridge Pingora services into the common runtime lifecycle
- expose transport metadata to
light-runtime - support gateway products without duplicating bootstrap code
Consumers
light-gateway uses this framework.
MSAL Exchange
The msal-exchange handler is a BFF security handler for SPA applications
that authenticate with Microsoft Authentication Library, MSAL, and need an
internal light-oauth security profile for gateway authorization.
The SPA obtains Azure MSAL tokens in the browser. It sends the MSAL ID token to the gateway for light-oauth token exchange. In the Azure authorization placement pattern, it also sends the MSAL access token during the exchange so the gateway can store it in a secure BFF cookie. The internal light-oauth token set is stored in secure BFF cookies and is used on later requests together with CSRF protection.
This page documents the current behavior and the token placement extension for
deployments that must keep the Azure MSAL access token in the downstream
Authorization header while forwarding the light-oauth token in a separate
header.
Use Cases
Use msal-exchange when:
- The UI is a browser SPA using MSAL.js.
- Azure Entra ID is the identity provider for the browser login.
- The gateway must exchange the Azure token for a light-oauth token containing the enterprise security profile and custom claims.
- The gateway must protect browser requests with HttpOnly cookies and CSRF.
- Downstream routing needs either the light-oauth token or the Azure MSAL token
in the
Authorizationheader.
Handler Placement
Enable the handler in the gateway handler chain before downstream routing and before handlers that depend on the authenticated principal.
Example:
handlers:
- exception
- cors
- msal-exchange
- header
- prefix
- router
chains:
bff:
- exception
- cors
- msal-exchange
- header
- prefix
- router
paths:
- path: /auth/ms/exchange
method: POST
exec:
- bff
- path: /auth/ms/logout
method: GET
exec:
- bff
When the handler is active, the gateway needs these resolved config files:
msal-exchange.ymlsecurity-msal.ymlsecurity.ymlclient.yml
security-msal.yml validates Azure MSAL tokens. security.yml validates the
light-oauth tokens stored in BFF cookies. client.yml provides the
light-oauth token-exchange client configuration.
Exchange Flow
The exchange endpoint receives the Azure MSAL ID token from the SPA and creates the BFF session.
POST /auth/ms/exchange
Authorization: Bearer <azure-msal-id-token>
-> read the Azure MSAL ID token
-> verify the ID token with security-msal.yml
-> generate a CSRF value
-> call light-oauth with the token-exchange grant
-> verify the returned light-oauth access token with security.yml
-> set BFF cookies
-> return { "scopes": [...] }
The token-exchange request uses client.yml oauth.token.token_exchange.
The outgoing form body contains:
grant_type=urn:ietf:params:oauth:grant-type:token-exchange
subject_token=<azure-msal-id-token>
subject_token_type=urn:ietf:params:oauth:token-type:jwt
csrf=<generated-csrf>
subjectTokenType can be set in msal-exchange.yml. When it is blank, the
shared token client default from client.yml is used.
On success, the response body contains the scopes from the light-oauth token:
{
"scopes": ["scope1", "scope2"]
}
Session Cookies
The handler uses the same cookie contract as the stateless SPA auth handler.
| Cookie | HttpOnly | Description |
|---|---|---|
accessToken | true | light-oauth access token |
refreshToken | true | light-oauth refresh token, when returned |
msalAccessToken | true | Azure MSAL access token when authorizationToken is azure-msal |
csrf | false | Generated CSRF value |
userId | false | User id from uid, user_id, or sub |
userType | false | User type from userType |
roles | false | Base64 encoded role value, default user |
host | false | Host claim |
email | false | Email claim from eml |
eid | false | Enterprise id claim |
accessToken and refreshToken are HttpOnly so browser JavaScript cannot read
the light-oauth tokens. The SPA reads the non-HttpOnly csrf cookie and sends
it back with protected requests.
CSRF Validation
For normal protected requests, the handler validates the request CSRF value
against the csrf claim in the light-oauth access token.
CSRF source order:
X-CSRF-TOKENrequest header.Sec-WebSocket-Protocolvalue starting withcsrf.for WebSocket requests.csrfquery parameter.
If the CSRF value is missing or does not match the JWT claim, the request is rejected.
Token Placement
authorizationToken selects which token owns the downstream Authorization
header after the BFF session has been established.
Supported values:
| Value | Authorization header | Light-oauth token location | Use case |
|---|---|---|---|
light-oauth | Bearer <light-oauth-token> | Authorization | Existing enterprise BFF pattern |
azure-msal | Bearer <azure-msal-access-token> | lightTokenHeader, default X-Light-Token | Azure-whitelisted downstream systems, such as AWS Agent Core |
authorizationToken: light-oauth
This is the current default behavior.
After the exchange, the SPA calls the gateway with cookies and CSRF:
GET /api/orders
Cookie: accessToken=...; csrf=...
X-CSRF-TOKEN: <csrf>
The handler:
-> reads the light-oauth accessToken cookie
-> verifies it with security.yml
-> validates CSRF
-> refreshes the token if it is close to expiry
-> injects Authorization: Bearer <light-oauth-token>
-> continues the handler chain
Downstream services receive:
Authorization: Bearer <light-oauth-token>
This mode is appropriate when downstream services and MCP tools trust
light-oauth directly and expect fine-grained security claims in the normal
Authorization header.
authorizationToken: azure-msal
This token placement pattern uses both Azure and light-oauth tokens downstream.
At exchange time, the SPA sends the MSAL ID token in Authorization and the
MSAL access token in msalAccessTokenHeader, which defaults to
X-MSAL-Access-Token:
POST /auth/ms/exchange
Authorization: Bearer <azure-msal-id-token>
X-MSAL-Access-Token: Bearer <azure-msal-access-token>
-> verify the MSAL ID token with security-msal.yml
-> verify the MSAL access token with security-msal.yml
-> exchange the ID token for a light-oauth token
-> store the light-oauth token in accessToken
-> store the MSAL access token in msalAccessToken
For later protected requests, the SPA sends cookies and CSRF. The SPA does not
need to put the Azure access token in the browser request Authorization
header because the gateway reads it from the HttpOnly msalAccessToken cookie:
GET /agent/chat
Cookie: accessToken=...; msalAccessToken=...; csrf=...
X-CSRF-TOKEN: <csrf>
The handler:
-> read the MSAL access token from the msalAccessToken cookie
-> verify the MSAL access token with security-msal.yml
-> read the light-oauth accessToken cookie
-> verify the light-oauth token with security.yml
-> validate CSRF
-> refresh the light-oauth token if it is close to expiry
-> inject Authorization: Bearer <azure-msal-access-token>
-> inject X-Light-Token: Bearer <light-oauth-token>
-> continue the handler chain
Downstream systems receive both tokens:
Authorization: Bearer <azure-msal-access-token>
X-Light-Token: Bearer <light-oauth-token>
This mode is intended for systems that only allow Azure as the OAuth provider
for the normal Authorization header, while still needing the light-oauth
security profile for API and MCP authorization decisions.
The SPA should not read or send X-Light-Token itself. The gateway should
derive that header from the HttpOnly light-oauth cookie after CSRF validation.
That keeps the light-oauth token out of browser JavaScript.
If a downstream light-gateway is responsible for fine-grained authorization,
it must be configured to verify X-Light-Token as the light-oauth token or to
promote X-Light-Token to Authorization at a trusted boundary before the
normal security/access-control handlers run.
Configuration
Example default configuration:
enabled: ${msal-exchange.enabled:true}
exchangePath: ${msal-exchange.exchangePath:/auth/ms/exchange}
logoutPath: ${msal-exchange.logoutPath:/auth/ms/logout}
cookieDomain: ${msal-exchange.cookieDomain:localhost}
cookiePath: ${msal-exchange.cookiePath:/}
cookieSecure: ${msal-exchange.cookieSecure:false}
sessionTimeout: ${msal-exchange.sessionTimeout:3600}
rememberMeTimeout: ${msal-exchange.rememberMeTimeout:604800}
renewBeforeSeconds: ${msal-exchange.renewBeforeSeconds:90}
refreshSingleFlightWaitMs: ${msal-exchange.refreshSingleFlightWaitMs:5000}
refreshSingleFlightCacheMs: ${msal-exchange.refreshSingleFlightCacheMs:3000}
refreshSingleFlightMaxEntries: ${msal-exchange.refreshSingleFlightMaxEntries:10000}
cookieSameSite: ${msal-exchange.cookieSameSite:None}
cookieTimeoutUri: ${msal-exchange.cookieTimeoutUri:/}
subjectTokenType: ${msal-exchange.subjectTokenType:}
authorizationToken: ${msal-exchange.authorizationToken:light-oauth}
lightTokenHeader: ${msal-exchange.lightTokenHeader:X-Light-Token}
msalAccessTokenHeader: ${msal-exchange.msalAccessTokenHeader:X-MSAL-Access-Token}
msalAccessTokenCookie: ${msal-exchange.msalAccessTokenCookie:msalAccessToken}
Fields:
| Field | Default | Description |
|---|---|---|
enabled | true | Enables or disables the handler once it is active in the chain. |
exchangePath | /auth/ms/exchange | Endpoint that receives the Azure MSAL ID token and creates the BFF session. |
logoutPath | /auth/ms/logout | Endpoint that clears BFF cookies. |
cookieDomain | localhost | Cookie domain for session cookies. |
cookiePath | / | Cookie path for session cookies. |
cookieSecure | false | Adds the Secure cookie attribute. Use true for HTTPS deployments. |
sessionTimeout | 3600 | Default max age in seconds for session cookies. |
rememberMeTimeout | 604800 | Max age in seconds for long-lived refresh-token cookies when light-oauth returns remember-me behavior. |
renewBeforeSeconds | 90 | Refresh the light-oauth access token when it expires within this window. |
refreshSingleFlightWaitMs | 5000 | Maximum wait time for concurrent refresh requests sharing the same refresh token. |
refreshSingleFlightCacheMs | 3000 | Short cache window for a successful refresh result. |
refreshSingleFlightMaxEntries | 10000 | Maximum refresh single-flight cache entries. |
cookieSameSite | None | Cookie SameSite attribute. Supported values are None, Lax, and Strict. |
cookieTimeoutUri | / | URI returned when the session expires and cannot be refreshed. |
subjectTokenType | blank | Optional token-exchange subject token type override. |
authorizationToken | light-oauth | Token to place in downstream Authorization: light-oauth or azure-msal. |
lightTokenHeader | X-Light-Token | Header used for the light-oauth token when authorizationToken is azure-msal. |
msalAccessTokenHeader | X-MSAL-Access-Token | Header that carries the Azure MSAL access token on the exchange request when authorizationToken is azure-msal. |
msalAccessTokenCookie | msalAccessToken | HttpOnly cookie used to store the Azure MSAL access token after exchange when authorizationToken is azure-msal. |
Invalid authorizationToken values should fail startup. lightTokenHeader
should not be Authorization; use authorizationToken: light-oauth for that
case. In azure-msal mode, msalAccessTokenHeader must not be
Authorization because Authorization carries the MSAL ID token on the
exchange endpoint. msalAccessTokenHeader must also be different from
lightTokenHeader.
Security Configuration
security-msal.yml validates Azure MSAL tokens. It is required when the handler
is active.
Example:
enableVerifyJwt: ${security-msal.enableVerifyJwt:true}
ignoreJwtExpiry: ${security-msal.ignoreJwtExpiry:false}
enableRelaxedKeyValidation: ${security-msal.enableRelaxedKeyValidation:false}
issuer: ${security-msal.issuer:}
audience: ${security-msal.audience:}
jwt:
certificate: ${security-msal.jwt.certificate:}
clockSkewInSeconds: ${security-msal.jwt.clockSkewInSeconds:60}
keyResolver: ${security-msal.jwt.keyResolver:}
Recommended settings:
- Set
issuerto the Azure tenant issuer when the tenant is known. - Set
audienceto the SPA client id or the expected Azure access-token audience. - Keep
ignoreJwtExpiry: falsein production. - Use the configured Microsoft JWK or certificate resolver supported by the gateway security runtime.
security.yml remains the normal light-oauth verifier. It validates the
light-oauth access token stored in the accessToken cookie and provides the
principal used by gateway authorization logic.
SPA Integration
Initial exchange:
await fetch("/auth/ms/exchange", {
method: "POST",
credentials: "include",
headers: {
Authorization: `Bearer ${azureMsalIdToken}`
}
});
Initial exchange with authorizationToken: azure-msal:
await fetch("/auth/ms/exchange", {
method: "POST",
credentials: "include",
headers: {
Authorization: `Bearer ${azureMsalIdToken}`,
"X-MSAL-Access-Token": `Bearer ${azureMsalAccessToken}`
}
});
Subsequent requests with the existing light-oauth authorization pattern:
await fetch("/api/orders", {
credentials: "include",
headers: {
"X-CSRF-TOKEN": csrf
}
});
Subsequent requests with the Azure MSAL authorization pattern:
await fetch("/agent/chat", {
credentials: "include",
headers: {
"X-CSRF-TOKEN": csrf
}
});
In both patterns, the SPA must send cookies with credentials: "include".
In the Azure MSAL authorization pattern, MSAL.js is responsible for obtaining
the Azure access token before calling /auth/ms/exchange. The gateway stores
that access token in the HttpOnly msalAccessToken cookie, validates it on
later BFF requests, injects it into Authorization, and injects the
light-oauth token into lightTokenHeader.
Logout
Logout clears all BFF cookies managed by the handler:
GET /auth/ms/logout
The handler returns an empty 200 response with deletion cookies for the known
session cookie names.
Error Handling
Important error codes:
| Code | Meaning |
|---|---|
ERR11000 | Required Azure MSAL bearer token is missing on the exchange endpoint or in the MSAL access-token cookie. |
ERR11001 | light-oauth token exchange failed. |
ERR10000 | Azure MSAL token or light-oauth token verification failed. |
ERR10036 | CSRF token is missing from the request. |
ERR10038 | CSRF claim is missing from the light-oauth token. |
ERR10039 | Request CSRF and token CSRF do not match. |
ERR10052 | Token response does not contain expires_in and the JWT has no usable exp. |
Implementation Notes
Rust light-pingora and Java light-spa-4j use the same token placement
contract:
authorizationToken: light-oauthpreserves the existing behavior and injects the light-oauth token intoAuthorization.authorizationToken: azure-msalverifies the exchange request's MSAL ID token and MSAL access token withsecurity-msal.yml, stores the MSAL access token inmsalAccessToken, injects it into downstreamAuthorization, and injects the light-oauth token intolightTokenHeader.lightTokenHeaderdefaults toX-Light-Tokenand must not beAuthorizationwhenauthorizationTokenisazure-msal.msalAccessTokenHeaderdefaults toX-MSAL-Access-Tokenand is used only on the exchange endpoint.msalAccessTokenCookiedefaults tomsalAccessTokenand is HttpOnly.
In azure-msal placement, the gateway requires the MSAL access-token cookie
only when a BFF session cookie is present. Requests without accessToken or
refreshToken cookies keep the existing pass-through behavior so public
endpoints are not forced to authenticate at this handler.
Light-Agent
light-agent is the interactive agent service in Light Fabric.
It provides a WebSocket chat interface, integrates with model providers,
invokes MCP tools through mcp-client, and stores conversation memory through
hindsight-client.
Key Dependencies
light-runtimelight-axummodel-providermcp-clienthindsight-clientportal-registry
Runtime
The app follows the standard runtime pattern:
- load config from
config/ - implement an Axum app
- start through
LightRuntimeBuilder - optionally register through portal registry
Deploy Native
This page describes the recommended VM deployment model for the Rust
light-agent native binary.
Use this model when a customer wants to run an agent service on a VM and expose
the chat UI/WebSocket endpoint outside Kubernetes. The agent serves the local
chat UI, connects to an LLM provider, calls MCP tools through light-gateway,
stores conversation memory in Postgres, and registers with controller.
Recommended Model
Deliver a versioned install bundle, not an ad hoc runtime script.
The bundle should contain:
light-agentnative binary.public/static assets for the chat UI.- Minimal bootstrap config files.
- A
systemdunit. - An install script for filesystem setup.
- A root-owned environment file for secrets.
Use systemd to run the service:
- It restarts the process on failure.
- It keeps logs in the host journal.
- It avoids shell-history and process-list leakage from command-line secrets.
- It gives the customer a standard operational surface:
start,stop,restart,status, andjournalctl.
Do not use a long-running shell wrapper to pass the bootstrap token, database URL, or model configuration. Use config files and an environment file instead.
Runtime Layout
light-agent uses relative runtime paths:
configpublic
The systemd service should therefore set WorkingDirectory to the installed
application directory.
Recommended VM layout:
/opt/light-agent/
light-agent -> releases/2.2.1/light-agent
releases/
2.2.1/
light-agent
config -> /etc/light-agent
public/
index.html
/etc/light-agent/
startup.yml
server.yml
portal-registry.yml
client.yml
mcp-client.yml
ollama.yml
values.yml
ca.pem
light-agent.env
/var/lib/light-agent/
config-cache/
The local config directory contains bootstrap and agent-specific config.
Runtime config downloaded from config-server should be written to
/var/lib/light-agent/config-cache by setting externalConfigDir in
startup.yml.
Keep /etc/light-agent readable by the service user. Keep
/var/lib/light-agent/config-cache writable by the service user.
Build Artifact
Build a release binary from light-fabric:
cargo build --release -p light-agent
The artifact is:
target/release/light-agent
For a static Linux build that matches the Docker build target:
rustup target add x86_64-unknown-linux-musl
cargo build --release -p light-agent --target x86_64-unknown-linux-musl
The static artifact is:
target/x86_64-unknown-linux-musl/release/light-agent
Build on a compatible Linux distribution for the customer VM. If the customer
fleet has mixed Linux versions, prefer a static or target-compatible build so
the binary does not fail on an older glibc.
Package with a versioned filename:
light-agent-<version>-linux-amd64.tar.gz
Include the static assets from:
apps/light-agent/public/
Runtime Dependencies
The VM must be able to reach:
- Controller, through
portalRegistry.portalUrl. - Config-server, through
startup.configServerUri. light-gateway, throughmcp-client.gatewayUrlandmcp-client.path.- The model provider, currently Ollama by default.
- Postgres, through
DATABASE_URL.
The Postgres database must contain the Hindsight memory tables used by
light-agent, including:
agent_memory_bank_tagent_memory_unit_tagent_session_history_t
LIGHT_AGENT_HOST_ID must be a valid host UUID for the target tenant/host. The
agent stores memory and session history under this host id.
Agent Roles
The same binary can run different logical agents. Use a different service id,
port, install directory, and systemd unit for each concurrently running role.
Common service ids are:
com.networknt.agent.account-1.0.0
com.networknt.agent.advisor-1.0.0
com.networknt.agent.tech-support-1.0.0
For a single account agent, keep the service name light-agent. For multiple
agents on the same VM, use names such as:
light-agent-account
light-agent-advisor
light-agent-tech-support
Each role needs a unique listener port if they run on the same VM.
Bootstrap Config
The local bootstrap config needs enough information to reach config-server,
controller, light-gateway, Ollama, and Postgres.
Example values.yml for an account agent:
startup.host: customer.example.com
startup.timeout: 3000
startup.connectTimeout: 3000
startup.bootstrapCaCertPath: config/ca.pem
startup.externalConfigDir: /var/lib/light-agent/config-cache
light-config-server-uri: https://config-server.customer.example.com:8435
server.serviceId: com.networknt.agent.account-1.0.0
server.environment: prod
server.ip: 0.0.0.0
server.advertisedAddress: agent-account-01.customer.example.com
server.httpPort: 8083
server.enableHttp: true
server.httpsPort: 8443
server.enableHttps: false
server.enableRegistry: true
server.startOnRegistryFailure: true
portalRegistry.portalUrl: https://controller.customer.example.com:8438
client.verifyHostname: true
mcp-client.gatewayUrl: https://mcp-gateway.customer.example.com
mcp-client.path: /mcp
mcp-client.timeoutMs: 5000
ollama.ollamaUrl: http://ollama.customer.example.com:11434
ollama.model: llama3.1:8b
server.advertisedAddress must be a stable address that controller and clients
can use to reach the VM agent. Do not advertise 127.0.0.1 or 0.0.0.0.
Example startup.yml:
host: ${startup.host:dev.lightapi.net}
serviceId: ${server.serviceId:com.networknt.agent.account-1.0.0}
envTag: ${server.environment:dev}
acceptHeader: application/yaml
timeout: ${startup.timeout:3000}
connectTimeout: ${startup.connectTimeout:3000}
configServerUri: ${light-config-server-uri:https://local.localhost}
authorization: ${light_portal_authorization:}
bootstrapCaCertPath: ${startup.bootstrapCaCertPath:config/ca.pem}
externalConfigDir: ${startup.externalConfigDir:/var/lib/light-agent/config-cache}
Example server.yml:
ip: ${server.ip:0.0.0.0}
advertisedAddress: ${server.advertisedAddress:127.0.0.1}
httpPort: ${server.httpPort:8083}
enableHttp: ${server.enableHttp:true}
httpsPort: ${server.httpsPort:8443}
enableHttps: ${server.enableHttps:false}
serviceId: ${server.serviceId:com.networknt.agent.account-1.0.0}
enableRegistry: ${server.enableRegistry:true}
startOnRegistryFailure: ${server.startOnRegistryFailure:true}
dynamicPort: ${server.dynamicPort:false}
environment: ${server.environment:dev}
shutdownGracefulPeriod: ${server.shutdownGracefulPeriod:2000}
Example portal-registry.yml:
portalUrl: ${portalRegistry.portalUrl:https://localhost:8438}
portalToken: ${light_portal_authorization:}
controllerDiscoveryToken: ${portalRegistry.controllerDiscoveryToken:}
Example client.yml:
tls:
verifyHostname: ${client.verifyHostname:true}
Example mcp-client.yml:
gatewayUrl: ${mcp-client.gatewayUrl:https://mcp-gateway.customer.example.com}
path: ${mcp-client.path:/mcp}
timeoutMs: ${mcp-client.timeoutMs:5000}
Example ollama.yml:
ollamaUrl: ${ollama.ollamaUrl:http://localhost:11434}
model: ${ollama.model:llama3.1:8b}
For the current light-agent implementation, keep ollama.yml and
mcp-client.yml in the local bootstrap config. They are read during process
initialization before the runtime completes remote config bootstrap.
Secrets
Keep secrets in a root-owned environment file or in the customer's secret manager. Do not pass secrets in command-line arguments.
Example /etc/light-agent/light-agent.env:
LIGHT_PORTAL_AUTHORIZATION=Bearer <token>
light_4j_config_password=<config-password-if-needed>
LIGHT_AGENT_HOST_ID=<host-uuid>
DATABASE_URL=postgres://agent_user:<password>@postgres.customer.example.com:5432/configserver
RUST_LOG=info
AGENT_LOG_ANSI=false
Permissions:
chown root:light-agent /etc/light-agent/light-agent.env
chmod 0640 /etc/light-agent/light-agent.env
LIGHT_PORTAL_AUTHORIZATION is used for config-server bootstrap and controller
registration. It is not the end-user chat token. If downstream MCP tools require
caller identity, the browser or BFF should send the user's Authorization
header to the agent WebSocket endpoint so the agent can forward it to
light-gateway.
Systemd Unit
Example /etc/systemd/system/light-agent.service:
[Unit]
Description=Light Agent
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=light-agent
Group=light-agent
WorkingDirectory=/opt/light-agent
EnvironmentFile=/etc/light-agent/light-agent.env
ExecStart=/opt/light-agent/light-agent
Restart=on-failure
RestartSec=5
LimitNOFILE=65535
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=full
ProtectHome=true
ReadWritePaths=/var/lib/light-agent/config-cache
[Install]
WantedBy=multi-user.target
Install and start:
systemctl daemon-reload
systemctl enable light-agent
systemctl start light-agent
systemctl status light-agent
View logs:
journalctl -u light-agent -f
Install Script Scope
An install script is useful, but keep it deterministic and small.
It should:
- Create the
light-agentuser and group. - Create
/opt/light-agent,/etc/light-agent, and/var/lib/light-agent/config-cache. - Install the binary with executable permissions.
- Install the
public/static assets. - Install bootstrap config files.
- Install or update the
systemdunit. - Set file ownership and permissions.
- Print the next operator steps for adding secrets and starting the service.
It should not:
- Embed bearer tokens.
- Pass tokens to
ExecStart. - Rewrite customer config-server state.
- Start the process before secrets, CA files, and database access are ready.
Startup Flow
The expected runtime flow is:
systemd
-> /opt/light-agent/light-agent
-> read local config/values.yml, ollama.yml, and mcp-client.yml
-> connect to Postgres with DATABASE_URL
-> build the MCP client for light-gateway
-> call config-server with LIGHT_PORTAL_AUTHORIZATION
-> write downloaded runtime config into /var/lib/light-agent/config-cache
-> start the Axum HTTP/WebSocket server
-> register the agent with controller using portalRegistry.portalUrl
-> serve the chat UI from public/
-> forward tool discovery and tool calls to light-gateway
If config-server is temporarily unavailable and cached config exists, the
runtime can continue from config-cache. If this is not acceptable for a
customer environment, make it an operational policy to clear config-cache
before restart or add a pre-start health check for config-server availability.
Endpoints
The native service exposes:
GET /health
GET /
GET /chat
/chat upgrades to WebSocket. The static chat UI is served from public/.
For local testing on the VM:
curl -i http://127.0.0.1:8083/health
Upgrade And Rollback
Use versioned binary releases:
/opt/light-agent/releases/2.2.1/light-agent
/opt/light-agent/releases/2.2.2/light-agent
/opt/light-agent/light-agent -> releases/2.2.2/light-agent
Upgrade:
systemctl stop light-agent
ln -sfn /opt/light-agent/releases/2.2.2/light-agent /opt/light-agent/light-agent
systemctl start light-agent
Rollback:
systemctl stop light-agent
ln -sfn /opt/light-agent/releases/2.2.1/light-agent /opt/light-agent/light-agent
systemctl start light-agent
Do not delete config-cache during a normal binary rollback. It is the local
cache of the config-server-delivered runtime state.
Validation Checklist
Before handing the VM to the customer:
systemctl status light-agentis active.journalctl -u light-agentshows successful config-server bootstrap.journalctl -u light-agentshows successful controller registration.- The controller shows the agent registered with the expected service id, environment, address, and port.
curl http://127.0.0.1:8083/healthreturns200 OK.- The chat UI loads from the VM address.
- The chat WebSocket connects to
/chat. - Logs show that the agent can connect to Postgres.
- Logs do not show MCP
tools/listfailures fromlight-gateway. - A chat request can discover and call a tool through
light-gateway. - Restarting the VM starts the agent automatically.
Security Checklist
- Store bearer tokens, config passwords, and database passwords outside the install bundle.
- Use a customer CA file instead of disabling TLS verification in production.
- Use a stable DNS name for
server.advertisedAddress. - Restrict inbound VM firewall rules to the required agent port.
- Restrict outbound VM firewall rules to config-server, controller,
light-gateway, Ollama, and Postgres. - Run as the dedicated
light-agentuser. - Keep
/etc/light-agent/light-agent.envreadable only by root and the service group. - Keep
/etc/light-agentwritable only by administrators. - Keep only
/var/lib/light-agent/config-cachewritable by the service. - Rotate
LIGHT_PORTAL_AUTHORIZATIONthrough the customer secret process.
Deploy Kubernetes
This page describes the recommended Kubernetes deployment model for the Rust
light-agent image from light-fabric/apps/light-agent.
Use this model when an agent service runs in a cluster and exposes the chat
UI/WebSocket endpoint through a Kubernetes Service, Ingress, or Gateway API. The
agent serves the local chat UI, connects to an LLM provider, calls MCP tools
through light-gateway, stores conversation memory in Postgres, and registers
with controller.
Recommended Model
Deploy the agent as a normal single-container Kubernetes workload:
Deploymentfor the agent pod.Servicefor stable in-cluster access.ConfigMapfor bootstrap config and non-secret values.Secretfor bearer tokens, config passwords, host id, and database URL.emptyDirorPersistentVolumeClaimforconfig-cache.ConfigMapor custom image layer forpublic/chat UI assets.- Optional
Ingress,Gateway API,NodePort, orLoadBalancerfor external browser access.
Keep runtime policy and shared platform configuration in config-server. The
Kubernetes bootstrap config should only contain enough information for startup,
trust, model/provider selection, light-gateway access, database access, and
controller registration.
Image
Build the image from the workspace root:
./apps/light-agent/build.sh 2.2.1
For local testing without pushing:
./apps/light-agent/build.sh 2.2.1 --local
Use immutable tags in Kubernetes. Avoid latest for customer deployments.
The current runtime image uses:
/app/light-agent
/app/config -> /config
The process runs as the image user agent. Mount /config for bootstrap
config and make /app/config-cache writable.
The current Dockerfile does not copy apps/light-agent/public/ into the runtime
image. For Kubernetes, either mount the public/ files from a ConfigMap or
build a custom image that includes them under /app/public.
Runtime Paths
Recommended container layout:
/config/
startup.yml
server.yml
portal-registry.yml
client.yml
mcp-client.yml
ollama.yml
values.yml
ca.pem
/app/config-cache/
values.yml
downloaded certs and files
/app/public/
index.html
Use a read-only projected volume for /config. Use a writable volume for
/app/config-cache.
For most deployments, use emptyDir for config-cache. This gives each pod a
fresh cache and avoids accidentally keeping stale config across pod replacement.
Use a PersistentVolumeClaim only when the customer explicitly wants the agent
to restart from the last downloaded config during a config-server outage. A
persistent cache improves outage tolerance but can also preserve stale runtime
state.
Runtime Dependencies
The pod must be able to reach:
- Controller, through
portalRegistry.portalUrl. - Config-server, through
startup.configServerUri. light-gateway, throughmcp-client.gatewayUrlandmcp-client.path.- The model provider, currently Ollama by default.
- Postgres, through
DATABASE_URL.
The Postgres database must contain the Hindsight memory tables used by
light-agent, including:
agent_memory_bank_tagent_memory_unit_tagent_session_history_t
LIGHT_AGENT_HOST_ID must be a valid host UUID for the target tenant/host. The
agent stores memory and session history under this host id.
Agent Roles
The same image can run different logical agents. Use a different service id, deployment name, Service name, and port for each concurrently running role.
Common service ids are:
com.networknt.agent.account-1.0.0
com.networknt.agent.advisor-1.0.0
com.networknt.agent.tech-support-1.0.0
For a single account agent, a conventional Kubernetes name is
light-agent-account. For multiple agents in the same namespace, use names such
as:
light-agent-account
light-agent-advisor
light-agent-tech-support
Each role needs a unique Service name. If they share one namespace and expose through one Ingress host, route each role by host or path.
Registration Address
In Kubernetes, do not register the pod IP. Pod IPs are ephemeral.
If controller and callers are inside the same cluster, advertise the Service DNS name:
server.advertisedAddress: light-agent-account.light-agent
The pattern is:
<service-name>.<namespace>
The port is still registered separately from the host/address.
If controller or callers are outside the cluster, advertise the externally reachable DNS name instead, such as the Ingress or LoadBalancer hostname:
server.advertisedAddress: account-agent.customer.example.com
Bootstrap Config
Example values.yml for an in-cluster controller, config-server, gateway,
Ollama, and Postgres:
startup.host: customer.example.com
startup.timeout: 3000
startup.connectTimeout: 3000
startup.bootstrapCaCertPath: config/ca.pem
startup.externalConfigDir: /app/config-cache
light-config-server-uri: https://config-server.lightapi.svc.cluster.local:8435
server.serviceId: com.networknt.agent.account-1.0.0
server.environment: prod
server.ip: 0.0.0.0
server.advertisedAddress: light-agent-account.light-agent
server.httpPort: 8083
server.enableHttp: true
server.httpsPort: 8443
server.enableHttps: false
server.enableRegistry: true
server.startOnRegistryFailure: true
portalRegistry.portalUrl: https://controller.lightapi.svc.cluster.local:8438
client.verifyHostname: true
mcp-client.gatewayUrl: https://ai-microgateway.light-gateway:8443
mcp-client.path: /mcp
mcp-client.timeoutMs: 5000
ollama.ollamaUrl: http://ollama.ai.svc.cluster.local:11434
ollama.model: llama3.1:8b
Example startup.yml:
host: ${startup.host:dev.lightapi.net}
serviceId: ${server.serviceId:com.networknt.agent.account-1.0.0}
envTag: ${server.environment:dev}
acceptHeader: application/yaml
timeout: ${startup.timeout:3000}
connectTimeout: ${startup.connectTimeout:3000}
configServerUri: ${light-config-server-uri:https://local.localhost}
authorization: ${light_portal_authorization:}
bootstrapCaCertPath: ${startup.bootstrapCaCertPath:config/ca.pem}
externalConfigDir: ${startup.externalConfigDir:/app/config-cache}
Example server.yml:
ip: ${server.ip:0.0.0.0}
advertisedAddress: ${server.advertisedAddress:127.0.0.1}
httpPort: ${server.httpPort:8083}
enableHttp: ${server.enableHttp:true}
httpsPort: ${server.httpsPort:8443}
enableHttps: ${server.enableHttps:false}
serviceId: ${server.serviceId:com.networknt.agent.account-1.0.0}
enableRegistry: ${server.enableRegistry:true}
startOnRegistryFailure: ${server.startOnRegistryFailure:true}
dynamicPort: ${server.dynamicPort:false}
environment: ${server.environment:dev}
shutdownGracefulPeriod: ${server.shutdownGracefulPeriod:2000}
Example portal-registry.yml:
portalUrl: ${portalRegistry.portalUrl:https://localhost:8438}
portalToken: ${light_portal_authorization:}
controllerDiscoveryToken: ${portalRegistry.controllerDiscoveryToken:}
Example client.yml:
tls:
verifyHostname: ${client.verifyHostname:true}
Example mcp-client.yml:
gatewayUrl: ${mcp-client.gatewayUrl:https://ai-microgateway.light-gateway:8443}
path: ${mcp-client.path:/mcp}
timeoutMs: ${mcp-client.timeoutMs:5000}
Example ollama.yml:
ollamaUrl: ${ollama.ollamaUrl:http://ollama.ai.svc.cluster.local:11434}
model: ${ollama.model:llama3.1:8b}
For the current light-agent implementation, keep ollama.yml and
mcp-client.yml in the local bootstrap config. They are read during process
initialization before the runtime completes remote config bootstrap.
Use the customer CA in ca.pem. Do not disable hostname verification in
production to work around certificate SAN problems.
Secrets
Store the portal bearer token, optional config password, host id, and database
URL in a Kubernetes Secret.
Example:
apiVersion: v1
kind: Secret
metadata:
name: light-agent-account-secret
namespace: light-agent
type: Opaque
stringData:
LIGHT_PORTAL_AUTHORIZATION: "Bearer <token>"
light_4j_config_password: "<config-password-if-needed>"
LIGHT_AGENT_HOST_ID: "<host-uuid>"
DATABASE_URL: "postgres://agent_user:<password>@postgres.lightapi.svc.cluster.local:5432/configserver"
data:
ca.pem: <base64-ca-pem>
LIGHT_PORTAL_AUTHORIZATION is used for config-server bootstrap and controller
registration. It is not the end-user chat token. If downstream MCP tools require
caller identity, the browser or BFF should send the user's Authorization
header to the agent WebSocket endpoint so the agent can forward it to
light-gateway.
Do not store real bearer tokens, database passwords, or customer CA material in Git, ConfigMaps, Helm values committed to the repo, or rendered deployment examples.
Example Manifests
Example ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: light-agent-account-config
namespace: light-agent
labels:
app.kubernetes.io/name: light-agent-account
app.kubernetes.io/component: agent
data:
values.yml: |
startup.host: customer.example.com
startup.timeout: 3000
startup.connectTimeout: 3000
startup.bootstrapCaCertPath: config/ca.pem
startup.externalConfigDir: /app/config-cache
light-config-server-uri: https://config-server.lightapi.svc.cluster.local:8435
server.serviceId: com.networknt.agent.account-1.0.0
server.environment: prod
server.ip: 0.0.0.0
server.advertisedAddress: light-agent-account.light-agent
server.httpPort: 8083
server.enableHttp: true
server.httpsPort: 8443
server.enableHttps: false
server.enableRegistry: true
server.startOnRegistryFailure: true
portalRegistry.portalUrl: https://controller.lightapi.svc.cluster.local:8438
client.verifyHostname: true
mcp-client.gatewayUrl: https://ai-microgateway.light-gateway:8443
mcp-client.path: /mcp
mcp-client.timeoutMs: 5000
ollama.ollamaUrl: http://ollama.ai.svc.cluster.local:11434
ollama.model: llama3.1:8b
startup.yml: |
host: ${startup.host:dev.lightapi.net}
serviceId: ${server.serviceId:com.networknt.agent.account-1.0.0}
envTag: ${server.environment:dev}
acceptHeader: application/yaml
timeout: ${startup.timeout:3000}
connectTimeout: ${startup.connectTimeout:3000}
configServerUri: ${light-config-server-uri:https://local.localhost}
authorization: ${light_portal_authorization:}
bootstrapCaCertPath: ${startup.bootstrapCaCertPath:config/ca.pem}
externalConfigDir: ${startup.externalConfigDir:/app/config-cache}
server.yml: |
ip: ${server.ip:0.0.0.0}
advertisedAddress: ${server.advertisedAddress:127.0.0.1}
httpPort: ${server.httpPort:8083}
enableHttp: ${server.enableHttp:true}
httpsPort: ${server.httpsPort:8443}
enableHttps: ${server.enableHttps:false}
serviceId: ${server.serviceId:com.networknt.agent.account-1.0.0}
enableRegistry: ${server.enableRegistry:true}
startOnRegistryFailure: ${server.startOnRegistryFailure:true}
dynamicPort: ${server.dynamicPort:false}
environment: ${server.environment:dev}
shutdownGracefulPeriod: ${server.shutdownGracefulPeriod:2000}
portal-registry.yml: |
portalUrl: ${portalRegistry.portalUrl:https://localhost:8438}
portalToken: ${light_portal_authorization:}
controllerDiscoveryToken: ${portalRegistry.controllerDiscoveryToken:}
client.yml: |
tls:
verifyHostname: ${client.verifyHostname:true}
mcp-client.yml: |
gatewayUrl: ${mcp-client.gatewayUrl:https://ai-microgateway.light-gateway:8443}
path: ${mcp-client.path:/mcp}
timeoutMs: ${mcp-client.timeoutMs:5000}
ollama.yml: |
ollamaUrl: ${ollama.ollamaUrl:http://ollama.ai.svc.cluster.local:11434}
model: ${ollama.model:llama3.1:8b}
Create the public/ ConfigMap from the repo asset:
kubectl -n light-agent create configmap light-agent-account-public \
--from-file=index.html=apps/light-agent/public/index.html \
--dry-run=client -o yaml
Example Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: light-agent-account
namespace: light-agent
labels:
app.kubernetes.io/name: light-agent-account
app.kubernetes.io/component: agent
app.kubernetes.io/part-of: lightapi
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: light-agent-account
template:
metadata:
labels:
app.kubernetes.io/name: light-agent-account
app.kubernetes.io/component: agent
app.kubernetes.io/part-of: lightapi
spec:
securityContext:
fsGroup: 999
fsGroupChangePolicy: OnRootMismatch
containers:
- name: light-agent
image: networknt/light-agent:2.2.1
imagePullPolicy: IfNotPresent
env:
- name: LIGHT_PORTAL_AUTHORIZATION
valueFrom:
secretKeyRef:
name: light-agent-account-secret
key: LIGHT_PORTAL_AUTHORIZATION
- name: light_4j_config_password
valueFrom:
secretKeyRef:
name: light-agent-account-secret
key: light_4j_config_password
optional: true
- name: LIGHT_AGENT_HOST_ID
valueFrom:
secretKeyRef:
name: light-agent-account-secret
key: LIGHT_AGENT_HOST_ID
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: light-agent-account-secret
key: DATABASE_URL
- name: RUST_LOG
value: info
- name: AGENT_LOG_ANSI
value: "false"
ports:
- name: http
containerPort: 8083
protocol: TCP
- name: https
containerPort: 8443
protocol: TCP
readinessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 30
periodSeconds: 30
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 1000m
memory: 768Mi
volumeMounts:
- name: bootstrap-config
mountPath: /config
readOnly: true
- name: config-cache
mountPath: /app/config-cache
- name: public
mountPath: /app/public
readOnly: true
volumes:
- name: bootstrap-config
projected:
sources:
- configMap:
name: light-agent-account-config
- secret:
name: light-agent-account-secret
items:
- key: ca.pem
path: ca.pem
- name: config-cache
emptyDir: {}
- name: public
configMap:
name: light-agent-account-public
Example Service:
apiVersion: v1
kind: Service
metadata:
name: light-agent-account
namespace: light-agent
labels:
app.kubernetes.io/name: light-agent-account
app.kubernetes.io/component: agent
spec:
type: ClusterIP
selector:
app.kubernetes.io/name: light-agent-account
ports:
- name: http
port: 8083
targetPort: http
protocol: TCP
- name: https
port: 8443
targetPort: https
protocol: TCP
External Access
For local testing with a ClusterIP Service:
kubectl -n light-agent port-forward svc/light-agent-account 8083:8083
Health check:
curl -i http://127.0.0.1:8083/health
If exposing through Ingress, make sure WebSocket upgrade is supported and idle timeouts are long enough for chat sessions.
Example NGINX Ingress annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
If downstream MCP tools require caller identity, put the agent behind a BFF or
authenticated reverse proxy that forwards the user's Authorization header to
the WebSocket request. A browser-created WebSocket from the embedded static UI
does not directly set arbitrary authorization headers.
Deploy Through Light-Deployer
The repo template lives at:
apps/light-agent/k8s/light-agent
Use the same template rules as light-gateway.
When light-deployer runs outside the cluster and has
LIGHT_DEPLOYER_TEMPLATE_BASE_DIR set, repoUrl: "local" can point to local
templates.
When light-deployer runs inside Kubernetes, use a real Git URL:
{
"template": {
"repoUrl": "https://github.com/networknt/light-fabric.git",
"ref": "main",
"path": "apps/light-agent/k8s/light-agent"
}
}
Do not use repoUrl: "local" for an in-cluster deployer unless the template
repo is mounted into the deployer container and
LIGHT_DEPLOYER_TEMPLATE_BASE_DIR points to it.
Keep Namespace out of templates rendered by light-deployer if the deployer
policy blocks cluster-scoped resources. Create the namespace separately:
kubectl create namespace light-agent
Config-Server Requirements
Before deploying the agent pod, config-server should already have config for the tuple used by startup:
host = startup.host
serviceId = server.serviceId
envTag = server.environment
At minimum, config-server should return runtime config for:
values.ymlserver.ymlwhen listener or registration settings are centrally managed.portal-registry.ymlwhen controller URLs or registry settings are centrally managed.client.ymlwhen TLS verification behavior is centrally managed.
For the current light-agent, keep mcp-client.yml and ollama.yml in the
local bootstrap ConfigMap even if other runtime config comes from config-server.
They are loaded before remote bootstrap completes.
Startup Flow
Expected runtime flow:
Kubernetes starts pod
-> /app/light-agent
-> read local /config/values.yml, ollama.yml, and mcp-client.yml
-> connect to Postgres with DATABASE_URL
-> build the MCP client for light-gateway
-> call config-server with LIGHT_PORTAL_AUTHORIZATION
-> write downloaded runtime config into /app/config-cache
-> start the Axum HTTP/WebSocket server
-> register the agent with controller using portalRegistry.portalUrl
-> serve the chat UI from /app/public
-> forward tool discovery and tool calls to light-gateway
If config-server is unavailable and /app/config-cache/values.yml exists, the
runtime can continue from cached config. With emptyDir, that cache disappears
when the pod is recreated. With a PVC, it can survive pod replacement.
Upgrade And Rollback
Use Kubernetes rolling updates with immutable image tags:
kubectl -n light-agent set image deploy/light-agent-account \
light-agent=networknt/light-agent:2.2.2
kubectl -n light-agent rollout status deploy/light-agent-account
Rollback:
kubectl -n light-agent rollout undo deploy/light-agent-account
For production, prefer changing only one variable at a time: either image tag or config-server runtime config, not both in the same rollout.
Validation Checklist
After deployment:
kubectl -n light-agent rollout status deploy/light-agent-accountsucceeds.- Pods are ready and restart count is stable.
- Logs show successful Postgres connection.
- Logs show successful config-server bootstrap.
- Logs show successful controller registration.
- Controller shows the agent registered with the expected service id, environment, host, and port.
curl http://127.0.0.1:8083/healthsucceeds through port-forward or Ingress.- The chat UI loads.
- The chat WebSocket connects to
/chat. - MCP
tools/listreacheslight-gateway. - MCP
tools/callreaches the backend MCP server throughlight-gateway. - A pod restart still starts cleanly with the selected cache policy.
Security Checklist
- Keep bearer tokens, config passwords, database passwords, and host ids in
Kubernetes
Secret, notConfigMap. - Use customer CA trust and keep
client.verifyHostname: truein production. - Use immutable image tags and image pull credentials from Kubernetes secrets when the registry is private.
- Run as the non-root image user.
- Make
/configread-only. - Make only
/app/config-cachewritable. - Restrict ingress traffic to required agent ports.
- Restrict egress traffic to config-server, controller,
light-gateway, Ollama, and Postgres. - Rotate
LIGHT_PORTAL_AUTHORIZATIONthrough the customer secret process.
Light-Deployer
light-deployer is the cluster-local Kubernetes deployment executor for Light
Portal.
It renders Kubernetes templates, validates manifests, applies resources through
kube-rs, reports rollout status, and exposes deployment tools through an MCP
JSON-RPC endpoint for local and MicroK8s testing.
Key Capabilities
- MCP JSON-RPC endpoint at
POST /mcp - AST-based YAML template rendering
- Git template fetching with
gix - Kubernetes dry-run, apply, delete, status, and prune
- redacted manifest summaries and diffs
- SSE deployment events
Runtime
light-deployer uses light-runtime, light-axum, config-loader, and
portal-registry so it follows the same service boot model as light-agent.
Testing Path
Use these pages in order when testing locally:
Start with standalone noop mode to validate template rendering. Then move to
MicroK8s real mode once the render request and target templates are correct.
For MCP clients, Light Portal, and AI agents, use POST /mcp with JSON-RPC
methods such as tools/list and tools/call. The /mcp/tools/* routes are
kept only as local debugging conveniences.
Build Local
This page builds the light-deployer binary and container image from the
Light Fabric workspace.
Run all commands from the repository root:
cd ~/workspace/light-fabric
Rust Build
Use cargo check first for a quick compile validation:
cargo check -p light-deployer
Run the deployer tests:
cargo test -p light-deployer
Build a debug binary:
cargo build -p light-deployer
Build a release binary:
cargo build --release -p light-deployer
The release binary is written to:
target/release/light-deployer
Docker Image
Build the local image:
./apps/light-deployer/build.sh latest
The default image name is:
networknt/light-deployer:latest
To override the image name:
IMAGE=localhost:32000/light-deployer:latest ./apps/light-deployer/build.sh latest
Verify the image exists:
docker image inspect networknt/light-deployer:latest
What The Image Contains
The Dockerfile copies:
/usr/local/bin/light-deployer/app/config
The container runs from /app, so the default runtime config directory is:
/app/config
The default HTTP port is 7088, configured in:
apps/light-deployer/config/server.yml
Expected Result
Before moving on, these commands should pass:
cargo check -p light-deployer
cargo test -p light-deployer
./apps/light-deployer/build.sh latest
docker image inspect networknt/light-deployer:latest
Prepare Config
light-deployer uses two kinds of configuration:
- runtime config loaded by
light-runtime - deployment request data sent through MCP
tools/callatPOST /mcp
Runtime Config Files
Default config lives in:
apps/light-deployer/config
Files:
server.yml: HTTP/HTTPS bind settings and service identitydeployer.yml: local deployer policyportal-registry.yml: future portal/controller registry settings
When running from the workspace root, the deployer automatically uses:
apps/light-deployer/config
When running inside the Docker image, it uses:
/app/config
Override the config directory with:
LIGHT_DEPLOYER_CONFIG_DIR=/path/to/config
Server Config
The default server config listens on HTTP port 7088:
ip: ${server.ip:0.0.0.0}
httpPort: ${server.httpPort:7088}
enableHttp: ${server.enableHttp:true}
enableHttps: ${server.enableHttps:false}
serviceId: ${server.serviceId:com.networknt.light-deployer-0.1.0}
enableRegistry: ${server.enableRegistry:false}
To change the port without editing the file, provide values through the normal runtime values mechanism, or use a copied config directory for local testing.
Deployer Policy
The default policy is permissive enough for local testing:
deployerId: ${deployer.deployerId:local-light-deployer}
clusterId: ${deployer.clusterId:local}
allowedNamespaces: []
allowedRepoHosts: []
allowedRepoPrefixes: []
allowedImageRegistries: []
devInsecure: ${deployer.devInsecure:false}
Empty allow lists mean the policy does not restrict that dimension. For production, configure explicit values.
Example tighter policy:
deployerId: petstore-microk8s
clusterId: microk8s-local
allowedNamespaces:
- petstore-dev
allowedRepoHosts:
- github.com
allowedRepoPrefixes:
- https://github.com/networknt/
allowedImageRegistries:
- networknt
devInsecure: false
prune:
enabled: true
maxDeletePercent: 30
sensitiveKinds:
- PersistentVolumeClaim
overrideRequired: true
Git Access
Public repositories do not need credentials.
For private HTTPS repositories, set:
LIGHT_DEPLOYER_GIT_TOKEN=...
Defaults:
- GitHub username:
x-access-token - Bitbucket Cloud username:
x-token-auth
For Bitbucket app passwords or other Git servers:
LIGHT_DEPLOYER_GIT_USERNAME=my-user
LIGHT_DEPLOYER_GIT_TOKEN=my-token-or-app-password
Only HTTPS token auth is supported in Phase 1. SSH auth is deferred.
Template Repository Requirements
The target application repository should contain a k8s/ directory with YAML
templates. The deployer reads all .yaml and .yml files under the requested
template path.
Example template reference:
{
"template": {
"repoUrl": "https://github.com/networknt/openapi-petstore.git",
"ref": "master",
"path": "k8s"
}
}
For local testing without Git clone, set:
LIGHT_DEPLOYER_TEMPLATE_BASE_DIR=/home/steve/workspace/openapi-petstore
Then use:
{
"template": {
"repoUrl": "local",
"ref": "master",
"path": "k8s"
}
}
Request Values
The request values object supplies placeholder values for templates.
Example for openapi-petstore:
{
"name": "openapi-petstore",
"image": {
"repository": "networknt/openapi-petstore",
"tag": "latest",
"pullPolicy": "IfNotPresent"
},
"service": {
"name": "openapi-petstore",
"type": "ClusterIP"
},
"resources": {
"requests": {
"memory": "64Mi",
"cpu": "250m"
},
"limits": {
"memory": "256Mi",
"cpu": "500m"
}
}
}
The current renderer replaces placeholders inside YAML string scalar values. Avoid placeholders in Kubernetes fields that must be numeric unless the template keeps those fields as fixed numbers.
Run Standalone
Standalone mode is the fastest way to test light-deployer before using a
real Kubernetes cluster.
Use noop mode first. It validates config, HTTP endpoints, template loading,
rendering, resource summaries, and response shape without mutating Kubernetes.
Run all commands from:
cd /home/steve/workspace/light-fabric
Start With Built-In Sample
Start the deployer with the sample template directory:
LIGHT_DEPLOYER_TEMPLATE_BASE_DIR=apps/light-deployer/examples/petstore \
LIGHT_DEPLOYER_KUBE_MODE=noop \
cargo run -p light-deployer
The service listens on:
http://127.0.0.1:7088
Check health from another terminal:
curl -fsSL http://127.0.0.1:7088/health
Expected output:
ok
List Tools With MCP JSON-RPC
The MCP endpoint is JSON-RPC 2.0 over HTTP at:
POST /mcp
List all deployment tools:
curl -fsSL http://127.0.0.1:7088/mcp \
-H 'content-type: application/json' \
-d '{
"jsonrpc": "2.0",
"id": "tools-list-1",
"method": "tools/list",
"params": {}
}'
Call a tool through MCP:
curl -fsSL http://127.0.0.1:7088/mcp \
-H 'content-type: application/json' \
-d '{
"jsonrpc": "2.0",
"id": "render-1",
"method": "tools/call",
"params": {
"name": "deployment.render",
"arguments": {
"hostId": "local-host",
"instanceId": "petstore-dev",
"environment": "dev",
"clusterId": "local",
"namespace": "light-deployer",
"values": {
"name": "petstore",
"image": {
"repository": "nginx",
"tag": "1.27"
},
"containerPort": 80
},
"template": {
"repoUrl": "local",
"ref": "main",
"path": "k8s"
}
}
}
}'
For local debugging, the deployer also exposes REST-style convenience endpoints:
curl -fsSL http://127.0.0.1:7088/mcp/tools/list
curl -fsSL http://127.0.0.1:7088/mcp/tools
curl -fsSL http://127.0.0.1:7088/mcp/tools/deployment.render
Use POST /mcp for MCP clients and AI agents.
Render The Built-In Sample
curl -fsSL http://127.0.0.1:7088/mcp \
-H 'content-type: application/json' \
-d '{
"jsonrpc": "2.0",
"id": "render-sample-1",
"method": "tools/call",
"params": {
"name": "deployment.render",
"arguments": {
"hostId": "local-host",
"instanceId": "petstore-dev",
"environment": "dev",
"clusterId": "microk8s-local",
"namespace": "light-deployer",
"values": {
"name": "petstore",
"replicas": 1,
"image": {
"repository": "nginx",
"tag": "1.27"
},
"containerPort": 80,
"service": {
"port": 80
}
},
"template": {
"repoUrl": "local",
"ref": "main",
"path": "k8s"
}
}
}
}'
Expected response shape:
{
"jsonrpc": "2.0",
"result": {
"isError": false,
"structuredContent": {
"action": "render",
"status": "rendered",
"deployerId": "local-light-deployer",
"clusterId": "local",
"resources": [
{
"kind": "Deployment",
"name": "petstore"
},
{
"kind": "Service",
"name": "petstore"
}
]
}
}
}
The exact requestId and manifestHash will differ.
Render openapi-petstore Locally
If /home/steve/workspace/openapi-petstore is available and has a k8s/
folder, run:
LIGHT_DEPLOYER_TEMPLATE_BASE_DIR=/home/steve/workspace/openapi-petstore \
LIGHT_DEPLOYER_KUBE_MODE=noop \
cargo run -p light-deployer
Render request:
curl -fsSL http://127.0.0.1:7088/mcp \
-H 'content-type: application/json' \
-d '{
"jsonrpc": "2.0",
"id": "render-openapi-petstore-1",
"method": "tools/call",
"params": {
"name": "deployment.render",
"arguments": {
"hostId": "local-host",
"instanceId": "openapi-petstore-dev",
"environment": "dev",
"clusterId": "microk8s-local",
"namespace": "petstore-dev",
"values": {
"name": "openapi-petstore",
"image": {
"repository": "networknt/openapi-petstore",
"tag": "latest",
"pullPolicy": "IfNotPresent"
},
"service": {
"name": "openapi-petstore",
"type": "ClusterIP"
},
"resources": {
"requests": {
"memory": "64Mi",
"cpu": "250m"
},
"limits": {
"memory": "256Mi",
"cpu": "500m"
}
}
},
"template": {
"repoUrl": "local",
"ref": "master",
"path": "k8s"
}
}
}
}'
Expected resources:
Deployment/openapi-petstoreService/openapi-petstore
Test Git Fetch
Stop the local-template run and restart without LIGHT_DEPLOYER_TEMPLATE_BASE_DIR:
LIGHT_DEPLOYER_KUBE_MODE=noop \
cargo run -p light-deployer
Render from GitHub:
curl -fsSL http://127.0.0.1:7088/mcp \
-H 'content-type: application/json' \
-d '{
"jsonrpc": "2.0",
"id": "render-git-1",
"method": "tools/call",
"params": {
"name": "deployment.render",
"arguments": {
"hostId": "local-host",
"instanceId": "openapi-petstore-dev",
"environment": "dev",
"clusterId": "microk8s-local",
"namespace": "petstore-dev",
"values": {
"name": "openapi-petstore",
"image": {
"repository": "networknt/openapi-petstore",
"tag": "latest"
}
},
"template": {
"repoUrl": "https://github.com/networknt/openapi-petstore.git",
"ref": "master",
"path": "k8s"
}
}
}
}'
For a private repository:
LIGHT_DEPLOYER_GIT_TOKEN=... \
LIGHT_DEPLOYER_KUBE_MODE=noop \
cargo run -p light-deployer
For Bitbucket app-password style auth:
LIGHT_DEPLOYER_GIT_USERNAME=my-user \
LIGHT_DEPLOYER_GIT_TOKEN=my-app-password \
LIGHT_DEPLOYER_KUBE_MODE=noop \
cargo run -p light-deployer
Dry Run And Diff In Noop Mode
Noop mode can also exercise the request path for these tools:
curl -fsSL http://127.0.0.1:7088/mcp \
-H 'content-type: application/json' \
-d '{
"jsonrpc": "2.0",
"id": "dry-run-sample-1",
"method": "tools/call",
"params": {
"name": "deployment.dryRun",
"arguments": {
"hostId": "local-host",
"instanceId": "petstore-dev",
"environment": "dev",
"clusterId": "microk8s-local",
"namespace": "light-deployer",
"values": {
"name": "petstore",
"replicas": 1,
"image": {
"repository": "nginx",
"tag": "1.27"
},
"containerPort": 80,
"service": {
"port": 80
}
},
"template": {
"repoUrl": "local",
"ref": "main",
"path": "k8s"
}
}
}
}'
curl -fsSL http://127.0.0.1:7088/mcp \
-H 'content-type: application/json' \
-d '{
"jsonrpc": "2.0",
"id": "diff-sample-1",
"method": "tools/call",
"params": {
"name": "deployment.diff",
"arguments": {
"hostId": "local-host",
"instanceId": "petstore-dev",
"environment": "dev",
"clusterId": "microk8s-local",
"namespace": "light-deployer",
"values": {
"name": "petstore",
"replicas": 1,
"image": {
"repository": "nginx",
"tag": "1.27"
},
"containerPort": 80,
"service": {
"port": 80
}
},
"template": {
"repoUrl": "local",
"ref": "main",
"path": "k8s"
}
}
}
}'
These calls do not validate against Kubernetes unless real mode is enabled.
Stop The Service
Press Ctrl-C in the terminal running cargo run.
Run Kubernetes
This page runs light-deployer inside MicroK8s and uses the in-cluster
ServiceAccount with kube-rs.
Prerequisites
MicroK8s should be running and microk8s kubectl should work:
microk8s status --wait-ready
microk8s kubectl get nodes
Build the image first:
cd /home/steve/workspace/light-fabric
./apps/light-deployer/build.sh latest
Import Image Into MicroK8s
docker save networknt/light-deployer:latest | microk8s ctr image import -
If your MicroK8s install requires elevated permissions:
docker save networknt/light-deployer:latest | sudo microk8s ctr image import -
Verify the image is available:
microk8s ctr images ls | grep light-deployer
Install Deployer
Apply the included manifests:
microk8s kubectl apply -f apps/light-deployer/k8s/namespace.yaml
microk8s kubectl apply -f apps/light-deployer/k8s/rbac.yaml
microk8s kubectl apply -f apps/light-deployer/k8s/deployment.yaml
microk8s kubectl apply -f apps/light-deployer/k8s/service.yaml
Wait for the pod:
microk8s kubectl -n light-deployer rollout status deploy/light-deployer
microk8s kubectl -n light-deployer get pods
Check logs:
microk8s kubectl -n light-deployer logs deploy/light-deployer
The deployment sets:
LIGHT_DEPLOYER_KUBE_MODE=real
So the service uses real Kubernetes API calls from inside the cluster.
Port Forward
microk8s kubectl -n light-deployer port-forward svc/light-deployer 7088:7088
In another terminal:
curl -fsSL http://127.0.0.1:7088/health
Expected:
ok
List Tools
curl -fsSL http://127.0.0.1:7088/mcp \
-H 'content-type: application/json' \
-d '{
"jsonrpc": "2.0",
"id": "tools-list-1",
"method": "tools/list",
"params": {}
}'
The response contains the deployer's tool names, descriptions, input schemas, and invocation metadata. Light Portal can use this JSON-RPC response to populate MCP tools for the API details view.
Render In Kubernetes
Rendering does not mutate the cluster:
curl -fsSL http://127.0.0.1:7088/mcp \
-H 'content-type: application/json' \
-d '{
"jsonrpc": "2.0",
"id": "render-sample-1",
"method": "tools/call",
"params": {
"name": "deployment.render",
"arguments": {
"hostId": "local-host",
"instanceId": "petstore-dev",
"environment": "dev",
"clusterId": "microk8s-local",
"namespace": "light-deployer",
"values": {
"name": "petstore",
"replicas": 1,
"image": {
"repository": "nginx",
"tag": "1.27"
},
"containerPort": 80,
"service": {
"port": 80
}
},
"template": {
"repoUrl": "local",
"ref": "main",
"path": "k8s"
}
}
}
}'
Dry Run In Kubernetes
Dry-run renders the manifest and asks the Kubernetes API to validate it without persisting resources:
curl -fsSL http://127.0.0.1:7088/mcp \
-H 'content-type: application/json' \
-d '{
"jsonrpc": "2.0",
"id": "dry-run-sample-1",
"method": "tools/call",
"params": {
"name": "deployment.dryRun",
"arguments": {
"hostId": "local-host",
"instanceId": "petstore-dev",
"environment": "dev",
"clusterId": "microk8s-local",
"namespace": "light-deployer",
"values": {
"name": "petstore",
"replicas": 1,
"image": {
"repository": "nginx",
"tag": "1.27"
},
"containerPort": 80,
"service": {
"port": 80
}
},
"template": {
"repoUrl": "local",
"ref": "main",
"path": "k8s"
}
}
}
}'
Expected status:
{
"jsonrpc": "2.0",
"result": {
"isError": false,
"structuredContent": {
"status": "validated"
}
}
}
Deploy Sample
The sample request deploys into the light-deployer namespace so it matches
the included namespace-scoped RBAC.
curl -fsSL http://127.0.0.1:7088/mcp \
-H 'content-type: application/json' \
-d '{
"jsonrpc": "2.0",
"id": "apply-sample-1",
"method": "tools/call",
"params": {
"name": "deployment.apply",
"arguments": {
"hostId": "local-host",
"instanceId": "petstore-dev",
"environment": "dev",
"clusterId": "microk8s-local",
"namespace": "light-deployer",
"values": {
"name": "petstore",
"replicas": 1,
"image": {
"repository": "nginx",
"tag": "1.27"
},
"containerPort": 80,
"service": {
"port": 80
}
},
"template": {
"repoUrl": "local",
"ref": "main",
"path": "k8s"
}
}
}
}'
The response should return quickly with an accepted/applying-style status. The operation continues in the deployer.
Watch Kubernetes resources:
microk8s kubectl -n light-deployer get deploy,svc,pods
Stream Events
Use the requestId from the deployment response:
curl -N "http://127.0.0.1:7088/events?request_id=<requestId>"
The event stream reports deployment progress and failures for that request.
Check Status
curl -fsSL http://127.0.0.1:7088/mcp \
-H 'content-type: application/json' \
-d '{
"jsonrpc": "2.0",
"id": "status-sample-1",
"method": "tools/call",
"params": {
"name": "deployment.status",
"arguments": {
"hostId": "local-host",
"instanceId": "petstore-dev",
"environment": "dev",
"clusterId": "microk8s-local",
"namespace": "light-deployer",
"template": {
"repoUrl": "local",
"ref": "main",
"path": "k8s"
}
}
}
}'
Undeploy Sample
curl -fsSL http://127.0.0.1:7088/mcp \
-H 'content-type: application/json' \
-d '{
"jsonrpc": "2.0",
"id": "delete-sample-1",
"method": "tools/call",
"params": {
"name": "deployment.delete",
"arguments": {
"hostId": "local-host",
"instanceId": "petstore-dev",
"environment": "dev",
"clusterId": "microk8s-local",
"namespace": "light-deployer",
"template": {
"repoUrl": "local",
"ref": "main",
"path": "k8s"
}
}
}
}'
Then verify resources:
microk8s kubectl -n light-deployer get deploy,svc,pods
Deploy openapi-petstore From Git
After the openapi-petstore repository has a k8s/ folder committed, use a
request like this:
curl -fsSL http://127.0.0.1:7088/mcp \
-H 'content-type: application/json' \
-d '{
"jsonrpc": "2.0",
"id": "apply-openapi-petstore-1",
"method": "tools/call",
"params": {
"name": "deployment.apply",
"arguments": {
"hostId": "local-host",
"instanceId": "openapi-petstore-dev",
"environment": "dev",
"clusterId": "microk8s-local",
"namespace": "light-deployer",
"values": {
"name": "openapi-petstore",
"image": {
"repository": "networknt/openapi-petstore",
"tag": "latest",
"pullPolicy": "IfNotPresent"
},
"service": {
"name": "openapi-petstore",
"type": "ClusterIP"
}
},
"template": {
"repoUrl": "https://github.com/networknt/openapi-petstore.git",
"ref": "master",
"path": "k8s"
}
}
}
}'
For private Git access, set LIGHT_DEPLOYER_GIT_TOKEN on the deployer pod.
In Kubernetes this should be injected from a Secret, not written directly into
the deployment manifest.
Update The Deployer Image
After rebuilding locally:
./apps/light-deployer/build.sh latest
docker save networknt/light-deployer:latest | microk8s ctr image import -
microk8s kubectl -n light-deployer rollout restart deploy/light-deployer
microk8s kubectl -n light-deployer rollout status deploy/light-deployer
Remove The Deployer
microk8s kubectl delete -f apps/light-deployer/k8s/service.yaml
microk8s kubectl delete -f apps/light-deployer/k8s/deployment.yaml
microk8s kubectl delete -f apps/light-deployer/k8s/rbac.yaml
microk8s kubectl delete -f apps/light-deployer/k8s/namespace.yaml
Light-Gateway
light-gateway is the Pingora-based gateway product in Light Fabric.
It is intended to host gateway behavior such as routing, proxying, and eventually AI/MCP gateway integrations while using the shared runtime and config model.
Key Dependencies
light-runtimelight-pingoraconfig-loader
Runtime
The gateway uses light-pingora as its transport framework and
light-runtime for lifecycle, bootstrap, and service configuration.
Light Rule In Light-Gateway
light-gateway uses Light-Rule to enforce deterministic policy decisions in
the Pingora request path. The first production use is MCP tool authorization and
response filtering for the mcp handler.
This lets a gateway route agent MCP traffic to downstream MCP servers or API servers while enforcing fine-grained authorization locally from configuration delivered by config-server.
When It Runs
Light-Rule is invoked by light-gateway when all of these are true:
handler.ymlincludes themcphandler in the matched chain.mcp-router.ymlenables the MCP router and defines tools.access-control.ymland/orrule.ymlare available from local config or config-server.- A client sends
tools/callto the configured MCP endpoint, normally/mcp.
The dependency path is:
light-gateway
-> light-pingora
-> light-rule
light-gateway links light-pingora, and light-pingora links
light-rule. The rule engine is therefore part of the gateway binary; there is
no dynamic plugin loading step.
Request Flow
For MCP traffic, the runtime flow is:
POST /mcp
-> handler.yml selects mcp
-> mcp-router parses JSON-RPC tools/call
-> access-control runtime builds rule context
-> light-rule evaluates req-acc rules
-> denied: return JSON-RPC error -32001
-> allowed: call downstream HTTP or MCP tool
-> light-rule evaluates optional res-fil rules
-> return JSON-RPC result
Authorization happens before the downstream call. Response filtering happens after the downstream response and before the MCP JSON-RPC response is returned to the agent.
Required Files
handler.yml
The mcp handler must be in the execution chain for the MCP path:
handlers:
- correlation
- security
- mcp
paths:
- path: /mcp
method: POST
exec:
- correlation
- security
- mcp
defaultHandlers: []
Security should run before mcp when rules depend on JWT claims such as
role, grp, pos, att, uid, or sub.
mcp-router.yml
mcp-router.yml exposes the MCP endpoint and maps tools to downstream APIs or
downstream MCP servers:
enabled: true
path: /mcp
maxSessions: 10000
maxSessionsPerClient: 100
tools:
- name: weather
description: Get weather.
targetHost: http://weather-api:8080
path: /weather
method: GET
endpoint: /weather@get
apiType: http
inputSchema:
type: object
properties:
city:
type: string
The endpoint field is the stable policy key used by rule.yml. If it is
omitted, the gateway derives one from the tool path and method, such as
/weather@get.
maxSessions caps the total in-memory MCP frontend sessions for this gateway
process. maxSessionsPerClient caps sessions for one authenticated client or,
when no principal is available, one MCP clientInfo.name and
clientInfo.version pair.
For downstream MCP servers, set apiType: mcp. For downstream API servers, use
apiType: http or omit it when the default is acceptable.
access-control.yml
access-control.yml controls whether policy is active and how rules combine:
enabled: true
accessRuleLogic: any
defaultDeny: true
skipPathPrefixes: []
Fields:
enabled: turns access-control evaluation on or off.accessRuleLogic:anyorallforreq-accrule ids on an endpoint.defaultDeny: whentrue, deny calls with no matching endpoint rule.skipPathPrefixes: endpoint prefixes that bypass access control.
The file name is access-control.yml. The loader also accepts
access-control.yaml.
rule.yml
rule.yml provides the rules and endpoint mappings:
ruleBodies:
allowMcpReader:
common: Y
ruleId: allowMcpReader
ruleName: Allow MCP reader
ruleType: req-acc
conditions:
- operatorCode: isNotNull
propertyPath: auditInfo.subject_claims.ClaimsMap.role
actions:
- actionClassName: com.networknt.rule.RoleBasedAccessControlAction
endpointRules:
/weather@get:
req-acc:
- allowMcpReader
permission:
roles: mcp-reader
In this example, a caller is allowed only when the authenticated principal has a
role matching mcp-reader.
The file name is rule.yml. The loader also accepts rule.yaml.
Rule Context
For MCP tool calls, the gateway builds a rule context with:
auditInfo.subject_claims.ClaimsMap: normalized JWT claims from the security handler.headers: incoming agent request headers, lowercased.endpoint: the tool policy endpoint, for example/weather@get.toolName: the MCP tool name.toolArguments: the JSON arguments fromtools/call.correlationId: the correlation id when one is available.permission: endpoint permission values merged into the root context.
The current built-in access-control action checks the caller role against
permission.roles.
Response filter actions can also use these claim dimensions:
role
group or grp
position or pos
attribute or att
user, user_id, uid, or sub
Built-In Actions
The gateway registers Rust actions under Java-compatible class names:
com.networknt.rule.RoleBasedAccessControlAction
RoleBasedAccessControlAction
com.networknt.rule.ResponseColumnFilterAction
ResponseColumnFilterAction
com.networknt.rule.ResponseRowFilterAction
ResponseRowFilterAction
RoleBasedAccessControlAction
Used with req-acc. It compares the caller role claim to permission.roles.
If there is no role claim or no configured roles, the action returns denied.
ResponseColumnFilterAction
Used with res-fil. It filters fields from array-like JSON responses according
to endpoint permission configuration.
Example:
ruleBodies:
filterColumns:
common: Y
ruleId: filterColumns
ruleName: Filter account columns
ruleType: res-fil
conditions:
- operatorCode: isNotNull
propertyPath: col
actions:
- actionClassName: com.networknt.rule.ResponseColumnFilterAction
endpointRules:
/accounts@get:
res-fil:
- filterColumns
permission:
col:
role:
mcp-reader: '["id","name"]'
ResponseRowFilterAction
Used with res-fil. It filters rows from array-like JSON responses according to
configured row predicates.
Example:
ruleBodies:
filterRows:
common: Y
ruleId: filterRows
ruleName: Filter account rows
ruleType: res-fil
conditions:
- operatorCode: isNotNull
propertyPath: row
actions:
- actionClassName: com.networknt.rule.ResponseRowFilterAction
endpointRules:
/accounts@get:
res-fil:
- filterRows
permission:
row:
role:
mcp-reader:
- colName: status
operator: "="
colValue: "OPEN"
Matching Rules
Endpoint matching checks:
- exact endpoint key first
- Java-style path templates such as
/accounts/{id}@get - parent path entries, for example
/accounts@getfor/accounts/123@get
For MCP tools, prefer explicitly setting endpoint in mcp-router.yml so the
policy key remains stable even if the downstream path changes.
Reload Behavior
light-gateway has reload support for MCP and access-control config:
- reloading
mcp-router.ymlrebuilds the MCP router runtime - reloading
access-control.ymlorrule.ymlrebuilds MCP and WebSocket policy runtimes
This matches the product model where light-portal manages configuration and
config-server delivers the resolved files.
Operational Notes
- If
access-control.ymlis missing, MCP tools are allowed unless another handler blocks the request. - If
access-control.ymlis enabled anddefaultDenyistrue, a tool call with no matchingreq-accendpoint rule is denied. - If the security handler does not run before
mcp, role-based rules will not have caller claims and will deny. - Rule execution is local to the gateway. It does not call the database on each request.
x-maskandx-mask-patternin MCP toolinputSchemaare handled before downstream execution.x-tokenizeis reserved for the tokenization service integration.
Verification
Useful checks:
cargo tree -p light-gateway -i light-rule
cargo test -p light-pingora access_control
cargo test -p light-gateway gateway_loads_mcp_router_when_mcp_handler_is_active
The first command verifies the binary linkage. The test commands verify the MCP access-control path, default deny behavior, role-based allow behavior, response filtering, and gateway MCP runtime loading.
Deploy Native
This page describes the recommended VM deployment model for the Rust
light-gateway native binary.
Use this model when a customer wants to run light-gateway as a microgateway on
a VM to protect backend MCP servers. The gateway starts from a small local
bootstrap config, downloads runtime config from config-server, then registers
itself with controller.
Recommended Model
Deliver a versioned install bundle, not an ad hoc runtime script.
The bundle should contain:
light-gatewaynative binary.- Minimal bootstrap config files.
- A
systemdunit. - An install script for filesystem setup.
- A root-owned environment file for secrets.
The install script can create users, directories, symlinks, permissions, and the
systemd unit. It should not be the long-running process wrapper, and it should
not pass secrets as command-line arguments.
Use systemd to run the service:
- It restarts the process on failure.
- It keeps logs in the host journal.
- It avoids shell-history and process-list leakage from command-line secrets.
- It gives the customer a standard operational surface:
start,stop,restart,status, andjournalctl.
Runtime Layout
light-gateway uses relative runtime paths:
configconfig-cache
The systemd service should therefore set WorkingDirectory to the installed
application directory.
Recommended VM layout:
/opt/light-gateway/
light-gateway
config -> /etc/light-gateway
config-cache -> /var/lib/light-gateway/config-cache
/etc/light-gateway/
startup.yml
server.yml
portal-registry.yml
client.yml
values.yml
ca.pem
light-gateway.env
/var/lib/light-gateway/
config-cache/
The local config directory contains only bootstrap-time files. Runtime config
downloaded from config-server is written to config-cache before Pingora starts.
Keep config-cache writable by the light-gateway service user.
Build Artifact
Build a release binary from light-fabric:
cargo build --release -p light-gateway
The artifact is:
target/release/light-gateway
Build on a compatible Linux distribution for the customer VM. If the customer
fleet has mixed Linux versions, prefer a static or target-compatible build so
the binary does not fail on an older glibc.
Package with a versioned filename:
light-gateway-<version>-linux-amd64.tar.gz
For customers with package-management standards, wrap the same layout in a
.deb or .rpm later. Start with tar.gz until the runtime contract is stable.
Bootstrap Config
The local bootstrap config only needs enough information to reach config-server, identify the gateway instance, and trust TLS.
Example values.yml:
startup.host: customer.example.com
startup.timeout: 3000
startup.connectTimeout: 3000
startup.bootstrapCaCertPath: config/ca.pem
light-config-server-uri: https://config-server.customer.example.com:8435
server.serviceId: com.customer.mcp-gateway-1.0.0
server.environment: prod
server.ip: 0.0.0.0
server.advertisedAddress: mcp-gateway-01.customer.example.com
server.httpPort: 8080
server.enableHttp: true
server.httpsPort: 8443
server.enableHttps: false
server.enableRegistry: true
server.startOnRegistryFailure: true
portalRegistry.portalUrl: https://controller.customer.example.com:8438
server.advertisedAddress must be a stable address that controller and clients
can use to reach the VM gateway. Do not advertise 127.0.0.1 or 0.0.0.0.
Example startup.yml:
host: ${startup.host:dev.lightapi.net}
serviceId: ${server.serviceId:com.networknt.light-gateway-1.0.0}
envTag: ${server.environment:dev}
acceptHeader: application/yaml
timeout: ${startup.timeout:3000}
connectTimeout: ${startup.connectTimeout:3000}
configServerUri: ${light-config-server-uri:https://local.localhost}
authorization: ${light_portal_authorization:}
bootstrapCaCertPath: ${startup.bootstrapCaCertPath:config/ca.pem}
Example server.yml:
ip: ${server.ip:0.0.0.0}
advertisedAddress: ${server.advertisedAddress:127.0.0.1}
httpPort: ${server.httpPort:8080}
enableHttp: ${server.enableHttp:true}
httpsPort: ${server.httpsPort:8443}
enableHttps: ${server.enableHttps:false}
tlsCertPath: ${server.tlsCertPath:}
tlsKeyPath: ${server.tlsKeyPath:}
serviceId: ${server.serviceId:com.networknt.light-gateway-1.0.0}
enableRegistry: ${server.enableRegistry:true}
startOnRegistryFailure: ${server.startOnRegistryFailure:true}
dynamicPort: ${server.dynamicPort:false}
environment: ${server.environment:dev}
shutdownGracefulPeriod: ${server.shutdownGracefulPeriod:2000}
Example portal-registry.yml:
portalUrl: ${portalRegistry.portalUrl:https://localhost:8438}
portalToken: ${light_portal_authorization:}
controllerDiscoveryToken: ${portalRegistry.controllerDiscoveryToken:}
Example client.yml should include the customer CA path and hostname
verification policy for outbound HTTPS calls:
tls:
caCertPath: ${client.caCertPath:config/ca.pem}
verifyHostname: ${client.verifyHostname:true}
Keep the full gateway behavior, including MCP routing, authentication, rule configuration, and downstream MCP targets, in config-server. The VM should not need local edits for normal policy or route changes.
Secrets
Keep secrets in a root-owned environment file or in the customer's secret manager. Do not pass secrets in command-line arguments.
Example /etc/light-gateway/light-gateway.env:
LIGHT_PORTAL_AUTHORIZATION=Bearer <token>
light_4j_config_password=<config-password-if-needed>
RUST_LOG=info
Permissions:
chown root:light-gateway /etc/light-gateway/light-gateway.env
chmod 0640 /etc/light-gateway/light-gateway.env
LIGHT_PORTAL_AUTHORIZATION is used for config-server bootstrap. The same token
is also used by portal registry startup when portal-registry.yml resolves
portalToken from light_portal_authorization.
Systemd Unit
Example /etc/systemd/system/light-gateway.service:
[Unit]
Description=Light Gateway
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=light-gateway
Group=light-gateway
WorkingDirectory=/opt/light-gateway
EnvironmentFile=/etc/light-gateway/light-gateway.env
ExecStart=/opt/light-gateway/light-gateway
Restart=on-failure
RestartSec=5
LimitNOFILE=65535
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=full
ProtectHome=true
ReadWritePaths=/var/lib/light-gateway/config-cache
[Install]
WantedBy=multi-user.target
Install and start:
systemctl daemon-reload
systemctl enable light-gateway
systemctl start light-gateway
systemctl status light-gateway
View logs:
journalctl -u light-gateway -f
Install Script Scope
An install script is useful, but keep it deterministic and small.
It should:
- Create the
light-gatewayuser and group. - Create
/opt/light-gateway,/etc/light-gateway, and/var/lib/light-gateway/config-cache. - Install the binary with executable permissions.
- Install bootstrap config files.
- Install or update the
systemdunit. - Set file ownership and permissions.
- Print the next operator steps for adding secrets and starting the service.
It should not:
- Embed bearer tokens.
- Pass tokens to
ExecStart. - Rewrite customer config-server state.
- Start the process before secrets and CA files are installed.
Startup Flow
The expected runtime flow is:
systemd
-> /opt/light-gateway/light-gateway
-> read local config/values.yml and startup.yml
-> call config-server with LIGHT_PORTAL_AUTHORIZATION
-> write downloaded config and files into config-cache
-> start Pingora with resolved runtime config
-> register gateway to controller using portalRegistry.portalUrl
-> route protected MCP traffic to downstream MCP servers
If config-server is temporarily unavailable and cached config exists, the
runtime can continue from config-cache. If this is not acceptable for a
customer environment, make it an operational policy to clear config-cache
before restart or add a pre-start health check for config-server availability.
Upgrade And Rollback
Use versioned binary releases:
/opt/light-gateway/releases/2.2.1/light-gateway
/opt/light-gateway/releases/2.2.2/light-gateway
/opt/light-gateway/light-gateway -> releases/2.2.2/light-gateway
Upgrade:
systemctl stop light-gateway
ln -sfn /opt/light-gateway/releases/2.2.2/light-gateway /opt/light-gateway/light-gateway
systemctl start light-gateway
Rollback:
systemctl stop light-gateway
ln -sfn /opt/light-gateway/releases/2.2.1/light-gateway /opt/light-gateway/light-gateway
systemctl start light-gateway
Do not delete config-cache during a normal binary rollback. It is the local
cache of the config-server-delivered runtime state.
Validation Checklist
Before handing the VM to the customer:
systemctl status light-gatewayis active.journalctl -u light-gatewayshows successful config-server bootstrap.journalctl -u light-gatewayshows successful controller registration.- The controller shows the gateway registered with the expected service id, environment, address, and port.
- The gateway health endpoint responds from the VM network.
- An MCP
tools/listcall reaches the gateway. - An MCP
tools/callcall reaches the configured backend MCP server. - Restarting the VM starts the gateway automatically.
Security Checklist
- Store bearer tokens and config passwords outside the install bundle.
- Use a customer CA file instead of disabling TLS verification in production.
- Use a stable DNS name for
server.advertisedAddress. - Restrict inbound VM firewall rules to required gateway ports.
- Restrict outbound VM firewall rules to config-server, controller, and backend MCP server addresses.
- Run as the dedicated
light-gatewayuser. - Keep
/etc/light-gateway/light-gateway.envreadable only by root and the service group. - Rotate
LIGHT_PORTAL_AUTHORIZATIONthrough the customer secret process.
Deploy Kubernetes
This page describes the recommended Kubernetes deployment model for the Rust
light-gateway image from light-fabric/apps/light-gateway.
Use this model when light-gateway runs as a microgateway in front of backend
MCP servers. The pod starts from local bootstrap config, downloads runtime
config from config-server into config-cache, starts Pingora, and registers the
gateway with controller.
Recommended Model
Deploy the gateway as a normal single-container Kubernetes workload:
Deploymentfor the gateway pod.Servicefor stable in-cluster access.ConfigMapfor bootstrap config and non-secret values.Secretfor bearer tokens and config passwords.emptyDirorPersistentVolumeClaimforconfig-cache.- Optional
Ingress,Gateway API,NodePort, orLoadBalancerfor external client access.
Keep gateway behavior such as MCP route definitions, access-control rules, backend MCP targets, and runtime TLS files in config-server. The Kubernetes bootstrap config should only contain enough information for startup, trust, and registration.
Image
Build the image from the workspace root:
./apps/light-gateway/build.sh 2.2.1
For local testing without pushing:
./apps/light-gateway/build.sh 2.2.1 --local
Use immutable tags in Kubernetes. Avoid latest for customer deployments.
The runtime image uses:
/app/light-gateway
/app/config -> /config
/app/config-cache
The process runs as the image user gateway. Mount /config for bootstrap
config and make /app/config-cache writable.
Runtime Paths
Recommended container layout:
/config/
startup.yml
server.yml
portal-registry.yml
client.yml
values.yml
ca.pem
/app/config-cache/
values.yml
downloaded certs and files
Use a read-only ConfigMap for /config. Use a writable volume for
/app/config-cache.
For most deployments, use emptyDir for config-cache. This gives each pod a
fresh cache and avoids accidentally keeping stale config across pod replacement.
Use a PersistentVolumeClaim only when the customer explicitly wants the
gateway to restart from the last downloaded config during a config-server
outage. A persistent cache improves outage tolerance but can also preserve stale
runtime state.
Registration Address
In Kubernetes, do not register the pod IP. Pod IPs are ephemeral.
If controller and callers are inside the same cluster, advertise the Service DNS name:
server.advertisedAddress: ai-microgateway.light-gateway
The pattern is:
<service-name>.<namespace>
The port is still registered separately from the host/address.
If controller or callers are outside the cluster, advertise the externally reachable DNS name instead, such as the Ingress or LoadBalancer hostname:
server.advertisedAddress: mcp-gateway.customer.example.com
For the Rust gateway, this is configured with server.advertisedAddress. The
Java gateway template uses STATUS_HOST_IP; that is a light-4j-specific hook
and is not the Rust gateway contract.
Bootstrap Config
Example values.yml for an in-cluster controller and config-server:
startup.host: customer.example.com
startup.timeout: 3000
startup.connectTimeout: 3000
startup.bootstrapCaCertPath: config/ca.pem
light-config-server-uri: https://config-server.lightapi.svc.cluster.local:8435
server.serviceId: com.customer.mcp-gateway-1.0.0
server.environment: prod
server.ip: 0.0.0.0
server.advertisedAddress: ai-microgateway.light-gateway
server.httpPort: 8080
server.enableHttp: true
server.httpsPort: 8443
server.enableHttps: false
server.enableRegistry: true
server.startOnRegistryFailure: true
portalRegistry.portalUrl: https://controller.lightapi.svc.cluster.local:8438
client.caCertPath: config/ca.pem
client.verifyHostname: true
Example startup.yml:
host: ${startup.host:dev.lightapi.net}
serviceId: ${server.serviceId:com.networknt.light-gateway-1.0.0}
envTag: ${server.environment:dev}
acceptHeader: application/yaml
timeout: ${startup.timeout:3000}
connectTimeout: ${startup.connectTimeout:3000}
configServerUri: ${light-config-server-uri:https://local.localhost}
authorization: ${light_portal_authorization:}
bootstrapCaCertPath: ${startup.bootstrapCaCertPath:config/ca.pem}
Example server.yml:
ip: ${server.ip:0.0.0.0}
advertisedAddress: ${server.advertisedAddress:127.0.0.1}
httpPort: ${server.httpPort:8080}
enableHttp: ${server.enableHttp:true}
httpsPort: ${server.httpsPort:8443}
enableHttps: ${server.enableHttps:false}
tlsCertPath: ${server.tlsCertPath:}
tlsKeyPath: ${server.tlsKeyPath:}
serviceId: ${server.serviceId:com.networknt.light-gateway-1.0.0}
enableRegistry: ${server.enableRegistry:true}
startOnRegistryFailure: ${server.startOnRegistryFailure:true}
dynamicPort: ${server.dynamicPort:false}
environment: ${server.environment:dev}
shutdownGracefulPeriod: ${server.shutdownGracefulPeriod:2000}
Example portal-registry.yml:
portalUrl: ${portalRegistry.portalUrl:https://localhost:8438}
portalToken: ${light_portal_authorization:}
controllerDiscoveryToken: ${portalRegistry.controllerDiscoveryToken:}
Example client.yml:
tls:
caCertPath: ${client.caCertPath:config/ca.pem}
verifyHostname: ${client.verifyHostname:true}
Use the customer CA in ca.pem. Do not disable hostname verification in
production to work around certificate SAN problems.
Secrets
Store the portal bearer token and optional config password in a Kubernetes
Secret.
Example:
apiVersion: v1
kind: Secret
metadata:
name: light-gateway-secret
namespace: light-gateway
type: Opaque
stringData:
LIGHT_PORTAL_AUTHORIZATION: "Bearer <token>"
light_4j_config_password: "<config-password-if-needed>"
LIGHT_PORTAL_AUTHORIZATION is used for config-server bootstrap. It is also
used by portal registry startup when portal-registry.yml resolves
portalToken from light_portal_authorization.
Do not store real bearer tokens in Git, ConfigMaps, Helm values committed to the repo, or rendered deployment examples.
Example Manifests
Create the namespace separately:
kubectl create namespace light-gateway
If deploying through light-deployer, keep Namespace out of the rendered
bundle because deployer policy may block cluster-scoped resources.
Example bootstrap ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: light-gateway-bootstrap
namespace: light-gateway
data:
values.yml: |
startup.host: customer.example.com
startup.timeout: 3000
startup.connectTimeout: 3000
startup.bootstrapCaCertPath: config/ca.pem
light-config-server-uri: https://config-server.lightapi.svc.cluster.local:8435
server.serviceId: com.customer.mcp-gateway-1.0.0
server.environment: prod
server.ip: 0.0.0.0
server.advertisedAddress: ai-microgateway.light-gateway
server.httpPort: 8080
server.enableHttp: true
server.httpsPort: 8443
server.enableHttps: false
server.enableRegistry: true
server.startOnRegistryFailure: true
portalRegistry.portalUrl: https://controller.lightapi.svc.cluster.local:8438
client.caCertPath: config/ca.pem
client.verifyHostname: true
startup.yml: |
host: ${startup.host:dev.lightapi.net}
serviceId: ${server.serviceId:com.networknt.light-gateway-1.0.0}
envTag: ${server.environment:dev}
acceptHeader: application/yaml
timeout: ${startup.timeout:3000}
connectTimeout: ${startup.connectTimeout:3000}
configServerUri: ${light-config-server-uri:https://local.localhost}
authorization: ${light_portal_authorization:}
bootstrapCaCertPath: ${startup.bootstrapCaCertPath:config/ca.pem}
server.yml: |
ip: ${server.ip:0.0.0.0}
advertisedAddress: ${server.advertisedAddress:127.0.0.1}
httpPort: ${server.httpPort:8080}
enableHttp: ${server.enableHttp:true}
httpsPort: ${server.httpsPort:8443}
enableHttps: ${server.enableHttps:false}
tlsCertPath: ${server.tlsCertPath:}
tlsKeyPath: ${server.tlsKeyPath:}
serviceId: ${server.serviceId:com.networknt.light-gateway-1.0.0}
enableRegistry: ${server.enableRegistry:true}
startOnRegistryFailure: ${server.startOnRegistryFailure:true}
dynamicPort: ${server.dynamicPort:false}
environment: ${server.environment:dev}
shutdownGracefulPeriod: ${server.shutdownGracefulPeriod:2000}
portal-registry.yml: |
portalUrl: ${portalRegistry.portalUrl:https://localhost:8438}
portalToken: ${light_portal_authorization:}
controllerDiscoveryToken: ${portalRegistry.controllerDiscoveryToken:}
client.yml: |
tls:
caCertPath: ${client.caCertPath:config/ca.pem}
verifyHostname: ${client.verifyHostname:true}
ca.pem: |
-----BEGIN CERTIFICATE-----
<customer-ca-certificate>
-----END CERTIFICATE-----
Example Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ai-microgateway
namespace: light-gateway
labels:
app: ai-microgateway
spec:
replicas: 2
selector:
matchLabels:
app: ai-microgateway
template:
metadata:
labels:
app: ai-microgateway
spec:
securityContext:
fsGroup: 999
fsGroupChangePolicy: OnRootMismatch
containers:
- name: light-gateway
image: networknt/light-gateway:2.2.1
imagePullPolicy: IfNotPresent
env:
- name: LIGHT_PORTAL_AUTHORIZATION
valueFrom:
secretKeyRef:
name: light-gateway-secret
key: LIGHT_PORTAL_AUTHORIZATION
- name: light_4j_config_password
valueFrom:
secretKeyRef:
name: light-gateway-secret
key: light_4j_config_password
optional: true
- name: RUST_LOG
value: info
ports:
- name: http
containerPort: 8080
- name: https
containerPort: 8443
readinessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 30
periodSeconds: 30
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: "1"
memory: 512Mi
volumeMounts:
- name: bootstrap-config
mountPath: /config
readOnly: true
- name: config-cache
mountPath: /app/config-cache
volumes:
- name: bootstrap-config
configMap:
name: light-gateway-bootstrap
- name: config-cache
emptyDir: {}
The example uses fsGroup: 999, which matches the default gateway group in the
current image. Adjust it if the image user or group changes.
If HTTP is disabled and only HTTPS is enabled, change the probes to an HTTPS probe or a TCP probe.
Example Service:
apiVersion: v1
kind: Service
metadata:
name: ai-microgateway
namespace: light-gateway
spec:
type: ClusterIP
selector:
app: ai-microgateway
ports:
- name: http
port: 8080
targetPort: http
- name: https
port: 8443
targetPort: https
For external access, add an Ingress, Gateway API route, NodePort, or
LoadBalancer according to the customer cluster standard. If external clients
or controller use that external path, set server.advertisedAddress to the same
externally reachable DNS name.
Apply With Kubectl
Apply manifests in this order:
kubectl apply -f namespace.yml
kubectl apply -f secret.yml
kubectl apply -f configmap.yml
kubectl apply -f deployment.yml
kubectl apply -f service.yml
Check rollout:
kubectl -n light-gateway rollout status deploy/ai-microgateway
kubectl -n light-gateway get pods -l app=ai-microgateway
kubectl -n light-gateway logs deploy/ai-microgateway
For local testing with a ClusterIP Service:
kubectl -n light-gateway port-forward svc/ai-microgateway 8080:8080 8443:8443
Deploy Through Light-Deployer
When light-deployer runs outside the cluster and has
LIGHT_DEPLOYER_TEMPLATE_BASE_DIR set, repoUrl: "local" can point to local
templates.
When light-deployer runs inside Kubernetes, use a real Git URL:
{
"template": {
"repoUrl": "https://github.com/networknt/light-fabric.git",
"ref": "main",
"path": "apps/light-gateway/k8s/light-gateway"
}
}
Do not use repoUrl: "local" for an in-cluster deployer unless the template
repo is mounted into the deployer container and
LIGHT_DEPLOYER_TEMPLATE_BASE_DIR points to it.
The in-cluster deployer checks out repoUrl at ref and reads manifests from
template.path.
Keep Namespace out of templates rendered by light-deployer if the deployer
policy blocks cluster-scoped resources. Create the namespace separately:
kubectl create namespace light-gateway
Config-Server Requirements
Before deploying the gateway pod, config-server should already have config for the tuple used by startup:
host = startup.host
serviceId = server.serviceId
envTag = server.environment
At minimum, config-server should return runtime config for:
handler.ymlmcp-router.ymlaccess-control.ymlandrule.ymlwhen MCP authorization is enabled.security.yml,unified-security.yml, or other active auth config.websocket-router.ymlwhen WebSocket MCP/BFF routing is enabled.- Any downstream client, token, or registry config required by the selected handlers.
The pod bootstrap files should stay small and stable. Normal route, policy, and backend changes should go through config-server and controller reload flows.
Startup Flow
Expected runtime flow:
Kubernetes starts pod
-> /app/light-gateway
-> read /app/config -> /config bootstrap files
-> call config-server with LIGHT_PORTAL_AUTHORIZATION
-> write downloaded config and files into /app/config-cache
-> start Pingora with resolved runtime config
-> register gateway to controller using portalRegistry.portalUrl
-> advertise server.advertisedAddress and configured port
-> route protected MCP traffic to backend MCP servers
If config-server is unavailable and /app/config-cache/values.yml exists, the
runtime can continue from cached config. With emptyDir, that cache disappears
when the pod is recreated. With a PVC, it can survive pod replacement.
Upgrade And Rollback
Use Kubernetes rolling updates with immutable image tags:
kubectl -n light-gateway set image deploy/ai-microgateway \
light-gateway=networknt/light-gateway:2.2.2
kubectl -n light-gateway rollout status deploy/ai-microgateway
Rollback:
kubectl -n light-gateway rollout undo deploy/ai-microgateway
For production, prefer changing only one variable at a time: either image tag or config-server runtime config, not both in the same rollout.
Validation Checklist
After deployment:
kubectl -n light-gateway rollout status deploy/ai-microgatewaysucceeds.- Pods are ready and restart count is stable.
- Logs show successful config-server bootstrap.
- Logs show successful controller registration.
- Controller shows the gateway registered with the expected service id, environment, host, and port.
server.advertisedAddressis reachable from the controller.- The Service responds on
/health. - MCP
tools/listreaches the gateway. - MCP
tools/callreaches the backend MCP server. - A pod restart still starts cleanly with the selected cache policy.
Security Checklist
- Keep bearer tokens in Kubernetes
Secret, notConfigMap. - Use customer CA trust and keep
client.verifyHostname: truein production. - Use immutable image tags and image pull credentials from Kubernetes secrets when the registry is private.
- Run as the non-root image user.
- Make
/configread-only. - Make only
/app/config-cachewritable. - Restrict ingress traffic to required gateway ports.
- Restrict egress traffic to config-server, controller, token/key services, and backend MCP servers.
- Rotate
LIGHT_PORTAL_AUTHORIZATIONthrough the customer secret process.
Kubernetes Gateway API Design
Status
Proposal.
This page captures how the current light-gateway work can be reused for
Kubernetes Gateway API without turning the microgateway product into a
catch-all Kubernetes control plane. The recommended direction is a separate
light-k8s-gateway product built on light-pingora for north/south ingress,
with a later sidecar or mesh product for transparent east/west traffic.
Context
The current Kubernetes deployment model runs light-gateway as a normal
Deployment with a ClusterIP Service. Runtime behavior comes from local
bootstrap config, config-server downloaded files in config-cache, and the
Pingora data plane built by light-pingora.
The current gateway already has useful data-plane pieces:
- HTTP and HTTPS proxying through Pingora.
- Static upstreams from
proxy.yml. - Service-aware routing from
router.yml. - Direct registry, controller-backed discovery, and static service targets.
- Handler chains for security, header mutation, CORS, rate limits, token handling, MCP, WebSocket, static resources, and config reload.
- Live config managers and reloaders for route and handler modules.
Gateway API adds a Kubernetes-native control plane. For ingress, users create
GatewayClass, Gateway, and route resources such as HTTPRoute. For service
mesh, the GAMMA model attaches route resources directly to Kubernetes
Service objects instead of using Gateway and GatewayClass.
Product Boundary
Keep the product line split by operational role:
light-pingorais the shared data-plane framework.light-gatewayremains the microgateway, sidecar, BFF, API, agent, MCP, and LLM gateway product configured through Light runtime, config-server, controller-rs, and local config.light-k8s-gatewayis the proposed Kubernetes Gateway API product for north/south ingress. It should reuselight-pingoraand lift reusablelight-gatewaymodules where appropriate, but it should own Kubernetes watches, Gateway API status, RBAC, listener translation, TLS Secret handling, and EndpointSlice routing.light-k8s-gateway-controllerandlight-k8s-gateway-proxyshould be separate deployments from the first implementation. The controller owns Kubernetes RBAC and status writes. The proxy owns untrusted client traffic and should not need Kubernetes API permissions.- A future
light-meshorlight-sidecarproduct should own transparent east/west Service Mesh behavior if we pursue GAMMA conformance. It should share the Gateway API route compiler andlight-pingoradata-plane modules, but its deployment model is sidecar or node-local interception, not ingress.
This avoids giving ordinary microgateway deployments broad Kubernetes RBAC and keeps config-server/controller-rs routing separate from portable Gateway API routing intent.
Goals
- Let operators install
light-k8s-gatewayas a Gateway API implementation with a controller name such asnetworknt.com/light-k8s-gateway. - Support north/south ingress with
GatewayClass,Gateway,HTTPRoute, KubernetesService,EndpointSlice,Secret, andReferenceGrant. - Separate Kubernetes reconciliation from request proxying so control-plane RBAC is never granted to the public traffic data plane.
- Provide a migration path from NGINX or Traefik by running side by side with a distinct GatewayClass, then moving routes class by class or host by host.
- Reuse the existing Pingora proxy, handler chain, service discovery, metrics, and config reload model instead of creating a separate proxy stack.
- Use Gateway API policy attachment for Light-specific Kubernetes policy CRDs instead of annotations or out-of-band route policy.
- Support east/west traffic using Gateway API mesh semantics where
HTTPRoute.parentRefscan point at aService. - Keep Light-specific policies available without forcing them into portable Gateway API fields. Gateway API should configure routing; Light config and future policy CRDs should configure Light-specific behavior.
- Build toward Gateway API conformance tests for both Gateway and Mesh feature sets.
Non-Goals
- Do not remove existing config-server, direct registry, portal registry, or static route support.
- Do not require every
light-gatewaydeployment to watch Kubernetes. Gateway API support should be disabled unless explicitly configured. - Do not run the Kubernetes controller reconciler inside public data-plane pods with broad Kubernetes RBAC.
- Do not claim immediate support for every Gateway API route type. Start with
HTTPRoute; addGRPCRoute,TLSRoute,TCPRoute, andUDPRoutein later milestones. - Do not make transparent east/west interception a hidden side effect of the ingress deployment. Mesh mode needs an explicit data-plane deployment model.
- Do not treat a non-transparent egress gateway as fully GAMMA-compliant mesh support.
Target API Versions
The north/south MVP targets the Gateway API v1
Standard Channel
resources:
GatewayClassGatewayHTTPRouteReferenceGrant
Experimental or later milestones must be labeled explicitly in docs, manifests,
and conformance reports. This includes GAMMA mesh behavior and route kinds such
as GRPCRoute, TLSRoute, TCPRoute, and UDPRoute when those features rely
on non-Standard channels in the installed Gateway API version.
North/South Ingress Model
For ingress replacement, light-k8s-gateway should run as two cooperating
pieces:
light-k8s-gateway-controller: watches Kubernetes resources, validates attachment and policy, updates status, performs leader election, and produces a compiled routing snapshot.light-k8s-gateway-proxy: consumes signed or mTLS-protected snapshots and serves client traffic through Pingora. It has no Kubernetes watch or status permissions and can scale independently with an HPA.
The split is mandatory from day 1. It prevents a proxy vulnerability in the
public data plane from becoming a Kubernetes control-plane compromise. The
controller can run as an HA deployment with Kubernetes Lease leader election;
only the leader reconciles resources and writes status. Non-leader controller
replicas stay warm and can take over quickly.
Snapshot delivery can start as a lightweight internal gRPC stream and evolve
toward an xDS-like API if we need richer incremental updates. The proxy should
apply the received GatewayApiSnapshot through the same kind of ConfigManager
swap used by the current Pingora modules.
Typical installation:
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
name: light-k8s-gateway
spec:
controllerName: networknt.com/light-k8s-gateway
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: public
namespace: gateway-system
spec:
gatewayClassName: light-k8s-gateway
listeners:
- name: http
protocol: HTTP
port: 80
allowedRoutes:
namespaces:
from: All
- name: https
protocol: HTTPS
port: 443
hostname: api.example.com
tls:
mode: Terminate
certificateRefs:
- kind: Secret
name: api-example-com
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: petstore
namespace: apps
spec:
parentRefs:
- name: public
namespace: gateway-system
sectionName: https
hostnames:
- api.example.com
rules:
- matches:
- path:
type: PathPrefix
value: /pets
backendRefs:
- name: petstore
port: 8080
The controller resolves this into a runtime route table:
Gateway listener
-> accepted HTTPRoutes
-> host/path/header/method/query matches
-> filters supported by light-k8s-gateway
-> backend Service
-> EndpointSlice addresses
-> Pingora ProxyTarget set
The existing proxy.yml and router.yml paths remain useful for legacy and
non-Kubernetes deployments. Kubernetes Gateway API routes should not depend on
service_id headers or pathPrefixService.yml; they should route from the
compiled Gateway API table directly to Kubernetes endpoints.
Required Ingress Patches
Add a Kubernetes Gateway API module:
k8sGatewayApi:
enabled: ${k8sGatewayApi.enabled:false}
mode: ${k8sGatewayApi.mode:ingress}
controllerName: ${k8sGatewayApi.controllerName:networknt.com/light-k8s-gateway}
gatewayClassName: ${k8sGatewayApi.gatewayClassName:light-k8s-gateway}
watchNamespaces: ${k8sGatewayApi.watchNamespaces:[]}
statusAddress: ${k8sGatewayApi.statusAddress:}
Implementation changes:
- Create
apps/light-k8s-gateway-controllerandapps/light-k8s-gateway-proxy. - Add Gateway API and Kubernetes clients, likely behind a Cargo feature such
as
k8s-gateway-api, usingkube,kube-runtime,k8s-openapi, and generated Gateway API resource types. - Watch
GatewayClass,Gateway,HTTPRoute,ReferenceGrant,Service,EndpointSlice,Secret, andNamespace. - Compile watched objects into a deterministic
GatewayApiSnapshot. - Push the compiled snapshot to proxy pods over an authenticated internal channel.
- Store the received snapshot in a proxy-side
ConfigManager, similar to the current proxy and router reload model. - Add a
light-pingoraGateway API route-table module that can select a backend before falling back to existing proxy/router behavior. - Update Kubernetes status conditions for
GatewayClass,Gateway, listeners, and routes. Status must clearly report unsupported route types, listener conflicts, missing TLS secrets, rejected cross-namespace references, empty backends, and unsupported filters. - Add Kubernetes
Leaseleader election so only one controller replica writes status and publishes snapshots. - Add controller RBAC for read watches, Secret reads where allowed, Lease writes, and status updates. Secret read permissions should be namespace-scoped where possible.
- Give proxy pods no Kubernetes RBAC by default.
- Add install manifests for separate controller and proxy
ServiceAccount,ClusterRole,ClusterRoleBinding,Deployment,Service, and a sampleGatewayClass.
The transport also needs a listener model. Today PingoraTransport binds the
single server.httpPort and single server.httpsPort from server.yml. That
is enough for the first 80/443 ingress path, but full Gateway API support
needs multiple listeners with independent protocol, port, hostname, and TLS
settings.
Suggested runtime patch:
server:
listeners:
- name: http
protocol: HTTP
ip: 0.0.0.0
port: 80
- name: https-api
protocol: HTTPS
ip: 0.0.0.0
port: 443
hostname: api.example.com
tlsCertPath: /var/run/light-k8s-gateway/tls/api/tls.crt
tlsKeyPath: /var/run/light-k8s-gateway/tls/api/tls.key
Keep server.httpPort, server.enableHttp, server.httpsPort, and
server.enableHttps as backward-compatible shorthand.
HTTPRoute Support Plan
Start with the common ingress subset:
GatewayClassacceptance fornetworknt.com/light-k8s-gateway.Gatewaylisteners forHTTPand terminatedHTTPS.HTTPRouteattachment byparentRefs,sectionName, listener hostname, listener namespace policy, and route hostname.HTTPRoutematches for path prefix, exact path, method, headers, and query parameters.backendRefsto KubernetesServicebackends, including weights.ReferenceGrantfor cross-namespace backend references.- Endpoint resolution from
EndpointSlice, with Service DNS as a fallback only when endpoint watching is unavailable. - TLS Secret loading for terminated HTTPS.
- Request header modification and URL rewrite where existing Pingora handlers already provide equivalent behavior.
Later milestones:
- Request redirect, response header modification, request mirroring, retries, and timeouts.
GRPCRouteover HTTP/2.TLSRoutefor SNI routing and passthrough.TCPRouteandUDPRoutefor L4 ingress if Pingora transport support is added.- Backend TLS policy and mTLS to upstream services.
Light Policy Attachment
Kubernetes-native deployments should use the Gateway API Policy Attachment pattern from GEP-713 for Light-specific behavior. Do not use annotations for core behavior, and do not require config-server-owned route policy for the Kubernetes Gateway API path.
Add Light policy CRDs with targetRefs that point at Gateway API resources:
apiVersion: gateway.lightapi.net/v1alpha1
kind: LightAuthPolicy
metadata:
name: petstore-auth
namespace: apps
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: HTTPRoute
name: petstore
jwt:
issuer: https://issuer.example.com
audience:
- petstore
apiVersion: gateway.lightapi.net/v1alpha1
kind: LightRateLimitPolicy
metadata:
name: petstore-ratelimit
namespace: apps
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: HTTPRoute
name: petstore
limits:
- name: default
requests: 1000
window: 60s
The controller should resolve effective policy for supported target kinds such
as Gateway, listener section, HTTPRoute, route rule, and eventually
Service for mesh. Policy status should report Accepted, Programmed, and
conflict conditions so resource owners can tell whether a policy is active.
Config-server remains valid for non-Kubernetes light-gateway deployments and
for migration bridges. For light-k8s-gateway, Kubernetes resources should be
the source of routing and policy intent.
TLS Secret Handling
TLS Secret material must not be written to persistent disk or normal
config-cache.
Preferred handling:
- The controller reads referenced TLS
Secretobjects, validates references andReferenceGrantrequirements, and distributes certificate material to proxies through the authenticated snapshot channel. - Proxies hold certificate material in memory and update Pingora TLS state without persisting private keys.
- If Pingora integration requires file paths for an early milestone, write
temporary files only to an
emptyDirmounted withmedium: Memory, under a path such as/var/run/light-k8s-gateway/tls.
Never copy TLS private keys into config-server, config-cache, persistent
volumes, image layers, or logs.
Endpoint Abstraction
light-pingora should not need to know whether endpoints came from Kubernetes,
direct-registry.yml, controller-rs discovery, or a static config file. Add a
shared endpoint abstraction such as:
UpstreamCluster
name
protocol
tls settings
load-balancing policy
EndpointSet
endpoint address
port
health/ready state
metadata
light-k8s-gateway-controller translates Service and EndpointSlice objects
into this shape. Existing Light discovery paths can translate direct registry
and portal-registry results into the same shape. The Pingora route-table module
then selects an UpstreamCluster without carrying Kubernetes-specific logic.
East/West Mesh Model
Gateway API mesh support uses a different binding model. Routes attach directly
to Service resources:
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: petstore-policy
namespace: apps
spec:
parentRefs:
- group: core
kind: Service
name: petstore
port: 8080
rules:
- matches:
- path:
type: PathPrefix
value: /v1
backendRefs:
- name: petstore-v2
port: 8080
weight: 10
- name: petstore-v1
port: 8080
weight: 90
The runtime semantics are:
- If no route attaches to a Service, default mesh behavior forwards to the Service backend.
- If routes attach and the request matches at least one route, the selected route backendRefs determine the destination.
- If routes attach and no route matches, reject the request.
- Same-namespace routes are producer routes and affect all clients.
- Different-namespace routes are consumer routes and affect clients in the route namespace.
The current light-gateway can proxy service-to-service calls explicitly, but
it does not transparently intercept traffic to Kubernetes Service frontends.
That means a real mesh implementation needs a data-plane attachment model, not
only a route compiler.
Recommended mesh milestones:
- Mesh M0: compile Service-attached
HTTPRouteresources and expose the effective route table through logs, module registry, and status. This proves the control-plane model without traffic interception. - Mesh M1: support an explicit in-cluster egress gateway mode. Workloads call
light-gatewaydirectly or through a configured HTTP proxy. This is useful operationally, but not advertised as transparent GAMMA conformance. - Mesh M2: add sidecar mode. Inject a lightweight
light-gatewaysidecar, or preferably a smallerlight-sidecarorlight-meshbinary using the samelight-pingoraroute-table module. Redirect outbound HTTP traffic to the sidecar, identify the original Service destination, apply Service-attached routes, then proxy to selected endpoints. - Mesh M3: add node-local or ambient mode. Use a DaemonSet plus CNI or eBPF redirection to intercept Service traffic without per-pod sidecars. This has a larger operational surface and should follow sidecar validation.
Sidecar mode is the shortest path because the current light-gateway already
has sidecar concepts such as sidecar.egressIngressIndicator, token handling
for outbound calls, and service discovery. The production packaging should
still be a dedicated sidecar or mesh product if the target is transparent
east/west traffic. The missing pieces are transparent redirect, original
destination detection, and a Service-oriented route table.
Mesh Data-Plane Requirements
To proxy east/west traffic with GAMMA semantics, add:
- A mesh route compiler that watches
HTTPRoute,Service,EndpointSlice,ReferenceGrant, and namespaces. - A Service frontend index keyed by namespace, Service name, port, DNS name, ClusterIP, and possibly original destination socket address.
- Producer and consumer route merge logic that follows Gateway API mesh rules.
- Request matching and rejection behavior for Services with attached routes.
- Backend endpoint selection from the selected route's backendRefs.
- A sidecar or node-local interception mechanism that can recover the original destination Service before the request is proxied.
- Policy hooks for Light security, token, and observability handlers.
- Mesh conformance test wiring with
--supported-features=Mesh.
Do not map GAMMA Service routes to Gateway listeners. In mesh mode, the
Service is the parent object, and GatewayClass/Gateway are intentionally not
part of the route binding.
Coexistence With Existing Light Runtime
Keep these layers distinct:
- Gateway API resources express portable Kubernetes routing intent.
light-pingoraroute tables execute the selected routing intent.handler.ymland Light module config apply Light-specific behavior.light-gatewaycontinues to serve the current microgateway, sidecar, BFF, API, agent, MCP, and LLM provider use cases.light-k8s-gatewayowns Kubernetes Gateway API ingress behavior.portal-registryanddirect-registry.ymlremain available for non-Kubernetes targets and existing Light service discovery.- Config-server remains the source for non-Kubernetes
light-gatewaypolicy and migration bridges. Kubernetes-nativelight-k8s-gatewayrouting and policy intent should come from Gateway API resources and Light policy CRDs.
For ingress, Kubernetes Service and EndpointSlice should be the primary
backend source. For non-Kubernetes or hybrid targets, add an explicit
implementation-specific backend policy instead of overloading portable
backendRefs.
Status And Conformance
Gateway API users rely on status. The controller must update:
GatewayClass.status.conditions.Gateway.status.addresses, listener conditions, and supported features.HTTPRoute.status.parentsfor every parentRef.- Light policy CRD status, including
Accepted,Programmed, and conflict conditions.
Only the active leader should update Kubernetes status. Controller replicas use
Kubernetes Lease leader election to avoid API-server write races and status
flapping.
Minimum conformance gates:
go test ./conformance -run TestConformance -args \
--gateway-class=light-k8s-gateway \
--supported-features=Gateway,HTTPRoute
Mesh conformance gate:
go test ./conformance -run TestConformance -args \
--supported-features=Mesh
When ingress and mesh are both enabled:
go test ./conformance -run TestConformance -args \
--gateway-class=light-k8s-gateway \
--supported-features=Mesh,Gateway,HTTPRoute
Observability And Telemetry
light-k8s-gateway must be operable as a primary ingress controller. Provide
Prometheus metrics, OpenTelemetry traces, and structured logs from day 1.
Proxy metrics:
- Request count tagged by Gateway, listener, route namespace,
HTTPRoute, backend Service, status code, and status class. - Request duration and upstream duration histograms.
- Active connections and in-flight requests.
- Upstream connection errors, retries, timeouts, and circuit-breaker opens.
- Snapshot version, snapshot age, and snapshot apply errors.
Controller metrics:
- Reconcile count, duration, and error count by resource kind.
- Kubernetes watch reconnect count and API-server request errors.
- Status update count and conflict count.
- Leader-election state.
- Snapshot generation count, size, and publish errors.
Tracing:
- Propagate W3C
traceparentand existing Light correlation IDs. - Create ingress spans tagged with Gateway API resource identity:
gateway.namespace,gateway.name,listener.name,route.namespace,route.name,route.rule,backend.service.namespace, andbackend.service.name. - Record upstream selection, retries, and policy decisions as span events without logging tokens, private keys, or sensitive headers.
Migration From NGINX Or Traefik
Recommended customer migration:
- Install
light-k8s-gatewaywith a newGatewayClassnamedlight-k8s-gateway. - Keep NGINX or Traefik running for existing
Ingressor Gateway API classes. - Create equivalent
GatewayandHTTPRouteresources for one host. - Validate status, route behavior, TLS, logs, metrics, and backend health.
- Move DNS or load balancer traffic for that host to
light-k8s-gateway. - Repeat host by host.
- Remove the old ingress controller only after route parity and operational dashboards are in place.
An optional Ingress-to-HTTPRoute converter can help customers migrate, but it should be a tool, not part of the runtime request path.
Open Questions
- What is the first supported east/west deployment model: current
light-gatewayas explicit egress gateway, a dedicated sidecar, or ambient? - How much of the current
server.ymllistener contract should remain inlight-runtimeversus moving Gateway API listener binding intolight-pingora? - Should the controller-to-proxy snapshot protocol stay as a small internal gRPC API, or should it adopt an xDS-compatible model early?
- Which Light policy CRDs are required for the MVP: auth, rate limit, header policy, request size, token, or a generic handler-chain policy?
- What is the exact
UpstreamClusterhealth model shared by Kubernetes EndpointSlice, controller-rs discovery, and direct registry sources?
Suggested Implementation Order
- Create
apps/light-k8s-gateway-controllerandapps/light-k8s-gateway-proxywith separate ServiceAccounts and RBAC. - Add controller leader election with Kubernetes
Leaseobjects. - Define
GatewayApiSnapshot,UpstreamCluster,EndpointSet, and the authenticated controller-to-proxy snapshot stream. - Implement proxy-side snapshot loading through
ConfigManager. - Implement
GatewayClass,Gateway,HTTPRoute,ReferenceGrant,Service,EndpointSlice,Secret, andNamespacewatches. - Implement attachment validation, policy validation, status updates, and snapshot publishing.
- Add a
light-pingoraGateway API route table and route HTTP traffic to Kubernetes Service endpoints. - Add memory-only TLS Secret handling and terminated HTTPS listener support
for the common
80/443ingress case. - Add initial Light policy CRDs using Gateway API policy attachment.
- Add Prometheus metrics, OpenTelemetry tracing, and structured logs for the controller and proxy.
- Run HTTPRoute Gateway conformance and close gaps.
- Add multi-listener runtime support.
- Add mesh route compilation for Service-attached
HTTPRouteresources. - Add explicit egress gateway mode for early east/west use.
- Add sidecar interception and run mesh conformance.
- Evaluate ambient/node-local mode after sidecar behavior is proven.
Light-Workflow
light-workflow is the workflow execution service for Agentic Workflow
documents.
It loads workflow definitions, executes workflow tasks, integrates with
light-rule for rule-backed checks, and exposes workflow execution APIs.
Key Dependencies
workflow-corelight-ruleaxumsqlxreqwest
Role
light-workflow is the runtime service that turns workflow specifications into
long-running execution state. It is used by agentic flows, human-in-the-loop
orchestration, and integration-test style automation.
Start Workflow
This page describes the local workflow start path used to test light-workflow
from light-portal.
light-workflow does not create workflow definitions and it is not the public
entry point for starting a workflow. For local testing, create the workflow
definition through the portal workflow service, then start it through the
startWorkflow command. The running light-workflow process consumes the
workflow start event from the portal database and executes the workflow tasks.
Runtime Path
The local start flow is:
- Create or update a workflow definition in
light-portal. - Start the workflow with the
workflowservicestartWorkflowcommand. workflow-commandwrites a workflow started event into the event store and outbox tables.light-workflowpolls the same database, loads the definition bywfDefId, creates the process and task records, and executes the workflow.
For this reason, the DATABASE_URL used by light-workflow must point to the
same database used by the local portal stack.
Prerequisites
Start the local portal stack first. For the Rust local stack, use the normal
portal-config-local deployment command from the portal-config-loc checkout:
./scripts/deploy-local.sh pg rust
Make sure the workflow command and query services are available in that stack.
The workflow definition pages in portal-view depend on those services.
Then build light-workflow:
cd /home/steve/workspace/light-fabric/apps/light-workflow
cargo build -p light-workflow --locked
Start light-workflow Locally
Create light-workflow.env in
/home/steve/workspace/light-fabric/apps/light-workflow:
DATABASE_URL=postgres://postgres:secret@localhost:5432/configserver
LIGHT_WORKFLOW_HTTP_ADDR=0.0.0.0:8080
RUST_LOG=light_workflow=debug,info
WORKFLOW_LOG_ANSI=false
Start the service with the debug binary:
./run.sh --debug-binary
The script loads light-workflow.env automatically. If you do not use the env
file, export the values before running the script:
export DATABASE_URL=postgres://postgres:secret@localhost:5432/configserver
export LIGHT_WORKFLOW_HTTP_ADDR=0.0.0.0:8080
export RUST_LOG=light_workflow=debug,info
export WORKFLOW_LOG_ANSI=false
./run.sh --debug-binary
Do not set the variables on separate shell lines without export. That creates
shell variables only for the current shell and run.sh will not receive them.
Recommended UI Test
The easiest local test is to create the definition in the portal UI and start it from the workflow editor test action.
-
Open
light-portal. -
Go to the workflow definition page.
-
Create a workflow definition.
-
Paste one of the example workflow YAML files from:
/home/steve/workspace/light-fabric/apps/light-workflow/examples -
Save the definition.
-
Open the definition in the workflow editor.
-
Use the editor test run action with a JSON input object.
For the basic example, use
apps/light-workflow/examples/simple-set-assert.yaml and this input:
{
"applicantId": "APP-001"
}
The editor test action is preferred for local testing because it parses the
input text as JSON and sends input as an object.
The table run button opens the generic startWorkflow form. If using that path,
make sure the request sends input as a JSON object, not as a string. If the
input is submitted as a string, the workflow command may accept the request but
the runtime context will not have the expected object fields.
Start with Postman or curl
You can also start the workflow directly through the portal command endpoint.
Send the request to the same light-gateway or light-portal host used by the UI.
Do not send this request to light-workflow; light-workflow is the executor,
not the command API.
The command envelope is:
{
"host": "lightapi.net",
"service": "workflow",
"action": "startWorkflow",
"version": "0.1.0",
"data": {
"hostId": "<host-id>",
"wfDefId": "<workflow-definition-id>",
"input": {
"applicantId": "APP-001"
}
}
}
Example curl shape:
curl -k -X POST "https://localhost:8443/portal/command" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <access-token>" \
-d '{
"host": "lightapi.net",
"service": "workflow",
"action": "startWorkflow",
"version": "0.1.0",
"data": {
"hostId": "<host-id>",
"wfDefId": "<workflow-definition-id>",
"input": {
"applicantId": "APP-001"
}
}
}'
If your local UI uses a session cookie instead of a bearer token, use Postman with the same authenticated session or copy the current local authorization header from the browser request.
Creating the Definition by API
For most local tests, create the definition in the UI. It is easier because the YAML can be pasted directly.
If you create the definition through the command API, send a workflow definition
command first and use the returned definition id as wfDefId in the
startWorkflow command.
The command shape is:
{
"host": "lightapi.net",
"service": "workflow",
"action": "createWfDefinition",
"version": "0.1.0",
"data": {
"hostId": "<host-id>",
"namespace": "light-portal",
"name": "simple-set-assert",
"version": "1.0.0",
"definition": "<workflow-yaml-as-json-string>"
}
}
When calling this from Postman, remember that the YAML definition is a JSON string field. Newlines must be escaped correctly by the JSON editor or sent by a tool that can build the JSON body safely.
Example Workflows
The current examples are in
/home/steve/workspace/light-fabric/apps/light-workflow/examples:
| File | Purpose | Input |
|---|---|---|
simple-set-assert.yaml | Basic local smoke test with no external dependency. | { "applicantId": "APP-001" } |
http-risk-decision.yaml | Calls a risk evaluation HTTP endpoint and branches on the result. | { "applicantId": "APP-001", "loanAmount": 25000, "creditScore": 720 } |
human-approval.yaml | Creates a human approval style workflow and waits for a later decision. | { "requestId": "REQ-001", "summary": "Approve test request" } |
insurance-claim-rest-v1.yaml | Complete product demo with direct HTTP API orchestration, native agent tasks, and human tasks. | See examples/README.md. |
insurance-claim-mcp-v1.yaml | Complete product demo with gateway MCP tool orchestration, native agent tasks, and human tasks. | See examples/README.md. |
insurance-claim-headless-v1.yaml | Headless insurance-claim regression workflow with deterministic agent outputs and no human-task pauses. | See examples/README.md. |
Start with simple-set-assert.yaml. It is the best smoke test because it does
not require another service.
For a complete multi-agent product demo, use the insurance claim suite in
apps/light-workflow/examples. The product walkthrough is
Insurance Claim Agentic Workflow, and
the operational runbook is in apps/light-workflow/examples/README.md.
For http-risk-decision.yaml, start a local mock service for the URL used by
the definition. When light-workflow runs natively with run.sh,
127.0.0.1 means the host machine. When light-workflow runs in Docker,
127.0.0.1 means the container itself, so change the workflow endpoint to a
Compose service name or host.docker.internal.
For human-approval.yaml, the first run should create a waiting task. Completing
that flow requires the worklist or task-completion API path.
Verify Execution
Watch the light-workflow log after sending startWorkflow. A successful run
should show that the start event was received, the first task was initialized,
and the executor picked up task work.
Useful database checks:
select wf_def_id, namespace, name, version
from wf_definition_t
order by update_ts desc
limit 5;
select process_id, wf_instance_id, status_code, context_data
from process_info_t
order by started_ts desc
limit 5;
select wf_task_id, task_type, status_code, task_output
from task_info_t
order by started_ts desc
limit 10;
select c_offset, event_type, aggregate_id, payload
from outbox_message_t
order by c_offset desc
limit 10;
If outbox_message_t has the workflow started event but no process or task
records appear, check that light-workflow is running against the same
DATABASE_URL as the portal stack.
Troubleshooting
DATABASE_URL is required: PutDATABASE_URLinlight-workflow.env, export it before runningrun.sh, or put the assignment on the same command line as./run.sh.function make_interval(mins => bigint) does not exist: Rebuild and restartlight-workflow. The runtime query must cast the retry value tointbefore passing it tomake_interval.- Workflow definition list is empty in the UI: Confirm the workflow query service is running and the local stack is using the jar or binary that contains the workflow definition owner-scope fix. Some local stacks run copied service artifacts, so rebuilding a source checkout is not enough unless the deployed artifact is refreshed.
- No tasks are created after starting the workflow: Confirm the
startWorkflowcommand wrote a workflow started event to the outbox table, and confirmlight-workflowpoints to that same database. - The workflow input is missing fields: Confirm
inputwas submitted as a JSON object. A string that contains JSON text is not the same as a JSON object in the workflow context.
Native Agent Call
Status
Recommended platform boundary.
call: agent is currently a native light-workflow task. It does not invoke a
running light-agent container. The workflow engine loads the portal agent
definition, selected skills, and skill tools from the database, builds a bounded
model prompt, calls the configured model provider directly, validates the JSON
output, and continues the workflow.
Containerized light-agent remains the interactive agent runtime. It serves
chat clients, keeps session memory, loads its effective catalog, and calls MCP
tools through light-gateway.
This page defines how both models should coexist in an enterprise platform.
Problem
The platform has two useful agent execution models:
- native agent tasks inside
light-workflow - containerized
light-agentservices
Both can use the same portal-authored concepts: agent definitions, skills, tools, workflow mappings, and gateway-routed API capabilities. They should not be treated as interchangeable runtime paths.
The main design question is whether a workflow should keep executing
call: agent natively or call a containerized light-agent service for every
agent step.
Current Behavior
When a workflow contains:
do:
- review-offer:
call: agent
with:
agent: com.networknt.agent.offer-1.0.0
skill: offer-decision
input:
customerId: "${ .customerId }"
profile: "${ .profile }"
outputSchemaRef: offerDecision
light-workflow handles the task itself:
- Resolve the agent by
agent_def_idor agent API name. - Load active skills assigned to the agent from
agent_skill_t. - If a skill is specified, narrow the prompt to that skill.
- Load skill tool metadata from
skill_tool_t,tool_t, andtool_param_t. - Build a bounded prompt from workflow context, skill instructions, optional task instructions, and the expected output schema.
- Call the model provider configured on the portal agent definition.
- Parse and validate the model response as JSON.
- Return the structured output to the workflow context.
The native task does not:
- call the
light-agentHTTP or WebSocket endpoint, - use
light-agentsession memory, - let the model run a dynamic gateway tool loop,
- execute tool calls from the model response.
Skill tools are included as guidance and future-routing context. In the current
runtime phase, API orchestration remains explicit workflow tasks such as
call: http, call: mcp, assert, switch, and ask.
Native Agent Tasks
Native agent tasks are best for bounded reasoning where the workflow remains the system of record.
Good examples:
- classify a request,
- normalize user-provided input,
- summarize API results,
- choose between workflow branches,
- draft a customer-facing explanation,
- assess whether human approval is required,
- produce structured output that must match a schema.
Benefits:
- Strong auditability: workflow records input, output, status, retry, and failure state.
- Deterministic orchestration: API calls, approvals, assertions, and retries stay in the workflow definition.
- Easier governance: output schemas and workflow-owned context constrain the model.
- Lower operational coupling: the task does not depend on a separate agent service instance being healthy.
- Better replay and diagnostics: the workflow engine owns the execution state.
Tradeoffs:
- It is not the full
light-agentruntime. - It does not use chat session history or Hindsight memory.
- It can duplicate some prompt/catalog handling from
light-agent. - Model provider scaling is tied to
light-workflow. - Dynamic tool selection is intentionally limited.
Containerized Agents
Containerized agents are independently deployed light-agent services.
They are best for interactive or autonomous agent behavior where the agent runtime itself is the product surface.
Good examples:
- user-facing chat agents,
- long-lived specialist agents,
- agents that need session memory,
- agents that should cache and refresh their effective catalog locally,
- agents that need a dynamic
tools/listandtools/callloop throughlight-gateway, - agents that must scale independently from workflow execution.
Benefits:
- Real agent runtime behavior: memory, chat sessions, local catalog cache, and gateway tool execution.
- Independent deployment, scaling, health checks, and versioning.
- Clear service identity through controller registration.
- Better fit for interactive clients and long-running conversational work.
Tradeoffs:
- Harder workflow audit if the agent internally decides which APIs to call.
- More distributed failure modes: network errors, timeouts, retries, and partial progress.
- Requires strict request and response contracts.
- Requires idempotency, correlation IDs, auth scopes, and timeout policy.
- Can make the workflow less deterministic if the agent is allowed to run an open-ended tool loop.
Recommendation
Keep the mixed approach, but make the boundary explicit.
Use native call: agent for bounded reasoning inside workflow-controlled
processes. Use workflow tasks and subworkflows for API orchestration. Use
containerized light-agent for interactive chat and specialist runtime agents.
The recommended enterprise pattern is:
main workflow
-> call: mcp or call: http for deterministic API access
-> run/start subworkflow for reusable skill-backed API orchestration
-> call: agent for bounded reasoning over workflow-owned context
-> ask/assert/switch/retry/audit in workflow
chat client
-> containerized light-agent
-> effective catalog from portal-query
-> tools/list and tools/call through light-gateway
-> session memory and chat history
Do not route every workflow agent step through a containerized agent by default. That would move too much process control into agent services and make enterprise audit, replay, and approval harder.
Do not remove native call: agent. It is the right primitive for workflow-owned
reasoning steps.
Skill To Workflow Pattern
For skills that require API orchestration, prefer mapping the skill to a workflow or subworkflow.
Example:
skill_t: customer-profile-review
-> skill_workflow_t: customer-profile-enrichment-v1
-> wf_definition_t: workflow that calls gateway MCP tools
In that pattern:
- the skill describes when and why to use the capability,
- the workflow owns the API call sequence,
light-gatewayexecutes MCP tool calls,- native
call: agentcan summarize or classify the results, - the workflow remains the audit boundary.
This is the preferred model for enterprise API access because it prevents an agent from inventing an unreviewed process path.
Demo Guidance
The current demos should be described precisely:
insurance-claim-rest-v1.yamlshows workflow-owned API orchestration with direct HTTP calls plus native agent tasks for bounded reasoning.insurance-claim-mcp-v1.yamlshows the same business flow throughlight-gatewayMCP tools plus native agent tasks for bounded reasoning.insurance-claim-headless-v1.yamlshows the deterministic regression path without human-task pauses.
The demos do not currently prove that light-workflow invokes the
containerized light-agent services. That can be added later as an explicit
runtime integration if the platform needs it.
Future Containerized-Agent Invocation
If workflow needs to call containerized light-agent services in the future,
do not silently change the meaning of native call: agent. Add an explicit
mode or task contract so operators can see which runtime path is used.
Possible options:
call: agent
with:
mode: native
agent: com.networknt.agent.offer-1.0.0
skill: offer-decision
call: agent
with:
mode: service
agent: com.networknt.agent.offer-1.0.0
skill: offer-decision
timeout: PT30S
or a separate task type:
call: agent-service
with:
serviceId: com.networknt.agent.offer-1.0.0
envTag: dev
skill: offer-decision
The service-call contract must require:
- explicit timeout and retry policy,
- idempotency key for side-effecting work,
- correlation and workflow instance headers,
- output schema validation,
- clear failure mapping to workflow status,
- portal/gateway authorization policy,
- observability across workflow, gateway, controller, and agent logs.
Decision Matrix
| Need | Preferred runtime |
|---|---|
| Deterministic API sequence | Workflow task or subworkflow |
| Gateway-routed API access | call: mcp through light-gateway |
| Bounded model reasoning | Native call: agent |
| Human approval or form input | ask task |
| Policy assertion | assert, switch, or rule task |
| Interactive chat | Containerized light-agent |
| Session memory | Containerized light-agent |
| Dynamic tool loop | Containerized light-agent |
| Enterprise audit and replay | Workflow-owned task |
Long-Term Direction
The platform should keep both execution models:
- Native agent tasks for workflow-owned reasoning.
- Containerized agents for interactive, memory-backed, independently scaled agent services.
The enterprise control rule is simple: workflows own durable process state and auditable API orchestration; agents provide bounded reasoning or interactive specialist behavior within contracts defined by the platform.
Sandbox Execution
Status
Proposed product design.
light-workflow should support sandbox-backed execution for tenant-authored and
automation-heavy workflows. The workflow engine remains the durable
orchestrator on the host. Sandboxes execute selected effectful tasks, or a
bounded group of related tasks, according to an approved execution security
profile.
Cube Sandbox is a good candidate provider for this boundary because it is designed for fast VM-backed sandbox creation, hardware isolation, and network policy enforcement. The design below treats Cube Sandbox as a pluggable provider, not as a hard dependency in the workflow DSL.
Problem
Workflows can be created by tenants and can eventually include tasks that run commands, scripts, containers, model calls, MCP tools, browser automation, or release automation. Those capabilities are useful, but they are also the highest-risk part of the workflow runtime.
The platform needs a way to say:
- whether a workflow is allowed to use sandbox execution,
- which tasks must be sent to a sandbox,
- whether tasks should share a sandbox session,
- which network, filesystem, image, command, and secret policies apply,
- how release-style workflows can keep a workspace and cache across steps without moving the workflow orchestrator itself into the sandbox.
Recommendation
Add an execution security profile to the workflow definition. The profile is a
request, not a final authority. At runtime, light-workflow computes an
effective profile from:
- workflow definition metadata,
- task metadata,
- tenant policy,
- service policy,
- operator-approved profile definitions,
- deployment defaults.
The workflow engine should stay on the host and continue to own task claiming, context loading, branching, retries, persistence, and audit. Sandbox execution should be delegated to a sandbox runner for the tasks that need isolation.
For release workflows, use one sandbox session per workflow instance by default. That allows checkout state, build caches, generated artifacts, and intermediate files to survive across related build and test tasks. Use a fresh task sandbox for high-privilege publish or signing tasks if they require release tokens or signing material.
First Schema Surface
Use existing metadata fields first so the design can be implemented without an
immediate workflow-core schema break. WorkflowDefinitionMetadata already has
document.metadata, and every task has metadata through
TaskDefinitionFields.
Workflow-level example:
document:
dsl: "1.0.3"
namespace: release
name: light-fabric-polyrepo-release
version: "1.0.0"
metadata:
lightWorkflow:
security:
executionProfile: release-sandbox
sandbox:
mode: workflow-session
provider: cubesandbox
template: light-fabric-release
reuse: same-workflow-instance
ttl: PT2H
idleTimeout: PT10M
Task-level example:
do:
- publish-github-release:
run:
shell:
command: ./release.sh
arguments:
- "${ .version }"
metadata:
lightWorkflow:
security:
sandbox:
mode: per-task
reason: release-token-isolation
secrets:
- github-release-token
Later, the runtime can normalize a first-class security field into the same
internal policy object:
security:
executionProfile: release-sandbox
sandbox:
mode: workflow-session
provider: cubesandbox
template: light-fabric-release
Execution Modes
none
Trusted workflows run in the host executor. This mode should be limited to internal workflows or workflows with no effectful untrusted task.
effectful-tasks
The default tenant mode. Pure orchestration tasks stay on the host, while effectful tasks are delegated to sandbox execution. Examples include shell, script, container, browser automation, external MCP servers, and filesystem work.
workflow-session
One sandbox session is created for a workflow instance and reused by approved tasks in that same instance. This is the right default for build, test, and release workflows because the sandbox can keep cloned repositories, dependency caches, build output, and temporary files across steps.
per-task
Each sandboxed task gets a fresh sandbox. This is the strongest isolation mode and should be used for untrusted commands, tasks with separate privilege levels, and tasks that receive high-value secrets.
Task Routing
Host execution should remain the default for control-plane tasks:
ask
assert
set
switch
workflow context merge
task claiming and completion
process state persistence
Sandbox execution should be required for high-risk task families:
run.shell
run.script
run.container
browser automation
tenant-provided code
filesystem mutation outside workflow context
external MCP server processes
command-line tools
release build and package commands
Provider calls need policy-based routing:
call.http host or sandbox, depending on egress policy
call.jsonrpc host or sandbox, depending on egress policy
call.mcp host for approved gateway endpoints, sandbox for external servers
call.agent host for bounded native model calls, sandbox when tools or code execution are enabled
call.rule host unless a rule profile explicitly requires isolation
A task may request stricter isolation than the workflow profile, but it must
not weaken the effective profile. For example, a workflow can run in
workflow-session mode while a publish task requests per-task mode. A task
inside a tenant workflow cannot request none if the tenant ceiling requires
sandbox execution.
Effective Policy
The runtime should compute and persist an effective policy for each workflow instance:
{
"requestedProfile": "release-sandbox",
"effectiveProfile": "release-sandbox",
"sandboxMode": "workflow-session",
"provider": "cubesandbox",
"template": "light-fabric-release",
"networkPolicy": "release-egress",
"secretPolicy": "task-scoped",
"approvedTaskTypes": ["run.shell", "call.http", "call.mcp"],
"policyVersion": 7
}
This policy should be written into process audit metadata so replay and incident review can prove which policy was active when the workflow ran.
Policy resolution rules:
- Tenant policy sets the maximum privilege a tenant can request.
- Service policy sets the maximum privilege
light-workflowmay grant in the current deployment. - Workflow metadata requests a profile.
- Task metadata can request stricter handling.
- Runtime validation rejects unsupported or unapproved task/provider combinations before the task executes.
- Approval-required profile changes emit pending workflow-definition events and must not immediately publish an active workflow definition.
Sandbox Session Lifecycle
For workflow-session mode:
- Claim a task on the host.
- Resolve the effective workflow security profile.
- Create or resume the sandbox session for this workflow instance.
- Mount or create an isolated workspace for the workflow instance.
- Send the task input, command specification, environment allowlist, and permitted secret handles to the sandbox runner.
- Stream logs and collect bounded output.
- Copy declared artifacts to a controlled artifact store.
- Return structured task output to
light-workflow. - Update task and process state on the host.
- Destroy the sandbox when the workflow completes, fails permanently, times out, or is cancelled.
The sandbox session id should be scoped to:
tenant id
workflow definition id and version
workflow instance id
effective profile version
requesting principal
Do not reuse one sandbox across tenants, workflow definitions, unrelated workflow instances, or different requesting principals.
Release Workflow Example
A Light-Fabric release workflow can use one sandbox session to release these repositories:
light-fabric
portal-service
controller-rs
light-example-rs
The host light-workflow process should still own the workflow instance. The
sandbox holds the release workspace:
light-workflow host
- claims tasks
- loads workflow context from Postgres
- resolves policy
- starts or resumes sandbox session
- dispatches build/test/release commands
- records task output, status, and audit
sandbox session
- checks out repositories
- runs tests and build scripts
- stores dependency caches
- produces release artifacts
- exposes logs and declared artifacts
Recommended task grouping:
prepare workspace workflow-session sandbox
checkout repositories workflow-session sandbox
run unit tests workflow-session sandbox
build docker images workflow-session sandbox, if Docker or BuildKit is available in policy
package release artifacts workflow-session sandbox
generate release notes workflow-session sandbox
publish release per-task sandbox or isolated publish worker
sign artifacts per-task sandbox or external signing service
The normal build/test/package tasks can share the same sandbox because they belong to one workflow instance and benefit from shared workspace state. Publish and signing tasks should be isolated because they require stronger secrets and have irreversible external effects.
Secret Handling
The sandbox should never receive broad platform credentials. Secrets must be:
- referenced by logical name in workflow or task metadata,
- approved by the effective profile,
- injected only for the task that needs them,
- short-lived where the provider supports it,
- redacted from logs and task output,
- excluded from workflow context exports.
Release tokens should be task-scoped. For example, tests and builds do not need GitHub release credentials. The publish task can receive a short-lived release token in a fresh sandbox or through a separate publish worker.
Network Policy
Each profile should define egress explicitly. A release profile might allow:
github.com
api.github.com
ghcr.io
crates.io
index.crates.io
registry.npmjs.org
docker.io
Tenant workflows should not get unrestricted network access. The sandbox
provider must enforce the egress policy, and light-workflow should still keep
its existing destination validation for host-executed HTTP, JSON-RPC, and MCP
calls.
Artifact Boundary
The sandbox filesystem is not the workflow state store. Tasks must declare which outputs are copied out:
metadata:
lightWorkflow:
artifacts:
- dist/*.tar.gz
- dist/*.sha256
- target/release/light-workflow
The runtime should copy artifacts into a controlled store and record artifact metadata in task output:
{
"artifacts": [
{
"name": "light-fabric-0.3.0-x86_64-unknown-linux-gnu.tar.gz",
"sha256": "...",
"size": 12450000,
"storeUri": "artifact://..."
}
]
}
Audit
Every sandboxed task should record:
- workflow definition id and version,
- workflow instance id,
- task id and task name,
- requested and effective profile,
- sandbox provider, template, session id, and sandbox id,
- command, argv, working directory, and environment allowlist,
- injected secret names, not values,
- network policy id,
- artifact metadata,
- exit status,
- duration,
- output size,
- log reference,
- policy version.
For call: agent, also record model provider, model name, prompt profile,
token budget, output schema id, validation result, and whether tool execution
was allowed.
Failure Handling
Sandbox failures should map to normal workflow task failure semantics:
- startup failure: task fails with
sandbox_start_failed, - policy rejection: task fails with
sandbox_policy_denied, - timeout: task fails with
sandbox_timeout, - command non-zero exit: task fails with
command_failed, - oversized output: task fails with
sandbox_output_too_large, - sandbox lost: task fails or retries according to task retry policy.
The host must not mark a task complete until the sandbox result has been validated and persisted. If a sandbox dies after a command has external side effects, retries must respect the task idempotency key and release workflow guardrails.
Implementation Plan
- Define
ExecutionSecurityProfileand sandbox policy structs inlight-workflow. - Parse workflow and task metadata under
lightWorkflow.security. - Add effective-profile resolution using tenant and service ceilings.
- Add a
SandboxRunnertrait with provider-neutral operations: create session, execute task, copy artifacts, checkpoint, destroy. - Add a Cube Sandbox provider implementation behind configuration.
- Keep unsupported
run.*task types disabled until they route throughSandboxRunner. - Persist sandbox session metadata in process context or a dedicated runtime table.
- Add audit output to every sandboxed task.
- Add approval gates for profiles that enable command execution, external MCP servers, broad egress, or task-scoped secrets.
- Add release workflow examples that use
workflow-sessionmode for build/test/package andper-taskmode for publish/sign.
Open Decisions
- Whether sandbox profiles live only in service configuration or are also portal-managed records.
- Whether artifact storage should use portal tables, object storage, or both.
- Whether publish/sign should run in a sandbox or call a separate release service.
- How much of the Cube Sandbox API should be exposed directly versus hidden behind a provider-neutral interface.
- Whether a first-class
securityfield should be added toworkflow-coreafter the metadata-based design proves stable.
References
Insurance Claim Agentic Workflow
This page describes a product workflow demo for orchestrating multiple agents,
skills, APIs, and human tasks with light-workflow.
The scenario is an auto insurance claim from first notice of loss to a settlement recommendation. It is a useful demo because it is familiar, has clear business states, needs several API calls, and includes human decisions that should not be delegated fully to an agent.
Demo Goal
The workflow should show how a deterministic process can coordinate:
- two or three agents
- multiple skills per agent
- REST API calls
- MCP tool calls through
light-gateway - human input and approval tasks
- branching based on policy, risk, and claim severity
The same business flow should be executable in two variants:
- REST workflow: calls the demo APIs directly with HTTP/OpenAPI tasks.
- MCP workflow: calls the same capabilities through MCP tools exposed by
light-gateway.
The workflow owns the process. Agents work inside bounded tasks and should not invent new process paths outside the workflow definition.
For the agent execution boundary, see
Native Agent Call. In the current implementation,
call: agent is a native light-workflow task. It does not invoke a
containerized light-agent service. API access in this demo is owned by the
workflow through direct HTTP tasks or MCP tool calls routed through
light-gateway.
Execution Model
This demo uses the enterprise workflow-first model:
light-workflowowns the claim process, task state, retries, branching, human tasks, and audit trail.- API access is explicit in the workflow as
call: httporcall: mcp. - Native
call: agenttasks perform bounded reasoning over workflow-owned context and must return structured output. - Skills provide instructions, tool context, and workflow mappings, but they do not give an agent permission to invent unreviewed process paths.
- Containerized
light-agentservices are not invoked by this demo workflow. They remain the runtime for chat clients and future service-agent integration.
Demo APIs
The existing demo APIs can be used as stand-ins for insurance services.
| API | Role in the claim workflow |
|---|---|
demo-customer-profile-api | Policyholder profile, vehicle list, policy status, contact preference, prior claims. |
demo-offer-decision-api | Claim triage, risk decision, settlement or repair recommendation. |
If more realism is needed later, the same workflow can add simulated services for document storage, repair estimates, fraud review, or payment authorization.
Agents
Claim Intake Agent
The Claim Intake Agent owns first notice of loss collection and basic validation.
Skills:
- collect accident facts
- validate required claim fields
- look up customer, policy, and vehicle data
- identify missing information
- summarize the claim for the next agent
Typical tools or API calls:
- get customer profile
- get customer policies
- get covered vehicles
- get prior claims
Human tasks:
- claimant confirms accident details
- claimant answers missing information questions
- claimant uploads or confirms photos, police report, and tow status
Coverage And Liability Agent
The Coverage and Liability Agent checks whether the claim can continue and whether a human adjuster must review it.
Skills:
- coverage eligibility check
- incident date versus policy period check
- vehicle coverage check
- liability and severity classification
- fraud or special investigation flagging
Typical tools or API calls:
- get policy status
- get prior claim history
- run triage decision
- run risk decision
Human tasks:
- adjuster reviews unclear liability
- adjuster confirms coverage exception handling
- special investigation team reviews high-risk claims
Settlement Agent
The Settlement Agent prepares the next action and customer-facing explanation.
Skills:
- repair versus total-loss recommendation
- deductible explanation
- settlement recommendation
- customer message draft
- next-document request
Typical tools or API calls:
- get offer decision
- get customer contact preference
- create settlement recommendation
Human tasks:
- adjuster approves high-value payment
- claimant accepts repair or settlement path
- claimant requests callback or more review
Claim Context And Handoffs
The workflow engine owns the claim state. Agents should be treated as stateless workers that read the current claim context, perform a bounded task, and return structured output.
Each major step enriches a shared claim context:
- intake adds normalized accident facts and missing information status
- customer lookup adds profile, policy, vehicle, and prior-claim data
- coverage review adds eligibility, deductible, liability, and risk signals
- triage adds severity, recommended path, and human-review requirements
- settlement adds the recommendation, explanation, and next actions
Handoffs between agents should happen through this workflow-owned context, not through private agent memory. This keeps the process deterministic, replayable, and auditable.
Workflow Outline
1. Start Claim
Input:
{
"customerId": "CUST-001",
"vehicleId": "VEH-001",
"incidentDate": "2026-05-30",
"accidentDescription": "Rear-ended at an intersection.",
"location": "Ottawa, ON",
"injuryReported": false,
"vehicleDrivable": false
}
The workflow validates that customerId, vehicleId, incidentDate, and
accidentDescription are present.
2. Fetch Customer Context
The workflow calls the profile and policy capabilities to retrieve:
- customer identity
- policy list
- covered vehicles
- contact preference
- prior claim count
Assertions:
- customer exists
- vehicle belongs to customer
- at least one active policy exists
3. Ask For Missing Information
If the input is incomplete, the workflow creates a human task for the claimant.
Example questions:
- Was anyone injured?
- Was another vehicle involved?
- Is the vehicle drivable?
- Was a police report filed?
- Are photos available?
The workflow should be resumable after the claimant answers.
4. Coverage Check
The workflow passes the gathered claim context to a native Coverage and Liability agent task. That task checks:
- policy active on incident date
- covered vehicle
- applicable coverage type
- deductible
- excluded conditions
Branches:
- no matching policy: route to adjuster review
- policy inactive: prepare denial draft for human review
- coverage found: continue to triage
5. Triage Decision
The workflow calls the decision API, either directly with HTTP or through
light-gateway MCP, with normalized claim context.
Expected decision output:
{
"severity": "medium",
"riskLevel": "low",
"recommendedPath": "repair",
"requiresAdjusterReview": false,
"estimatedLoss": 3200
}
Branches:
- low risk and low value: continue automatically
- unclear liability: create adjuster review task
- high risk: create special investigation task
- high value: create approval task
6. Settlement Recommendation
The workflow passes the approved claim context to a native Settlement agent task. That task prepares:
- recommended path: repair, estimate, total-loss review, denial draft, or more information
- deductible explanation
- next documents required
- customer-facing summary
The result should be structured so the UI can render it and the agent can explain it.
7. Human Approval
Approval is required for:
- high estimated loss
- denial recommendation
- special investigation referral
- liability uncertainty
- customer dispute
The task should record:
- approver role
- approval decision
- comment
- timestamp
- whether the workflow should proceed, revise, or stop
8. Customer Response
The claimant chooses one of:
- accept repair path
- request adjuster callback
- upload more documents
- dispute the recommendation
This should be modeled as a human ask task rather than an agent-only step.
9. End State
Possible workflow outcomes:
| State | Meaning |
|---|---|
claim-approved | Claim can proceed to repair or settlement. |
needs-adjuster-review | Human adjuster must review before next action. |
needs-customer-info | Claimant must provide missing information. |
referred-to-siu | Claim is referred to special investigation. |
claim-denied-draft | Denial is drafted but still needs human approval. |
Failure Handling And Fallbacks
The demo should show graceful degradation when an API call or agent task cannot finish automatically.
Recommended fallback behavior:
| Failure | Workflow response |
|---|---|
Customer profile returns 404 | Create a manual customer verification task. |
| Policy or vehicle lookup is unavailable | Retry, then route to adjuster review with the partial claim context. |
| Decision API is unavailable | Create a manual triage task and include the last successful context. |
| Agent output fails validation | Re-run once with validation feedback, then create a human review task. |
| Human task times out | Escalate to the configured role or mark the claim as waiting for follow-up. |
The failure branch should preserve the accumulated claim context and the failed request or response metadata so the human reviewer can continue from the same state instead of restarting the claim.
REST Variant
The REST workflow calls the demo APIs directly.
Use this variant to show:
- deterministic API orchestration
- direct HTTP/OpenAPI task execution
- workflow assertions
- human waiting tasks
- repeatable headless tests with fixed inputs
Example task sequence:
start-claim
get-customer-profile
assert-active-policy
ask-missing-info
run-claim-triage
switch-risk-path
ask-adjuster-approval
prepare-settlement-summary
ask-customer-response
complete-claim
MCP Variant
The MCP workflow invokes the same capabilities through MCP tools exposed by
light-gateway.
Use this variant to show:
- tool discovery with
tools/list - tool execution with
tools/call - agent skill guidance over the selected tool set
- gateway as the runtime MCP data plane
Skills should be treated as guidance and curation for the agent, not as the
runtime transport. The workflow still calls MCP tools through light-gateway.
A skill describes when and how to use tools. For example, the
coverage-review skill can instruct the agent to call evaluate_coverage
before score_claim_risk, explain which fields must be present, and define
what output shape the workflow expects.
Example tool groups:
| Skill | Tools |
|---|---|
claim-intake | get_customer_profile, get_policy, get_vehicle, list_prior_claims |
coverage-review | evaluate_coverage, score_claim_risk, classify_liability |
settlement | recommend_offer, generate_customer_summary, list_required_documents |
Human Task Model
Human work should be explicit and durable.
Recommended task types:
- claimant information request
- adjuster approval
- liability review
- special investigation review
- customer settlement response
Recommended fields:
- prompt
- mode: choice, text, object, file, approval
- assignee or candidate role
- due time
- validation rules
- sensitive flag
- comments
- decision result
The workflow should pause at the human task and resume after a valid response is recorded.
The pause is durable. light-workflow persists the process and task state while
waiting, so the workflow can remain idle for hours or days without consuming
active execution resources. When the claimant, adjuster, or investigator
completes the task, the workflow resumes from the persisted state and continues
with the same claim context.
Minimal First Implementation
Start with a narrow happy path:
- Start with
customerId,vehicleId, and accident details. - Workflow fetches customer profile through HTTP or MCP.
- Workflow asserts active policy and covered vehicle.
- Workflow calls the decision API for triage.
- Workflow asks an adjuster to approve if
estimatedLossexceeds a threshold. - Native Settlement agent task prepares the recommendation.
- Workflow completes with
claim-approvedorneeds-adjuster-review.
This first version is enough to demonstrate multi-agent orchestration without needing every insurance edge case.
Later Enhancements
Add complexity incrementally:
- document upload and OCR simulation
- repair shop estimate comparison
- fraud and special investigation path
- payment authorization
- subrogation when another driver is liable
- scheduled headless regression runs
- customer notification drafting
- analytics for cycle time and approval bottlenecks
Demo Success Criteria
The demo is successful if it shows:
- the same business process running through REST and MCP variants
- agents using skills to perform bounded work
- APIs called through both direct HTTP and MCP tool paths
- at least one human input task
- at least one human approval task
- auditable workflow state transitions
- clear final outcome and explanation
Light Portal Setup
This page describes the portal-side setup required to run the
light-workflow product demos from a local light-portal stack.
For the execution model behind native agent tasks, see Native Agent Call. For the insurance product scenario, see Insurance Claim Agentic Workflow.
Prerequisites
Start the local portal stack with the workflow services, gateway, controller, and Postgres available.
For the Rust local stack:
cd /home/steve/workspace/portal-config-loc
./scripts/deploy-local.sh pg rust
The local stack should include:
- Postgres,
workflow-command,workflow-query,light-gateway,- controller,
- config-server,
demo-customer-profile-api,demo-offer-decision-api.
light-workflow must use the same database as workflow-command:
DATABASE_URL=postgres://postgres:secret@localhost:5432/configserver
Start Light-Workflow
Build and run light-workflow from the light-fabric checkout:
cd /home/steve/workspace/light-fabric
cargo build -p light-workflow --locked
cd apps/light-workflow
DATABASE_URL=postgres://postgres:secret@localhost:5432/configserver \
LIGHT_WORKFLOW_HTTP_ADDR=0.0.0.0:8080 \
RUST_LOG=light_workflow=debug,info \
WORKFLOW_LOG_ANSI=false \
./run.sh --debug-binary
For repeated runs, put those values in
apps/light-workflow/light-workflow.env and run:
./run.sh --debug-binary
Import Agent Catalog Data
Native call: agent tasks load portal agent, skill, and tool metadata from the
portal database. Import the demo catalog events before running workflows that
contain agent tasks.
cd /home/steve/workspace/event-importer
./importer.sh \
--filename /home/steve/workspace/light-fabric/apps/light-workflow/examples/agent-catalog-events.json
For a different host or user, pass replacement rules:
./importer.sh \
--filename /home/steve/workspace/light-fabric/apps/light-workflow/examples/agent-catalog-events.json \
--replacement '[
{"field":"hostId","from":"01964b05-552a-7c4b-9184-6857e7f3dc5f","to":"<host-id>"},
{"field":"user","from":"01964b05-5532-7c79-8cde-191dcbd421b8","to":"<user-id>"},
{"field":"operationOwner","from":"01964b05-5532-7c79-8cde-191dcbd421b8","to":"<user-id>"},
{"field":"deliveryOwner","from":"01964b05-5532-7c79-8cde-191dcbd421b8","to":"<user-id>"}
]'
The demo catalog uses modelProvider: mock for deterministic local runs. For
real model execution, update the portal agent definitions to use the desired
provider and apiKeyRef.
Upload API Metadata
For the insurance claim demos, upload or refresh the OpenAPI specs for:
demo-customer-profile-api,demo-offer-decision-api.
The portal catalog should contain endpoint and tool projections for the demo
APIs before the MCP workflow is run. The MCP workflow expects light-gateway
tools/list to expose these tools:
getCustomerProfile
getCustomerPreferences
getCustomerPolicies
getCoveredVehicle
listPriorClaims
triageClaim
recommendSettlement
Verify the tool surface through the gateway:
curl -k -sS -X POST "https://localhost:8443/mcp" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <access-token>" \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}'
Create Workflow Definitions
Create workflow definitions in the portal UI or through the workflow command API. For the insurance claim demo, create these definitions:
insurance-claim-rest-v1.yaml
insurance-claim-mcp-v1.yaml
insurance-claim-headless-v1.yaml
The files live in:
/home/steve/workspace/light-fabric/apps/light-workflow/examples
After creation, capture their ids:
psql "postgresql://postgres:secret@localhost:5432/configserver" \
-c "select host_id, wf_def_id, name from wf_definition_t where active and name in ('insurance-claim-rest-v1', 'insurance-claim-mcp-v1', 'insurance-claim-headless-v1') order by name;"
Roles And Human Tasks
The insurance claim workflow creates durable human tasks. Confirm that the demo host has the roles used by those assignments:
claimant
claims-adjuster
siu-investigator
customer-service
Human tasks remain in the portal database while waiting. The workflow resumes after the task-completion command records a valid response.
Start And Verify
Use the portal UI start action, Postman collection, or curl helper from the examples directory.
cd /home/steve/workspace/light-fabric/apps/light-workflow/examples
ACCESS_TOKEN=<token> \
HOST_ID=<host-id> \
HEADLESS_WF_DEF_ID=<headless-wf-def-id> \
./insurance-claim-demo-curl.sh start-headless
Run the SQL verification helper after each start or task completion:
psql "postgresql://postgres:secret@localhost:5432/configserver" \
-v host_id=<host-id> \
-f /home/steve/workspace/light-fabric/apps/light-workflow/examples/insurance-claim-demo-queries.sql
For the full runbook, see:
/home/steve/workspace/light-fabric/apps/light-workflow/examples/README.md
Troubleshooting
| Symptom | Check |
|---|---|
| Workflow starts but no process appears | Confirm light-workflow uses the same DATABASE_URL as workflow-command. |
| Agent task fails before a human task | Confirm agent-catalog-events.json was imported for the same hostId. |
| MCP tool is not found | Call gateway tools/list and confirm the tool names match the workflow YAML. |
| Human task is not visible | Check task_asst_t, role membership, and task status. |
Input fields resolve as ${ .customerId } | Confirm startWorkflow sends input as a JSON object, not a JSON string. |
Comparison: Light-Fabric vs. AgentGateway
This document provides a high-level comparison between Light-Fabric and AgentGateway to help architects and engineering leaders choose the right foundation for their agentic workflows.
Overview
While both systems aim to facilitate interaction with Large Language Models (LLMs), they operate at different layers of the AI stack and prioritize different architectural outcomes.
| Feature | Light-Fabric | AgentGateway |
|---|---|---|
| Primary Philosophy | Agentic Fabric: Unified Governance & Lifecycle | Agentic Gateway: High-performance Proxy |
| Core Architecture | Integrated Platform (Layer) | Standalone Gateway (Service) |
| Target User | Central IT / Platform Engineering | Application Developers / DevOps |
| Lifecycle Management | APIs, Agents, MCPs, and Gateways | Primarily LLM Request Routing |
| Language | Native Rust (Extreme Performance) | Rust / Go (Variable) |
1. Governance vs. Connectivity
Light-Fabric (Governance)
Light-Fabric is designed as a Single Control Plane. It assumes that in an enterprise environment, "freedom without governance is chaos." It provides:
- Centralized Registry: Every agent, skill, and tool is registered and governed via the
light-portal. - Fine-Grained Authorization: Deep policy enforcement at the endpoint level, including row and column-level data masking.
- Auditability: A unified audit trail for all agentic interactions across the entire organization.
AgentGateway (Connectivity)
AgentGateway typically focuses on the North-South traffic between an application and multiple LLM providers. Its primary strength is:
- Simplified Routing: Getting a request from Point A to Point B with retries and failover.
- Provider Abstraction: Normalizing different LLM APIs into a single interface.
2. Integrated Intelligence: Hindsight
One of the defining differences of the Light-Fabric is the deep integration of Hindsight Memory.
- Light-Fabric: Memory is not an "add-on." The platform provides native biomimetic memory banks (World Facts, Experiences, Mental Models) that are automatically managed and scoped (Global, Shared, Private) as part of the fabric.
- AgentGateway: Typically treats memory as external state. The application or a separate vector database must manage context before sending the request through the gateway.
3. Skill & Tool Management
Centralized Skills (Fabric)
In Light-Fabric, skills (tools) are first-class citizens. They are registered, versioned, and governed centrally. An agent doesn't just "have" a tool; the Fabric grants the agent access to a skill based on its role and the current context.
Standard Tooling (Gateway)
AgentGateway generally passes tool definitions through to the provider. The management of who can use which tool and how those tools are secured is usually left to the application logic.
4. Orchestration: Hybrid Agentic Workflows
Light-Fabric (Integrated Orchestrator)
Light-Fabric treats orchestration as a foundational service. It implements a Hybrid Model:
- Deterministic Process: The overall business logic (e.g., insurance claim steps) is fixed and compliant.
- Autonomous Tasks: Individual steps within the process are delegated to agents.
- Statefulness: The Fabric manages long-running state across days or weeks, ensuring durability.
AgentGateway (Stateless Proxy)
AgentGateway is primarily a stateless component.
- External Orchestration: The workflow logic must reside in your application code or an external engine (like Temporal).
- Proxy Only: It handles the communication but does not "understand" or manage the multi-step business process itself.
5. Security: The Rule Engine
Light-Fabric (Integrated Governance)
Light-Fabric includes an integrated YAML-based Rule Engine (light-rule) designed for fine-grained authorization:
- Data Filtering: Automatically masks or filters response data (column/row level) based on policies.
- Policy Enforcement: Checks permissions before an agent executes a tool or accesses a memory unit.
- Hot-Reloading: Security rules can be updated in real-time without redeploying the platform.
AgentGateway (Basic Middleware)
AgentGateway typically provides basic security features like API key validation or rate limiting.
- Limited Filtering: While it can intercept traffic, implementing complex, context-aware data masking usually requires writing custom middleware or handling it at the application level.
6. MCP Support: Gateway vs. Ecosystem
Light-Fabric (Integrated Tooling)
Light-Fabric treats Model Context Protocol (MCP) as a primary source for agent tools.
- Direct Integration: Agents use the
mcp-clientto directly consume tools from MCP servers. - Registry Management: MCP servers are registered in the
light-portal, allowing for centralized discovery and governance. - Unified Security: The same Fine-Grained Authorization rules apply to MCP tools as they do to native Rust tools.
AgentGateway (Specialized MCP Proxy)
AgentGateway provides a highly specialized MCP Gateway layer.
- Protocol Translation: It excels at translating between different MCP transports (SSE, Streamable HTTP, etc.).
- Exposing Servers: Its primary role is to make MCP servers accessible to external applications through a normalized gateway interface.
- Advanced Networking: Includes features like stream merging and specialized MCP routing.
For a deep dive into the technical differences, see our Detailed MCP Feature Comparison.
Summary: Which to Choose?
Choose Light-Fabric if:
- You are building an Enterprise AI Strategy that requires unified governance, stateful workflows, and integrated security.
- You need to manage the entire lifecycle of agents and the business processes they participate in.
- You require advanced data privacy (masking) and long-term memory (Hindsight) as native platform features.
Choose AgentGateway if:
- You need a lightweight proxy to handle LLM provider failover and basic request normalization.
- You prefer to manage agent logic, workflows, memory, and security entirely within your external application stack.
- You are looking for a simple tool to solve immediate connectivity needs without implementing a comprehensive platform layer.
Detailed Comparison: MCP Gateway Features
This document provides a technical deep dive into the Model Context Protocol (MCP) implementations in Light-Fabric and AgentGateway.
Feature Matrix
| Feature | Light-Fabric | AgentGateway |
|---|---|---|
| Primary Role | Provider/Gateway/Portal: Exposes MCP/API Servers. | Provider/Gateway: Exposes MCP servers. |
| Onboarding | Auto-Discovery: Automatic tools/list sync. | Manual: K8s CRD/Manifest configuration. |
| Data Privacy | Deep: Row/Column level masking. | Basic: Allow/Deny access control. |
| Transports | SSE, Streamable HTTP, WebSocket | SSE, Streamable HTTP, WebSocket |
| Legacy Integration | Native: REST/RPC to MCP transformation. | External: Manual wrappers required. |
| Authorization | Managed: Roles, Groups, Positions, Attributes. | Infrastructure: CEL-based policies. |
| Hot-Reloading | Native: Integrated Control Plane & Registry. | Infrastructure: Istio/xDS sync. |
| Authentication | JWT (End-to-End Propagation) | JWT, Keycloak, OIDC, Passthrough |
| Observability | Distributed Tracing (OTEL) and Integrated Hindsight Memory | Distributed Tracing (OTEL) |
1. Architectural Intent
AgentGateway: The Network Proxy Layer
AgentGateway is designed as a high-availability proxy for MCP servers. Its primary focus is the North-South traffic between an application and multiple MCP backends.
- Multiplexing: Optimized for merging multiple MCP backends into a single upstream connection (
mergestream.rs). - Protocol Translation: Excels at translating between SSE, Streamable HTTP, and WebSocket transports.
- Infrastructure Focus: Operates as a Kubernetes-native component managed via manifests and standard networking policies.
Light-Fabric: The Managed Enterprise Platform
Light-Fabric provides a Unified Governance Fabric that treats AI agents and MCP tools as part of the broader enterprise API ecosystem.
- Unified Gateway: The AI Gateway (Rust/Pingora-based) serves as a single entry point for UI, Agents, and Tools, supporting both MCP and traditional REST/RPC APIs.
- Centralized Portal: Uses the Light-Portal as a control plane for onboarding (auto-discovery), configuration (hot-reloading), and security management.
- Governed Intelligence: Integrates the gateway directly with Hindsight Memory and the Fine-Grained Rule Engine, ensuring that every tool call is governed by corporate compliance rules (e.g., row/column masking).
- End-to-End Security: Maintains a single JWT-based identity from the user's chat interface all the way to the underlying MCP or API endpoint.
2. Security & Authorization
AgentGateway: Infrastructure-Aware RBAC
AgentGateway uses Common Expression Language (CEL) for its authorization policies.
- Capabilities: High-speed, network-level blocking based on JWT claims and request headers.
- Limitation: Lacks native support for content-aware data masking or organizational hierarchy logic.
Light-Fabric: Content-Aware Managed Auth
Light-Fabric provides a mature Fine-Grained Authorization layer:
- Managed ABAC/PBAC: Supports Role, Group, Corporate Position (Hierarchy), and Attribute-based protection.
- Data Privacy: Supports native Row and Column filtering (data masking), ensuring agents only see data they are authorized to process.
- End-to-End JWT: The same JWT token is propagated from the UI through the Agent to the AI Gateway and MCP tool.
3. Lifecycle & Tool Onboarding
AgentGateway: Configuration-Driven
Onboarding tools in AgentGateway is an infrastructure task:
- Manual Mapping: Requires defining Kubernetes Custom Resources (
HTTPRoute,Backend) to map MCP servers to the gateway. - Scope: Primarily focused on exposing existing MCP servers.
Light-Fabric: Registry-Driven
Light-Fabric provides a "Zero-Effort" onboarding experience via Light-Portal:
- Auto-Discovery: Registering an MCP API triggers an automatic
tools/listcall to populate the registry. - Protocol Transformation: Automatically transforms existing OpenAPI/REST and RPC services into MCP tools without requiring wrappers.
- Centralized Governance: All tools (Native, REST, MCP) are managed in a single unified registry.
4. Control Plane & Configuration
AgentGateway: Kubernetes-Native
- Orchestration: Managed via the Istio/xDS control plane.
- Updates: Configuration changes are applied via Kubernetes manifests (YAML).
Light-Fabric: Portal-Managed
- Hot-Reloading: Uses a dedicated Config Server and Control Plane to update gateway and agent configurations in real-time without restarts.
- Enterprise Management: Business-centric UI for managing tool visibility, agent permissions, and security policies.
5. Conclusion
- Use AgentGateway if you are an infrastructure provider who needs to expose MCP-based tools to multiple external applications securely and reliably.
- Use Light-Fabric if you are building intelligent agents that need to use those tools to solve complex business problems within a governed framework.
Why Light-Fabric Already Covers the MCP Gateway — No Second Gateway Required
This document addresses a recommendation (produced by Grok AI) suggesting that an enterprise should deploy the open-source AgentGateway as a dedicated MCP layer alongside an existing API platform. After performing a side-by-side source code analysis of both projects (see vs-agentgateway.md and vs-agent-gateway-mcp.md), we present the findings below.
1. The Recommendation Was Generated Without Knowledge of Light-Fabric
The Grok-produced analysis operates under a critical blind spot: it has no knowledge of Light-Fabric (Rust-based, open-sourced to customers) or its capabilities. The recommendation frames the choice as "keep your existing REST platform + add AgentGateway for MCP," because Grok only knows about publicly documented open-source projects. It does not account for the fact that:
- Light-Fabric is already in production and serving agentic workloads today.
- Every feature listed in the recommendation — MCP federation, tool discovery, protocol translation, security, and observability — has already been built, demonstrated, and validated with the project team.
- The comparison is therefore not between "a REST framework" and "an MCP gateway." It is between two systems that both provide MCP gateway capabilities, where one (Light-Fabric/Light-Gateway) is already deployed and battle-tested in our environment.
2. Source Code Analysis: Light-Fabric Already Does What AgentGateway Does
We conducted a detailed, code-level comparison of both projects. The full results are documented in our High-Level Comparison and Detailed MCP Feature Comparison. The key findings are summarized below.
2.1 MCP Protocol Support
| Capability | Light-Fabric | AgentGateway |
|---|---|---|
| Transports | SSE, Streamable HTTP, WebSocket | SSE, Streamable HTTP, WebSocket |
| Tool Discovery | Auto-discovery via tools/list sync | Manual K8s CRD configuration |
| Protocol Translation | Native REST/RPC → MCP transformation | Manual wrappers required |
| Stream Handling | Supported | Supported (mergestream) |
Both projects support the same MCP transports. Light-Fabric goes further with automatic tool discovery and native protocol transformation from existing REST/RPC APIs — exactly the "OpenAPI-to-MCP mapping" that the Grok recommendation credits to AgentGateway, except Light-Fabric does it without requiring a separate component.
2.2 Security & Authorization
| Capability | Light-Fabric | AgentGateway |
|---|---|---|
| Authentication | JWT (end-to-end propagation) | JWT, Keycloak, OIDC, Passthrough |
| Authorization | Role, Group, Position, Attribute-based (ABAC/PBAC) | CEL-based policies |
| Data Privacy | Row/Column-level masking | Allow/Deny access control |
| Rule Engine | Integrated YAML-based, hot-reloadable | Basic middleware |
The Grok recommendation highlights "tool-level RBAC" and "MCP-compliant OAuth 2.1" as AgentGateway strengths. Our code analysis shows that Light-Fabric's authorization model is significantly deeper — it supports corporate-hierarchy-aware policies and content-level data masking that AgentGateway simply does not implement.
2.3 Lifecycle & Operations
| Capability | Light-Fabric | AgentGateway |
|---|---|---|
| Onboarding | Portal-driven, auto-discovery | K8s manifest-driven, manual |
| Hot-Reloading | Native (Config Server + Control Plane) | Infrastructure-dependent (Istio/xDS) |
| Observability | OTEL + integrated Hindsight Memory | OTEL + OpenInference |
| Orchestration | Integrated hybrid workflows (deterministic + autonomous) | None (stateless proxy) |
Light-Fabric manages the entire lifecycle — from tool registration through governance to runtime orchestration — while AgentGateway only handles the proxy layer.
3. Two Gateways Is Overkill
The Grok recommendation frames the architecture as a "clean separation of concerns." In practice, deploying both Light-Fabric and AgentGateway creates redundant infrastructure with real costs:
Duplicated Capabilities
Both systems would be performing the same core functions:
- Receiving MCP requests from agents
- Translating tool calls to backend HTTP requests
- Enforcing security policies on tool access
- Providing observability for agentic traffic
Running two gateways that do the same thing is not "separation of concerns" — it is duplication of concerns. Every MCP request would traverse two proxy layers instead of one, adding latency and operational complexity for zero additional capability.
Operational Burden
- Two deployment pipelines to maintain on EKS
- Two sets of security policies to keep in sync
- Two configuration surfaces (K8s CRDs for AgentGateway vs. Portal for Light-Fabric)
- Two failure domains to monitor and troubleshoot
- Two upgrade cycles to coordinate
The "No Code Changes" Claim Is Misleading
The Grok recommendation states AgentGateway requires "no code changes." This is true only if you ignore the work required to:
- Write and maintain Kubernetes Custom Resources for every MCP backend
- Build manual wrappers for non-MCP services (Light-Fabric does this natively)
- Implement application-level logic for everything AgentGateway doesn't cover (stateful workflows, data masking, memory management)
Light-Fabric also requires no code changes to existing backend services — and it provides the governance layer out of the box.
4. Addressing the "Rust Performance" Argument
The recommendation claims AgentGateway has a "performance edge" due to its Rust data plane. This argument does not hold:
- Light-Fabric's AI Gateway currently runs on the high-performance Java-based light-gateway, and a new Rust-based AI Gateway is also under way, built on the Pingora framework (Cloudflare's production proxy engine). Even the existing Java gateway delivers exceptional throughput, and the Rust gateway will remove the JVM from the critical path entirely.
- Both systems benefit from Rust's zero-cost abstractions, memory safety, and lack of garbage collection pauses.
- The performance comparison between the two Rust implementations would be marginal and workload-dependent — not a differentiator.
5. Addressing the "Custom Development" Concern
The recommendation warns against "implementing MCP directly" because it "involves significant custom development." This concern does not apply:
- Light-Fabric's MCP support is not custom development — it is a fully implemented, production-ready feature of the platform.
- The MCP client, gateway routing, tool registry, and security integration are all existing, tested components, not a backlog of work to be done.
- The project team has already seen these features demonstrated end-to-end.
6. Summary
| Concern from Grok Recommendation | Reality |
|---|---|
| "Light4j is a REST framework, not an AI proxy" | Light-Fabric is a full agentic platform with an AI Gateway already in production |
| "AgentGateway provides MCP federation and tool discovery" | Light-Fabric provides the same capabilities with deeper governance |
| "Rust performance advantage over JVM" | Light-Fabric's Java gateway is already very fast, and a Rust (Pingora-based) gateway is coming |
| "Clean separation of concerns" | Two gateways doing the same thing is duplication, not separation |
| "No code changes required" | True for both — but AgentGateway requires extensive K8s manifest management |
| "Custom MCP implementation is risky" | Light-Fabric's MCP support is already built, tested, and in production |
Conclusion
The Grok-generated recommendation is well-structured but fundamentally flawed because it was produced without knowledge of Light-Fabric's capabilities. When evaluated against the actual source code and production state of both systems, the case for adding AgentGateway collapses:
- Light-Fabric already provides every MCP gateway capability that AgentGateway offers.
- Light-Fabric goes significantly further with integrated governance, data privacy, memory, and orchestration.
- Adding a second gateway introduces operational complexity and latency with no net-new capability.
The pragmatic, low-risk path is to continue with the platform that is already built, already in production, and already proven to the team.