16 min
Vibemind: Event-Driven Multi-Agent Operating System With Context-Engineering, Screen-State Perception, and Self-Improving Control
Technical Field
This disclosure relates to artificial intelligence operating systems and, more specifically, to event-driven orchestration of heterogeneous software agents and Model Context Protocol (MCP) tools using (i) context-engineering over dynamic knowledge spaces, (ii) visual screen-state perception with moiré-assisted OCR for robust UI automation, and (iii) self-improving control via puzzle-encoded interaction traces and continuous-thought training.
Background
Conventional agent frameworks typically bind a single planner model to tool calls, lack persistent, structured memory, and rely on brittle coordinate-based UI automation. They struggle with: (a) selecting the right agent chain for long tasks, (b) recovering from partial failures, (c) safely generalizing across apps and machines, and (d) learning from execution traces to improve future runs.
Vibemind addresses these gaps by (1) routing intents and events through an event loop that assigns work to specialized agent teams and MCP adapters; (2) actively curating context with query generation and Monte-Carlo exploration of an embedding space; (3) perceiving on-screen state using a de-moiré analysis module with OCR and cursor-to-target geometry; and (4) encoding interaction traces into a structured “puzzle” representation used to train continuous-thought machines (CTMs) that refine routing and recovery over time.
Summary
Vibemind is an AI-native, event-driven multi-agent OS that:
- Exposes a voice/chat front end, accepts intents, and emits artifacts per step while coordinating autonomous “hands-off” runs with retry and policy control.
- Performs context engineering by exploring an embedding space, generating and pruning queries/documents/code fragments using an LLM-in-the-loop Monte-Carlo method.
- Perceives UI state with a De Moiré module that uses interference patterns plus OCR to infer element locations, predict movement, and execute adaptive, position-independent clicks, including a stealth mode that blacks out the visible screen while analysis continues.
- Automates interfaces via time-based live screen capture, OCR zones, and state machines (e.g., n8n) that recognize “Excel open,” “shell finished,” or “process stuck.”
- Builds a knowledge graph with agent teams that extract requirements and memories, then routes sub-tasks and acceptance criteria to coding agents.
- Encodes multi-agent conversations as Kotlin puzzles; CTMs learn to solve these puzzles and thereby learn when to pause/terminate/switch strategies, creating a continuous-processing architecture.
- Integrates multiple MCP servers (e.g., Playwright, Docker, Git, filesystem) in an event-based architecture with user clarification hooks.
Brief Description of the Drawings
- Fig. 1 System topology and event loop with agent teams and MCP adapters.
- Fig. 2 Context-engineering embedding space with Monte-Carlo exploration and cluster pruning.
- Fig. 3 Artifact pipeline per step (plans, traces, diffs, test logs, patches).
- Fig. 4 De Moiré module: moiré field, OCR regions, cursor-target vector, and stealth mode.
- Fig. 5 Time-based OCR zones with external state engine for UI automation.
- Fig. 6 Knowledge-graph generator with requirement extraction agents and routing to coders.
- Fig. 7 Kotlin puzzle/CTM training loop and continuous-processing pipeline.
Detailed Description
1. Event-Driven Multi-Agent OS
The system exposes an input interface (voice and chat) and an event loop. Intents and environment events enter the loop, are prioritized, and dispatched to specialized agents and MCP tools including, for example, Docker, Playwright, and GitHub. For complex workflows, a “hands-off pattern” encapsulates long-running task logic and recovery behavior. Each project is organized into buckets (Coding/Debugging/Testing), with explicit interfaces and acceptance criteria. Step artifacts (plans, diffs, logs) are emitted at each transition and fed back as context for subsequent steps.
Routing: The loop uses signals from perception and state engines to select agent teams. Policies determine retry, backoff, and escalation. A compact “context snippet” object summarizes relevant history for tool calls, minimizing prompt bloat while preserving grounding.
2. Context Engineering and Learning
Vibemind maintains an embedding space over documents, code fragments, and prior traces. A Monte-Carlo algorithm iteratively (i) proposes query mutations, (ii) scores retrieval coherence vs task goals, and (iii) prunes or promotes clusters into stable knowledge units. LLMs participate by proposing candidate queries and judging relevance; the result is a self-optimizing context that “breathes” with the task.
3. Artifact Production
At each state transition, the system emits machine-readable artifacts: local plans, selected tools, inputs, outputs, error traces, test results, knowledge-graph deltas, and acceptance judgments. Artifacts are addressable and can be replayed for post-mortem or reinforcement of future plans.
4. De Moiré: Screen-State Perception and Stealth
A C++-connected Moiré Module imposes a controlled moiré interference field on the desktop capture and uses OCR/pattern recognition to estimate target element poses. The module computes the cursor-to-target vector and executes clicks adaptively even when absolute coordinates drift, supporting resilience across window layouts, DPI changes, and app skins. A “De-Moiré blackout” mode blanks the visible screen while perception continues internally, enabling stealth operation without losing state understanding. Continuous motion prediction and OCR heatmaps unify perception, control, and concealment in one sensory-cognitive interface.
5. Time-Based Live Screen + OCR Zones + State Engine
A secondary OCR layer samples the screen at 1 Hz (or on demand) and pushes observations to a state engine (e.g., n8n) that recognizes macro-states such as “Excel open,” “shell finished,” or “process stuck.” Position control can be implemented with PyAutoGUI or equivalent. This layer supports multi-PC extension with edge functions and synchronized browser views, enabling task distribution and federated learning of UI dynamics across machines.
6. Knowledge-Graph Production by Agent Teams
A team of specialized agents transforms chunked inputs into requirements, memories, and prompt adjustments. The knowledge graph is a JSON object enriched with contextual labels, linking requirements to source evidence, code locations, and tests. Graph deltas inform downstream coding agents (e.g., multiple code copilots) and constrain acceptance checks to traceable evidence, one trace per responsible agent.
7. Kotlin Puzzle + Continuous-Thought Machines (CTMs)
Multi-agent conversations and tool sequences are serialized into Kotlin “puzzles” that encode valid paths to a goal (e.g., 120-step solvable sequences with breadth-first search metrics). CTMs are trained to solve these puzzles while learning control actions such as pause/stop/switch-strategy. A “Kurograph” visualization divides the process into buckets and displays simultaneous processes and dependencies. The continuous-processing architecture feeds each puzzle’s outcome back as new training data, enabling the CTM to orchestrate future routing decisions and to terminate unproductive loops.
8. Multi-Agent MCP Integration
The OS integrates MCP servers for Playwright, Docker, Git, filesystem, and others. All invocations are event-based with an explicit user-clarification path when uncertain. A DevOps subsystem accelerates registering new MCP servers and provides standardized logs and metrics for the learning modules.
Representative Implementations
- Autonomous build-and-test: The loop selects a “Coding” team, runs Playwright MCP to exercise flows, captures OCR/state signals, and retries with altered context when tests fail.
- Cross-app UI automation: De-Moiré yields cursor-target vectors robust to layout shifts; the engine recognizes “stuck” states and triggers agent fallback.
- Requirements mining: Graph agents derive requirements from chunked sources and route them to appropriate code agents with acceptance criteria.
Advantages
- Robust UI control independent of absolute coordinates through moiré-assisted perception and OCR.
- Continual self-improvement by encoding runs as puzzles and training CTMs on execution traces.
- Efficient context curation via Monte-Carlo exploration and LLM judging, reducing hallucinations and tool thrash.
- Traceable, graph-driven requirements-to-code routing with per-agent accountability.
Claims
1. System claim
- A computer-implemented system comprising:
a. an event loop configured to receive intents and environment events and to dispatch tasks to a plurality of specialized software agents and Model Context Protocol (MCP) tools;
b. a context-engineering module configured to maintain an embedding space of documents, code fragments, and traces, to propose query mutations using a Monte-Carlo exploration process, and to prune and promote clusters based on task coherence;
c. a screen-state perception module comprising a de-moiré analyzer that imposes or detects a moiré interference field, performs optical character recognition over regions of interest, computes cursor-to-target vectors, and executes adaptive, position-independent UI actions, the module further comprising a stealth mode in which visible output is blanked while analysis continues;
d. a time-based OCR subsystem configured to sample the screen at a fixed or on-demand cadence and to publish observations to a state engine that recognizes application and workflow states;
e. a knowledge-graph generator comprising agent teams that extract requirements and memories as a typed JSON graph and route sub-tasks with associated acceptance criteria to coding agents; and
f. a learning subsystem that encodes multi-agent conversations and tool sequences as puzzle representations and trains a continuous-thought model to solve said puzzles and to emit control actions including pause, terminate, and strategy switch,
wherein the event loop updates routing policies based on outputs from the context-engineering module, the screen-state perception module, the state engine, the knowledge-graph generator, and the learning subsystem.
2. Method claim
- A method for autonomous software task execution, comprising:
receiving an intent; selecting an initial agent team and MCP toolset; generating candidate context queries via Monte-Carlo exploration; retrieving and pruning context; executing UI actions using de-moiré-assisted OCR with adaptive cursor-to-target control; sampling screen state to update a state engine; emitting artifacts per step; constructing knowledge-graph deltas and routing sub-tasks; serializing interaction traces as puzzles; training or updating a continuous-thought model on said puzzles; and adjusting future routing and termination policies according to the model’s control outputs.
3. Computer-readable medium claim
- A non-transitory computer-readable medium storing instructions which, when executed by one or more processors, cause the processors to perform the method of claim 2.
4. Dependent claims
- The system of claim 1, wherein the de-moiré analyzer computes a spatial field over the entire desktop and infers element pose by correlating interference gradients with OCR-detected glyph clusters.
- The system of claim 1, wherein the state engine identifies at least one of: “application open,” “shell finished,” and “process stuck,” and triggers policy-driven retries.
- The system of claim 1, wherein the knowledge graph links requirements to source evidence, code spans, test assets, and acceptance results, and constrains agent outputs to evidence-backed changes.
- The system of claim 1, wherein the learning subsystem measures puzzle optimality by breadth-first search path length and labels interaction steps with agent-tool semantics.
- The system of claim 1, wherein the time-based OCR subsystem distributes observations via edge functions across multiple client machines to form a federated state model.
- The method of claim 2, further comprising emitting a compact “context snippet” for each tool call to minimize prompt size while preserving provenance.
- The method of claim 2, wherein termination is selected when CTM-predicted utility falls below a threshold conditioned on recognized “stuck” states.
- The system of claim 1, further comprising a DevOps subsystem to register new MCP servers and normalize logs/metrics for the learning subsystem.
- The system of claim 1, wherein stealth mode blanks user-visible output while continuing internal capture and OCR to maintain closed-loop control.
Enablement Notes
- Agents/MCPs: Playwright for browser control; Docker for environment lifecycle; Git/GitHub for VCS; filesystem and shell MCPs for local ops.
- Perception: C++ or high-performance module for moiré field generation; OCR via standard libraries; geometric cursor vector solving in the presence of DPI scaling and window transforms.
- State engine: n8n or equivalent converts OCR events into state machines; triggers retries, fallbacks, or human-in-the-loop clarification.
- Learning: Kotlin puzzle serializer; CTM training pipeline consuming session logs and graph labels; Kurograph visualization for concurrent processes and dependencies.
Industrial Applicability
The system applies to autonomous testing, app prototyping, software maintenance, data entry, RPA, and resilient cross-app automation where layouts change, context is large, and continuous improvement is required.
Disclosure Integrity
All technical features above are supported by the provided Vibemind design materials describing the event-driven multi-agent loop, context engineering, moiré/OCR perception, time-based OCR zones with state engines, knowledge-graph production, Kotlin puzzle/CTM learning, and MCP integrations.
If you want this converted into a PCT-style spec with numbered paragraphs and formal figure callouts, I’ll stitch in claim charts and add exemplary pseudo-code blocks next.