Changelog
All notable changes to WASP are documented here. Versions follow a semantic versioning scheme after the initial phase-based development (Phases 1–18).
v2.6 — April 30, 2026
Focus: Edge Fix Pass (6 surgical fixes) + Final Pre-Production Hardening Pass (10 fixes) + Panic Reset dashboard page
Highlights
- 6 edge-case fixes that close specific hallucination and correctness failure modes without changing golden-path behavior.
- 10 pre-production hardening fixes covering security, reliability, observability, and learning quality.
- Panic Reset page — single-operation hard reset with confirmation gate.
Edge Fix Pass (April 30, 2026)
Six surgical fixes applied across events/handlers.py and policy/response_guard.py. Each fix targets a specific hallucination or correctness failure mode.
1. Low-Intent Cold-Start Guard (events/handlers.py)
Short messages on a fresh chat without prior context are one of the most reliable hallucination triggers. Without an anchor in chat memory, a single token like "ok" or a context-required phrase like "do the same" gives the LLM nothing to ground on, and it reaches for training-data noise.
New deterministic guard:
_LOW_INTENT_TOKENSfrozenset — single-token confirmations and acknowledgements (multilingual)._LOW_INTENT_EMOJI_RE— emoji / digit / punctuation-only messages._CONTEXT_REQUIRED_PHRASES_RE— phrases like "do the same", "again", "same as before" that require prior context to be meaningful.
_is_low_intent() returns True when: single ambiguous token, emoji-only, context-required phrase without anchor, or ≤2 tokens all in the ambiguous set. When low-intent + no scheduled-language match + no last_exchange anchor, the handler returns a clarification fast-path in the user's detected language and never invokes the LLM.
Greetings ("hello" / "hi" / "hey" / "ping") intentionally excluded — they have a dedicated friendly-response path. Bypassed for [RETRY OF PREVIOUS: messages (those have explicit anchor).
2. Multi-URL Aggregator: Error: Prefix Detection (events/handlers.py ~line 5385)
When auto-detect resolves multiple URLs in a single user message, the multi-URL aggregator builds a deterministic per-URL outcome list. The browser skill returns success=True even when its output begins with Error: URL blocked... (SSRF blocklist hit, file:// block, RFC-1918 block) — success only means "the skill itself didn't crash", not "the URL was reachable".
Before: SSRF-blocked URLs labeled ✅ navigated.
After: Error: prefix detected first; URL labeled ❌ with the first-line error message.
Outcome icons:
| Icon | Meaning |
|---|---|
| ✅ navegado / navigated | Browser reached the URL successfully without screenshot |
| ✅ captura enviada / screenshot sent | Screenshot captured and attached |
| 🚫 bloqueado / blocked | Login wall or captcha ([CAPTURE_VALID: false]) |
| ❌ <error line> | Error: prefix detected (SSRF, file://, RFC-1918, etc.) — first 120 chars |
Test case (passes): captura https://192.168.0.1 y https://10.0.0.1 → both URLs correctly labeled ❌.
3. Agent Name Preservation: Non-Greedy Regex (events/handlers.py)
_AGENT_NAME_PATTERNS (multilingual alternations matching named X and equivalent constructions) was greedy — for an input like create an agent named Crypto Watcher that monitors prices every 3 hours, the regex captured the entire phrase as the name.
Fix: non-greedy multi-token capture [\w-]+(?:\s+[\w-]+){0,4}? with a lookahead stop-set on clause connectors (that/to/with/for/on/in and equivalents) and punctuation. Quoted form (named "Foo Bar") takes priority via the first alternation group.
| Input | Extracted Name |
|---|---|
create an agent named Bob to track news | Bob |
create an agent named "Crypto Watcher" for BTC alerts | Crypto Watcher |
create an agent named News Watcher that monitors RSS feeds every hour | News Watcher |
4. Schedule Honesty Bidirectional (policy/response_guard.py)
task_manager only supports interval scheduling. It does NOT support fixed clock times (at 9am) or daypart phrases (in the morning). Two directions of dishonesty were possible:
- Agent-side lie (existing): response asserts a clock time → guard strips it.
- NEW user-side silent misinterpretation: user requests a clock time or daypart, agent creates an interval task without disclosing the gap. Operator believes their schedule was honored.
New behavior:
- User-side clock-time branch: when the user requested a specific clock time (matches AM/PM,
:, oro'clock) ANDhas_real_task_create(skill_results)returns True → response is appended with a disclaimer that the task does not run at the requested clock time and thattask_manageronly supports interval scheduling (every N hours from creation time). - NEW daypart branch (
DAYPART_CLAIM_RE): matches phrases likein the morning,every evening,at dawn,at noon,at midnight, and equivalents in supported languages → identical disclaimer family.
Trace fields: had_real_create=True, claimed_time=<requested>, origin in {"user_text", "user_text_daypart"}. Guard skipped when numeric match without AM/PM/colon (likely an interval expression).
5. Markdown Link Sanitizer (policy/response_guard.py)
The sanitizer previously handled image syntax  only. Plain [text](url) markdown links leaked through and rendered literally in Telegram.
Fix: new _MARKDOWN_LINK_RE = re.compile(r"(?<!!)\[([^\]]+)\]\((https?://[^\s)]+)\)"). Matches collapse to text (url) form so the URL stays accessible without raw markdown chars rendering literally. Negative lookbehind (?<!!) prevents double-handling of image syntax.
Example:
- Before:
See more details at [this page](https://example.com/foo)→ renders with literal[]()brackets - After:
See more details at this page (https://example.com/foo)→ readable
Trace records stripped: ["link"] for post-hoc analysis.
6. Entity-Proximity Verdict Check (policy/response_guard.py)
The verdict-evidence check guards against tracking-code hallucinations: when a user pastes a postal/courier tracking code, the browser may navigate to a tracking-site home page that contains words like "delivered" as static UI labels for unrelated shipments. Without entity proximity, the LLM could stitch these unrelated labels with the user's specific tracking code into a fabricated claim.
The previous check was too lenient — verdict word anywhere in body counted as evidence.
Fix: 200-character proximity window between any user-named entity and any verdict occurrence:
_user_named_entities()extracts user-specified codes via_USER_ENTITY_RE:[A-Z]{2}\d{9}[A-Z]{2}— China Post / EMS format1Z[A-Z0-9]{16}— UPS format\d{12,22}— generic long numeric (FedEx etc.)[A-Z]{4,6}\d{6,10}— alphanumeric mix
_skill_output_supports_verdict()extended withuser_entitiesparameter.- When entities present and verdict never co-occurs within 200 chars of any entity in any successful skill body → not evidence → honest fallback returned.
_has_useful_skill_data()also extended: outputs that don't contain user-specified entities don't count as useful data for factual grounding.
Verdict keywords cover delivery states in supported languages (delivered, in transit, out for delivery, pending, received, in customs, and equivalents).
Final Pre-Production Hardening Pass (April 15, 2026 — 10 Fixes)
New
- Panic Reset page (
/reset): hard-confirmation UI (must type "RESET WASP", paste blocked) that wipes all 17 DB tables, 12+ Redis key patterns, agent identity XP/birth date, self-model, and runsVACUUM FULL. Progress streams live to the page; result shows in a green-bordered card withWaspToastnotification. AuditLog entry written for every reset. - Weekly VACUUM ANALYZE (
scheduler/db_maintenance.py):DbMaintenanceJobruns every 604,800 s (weekly) outside a transaction (AUTOCOMMIT) — reclaims dead-row space and updates PostgreSQL planner statistics without table locking. - Shell invocation audit logging (
skills/builtin/shell.py): everyshellskill call writes oneAuditLogrow withaction="skill.shell", redacted command (strips sk-, AIza, xai-, hf_, key=value passwords), exit code, and optionalgoal_idcontext. Fire-and-forget viaasyncio.ensure_future(). - Health dashboard: behavioral queue depth: new "Learning Queue" panel in the Safety & Execution Control grid shows
behavioral:pendingqueue depth / 50 cap with progress bar — green (0–19), yellow (≥20), red (≥40, LLM storm risk).
Security
- SSRF blocklist on
fetch_url(skills/builtin/fetch_url.py): imported_is_ssrf_target()fromhttp_request.py; blocks RFC-1918, loopback, and cloud metadata endpoints before any HTTP connection.fetch_urlnow matcheshttp_requestSSRF protection. - Self-improve syntax validation + backup (
dashboard/routes/self_improve.py):ast.parse()validates Python content before write (rejects with HTTP 400 onSyntaxError);shutil.copy2()creates timestamped backup at/data/src_patches/backup_{ts}_{filename}before overwrite;backup_pathreturned in success JSON.
Reliability
- Boot model liveness ping (
events/handlers.py):_run_boot_sequence()performs an 8 s timeout,max_tokens=1LLM ping; boot message shows "live ✓" or "unreachable ✗"; explicit warning when unreachable directs operator to/models. - Boot cognitive-state warning (
events/handlers.py): fresh/post-reset boot message warns that all memory has been cleared and sets expectations for the rebuild period. - Behavioral queue drop-count logging (
memory/behavioral.py): existing 50-item cap enhanced with precisebefore/afterllen tracking; logsbehavioral.queue_cap_trimmedwithdroppedcount when items are evicted.
Learning Quality
- Behavioral rule conflict detection (
memory/behavioral.py):_NEGATION_WORDSfrozenset +_has_conflict()function; detects contradictions (35% core-word overlap + negation asymmetry); logsbehavioral.rule_conflict_detected— rule is saved but conflict is surfaced for operator review.
Scale update
| Metric | v2.5 | v2.6 |
|---|---|---|
| Scheduler jobs | 40 | 41 |
| AuditLog action types | skill.self_improve | + skill.shell, agent.reset |
| Health panel count | 3 | 4 |
| Dashboard pages | 23 | 24 (+ /reset) |
| Response-guard checks | 9 | 11 (entity-proximity, schedule daypart) |
v2.5 — April 7, 2026
Focus: Dashboard restructure, production audit fixes, 2026 model catalog update, new cognitive subsystems
New
- Dashboard full restructure: sidebar reorganized into 5 sections (Overview, Cognition, Memory, Tools, Operations); 5 new pages added:
/self-improve— read/propose/apply/reject code patches with diff view and syntax validation/behavioral— view, filter, and delete behavioral rules learned from corrections/knowledge— browse knowledge graph nodes and relations with entity type filters/subscriptions— manage RSS feeds and price alert subscriptions/config— runtime config overrides (config:overridesRedis hash) without container restart
- HealthState Adaptive Execution:
HealthMonitorevaluates composite health score every 60s; when score drops below 70,health_state=DEGRADEDflag set in Redis; scheduler jobs and anticipatory simulation check flag and downgrade to lightweight mode - SaccadicVision Change Detection Daemon (
scheduler/saccadic.py):SaccadicVisionJobruns every 600s; takes periodic browser screenshots of monitored URLs; pixel-diff comparison viaPillow; sends Telegram alert when visual change exceeds 8% threshold; state stored insaccadic_visionDB table - Dream Failure Pattern Analysis:
DreamJobnow includes a failure-pattern phase — queries last 48h ofaudit_logfor error spikes, identifies top-3 failing skills, injects findings into dream reflection prompt for proactive repair suggestions - Self-Improve Soft Safety Gate (
dashboard/routes/self_improve.py):_self_improve_soft_gate()deterministic pattern check runs before any write; BLOCK if content targets critical paths (sandbox.py,control_layer.py,behavioral.py,response_grounder.py) AND contains safety-weakening patterns (disable sandbox,bypass guard,_HIGH_RISK_ACTIONS=frozenset()); WARN on large patches to critical paths; all decisions logged toaudit_log - Self-Improve diff-awareness precision patch: patch apply logic uses
old_text/new_textpairs with exact context matching; rejects ambiguous patches whereold_textappears more than once in the file
Security
- SHA-256 sidecar verification (
utils/patch_integrity.py): every file written byself_improvepatch action generates a.sha256sidecar at/data/src_patches/{filename}.sha256;apply_persisted_patches()at startup verifies hash before applying — tampered patches are rejected withCRITICALlog - CSP
unsafe-evaldocumented: Content-Security-Policy header audit confirmedunsafe-evalrequired by Jinja2 template rendering; documented in security page with mitigation notes config:overridesruntime: sensitive config keys (ANTHROPIC_API_KEY,TELEGRAM_BOT_TOKEN,DB_URL) blocked from override via_BLOCKED_OVERRIDE_KEYSset; all override writes logged toaudit_log
Reliability
- Audit log retention job (
scheduler/audit_retention.py):AuditRetentionJobruns every 86,400s (daily); hard-deletes rows older than 90 days fromaudit_log; logsaudit_retention.deleted count=N - Bounded Redis Streams:
events:incomingandevents:outgoingstreams capped at 10,000 entries viaMAXLEN ~10000on everyXADD; prevents unbounded memory growth during Telegram polling storms - PEL (Pending Entry List) recovery:
StreamConsumerJobchecks PEL length every tick; entries idle >30min are claimed and re-delivered; prevents message loss on container crash mid-processing - Keyset pagination for memory queries (
memory/learning.py):get_recent_examples()and related queries migrated fromOFFSETtoid > last_seen_idkeyset pagination; eliminates O(n) table scans on largelearning_examplestables - Composite DB index (
db/models.py):Index("ix_audit_log_chat_created", AuditLog.chat_id, AuditLog.timestamp)added; reduces audit-log query latency from ~400ms to ~12ms on 100k-row tables - SaccadicVision lifecycle:
SaccadicVisionJobcleans up screenshot files older than 7 days from/data/screenshots/; prevents disk exhaustion on long-running instances _vault/_policyRedis key fix (memory/behavioral.py): behavioral rule keys namespaced underbehavioral:rules:{id}(was_vault:{id}) — eliminates collision with integration vault keys
Model Catalog Update
- All 11 providers updated to 2026 model catalogs: Anthropic (Claude 4.6 family), OpenAI (GPT-5 family), Google (Gemini 2.5 family), xAI (Grok-3 family), Mistral (Mistral Large 3), Cohere (Command R+ 2025), Fireworks, Together AI, Perplexity, Groq, Ollama
- Model router updated with new capability classifications for vision, code, reasoning, and quick-response task types
Scale update
| Metric | v2.4 | v2.5 |
|---|---|---|
| Scheduler jobs | 35 | 40 |
| Dashboard pages | 18 | 23 |
| DB tables | 17 | 21 |
| Supported model providers | 8 | 11 |
v2.4 — April 3, 2026
Focus: Response grounding hardening, domain lock precision, evidence-state typed flags
New
- Response Grounding Engine Checks 5–9 (
skills/response_validator.py): five new validation gates added toResponseGrounder.validate():- Check 5 — Weak-response rejection: detects responses consisting primarily of apologies or capability disclaimers (
I'm sorry,I can't,as an AI) when the agent has the required skill; forces skill invocation instead - Check 6 — Generic phrase filter: blocks filler responses (
Let me check that for you,Great question!,Certainly!) when no substantive answer follows; triggersshould_retry=True - Check 7 — Status-marker validation: if response contains
✅/❌status markers, validates that a corresponding skill result exists in the conversation round; prevents fabricated status reports - Check 8 — Intent evidence gate: for information-seeking queries, requires at least one
MONITOREDor higher skill call in the response round; blocks pure hallucinated answers - Check 9 — Anti-hallucination guard for numeric claims: detects responses with specific numbers (prices, dates, counts) that lack a verifiable skill-result source; injects
[REQUIRES_GROUNDING]flag and forces re-run with web search
- Check 5 — Weak-response rejection: detects responses consisting primarily of apologies or capability disclaimers (
- DomainLock Hardening (
skills/response_validator.py): four precision fixes to the domain lock subsystem:- Root normalization:
example.co.ukandwww.example.co.uknow treated as same domain (stripswww., handles multi-part TLDs) - Semantic category guards: domains in the same semantic category (e.g., two crypto exchanges, two news sites) no longer cross-lock; prevents false positives where agent searched two legitimate sources
- Anchor domain field:
DomainLockobject now storesanchor_domain(the first-seen domain that triggered the lock); logged with every lock decision for debugging - Cross-turn stale lock clearance: domain locks older than 3 conversation turns are automatically cleared; prevents stale locks from blocking legitimate follow-up queries
- Root normalization:
- EvidenceState typed flags (
agent/evidence.py): newEvidenceStatedataclass replaces ad-hoc boolean flags; fields:has_skill_result,has_grounded_number,has_status_marker,has_intent_evidence,grounding_source; passed throughResponseGrounderand stored in round context for post-hoc audit
Fixed
response_validator.py: domain lock was triggering onfetch_urlcalls to the same base domain as a priorweb_searchresult — fixed by addingfetch_urlto the_SAME_DOMAIN_EXEMPT_SKILLSsetorchestrator.py(goal): chain-break condition checked wrong action strings ("blocked"instead of"autonomy_blocked","sandbox_denied","budget_exceeded") — goals with blocked tasks were retrying indefinitely instead of failinghandlers.py:results.index(result)for browser URL tracking raisedValueErroron duplicate results — fixed withenumerate()+ index tracking
Scale update
| Metric | v2.3 | v2.4 |
|---|---|---|
| Response grounding checks | 4 | 9 |
| Domain lock precision fixes | 0 | 4 |
| EvidenceState flags | 0 | 5 |
v2.3 — March 28, 2026
Focus: Universal Interaction Validation Layer, SPA support, browser reliability
New
- Universal Interaction Validation Layer (
skills/builtin/browser.py): four-phase validation wrapper around every browser click and form interaction:- Phase 1 — Pre-click validation: verifies element is visible, enabled, and not obscured before dispatching click; raises
InteractionErrorwith screenshot evidence if any check fails - Phase 2 — Post-click interference detection: after click, scans for modal overlays, cookie banners, CAPTCHA iframes, and anti-bot challenges; automatically dismisses cookie banners; pauses and logs on CAPTCHA detection
- Phase 3 — Result-state confirmation: waits for DOM stabilization (network idle + no pending XHR); compares URL and key DOM checksum before/after click to confirm navigation or state change occurred
- Phase 4 — Validated screenshot capture: final screenshot taken only after all three prior phases pass; screenshot path and interaction outcome logged to
audit_logwithaction="browser.interaction"
- Phase 1 — Pre-click validation: verifies element is visible, enabled, and not obscured before dispatching click; raises
- div-button SPA support: JavaScript-rendered
<div role="button">and<span>click targets now handled by_click_spa_element()helper; dispatches bothmousedown+mouseup+clicksynthetic events; supports React/Vue SPAs that use non-form click targets for parcel-tracking and similar flows - Browser handler timeout increase:
handle_browser_action()timeout raised from 90s to 150s for navigation-heavy operations;handle_page_load()raised from 150s to 180s for JavaScript-heavy SPAs
Fixed
- Enforcement loop fix (
skills/builtin/browser.py):_action_terminal_detected()guard was combined withand notlogic that caused the enforcement loop to exit prematurely on the first non-terminal action; fixed to check terminal state independently per loop iteration handlers.py: generator nesting inverted inall_skill_calls_rawcomprehension — learning loop was always receiving empty results; fixed by swappingforclause orderexecutor.py:result.output[:500]andresult.error[:300]crashed withTypeErrorwhenoutputorerrorwasNone; fixed with(result.output or "")[:500]guards
Scale update
| Metric | v2.2 | v2.3 |
|---|---|---|
| Browser interaction phases | 1 | 4 |
| Handler timeout (navigation) | 90s | 150s |
| Handler timeout (page load) | 150s | 180s |
| SPA click targets supported | div[onclick] | div[role=button], span, custom |
v2.2 — March 24, 2026
Focus: deep_scraper hardening, security fixes, capability map completeness
New
deep_scraperpromoted from custom OpenClaw skill → permanent built-in skill (src/skills/builtin/deep_scraper.py)deep_scraperSSRF protection:_is_safe_url()resolves all A/AAAA records viagetaddrinfo(), blocks loopback/private/link-local/reserved IPs, fails closed on DNS failure; runs viaasyncio.to_thread()(non-blocking)
Fixed
auto_detect.py: YouTube URL detection was routing toshellskill with raw docker command (security bypass) — now correctly routes todeep_scraper(url=...)with full capability enforcementskills/builtin/__init__.py:delete_reminderandmeta_orchestrateadded to_CAPABILITY_MAP(were relying on default fallback — now explicitly declared)response_validator.py:deep_scraperadded to_PRICE_GROUNDING_SKILLS(consistent withbrowser_deep_scrape)
Cleanup
/data/skills/deep-scraper/custom skill directory removed — eliminates phantom custom skill entry in the Skills dashboard page
v2.1 — March 23, 2026
Focus: production audit, browser CPU fix, security hardening, multilingual support
New
- Browser Session Idle Reaper: daemon thread closes Chromium sessions idle >300s — fixes chronic 80%→0.25% CPU exhaustion from stale sessions
- Browser URL blocklist: blocks
file://,javascript:,data:,vbscript:schemes and RFC-1918/loopback/cloud metadata IMDS addresses - Multilingual Auto-Detect:
lang_detect.py— browser/screenshot/navigation patterns in EN, ES, PT, FR, DE, ZH, JA, KO, AR, RU; localized fallback responses in 10 languages - Domain Drift Protection: validator catches browser→crypto/email substitution attempts;
should_retry=Falseon confirmed substitution; Capability Engine skips when auto-detect already handled the request
Fixed (6 bugs from production audit)
autonomous.py: autonomy_mode was set to"auto"(invalid enum) — goal creation was completely broken; fixed toautonomy_mode=Nonehandlers.py: recovery round used wronggenerate()signature — recovery never executed; fixed toModelRequest(...)handlers.py:_can_recoverwas overriding validator'sshould_retry=Falseviaor reason=="drift"— now respects validator decisionhandlers.py: screenshot path collection usedsearch(first match only); fixed tofinditer(all paths);browser_screenshot_full_pageadded to filterhandlers.py: Capability Engine was running even when auto-detect already handled the request — potential double-execution; now gated bynot auto_callsbehavioral_learner.py: Telegram notifications were published to"agent:outgoing"(dead stream) — silently lost; fixed to"events:outgoing"
Security
self_improve.py:_list_files()path containment now usesrealpath()(matches existing check in_read_file()) — prevents symlink traversalredaction.py: AIza pattern broadened to{25,}(was{35}); AKIA pattern to{12,}(was{16})capability_engine.py: blocks raw skill output in email body (Screenshot saved to,/data/screenshots/paths,⚠️ Verify the title)
v2.0
Focus: Active Flow Context Lock, Planning Mode override, Response Contract, Intent Completeness
New
- Active Flow Context Lock: per-chat Redis state (TTL 15 min) survives LLM failures; follow-up messages anchored to the same domain;
[ACTIVE FLOW — CONTEXT LOCK]block injected into system prompt — eliminates cross-domain hallucination (e.g., crypto question answered with weather data) - Planning Mode Hard Override: 5-layer execution block (auto-detect → Decision Layer → Capability Engine → LLM loop → Validation safeguard); when user says "no ejecutes / solo analiza / antes de ejecutar", zero skills run regardless of LLM output
- Universal Response Contract:
_detect_response_type()classifies each request (comparison / multipart / list / explanation / action / chat); type-specific structure rule injected into every system prompt via_build_cognitive_control_block() - Intent Completeness Engine:
intent_engine.py— deterministic multi-part intent extraction (4 strategies: colon list, numbered list, conjunctive chain, multi-question); one completeness-retry per turn with exact missing-section correction prompt flow_state.py(new):save_active_flow(),load_active_flow(),clear_active_flow(),is_explicit_domain_switch(),is_crypto_recovery_followup(),detect_flow_assets()
Improved
ResponseValidator.validate()now acceptsplanning_mode=True— new_check_planning_mode_violation()fires first when activeResponseValidator._check_completeness_multipart()— blocks structurally incomplete multi-part answers (≥2?, enumeration starts, conjunctive markers)render_report.py(crypto): premium terminal-grade format with aligned columns, volume in B/M notation, price arrows inline, separate email/Telegram renderers
Tests
- 34 new tests:
tests/test_flow_state.py(all passing)
v1.9
Focus: Response Validation, voice input, audio pipeline, production fixes
New
- Response Validation & Recovery Engine: deterministic post-LLM validator —
grounding_fail/incomplete/drift/screenshot_incompletechecks; 2-retry auto-recovery; no LLM calls in validation path - RecoveryMemory: Redis FIFO store (50 entries, 7-day TTL) — only validated successful recoveries stored; no noise from failed attempts
- Voice/Audio Input: Telegram voice messages fully operational —
handle_voice()in bridge downloads to/data/shared/uploads/voice_{uuid}.ogg;transcribe_audio_sync()calls OpenAI Whisper API viaasyncio.to_thread()with 12s hard timeout; transcription fed into full LLM+skill pipeline extract_fields.pyskill: extract named fields from previous skill output by path (e.g.,field_name:var_name)- Telegram typing indicator with 95s response timeout guard —
_pendingdict +_response_timeout_guard()— prevents stuck typing indicator on long responses
Fixed
- Critical metadata decode bug:
bus.pyauto-decodes all Redis stream fields as JSON — handlers now accept bothdictandstrfor metadata _skill_round_countUnboundLocalError (was used before assignment in some code paths)check_screenshot_completeness()now trace-based only (execution skills set) — no brittle response string matching
Scale
- Scheduler: 25 → 27 background jobs
- Skills: 26 → 27 built-in skills (
extract_fields.pyadded) - DB: 18 → 21 tables (
AgentRecord,AgentMessage,BehavioralRuleadded)
v1.8
Focus: Capability Engine production hardening, quality scoring, degradation detection
New
- Capability Engine v2: strict template validation —
_HARD_ARGSabort,_OPTIONAL_ARGS→ empty string, all others required - Weighted scoring:
(kw_hits×2) + (success_rate×5) + (avg_completeness×3) + recency_bonus - latency_penalty - Pre-execution static validation:
_pre_validate()checks all template vars before any step runs - Output completeness guarantee: blocks incomplete renders; validates email body ≥50 chars
- Improvement loop:
completeness_history(last 10 runs), EMA latency tracking, auto-degradation detection - AgentManagerSkill: LLM can create/list/pause/resume/archive agents via natural language (
agent_managerskill); late-wired toAgentOrchestratorinmain.py
Improved
- Goal Priority Axis:
Goal.priority(1-10) +Goal.sourcefields; user goals=8, agent goals=6, autonomous=3;tick()sorts by priority descending - Self-Integrity Monitor:
SelfIntegrityMonitorJobevery 6h — cross-checks self-model strengths vs actual skill success rates, detects epistemic drift, checks audit_log error spikes
v1.7
Focus: Memory & Resource Governance, Opportunity Engine, Self-Reflection Engine
New
- Opportunity Engine (
scheduler/opportunity_engine.py): scans episodic memory for automation patterns (crypto, news, website, reports, API); max 2 suggestions/day, 48h dedup - Self-Reflection Engine: LLM post-mortem insights after goal completion/failure; max 3/goal; Redis TTL 7d; injected into future context
- Resource Governor: Redis-backed rate limiting — goal slots (10), LLM/min (30), API/min (60), tasks/hour (50)
- Goal-scoped Memory (
goal_memorytable): episodic memory scoped to a specific goal's execution context — prevents cross-goal pollution - Memory Ranking System: score = 0.5×similarity + 0.3×recency + 0.2×importance; applied before context injection
build_context()acceptsgoal_idparameter; injects goal-scoped memory + reflection insights when provided
Scale
- Scheduler: 23 → 25 background jobs (
opportunity_engineadded;vector_indexalready counted) - Memory: 9 → 11 primary layers (goal_memory + self-model/epistemic)
- DB: new
goal_memorytable (auto-created via SQLAlchemycreate_all)
Observability
- New structured logs:
memory_retrieved,memory_ranked,goal_memory_added,goal_memory_used,opportunity_detected,opportunity_suggested,reflection_triggered,reflection_saved
v1.6
Focus: Decision Layer, Production Hardening v2, Goal Engine improvements
New
- Decision Layer (
src/decision_layer.py): pure heuristic pre-LLM classifier with 5 strategies —DIRECT_RESPONSE/GOAL/SCHEDULED_TASK/SUB_AGENT/SCRIPT;SCHEDULED_TASKandSUB_AGENTcall skills directly without LLM (zero hallucination risk);GOALroutes directly to GoalOrchestrator - Behavioral Learning Loop:
_detect_correction()in handlers.py; correction queued to Redisbehavioral:pending;BehavioralLearnerJob(every 120s) → LLM analysis → rule saved tobehavioral_rulesDB table → injected in every system prompt; rule types: refusal / hallucination / wrong_skill / missing_context - Cognitive Pressure Index (CPI): 0–100 composite metric (active goals 20%, error rate 25%, latency 20%, memory growth 15%, CPU 20%); background jobs skip when >80
- Self-Integrity Monitor:
SelfIntegrityMonitorJobevery 6h; cross-validates self-model against actual performance; JSON report atagent:integrity_report - Circuit Breaker Redis persistence: circuit breaker state saved to
cb:state:{integration_id}on every transition; TTL = max(86400, recovery_timeout×10) - Sovereign Mode:
SOVEREIGN_MODE=true(default); raisesMAX_SKILL_ROUNDSto 12; injects⚡ SOVEREIGN MODE ACTIVEblock; doubles cognitive budgets delete_reminderskill: deletes by keyword match orkeyword="all"- Self-repair:
patch(file, old_text, new_text)surgical edits +install(package)runtime pip installs; all patches auto-persisted to/data/src_patches/and re-applied on rebuild
Goal Engine
- Plan Lock:
goal.plan_locked = Trueafter first task succeeds — blocks spurious replanning while the plan is working - 8-step cap: plans exceeding 8 steps automatically truncated in topological order
- Duplicate Goal Detection: Jaccard word overlap ≥60% → return existing goal instead of creating duplicate
- Structured observability events:
plan_created/plan_locked/plan_replan/plan_completed - Replan storm: threshold 3 replans / 5 min (was 6 / 10 min); now marks goal FAILED with partial outputs collected (was PAUSED silently)
- Planner step preference: deterministic tool first → existing skill → LLM as last resort
Fixed
- Duplicate task execution (removed immediate execution override — tasks now scheduled at
now + interval) - Month-boundary date parsing bug (
parsed.replace(day=parsed.day+1)→parsed + timedelta(days=1)) - PAUSED goals blocking agent forever —
runtime.tick()auto-resumes after backoff, fails after 10min paused _clean_telegram_output(): strips markdown, prompt leakage ([TAREA PROGRAMADA:],EJECUTA AHORA), execution summaries from all outgoing messages- Auto-detect "nuevo agente" false positive —
_AGENT_CREATE_VETO_PATTERNSblocks complaint text from triggering agent creation
Scale
- Scheduler: 22 → 23 background jobs (
world_modeladded)
v1.5
Focus: Next-Gen Cognitive Systems, Vector Memory, Security Hardening
New Cognitive Systems (6)
- Vector Semantic Memory: PostgreSQL
memory_embeddingstable; Ollamanomic-embed-textembeddings or deterministic SHA-512 fallback; cosine similarity search (top-K); injected into the system prompt as a labeled semantic-memory block; feature-flaggedVECTOR_MEMORY_ENABLED - Plan Critic: LLM validates TaskGraph before execution; enabled via
PLAN_CRITIC_ENABLED - Meta-Agent Supervisor:
meta_orchestrateskill decomposes goal into coordinated agent team;META_AGENT_ENABLED - World Model:
EntityStatetable tracks real-world entity states (BTC price, trend, change %);WorldModelJobevery 15min; entity cards on dashboard - Skill Evolution Engine:
skill_patternstable; detects recurring multi-skill sequences (min 5 occurrences); LLM synthesizes composite Python skill; AST validation before write;SKILL_EVOLUTION_ENABLED - Temporal Reasoner: trend summaries injected as
[TEMPORAL INSIGHTS];TEMPORAL_REASONING_ENABLED
Security Hardening (7 fixes)
self_improve: all operations (read,write,patch) now userealpath()— closes symlink-based path traversal- LLM-generated skill code: AST validation before write (blocks
subprocess,os.system,eval,exec, etc.) - CSRF: token now session-bound (rejects unauthenticated
"anon"sessions before Redis lookup) /data/memoryremoved from/chat/mediasearch dirs — prevents internal snapshots from being served publiclyskills.py: slug re-validated on toggle/edit/delete via_safe_skill_dir()— prevents directory traversalhttp_request:_is_ssrf_target()blocks RFC-1918 + cloud metadata endpoints- Error responses:
str(e)replaced with first line only, capped at 120 chars — prevents internal info leakage
Dashboard
- 3 new pages: Vector Memory (
/vector-memory), World Model (/world-model), Skill Evolution (/skill-evolution)
Scale
- Scheduler: 20 → 22 background jobs
- Memory: 8 → 9 persistent layers
- DB: 14 → 18 tables (
memory_embeddings,skill_patterns,entity_states,state_predictions) - 12 new configuration feature-flag variables
Phases 1–18 (Core Development)
The initial 18 development phases built the foundational architecture of WASP:
| Phase | Key Systems |
|---|---|
| 1–3 | Event-driven architecture (Redis Streams), core agent loop, episodic memory (PostgreSQL) |
| 4–6 | Skill system (SkillBase, SkillExecutor, PolicyEngine), custom skills, task scheduler |
| 7 | Health monitor, SelfHealer, Introspector |
| 8 | Dashboard (Quart), session auth, CSRF protection, audit logging |
| 9 | Agent autonomy — shell, Python execution, browser (Selenium + Chromium), named sessions |
| 10 | Knowledge Graph (PostgreSQL + Redis cache, rule-based NLP extraction) |
| 11 | Self-Model (Redis agent:self_model), Epistemic State, domain confidence |
| 12 | Procedural Memory (abstract_procedure(), keyword retrieval, few-shot injection) |
| 13 | Temporal World Model (world_timeline table, price/state extraction, trend detection) |
| 14 | Anticipatory Simulation (pre-execution consequence analysis for privileged skills) |
| 15 | Multi-agent orchestration v1 (AgentOrchestrator, AgentRuntime, CapabilitySandbox, inter-agent PostgreSQL bus) |
| 16 | Dream Mode (DreamJob: memory consolidation, KG enrichment, LLM reflection, world pre-fetch) |
| 17 | Autonomous Goal Generator (proactive LLM-evaluated goal creation, rate limiting, CPI guard) |
| 18 | QA/SRE audit — 208 tests (unit/integration/e2e/chaos/security), 9 connector ID fixes, Makefile |
Statistics at v2.6
| Metric | Count |
|---|---|
| Built-in skills | 37 |
| Background scheduler jobs | 41 |
| Memory layers | 18 (11 primary + 7 auxiliary) |
| PostgreSQL tables | 20 |
| Integration connectors | 40+ |
| LLM providers | 11 |
| Max LLM rounds (Sovereign) | 12 |
| Max concurrent goals | 3 |
| Max concurrent agents | 10 |
| Test suite | 208 tests |
| AuditLog action types | skill.shell, skill.self_improve, agent.reset, + all goal/task actions |
| Dashboard pages | 24 |