Skip to main content

Changelog

All notable changes to WASP are documented here. Versions follow a semantic versioning scheme after the initial phase-based development (Phases 1–18).


v2.6 — April 30, 2026

Focus: Edge Fix Pass (6 surgical fixes) + Final Pre-Production Hardening Pass (10 fixes) + Panic Reset dashboard page

Highlights

  • 6 edge-case fixes that close specific hallucination and correctness failure modes without changing golden-path behavior.
  • 10 pre-production hardening fixes covering security, reliability, observability, and learning quality.
  • Panic Reset page — single-operation hard reset with confirmation gate.

Edge Fix Pass (April 30, 2026)

Six surgical fixes applied across events/handlers.py and policy/response_guard.py. Each fix targets a specific hallucination or correctness failure mode.

1. Low-Intent Cold-Start Guard (events/handlers.py)

Short messages on a fresh chat without prior context are one of the most reliable hallucination triggers. Without an anchor in chat memory, a single token like "ok" or a context-required phrase like "do the same" gives the LLM nothing to ground on, and it reaches for training-data noise.

New deterministic guard:

  • _LOW_INTENT_TOKENS frozenset — single-token confirmations and acknowledgements (multilingual).
  • _LOW_INTENT_EMOJI_RE — emoji / digit / punctuation-only messages.
  • _CONTEXT_REQUIRED_PHRASES_RE — phrases like "do the same", "again", "same as before" that require prior context to be meaningful.

_is_low_intent() returns True when: single ambiguous token, emoji-only, context-required phrase without anchor, or ≤2 tokens all in the ambiguous set. When low-intent + no scheduled-language match + no last_exchange anchor, the handler returns a clarification fast-path in the user's detected language and never invokes the LLM.

Greetings ("hello" / "hi" / "hey" / "ping") intentionally excluded — they have a dedicated friendly-response path. Bypassed for [RETRY OF PREVIOUS: messages (those have explicit anchor).

2. Multi-URL Aggregator: Error: Prefix Detection (events/handlers.py ~line 5385)

When auto-detect resolves multiple URLs in a single user message, the multi-URL aggregator builds a deterministic per-URL outcome list. The browser skill returns success=True even when its output begins with Error: URL blocked... (SSRF blocklist hit, file:// block, RFC-1918 block) — success only means "the skill itself didn't crash", not "the URL was reachable".

Before: SSRF-blocked URLs labeled ✅ navigated. After: Error: prefix detected first; URL labeled ❌ with the first-line error message.

Outcome icons:

IconMeaning
✅ navegado / navigatedBrowser reached the URL successfully without screenshot
✅ captura enviada / screenshot sentScreenshot captured and attached
🚫 bloqueado / blockedLogin wall or captcha ([CAPTURE_VALID: false])
❌ <error line>Error: prefix detected (SSRF, file://, RFC-1918, etc.) — first 120 chars

Test case (passes): captura https://192.168.0.1 y https://10.0.0.1 → both URLs correctly labeled ❌.

3. Agent Name Preservation: Non-Greedy Regex (events/handlers.py)

_AGENT_NAME_PATTERNS (multilingual alternations matching named X and equivalent constructions) was greedy — for an input like create an agent named Crypto Watcher that monitors prices every 3 hours, the regex captured the entire phrase as the name.

Fix: non-greedy multi-token capture [\w-]+(?:\s+[\w-]+){0,4}? with a lookahead stop-set on clause connectors (that/to/with/for/on/in and equivalents) and punctuation. Quoted form (named "Foo Bar") takes priority via the first alternation group.

InputExtracted Name
create an agent named Bob to track newsBob
create an agent named "Crypto Watcher" for BTC alertsCrypto Watcher
create an agent named News Watcher that monitors RSS feeds every hourNews Watcher

4. Schedule Honesty Bidirectional (policy/response_guard.py)

task_manager only supports interval scheduling. It does NOT support fixed clock times (at 9am) or daypart phrases (in the morning). Two directions of dishonesty were possible:

  1. Agent-side lie (existing): response asserts a clock time → guard strips it.
  2. NEW user-side silent misinterpretation: user requests a clock time or daypart, agent creates an interval task without disclosing the gap. Operator believes their schedule was honored.

New behavior:

  • User-side clock-time branch: when the user requested a specific clock time (matches AM/PM, :, or o'clock) AND has_real_task_create(skill_results) returns True → response is appended with a disclaimer that the task does not run at the requested clock time and that task_manager only supports interval scheduling (every N hours from creation time).
  • NEW daypart branch (DAYPART_CLAIM_RE): matches phrases like in the morning, every evening, at dawn, at noon, at midnight, and equivalents in supported languages → identical disclaimer family.

Trace fields: had_real_create=True, claimed_time=<requested>, origin in {"user_text", "user_text_daypart"}. Guard skipped when numeric match without AM/PM/colon (likely an interval expression).

The sanitizer previously handled image syntax ![alt](url) only. Plain [text](url) markdown links leaked through and rendered literally in Telegram.

Fix: new _MARKDOWN_LINK_RE = re.compile(r"(?<!!)\[([^\]]+)\]\((https?://[^\s)]+)\)"). Matches collapse to text (url) form so the URL stays accessible without raw markdown chars rendering literally. Negative lookbehind (?<!!) prevents double-handling of image syntax.

Example:

  • Before: See more details at [this page](https://example.com/foo) → renders with literal []() brackets
  • After: See more details at this page (https://example.com/foo) → readable

Trace records stripped: ["link"] for post-hoc analysis.

6. Entity-Proximity Verdict Check (policy/response_guard.py)

The verdict-evidence check guards against tracking-code hallucinations: when a user pastes a postal/courier tracking code, the browser may navigate to a tracking-site home page that contains words like "delivered" as static UI labels for unrelated shipments. Without entity proximity, the LLM could stitch these unrelated labels with the user's specific tracking code into a fabricated claim.

The previous check was too lenient — verdict word anywhere in body counted as evidence.

Fix: 200-character proximity window between any user-named entity and any verdict occurrence:

  • _user_named_entities() extracts user-specified codes via _USER_ENTITY_RE:
    • [A-Z]{2}\d{9}[A-Z]{2} — China Post / EMS format
    • 1Z[A-Z0-9]{16} — UPS format
    • \d{12,22} — generic long numeric (FedEx etc.)
    • [A-Z]{4,6}\d{6,10} — alphanumeric mix
  • _skill_output_supports_verdict() extended with user_entities parameter.
  • When entities present and verdict never co-occurs within 200 chars of any entity in any successful skill body → not evidence → honest fallback returned.
  • _has_useful_skill_data() also extended: outputs that don't contain user-specified entities don't count as useful data for factual grounding.

Verdict keywords cover delivery states in supported languages (delivered, in transit, out for delivery, pending, received, in customs, and equivalents).


Final Pre-Production Hardening Pass (April 15, 2026 — 10 Fixes)

New

  • Panic Reset page (/reset): hard-confirmation UI (must type "RESET WASP", paste blocked) that wipes all 17 DB tables, 12+ Redis key patterns, agent identity XP/birth date, self-model, and runs VACUUM FULL. Progress streams live to the page; result shows in a green-bordered card with WaspToast notification. AuditLog entry written for every reset.
  • Weekly VACUUM ANALYZE (scheduler/db_maintenance.py): DbMaintenanceJob runs every 604,800 s (weekly) outside a transaction (AUTOCOMMIT) — reclaims dead-row space and updates PostgreSQL planner statistics without table locking.
  • Shell invocation audit logging (skills/builtin/shell.py): every shell skill call writes one AuditLog row with action="skill.shell", redacted command (strips sk-, AIza, xai-, hf_, key=value passwords), exit code, and optional goal_id context. Fire-and-forget via asyncio.ensure_future().
  • Health dashboard: behavioral queue depth: new "Learning Queue" panel in the Safety & Execution Control grid shows behavioral:pending queue depth / 50 cap with progress bar — green (0–19), yellow (≥20), red (≥40, LLM storm risk).

Security

  • SSRF blocklist on fetch_url (skills/builtin/fetch_url.py): imported _is_ssrf_target() from http_request.py; blocks RFC-1918, loopback, and cloud metadata endpoints before any HTTP connection. fetch_url now matches http_request SSRF protection.
  • Self-improve syntax validation + backup (dashboard/routes/self_improve.py): ast.parse() validates Python content before write (rejects with HTTP 400 on SyntaxError); shutil.copy2() creates timestamped backup at /data/src_patches/backup_{ts}_{filename} before overwrite; backup_path returned in success JSON.

Reliability

  • Boot model liveness ping (events/handlers.py): _run_boot_sequence() performs an 8 s timeout, max_tokens=1 LLM ping; boot message shows "live ✓" or "unreachable ✗"; explicit warning when unreachable directs operator to /models.
  • Boot cognitive-state warning (events/handlers.py): fresh/post-reset boot message warns that all memory has been cleared and sets expectations for the rebuild period.
  • Behavioral queue drop-count logging (memory/behavioral.py): existing 50-item cap enhanced with precise before/after llen tracking; logs behavioral.queue_cap_trimmed with dropped count when items are evicted.

Learning Quality

  • Behavioral rule conflict detection (memory/behavioral.py): _NEGATION_WORDS frozenset + _has_conflict() function; detects contradictions (35% core-word overlap + negation asymmetry); logs behavioral.rule_conflict_detected — rule is saved but conflict is surfaced for operator review.

Scale update

Metricv2.5v2.6
Scheduler jobs4041
AuditLog action typesskill.self_improve+ skill.shell, agent.reset
Health panel count34
Dashboard pages2324 (+ /reset)
Response-guard checks911 (entity-proximity, schedule daypart)

v2.5 — April 7, 2026

Focus: Dashboard restructure, production audit fixes, 2026 model catalog update, new cognitive subsystems

New

  • Dashboard full restructure: sidebar reorganized into 5 sections (Overview, Cognition, Memory, Tools, Operations); 5 new pages added:
    • /self-improve — read/propose/apply/reject code patches with diff view and syntax validation
    • /behavioral — view, filter, and delete behavioral rules learned from corrections
    • /knowledge — browse knowledge graph nodes and relations with entity type filters
    • /subscriptions — manage RSS feeds and price alert subscriptions
    • /config — runtime config overrides (config:overrides Redis hash) without container restart
  • HealthState Adaptive Execution: HealthMonitor evaluates composite health score every 60s; when score drops below 70, health_state=DEGRADED flag set in Redis; scheduler jobs and anticipatory simulation check flag and downgrade to lightweight mode
  • SaccadicVision Change Detection Daemon (scheduler/saccadic.py): SaccadicVisionJob runs every 600s; takes periodic browser screenshots of monitored URLs; pixel-diff comparison via Pillow; sends Telegram alert when visual change exceeds 8% threshold; state stored in saccadic_vision DB table
  • Dream Failure Pattern Analysis: DreamJob now includes a failure-pattern phase — queries last 48h of audit_log for error spikes, identifies top-3 failing skills, injects findings into dream reflection prompt for proactive repair suggestions
  • Self-Improve Soft Safety Gate (dashboard/routes/self_improve.py): _self_improve_soft_gate() deterministic pattern check runs before any write; BLOCK if content targets critical paths (sandbox.py, control_layer.py, behavioral.py, response_grounder.py) AND contains safety-weakening patterns (disable sandbox, bypass guard, _HIGH_RISK_ACTIONS=frozenset()); WARN on large patches to critical paths; all decisions logged to audit_log
  • Self-Improve diff-awareness precision patch: patch apply logic uses old_text/new_text pairs with exact context matching; rejects ambiguous patches where old_text appears more than once in the file

Security

  • SHA-256 sidecar verification (utils/patch_integrity.py): every file written by self_improve patch action generates a .sha256 sidecar at /data/src_patches/{filename}.sha256; apply_persisted_patches() at startup verifies hash before applying — tampered patches are rejected with CRITICAL log
  • CSP unsafe-eval documented: Content-Security-Policy header audit confirmed unsafe-eval required by Jinja2 template rendering; documented in security page with mitigation notes
  • config:overrides runtime: sensitive config keys (ANTHROPIC_API_KEY, TELEGRAM_BOT_TOKEN, DB_URL) blocked from override via _BLOCKED_OVERRIDE_KEYS set; all override writes logged to audit_log

Reliability

  • Audit log retention job (scheduler/audit_retention.py): AuditRetentionJob runs every 86,400s (daily); hard-deletes rows older than 90 days from audit_log; logs audit_retention.deleted count=N
  • Bounded Redis Streams: events:incoming and events:outgoing streams capped at 10,000 entries via MAXLEN ~10000 on every XADD; prevents unbounded memory growth during Telegram polling storms
  • PEL (Pending Entry List) recovery: StreamConsumerJob checks PEL length every tick; entries idle >30min are claimed and re-delivered; prevents message loss on container crash mid-processing
  • Keyset pagination for memory queries (memory/learning.py): get_recent_examples() and related queries migrated from OFFSET to id > last_seen_id keyset pagination; eliminates O(n) table scans on large learning_examples tables
  • Composite DB index (db/models.py): Index("ix_audit_log_chat_created", AuditLog.chat_id, AuditLog.timestamp) added; reduces audit-log query latency from ~400ms to ~12ms on 100k-row tables
  • SaccadicVision lifecycle: SaccadicVisionJob cleans up screenshot files older than 7 days from /data/screenshots/; prevents disk exhaustion on long-running instances
  • _vault/_policy Redis key fix (memory/behavioral.py): behavioral rule keys namespaced under behavioral:rules:{id} (was _vault:{id}) — eliminates collision with integration vault keys

Model Catalog Update

  • All 11 providers updated to 2026 model catalogs: Anthropic (Claude 4.6 family), OpenAI (GPT-5 family), Google (Gemini 2.5 family), xAI (Grok-3 family), Mistral (Mistral Large 3), Cohere (Command R+ 2025), Fireworks, Together AI, Perplexity, Groq, Ollama
  • Model router updated with new capability classifications for vision, code, reasoning, and quick-response task types

Scale update

Metricv2.4v2.5
Scheduler jobs3540
Dashboard pages1823
DB tables1721
Supported model providers811

v2.4 — April 3, 2026

Focus: Response grounding hardening, domain lock precision, evidence-state typed flags

New

  • Response Grounding Engine Checks 5–9 (skills/response_validator.py): five new validation gates added to ResponseGrounder.validate():
    • Check 5 — Weak-response rejection: detects responses consisting primarily of apologies or capability disclaimers (I'm sorry, I can't, as an AI) when the agent has the required skill; forces skill invocation instead
    • Check 6 — Generic phrase filter: blocks filler responses (Let me check that for you, Great question!, Certainly!) when no substantive answer follows; triggers should_retry=True
    • Check 7 — Status-marker validation: if response contains / status markers, validates that a corresponding skill result exists in the conversation round; prevents fabricated status reports
    • Check 8 — Intent evidence gate: for information-seeking queries, requires at least one MONITORED or higher skill call in the response round; blocks pure hallucinated answers
    • Check 9 — Anti-hallucination guard for numeric claims: detects responses with specific numbers (prices, dates, counts) that lack a verifiable skill-result source; injects [REQUIRES_GROUNDING] flag and forces re-run with web search
  • DomainLock Hardening (skills/response_validator.py): four precision fixes to the domain lock subsystem:
    • Root normalization: example.co.uk and www.example.co.uk now treated as same domain (strips www., handles multi-part TLDs)
    • Semantic category guards: domains in the same semantic category (e.g., two crypto exchanges, two news sites) no longer cross-lock; prevents false positives where agent searched two legitimate sources
    • Anchor domain field: DomainLock object now stores anchor_domain (the first-seen domain that triggered the lock); logged with every lock decision for debugging
    • Cross-turn stale lock clearance: domain locks older than 3 conversation turns are automatically cleared; prevents stale locks from blocking legitimate follow-up queries
  • EvidenceState typed flags (agent/evidence.py): new EvidenceState dataclass replaces ad-hoc boolean flags; fields: has_skill_result, has_grounded_number, has_status_marker, has_intent_evidence, grounding_source; passed through ResponseGrounder and stored in round context for post-hoc audit

Fixed

  • response_validator.py: domain lock was triggering on fetch_url calls to the same base domain as a prior web_search result — fixed by adding fetch_url to the _SAME_DOMAIN_EXEMPT_SKILLS set
  • orchestrator.py (goal): chain-break condition checked wrong action strings ("blocked" instead of "autonomy_blocked", "sandbox_denied", "budget_exceeded") — goals with blocked tasks were retrying indefinitely instead of failing
  • handlers.py: results.index(result) for browser URL tracking raised ValueError on duplicate results — fixed with enumerate() + index tracking

Scale update

Metricv2.3v2.4
Response grounding checks49
Domain lock precision fixes04
EvidenceState flags05

v2.3 — March 28, 2026

Focus: Universal Interaction Validation Layer, SPA support, browser reliability

New

  • Universal Interaction Validation Layer (skills/builtin/browser.py): four-phase validation wrapper around every browser click and form interaction:
    • Phase 1 — Pre-click validation: verifies element is visible, enabled, and not obscured before dispatching click; raises InteractionError with screenshot evidence if any check fails
    • Phase 2 — Post-click interference detection: after click, scans for modal overlays, cookie banners, CAPTCHA iframes, and anti-bot challenges; automatically dismisses cookie banners; pauses and logs on CAPTCHA detection
    • Phase 3 — Result-state confirmation: waits for DOM stabilization (network idle + no pending XHR); compares URL and key DOM checksum before/after click to confirm navigation or state change occurred
    • Phase 4 — Validated screenshot capture: final screenshot taken only after all three prior phases pass; screenshot path and interaction outcome logged to audit_log with action="browser.interaction"
  • div-button SPA support: JavaScript-rendered <div role="button"> and <span> click targets now handled by _click_spa_element() helper; dispatches both mousedown + mouseup + click synthetic events; supports React/Vue SPAs that use non-form click targets for parcel-tracking and similar flows
  • Browser handler timeout increase: handle_browser_action() timeout raised from 90s to 150s for navigation-heavy operations; handle_page_load() raised from 150s to 180s for JavaScript-heavy SPAs

Fixed

  • Enforcement loop fix (skills/builtin/browser.py): _action_terminal_detected() guard was combined with and not logic that caused the enforcement loop to exit prematurely on the first non-terminal action; fixed to check terminal state independently per loop iteration
  • handlers.py: generator nesting inverted in all_skill_calls_raw comprehension — learning loop was always receiving empty results; fixed by swapping for clause order
  • executor.py: result.output[:500] and result.error[:300] crashed with TypeError when output or error was None; fixed with (result.output or "")[:500] guards

Scale update

Metricv2.2v2.3
Browser interaction phases14
Handler timeout (navigation)90s150s
Handler timeout (page load)150s180s
SPA click targets supporteddiv[onclick]div[role=button], span, custom

v2.2 — March 24, 2026

Focus: deep_scraper hardening, security fixes, capability map completeness

New

  • deep_scraper promoted from custom OpenClaw skill → permanent built-in skill (src/skills/builtin/deep_scraper.py)
  • deep_scraper SSRF protection: _is_safe_url() resolves all A/AAAA records via getaddrinfo(), blocks loopback/private/link-local/reserved IPs, fails closed on DNS failure; runs via asyncio.to_thread() (non-blocking)

Fixed

  • auto_detect.py: YouTube URL detection was routing to shell skill with raw docker command (security bypass) — now correctly routes to deep_scraper(url=...) with full capability enforcement
  • skills/builtin/__init__.py: delete_reminder and meta_orchestrate added to _CAPABILITY_MAP (were relying on default fallback — now explicitly declared)
  • response_validator.py: deep_scraper added to _PRICE_GROUNDING_SKILLS (consistent with browser_deep_scrape)

Cleanup

  • /data/skills/deep-scraper/ custom skill directory removed — eliminates phantom custom skill entry in the Skills dashboard page

v2.1 — March 23, 2026

Focus: production audit, browser CPU fix, security hardening, multilingual support

New

  • Browser Session Idle Reaper: daemon thread closes Chromium sessions idle >300s — fixes chronic 80%→0.25% CPU exhaustion from stale sessions
  • Browser URL blocklist: blocks file://, javascript:, data:, vbscript: schemes and RFC-1918/loopback/cloud metadata IMDS addresses
  • Multilingual Auto-Detect: lang_detect.py — browser/screenshot/navigation patterns in EN, ES, PT, FR, DE, ZH, JA, KO, AR, RU; localized fallback responses in 10 languages
  • Domain Drift Protection: validator catches browser→crypto/email substitution attempts; should_retry=False on confirmed substitution; Capability Engine skips when auto-detect already handled the request

Fixed (6 bugs from production audit)

  • autonomous.py: autonomy_mode was set to "auto" (invalid enum) — goal creation was completely broken; fixed to autonomy_mode=None
  • handlers.py: recovery round used wrong generate() signature — recovery never executed; fixed to ModelRequest(...)
  • handlers.py: _can_recover was overriding validator's should_retry=False via or reason=="drift" — now respects validator decision
  • handlers.py: screenshot path collection used search (first match only); fixed to finditer (all paths); browser_screenshot_full_page added to filter
  • handlers.py: Capability Engine was running even when auto-detect already handled the request — potential double-execution; now gated by not auto_calls
  • behavioral_learner.py: Telegram notifications were published to "agent:outgoing" (dead stream) — silently lost; fixed to "events:outgoing"

Security

  • self_improve.py: _list_files() path containment now uses realpath() (matches existing check in _read_file()) — prevents symlink traversal
  • redaction.py: AIza pattern broadened to {25,} (was {35}); AKIA pattern to {12,} (was {16})
  • capability_engine.py: blocks raw skill output in email body (Screenshot saved to, /data/screenshots/ paths, ⚠️ Verify the title)

v2.0

Focus: Active Flow Context Lock, Planning Mode override, Response Contract, Intent Completeness

New

  • Active Flow Context Lock: per-chat Redis state (TTL 15 min) survives LLM failures; follow-up messages anchored to the same domain; [ACTIVE FLOW — CONTEXT LOCK] block injected into system prompt — eliminates cross-domain hallucination (e.g., crypto question answered with weather data)
  • Planning Mode Hard Override: 5-layer execution block (auto-detect → Decision Layer → Capability Engine → LLM loop → Validation safeguard); when user says "no ejecutes / solo analiza / antes de ejecutar", zero skills run regardless of LLM output
  • Universal Response Contract: _detect_response_type() classifies each request (comparison / multipart / list / explanation / action / chat); type-specific structure rule injected into every system prompt via _build_cognitive_control_block()
  • Intent Completeness Engine: intent_engine.py — deterministic multi-part intent extraction (4 strategies: colon list, numbered list, conjunctive chain, multi-question); one completeness-retry per turn with exact missing-section correction prompt
  • flow_state.py (new): save_active_flow(), load_active_flow(), clear_active_flow(), is_explicit_domain_switch(), is_crypto_recovery_followup(), detect_flow_assets()

Improved

  • ResponseValidator.validate() now accepts planning_mode=True — new _check_planning_mode_violation() fires first when active
  • ResponseValidator._check_completeness_multipart() — blocks structurally incomplete multi-part answers (≥2 ?, enumeration starts, conjunctive markers)
  • render_report.py (crypto): premium terminal-grade format with aligned columns, volume in B/M notation, price arrows inline, separate email/Telegram renderers

Tests

  • 34 new tests: tests/test_flow_state.py (all passing)

v1.9

Focus: Response Validation, voice input, audio pipeline, production fixes

New

  • Response Validation & Recovery Engine: deterministic post-LLM validator — grounding_fail / incomplete / drift / screenshot_incomplete checks; 2-retry auto-recovery; no LLM calls in validation path
  • RecoveryMemory: Redis FIFO store (50 entries, 7-day TTL) — only validated successful recoveries stored; no noise from failed attempts
  • Voice/Audio Input: Telegram voice messages fully operational — handle_voice() in bridge downloads to /data/shared/uploads/voice_{uuid}.ogg; transcribe_audio_sync() calls OpenAI Whisper API via asyncio.to_thread() with 12s hard timeout; transcription fed into full LLM+skill pipeline
  • extract_fields.py skill: extract named fields from previous skill output by path (e.g., field_name:var_name)
  • Telegram typing indicator with 95s response timeout guard — _pending dict + _response_timeout_guard() — prevents stuck typing indicator on long responses

Fixed

  • Critical metadata decode bug: bus.py auto-decodes all Redis stream fields as JSON — handlers now accept both dict and str for metadata
  • _skill_round_count UnboundLocalError (was used before assignment in some code paths)
  • check_screenshot_completeness() now trace-based only (execution skills set) — no brittle response string matching

Scale

  • Scheduler: 25 → 27 background jobs
  • Skills: 26 → 27 built-in skills (extract_fields.py added)
  • DB: 18 → 21 tables (AgentRecord, AgentMessage, BehavioralRule added)

v1.8

Focus: Capability Engine production hardening, quality scoring, degradation detection

New

  • Capability Engine v2: strict template validation — _HARD_ARGS abort, _OPTIONAL_ARGS → empty string, all others required
  • Weighted scoring: (kw_hits×2) + (success_rate×5) + (avg_completeness×3) + recency_bonus - latency_penalty
  • Pre-execution static validation: _pre_validate() checks all template vars before any step runs
  • Output completeness guarantee: blocks incomplete renders; validates email body ≥50 chars
  • Improvement loop: completeness_history (last 10 runs), EMA latency tracking, auto-degradation detection
  • AgentManagerSkill: LLM can create/list/pause/resume/archive agents via natural language (agent_manager skill); late-wired to AgentOrchestrator in main.py

Improved

  • Goal Priority Axis: Goal.priority (1-10) + Goal.source fields; user goals=8, agent goals=6, autonomous=3; tick() sorts by priority descending
  • Self-Integrity Monitor: SelfIntegrityMonitorJob every 6h — cross-checks self-model strengths vs actual skill success rates, detects epistemic drift, checks audit_log error spikes

v1.7

Focus: Memory & Resource Governance, Opportunity Engine, Self-Reflection Engine

New

  • Opportunity Engine (scheduler/opportunity_engine.py): scans episodic memory for automation patterns (crypto, news, website, reports, API); max 2 suggestions/day, 48h dedup
  • Self-Reflection Engine: LLM post-mortem insights after goal completion/failure; max 3/goal; Redis TTL 7d; injected into future context
  • Resource Governor: Redis-backed rate limiting — goal slots (10), LLM/min (30), API/min (60), tasks/hour (50)
  • Goal-scoped Memory (goal_memory table): episodic memory scoped to a specific goal's execution context — prevents cross-goal pollution
  • Memory Ranking System: score = 0.5×similarity + 0.3×recency + 0.2×importance; applied before context injection
  • build_context() accepts goal_id parameter; injects goal-scoped memory + reflection insights when provided

Scale

  • Scheduler: 23 → 25 background jobs (opportunity_engine added; vector_index already counted)
  • Memory: 9 → 11 primary layers (goal_memory + self-model/epistemic)
  • DB: new goal_memory table (auto-created via SQLAlchemy create_all)

Observability

  • New structured logs: memory_retrieved, memory_ranked, goal_memory_added, goal_memory_used, opportunity_detected, opportunity_suggested, reflection_triggered, reflection_saved

v1.6

Focus: Decision Layer, Production Hardening v2, Goal Engine improvements

New

  • Decision Layer (src/decision_layer.py): pure heuristic pre-LLM classifier with 5 strategies — DIRECT_RESPONSE / GOAL / SCHEDULED_TASK / SUB_AGENT / SCRIPT; SCHEDULED_TASK and SUB_AGENT call skills directly without LLM (zero hallucination risk); GOAL routes directly to GoalOrchestrator
  • Behavioral Learning Loop: _detect_correction() in handlers.py; correction queued to Redis behavioral:pending; BehavioralLearnerJob (every 120s) → LLM analysis → rule saved to behavioral_rules DB table → injected in every system prompt; rule types: refusal / hallucination / wrong_skill / missing_context
  • Cognitive Pressure Index (CPI): 0–100 composite metric (active goals 20%, error rate 25%, latency 20%, memory growth 15%, CPU 20%); background jobs skip when >80
  • Self-Integrity Monitor: SelfIntegrityMonitorJob every 6h; cross-validates self-model against actual performance; JSON report at agent:integrity_report
  • Circuit Breaker Redis persistence: circuit breaker state saved to cb:state:{integration_id} on every transition; TTL = max(86400, recovery_timeout×10)
  • Sovereign Mode: SOVEREIGN_MODE=true (default); raises MAX_SKILL_ROUNDS to 12; injects ⚡ SOVEREIGN MODE ACTIVE block; doubles cognitive budgets
  • delete_reminder skill: deletes by keyword match or keyword="all"
  • Self-repair: patch(file, old_text, new_text) surgical edits + install(package) runtime pip installs; all patches auto-persisted to /data/src_patches/ and re-applied on rebuild

Goal Engine

  • Plan Lock: goal.plan_locked = True after first task succeeds — blocks spurious replanning while the plan is working
  • 8-step cap: plans exceeding 8 steps automatically truncated in topological order
  • Duplicate Goal Detection: Jaccard word overlap ≥60% → return existing goal instead of creating duplicate
  • Structured observability events: plan_created / plan_locked / plan_replan / plan_completed
  • Replan storm: threshold 3 replans / 5 min (was 6 / 10 min); now marks goal FAILED with partial outputs collected (was PAUSED silently)
  • Planner step preference: deterministic tool first → existing skill → LLM as last resort

Fixed

  • Duplicate task execution (removed immediate execution override — tasks now scheduled at now + interval)
  • Month-boundary date parsing bug (parsed.replace(day=parsed.day+1)parsed + timedelta(days=1))
  • PAUSED goals blocking agent forever — runtime.tick() auto-resumes after backoff, fails after 10min paused
  • _clean_telegram_output(): strips markdown, prompt leakage ([TAREA PROGRAMADA:], EJECUTA AHORA), execution summaries from all outgoing messages
  • Auto-detect "nuevo agente" false positive — _AGENT_CREATE_VETO_PATTERNS blocks complaint text from triggering agent creation

Scale

  • Scheduler: 22 → 23 background jobs (world_model added)

v1.5

Focus: Next-Gen Cognitive Systems, Vector Memory, Security Hardening

New Cognitive Systems (6)

  • Vector Semantic Memory: PostgreSQL memory_embeddings table; Ollama nomic-embed-text embeddings or deterministic SHA-512 fallback; cosine similarity search (top-K); injected into the system prompt as a labeled semantic-memory block; feature-flagged VECTOR_MEMORY_ENABLED
  • Plan Critic: LLM validates TaskGraph before execution; enabled via PLAN_CRITIC_ENABLED
  • Meta-Agent Supervisor: meta_orchestrate skill decomposes goal into coordinated agent team; META_AGENT_ENABLED
  • World Model: EntityState table tracks real-world entity states (BTC price, trend, change %); WorldModelJob every 15min; entity cards on dashboard
  • Skill Evolution Engine: skill_patterns table; detects recurring multi-skill sequences (min 5 occurrences); LLM synthesizes composite Python skill; AST validation before write; SKILL_EVOLUTION_ENABLED
  • Temporal Reasoner: trend summaries injected as [TEMPORAL INSIGHTS]; TEMPORAL_REASONING_ENABLED

Security Hardening (7 fixes)

  • self_improve: all operations (read, write, patch) now use realpath() — closes symlink-based path traversal
  • LLM-generated skill code: AST validation before write (blocks subprocess, os.system, eval, exec, etc.)
  • CSRF: token now session-bound (rejects unauthenticated "anon" sessions before Redis lookup)
  • /data/memory removed from /chat/media search dirs — prevents internal snapshots from being served publicly
  • skills.py: slug re-validated on toggle/edit/delete via _safe_skill_dir() — prevents directory traversal
  • http_request: _is_ssrf_target() blocks RFC-1918 + cloud metadata endpoints
  • Error responses: str(e) replaced with first line only, capped at 120 chars — prevents internal info leakage

Dashboard

  • 3 new pages: Vector Memory (/vector-memory), World Model (/world-model), Skill Evolution (/skill-evolution)

Scale

  • Scheduler: 20 → 22 background jobs
  • Memory: 8 → 9 persistent layers
  • DB: 14 → 18 tables (memory_embeddings, skill_patterns, entity_states, state_predictions)
  • 12 new configuration feature-flag variables

Phases 1–18 (Core Development)

The initial 18 development phases built the foundational architecture of WASP:

PhaseKey Systems
1–3Event-driven architecture (Redis Streams), core agent loop, episodic memory (PostgreSQL)
4–6Skill system (SkillBase, SkillExecutor, PolicyEngine), custom skills, task scheduler
7Health monitor, SelfHealer, Introspector
8Dashboard (Quart), session auth, CSRF protection, audit logging
9Agent autonomy — shell, Python execution, browser (Selenium + Chromium), named sessions
10Knowledge Graph (PostgreSQL + Redis cache, rule-based NLP extraction)
11Self-Model (Redis agent:self_model), Epistemic State, domain confidence
12Procedural Memory (abstract_procedure(), keyword retrieval, few-shot injection)
13Temporal World Model (world_timeline table, price/state extraction, trend detection)
14Anticipatory Simulation (pre-execution consequence analysis for privileged skills)
15Multi-agent orchestration v1 (AgentOrchestrator, AgentRuntime, CapabilitySandbox, inter-agent PostgreSQL bus)
16Dream Mode (DreamJob: memory consolidation, KG enrichment, LLM reflection, world pre-fetch)
17Autonomous Goal Generator (proactive LLM-evaluated goal creation, rate limiting, CPI guard)
18QA/SRE audit — 208 tests (unit/integration/e2e/chaos/security), 9 connector ID fixes, Makefile

Statistics at v2.6

MetricCount
Built-in skills37
Background scheduler jobs41
Memory layers18 (11 primary + 7 auxiliary)
PostgreSQL tables20
Integration connectors40+
LLM providers11
Max LLM rounds (Sovereign)12
Max concurrent goals3
Max concurrent agents10
Test suite208 tests
AuditLog action typesskill.shell, skill.self_improve, agent.reset, + all goal/task actions
Dashboard pages24