Operations
Documentation Map
-
Operations
-
Channel:
stable -
Source repo:
JaddaHelpifyr/helpifyr-fabric
Operations
Tool / Contract Summary
This page documents how the repository is deployed, verified, restarted, and observed. It is the runtime companion to the contract and API documentation.
Business Value
- gives operators a deterministic run path for local, CI, and live-host verification
- keeps runtime evidence aligned with Fabric contract truth
- prevents host or container mutation from happening without an auditable verify path
Current Verified State
- reference live host:
<internal-runtime-redacted> - host verification posture: read-first over SSH before assuming runtime failure
- main stack and platform-plane compose files live in
deploy/compose/ - runtime evidence can be cross-checked through
/api/v1/tools/host-snapshot,/api/v1/tools/runtime-status,/api/v1/tools/runtime-evidence,/api/v1/tools/runtime-contracts, and/api/v1/tools/runtime-observations
Workspace Git/Scan Guardrails (Mandatory)
- Gitea is Source of Truth; local Windows workspaces are disposable working copies.
- Never run Codex sessions on
<workspace-root>root; always use a concrete repo path. - Limit active repo sessions to 2-3 in parallel.
- Before each run in a repo:
git fetch --prune,git checkout <branch>,git pull --ff-only. - No background git discovery loops (
git status,git ls-files, worktree scans) without explicit scoped need. - Automation scripts must run repo-scoped only, never global over
<workspace-root>.
scan_and_fix Standard
scripts/scan_and_fix.shmust enforce runner timeout + single-run lock +.envfallback to<workspace-root>/.envand<workspace-root>/.env.scripts/scan_open_issues_repo_only.shmust exist and query only current repo open issues via Gitea API.
Workspace Hygiene
- Daily cleanup: stale
_worktrees/*,_tmp/*,test-results/*, large temporary artifacts. - Weekly cleanup: stale local branches/worktrees.
- Never leave valuable artifacts as untracked files in workspace root.
Dirty-State Policy
- Dirty state is allowed while actively implementing.
- Before new scan/automation runs: commit/stash, or use a dedicated worktree.
- Never propagate
dirty_unknownstates.
Incident Playbook (git.exe storm)
- Identify parent of
git.exe(usually oneCodex.exe). - Stop only the offending process tree.
- Restart session on concrete repo path.
- Reduce parallel sessions.
- Verify
git.execount drops within 30-60s.
Available Now
Runtime paths
- main stack:
deploy/compose/jhf-fabric.stack.yml - low-CPU main stack:
deploy/compose/jhf-fabric.stack.low-cpu.yml - platform plane:
deploy/compose/jhf-fabric.platform-plane.yml - low-CPU platform plane:
deploy/compose/jhf-fabric.platform-plane.low-cpu.yml - ephemeral suite:
deploy/compose/docker-compose.test.yml - low-CPU ephemeral suite:
deploy/compose/docker-compose.test.low-cpu.yml
Operational scripts
scripts/resolve-runtime-env.shscripts/build-runtime-tool-env.shscripts/ensure-fabric-docker-resources.shscripts/redeploy-host-stack.shscripts/redeploy-platform-plane.shscripts/prepare-platform-plane-assets.shscripts/test-up.shscripts/test-run.shscripts/test-down.shscripts/bootstrap_wikijs_docs.pyscripts/safe_docker_logs.shscripts/post-deploy-guardrails.shscripts/verify-runtime-guardrails.shscripts/verify_runtime_materialization.py
Wiki.js and platform-plane assets
deploy/compose/platform-plane/wiki/HELPIFYR_WIKI_HOME.mddeploy/compose/platform-plane/wiki/favicon.svgdeploy/compose/platform-plane/wiki/jadda_helpifyr_logo.svgdeploy/compose/platform-plane/wiki/helpifyr-wiki-theme.cssdocs/operations/WIKIJS_PLATFORM_PLANE.md
Optional / Extended
- low-CPU deployment variants
- platform-plane services such as Wiki.js, Prometheus, Grafana, and OpenTelemetry Collector
- optional consumers such as internal docs portals and downstream runtime dashboards
Planned / Not In Current Scope
- any host mutation path that is not represented by a real script or guarded preview surface
- undocumented write flows against providers or downstream tools
Public Surfaces
Operator runtime and evidence routes:
GET /healthGET /api/v1/platform/servicesGET /api/v1/tools/host-snapshotGET /api/v1/tools/runtime-statusGET /api/v1/tools/runtime-evidenceGET /api/v1/tools/runtime-contractsGET /api/v1/tools/runtime-observationsGET /api/v1/observability/readinessGET /api/v1/security/readinessGET /api/v1/recovery/readinessGET /api/v1/signoff/readiness
Contract Families
Operations interact directly with:
- runtime port contracts
- provider instance registry
- drift reports
- docs and wiki governance contracts
- shared topology and shared service baseline contracts
Producer / Consumer Zuordnung
- producer: Fabric publishes runtime observations and contract-shaped operational evidence
- consumer: operators, CI, Wiki.js, and downstream repos
- boundary rule: host observations are consumed into Fabric, but they do not override contract truth
Compatibility Window
- live-host posture is Linux and POSIX/bash first
- compose and redeploy scripts are the canonical operational interface
- direct ad-hoc mutation is not considered a compatible operator path
Lifecycle Status
- active deployment and verification posture
- host and platform-plane scripts are maintained alongside the API and contract layers
Readiness / Drift / Monitoring
Recommended health and readiness order:
GET /healthGET /api/v1/platform/servicesGET /api/v1/observability/readinessGET /api/v1/security/readinessGET /api/v1/recovery/readinessGET /api/v1/signoff/readiness- subsystem-specific readiness for persistence, Dapr, events, tooling, identity, or providers as needed
Monitoring stack:
- Prometheus for metrics collection
- Grafana for dashboards
- OpenTelemetry Collector for telemetry export
/api/v1/monitoring/metricsfor tool and policy metrics
Deployment / Verify
Validate compose
bash ./scripts/resolve-runtime-env.sh /tmp/jhf-fabric-resolved.envdocker compose --env-file /tmp/jhf-fabric-resolved.env -f deploy/compose/jhf-fabric.stack.yml configdocker compose -f deploy/compose/jhf-fabric.platform-plane.yml configdocker compose --env-file deploy/compose/platform-plane/wiki/.env -f deploy/compose/jhf-fabric.platform-plane.yml configdocker compose -f deploy/compose/docker-compose.test.yml config
Deploy or redeploy
bash ./scripts/redeploy-host-stack.shbash ./scripts/redeploy-platform-plane.shbash ./scripts/prepare-platform-plane-assets.shbash ./scripts/ensure-fabric-docker-resources.shbash ./scripts/verify-runtime-guardrails.shpython ./scripts/verify_runtime_materialization.py --checkpython ./scripts/verify_runtime_materialization.py --check --live-via-ssh <internal-runtime-redacted><internal-runtime-redacted>
Local API start
uvicorn helpifyr_fabric.api.app:app --reload
Ephemeral verification
bash ./scripts/test-up.shbash ./scripts/test-run.shbash ./scripts/test-down.sh
Docs bootstrap
python scripts/bootstrap_wikijs_docs.py --wiki-url http://<internal-runtime-redacted>:33001 --site-host https://helpifyr.com/docs/- add
--dry-runto preview output without writing pages python scripts/docs/materialize_public_docs_site.pypython scripts/docs/export_public_docs_bundle.pynpm run build --prefix docs-sitenpm run deploy:cloudflare --prefix docs-site
Wiki.js bootstrap remains internal/operator-only. The live public docs entrypoint is https://helpifyr.com/docs/, served through the Astro homepage runtime with a Docusaurus-owned docs route family materialized from Fabric docs-platform truth.
The tracked bundle contracts/docs/public_docs_site_bundle.json remains the canonical Fabric-generated handoff artifact; it must not reintroduce jhf-docs as the current public publisher.
Scan&Fix automation
Quickstart:
bash scripts/scan_and_fix.sh --dry-runbash scripts/scan_and_fix.shbash scripts/scan_and_fix.sh --issue 344 --dry-runbash scripts/scan_and_fix.sh --max-issues 1bash scripts/scan_and_fix.sh --executor-cmd 'python -c "import sys; print(sys.stdin.read())"' --issue 344
Dry-run:
- prints the reconciled open issue set in ranked order
- classifies the queue into
really-open-repo-owned,already-covered-by-pr, andblocked_external - excludes open pull requests from the executable issue queue even when the backing Gitea issues API returns them in the same feed
- falls back to the visible repo-scoped Gitea
/pullspage plus per-PR detail reads when the open-pulls list API returns an empty set but open PRs still exist - prints the generated execution prompt with repo, issue, branch, worktree, and live-verify host/user context
- does not invoke the runner
- when no
--issueor--max-issuesis provided, the dry-run shows the full currently open ranked queue plus the actionable executable subset
Live-run:
- prefer
--executor-cmdwhen you want an explicit runner path, for example a local Codex wrapper - if
--executor-cmdis omitted, the script falls back toSCAN_AND_FIX_RUNNER_CMD, then to the first verified local Codex CLI candidate it can execute, including a resolved per-user Windows/Git-Bash path such asC:/Users/<user>/.codex/bin/codex.exebefore any inaccessible WindowsApps alias - the auto-discovered local Codex fallback uses the local danger-full-access execution mode by default because nested Windows
--full-autoruns can fail before repo work starts with sandbox spawn errors; use--executor-cmdwhen you need a different runner posture explicitly - example:
bash scripts/scan_and_fix.sh --executor-cmd 'python -c "import sys; print(sys.stdin.read())"' --issue 344
- default behavior:
bash scripts/scan_and_fix.shprocesses the full currently open ranked issue set, then dispatches only thereally-open-repo-ownedsubsetbash scripts/scan_and_fix.sh --max-issues 1limits the run to the highest-ranked issue- open pull requests are skipped so the queue stays issue-driven instead of PR-driven, and issues already covered by an open PR or explicitly marked
blocked_externalare reported but not dispatched by default - shared-host CI executes only the
scanfix_fastpathpytest subset by default; the heavierscanfix_extendedharness/fallback coverage is reserved for explicit local verification orworkflow_dispatchso the shared runner lane stays low-pressure
Supported parameters:
--issue <id>--dry-run--max-issues <n>--since <duration>--labels <csv>--severity-order "critical,high,medium,low"--host <host>--user <user>--executor-cmd <cmd>
Typical failure modes:
- missing required credentials in
<workspace-root>/.envor environment, includingGITEA_TOKENfor API access - no matching open issue after filters are applied
--executor-cmdomitted,SCAN_AND_FIX_RUNNER_CMDunset, and no usable local Codex CLI candidate available- runner command returns non-zero
- long all-open runs started without a confirming dry-run after issue churn, causing an unintended queue shape
- operator expects mutation on
main; the wrapper only prepares issue-driven branch/PR work
Known Limits
- direct host/port verification remains canonical when a service has not yet published its own hostname contract
- platform-plane components are optional and may not be present on every installation
- runtime observations can be delayed or degraded when provider dependencies are unavailable
Exceptions / Waivers
- legacy
*.<internal-runtime-redacted>hostnames are redirect surfaces only and must not be treated as primary Fabric-owned truth - some runtime probes are intentionally lightweight and policy-limited to avoid pressure on host services
Logs
For live-host log inspection, use the bounded snapshot policy in operations/HOST_DOCKER_LOG_GUARDRAILS.md (docs/operations/HOST_DOCKER_LOG_GUARDRAILS.md).
Main stack
timeout 15s docker compose -f deploy/compose/jhf-fabric.stack.yml logs --since 10m --tail 80 apitimeout 15s docker compose -f deploy/compose/jhf-fabric.stack.yml logs --since 10m --tail 80 daprdtimeout 15s docker compose -f deploy/compose/jhf-fabric.stack.yml logs --since 10m --tail 80 postgrestimeout 15s docker compose -f deploy/compose/jhf-fabric.stack.yml logs --since 10m --tail 80 natsbash ./scripts/post-deploy-guardrails.sh jhf-fabric
Platform plane
timeout 15s docker compose -f deploy/compose/jhf-fabric.platform-plane.yml logs --since 10m --tail 80 apitimeout 15s docker compose -f deploy/compose/jhf-fabric.platform-plane.yml logs --since 10m --tail 80 prometheustimeout 15s docker compose -f deploy/compose/jhf-fabric.platform-plane.yml logs --since 10m --tail 80 grafanatimeout 15s docker compose --env-file deploy/compose/platform-plane/wiki/.env -f deploy/compose/jhf-fabric.platform-plane.yml logs --since 10m --tail 80 wikijstimeout 15s docker compose -f deploy/compose/jhf-fabric.platform-plane.yml logs --since 10m --tail 80 otel-collectorbash ./scripts/post-deploy-guardrails.sh jhf-fabric
CPU-Safe Runtime Guardrails
- Standard verify and redeploy paths must stay bounded and low-pressure.
- Use
bash ./scripts/verify-runtime-guardrails.shbefore release-oriented changes to confirm repo-owned stack truth, bounded diagnostics, and low-pressure defaults. - Use
python ./scripts/verify_runtime_materialization.py --check --live-via-ssh <internal-runtime-redacted><internal-runtime-redacted>when runtime/config changes are involved to prove repo truth, active compose labels, container env/mounts/networks, and app readback stayed aligned. - Post-deploy cleanup is mandatory through
bash ./scripts/post-deploy-guardrails.sh jhf-fabric; the script fails closed if stale repo-owneddocker logs,docker compose ... logs, ordocker execdiagnostics remain beyond the configured minimum age.
Typical Failure Modes
- Dapr sidecar unavailable
- PostgreSQL bootstrap or migration incomplete
- NATS or event publication readiness drift
- stale repository manifests or tool profiles
- Grafana active-dashboard provisioning drift
- provider runtime evidence delay or DNS/network mismatch on the host
Diagnosis Order
- validate compose config
- inspect
docker compose ... ps - inspect bounded Fabric API log snapshots
- inspect bounded dependency service log snapshots
- run readiness endpoints in the order above
- use subsystem runbooks before mutating state
Restart And Recovery
- restart only the affected Fabric-owned service when possible
- prefer additive rebuild or restart over manual state edits
- use STACK_RECOVERY_RUNBOOK (
docs/operations/STACK_RECOVERY_RUNBOOK.md) before direct persistence mutation - use recovery and signoff readiness surfaces to confirm post-restart state
Runtime Dependencies
- PostgreSQL
- NATS JetStream
- Dapr sidecar
- Gitea for repository contract intake
- optional platform-plane observability services
- optional internal docs consumer through Wiki.js
Related Issues
- operational history and remaining backlog items live under
docs/issues/anddocs/AUTONOMOUS_BACKLOG.md
License
- License: AGPLv3
- Project: https://helpifyr.com