Skip to main content

Monitoring

Health checks, log monitoring, and alerting for Konsolidat.

Health Endpoints

Frappe / Konsol API

curl http://localhost:8069/api/method/konsol.api.health
# {"message": {"status": "ok", "app": "konsol",
# "clickhouse": {"status": "healthy", "clickhouse_reachable": true,
# "recent_sync_failures": [], "message": "All systems operational"}}}
  • Guest-accessible — no auth needed
  • Confirms the Frappe web server and Konsol app are running and checks ClickHouse (SELECT 1) plus recent ingestion sync failures
  • Top-level status is ok (ClickHouse healthy), degraded (reachable but tables have recent sync failures), or down (ClickHouse unreachable); the nested clickhouse object has the detail
  • A single guest-accessible call therefore covers the Frappe → ClickHouse path; see Full Stack Check below only if you want to exercise the authenticated query path end-to-end

ClickHouse

curl "http://localhost:8123/?query=SELECT+1"
# 1

# Or via Docker health check:
docker inspect --format='{{.State.Health.Status}}' konsolidat_clickhouse
# healthy

Full Stack Check

Test the complete path (Frappe → ClickHouse):

# Login first
curl -c cookies.txt -X POST http://localhost:8069/api/method/login \
-H "Content-Type: application/json" \
-d '{"usr": "admin", "pwd": "password"}'

# Query via API
curl -b cookies.txt "http://localhost:8069/api/method/konsol.api.epm_value?\
entity=USMF&year=2024&period=1&account=401100"

A successful response with a numeric value confirms the entire pipeline is working.

ClickHouse System Tables

Query Performance

-- Slowest queries in the last hour
SELECT
query_start_time,
query_duration_ms,
read_rows,
formatReadableSize(read_bytes) as data_read,
query
FROM system.query_log
WHERE type = 'QueryFinish'
AND query_start_time > now() - INTERVAL 1 HOUR
ORDER BY query_duration_ms DESC
LIMIT 10

Database Sizes

SELECT
database,
formatReadableSize(sum(bytes_on_disk)) as disk_size,
formatReadableQuantity(sum(rows)) as total_rows,
count() as parts
FROM system.parts
WHERE active AND database LIKE 'epm_%'
GROUP BY database
ORDER BY sum(bytes_on_disk) DESC

Table-Level Detail

SELECT
database,
table,
formatReadableSize(sum(bytes_on_disk)) as size,
formatReadableQuantity(sum(rows)) as rows
FROM system.parts
WHERE active AND database = 'epm_gold'
GROUP BY database, table
ORDER BY sum(bytes_on_disk) DESC

Active Connections

SELECT * FROM system.processes

Error Log

SELECT
event_time,
message
FROM system.text_log
WHERE level = 'Error'
AND event_time > now() - INTERVAL 1 DAY
ORDER BY event_time DESC
LIMIT 20

Frappe Logs

Log Files

LogPathContent
Web~/frappe-bench/logs/web.logHTTP requests to Frappe
Worker~/frappe-bench/logs/worker.logBackground job execution
Frappe~/frappe-bench/logs/frappe.logApplication errors
Scheduler~/frappe-bench/logs/scheduler.logScheduled tasks

Tail API Errors

grep "konsol.api" ~/frappe-bench/logs/frappe.log | tail -20

Common Error Patterns

Log PatternMeaning
ClickHouse connection failedClickHouse unreachable from Frappe
ClickHouse query timeoutQuery exceeded 30s
ValidationError: Invalid scenarioClient sent bad scenario value
403 ForbiddenUser not logged in or lacks permission

Monitoring Setup

UptimeRobot / Pingdom

Point at: https://epm.yourcompany.com/api/method/konsol.api.health

  • Expected: HTTP 200 with "status": "ok"
  • Alert on: Non-200 or timeout

Docker Health Checks

Already configured in docker-compose.yml:

healthcheck:
test: ["CMD", "clickhouse-client", "--query", "SELECT 1"]
interval: 10s
timeout: 5s
retries: 5

dbt Test Monitoring

After each dbt build, check exit code:

dbt build && echo "SUCCESS" || echo "FAILED — check test output"

For scheduled runs, pipe test results to a log or alert system:

dbt test --output json > /var/log/dbt-test-results.json 2>&1

Alert Thresholds

MetricWarningCritical
API response time> 2s> 10s
ClickHouse query time> 5s> 30s (timeout)
ClickHouse disk usage> 70%> 90%
dbt test failuresAny warn severityAny error severity
Airbyte sync failureAny failure

VBA Debug Logging

The Excel VBA module includes a built-in logging system:

  1. Run EPM_ToggleLog to enable logging
  2. A hidden _EPM_Log sheet captures:
    • Timestamp
    • Level (INFO, ERROR, DEBUG)
    • Message (HTTP requests, responses, parse results)
  3. Run EPM_Debug for a full diagnostic test

Next Steps