Refactor: standard install/start/check/stop/load/query interface per system by alexey-milovidov · Pull Request #860 · ClickHouse/ClickBench

alexey-milovidov · 2026-05-07T12:15:17Z

Summary

Split each local system's monolithic benchmark.sh into 7 single-purpose scripts (install, start, check, stop, load, query, data-size) with a stable contract, driven by a new shared lib/benchmark-common.sh.
Wrap dataframe / in-process systems (pandas, polars-dataframe, chdb-dataframe, daft-parquet*, duckdb-dataframe, sirius) in small FastAPI servers so they fit the same start/stop/query lifecycle.
88 local systems refactored; cloud/managed systems and a handful of non-functional ones are intentionally untouched.

Why

Previously, every system's benchmark.sh bundled installation, server lifecycle, dataset download, data loading, and query dispatch into one script — and run.sh hard-coded the per-query orchestration. There was no programmatic per-query entry point, so:

Tweaking the dataset, query set, or per-query behavior (e.g. restarting the system between queries to neutralize warm-process effects) required editing every system's scripts individually.
Building an online "run query X against system Y" service was impossible.
Most run.sh ran all 3 tries inside a single CLI invocation, so OS-cache warmth from try 1 leaked into tries 2/3.

The new per-system interface

Script	Stdin	Stdout	Stderr	Notes
`install`	-	progress	progress	Idempotent. Env prep + system install.
`start`	-	-	progress	Start daemon. Idempotent. Empty/exit-0 for stateless tools.
`check`	-	-	progress	Trivial query (e.g. `SELECT 1`). Exit 0 iff responsive.
`stop`	-	-	progress	Stop daemon. Idempotent.
`load`	-	progress	progress	Runs create.sql + loads data; deletes source files then `sync`.
`query`	one query	query result, any format	last line: fractional seconds (`0.123`)	Non-zero exit on failure.
`data-size`	-	bytes (one integer)	-	Reports the data footprint.

Each system's benchmark.sh becomes a 4-line shim that sets a couple of env vars and exec's the shared driver:

#!/bin/bash
export BENCH_DOWNLOAD_SCRIPT="download-hits-parquet-partitioned"
export BENCH_RESTARTABLE=yes
exec ../lib/benchmark-common.sh

The shared driver runs install → start+check → download → load (timed) → for each query: flush caches; if BENCH_RESTARTABLE=yes, stop+start; run query 3× → data-size → stop. The output log shape (Load time:, [t1,t2,t3], per query, Data size:) is identical to the old benchmark.sh, so cloud-init.sh.in's POST to play.clickhouse.com keeps working unchanged.

BENCH_RESTARTABLE=no for embedded CLIs (duckdb, sqlite, datafusion, …) and dataframe wrappers — restarting a single CLI/Python process between queries would dominate query time. For these, OS caches are still flushed between queries.

Scope

Refactored (88 systems):

Server, restartable: clickhouse, postgresql, mysql, mariadb, monetdb, druid, pinot, vertica, exasol, kinetica, heavyai, questdb, cockroachdb, elasticsearch, ydb, … and the postgres/clickhouse/mysql variants (timescaledb, citus, paradedb, postgresql-indexed, clickhouse-parquet*, clickhouse-datalake*, mysql-myisam, tidb, infobright, …)
Embedded CLI, not restartable: duckdb (and variants), sqlite, datafusion (and partitioned), glaredb (and partitioned), hyper, hyper-parquet, octosql, opteryx, sail (and partitioned), drill, turso, chdb, chdb-parquet-partitioned
Dataframe with FastAPI wrapper, not restartable: pandas, polars-dataframe, chdb-dataframe, daft-parquet, daft-parquet-partitioned, duckdb-dataframe, sirius
Spark family: spark, spark-auron, spark-comet, spark-gluten

Not refactored (intentionally out of scope):

Cloud / managed: alloydb, athena, aurora-{mysql,postgresql}, bigquery, clickhouse-cloud, databricks, motherduck, redshift, redshift-serverless, snowflake, hydrolix, firebolt(), hologres, tinybird, hydra, mariadb-columnstore, pg_duckdb, singlestore, supabase, tablespace, tembo-olap, timescale-cloud, crunchy-bridge-for-analytics, s3select, …
Non-functional: csvq, dsq, locustdb (panic on first query); exasol, spark-velox (empty dirs)
Non-SQL or no SQL CLI: mongodb (JS aggregation pipelines), polars (no SQL CLI; the dataframe variant is wrapped instead)

Validated end-to-end on a 96-core / 185 GB ARM machine

System	Data	Outcome
clickhouse	14.2 GB / 100M rows	Full 43 queries × 3 tries with stop/start between queries; load 124s
duckdb	20.6 GB / 100M rows	Full 43 queries × 3 tries (no restart); load 69s
pandas	4.2 GB in-mem (5M-row subset)	42/43 queries; Q43 hit a pandas lambda bug → recorded as `null` (framework's error path works)
sqlite	3.9 GB (5M-row subset)	First 5 queries × 3 tries; load 68s
postgresql	100M rows / 75 GB TSV	First 3 queries × 3 tries with restart; load 829s. Cold-cache spike clearly visible (135s → 7s after warmup) — confirms per-query restart actually flushes the page cache

All 88 refactored systems pass bash -n and have executable bits set on the 7 scripts + benchmark.sh.

Bug fixes surfaced during validation

lib/benchmark-common.sh: data-size now runs before stop (clickhouse and pandas need the server up to report size).
clickhouse/start: idempotent (was erroring when already running).
duckdb/load, sqlite/load: rm -f hits.db/mydb for idempotent reruns.
postgresql/load: -v ON_ERROR_STOP=1 so COPY data errors actually fail the script instead of silently rolling back.
BENCH_DOWNLOAD_SCRIPT may now be empty for systems that read directly from S3 datalakes / remote services (clickhouse-datalake*, duckdb-datalake*, chyt, …).

Flagged for follow-up review

duckdb-memory — :memory: semantics force a per-query reload; will inflate timings vs. the original single-process flow.
cloudberry, greenplum — multi-phase install (reboot between phases); the shim only runs phase 1.
sirius — GPU-dependent; long-lived duckdb CLI subprocess proxy; review the stdin/sentinel protocol.
paradedb*, pg_ducklake, pg_mooncake — Docker container created in install then docker cp in load (small divergence from the original docker run -v ... due to the lifecycle order: start runs before download).

Test plan

bash -n on all 88 systems' scripts
clickhouse: full 43-query benchmark.sh on 100M-row real data
duckdb: full 43-query benchmark.sh on 100M-row real data
pandas: 43-query benchmark.sh on a 5M-row subset
sqlite: abbreviated benchmark.sh on a 5M-row subset
postgresql: abbreviated benchmark.sh on full 100M-row data
Smoke-run on a fresh c6a.metal/equivalent VM via cloud-init for a representative system from each family before merging
Verify play.clickhouse.com log-ingestion sink continues to parse the output for at least one production benchmark run

🤖 Generated with Claude Code

…/data-size Each local system now exposes a small set of single-purpose scripts with a stable contract, so they can be driven by a shared lib/benchmark-common.sh and reused by external tooling (e.g. an online "run query against system X" service): install env prep + system install (idempotent) start start daemon (idempotent; empty for stateless tools) check trivial query, exit 0 iff responsive stop stop daemon (idempotent) load runs create.sql + loads data, deletes source files, sync query SQL on stdin; result on stdout; runtime in fractional seconds on the last line of stderr; non-zero exit on error data-size prints data footprint in bytes (one integer to stdout) Each system's old monolithic benchmark.sh is replaced by a 4-line shim that sets a couple of env vars (BENCH_DOWNLOAD_SCRIPT, BENCH_RESTARTABLE) and exec's lib/benchmark-common.sh. The shared driver runs the unified flow: install -> start+check -> download -> load (timed) -> for each query {flush caches; optionally stop+start to neutralize warm-process effects; run query 3x} -> data-size -> stop. Output format ([t1,t2,t3], Load time, Data size) matches the previous benchmark.sh exactly so cloud-init.sh.in's log POST to play.clickhouse.com keeps working unchanged. For dataframe/in-process systems (pandas, polars-dataframe, chdb-dataframe, daft-parquet*, duckdb-dataframe, sirius), the engine is wrapped in a small FastAPI server (server.py) so the start/stop/query interface still applies. BENCH_RESTARTABLE=no for these (and for embedded CLIs like duckdb, sqlite, datafusion, etc.) since restarting a single Python/CLI process between queries would dominate query time. Scope: 88 local systems refactored. Cloud/managed systems and a handful of non-functional ones (csvq, dsq, locustdb, mongodb, polars CLI, exasol, spark-velox) are intentionally left untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolves conflict in clickhouse-datalake{,-partitioned}: upstream switched the datalake variants from filesystem-cache to userspace page-cache (PR #818). The refactored install/query scripts now adopt the page-cache approach. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mongodb: query takes a MongoDB aggregation pipeline (Extended JSON, one line) on stdin instead of SQL — these are the same canonical 43 ClickBench queries, just expressed as mongo pipelines. queries.txt is generated from queries.js (the source of truth) by replacing JS-only constructors (NumberLong, ISODate, NumberDecimal) with their EJSON canonical form. The shim sets BENCH_QUERIES_FILE=queries.txt to point the driver at it. polars: wrapped in a FastAPI server analogous to polars-dataframe, but the load step uses pl.scan_parquet (LazyFrame) so the parquet file remains needed at query time — the load script does NOT delete hits.parquet. data-size returns the on-disk parquet size since a LazyFrame has no materialized in-memory size. Both systems now expose the standard install/start/check/stop/load/query/ data-size scripts and a 4-line benchmark.sh shim, removing the old benchmark.sh / run.js / query.py / formatResult.js paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…use in query Per review: clickhouse-local persists table metadata in its --path dir, so the CREATE TABLE only needs to run once during ./load. ./query just runs the query against the persisted table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…atively Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… readiness Per review (alexey-milovidov): clickhouse start leaves the system in the desired state (server running) even when it returns non-zero with "already running". Make the shared driver tolerate non-zero from ./start and rely on bench_check_loop as the authoritative readiness signal. This lets per-system start scripts stay simple — they just need to make a best-effort attempt to launch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ouse#860) Adopts the per-system 7-script interface from ClickHouse#860 for gizmosql/, and replaces the Java sqlline-based gizmosqlline client with the C++ gizmosql_client shell that ships with gizmosql_server. Scripts (matching the contract from lib/benchmark-common.sh): benchmark.sh - 4-line shim that exec's ../lib/benchmark-common.sh install - apt + curl gizmosql_cli_linux_$ARCH.zip; no openjdk, no separate gizmosqlline download start - idempotent server bring-up (skips if port 31337 is open) check - cheap TCP probe (auth-gated SQL would need credentials) stop - kills tracked PID; pkill belt-and-braces fallback load - rm -f clickbench.db, then create.sql + load.sql via gizmosql_client; deletes hits.parquet and sync's query - reads one query from stdin, runs via gizmosql_client with .timer on + .mode trash; emits fractional seconds as the last stderr line (parsed from "Run Time: X.XXs") data-size - wc -c clickbench.db Notes: - BENCH_DOWNLOAD_SCRIPT=download-hits-parquet-single, BENCH_RESTARTABLE=yes (gizmosql is a server, so per-query restart neutralizes warm-process effects, matching the clickhouse/postgres pattern in ClickHouse#860). - util.sh now exports GIZMOSQL_HOST/PORT/USER/PASSWORD - the env vars gizmosql_client reads natively, so query/load can call gizmosql_client with no flags. The server still receives the username via --username. - PID_FILE moved to a stable /tmp path (was /tmp/gizmosql_server_$$.pid, which broke across the start/stop process boundary in the new layout). This PR depends on ClickHouse#860 (which introduces lib/benchmark-common.sh and the contract). Once ClickHouse#860 lands, this PR's diff against main will be only the gizmosql/ files. Validated locally on macOS with gizmosql v1.22.4: the query script produces the expected fractional-seconds last line on stdout/stderr separation, and exits non-zero on error paths. See https://docs.gizmosql.com/#/client for gizmosql_client docs.

Resolves merge conflicts: - Removed cedardb/run.sh, gizmosql/run.sh — superseded by the standard query interface; the refactor branch already replaced them. - Restored datafusion{,-partitioned}/make-json.sh, doris{,-parquet}/get-result-json.sh with main's dated-results version. These are independent post-run JSON builders, still referenced from the per-system READMEs. - Kept the thin benchmark.sh shim in gizmosql/, spark-{auron,comet,gluten}/, trino/. Per-system result-JSON auto-save (added on main while this branch was in flight) is intentionally not carried over: under the new interface, result.csv is the single timing artifact and JSON construction belongs in separate tooling. - gizmosql/{install,load,query,util.sh}: merge auto-took main's switch from gizmosqlline (Java) to gizmosql_client (CLI shipped with the server), but the refactor branch's load/query still referenced GIZMOSQL_SERVER_URI and GIZMOSQL_USERNAME. Updated install to drop openjdk + gizmosqlline, load to use gizmosql_client (and stop the server first to release the database file), and query to drive gizmosql_client with .timer/.mode trash and parse "Run Time:" instead of "rows selected (... seconds)".

…-system layout These four entries were added on main while this branch was in flight (the existing trino/ scripts here were a memory-connector stub that never worked end-to-end). Rebuild each one against the new install/start/check/stop/load/ query/data-size contract so they share lib/benchmark-common.sh: - trino, trino-partitioned: Hive connector + file metastore + local Parquet hardlinked into data/hits/ (matches main's working impl from PR #856). - trino-datalake{,-partitioned}: same, plus the AnonymousAWSCredentials shim to read clickhouse-public-datasets/hits_compatible/athena from anonymous S3 (the published bucket size is reported by data-size since the data is read on demand). BENCH_DOWNLOAD_SCRIPT="" — no local dataset to fetch. - benchmark.sh in all four becomes a 4-line shim. Old run.sh deleted.

…r-system layout These four entries were added on main while this branch was in flight. Adapt them to the install/start/check/stop/load/query/data-size contract: - presto, presto-partitioned: Hive connector + file metastore + local Parquet hardlinked into data/hits/. - presto-datalake{,-partitioned}: same plus the AnonymousAWSCredentials shim (compiled in a throwaway trinodb/trino container, since the prestodb image ships only a JRE) so the hive-hadoop2 plugin can read the public bucket anonymously. BENCH_DOWNLOAD_SCRIPT="" — schema-only load against S3. Each benchmark.sh becomes a 4-line shim. Old run.sh deleted.

These two entries were added on main while this branch was in flight. Adapt to the install/start/check/stop/load/query/data-size contract: - BENCH_DOWNLOAD_SCRIPT="" — the vortex bench binary fetches Parquet and converts to .vortex on first invocation. - BENCH_RESTARTABLE=no — embedded Rust CLI; per-query restart would dominate query time. - query: stages stdin into a temp queries-file and passes -q 0, since the bench binary addresses queries by index rather than reading SQL on stdin. - The single variant uses the `clickbench` binary (vortex 0.34.0); the partitioned variant uses `query_bench clickbench` (vortex 0.44.0). Old run.sh deleted.

Quickwit was added on main while this branch was in flight. Adapt to the install/start/check/stop/load/query/data-size contract: - BENCH_QUERIES_FILE="queries.json" — Quickwit accepts Elasticsearch-format JSON queries via the /_elastic compat API, not SQL. queries.json holds one ES query per line; queries not expressible in Quickwit are encoded as the literal "null". - BENCH_DOWNLOAD_SCRIPT="" — the load script fetches hits.json.gz directly (there is no shared download-hits-json helper) and pipes it through `quickwit tool local-ingest`, since v0.9's sharded ingest-v2 endpoint caps single-node throughput at a few MB/s. - BENCH_RESTARTABLE=yes — relies on the common driver's per-query restart to flush Quickwit's fast_field_cache and split_footer_cache (the result caches are already disabled in node-config.yaml). - query: returns non-zero for "null" queries so the framework records null in the per-query timing array; otherwise reports .took (ms → seconds). Old run.sh deleted.

The original used /tmp/gizmosql_server_$$.pid where $$ is the calling process's PID. That worked when benchmark.sh sourced util.sh and called start/stop in the same shell, but under the new per-system layout each of start, stop, load, and query sources util.sh in its own subshell — so stop_gizmosql couldn't find the PID file written by start_gizmosql. Use a fixed path under the system directory instead. Also expose wait_for_gizmosql so callers (like load) can wait for readiness without restarting.

Conflict only in gizmosql/benchmark.sh — kept the thin shim. Main switched gizmosql to the official one-line installer (PR #879); fold that into gizmosql/install so we stop hand-detecting arch and downloading the zip. Other changes auto-merged: quickwit/index_config.yaml gained tag_fields on CounterID + record:basic on text fields (PR #886), and assorted result JSONs for ClickHouse Cloud / Citus / Cratedb / etc.

start/stop scripts may emit progress lines (clickhouse-server prints PID table tracking, sudo's chown invocation, postgres's startup messages, etc.). With BENCH_RESTARTABLE=yes those scripts run before every query, so their output interleaves with the parseable [t1,t2,t3] / Load time / Data size lines and breaks the cloud-init log POST to play.clickhouse.com. Redirect both stdout and stderr from ./start and ./stop to /dev/null at the three call sites in lib/benchmark-common.sh. The check loop is the authoritative readiness signal, so losing start's output costs nothing in steady state; for debugging, run ./start manually outside the driver.

The DuckDB installer at install.duckdb.org drops the binary into ~/.duckdb/cli/latest/duckdb and only suggests adding that directory to PATH. Previously each install attempted a per-user symlink into ~/.local/bin, which silently no-ops when that directory isn't on PATH (default for root in cloud-init). The result was ./check failing for 300s with no useful error. Symlink to /usr/local/bin/duckdb via sudo right after install instead; that's on PATH for every user, and the symlink is itself idempotent.

Ubuntu's docker.io ships the docker CLI without the v2 compose plugin, so the existing `command -v docker` short-circuit skipped installation on boxes that already had docker but no `docker compose`. ./start then ran `docker compose up -d`, which silently failed, and ./check timed out at 300s. Fall back to docker-compose-v2 for the Ubuntu package name. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Throughput variant of ClickBench. N connections (default 10) hold open sessions and each picks a uniformly random query from the standard 43-query set; the run goes for a fixed wall-clock window (default 600s) after a warmup. Reports completed queries, QPS, latency p50/p95/p99, and per-query mean. Backends: ClickHouse over HTTP (stdlib http.client), StarRocks over the MySQL wire protocol (pymysql). Each system's recommended path so neither is paying a wire-format penalty the other isn't. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ned}/query: pass query via temp file `python3 - <<'PY' ... PY` directs the heredoc into python3's stdin so the interpreter can read its program from there. Once the heredoc is fully consumed, sys.stdin (the same FD) is at EOF — so sys.stdin.read() inside the heredoc returned an empty string, and chdb / hyper / sail dutifully ran the empty query and reported ~0.000s for every try. Stage stdin into a temp file in bash before invoking the heredoc and pass the path as argv[1]; the python script reads the query from that file. Also include result materialization in the timing window for chdb/query and chdb-parquet-partitioned/query (move `end = ...` past fetchall / str(res)) — the timer was previously stopped before the result was realized, which would have under-counted query time even when the stdin bug wasn't masking it entirely.

Right now ./check stderr is silently dropped while the loop retries for 300s, then we report "did not succeed within 300s" with no clue why. For deterministic failures (missing env var like YT_PROXY for chyt, an install step that didn't run, etc.) the user wastes 5 minutes and still has to dig through the per-system check script to find out what happened. Capture the last attempt's stderr and print it on timeout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The upstream install path assumes RHEL/Rocky/Alma — yum, grubby, SELinux, the wheel group, /data0. On Ubuntu/Debian the prereqs phase silently half-completes (several |||| true skips), the gpadmin user is sometimes not created, and db-install would later die at `yum install -y go`. Either way ./check times out at 300s with no diagnostic. Bail with a clear "needs yum" message before doing anything destructive, and call out the requirement in the README. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cloud-init runs scripts as root with HOME unset. Tools that follow XDG-ish conventions then fall over: the GizmoSQL one-line installer exits at line 32 with "HOME: parameter not set" (it runs under `sh -u`), duckdb-vortex's `INSTALL vortex` writes to /.duckdb/extensions/... and later fails to find it ("Extension /.duckdb/extensions/v1.5.2/..."), and duckdb-datalake{,-partitioned} queries crash 43 times each with "Can't find the home directory at ''" while autoloading httpfs. Each affected install script tried to paper over this locally with `export HOME=${HOME:=~}`, but the export only lives for that script — the sibling load/query scripts the lib runs in fresh subprocesses still see HOME unset. Set it once here so every per-system step inherits it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

apt's monetdb5-sql post-install creates /var/lib/monetdb as the monetdb user's home dir, so the existing `if [ ! -d /var/lib/monetdb ]` guard skipped `monetdbd create` and left the dbfarm uninitialized. ./check then looped 300s on `mclient: cannot connect: control socket does not exist` and the run died. Probe the dbfarm marker file (.merovingian_properties) instead of the directory, and explicitly `monetdbd start` after create — both are idempotent, and a daemon that's already up just no-ops. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

paradedb/paradedb:0.10.0 (the prior pin) was rotated out of Docker Hub — docker pull returned "manifest not found" and ./check timed out. The oldest tags still hosted are 0.15.x, so move both directories onto a real Postgres-version-specific tag (latest-pg17) that paradedb still maintains. This unblocks the image pull. NOTE: paradedb dropped its pg_lakehouse / parquet_fdw extension after 0.10.x (the parquet_fdw_handler() function no longer exists), so create.sql still needs to be reworked away from the foreign-table approach for queries to succeed end-to-end. That's a separate change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The prior URL (qa-build.oss-cn-beijing.aliyuncs.com selectdb-doris-2.1.7-rc01) returned 404 — SelectDB stopped publishing free standalone tarballs once the product moved fully to a managed-cloud offering. VeloDB (the company that now stewards SelectDB) hosts the official Apache Doris release binaries instead, which are functionally what SelectDB ships today. Pin to the current stable (4.0.5) and use the symmetric $dir_name path layout that doris/install already uses, instead of the hardcoded selectdb-doris-2.1.7 segment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

These results stay valid under the new brand — the engine is the same Apache Doris distribution, only the brand changed. Strip the historical tag and the auto-stamped comment from all 11 result JSONs and reword the README's History section to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The 36000s (10h) cap on ./benchmark.sh was hard-coded in cloud-init.sh. Lift it behind BENCHMARK_TIMEOUT, defaulting to 36000s so existing runs are unchanged, and forward the var from run-benchmark.sh on the same path as YT_PROXY/YT_TOKEN/CHYT_ALIAS — operators can now bump (or shrink) the cap without editing the script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@timeout

Switch from runtime override (BENCHMARK_TIMEOUT in the cloud-init env) to render-time substitution: cloud-init.sh.in now has `timeout @timeout@` and run-benchmark.sh substitutes it from $timeout (default 36000), matching how @System@, @repo@, @Branch@, and @runtime_env@ already work. End state for operators is the same — `timeout=NNN run-benchmark.sh ...` overrides the cap — but the rendered script reads naturally (`timeout 36000 ./benchmark.sh`) instead of dragging an env var through the cloud-init scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Remove the YT_PROXY / YT_TOKEN / CHYT_ALIAS forwarding loop in run-benchmark.sh and the @runtime_env@ injection block in cloud-init.sh.in that fed it. The chyt/ entry and hardware/benchmark-{chyt,yql}.sh remain intact for anyone running them locally with the env vars set; they're just no longer auto-forwarded by the cloud-init render path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A new result landed under the renamed velodb/ tree but still carried "system": "SelectDB" from a stale template — every other result was already updated. Bring this one in line so the dashboard shows it under the new brand. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The 2026-05-09 17:35 SelectDB run loaded 74 GB cleanly and the cold (first try) timings looked sane (0.46 / 1.23 / 1.57 s for Q1-Q3, matching Apache Doris on the same hardware), but every warm run collapsed to 0.00-0.01 s — clearly the SQL/result cache returning the prior result instead of re-running the query. Apache Doris 4.0.5 ships with `cache_enable_sql_mode = true` / `enable_sql_cache = true` defaults, so the per-query `clear_cache/all` to BE wasn't enough on its own. Mirror what doris/install does and add SelectDB-specific FE flips: - be.conf: disable_storage_page_cache = true segment_cache_capacity = 0 - fe.conf: cache_enable_sql_mode = false cache_enable_partition_mode = false Belt-and-suspenders, also `SET GLOBAL` the corresponding session variables once the FE is up, in case a future build re-enables the defaults under different config keys. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Several databases (umbra, mysql, postgres, mongodb, cratedb) survive an earlyoom / kernel-OOM kill mid-COPY by restarting and exposing a half-empty table. The 43 query iterations then run in microseconds against a near-empty result set, the materialized view's load_time + length(runtimes)==43 + arrayExists(>0.1) checks pass, and the run lands in sink.results looking like a clean win. The 2026-05-09 13:29 umbra run was exactly that — `psql:create.sql:109: ERROR: canceled, NOTICE: server side shutdown` mid-COPY, then queries returning in 0.001 s against a near-empty table. After ./load, bench_load now: 1. Re-runs ./check — confirms the server is still up. Fails the run if the server died. 2. Runs ./data-size and verifies the result is at least 5 GB. The hits dataset doesn't fit in <5 GB on any system in the catalog, so anything smaller is a partial load (the same threshold the downstream sink.parser materialized view uses). The verified data-size is cached in .bench_data_size so bench_main emits the same number without walking the data dir twice. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous commit cached the post-load data-size in .bench_data_size to avoid walking the data dir twice, but the value can drift between load and end-of-run on systems that compact / merge in the background (clickhouse-style merges, doris segment compaction, postgres autovacuum visiblemap, etc.). Drop the cache: keep the validation call in bench_load, and keep the second ./data-size invocation in bench_main so the logged value reflects what's actually on disk when queries are done. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bench_run_query was flushing OS caches BEFORE ./stop, which is ineffective for any engine that mmaps its data files — the running process pins those pages and drop_caches can't evict them. The new instance then re-mmaps the same files and the "cold" run reads from a warm page cache. Symptom: Umbra's "cold" Q21 (LIKE on URL) was 25 ms on a 100M-row table, with cold/warm ratios around 5–7× across the suite where ClickHouse and CedarDB sit at 11–22×. Reorder to stop -> wait until ./check actually fails -> flush -> start. Wait-for-stopped is needed because ./stop can return before the process is fully gone (docker stop sends SIGTERM with a grace period; some daemons close file descriptors lazily). For non-restartable systems (BENCH_RESTARTABLE=no) we still flush, with the caveat that mmap-backed in-process engines won't see a real cold cache there either — that's a "cold-warm-warm" methodology question, not this fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two cases the lib's per-query timing path was treating as success: 1. umbra/query's error grep was '^ERROR|psql: error'. Umbra also emits FATAL: and PANIC: prefixes (the latter observed on `unable to allocate buffer pool`); broaden the grep and add -v ON_ERROR_STOP=1 to psql for explicitness. 2. umbra/load called psql without ON_ERROR_STOP. When the COPY hits memory pressure on a small box (the 16 GB c6a.4xlarge can't hold Umbra's mmap working set for the 75 GB hits.tsv), the transaction errored mid-COPY but the script continued, leaving a partial table. The 43 queries then ran fast over the surviving subset and the result file showed implausible 1–25 ms warm/cold pairs for queries that are disk-bound on a full table. Add ON_ERROR_STOP=1 plus a row-count assertion (expected ~99,997,497) so the load fails loudly instead of producing fake-fast timings. Caveat documented in umbra/query: Umbra silently returns a NULL row for unimplemented functions (e.g. regexp_substr) without emitting any error or warning. None of the 43 ClickBench queries hit that path today, but if a future query does, the timing pipeline will report a microsecond "successful" run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The 2026-05-09 c6a.4xlarge run reports data_size=16.18 GB (the full load is ~37 GB) and Q21 cold/warm of 17 ms / 1 ms (a real 100M-row URL-LIKE scan is ~38 s cold / ~80 ms warm). Same shape across Q22, Q23, Q24, Q26, Q27 — a partial COPY left a tiny surviving subset that all queries ran fast over. The umbra/load row-count assertion that just landed will fail this kind of run loudly going forward; this file predates that check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two unrelated all-43-nulls failures observed on c6a.4xlarge runs in the last 14 hours. kinetica/query: kisql 7.2+ reports timings as "Timing (seconds): Connection=X, Query=Y" but the parser still grepped for the legacy "Query Execution Time: <s> sec" footer, so every query came back with "no Query Execution Time in kisql output" → null. Accept the new format (preferred) and keep the old fallback. Also tighten the error sniff to anchor "^(error|exception)" so the load step's "WARNING: Skipped: 1, inserted ..." doesn't get treated as fatal. presto/install (and presto-partitioned, presto-datalake, presto-datalake-partitioned): Hardcoded -Xmx48G with query.max-memory=24GB exceeds physical RAM on c6a.4xlarge (32 GiB) — JVM tries to grow into swap, earlyoom kills it, and queries fail with "java.io.IOException: unexpected end of stream on http://localhost:8081/...". Compute the heap (~70% of /proc/meminfo MemTotal) and downstream query-memory caps from host RAM at install time so the configuration scales to both 4xlarge and metal/48xl-class machines. Trino is unaffected because it never overrode the container default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The shared spark*/query.py uses a few SparkSession.builder config keys that were dropped/renamed in pyspark 4.0 (notably .config('spark.driver', 'local[*]')). On 4.0 the SparkSession startup fails silently before any query runs, the script exits without ever printing a numeric timing line, and the lib records null for every query — observed as 43-null result rows on the latest c6a.4xlarge run. Match the version used by the other refactored Spark variants (spark-auron 3.5.5, spark-comet 3.5.6, spark-gluten 3.5.2, spark-velox 3.5.2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The script piped each query to drill-embedded with `printf '%s\r'`, relying on \r as the line terminator. sqlline (drill-embedded's REPL) on a non-TTY pipe doesn't treat \r as Enter on Linux — the query sat buffered, EOF arrived, and sqlline exited without firing the SQL. Every benchmark run produced 43 null timings. Switch to writing the query to a tempfile, mount it into the container, and use `drill-embedded --run=/q.sql` (sqlline's script mode). Also: - Tolerate drill exiting 0 on a failed query: sniff for Error / Aborting command set / No current connection / Java stack traces. - Tighten the timing regex to "(N rows? in X.YYY seconds?)" so we don't misparse other parenthesised numbers in the output. - Correctly recognise "1 row" (singular) as well as "N rows". Caveat: apache/drill's only published image is linux/amd64. On arm64 hosts the JVM hits a NoClassDefFoundError on RootAllocator init under QEMU emulation; that's an upstream packaging issue, not a script issue. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Six systems were still on the old monolithic benchmark.sh and were therefore excluded from the c6a.4xlarge per-system-script-interface batch. Split each into install / start / stop / check / load / query / data-size + a thin lib-driven benchmark.sh shim. pg_duckdb TSV ingest + COPY FREEZE; force_execution=true pg_duckdb-indexed parallel COPY shards + indexes via index.sql pg_duckdb-parquet bind-mounts hits.parquet, view-only (no ingest) pg_duckdb-motherduck no local data, CTAS into MotherDuck, REQUIRES MOTHERDUCK_TOKEN; data-size returns the source parquet size so the post-load >5GB sanity check doesn't false-positive on cloud-stored data ursa ClickHouse-derivative, mirror of clickhouse/ yugabytedb yugabyted standalone Notes: - pg_duckdb / pg_duckdb-indexed / pg_duckdb-parquet pass postgres tuning (shared_buffers, max_*_workers, duckdb.max_memory, etc.) via `postgres -c k=v` at docker-run time so the cluster picks them up without a second restart, replacing the old "append to postgresql.conf then docker restart" dance. - pg_duckdb-parquet downloads hits.parquet in install (not via lib's bench_download) because the container needs the file bind-mounted at start time, before lib's download phase runs. - All six set BENCH_RESTARTABLE=yes so the cold/warm/warm methodology applies (the lib's stop -> wait -> drop_caches -> start sequence is what the postgres-style cache invalidation needs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ILE to cpimport Two coupled fixes: 1. Refactor to per-system-script-interface (install/start/stop/check/load /query/data-size + lib-driven benchmark.sh). The entry was on the old monolithic format, so it was excluded from the c6a.4xlarge batch. 2. Switch the data load from `LOAD DATA LOCAL INFILE` to `cpimport`. ColumnStore's recommended bulk path is cpimport — the SQL-layer LOAD DATA INFILE the entry used could not handle the 75 GB hits.tsv and died after ~5 min with the cryptic ERROR 1030 (HY000): Got error -1 "Internal error < 0 (Not system error)" from storage engine ColumnStore that the README documented as "we couldn't reproduce, MariaDB has no public issue tracker." cpimport reads STDIN, so we can pipe the host-side file straight in without docker cp. Also: - New mariadb client requires SSL by default; the columnstore image's server doesn't support SSL. Pass --skip-ssl everywhere. - Container-side server is provisioned + a per-user GRANT issued in ./start; idempotent on subsequent restarts. - ./query parses both "(X.YYY sec)" and "(M min S sec)" forms and correctly converts to fractional seconds. Tested locally on Ubuntu 26.04 / arm64 (mariadb/columnstore image is multi-arch): - install + start + provision + GRANT all idempotent. - cpimport of 100k rows: 1000 rows/s sustained, no errors. - Q1/Q2/Q5/Q21 against the 100k subset return correct counts with sane timings (28 ms / 1.027 s / 31 ms / 96 ms). - stop -> start -> check round-trip works; data persists. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous embedded-Python-per-query design re-loaded the entire hits.parquet into a fresh DuckDB :memory: connection on every ./query invocation. That made every "query" measurement actually be a full ingest (~minutes on the full dataset), dwarfing the actual SQL execution time and producing wildly inflated numbers — which is also why the entry was producing 0 rows in recent c6a.4xlarge runs (each query took longer than the lib's per-query slot). Mirror the duckdb-dataframe / pandas / polars-dataframe layout instead: a uvicorn/FastAPI server (server.py) holds one DuckDB connection with the compressed_mem :memory: schema loaded once via /load. start/stop manage the python pid; check is GET /health; query is POST /query with the SQL on the request body. data-size returns the server process RSS as a proxy for the in-memory compressed footprint. BENCH_RESTARTABLE was already "no" (the lib doesn't restart the server between queries), which is exactly what we want — restarting would dump the in-memory compressed state and force a full re-ingest for every query, which is the bug we're fixing. Tested locally on a 1M-row hits_0.parquet sample: load 1.547 s Q1 0.002 s (count = 1000000) Q21 0.011 s (URL LIKE) RSS 1.17 GB Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The 8 entries that load the dataset into memory and serve queries from a long-lived Python process — pandas, polars-dataframe, duckdb-dataframe, duckdb-memory, chdb-dataframe, daft-parquet, daft-parquet-partitioned, sirius — already carried the "in-memory" tag in 5 of the 8 templates. Backfill the other 3 (daft-parquet, daft-parquet-partitioned, sirius) and add the tag to every historical result that doesn't already have it so the dashboard's tag-based filtering stays consistent. 67 files updated. Diff is minimal (one trailing-comma change + one new tag per file) — patched in place rather than re-pretty-printing JSON, so the existing single-line / multi-line / indented styles in the result files are preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ckHouse/ClickBench into refactor/per-system-script-interface

…alse Investigating ClickHouse's ~2 s cold-Q40 floor on c6a.4xlarge: the default async_load_databases=1 makes the server bind its listen port and answer SELECT 1 before user-database parts have finished loading. The lib's bench_check_loop then sees ./check pass, drop_caches+restart looks "ready", and the first real query — Q40, the heaviest in the suite — stalls 2-3 s waiting for the part loader to finish. Measured locally on this 96-core arm box (NVMe): async_load_databases=1 (default): SELECT 1 ready at: 0.20 s First SELECT count() FROM hits: 2.89 s <-- waiting on parts Q40 (parts now loaded): 0.33 s async_load_databases=0 (this commit): SELECT 1 ready at: 0.12 s First Q40 cold (parts already loaded): 0.25 s Q40 warm: 0.085 s A ~12x cold-run improvement on Q40 (and similar on every other "cold" measurement, since the wait is pre-paid into the bench_start step where it belongs instead of the first query's timer). Drop a config.d/async_load_databases.xml override in both clickhouse/install and clickhouse-tencent/install (the two refactored entries that install via `clickhouse install --noninteractive`). Other CH-family entries don't need this: clickhouse-{datalake,parquet}* use `clickhouse local` (embedded, no daemon), clickhouse-{cloud,web} are managed services we can't reconfigure, byconity uses its own docker-compose stack, ursa is a separate binary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Same setting (async_load_databases=false), just stored as config.d/async_load_databases.yaml instead of an XML snippet — matches the YAML config style preferred elsewhere in the project. Verified the clickhouse-server picks it up (system.server_settings shows the value applied) and the ~12x cold-Q40 improvement is intact (0.25 s here). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ewarm Follow-up to {clickhouse,clickhouse-tencent}/install: forcing async_load_databases=false ensures parts are loaded by the time the server reports ready, but marks and primary indices were still loaded lazily on first column access. So the FIRST cold query after a fresh restart paid the mark/PK-load cost; subsequent queries against the same columns were fast. Adding the per-table prewarm settings — prewarm_mark_cache, prewarm_primary_key_cache, min_bytes_to_prewarm_caches — instructs the engine to populate those caches during startup, again moving the work out of the cold-query timer into bench_start where it belongs. Local measurements stacked over the async_load_databases fix: default (async_load=1, prewarm=0): cold=2.89s warm=0.085s (34x) async_load=0, prewarm=0: cold=0.25s warm=0.085s (3x) async_load=0, prewarm=1 (this commit): cold=0.19s warm=0.085s (2.2x) The remaining ~0.10 s cold/warm gap is OS pagecache misses for the actual column data, which can't be eliminated without keeping data resident across the restart (which would defeat the cold-restart methodology). Skipped ursa (older fork; settings may not exist) and the managed clickhouse-{cloud,web} entries (can't change settings server-side). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…cache prewarm" Prewarming the mark and primary-key caches at startup is a CH-specific optimization that other systems in the suite don't get an equivalent of, so applying it here gives ClickHouse an unfair advantage on the "cold" measurement vs. systems that genuinely do load metadata on first query. Keep async_load_databases=false (correctness — that just ensures ./check doesn't pass before parts exist) but drop the prewarm. Cold Q40 goes from 0.19 s back to ~0.25 s — still a ~12x improvement over the original 2.89 s, all from the async_load_databases fix alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The "serverless" tag should only mark actual cloud services (where the user pays per-query and the vendor manages the runtime). It was mistakenly applied to embedded/in-memory engines that run entirely on the benchmark machine: chdb, chdb-dataframe, chdb-parquet-partitioned, clickhouse-web (browser WASM), daft-parquet, daft-parquet-partitioned, glaredb, glaredb-partitioned, opteryx, pandas. Cloud services keep the tag: bigquery, motherduck, pg_duckdb-motherduck. Applied across templates and historical result files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

alexey-milovidov and others added 3 commits May 7, 2026 12:14

alexey-milovidov commented May 7, 2026

View reviewed changes

Comment thread clickhouse-datalake-partitioned/load Outdated

alexey-milovidov commented May 7, 2026

View reviewed changes

Comment thread clickhouse/query Outdated

alexey-milovidov commented May 7, 2026

View reviewed changes

Comment thread clickhouse/start Outdated

alexey-milovidov and others added 3 commits May 7, 2026 12:29

clickhouse/query: drop the cat shim — clickhouse-client reads stdin n…

b5d60e8

…atively Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add missing change

94794b5

prmoore77 mentioned this pull request May 7, 2026

GizmoSQL: switch from gizmosqlline (Java) to gizmosql_client #863

Merged

2 tasks

alexey-milovidov and others added 18 commits May 9, 2026 01:22

alexey-milovidov and others added 19 commits May 9, 2026 19:17

Add some results

8713287

Add more results

5ea85fc

spark-velox: drop the "Velox" tag from result and template

42e887c

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

alexey-milovidov mentioned this pull request May 10, 2026

[Bug]: infinite loop with 100% CPU consumption after restart. cedardb/issues#70

Open

alexey-milovidov and others added 10 commits May 10, 2026 00:10

Add more results

2880a63

Merge branch 'refactor/per-system-script-interface' of github.com:Cli…

e2bdcbc

…ckHouse/ClickBench into refactor/per-system-script-interface

More results

b355245

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: standard install/start/check/stop/load/query interface per system#860

Refactor: standard install/start/check/stop/load/query interface per system#860
alexey-milovidov wants to merge 97 commits intomainfrom
refactor/per-system-script-interface

alexey-milovidov commented May 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alexey-milovidov commented May 7, 2026

Summary

Why

The new per-system interface

Scope

Validated end-to-end on a 96-core / 185 GB ARM machine

Bug fixes surfaced during validation

Flagged for follow-up review

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant