Scheduling pipelines with faucet schedule
faucet schedule runs a pipeline on a cron schedule in a long-running
foreground process. It is designed for server-side deployment: drop it into
systemd, Kubernetes, or any supervisor that can restart it on failure, and the
pipeline fires on time every time.
faucet schedule pipeline.yaml # foreground; Ctrl-C or SIGTERM to stop
faucet schedule pipeline.yaml --once # run exactly once now, then exit
The config must include a schedule: block alongside the usual pipeline:.
Configs without one are rejected with a hint to use faucet run instead.
A runnable example
The following config runs a CSV→JSONL pipeline every night at 02:00
America/Los_Angeles. Save it as nightly.yaml and start it with
faucet schedule nightly.yaml:
# nightly.yaml — run at 02:00 Pacific every night
version: 1
name: nightly-rollup
schedule:
cron: "0 2 * * *"
timezone: "America/Los_Angeles"
overlap_policy: skip # don't pile up if a run runs long
max_consecutive_failures: 5 # exit non-zero after 5 straight failures (supervisor restarts)
on_failure: continue
shutdown_grace_secs: 30
pipeline:
source:
type: csv
config:
path: ./events.csv
sink:
type: jsonl
config:
path: ./events.jsonl
See cli/examples/scheduled_nightly.yaml
for the canonical copy.
Cron syntax
faucet uses a standard Unix cron expression, validated at config-load time. A bad expression or an expression that can never fire produces a clear error before the process starts.
5-field form (MIN HOUR DOM MON DOW):
| Expression | Meaning |
|---|---|
0 2 * * * | Every night at 02:00 |
*/15 * * * * | Every 15 minutes |
0 9 * * 1-5 | Weekdays at 09:00 |
0 0 1 * * | First of every month at midnight |
0 */6 * * * | Every 6 hours |
6-field form (SEC MIN HOUR DOM MON DOW) — add a leading seconds field
for sub-minute intervals:
| Expression | Meaning |
|---|---|
*/30 * * * * * | Every 30 seconds |
0 */5 * * * * | Every 5 minutes (explicit seconds=0) |
Field ranges follow standard cron semantics: * (every), */N (every N),
a-b (range), a,b,c (list). Month and day-of-week names (JAN, MON,
etc.) are accepted. Special strings like @daily and @hourly are not
supported — use the numeric form.
Timezone and DST
Set timezone to any IANA timezone name
(e.g. America/Los_Angeles, Europe/Berlin, Asia/Tokyo). The default is
UTC.
All tick times are computed on UTC monotonic instants with timezone-correct wall-clock interpretation, so DST transitions behave correctly:
- Fall-back (clocks go back): a repeated wall-clock hour fires once.
- Spring-forward (clocks skip ahead): a wall-clock time in the skipped hour is treated as if it were in the hour immediately after the gap — the next valid tick.
The scheduler loop re-checks the wall clock at least every 30 seconds, so NTP steps, VM freeze/thaw, and DST shifts can never drift a scheduled fire by more than ~30 seconds.
Missed-tick behavior
The scheduler advances from the scheduled tick, not the wall clock, so a single occurrence is not skipped just because dispatch latency pushed the clock a little past it — it fires promptly (slightly late) and the schedule resumes. But if many ticks elapsed (the process was down, or a run took longer than several cron periods), the backlog is collapsed to a single catch-up: the scheduler fires once at the next due time and moves on. There is no catch-up storm and no flood of backfilled runs.
To find out how late a run fired, scrape
faucet_schedule_run_lateness_seconds (histogram: actual_start − scheduled_for).
Overlap policy
The overlap policy controls what happens when a tick fires while a run is already executing.
| Policy | When to use |
|---|---|
skip (default) | The tick is dropped and a faucet_schedule_overlaps_total{policy=skip} counter is incremented. Use when it is acceptable to miss a cycle if the previous one ran long. Most pipelines. |
queue | One missed tick is buffered and fires immediately when the current run finishes. Further misses during that same run collapse into the single queued tick (in-memory only — lost on restart). Use when missing a cycle is unacceptable but strict concurrency still must be preserved. |
forbid | The process exits non-zero the moment an overlap would occur. Use when overlapping runs would produce corrupt output or you want a hard guarantee that no two instances run simultaneously — pair with a supervisor that alerts or pages on non-zero exit. |
Choosing between skip and queue: if your pipeline is idempotent and
catching up after a long run matters (e.g. incremental replication with
state), use queue. If occasional missed cycles are harmless and you prefer
simplicity, use skip.
Failure model and supervisor integration
Two independent knobs govern what happens when a run fails:
on_failure | max_consecutive_failures | Behaviour |
|---|---|---|
continue (default) | null | Tolerates all failures indefinitely. Alert via faucet_schedule_consecutive_failures gauge. |
continue | N | Tolerates up to N−1 straight failures; exits non-zero when the Nth consecutive failure occurs. A successful run resets the counter to 0. |
stop | any | Exits non-zero immediately on the first failure. |
The recommended production pattern is on_failure: continue with
max_consecutive_failures: N (5–10 depending on how quickly you want a
supervisor restart):
schedule:
cron: "*/5 * * * *"
on_failure: continue
max_consecutive_failures: 5 # restart after 5 straight failures
Systemd unit example
# /etc/systemd/system/nightly-rollup.service
[Unit]
Description=faucet nightly rollup
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/bin/faucet schedule /opt/pipelines/nightly.yaml
Restart=on-failure
RestartSec=30s
# Env vars for the pipeline
EnvironmentFile=/opt/pipelines/nightly.env
[Install]
WantedBy=multi-user.target
Restart=on-failure means systemd restarts the process whenever it exits
with a non-zero code, which is exactly the condition max_consecutive_failures
produces. RestartSec=30s adds a brief cooldown between restarts to avoid
hammering a broken upstream.
Kubernetes CronJob vs long-running Deployment
faucet schedule is designed for a Deployment (or long-running Pod):
one process, always running, fires on cron. This keeps token caches warm
and avoids cold-start latency on every tick.
If you need Kubernetes to manage the schedule itself, use a Kubernetes
CronJob with faucet run instead — each invocation is ephemeral and
the scheduler handles missed/overlapping pods at the platform level.
Graceful shutdown and SIGTERM
On SIGTERM or Ctrl-C:
- faucet stops accepting new ticks.
- If a run is in flight, it waits up to
shutdown_grace_secs(default 30) for it to finish. - If the run finishes within the grace period, the process exits 0.
- If the run is still running after the grace period, it is aborted. The
per-page
StateStorebookmark means the next start resumes from the last confirmed write — no data is lost, but the partial page since the last bookmark is re-fetched on the next run. Whether that causes duplicates depends on your sink’s idempotency.
Increase shutdown_grace_secs for long-running pages (e.g. a BigQuery batch
that takes several minutes to flush):
schedule:
cron: "0 * * * *"
shutdown_grace_secs: 120
Dated outputs with ${now.*}
${now.*} tokens let you inject the run’s wall time into source and sink config
values — so a scheduled pipeline can write to a different file or object-storage
prefix on every tick without any manual bookkeeping.
The headline use case is a dated partition path:
# nightly_partitioned.yaml — write to a new dated partition every night
version: 1
name: nightly-events
schedule:
cron: "0 2 * * *"
timezone: "America/Los_Angeles"
overlap_policy: skip
max_consecutive_failures: 5
pipeline:
source:
type: rest
config:
base_url: https://api.example.com
path: /v1/events
sink:
type: jsonl
config:
# ${now.date} reflects the schedule's timezone (America/Los_Angeles),
# so the partition label matches the business date of the run.
path: "./warehouse/dt=${now.date}/events.jsonl"
When the cron fires at 02:00 on 2026-03-09 Pacific time, ${now.date} resolves
to 2026-03-09 and faucet writes to ./warehouse/dt=2026-03-09/events.jsonl.
The parent directory is created automatically — local file sinks (JSONL, CSV)
create missing parent directories so dated subdirectory paths work without
pre-creating the tree.
The full token set:
| Token | Example | Use case |
|---|---|---|
${now.date} | 2026-03-08 | Daily partition key |
${now.year} / ${now.month} / ${now.day} | 2026 / 03 / 08 | Hive-style year=…/month=…/day=… paths |
${now.hour} | 14 | Hourly partitions |
${now.unix} | 1741442709 | Unique epoch-based filenames |
${now.strftime.<fmt>} | 2026/03/08/14 | Arbitrary layout — e.g. ${now.strftime.%Y/%m/%d/%H} |
${now.datetime} / ${now.iso} | 2026-03-08T14:05:09+00:00 | RFC 3339 timestamp in a filename or object key |
Clock semantics
faucet schedule uses the tick’s scheduled time rendered in the schedule’s
timezone — not the actual wall clock when the run started. This means
${now.date} is deterministic: re-running the same tick (e.g. after a restart)
produces the same path.
faucet schedule --once uses the current wall clock in the schedule’s timezone.
Backfilling with faucet run --clock
To backfill a range of dates, use faucet run with the --clock flag instead
of faucet schedule. --clock overrides the process start time used by
${now.*}:
# Backfill three nightly partitions
faucet run --clock 2026-03-01 nightly_partitioned.yaml
faucet run --clock 2026-03-02 nightly_partitioned.yaml
faucet run --clock 2026-03-03 nightly_partitioned.yaml
A bare date (2026-03-01) is treated as midnight UTC. An RFC 3339 timestamp
(2026-03-01T02:00:00-08:00) sets the clock precisely. Unknown ${now.*}
tokens are config errors; the token set is validated at run start before any
I/O begins.
Health metrics to scrape
Register a Prometheus listener via the observability: block:
observability:
prometheus:
listen: "127.0.0.1:9464"
Key metrics for a scheduling health dashboard:
| Metric | What to alert on |
|---|---|
faucet_schedule_heartbeat_unix_seconds | time() - value > 90 → scheduler loop is stuck or process crashed |
faucet_schedule_consecutive_failures | > 0 → at least one recent failure; >= max_consecutive_failures → imminent exit |
faucet_schedule_next_tick_unix_seconds | value - time() > 2 * expected_interval → scheduler is not advancing |
faucet_schedule_runs_total{outcome="err"} | Increasing counter → runs are failing |
faucet_schedule_overlaps_total | Repeated increments → runs are taking longer than the cron period |
faucet_schedule_run_lateness_seconds | p99 > threshold → runs are starting significantly late |
Full metric reference:
| Metric | Type | Description |
|---|---|---|
faucet_schedule_runs_total{pipeline,outcome} | Counter | outcome ∈ {ok, err, skipped} |
faucet_schedule_overlaps_total{pipeline,policy} | Counter | Overlap events by policy |
faucet_schedule_next_tick_unix_seconds{pipeline} | Gauge | Unix timestamp of the next scheduled tick |
faucet_schedule_runs_in_flight{pipeline} | Gauge | 0 or 1 |
faucet_schedule_consecutive_failures{pipeline} | Gauge | Resets to 0 on success |
faucet_schedule_heartbeat_unix_seconds{pipeline} | Gauge | Updated every loop wake (≤30 s) |
faucet_schedule_last_run_started_unix_seconds{pipeline} | Gauge | |
faucet_schedule_last_run_completed_unix_seconds{pipeline} | Gauge | |
faucet_schedule_last_run_duration_seconds{pipeline} | Gauge | |
faucet_schedule_run_lateness_seconds{pipeline} | Histogram | actual_start − scheduled_for |
Each run also emits a faucet.schedule.run tracing span (attributes:
run_ordinal, scheduled_for_unix_seconds, tick_unix_seconds) that wraps
the inner pipeline spans, so distributed tracing carries the scheduling
context through the full pipeline.