Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Scheduling pipelines with faucet schedule

faucet schedule runs a pipeline on a cron schedule in a long-running foreground process. It is designed for server-side deployment: drop it into systemd, Kubernetes, or any supervisor that can restart it on failure, and the pipeline fires on time every time.

faucet schedule pipeline.yaml           # foreground; Ctrl-C or SIGTERM to stop
faucet schedule pipeline.yaml --once    # run exactly once now, then exit

The config must include a schedule: block alongside the usual pipeline:. Configs without one are rejected with a hint to use faucet run instead.

A runnable example

The following config runs a CSV→JSONL pipeline every night at 02:00 America/Los_Angeles. Save it as nightly.yaml and start it with faucet schedule nightly.yaml:

# nightly.yaml — run at 02:00 Pacific every night
version: 1
name: nightly-rollup

schedule:
  cron: "0 2 * * *"
  timezone: "America/Los_Angeles"
  overlap_policy: skip            # don't pile up if a run runs long
  max_consecutive_failures: 5     # exit non-zero after 5 straight failures (supervisor restarts)
  on_failure: continue
  shutdown_grace_secs: 30

pipeline:
  source:
    type: csv
    config:
      path: ./events.csv
  sink:
    type: jsonl
    config:
      path: ./events.jsonl

See cli/examples/scheduled_nightly.yaml for the canonical copy.

Cron syntax

faucet uses a standard Unix cron expression, validated at config-load time. A bad expression or an expression that can never fire produces a clear error before the process starts.

5-field form (MIN HOUR DOM MON DOW):

ExpressionMeaning
0 2 * * *Every night at 02:00
*/15 * * * *Every 15 minutes
0 9 * * 1-5Weekdays at 09:00
0 0 1 * *First of every month at midnight
0 */6 * * *Every 6 hours

6-field form (SEC MIN HOUR DOM MON DOW) — add a leading seconds field for sub-minute intervals:

ExpressionMeaning
*/30 * * * * *Every 30 seconds
0 */5 * * * *Every 5 minutes (explicit seconds=0)

Field ranges follow standard cron semantics: * (every), */N (every N), a-b (range), a,b,c (list). Month and day-of-week names (JAN, MON, etc.) are accepted. Special strings like @daily and @hourly are not supported — use the numeric form.

Timezone and DST

Set timezone to any IANA timezone name (e.g. America/Los_Angeles, Europe/Berlin, Asia/Tokyo). The default is UTC.

All tick times are computed on UTC monotonic instants with timezone-correct wall-clock interpretation, so DST transitions behave correctly:

  • Fall-back (clocks go back): a repeated wall-clock hour fires once.
  • Spring-forward (clocks skip ahead): a wall-clock time in the skipped hour is treated as if it were in the hour immediately after the gap — the next valid tick.

The scheduler loop re-checks the wall clock at least every 30 seconds, so NTP steps, VM freeze/thaw, and DST shifts can never drift a scheduled fire by more than ~30 seconds.

Missed-tick behavior

The scheduler advances from the scheduled tick, not the wall clock, so a single occurrence is not skipped just because dispatch latency pushed the clock a little past it — it fires promptly (slightly late) and the schedule resumes. But if many ticks elapsed (the process was down, or a run took longer than several cron periods), the backlog is collapsed to a single catch-up: the scheduler fires once at the next due time and moves on. There is no catch-up storm and no flood of backfilled runs.

To find out how late a run fired, scrape faucet_schedule_run_lateness_seconds (histogram: actual_start − scheduled_for).

Overlap policy

The overlap policy controls what happens when a tick fires while a run is already executing.

PolicyWhen to use
skip (default)The tick is dropped and a faucet_schedule_overlaps_total{policy=skip} counter is incremented. Use when it is acceptable to miss a cycle if the previous one ran long. Most pipelines.
queueOne missed tick is buffered and fires immediately when the current run finishes. Further misses during that same run collapse into the single queued tick (in-memory only — lost on restart). Use when missing a cycle is unacceptable but strict concurrency still must be preserved.
forbidThe process exits non-zero the moment an overlap would occur. Use when overlapping runs would produce corrupt output or you want a hard guarantee that no two instances run simultaneously — pair with a supervisor that alerts or pages on non-zero exit.

Choosing between skip and queue: if your pipeline is idempotent and catching up after a long run matters (e.g. incremental replication with state), use queue. If occasional missed cycles are harmless and you prefer simplicity, use skip.

Failure model and supervisor integration

Two independent knobs govern what happens when a run fails:

on_failuremax_consecutive_failuresBehaviour
continue (default)nullTolerates all failures indefinitely. Alert via faucet_schedule_consecutive_failures gauge.
continueNTolerates up to N−1 straight failures; exits non-zero when the Nth consecutive failure occurs. A successful run resets the counter to 0.
stopanyExits non-zero immediately on the first failure.

The recommended production pattern is on_failure: continue with max_consecutive_failures: N (5–10 depending on how quickly you want a supervisor restart):

schedule:
  cron: "*/5 * * * *"
  on_failure: continue
  max_consecutive_failures: 5   # restart after 5 straight failures

Systemd unit example

# /etc/systemd/system/nightly-rollup.service
[Unit]
Description=faucet nightly rollup
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/faucet schedule /opt/pipelines/nightly.yaml
Restart=on-failure
RestartSec=30s
# Env vars for the pipeline
EnvironmentFile=/opt/pipelines/nightly.env

[Install]
WantedBy=multi-user.target

Restart=on-failure means systemd restarts the process whenever it exits with a non-zero code, which is exactly the condition max_consecutive_failures produces. RestartSec=30s adds a brief cooldown between restarts to avoid hammering a broken upstream.

Kubernetes CronJob vs long-running Deployment

faucet schedule is designed for a Deployment (or long-running Pod): one process, always running, fires on cron. This keeps token caches warm and avoids cold-start latency on every tick.

If you need Kubernetes to manage the schedule itself, use a Kubernetes CronJob with faucet run instead — each invocation is ephemeral and the scheduler handles missed/overlapping pods at the platform level.

Graceful shutdown and SIGTERM

On SIGTERM or Ctrl-C:

  1. faucet stops accepting new ticks.
  2. If a run is in flight, it waits up to shutdown_grace_secs (default 30) for it to finish.
  3. If the run finishes within the grace period, the process exits 0.
  4. If the run is still running after the grace period, it is aborted. The per-page StateStore bookmark means the next start resumes from the last confirmed write — no data is lost, but the partial page since the last bookmark is re-fetched on the next run. Whether that causes duplicates depends on your sink’s idempotency.

Increase shutdown_grace_secs for long-running pages (e.g. a BigQuery batch that takes several minutes to flush):

schedule:
  cron: "0 * * * *"
  shutdown_grace_secs: 120

Dated outputs with ${now.*}

${now.*} tokens let you inject the run’s wall time into source and sink config values — so a scheduled pipeline can write to a different file or object-storage prefix on every tick without any manual bookkeeping.

The headline use case is a dated partition path:

# nightly_partitioned.yaml — write to a new dated partition every night
version: 1
name: nightly-events

schedule:
  cron: "0 2 * * *"
  timezone: "America/Los_Angeles"
  overlap_policy: skip
  max_consecutive_failures: 5

pipeline:
  source:
    type: rest
    config:
      base_url: https://api.example.com
      path: /v1/events
  sink:
    type: jsonl
    config:
      # ${now.date} reflects the schedule's timezone (America/Los_Angeles),
      # so the partition label matches the business date of the run.
      path: "./warehouse/dt=${now.date}/events.jsonl"

When the cron fires at 02:00 on 2026-03-09 Pacific time, ${now.date} resolves to 2026-03-09 and faucet writes to ./warehouse/dt=2026-03-09/events.jsonl. The parent directory is created automatically — local file sinks (JSONL, CSV) create missing parent directories so dated subdirectory paths work without pre-creating the tree.

The full token set:

TokenExampleUse case
${now.date}2026-03-08Daily partition key
${now.year} / ${now.month} / ${now.day}2026 / 03 / 08Hive-style year=…/month=…/day=… paths
${now.hour}14Hourly partitions
${now.unix}1741442709Unique epoch-based filenames
${now.strftime.<fmt>}2026/03/08/14Arbitrary layout — e.g. ${now.strftime.%Y/%m/%d/%H}
${now.datetime} / ${now.iso}2026-03-08T14:05:09+00:00RFC 3339 timestamp in a filename or object key

Clock semantics

faucet schedule uses the tick’s scheduled time rendered in the schedule’s timezone — not the actual wall clock when the run started. This means ${now.date} is deterministic: re-running the same tick (e.g. after a restart) produces the same path.

faucet schedule --once uses the current wall clock in the schedule’s timezone.

Backfilling with faucet run --clock

To backfill a range of dates, use faucet run with the --clock flag instead of faucet schedule. --clock overrides the process start time used by ${now.*}:

# Backfill three nightly partitions
faucet run --clock 2026-03-01 nightly_partitioned.yaml
faucet run --clock 2026-03-02 nightly_partitioned.yaml
faucet run --clock 2026-03-03 nightly_partitioned.yaml

A bare date (2026-03-01) is treated as midnight UTC. An RFC 3339 timestamp (2026-03-01T02:00:00-08:00) sets the clock precisely. Unknown ${now.*} tokens are config errors; the token set is validated at run start before any I/O begins.

Health metrics to scrape

Register a Prometheus listener via the observability: block:

observability:
  prometheus:
    listen: "127.0.0.1:9464"

Key metrics for a scheduling health dashboard:

MetricWhat to alert on
faucet_schedule_heartbeat_unix_secondstime() - value > 90 → scheduler loop is stuck or process crashed
faucet_schedule_consecutive_failures> 0 → at least one recent failure; >= max_consecutive_failures → imminent exit
faucet_schedule_next_tick_unix_secondsvalue - time() > 2 * expected_interval → scheduler is not advancing
faucet_schedule_runs_total{outcome="err"}Increasing counter → runs are failing
faucet_schedule_overlaps_totalRepeated increments → runs are taking longer than the cron period
faucet_schedule_run_lateness_secondsp99 > threshold → runs are starting significantly late

Full metric reference:

MetricTypeDescription
faucet_schedule_runs_total{pipeline,outcome}Counteroutcome{ok, err, skipped}
faucet_schedule_overlaps_total{pipeline,policy}CounterOverlap events by policy
faucet_schedule_next_tick_unix_seconds{pipeline}GaugeUnix timestamp of the next scheduled tick
faucet_schedule_runs_in_flight{pipeline}Gauge0 or 1
faucet_schedule_consecutive_failures{pipeline}GaugeResets to 0 on success
faucet_schedule_heartbeat_unix_seconds{pipeline}GaugeUpdated every loop wake (≤30 s)
faucet_schedule_last_run_started_unix_seconds{pipeline}Gauge
faucet_schedule_last_run_completed_unix_seconds{pipeline}Gauge
faucet_schedule_last_run_duration_seconds{pipeline}Gauge
faucet_schedule_run_lateness_seconds{pipeline}Histogramactual_start − scheduled_for

Each run also emits a faucet.schedule.run tracing span (attributes: run_ordinal, scheduled_for_unix_seconds, tick_unix_seconds) that wraps the inner pipeline spans, so distributed tracing carries the scheduling context through the full pipeline.