Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Choosing a connector

Several connectors overlap. This page resolves the common “which one?” questions. For the full feature grid see the connector catalog.

PostgreSQL: query source vs. CDC

  • source-postgres runs a SQL query and returns the rows. Use it for one-shot extracts, snapshots, or when you control an updated_at column and parameterize the query yourself. Simple, no special Postgres config.
  • source-postgres-cdc streams every INSERT/UPDATE/DELETE from the write-ahead log via logical replication. Use it when you need every change (including deletes), low-latency capture, or resumability without a cursor column. Requires wal_level = logical and a publication, and retains WAL between runs. See the CDC tutorial.

Rule of thumb: periodic snapshot → query source; continuous change feed → CDC.

Object storage: S3/GCS source vs. Parquet source

  • source-s3 / source-gcs read objects as JSONL, a JSON array, or raw text. Use them for line-delimited JSON, logs, or text dumps.
  • source-parquet reads columnar Parquet (local, glob, or S3) with a vectorized Arrow reader and column projection. Use it for analytical datasets — it’s far faster and can skip columns you don’t need.

Rule of thumb: the file is .parquet → Parquet source; it’s JSON/text → S3/GCS source. (The Parquet source reads from S3 directly, so you don’t need the S3 source in front of it.)

Live feeds: WebSocket vs. Webhook vs. Kafka/Redis

  • source-websocket — connects out to a live push endpoint (ws:///wss://), optionally sends subscription frames, and streams each incoming message as a record. Use it for market data, chat feeds, telemetry, or any server that pushes over WebSocket. Live-only — no replay, no durable offset.
  • source-webhook — opens a temporary HTTP server and receives inbound HTTP POSTs from external systems over a time window. Use it when the remote system pushes to you over HTTP rather than WebSocket.
  • source-kafka / source-redis — broker-backed streaming with durable, replayable offsets and resumable bookmarks. Use these when you need guaranteed delivery and the ability to continue from where a previous run left off.

Rule of thumb: connecting out to a live WebSocket feed → source-websocket; receiving inbound HTTP POST payloads → source-webhook; durable, replayable event stream → source-kafka or source-redis.

Streaming: Redis vs. Kafka

  • source-redis reads streams, lists, or key patterns. Great when Redis is already in your stack and volumes are modest.
  • source-kafka is a real consumer with consumer-group offsets and resumable bookmarks. Use it for high-throughput event pipelines and durable, replayable streams.

Rule of thumb: durable, high-volume event stream → Kafka; lightweight queue/cache already on hand → Redis.

HTTP APIs: REST vs. GraphQL vs. XML vs. gRPC

  • source-rest — JSON REST APIs. The most full-featured source: six pagination styles, seven auth strategies, incremental replication, partitions.
  • source-graphql — GraphQL endpoints with cursor pagination and variable injection.
  • source-xml — XML/SOAP APIs; converts XML to JSON with dot-path extraction.
  • source-grpc — gRPC services via dynamic protobuf (prost-reflect), unary or server-streaming.

Rule of thumb: match the protocol the API speaks. For incremental/resumable ingestion, REST has the richest support.

Warehouses: when to read with BigQuery / Snowflake sources

Use source-bigquery / source-snowflake to read out of a warehouse (e.g. to move a query result elsewhere). To load into one, use the matching sink. To transform data already inside the warehouse, reach for dbt — that’s not faucet’s job.

Sinks: column-mapped vs. JSON blob (SQL databases)

The Postgres/MySQL/SQLite/SQL Server sinks can write either:

  • a single JSON/JSONB column (column_mapping: { type: jsonb, column: data }) — schemaless, no DDL coupling, easiest to start with; or
  • auto-mapped columns — one column per top-level field, for queryable relational tables.

Rule of thumb: exploratory / evolving schema → JSON column; stable schema you query with SQL → mapped columns.

File sinks: JSONL vs. CSV vs. Parquet vs. stdout

  • sink-stdout — debugging and pipelines (faucet preview uses it).
  • sink-jsonl — line-delimited JSON; lossless, streaming-friendly, gzip/zstd-capable.
  • sink-csv — flat tabular output for spreadsheets/BI; nested fields flatten.
  • sink-parquet — columnar analytical output with built-in compression and schema inference; best for large datasets consumed by analytics engines.

Rule of thumb: machine-to-machine JSON → JSONL; tabular for humans → CSV; analytics at scale → Parquet.

Still unsure?

Run faucet list to see what’s installed, faucet schema source <name> to inspect a connector’s config, and faucet preview <config> --limit 10 to try a source without writing anywhere.