Skip to content

PSA Intelligence Canary Scan

pypi python build tests license

canary-scan logo

Scan document data-sources for canaries, trackers, web beacons, and per-recipient fingerprints before interacting with supplied datasets.

When you receive a large document dump from an external party — a leak, legal disclosure, or investigation — those files and documents can contain deliberate or indirect canaries: tracking pixels, embedded JavaScript, remote template links, steganographic watermarks, or per-recipient metadata fingerprints that phone home the moment a file is opened.

canary-scan inspects files without opening it in its native viewer, extracting and analysing raw structure, metadata, embedded objects, and near-duplicate fingerprints to surface anything that may reveal to an external party that the data-source is being examined.


Quick Start (Docker)

The primary and recommended way to run canary-scan is using Docker, which comes pre-bundled with all system dependencies and utilities:

# Run the scan using the GitHub Container Registry image
docker run --rm \
  -v /mnt/datasource:/data:ro \
  -v $(pwd)/canary-scan-out:/output \
  ghcr.io/psaintelligence/canary-scan:latest scan /data -o /output

# Review findings
jq '.[] | select(.severity=="critical")' canary-scan-out/canary-scan-report.json

Run canary-scan --guide inside the container/local shell for a concise cheat sheet, or see the Workflow guide for a full walkthrough.


Quick Start (pipx)

If you prefer to run the tool natively, you can install the Python package:

pipx install canary-scan

See the Install guide for required system dependencies, optional packages, and air-gapped environment setup.


Canary Scan Report Summary

canary-scan CLI scan output


Detection Pipeline

Seven sequential stages, each writing a JSONL artefact:

graph LR
    A[inventory] --> B[metadata] --> C[remote-refs] --> D[embedded] --> E[stego] --> F[uniqueness] --> G[report]
Stage What it checks
inventory File walk, SHA-256 hashes, MIME types, bucket classification
metadata exiftool extraction, tracking URLs, GPS/serial/PII indicators
remote-refs XXE, tracking pixels, remote template links, formula injections, OLE hyperlinks
embedded Nested binaries, OLE/ActiveX objects, raster image extraction
stego Steghide/stegseek carrier checks, QR code URL detection, EXIF thumbnail mismatch
uniqueness Near-duplicate clustering to find per-recipient canary values
report Merge, deduplicate, filter by severity, emit JSON/CSV/SARIF

See File Types for the full matrix of supported formats and canary vectors.


Severity Levels

Severity Meaning Recommended Action
critical Active phone-home URL or JS that fires on open Do NOT open in native viewer — quarantine
high Embedded OLE/JS objects, steganographic payload Investigate in an isolated sandbox
medium Unique fingerprint, GPS/PII metadata Strip metadata before further handling
low Metadata oddity, non-standard producer string Note for chain of custody
info Annotated — no canary confirmed Informational only