Skip to main content

Collectors and Extractions

Collectors are the entry points for data into the Sterndesk extraction pipeline. They define how and from where data is gathered before being processed by an Extraction Schema. Each collector type represents a different method of data ingestion, allowing you to bring documents and content from various origins into your project for structured data extraction.

Architecture Overview

All collectors feed into the same extraction pipeline, creating a unified data flow regardless of the origin: Collectors fan into the extraction pipeline

Collector Types

Sterndesk supports multiple collector types, each optimized for different data origins:
Collector TypeOriginUse Case
UploadFile uploadsPDFs, images, documents
CrawlWeb URLsWeb pages, online content
More coming soonAdditional integrations planned

Organizing Collectors

You can create multiple collectors within a single project to organize different data streams. Each collector can have its own extraction schema, allowing you to process different document types with specialized schemas. Common patterns include:
  • By document type — Separate collectors for invoices, receipts, and contracts
  • By business unit — One collector for finance documents, another for HR paperwork
  • By UI context — Different parts of your application use different collectors (e.g., an invoice upload widget vs. a research paper ingestion form)
  • By data origin — Distinguish between user-uploaded content and crawled web data
This separation provides clear organization, distinct extraction schemas per use case, and easier tracking of where data originated.

Extraction Modes

Each collector supports two extraction modes that determine when and how data is processed:

Direct Extraction

In Direct Extraction mode, documents are processed immediately upon collection. The extraction workflow runs automatically:
  1. Collect — Data is ingested via the collector
  2. Convert — Documents are converted to a processable format
  3. Structure — An LLM extracts structured data according to your schema
  4. Deliver — Results are available immediately via the API
Direct Extraction is ideal for:
  • Real-time processing requirements
  • Single document workflows
  • Immediate data availability needs
To enable Direct Extraction, attach an Extraction Schema to your collector when creating it.

Bundled Extraction

In Bundled Extraction mode, collected data is persisted to long-term S3 storage before extraction runs as a separate step. Enable this mode by setting bundle_enabled: true when creating a collector. When bundling is enabled:
  1. Collect — Data is ingested via the collector
  2. Bundle — Data is stored in S3 as a bundle with status CREATEDSTORED
  3. Extract — Extraction can run on the bundled data (when configured)
This allows for:
  • Batch processing of multiple documents
  • Re-extraction with updated schemas
  • Data archival and audit trails
  • Deferred processing workflows

Working with Bundles

List bundles for a project using the ListBundles RPC:
curl "https://api.eu.sterndesk.com/r/bundles?project_id=proj_xyz789" \
  -H "Authorization: Bearer YOUR_API_KEY"
Download bundled data using the DownloadBundle RPC, which returns pre-signed GET URLs:
curl "https://api.eu.sterndesk.com/r/bundles/bnd_abc123/download" \
  -H "Authorization: Bearer YOUR_API_KEY"
Each bundle tracks:
  • Source type — Whether it originated from an upload (BUNDLE_SOURCE_TYPE_UPLOAD) or crawl (BUNDLE_SOURCE_TYPE_CRAWL)
  • Object count — Number of objects stored in the bundle
  • Total size — Total size in bytes of all bundled objects

Upload Collectors

Upload Collectors enable you to ingest documents directly into Sterndesk via file uploads. They are ideal for processing PDFs, images, and other document formats.

How Upload Collectors Work

  1. Create an Upload Collector — Define a named collector within your project
  2. Request Pre-signed URLs — Get secure upload URLs for your files
  3. Upload Files — PUT your files directly to the pre-signed URLs
  4. Automatic Processing — If a schema is attached, extraction runs automatically

Upload Lifecycle

Each upload goes through the following states:
StatusDescription
CREATEDUpload initiated, awaiting file transfer
TRANSFERREDFiles successfully uploaded
DIRECTLY_EXTRACTEDExtraction complete (if schema attached)
EXPIREDUpload window expired before transfer completed

Crawl Collectors

Crawl Collectors enable you to extract data from web pages by providing URLs. They automatically fetch and process web content.

How Crawl Collectors Work

  1. Create a Crawl Collector — Define a named collector within your project
  2. Submit URLs — Provide URLs to crawl
  3. Automatic Fetching — The crawler retrieves and processes the web content
  4. Extraction — If a schema is attached, structured data is extracted

Crawl Lifecycle

Each crawl goes through the following states:
StatusDescription
CREATEDCrawl initiated, pending execution
CRAWLEDWeb content successfully retrieved

API Reference

For detailed information on managing collectors, see the API Reference:

Next Steps