Collectors and Extractions

Collectors are the entry points for data into the Sterndesk extraction pipeline. They define how and from where data is gathered before being processed by an Extraction Schema. Each collector type represents a different method of data ingestion, allowing you to bring documents and content from various origins into your project for structured data extraction.

Architecture Overview

All collectors feed into the same extraction pipeline, creating a unified data flow regardless of the origin:

Collectors fan into the extraction pipeline

Collector Types

Sterndesk supports multiple collector types, each optimized for different data origins:

Collector Type	Origin	Use Case
Upload	File uploads	PDFs, images, documents
Crawl	Web URLs	Web pages, online content
More coming soon	—	Additional integrations planned

Organizing Collectors

You can create multiple collectors within a single project to organize different data streams. Each collector can have its own extraction schema, allowing you to process different document types with specialized schemas. Common patterns include:

By document type — Separate collectors for invoices, receipts, and contracts
By business unit — One collector for finance documents, another for HR paperwork
By UI context — Different parts of your application use different collectors (e.g., an invoice upload widget vs. a research paper ingestion form)
By data origin — Distinguish between user-uploaded content and crawled web data

This separation provides clear organization, distinct extraction schemas per use case, and easier tracking of where data originated.

Extraction Modes

Each collector supports two extraction modes that determine when and how data is processed:

Direct Extraction

In Direct Extraction mode, documents are processed immediately upon collection. The extraction workflow runs automatically:

Collect — Data is ingested via the collector
Convert — Documents are converted to a processable format
Structure — An LLM extracts structured data according to your schema
Deliver — Results are available immediately via the API

Direct Extraction is ideal for:

Real-time processing requirements
Single document workflows
Immediate data availability needs

To enable Direct Extraction, attach an Extraction Schema to your collector when creating it.

Bundled Extraction

In Bundled Extraction mode, collected data is persisted to long-term S3 storage before extraction runs as a separate step. Enable this mode by setting bundle_enabled: true when creating a collector. When bundling is enabled:

Collect — Data is ingested via the collector
Bundle — Data is stored in S3 as a bundle with status CREATED → STORED
Extract — Extraction can run on the bundled data (when configured)

This allows for:

Batch processing of multiple documents
Re-extraction with updated schemas
Data archival and audit trails
Deferred processing workflows

Working with Bundles

List bundles for a project using the ListBundles RPC:

curl "https://api.eu.sterndesk.com/r/bundles?project_id=proj_xyz789" \
  -H "Authorization: Bearer YOUR_API_KEY"

Download bundled data using the DownloadBundle RPC, which returns pre-signed GET URLs:

curl "https://api.eu.sterndesk.com/r/bundles/bnd_abc123/download" \
  -H "Authorization: Bearer YOUR_API_KEY"

Each bundle tracks:

Source type — Whether it originated from an upload (BUNDLE_SOURCE_TYPE_UPLOAD) or crawl (BUNDLE_SOURCE_TYPE_CRAWL)
Object count — Number of objects stored in the bundle
Total size — Total size in bytes of all bundled objects

Upload Collectors

Upload Collectors enable you to ingest documents directly into Sterndesk via file uploads. They are ideal for processing PDFs, images, and other document formats.

How Upload Collectors Work

Create an Upload Collector — Define a named collector within your project
Request Pre-signed URLs — Get secure upload URLs for your files
Upload Files — PUT your files directly to the pre-signed URLs
Automatic Processing — If a schema is attached, extraction runs automatically

Upload Lifecycle

Each upload goes through the following states:

Status	Description
`CREATED`	Upload initiated, awaiting file transfer
`TRANSFERRED`	Files successfully uploaded
`DIRECTLY_EXTRACTED`	Extraction complete (if schema attached)
`EXPIRED`	Upload window expired before transfer completed

Crawl Collectors

Crawl Collectors enable you to extract data from web pages by providing URLs. They automatically fetch and process web content.

How Crawl Collectors Work

Create a Crawl Collector — Define a named collector within your project
Submit URLs — Provide URLs to crawl
Automatic Fetching — The crawler retrieves and processes the web content
Extraction — If a schema is attached, structured data is extracted

Crawl Lifecycle

Each crawl goes through the following states:

Status	Description
`CREATED`	Crawl initiated, pending execution
`CRAWLED`	Web content successfully retrieved

API Reference

For detailed information on managing collectors, see the API Reference:

Next Steps

Upload URLs

Learn about pre-signed URLs for secure file uploads

Extraction Schemas

Define how data is extracted and structured

Get Started

Concepts

Guides

Collectors and Extractions

Collectors and Extractions

Architecture Overview

Collector Types

Organizing Collectors

Extraction Modes

Direct Extraction

Bundled Extraction

Working with Bundles

Upload Collectors

How Upload Collectors Work

Upload Lifecycle

Crawl Collectors

How Crawl Collectors Work

Crawl Lifecycle

API Reference

Next Steps

Upload URLs

Extraction Schemas

Get Started

Concepts

Guides

​Collectors and Extractions

​Architecture Overview

​Collector Types

​Organizing Collectors

​Extraction Modes

​Direct Extraction

​Bundled Extraction

​Working with Bundles

​Upload Collectors

​How Upload Collectors Work

​Upload Lifecycle

​Crawl Collectors

​How Crawl Collectors Work

​Crawl Lifecycle

​API Reference

​Next Steps

Upload URLs

Extraction Schemas

Collectors and Extractions

Architecture Overview

Collector Types

Organizing Collectors

Extraction Modes

Direct Extraction

Bundled Extraction

Working with Bundles

Upload Collectors

How Upload Collectors Work

Upload Lifecycle

Crawl Collectors

How Crawl Collectors Work

Crawl Lifecycle

API Reference

Next Steps