Collectors and Extractions
Collectors are the entry points for data into the Sterndesk extraction pipeline. They define how and from where data is gathered before being processed by an Extraction Schema. Each collector type represents a different method of data ingestion, allowing you to bring documents and content from various origins into your project for structured data extraction.Architecture Overview
All collectors feed into the same extraction pipeline, creating a unified data flow regardless of the origin:Collector Types
Sterndesk supports multiple collector types, each optimized for different data origins:| Collector Type | Origin | Use Case |
|---|---|---|
| Upload | File uploads | PDFs, images, documents |
| Crawl | Web URLs | Web pages, online content |
| More coming soon | — | Additional integrations planned |
Organizing Collectors
You can create multiple collectors within a single project to organize different data streams. Each collector can have its own extraction schema, allowing you to process different document types with specialized schemas. Common patterns include:- By document type — Separate collectors for invoices, receipts, and contracts
- By business unit — One collector for finance documents, another for HR paperwork
- By UI context — Different parts of your application use different collectors (e.g., an invoice upload widget vs. a research paper ingestion form)
- By data origin — Distinguish between user-uploaded content and crawled web data
Extraction Modes
Each collector supports two extraction modes that determine when and how data is processed:Direct Extraction
In Direct Extraction mode, documents are processed immediately upon collection. The extraction workflow runs automatically:- Collect — Data is ingested via the collector
- Convert — Documents are converted to a processable format
- Structure — An LLM extracts structured data according to your schema
- Deliver — Results are available immediately via the API
- Real-time processing requirements
- Single document workflows
- Immediate data availability needs
Bundled Extraction
In Bundled Extraction mode, collected data is persisted to long-term S3 storage before extraction runs as a separate step. Enable this mode by settingbundle_enabled: true when creating a collector.
When bundling is enabled:
- Collect — Data is ingested via the collector
- Bundle — Data is stored in S3 as a bundle with status
CREATED→STORED - Extract — Extraction can run on the bundled data (when configured)
- Batch processing of multiple documents
- Re-extraction with updated schemas
- Data archival and audit trails
- Deferred processing workflows
Working with Bundles
List bundles for a project using theListBundles RPC:
DownloadBundle RPC, which returns pre-signed GET URLs:
- Source type — Whether it originated from an upload (
BUNDLE_SOURCE_TYPE_UPLOAD) or crawl (BUNDLE_SOURCE_TYPE_CRAWL) - Object count — Number of objects stored in the bundle
- Total size — Total size in bytes of all bundled objects
Upload Collectors
Upload Collectors enable you to ingest documents directly into Sterndesk via file uploads. They are ideal for processing PDFs, images, and other document formats.How Upload Collectors Work
- Create an Upload Collector — Define a named collector within your project
- Request Pre-signed URLs — Get secure upload URLs for your files
- Upload Files — PUT your files directly to the pre-signed URLs
- Automatic Processing — If a schema is attached, extraction runs automatically
Upload Lifecycle
Each upload goes through the following states:| Status | Description |
|---|---|
CREATED | Upload initiated, awaiting file transfer |
TRANSFERRED | Files successfully uploaded |
DIRECTLY_EXTRACTED | Extraction complete (if schema attached) |
EXPIRED | Upload window expired before transfer completed |
Crawl Collectors
Crawl Collectors enable you to extract data from web pages by providing URLs. They automatically fetch and process web content.How Crawl Collectors Work
- Create a Crawl Collector — Define a named collector within your project
- Submit URLs — Provide URLs to crawl
- Automatic Fetching — The crawler retrieves and processes the web content
- Extraction — If a schema is attached, structured data is extracted
Crawl Lifecycle
Each crawl goes through the following states:| Status | Description |
|---|---|
CREATED | Crawl initiated, pending execution |
CRAWLED | Web content successfully retrieved |