Skip to main content

Upload and Extract Documents

This guide walks you through uploading documents and extracting structured data from them using Sterndesk’s Upload Collector.

Prerequisites

Before you begin, ensure you have: For this guide, we’ll assume you have a project ID available. We’ll use proj_xyz789 as an example.

Step 1: Create an Extraction Schema

An extraction schema defines the structure of data you want to extract from your documents. Create a schema that matches the information you’re looking for. For this example, we’ll create a simple schema to extract contact information:
curl -X POST https://api.eu.sterndesk.com/r/extraction-schemas \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "project_id": "proj_xyz789",
    "name": "Contact Extraction",
    "json_schema": "{\"type\":\"object\",\"properties\":{\"full_name\":{\"type\":\"string\",\"description\":\"The full name of the person\"},\"email\":{\"type\":\"string\",\"description\":\"Email address\"},\"phone\":{\"type\":\"string\",\"description\":\"Phone number\"},\"company\":{\"type\":\"string\",\"description\":\"Company or organization name\"}},\"required\":[\"full_name\"]}"
  }'
The json_schema field must be a JSON-encoded string, not a nested object. See Extraction Schemas for details on schema encoding.
Response:
{
  "id": "exsc_abc123",
  "name": "Contact Extraction"
}
Save the schema ID (exsc_abc123) for the next step.

Step 2: Create an Upload Collector

An Upload Collector is a collector that accepts file uploads. When you attach an extraction schema to it, documents are automatically extracted upon upload (Direct Extraction mode).
curl -X POST https://api.eu.sterndesk.com/r/upload-collectors \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "project_id": "proj_xyz789",
    "name": "Contact Documents",
    "strategy": "UPLOAD_STRATEGY_PUT",
    "direct_extraction_schema_id": "exsc_abc123"
  }'
To enable long-term storage of uploaded files, add "bundle_enabled": true to your request. This stores the uploaded data in S3 as a bundle that can be downloaded later. See Bundled Extraction for details.
Response:
{
  "id": "upl_coll_def456",
  "name": "Contact Documents"
}
Save the collector ID (upl_coll_def456) for creating uploads.

Step 3: Create an Upload

To upload files, first create an upload request specifying the files you want to upload. You must declare the exact size of each file in bytes.
curl -X POST https://api.eu.sterndesk.com/r/uploads \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "upload_collector_id": "upl_coll_def456",
    "files": [
      {"size_bytes": 1048576}
    ],
    "upload_expiration": "600s"
  }'
The upload_expiration field specifies how long the upload URLs remain valid (minimum 1ms, maximum 1 hour). Response:
{
  "pre_signs": [
    {
      "url": "https://storage.example.com/...",
      "strategy": "UPLOAD_STRATEGY_PUT"
    }
  ]
}

Step 4: Upload Files Using Pre-signed URLs

Use the returned pre-signed URL to upload your file directly to storage. For details on how pre-signed URLs work and why Sterndesk uses direct-to-storage uploads, see Upload URLs. Upload your file using an HTTP PUT request:
curl -X PUT "PRE_SIGNED_URL" \
  --data-binary @/path/to/your/document.pdf
The file size must exactly match the size_bytes value you declared when creating the upload. A size mismatch will cause the upload to fail with a 403 error.
If you’re uploading multiple files, upload them in the same order as they were specified in the files array—the first pre-signed URL corresponds to the first file specification.

Step 5: Poll for Extraction Results

Once files are uploaded, Sterndesk automatically processes them if an extraction schema is attached to the collector. Poll the extractions endpoint to check the status and retrieve results. First, list uploads to get the upload ID:
curl "https://api.eu.sterndesk.com/r/uploads?upload_collector_id=upl_coll_def456" \
  -H "Authorization: Bearer YOUR_API_KEY"
Response:
{
  "items": [
    {
      "id": "upld_ghi789",
      "status": "UPLOAD_STATUS_DIRECTLY_EXTRACTED"
    }
  ]
}
Once the status is UPLOAD_STATUS_DIRECTLY_EXTRACTED, retrieve the extraction results:
curl "https://api.eu.sterndesk.com/r/direct-upload-extractions?upload_id=upld_ghi789" \
  -H "Authorization: Bearer YOUR_API_KEY"
Response:
{
  "items": [
    {
      "id": "duex_jkl012",
      "status": "DIRECT_UPLOAD_EXTRACTION_STATUS_STRUCTURED",
      "extraction_output": {
        "full_name": "Jane Smith",
        "email": "[email protected]",
        "phone": "+1 555-123-4567",
        "company": "Acme Corporation"
      }
    }
  ]
}
Once the extraction status is DIRECT_UPLOAD_EXTRACTION_STATUS_STRUCTURED, the extraction_output field contains your structured data.

Deleting an Upload Collector

When you no longer need an upload collector, you can delete it.
Deleting an upload collector permanently removes the collector and all associated uploads and extractions. This action cannot be undone.
curl -X DELETE https://api.eu.sterndesk.com/r/upload-collectors/upl_coll_def456 \
  -H "Authorization: Bearer YOUR_API_KEY"

Next Steps