Skip to main content

Extraction Schemas

Extraction schemas define the structure and format of data you want to extract from documents. They tell Sterndesk exactly what information to look for and how to organize it in the output. Schemas are typically designed for a specific type of document—a schema for extracting data from vessel reports will look very different from one for research papers or product pages.

What Is an Extraction Schema?

An extraction schema is a blueprint that describes the shape of the data you want to extract. When you process a document, Sterndesk uses the schema to:
  1. Guide extraction — The AI understands what fields to look for based on the schema
  2. Validate output — Ensures the extracted data conforms to your expected structure
  3. Standardize results — All extractions for a given schema produce consistent, predictable JSON
Without a schema, extraction results would be unstructured and inconsistent. Schemas give you control over the exact output format, making it easy to integrate extracted data into your applications and databases.

JSON Schema Standard

Sterndesk uses JSON Schema as the format for defining extraction schemas. JSON Schema is an open standard for describing and validating JSON data structures. Key features of JSON Schema that make it ideal for extraction:
FeatureDescription
Type definitionsSpecify whether a field is a string, number, boolean, array, or object
Required fieldsMark which fields must be present in the output
EnumerationsRestrict values to a predefined set of options
Nested structuresDefine complex hierarchies with objects containing other objects
DescriptionsAdd human-readable descriptions that help the AI understand each field
Field descriptions are particularly important—they provide context that helps the AI accurately identify and extract the right information from documents.

Schemas and Collectors

Extraction schemas are linked to collectors. When you configure a collector, you associate it with an extraction schema. Every document processed by that collector will have its data extracted according to the schema.
Collector → uses → Extraction Schema → produces → Extractions
This design allows you to:
  • Use the same schema across multiple collectors
  • Process different document types with different schemas
  • Update a schema once and have it apply to all associated collectors

Schema Encoding

When sending an extraction schema to the Sterndesk API, the JSON Schema must be encoded as a JSON string. This means the schema object needs to be serialized (stringified) before being included in your API request.
A common mistake is sending the schema as a nested JSON object. The API expects the schema as a string value, not an embedded object.
Incorrect — schema as nested object:
{
  "name": "contact-extraction",
  "json_schema": {
    "type": "object",
    "properties": { ... }
  }
}
Correct — schema as JSON string:
{
  "name": "contact-extraction",
  "json_schema": "{\"type\":\"object\",\"properties\":{...}}"
}

Summarization as Extraction

Summarization is a special form of extraction where instead of pulling out discrete fields, you ask the AI to generate a condensed representation of the document’s content. You can implement summarization by defining schema fields that expect synthesized content rather than verbatim extraction:
{
  "type": "object",
  "properties": {
    "executive_summary": {
      "type": "string",
      "description": "A 2-3 sentence summary of the document's main points and conclusions"
    },
    "key_findings": {
      "type": "array",
      "items": { "type": "string" },
      "description": "List of the 3-5 most important findings or takeaways"
    },
    "action_items": {
      "type": "array",
      "items": { "type": "string" },
      "description": "Any recommended actions or next steps mentioned in the document"
    }
  }
}
This approach lets you combine summarization with structured extraction in a single schema—extract specific data points alongside generated summaries.

When to Use Summarization Fields

  • Long documents — Condense lengthy reports into digestible summaries
  • Research papers — Extract methodology summaries and key conclusions
  • Meeting notes — Generate action item lists from unstructured notes
  • Compliance documents — Summarize findings and recommendations

Schema Examples

Simple Schema

A basic schema for extracting contact information from a document:
{
  "type": "object",
  "properties": {
    "full_name": {
      "type": "string",
      "description": "The full name of the person"
    },
    "email": {
      "type": "string",
      "description": "Email address"
    },
    "phone": {
      "type": "string",
      "description": "Phone number in any format"
    },
    "company": {
      "type": "string",
      "description": "Company or organization name"
    }
  },
  "required": ["full_name"]
}

Complex Schema

A more comprehensive schema for extracting maritime work experience from a resume, with nested objects, arrays, and enumerations:
{
  "type": "object",
  "properties": {
    "personal_info": {
      "type": "object",
      "properties": {
        "first_name": {
          "type": "string",
          "description": "First name of the candidate"
        },
        "last_name": {
          "type": "string",
          "description": "Last name of the candidate"
        },
        "nationality": {
          "type": "string",
          "description": "ISO 3166-1 alpha-2 country code (e.g., 'US', 'NL', 'PH')"
        },
        "email_addresses": {
          "type": "array",
          "items": { "type": "string" },
          "description": "Contact email addresses"
        }
      }
    },
    "sea_experience": {
      "type": "array",
      "description": "Work experience on vessels or offshore",
      "items": {
        "type": "object",
        "properties": {
          "vessel_name": {
            "type": "string",
            "description": "Name of the vessel"
          },
          "vessel_type": {
            "type": "string",
            "enum": ["bulk_carrier", "tanker", "container_ship", "offshore_supply", "dredger", "tug", "other"],
            "description": "Type of vessel"
          },
          "rank": {
            "type": "string",
            "enum": ["master", "chief_officer", "second_officer", "third_officer", "chief_engineer", "second_engineer", "bosun", "ab", "os"],
            "description": "Position held on the vessel"
          },
          "start_date": {
            "type": "object",
            "properties": {
              "year": { "type": "integer", "description": "Year (e.g., 2023)" },
              "month": { "type": "integer", "description": "Month (1-12)" }
            }
          },
          "end_date": {
            "type": "object",
            "properties": {
              "year": { "type": "integer", "description": "Year (e.g., 2024)" },
              "month": { "type": "integer", "description": "Month (1-12)" }
            }
          },
          "to_present": {
            "type": "boolean",
            "description": "True if this is the current position"
          },
          "responsibilities_summary": {
            "type": "string",
            "description": "Brief summary of duties and responsibilities"
          }
        }
      }
    },
    "certifications": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string", "description": "Certificate name" },
          "issuing_authority": { "type": "string", "description": "Organization that issued the certificate" },
          "expiry_date": { "type": "string", "description": "Expiration date if applicable" }
        }
      }
    }
  }
}
This schema demonstrates:
  • Nested objectspersonal_info groups related fields
  • Arrays of objectssea_experience captures multiple work entries
  • Enumerationsvessel_type and rank restrict values to valid options
  • Optional fields — Not all fields are required, allowing partial extraction
  • Descriptive fields — Each property includes a description to guide extraction

API Reference

For detailed information on creating and managing extraction schemas, see the API Reference.

Next Steps