Generated Metadata
Mixedbread Stores automatically generate metadata for each file ingested. This generated metadata provides structured information about the content of the file, including language, size, headings, number of pages, and more.
The generated metadata can be retrieved using the generated_metadata chunk field.
Supported File Types
The generated_metadata object is a typed structure discriminated by the type field.
If type is not present, it is inferred from file_type (MIME); otherwise it defaults to text.
Supported type values and typical inputs:
markdown(e.g.,.md,.markdown,.mdx,text/markdown)text(e.g.,.txt,text/plain)pdf(e.g.,.pdf,application/pdf)code(e.g., Python, TypeScript, Java, C# source files)audio(e.g.,.mp3,.wav,.m4a, audio files)
Every metadata object also includes a file_type (MIME) reflecting the source content.
Markdown - Heading Extraction
When processing markdown files, the system automatically extracts and preserves
heading structure to enhance search relevance and provide context. This feature
works for all markdown formats (.md, .markdown, .mdx).
What You Get
Each markdown chunk’s generated_metadata includes:
type: Always"markdown".file_type: Always"text/markdown".language: Detected language of the text.word_count: Word count for the chunk.file_size: File size in bytes.chunk_headings: Headings found within the current chunk ([{ level: number, text: string }]).heading_context: The document structure context leading up to this chunk ([{ level: number, text: string }]).
Example Output
Consider this markdown document:
Chunk 2 (Setup section):
Chunk 3 (Configuration section):
Text - Common Fields
Plain text chunks include a simpler generated_metadata shape:
type:"text"file_type:"text/plain"language: Detected language of the textword_count: Word count for the chunkfile_size: File size in bytes
Example
Code - Language and Size
For supported source files (e.g., Python, TypeScript, Java, C#), generated_metadata includes:
type:"code"file_type: One oftext/x-python,text/x-script.python,application/typescript,text/typescript,text/x-java-source,text/x-csharp, orapplication/javascriptlanguage: Detected programming languageword_count: Tokenized word count approximation for codefile_size: File size in bytes
Example
PDF - Document Stats
PDF chunks have specialized document-level stats:
type:"pdf"file_type:"application/pdf"total_pages: Total number of pages in the documenttotal_size: Total size of the original file in bytes
Example
Audio - Media Information
Audio chunks include specialized media metadata:
type:"audio"file_type: String (MIME type, e.g.,audio/mpeg,audio/wav)file_size: File size in bytestotal_duration_seconds: Total duration of the audio in secondssample_rate: Audio sample rate in Hzchannels: Number of audio channelsaudio_format: Audio format code
Example
Type Inference and Flexibility
- If
typeis not present ingenerated_metadata, it is inferred fromfile_typewhen possible. - If neither
typenor a recognizedfile_typeis present, thetypedefaults to"text". - The system may include additional fields as needed for future enhancements; clients should read the documented fields and ignore unknown ones.
Supported Metadata Types
Learn about supported metadata types and how to structure metadata for optimal search performance, filtering capabilities, and content organization in Mixedbread Stores.
Search
Learn how to search your Store with semantic queries, configuration options, and advanced filtering capabilities.