Generated Metadata

Mixedbread Stores automatically generate metadata for each file ingested. This generated metadata provides structured information about the content of the file, including language, size, headings, number of pages, and more.

The generated metadata can be retrieved using the generated_metadata chunk field.

Supported File Types

The generated_metadata object is a typed structure discriminated by the type field. If type is not present, it is inferred from file_type (MIME); otherwise it defaults to text.

Supported type values and typical inputs:

markdown (e.g., .md, .markdown, .mdx, text/markdown)
text (e.g., .txt, text/plain)
pdf (e.g., .pdf, application/pdf)
code (e.g., Python, TypeScript, Java, C# source files)
audio (e.g., .mp3, .wav, .m4a, audio files)

Every metadata object also includes a file_type (MIME) reflecting the source content.

Markdown - Heading Extraction

When processing markdown files, the system automatically extracts and preserves heading structure to enhance search relevance and provide context. This feature works for all markdown formats (.md, .markdown, .mdx).

What You Get

Each markdown chunk’s generated_metadata includes:

type: Always "markdown".
file_type: Always "text/markdown".
language: Detected language of the text.
word_count: Word count for the chunk.
file_size: File size in bytes.
chunk_headings: Headings found within the current chunk ([{ level: number, text: string }]).
heading_context: The document structure context leading up to this chunk ([{ level: number, text: string }]).

Example Output

Consider this markdown document:

# Getting Started
## Installation
### Prerequisites
You need Python 3.8+ installed.
...

### Setup
Run the following command:

```bash
pip install package
```
...

## Configuration
### Environment Variables
Set these variables in your `.env` file.

### Database Setup
Configure your database connection.
```

When processed, the chunks would have `generated_metadata` like this:

**Chunk 1 (Prerequisites section):**
```json
{
  "type": "markdown",
  "file_type": "text/markdown",
  "language": "en",
  "word_count": 6,
  "file_size": 1234,
  "chunk_headings": [
    {"level": 1, "text": "Getting Started"},
    {"level": 2, "text": "Installation"},
    {"level": 3, "text": "Prerequisites"}
  ],
  "heading_context": []
}

Chunk 2 (Setup section):

{
  "type": "markdown",
  "file_type": "text/markdown",
  "language": "en",
  "word_count": 12,
  "file_size": 1234,
  "chunk_headings": [
    {"level": 3, "text": "Setup"}
  ],
  "heading_context": [
    {"level": 1, "text": "Getting Started"},
    {"level": 2, "text": "Installation"},
    {"level": 3, "text": "Prerequisites"}
  ]
}

Chunk 3 (Configuration section):

{
  "type": "markdown",
  "file_type": "text/markdown",
  "language": "en",
  "word_count": 20,
  "file_size": 1234,
  "chunk_headings": [
    {"level": 2, "text": "Configuration"},
    {"level": 3, "text": "Environment Variables"},
    {"level": 3, "text": "Database Setup"}
  ],
  "heading_context": [
    {"level": 1, "text": "Getting Started"},
    {"level": 2, "text": "Installation"},
    {"level": 3, "text": "Setup"}
  ]
}

Text - Common Fields

Plain text chunks include a simpler generated_metadata shape:

type: "text"
file_type: "text/plain"
language: Detected language of the text
word_count: Word count for the chunk
file_size: File size in bytes

Example

{
  "type": "text",
  "file_type": "text/plain",
  "language": "en",
  "word_count": 57,
  "file_size": 2048
}

Code - Language and Size

For supported source files (e.g., Python, TypeScript, Java, C#), generated_metadata includes:

type: "code"
file_type: One of text/x-python, text/x-script.python, application/typescript, text/typescript, text/x-java-source, text/x-csharp, or application/javascript
language: Detected programming language
word_count: Tokenized word count approximation for code
file_size: File size in bytes

Example

{
  "type": "code",
  "file_type": "text/x-python",
  "language": "python",
  "word_count": 120,
  "file_size": 8192
}

PDF - Document Stats

PDF chunks have specialized document-level stats:

type: "pdf"
file_type: "application/pdf"
total_pages: Total number of pages in the document
total_size: Total size of the original file in bytes

Example

{
  "type": "pdf",
  "file_type": "application/pdf",
  "total_pages": 42,
  "total_size": 1048576
}

Audio - Media Information

Audio chunks include specialized media metadata:

type: "audio"
file_type: String (MIME type, e.g., audio/mpeg, audio/wav)
file_size: File size in bytes
total_duration_seconds: Total duration of the audio in seconds
sample_rate: Audio sample rate in Hz
channels: Number of audio channels
audio_format: Audio format code

Example

{
  "type": "audio",
  "file_type": "audio/mpeg",
  "file_size": 5242880,
  "total_duration_seconds": 180.5,
  "sample_rate": 44100,
  "channels": 2,
  "audio_format": 1
}

Type Inference and Flexibility

If type is not present in generated_metadata, it is inferred from file_type when possible.
If neither type nor a recognized file_type is present, the type defaults to "text".
The system may include additional fields as needed for future enhancements; clients should read the documented fields and ignore unknown ones.

Generated Metadata

On this page