cf-file-parser 0.1.21

File Parser module
Documentation
# DESIGN

## 1. Architecture Overview

File Parser is implemented as a modkit module that exposes REST endpoints for document parsing. It integrates with external parsing libraries and provides a unified interface for multiple document formats.

**System Context**: File Parser operates as a standalone service module within the HyperSpot platform, accepting HTTP requests and returning parsed content.

## 2. Design Principles

### Stateless Operation

**ID**: [ ] `p2` `fdd-file-parser-principle-stateless-v1`

<!-- fdd-id-content -->
Parser does not maintain session state. Each request is independent. Temporary files cleaned up after processing.
<!-- fdd-id-content -->

### Format Agnostic

**ID**: [ ] `p2` `fdd-file-parser-principle-format-agnostic-v1`

<!-- fdd-id-content -->
Unified API regardless of input format. Format detection automatic where possible. Consistent error handling across formats.
<!-- fdd-id-content -->

## 3. Constraints

### File Size Limits

**ID**: [ ] `p2` `fdd-file-parser-constraint-file-size-v1`

<!-- fdd-id-content -->
Maximum 50MB per document. Enforced at API layer via body size limits.
<!-- fdd-id-content -->

### Local Path Security

**ID**: [ ] `p1` `fdd-file-parser-constraint-local-path-security-v1`

<!-- fdd-id-content -->
Local file parsing (`parse-local`) validates paths before any filesystem access:
(a) paths containing `..` components are rejected outright;
(b) the requested path is canonicalized (resolving symlinks);
(c) the canonical path must be a descendant of the mandatory `allowed_local_base_dir`.
The module fails to start if `allowed_local_base_dir` is missing or unresolvable.
Violations return HTTP 403 Forbidden. Rejected attempts are logged at `warn` level.
<!-- fdd-id-content -->

### Supported Formats

**ID**: [ ] `p2` `fdd-file-parser-constraint-formats-v1`

<!-- fdd-id-content -->
PDF, DOCX, XLSX, PPTX, PNG, JPG, TIFF supported. Other formats rejected with clear error message.
<!-- fdd-id-content -->

## 4. Components

### API Layer

**ID**: [ ] `p1` `fdd-file-parser-component-rest-v1`

<!-- fdd-id-content -->
REST endpoints: `/file-parser/v1/info`, `/file-parser/v1/upload`, `/file-parser/v1/upload/markdown`, `/file-parser/v1/parse-local*`
<!-- fdd-id-content -->

### Parser Service

**ID**: [ ] `p1` `fdd-file-parser-component-parser-v1`

<!-- fdd-id-content -->
Coordinates parsing operations, handles format detection, manages temporary file lifecycle.
<!-- fdd-id-content -->

### Format Handlers

**ID**: [ ] `p1` `fdd-file-parser-component-handlers-v1`

<!-- fdd-id-content -->
PDF handler, Office handler (DOCX, XLSX, PPTX), Image handler (PNG, JPG, TIFF).
<!-- fdd-id-content -->

### Markdown Renderer

**ID**: [ ] `p1` `fdd-file-parser-component-markdown-v1`

<!-- fdd-id-content -->
Converts parsed content to Markdown, preserves document structure, handles tables and formatting.
<!-- fdd-id-content -->

## 5. Sequences

### Document Upload and Parse

**ID**: [ ] `p1` `fdd-file-parser-seq-upload-parse-v1`

<!-- fdd-id-content -->
1. Client uploads document via REST API
2. API layer validates file size and format
3. Parser service detects document type
4. Appropriate format handler processes document
5. Content extracted (text, tables, images)
6. Optional: Markdown renderer converts to Markdown
7. Response returned to client
8. Temporary files cleaned up

**Components**: `fdd-file-parser-component-rest-v1`, `fdd-file-parser-component-parser-v1`, `fdd-file-parser-component-handlers-v1`, `fdd-file-parser-component-markdown-v1`
<!-- fdd-id-content -->

## 6. Error Handling

- Unsupported format → 400 Bad Request
- File too large → 413 Payload Too Large
- Parsing failure → 500 Internal Server Error with details

## 7. Dependencies

- External parsing libraries for format support
- Temporary file system for processing

## Appendix

### Change Log

| Date | Version | Author | Changes |
|------|---------|--------|---------|
| 2026-02-09 | 0.1.0 | System | Initial DESIGN for cypilot validation |
| 2026-02-17 | 0.2.0 | Security | Removed `/file-parser/v1/parse-url*` endpoints, HTTP client dependency, URL error handling, and `download_timeout_secs` config. Rationale: SSRF risk (issue #525) — parsing documents from caller-supplied URLs exposed an uncontrolled outbound HTTP path. Decision: remove the capability entirely rather than attempt mitigation. |
| 2026-02-17 | 0.3.0 | Security | Added path-traversal protections for `parse-local` endpoints: `..` component rejection, path canonicalization, optional `allowed_local_base_dir` enforcement, symlink-escape prevention, and `PathTraversalBlocked` error (HTTP 403). Added constraint `fdd-file-parser-constraint-local-path-security-v1`. Rationale: prevent arbitrary file read via path traversal (issue #525). |