cf-file-parser 0.1.20

File Parser module
Documentation
# PRD

## 1. Overview

**Purpose**: File Parser provides document parsing and conversion capabilities. It extracts structured content from various file formats (PDF, DOCX, XLSX, PPTX, images) and renders them as Markdown or JSON.

**Target Users**:
- **Platform Developers** — integrate document parsing into AI workflows
- **External API Consumers** — parse documents via REST API

**Key Problems Solved**:
- **Format fragmentation**: unified API for parsing multiple document formats
- **Content extraction**: reliable text, table, and image extraction from documents
- **Markdown conversion**: consistent rendering of documents as Markdown

**Success Criteria**:
- Parse common formats (PDF, DOCX, XLSX, PPTX) with >95% accuracy
- Response time < 5s for documents < 10MB

**Capabilities**:
- Upload and parse documents
- Extract text, tables, and images
- Render documents as Markdown
- Support for both binary upload and multipart form data

## 2. Actors

### 2.1 Human Actors

#### API User

**ID**: `fdd-file-parser-actor-api-user`

<!-- fdd-id-content -->
**Role**: End user who interacts with File Parser API to upload and parse documents.
<!-- fdd-id-content -->

### 2.2 System Actors

#### Consumer Module

**ID**: `fdd-file-parser-actor-consumer`

<!-- fdd-id-content -->
**Role**: Internal module that uses File Parser to process documents as part of larger workflows.
<!-- fdd-id-content -->

## 3. Use Cases

### Parse Uploaded Document

**ID**: [ ] `p1` `fdd-file-parser-usecase-upload-parse-v1`

<!-- fdd-id-content -->
User uploads a document (PDF, DOCX, XLSX, PPTX, or image) and receives parsed content with text, tables, and optional Markdown rendering.

**Actors**: `fdd-file-parser-actor-api-user`
<!-- fdd-id-content -->

## 4. Functional Requirements

### Document Upload

**ID**: [ ] `p1` `fdd-file-parser-fr-upload-v1`

<!-- fdd-id-content -->
System SHALL support binary file upload and multipart form upload with optional Markdown rendering. System SHALL handle documents up to 50MB.

**Actors**: `fdd-file-parser-actor-api-user`
<!-- fdd-id-content -->

### Format Support

**ID**: [ ] `p1` `fdd-file-parser-fr-formats-v1`

<!-- fdd-id-content -->
System SHALL support parsing PDF, DOCX, XLSX, PPTX, PNG, JPG, and TIFF formats.

**Actors**: `fdd-file-parser-actor-api-user`
<!-- fdd-id-content -->

### Content Extraction

**ID**: [ ] `p1` `fdd-file-parser-fr-extraction-v1`

<!-- fdd-id-content -->
System SHALL extract text content, preserve document structure, extract tables, and extract embedded images.

**Actors**: `fdd-file-parser-actor-api-user`
<!-- fdd-id-content -->

### Markdown Rendering  

**ID**: [ ] `p1` `fdd-file-parser-fr-markdown-v1`

<!-- fdd-id-content -->
System SHALL convert documents to Markdown format, preserving headings, lists, formatting, tables, and code blocks.

**Actors**: `fdd-file-parser-actor-api-user`
<!-- fdd-id-content -->

### Local Path Security

**ID**: [ ] `p1` `fdd-file-parser-fr-local-path-security-v1`

<!-- fdd-id-content -->
System SHALL reject local file paths containing `..` traversal components. System SHALL require a mandatory `allowed_local_base_dir` configuration; the module SHALL fail to start if this field is missing or the path cannot be resolved. System SHALL canonicalize the requested path (resolving symlinks) and reject paths that do not fall under the base directory. Rejected requests SHALL return HTTP 403 and be logged at `warn` level.

**Actors**: `fdd-file-parser-actor-api-user`, `fdd-file-parser-actor-consumer`
<!-- fdd-id-content -->

## 5. Non-Functional Requirements

### Performance

**ID**: [ ] `p1` `fdd-file-parser-nfr-response-time-v1`

<!-- fdd-id-content -->
System SHALL respond in < 5s for documents < 10MB and < 30s for documents < 50MB.
<!-- fdd-id-content -->

### Scalability

**ID**: [ ] `p1` `fdd-file-parser-nfr-concurrency-v1`

<!-- fdd-id-content -->
System SHALL support 100 concurrent parsing requests.
<!-- fdd-id-content -->

### Reliability

**ID**: [ ] `p1` `fdd-file-parser-nfr-availability-v1`  

<!-- fdd-id-content -->
System SHALL maintain 99.9% uptime SLA.
<!-- fdd-id-content -->

## 6. Out of Scope

- OCR for scanned documents (future enhancement)
- Document editing or modification
- Format conversion beyond Markdown
- Long-term document storage

## Appendix

### Change Log

| Date | Version | Author | Changes |
|------|---------|--------|---------|
| 2026-02-09 | 0.1.0 | System | Initial PRD for cypilot validation |
| 2026-02-17 | 0.2.0 | Security | Removed URL parsing capability (use case `fdd-file-parser-usecase-url-parse-v1`, FR `fdd-file-parser-fr-url-v1`). Rationale: SSRF vulnerability (issue #525) — URL parsing allowed server-side requests to arbitrary endpoints, posing an unacceptable security risk. Decision: remove rather than harden. |
| 2026-02-17 | 0.3.0 | Security | Added FR `fdd-file-parser-fr-local-path-security-v1` — path-traversal protections for `parse-local` (reject `..`, canonicalize, enforce `allowed_local_base_dir`, block symlink escapes). Rationale: prevent arbitrary file read via path traversal (issue #525). |