screenplay-doc-parser-rs 0.1.8

Tools to parse Screenplay-formatted documents into semantically-typed structs.
Documentation

Screenplay Doc Parser

Parses a PDF document file into a structured, semantically typed ScreenplayDocument object.

This parser currently supports parsing from PDF, but may include support for other formats such as FDX or Fountain in the future.

How

The PDF parser uses the x,y positions of the TextElements on a page to deduce their type. This will usually be correct, BUT may require manual intervention after parsing for some edge-cases. Screenwriters love to play with formatting and indentation...

In general, screenplay elements like Action, Character, Dialogue, Parentheticals, even the Page Number, Scene Numbers and revision markers, all have a set indentation point, and/or specific justification.

Also, screenplays generally have consistent margins, or at least margins consistint within the same document (hopefully...)

If we know the indentations and margins of a document, we can deduce that, any line of text which begins at 1.5 inches from the left side, is below the top margin and above the bottom margin, is probably an Action line.

Lines that adhere to the above, but also start with something like INT. or EXT. are very likely SceneHeadings.

Character names and dialogue have their own indentations, as well as parentheticals. So this scheme should yield correct parsing for the majority of a properly-formatted ScreenplayPDF.

The user of this crate can also pass in their own indentation values and strings to match against for Scene Environments or Time of Day (INT./EXT., DAY, NIGHT...), so we can even support screenplays that are A4, or have deviated somewhat from "standard" US-Letter formatting.

The default margins and indentations for this crate are taken from the default settings found in Final Draft 11, for a simple US-Letter screenplay.

What

This categorizes the following Screenplay Element Types:

  • Action
  • Character Cue
    • Also Character Extensions (i.e. the (V.O.) in CHARACTER (V.O.))
  • Dialogue
  • Parenthetical
  • Scene Headings (including Heading Elements)
    • Scene Environment (INT. or EXT.)
    • Scene Location
      • Scene Sublocation (any element that follows a Location which ISN'T another valid element...)
    • Time of Day (DAY, NIGHT, MORNING, EVENING...)

This parser also captures the following screenplay elements as metadata

  • Scene Number (alphanumeric)
  • Page Number (alphanumeric)

Types that rely on matching arbitrary strings

Some types, such as TimeOfDay, Revision Markers, and Environment rely on arbitrary string values. You can pass in your own collection of these strings, to parse a screenplay written in a different language, or support additional / specific elements.

For example, you can add "DUSK" or "HIGH NOON" as TimeOfDay strings, so that they are correctly identified as TimeOfDay elements

Indentations

Additionally, the ElementIndentations struct can be passed in to the PDF parser, to provide custom indentations and support parsing a screenplay formatted in A4, or a screenplay formatted with "centered" (as in placement, not justification) sctipts, like from Fade In or other programs.

TODO

These are currently not parsed or handled properly yet:

  • Title Page elements
  • Dual Dialogue blocks
  • Transitions ("CUT TO:", "FADE IN:", etc; any element like that which is right-aligned.)
  • A4 detection (no default ElementIndentation values for A4 yet)

DEPENDENCIES

This parser has an optional feature, which uses the mupdf-basic-text-extractor crate to allow PDF file reading. You may choose to exclude this feature and roll your own PDF file-parsing, and then handle the conversion to the generic `pdf_document::PDFDocument' object, which gets passed into the PDF parser.

LICENSE

This code is licensed under AGPL-3.0.