tlparse 0.4.4

Parse TORCH_LOG logs produced by PyTorch torch.compile
Documentation
discrete features:
- parse compilation starts [0/0] annotation and create a visualization of the
  stack and how they relate together
    - each compilation represents a stack trace
    - a compilation is an ancestor of another compilation if its stack trace is a
      prefix of the other stack trace
    - this is a tree structure, so can use dot to visualize
      - sort of like a flame graph, but there is no logical length (maybe
        could use compilation time!)
- correlating generated fx graph / dynamo trace / sizes
  - just get all the graphs, display them in an indexed fashion
  - use case:
    https://fb.workplace.com/groups/1075192433118967/posts/1377556259549248/ did the sizes change?
- why are our logs so big / who is generating all the logs
- rank comparison / differ (for rank desychronization problems due to
  nondeterminism)
- rank tracing (compare how compilation is proceeding on different ranks, is
  it imbalanced, debug compile time performance problems)


infrastructure:
- parse out a single compilation [0/0]
  - start/end time for compilation

some problems:
- downloading all logs from logarithm takes surprisingly long (using lg
  command)
- there may be a lot of logs with a given MAST job, as there may be multiple
  flows pasted together, sometimes not obvious which one to look at


samples logs:
- https://fb.workplace.com/groups/1075192433118967/posts/1377556259549248 why
  cuda graph cause regression problem
  - ~/local/eval-log.txt - the eval log
  - ~/local/rank0-cudagraph-train.txt - the train log, rank0 only
    (cudagraph-log.txt)
      - interestingly, sometimes the ranks are interleaved in a naughty way

- > ~/log2.txt ~torch only~> ~/f2.txt
  - from  lg tw:tsp_zch/mast_hpc/f524854032-TrainingApplication.trainers.mqdbca/0 --start-time=1706158048 --end-time=1706165989 --stream=stderr
  - this is the jon chuang config pr caused shampoo dynamic compile disaster

- ~/log.txt (flavio-log.txt)
  - this is flavio truzzi recent aps log
  - xref https://fb.workplace.com/groups/6829516587176185/posts/6829560007171843/
  - nb: this doesn't have all debug info



what do i want to change about the logs
- split into separate log file per rank to prevent splicing
  - dedicated_log file is OK
  - need some sort of hook for this
  - need some way to test this
- stack frame stored in single line and parseable
  - this is actively harmful without preventing muxing (because larger write
    is less likely to be atomic)


uploading functionality
- motivation
  - if you run tlparse on a server, and it generates html, want to be able to
    conveniently view it / share it to someone else, without having to download
  - otherwise, can only do plain text report and share via pastebin
- alternate models
  - perfetto/chrome trace viewer: generate a trace json, separate viewer you
    upload the file too
    - but note that internally we built a built-in viewer that you can link to
      with data directly. Convenient!
  - generate an html file, pop open browser to view



what does the one-size-fits-all command do (drive structured logging)
- extract all IR representations into separate files (preferably machine
  readable, but that's other people's problem)
  - rendering these in human readable way, potentially *downstream* tool
    problem

use cases for the log parser
- there is some problem, you are trying to diagnose the problem from logs
  - but the logs are too big
    - because all the ranks are muxed together
    - because the dynamo debug logs are too spammy
      - because I can't actually tell what I'm running over from the Dynamo
        logs
        - because I don't actually know what the model is doing (pdb style
          view?)
      - because there are so many values on the stack
      - because this is a cursed model with lots of tiny tensors and lots of
        bytecodes and therefore traversal is terrible !!!
    - because the graph outputs are too big
      - because no one asked for tabular output
      - because the graph sizes are too far away from where you need them
      - the graph is so long so you can't easily jump to def/use
    - because the guard output is too big
      - because you can't easily find the recompiles logs
        - because the recompiles log doesn't say what exactly changed the next
          time
    - because the tracebacks are too big
    - finding the graph break information is finding a needle in haystack
    - because the restart analysis logs are annoying
    - because the inductor logs are too long
      - because I can't easily correlate inductor with aten being processed
        (godbolt style, but godbolt not useful because too difficult to do the full information)
    - because I don't know how to jump to the end of a section
      - dynamo -> aot -> inductor -> guard
  - but you can't get runnable artifacts from the logs
  - you want to display some information, if you print everything fully
    detailed it's too much, so you want fold/expand html UI (then the dump representation is full information)
  - what's same/different between ranks
  - what's same/different between recompiles
  - two users: PyTorch developers, mass market general developers
- you are working on a new model and you want to know how far along you are
- trace recording and visualization (but maybe just defer to zoomer)
- logarithm actually sort of sucks?
- it's too hard to figure out how to modify source code to hit some s0 as
  dynamic, from the logs
  - because I can't tell what the source of a size guard is
  - because automatic dynamic is printed by default

- that's weird, why is the same frame having very different behavior each
  time?
  - are we allocating separate numbers for the separate object instances?


value added
- download (all) the logs in the first place
- put the result somewhere shareable
- automatically process tlparse when someone posts a log for help


meta plugin architecture
- want the plugin to automatically run
- choice: fbpkg distribution vs oss plus internal plugin
  - choice: pyo3/maturin python plugin vs shelling out to executables
- plugin goals:
    telemetry
    log downloading
    - lg or tw command line tool
    uploading
    - manifold cli into https://www.internalfb.com/intern/wiki/Development_Environment/Persistent_Storage/#raw-manifold-path-for-a https://www.internalfb.com/intern/wiki/Manifold/Getting_Started/Manifold_CLI/



feb 23 ideas
- ddpoptimize split needs a context
- post_grad_graph and output_code occasionally has no context; how to orient
  in this situation :think:
- would like to know code hash, so can generate links to files


- recompile
  - dynamic shape dimension changed
  - just collect them all at once place


mar 19 ideas
- print the nn module structure, whenever nn module is compiled