cairo-native 0.9.0-rc.6

# Debugging

## Useful environment variables

These 2 env vars will dump the generated MLIR code from any compilation on the current working directory as:

- `dump.mlir`: The MLIR code after passes without locations.
- `dump-debug.mlir`: The MLIR code after passes with locations.
- `dump-prepass.mlir`: The MLIR code before without locations.
- `dump-prepass-debug.mlir`: The MLIR code before passes with locations.

Do note that the MLIR with locations is in pretty form and thus not suitable to pass to `mlir-opt`.

```bash
export NATIVE_DEBUG_DUMP_PREPASS=1
export NATIVE_DEBUG_DUMP=1
```

### Debugging with LLDB

To debug with LLDB (or another debugger), we must compile the binary with the `with-debug-utils` feature.
```bash
cargo build --package cairo-native-run --features with-debug-utils
```

Then, we can add the a debugger breakpoint trap. To add it at a given sierra statement, we can set the following env var:
```bash
export NATIVE_DEBUG_TRAP_AT_STMT=10
```

The trap instruction may not end up exactly where the statement is.

If we want to manually set the breakpoint (for example, when executing a particular libfunc), then we can use the `DebugUtils` metadata in the code.
```rust,ignore
#[cfg(feature = "with-debug-utils")]
{
    metadata.get_mut::<DebugUtils>()
        .unwrap()
        .debug_breakpoint_trap(block, location)?;
}
```

Now, we need to execute `cairo-native-run` from our debugger (LLDB). If we want to see the source locations, we also need to set the `NATIVE_DEBUG_DUMP` env var and execute the program with AOT.

```bash
lldb -- target/debug/cairo-native-run -s programs/recursion.cairo --available-gas 99999999 --run-mode aot
```

Some usefull lldb commands:
- `process launch`: starts the program
- `frame select`: shows the current line information
- `thread step-in`: makes a source level single step
- `thread continue`: continues execution of the current process
- `disassemble --frame --mixed`: shows assembly instructions mixed with source level code

## Logging
Enable logging to see the compilation process:

```bash
export RUST_LOG="cairo_native=trace"
```

## Other tips:

- Try to find the minimal program to reproduce an issue, the more isolated the easier to test.
- Use the `debug_utils` print utilities, more info [here](https://starkware-libs.github.io/cairo_native/cairo_native/metadata/debug_utils/struct.DebugUtils.html):

```rust,ignore
#[cfg(feature = "with-debug-utils")]
{
    metadata.get_mut::<DebugUtils>()
        .unwrap()
        .print_pointer(context, helper, entry, ptr, location)?;
}
```

## Trace Dump Feature

The `with-trace-dump` feature is used to generate the execution trace of a sierra program.

First, make sure to compile with the feature enabled:
```bash
cargo build --release --features with-trace-dump
```

Then, use the `trace_output` flag to save the trace dump to disk:

```bash
target/release/cairo-native-run -s programs/recursion.cairo --trace-output programs/recursion.trace --available-gas 10000000
```

The generated file will contain the state of all variables in the current scope, for every statement executed:

```json
{
  "states": [
    {
      "statementIdx": 25,
      "preStateDump": {
        "0": "Unit",
        "1": { "U64": 9993660 }
      }
    },
    {
      "statementIdx": 26,
      "preStateDump": {
        "0": "Unit",
        "1": { "U64": 9993660 }
      }
    },
    {
      "statementIdx": 27,
      "preStateDump": {
        "0": "Unit",
        "1": { "U64": 9993660 },
        "2": { "Felt": "0x3e8" }
      }
    },
    ...
  ]
}
```

It is sometimes useful to take a look at the sierra program. You can use the `--sierra-output` flag to save the sierra program to disk.

```txt
disable_ap_tracking() -> (); // 25
const_as_immediate<Const<felt252, 1000>>() -> ([2]); // 26
store_temp<RangeCheck>([0]) -> ([0]); // 27
```

## Debugging Contracts

Contracts are difficult to debug for various reasons, including:
- They are external to the project.
- We don’t have their source code.
- They run autogenerated code (the wrapper).
- They have a limited number of allowed libfuncs (ex. cannot use the print libfunc).
- Usually it’s not a single contract but multiple that

Some of them have workarounds:

### Obtaining the contract
There are various options for obtaining the contract, which include:

- Manually invoking the a Starknet API using `curl` with the contract class.

Example:

```bash
curl --location --request POST 'https://mainnet.juno.internal.lambdaclass.com' \
--header 'Content-Type: application/json' \
--data-raw '{
  "jsonrpc": "2.0",
  "method": "starknet_getClass",
  "id": 0,
  "params": {
    "class_hash": "0x036078334509b514626504edc9fb252328d1a240e4e948bef8d0c08dff45927f",
    "block_id": 657887
}
}'
```

- Running the replay with some code to write all the executed contracts on disk.

Both should provide us with the contract, but if we’re manually invoking the API we’ll need to process the JSON a bit to:

- Remove the JsonRPC overhead, and
- Convert the ABI from a string of JSON into a JSON object.

### Interpreting the contract
The contract JSON contains the Sierra program in a useless form (in the sense
that we cannot understand anything), as well as some information about the
entry points and some ABI types. We’ll need the Sierra program (in Sierra
format, not the JSON) to be able to understand what should be happening.

We can use the `starknet-sierra-extract-code` binary, which can be found in
the cairo project when compiled from source (not in the binary distribution).
That binary will extract the Sierra program without any debug information,
which is still not very useful.

Once we have the Sierra we can run the
[Sierra mapper](https://github.com/azteca1998/sierra-mapper) to autogenerate
some type, libfunc and function names so that we know what we’re looking at
without losing our mind. The Sierra mapper can be run multiple times, adding
more names manually as the user sees fit.

### How to actually debug

First of all we need to **know which contract is actually failing**. Most
of the time the contract where it crashes isn’t the transaction’s class
hash, but a chain of contract/library calls.

To know which contract is being called we can add some debugging prints in
the replay that logs contract executions. For example:

```rust,ignore
impl StarknetSyscallHandler for ReplaySyscallHandler {
    // ...

    fn library_call(
        &mut self,
        class_hash: Felt,
        function_selector: Felt,
        calldata: &[Felt],
        remaining_gas: &mut u128,
    ) -> SyscallResult<Vec<Felt>> {
        // ...

        println!("Starting execution of contract {class_hash} on selector {function_selector} with calldata {calldata:?}.");
        let result = executor.invoke_contract_dynamic(...);
        println!("Finished execution of contract {class_hash}.");
        if result.failure_flag {
            println!("Execution of contract {class_hash} failed.");
        }

        // ...
    }

    fn call_contract(
        &mut self,
        address: Felt,
        entry_point_selector: Felt,
        calldata: &[Felt],
        remaining_gas: &mut u128,
    ) -> SyscallResult<Vec<Felt>> {
			  // ...

			  println!("Starting execution of contract {class_hash} on selector {function_selector} with calldata {calldata:?}.");
			  let result = executor.invoke_contract_dynamic(...);
			  println!("Finished execution of contract {class_hash}.");
			  if result.failure_flag {
					  println!("Execution of contract {class_hash} failed.");
				}

				// ...
		}
}
```

If we run something like the above then the
[replay](https://github.com/starkware-libs/starknet-replay) should start
printing the log of what’s actually being executed and where it crashes.
It may print multiple times the error message, but **only the first one is
the relevant one** (the others should be the contract call chain in reverse
order). Once we know which contract is being called and its calldata we can
download and extract its Sierra as detailed above.

We then need to know **where it fails within the contract**. To do that we
can look at the error message and deduce where it’s used based on the Sierra
program. For example, the error message `u256_mul overflow` is felt-encoded
as `0x753235365f6d756c206f766572666c6f77`, or
`39879774624083218221774975706286902767479` in decimal. If we look for
usages of that specific value we’ll most likely find all the **places where
that error can be thrown**. Now we just need to narrow them down to a single
one and we’ll be able to actually start debugging.

An idea on how to do that is modifying Cairo native so that it adds a
breakpoint every time a constant with that error message is generated.
For example:

```rust,ignore
/// Generate MLIR operations for the `felt252_const` libfunc.
pub fn build_const<'ctx, 'this>(
    context: &'ctx Context,
    registry: &ProgramRegistry<CoreType, CoreLibfunc>,
    entry: &'this Block<'ctx>,
    location: Location<'ctx>,
    helper: &LibfuncHelper<'ctx, 'this>,
    metadata: &mut MetadataStorage,
    info: &Felt252ConstConcreteLibfunc,
) -> Result<()> {
    let value = match info.c.sign() {
        Sign::Minus => {
            let prime = metadata
                .get::<PrimeModuloMeta<Felt>>()
                .ok_or(Error::MissingMetadata)?
                .prime();
            (&info.c + prime.to_bigint().expect("always is Some"))
                .to_biguint()
                .expect("always is positive")
        }
        _ => info.c.to_biguint().expect("sign already checked"),
    };
    let felt252_ty = registry.build_type(
        context,
        helper,
        registry,
        metadata,
        &info.branch_signatures()[0].vars[0].ty,
    )?;
    if value == "39879774624083218221774975706286902767479".parse().unwrap() {
        // If using the debugger:
        metadata
            .get_mut::<crate::metadata::debug_utils::DebugUtils>()
            .unwrap()
            .debug_breakpoint_trap(entry, location)
            .unwrap();
        // If not using the debugger (not tested, may not provide useful information).
        metadata
            .get_mut::<crate::metadata::debug_utils::DebugUtils>()
            .unwrap()
            .debug_print(
                context,
                helper,
                entry,
                &format!("Invoked felt252_const<error_msg> at {location}."),
                location,
            )
            .unwrap();
    }
    let value = entry.const_int_from_type(context, location, value, felt252_ty)?;
    entry.append_operation(helper.br(0, &[value], location));
    Ok(())
}
```

Using the debugger will also provide the internal call backtrace (of the
contract) and register values, so it’s the recommended way, but depending on
the contract it may not be feasible (ex. the contract is too big and running
the debugger is not practical due to the amount of time it takes to get to
the crash).

Once we know exactly where it crashes we can follow the control flow of the
Sierra program backwards and discover how it reached that point.

In some cases the **problem may be somewhere completely different from where
the error is thrown**. In other words, the error we’re seeing may be a side
effect of a completely different bug. For example, in a `u256_mul overflow`,
the bug may be found in the mul operation implementation, or alternatively it
may just be that the values passed to it are not what they should be. That’s
why it’s important to check for those cases and keep following the control
flow backwards as required.

### Fixing the bug
Before fixing the bug it’s really important to know:

- **Where** it happens (in our compiler, not so much in the contract at this point)
- **Why** it happens (as in, what caused this bug to be in our codebase in the first place)
- **How** to fix it properly (not the actual code but to know what steps to take to fix it).
- Could the **same bug** happen in **different places**? (for example, if it was the implementation of `u64_sqrt`, could the same bug happen in `u32_sqrt` and others?)
- What **side-effects** will the bug fix trigger? (for example, if the fix implies changing the layout of some type, will the new layout make something completely unrelated fail later on?)

The last one is really important since we don’t want to cause more bugs
fixing the ones we already have. To understand the side effects we need to
have a full understanding of the bug, which implies having an answer to (at
least) all the other things to know before fixing it.

Once we know all that we can:

1. Add tests that reproduce the bug (including all the variants that we may discover).
2. Implement the fix in code.

> Note: Those steps must be done in that order. Otherwise we risk
> unconsciously avoiding bugs in our tests for our bug fix implementation by
> building our tests from our implementation instead of the correct
> behaviour.

### Comparing with Sierra Emulator

To aid in the debugging process, we developed [sierra-emu](https://github.com/lambdaclass/sierra-emu/). It’s an external tool that executes raw sierra code and outputs an execution trace, containing each statement executed and the associated state.

In addition to this, we developed the `with-trace-dump` feature for Cairo Native, which generates an execution trace that records every statement executed. It has the same shape as the one generated by the Sierra emulator. Supporting transaction execution with Cairo Native trace dump required quite a few hacks, which is why we haven’t merged it to main. This is why we need to use a specific cairo native branch.

By combining both tools, we can hopefully pinpoint exactly which *libfunc* implementation is buggy.

Before starting, make sure to clone [starknet-replay](https://github.com/starkware-libs/starknet-replay).

#### Obtaining Sierra Emulator Trace in Starknet Replay

1. Checkout starknet-replay `trace-dump` branch.
2. Execute a single transaction with the `use-sierra-emu` feature
    ```bash
    cargo run --features use-sierra-emu tx <HASH> <CHAIN> <BLOCK>
    ```
3. Once finished, it will have written the traces of each inner contract inside of `traces/emu`, relative to the current working directory.

As a single transaction can invoke multiple contracts (by contract and library calls), this generates a trace file for each contract executed, numbered in ascending order: `trace_0.json`, `trace_1.json`, etc.

#### Obtaining Cairo Native Trace in Starknet Replay

1. Checkout starknet-replay `trace-dump` branch.
2. Execute a single transaction with the `with-trace-dump` feature
    ```bash
    cargo run --features with-trace-dump tx <HASH> <CHAIN> <BLOCK>
    ```
3. Once finished, it will have written the traces of each inner contract inside of `traces/native`, relative to the current working directory.

#### Patching Dependencies

If the execution panics, It may indicate that not all the required libfuncs or types have been implemented (for either sierra emulator or Cairo Native trace dump feature). It is a good idea to patch the dependencies to a local path and implement the missing features. You can add this to `Cargo.toml`

```toml
[patch.'https://github.com/starkware-libs/cairo_native']
cairo-native = { path = "../cairo_native" }
[patch.'https://github.com/lambdaclass/sierra-emu']
sierra-emu = { path = "../sierra-emu" }
```

#### Comparing Traces

Once you have generated the traces for both the Sierra emulator and Cairo Native, you can begin debugging.

1. Compare the traces of the same contract with the favorite tool:
    ```bash
    diff "traces/{emu,native}/trace_0.json" # or
    delta "traces/{emu,native}/trace_0.json" --side-by-side
    ```
2. Look for the first significant difference between the traces. Not all the differences are significant, for example:
    1. Sometimes the emulator and Cairo Native differ in the Gas builtin. It usually doesn’t affect the outcome of the contract.
    2. The ec_state_init libfunc randomizes an elliptic curve point, which is why they always differ.
3. Find the index of the statement executed immediately previous to the first difference.
4. Open `traces/prog_0.sierra` and look for that statement.
    1. If it’s a return, then you are dealing with a control flow bug. These are difficult to debug.
    2. If it’s a libfunc invocation, then that libfunc is probably the one that is buggy.
    3. If it’s a library or contract call, then the bug is probably in another contract, and you should move onto the next trace.

#### Useful Scripts

In the `scripts` folder of starknet-replay, you can find useful scripts for debugging. Make sure to execute them in the root directory. Some scripts require `delta` to be installed.

- `compare-traces`: Compares every trace and outputs which are different. This can help finding the buggy contract when there are a lot of traces.
    ```bash
    > ./scripts/compare-traces.sh
    difference: ./traces/emu/trace_0.json ./traces/native/trace_0.json
    difference: ./traces/emu/trace_1.json ./traces/native/trace_1.json
    difference: ./traces/emu/trace_3.json ./traces/native/trace_3.json
    missing file: ./traces/native/trace_4.json
    ```
- `diff-trace`: Receives a trace number, and executes `delta` to compare that trace.
    ```bash
    ./scripts/diff-trace.sh 1
    ```
- `diff-trace-flow`: Like `diff-trace`, but only diffs (with `delta`) the statement indexes. It can be used to visualize the control flow difference.
    ```bash
    ./scripts/diff-trace-flow.sh 1
    ```
- `string-to-felt`: Converts the given string to a felt. Can be used to search in the code where a specific error message was generated.
    ```bash
    > ./scripts/string-to-felt.sh "u256_mul Overflow"
    753235365f6d756c204f766572666c6f77
    ```

## Debugging Compilation

If we encounter contracts/programs that take too long to compile, the first step is to pinpoint what is causing the long compilation times.

If we find that a particular libfunc is taking too much time to compile/optimize, we should consider moving that libfunc to the runtime. First, we need to check if it would give any improvements at all. To do this, we can "fake" a runtime call to trick the compiler into thinking that a particular libfunc is implemented externally. If we just "delete" the libfunc implementation, we may allow the compiler to optimize a lot of instructions away. This would hide the actual problem.

For details on how to do this, see the debugging functions `build_mock_runtime_call` and `build_mock_libfunc`. The latter is fully generic, and can be used as a replacement for any libfunc implementation.

For example, to check if the `eval_circuit` libfunc is taking too much time to compile, just replace this:
```rust,ignore
// at src/libfuncs/circuit.rs
CircuitConcreteLibfunc::Eval(info) => {
    build_eval(context, registry, entry, location, helper, metadata, info)
}
```
With this:
```rust,ignore
CircuitConcreteLibfunc::Eval(info) => {
    build_mock_libfunc(context, registry, entry, location, helper, metadata, info.signature())
}
```

Note that sometimes the problem is not a libfunc, but the actual types involved. In these cases mocking a libunc may not help, as doing so would have to operate with those complex types anyway (particularly, loading them from pointers).