cairo-native 0.8.0

A compiler to convert Cairo's IR Sierra code to MLIR and execute it.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
# Debugging

## Useful environment variables

These 2 env vars will dump the generated MLIR code from any compilation on the current working directory as:

- `dump.mlir`: The MLIR code after passes without locations.
- `dump-debug.mlir`: The MLIR code after passes with locations.
- `dump-prepass.mlir`: The MLIR code before without locations.
- `dump-prepass-debug.mlir`: The MLIR code before passes with locations.

Do note that the MLIR with locations is in pretty form and thus not suitable to pass to `mlir-opt`.

```bash
export NATIVE_DEBUG_DUMP_PREPASS=1
export NATIVE_DEBUG_DUMP=1
```

### Debugging with LLDB

To debug with LLDB (or another debugger), we must compile the binary with the `with-debug-utils` feature.
```bash
cargo build --package cairo-native-run --features with-debug-utils
```

Then, we can add the a debugger breakpoint trap. To add it at a given sierra statement, we can set the following env var:
```bash
export NATIVE_DEBUG_TRAP_AT_STMT=10
```

The trap instruction may not end up exactly where the statement is.

If we want to manually set the breakpoint (for example, when executing a particular libfunc), then we can use the `DebugUtils` metadata in the code.
```rust,ignore
#[cfg(feature = "with-debug-utils")]
{
    metadata.get_mut::<DebugUtils>()
        .unwrap()
        .debug_breakpoint_trap(block, location)?;
}
```

Now, we need to execute `cairo-native-run` from our debugger (LLDB). If we want to see the source locations, we also need to set the `NATIVE_DEBUG_DUMP` env var and execute the program with AOT.

```bash
lldb -- target/debug/cairo-native-run -s programs/recursion.cairo --available-gas 99999999 --run-mode aot
```

Some usefull lldb commands:
- `process launch`: starts the program
- `frame select`: shows the current line information
- `thread step-in`: makes a source level single step
- `thread continue`: continues execution of the current process
- `disassemble --frame --mixed`: shows assembly instructions mixed with source level code

## Logging
Enable logging to see the compilation process:

```bash
export RUST_LOG="cairo_native=trace"
```

## Other tips:

- Try to find the minimal program to reproduce an issue, the more isolated the easier to test.
- Use the `debug_utils` print utilities, more info [here]https://lambdaclass.github.io/cairo_native/cairo_native/metadata/debug_utils/struct.DebugUtils.html:

```rust,ignore
#[cfg(feature = "with-debug-utils")]
{
    metadata.get_mut::<DebugUtils>()
        .unwrap()
        .print_pointer(context, helper, entry, ptr, location)?;
}
```

## Trace Dump Feature

The `with-trace-dump` feature is used to generate the execution trace of a sierra program.

First, make sure to compile with the feature enabled:
```bash
cargo build --release --features with-trace-dump
```

Then, use the `trace_output` flag to save the trace dump to disk:

```bash
target/release/cairo-native-run -s programs/recursion.cairo --trace-output programs/recursion.trace --available-gas 10000000
```

The generated file will contain the state of all variables in the current scope, for every statement executed:

```json
{
  "states": [
    {
      "statementIdx": 25,
      "preStateDump": {
        "0": "Unit",
        "1": { "U64": 9993660 }
      }
    },
    {
      "statementIdx": 26,
      "preStateDump": {
        "0": "Unit",
        "1": { "U64": 9993660 }
      }
    },
    {
      "statementIdx": 27,
      "preStateDump": {
        "0": "Unit",
        "1": { "U64": 9993660 },
        "2": { "Felt": "0x3e8" }
      }
    },
    ...
  ]
}
```

It is sometimes useful to take a look at the sierra program. You can use the `--sierra-output` flag to save the sierra program to disk.

```txt
disable_ap_tracking() -> (); // 25
const_as_immediate<Const<felt252, 1000>>() -> ([2]); // 26
store_temp<RangeCheck>([0]) -> ([0]); // 27
```

## Debugging Contracts

Contracts are difficult to debug for various reasons, including:
- They are external to the project.
- We don’t have their source code.
- They run autogenerated code (the wrapper).
- They have a limited number of allowed libfuncs (ex. cannot use the print libfunc).
- Usually it’s not a single contract but multiple that

Some of them have workarounds:

### Obtaining the contract
There are various options for obtaining the contract, which include:

- Manually invoking the a Starknet API using `curl` with the contract class.

Example:

```bash
curl --location --request POST 'https://mainnet.juno.internal.lambdaclass.com' \
--header 'Content-Type: application/json' \
--data-raw '{
  "jsonrpc": "2.0",
  "method": "starknet_getClass",
  "id": 0,
  "params": {
    "class_hash": "0x036078334509b514626504edc9fb252328d1a240e4e948bef8d0c08dff45927f",
    "block_id": 657887
}
}'
```

- Running the replay with some code to write all the executed contracts on disk.

Both should provide us with the contract, but if we’re manually invoking the API we’ll need to process the JSON a bit to:

- Remove the JsonRPC overhead, and
- Convert the ABI from a string of JSON into a JSON object.

### Interpreting the contract
The contract JSON contains the Sierra program in a useless form (in the sense
that we cannot understand anything), as well as some information about the
entry points and some ABI types. We’ll need the Sierra program (in Sierra
format, not the JSON) to be able to understand what should be happening.

We can use the `starknet-sierra-extract-code` binary, which can be found in
the cairo project when compiled from source (not in the binary distribution).
That binary will extract the Sierra program without any debug information,
which is still not very useful.

Once we have the Sierra we can run the
[Sierra mapper](https://github.com/azteca1998/sierra-mapper) to autogenerate
some type, libfunc and function names so that we know what we’re looking at
without losing our mind. The Sierra mapper can be run multiple times, adding
more names manually as the user sees fit.

### How to actually debug

First of all we need to **know which contract is actually failing**. Most
of the time the contract where it crashes isn’t the transaction’s class
hash, but a chain of contract/library calls.

To know which contract is being called we can add some debugging prints in
the replay that logs contract executions. For example:

```rust,ignore
impl StarknetSyscallHandler for ReplaySyscallHandler {
    // ...

    fn library_call(
        &mut self,
        class_hash: Felt,
        function_selector: Felt,
        calldata: &[Felt],
        remaining_gas: &mut u128,
    ) -> SyscallResult<Vec<Felt>> {
        // ...

        println!("Starting execution of contract {class_hash} on selector {function_selector} with calldata {calldata:?}.");
        let result = executor.invoke_contract_dynamic(...);
        println!("Finished execution of contract {class_hash}.");
        if result.failure_flag {
            println!("Execution of contract {class_hash} failed.");
        }

        // ...
    }

    fn call_contract(
        &mut self,
        address: Felt,
        entry_point_selector: Felt,
        calldata: &[Felt],
        remaining_gas: &mut u128,
    ) -> SyscallResult<Vec<Felt>> {
			  // ...

			  println!("Starting execution of contract {class_hash} on selector {function_selector} with calldata {calldata:?}.");
			  let result = executor.invoke_contract_dynamic(...);
			  println!("Finished execution of contract {class_hash}.");
			  if result.failure_flag {
					  println!("Execution of contract {class_hash} failed.");
				}

				// ...
		}
}
```

If we run something like the above then the
[replay](https://github.com/lambdaclass/starknet-replay) should start
printing the log of what’s actually being executed and where it crashes.
It may print multiple times the error message, but **only the first one is
the relevant one** (the others should be the contract call chain in reverse
order). Once we know which contract is being called and its calldata we can
download and extract its Sierra as detailed above.

We then need to know **where it fails within the contract**. To do that we
can look at the error message and deduce where it’s used based on the Sierra
program. For example, the error message `u256_mul overflow` is felt-encoded
as `0x753235365f6d756c206f766572666c6f77`, or
`39879774624083218221774975706286902767479` in decimal. If we look for
usages of that specific value we’ll most likely find all the **places where
that error can be thrown**. Now we just need to narrow them down to a single
one and we’ll be able to actually start debugging.

An idea on how to do that is modifying Cairo native so that it adds a
breakpoint every time a constant with that error message is generated.
For example:

```rust,ignore
/// Generate MLIR operations for the `felt252_const` libfunc.
pub fn build_const<'ctx, 'this>(
    context: &'ctx Context,
    registry: &ProgramRegistry<CoreType, CoreLibfunc>,
    entry: &'this Block<'ctx>,
    location: Location<'ctx>,
    helper: &LibfuncHelper<'ctx, 'this>,
    metadata: &mut MetadataStorage,
    info: &Felt252ConstConcreteLibfunc,
) -> Result<()> {
    let value = match info.c.sign() {
        Sign::Minus => {
            let prime = metadata
                .get::<PrimeModuloMeta<Felt>>()
                .ok_or(Error::MissingMetadata)?
                .prime();
            (&info.c + prime.to_bigint().expect("always is Some"))
                .to_biguint()
                .expect("always is positive")
        }
        _ => info.c.to_biguint().expect("sign already checked"),
    };
    let felt252_ty = registry.build_type(
        context,
        helper,
        registry,
        metadata,
        &info.branch_signatures()[0].vars[0].ty,
    )?;
    if value == "39879774624083218221774975706286902767479".parse().unwrap() {
        // If using the debugger:
        metadata
            .get_mut::<crate::metadata::debug_utils::DebugUtils>()
            .unwrap()
            .debug_breakpoint_trap(entry, location)
            .unwrap();
        // If not using the debugger (not tested, may not provide useful information).
        metadata
            .get_mut::<crate::metadata::debug_utils::DebugUtils>()
            .unwrap()
            .debug_print(
                context,
                helper,
                entry,
                &format!("Invoked felt252_const<error_msg> at {location}."),
                location,
            )
            .unwrap();
    }
    let value = entry.const_int_from_type(context, location, value, felt252_ty)?;
    entry.append_operation(helper.br(0, &[value], location));
    Ok(())
}
```

Using the debugger will also provide the internal call backtrace (of the
contract) and register values, so it’s the recommended way, but depending on
the contract it may not be feasible (ex. the contract is too big and running
the debugger is not practical due to the amount of time it takes to get to
the crash).

Once we know exactly where it crashes we can follow the control flow of the
Sierra program backwards and discover how it reached that point.

In some cases the **problem may be somewhere completely different from where
the error is thrown**. In other words, the error we’re seeing may be a side
effect of a completely different bug. For example, in a `u256_mul overflow`,
the bug may be found in the mul operation implementation, or alternatively it
may just be that the values passed to it are not what they should be. That’s
why it’s important to check for those cases and keep following the control
flow backwards as required.

### Fixing the bug
Before fixing the bug it’s really important to know:

- **Where** it happens (in our compiler, not so much in the contract at this point)
- **Why** it happens (as in, what caused this bug to be in our codebase in the first place)
- **How** to fix it properly (not the actual code but to know what steps to take to fix it).
- Could the **same bug** happen in **different places**? (for example, if it was the implementation of `u64_sqrt`, could the same bug happen in `u32_sqrt` and others?)
- What **side-effects** will the bug fix trigger? (for example, if the fix implies changing the layout of some type, will the new layout make something completely unrelated fail later on?)

The last one is really important since we don’t want to cause more bugs
fixing the ones we already have. To understand the side effects we need to
have a full understanding of the bug, which implies having an answer to (at
least) all the other things to know before fixing it.

Once we know all that we can:

1. Add tests that reproduce the bug (including all the variants that we may discover).
2. Implement the fix in code.

> Note: Those steps must be done in that order. Otherwise we risk
> unconsciously avoiding bugs in our tests for our bug fix implementation by
> building our tests from our implementation instead of the correct
> behaviour.

### Comparing with Sierra Emulator

To aid in the debugging process, we developed [sierra-emu](https://github.com/lambdaclass/sierra-emu/). It’s an external tool that executes raw sierra code and outputs an execution trace, containing each statement executed and the associated state.

In addition to this, we developed the `with-trace-dump` feature for Cairo Native, which generates an execution trace that records every statement executed. It has the same shape as the one generated by the Sierra emulator. Supporting transaction execution with Cairo Native trace dump required quite a few hacks, which is why we haven’t merged it to main. This is why we need to use a specific cairo native branch.

By combining both tools, we can hopefully pinpoint exactly which *libfunc* implementation is buggy.

Before starting, make sure to clone [starknet-replay](https://github.com/lambdaclass/starknet-replay).

#### Obtaining Sierra Emulator Trace in Starknet Replay

1. Checkout starknet-replay `trace-dump` branch.
2. Execute a single transaction with the `use-sierra-emu` feature
    ```bash
    cargo run --features use-sierra-emu tx <HASH> <CHAIN> <BLOCK>
    ```
3. Once finished, it will have written the traces of each inner contract inside of `traces/emu`, relative to the current working directory.

As a single transaction can invoke multiple contracts (by contract and library calls), this generates a trace file for each contract executed, numbered in ascending order: `trace_0.json`, `trace_1.json`, etc.

#### Obtaining Cairo Native Trace in Starknet Replay

1. Checkout starknet-replay `trace-dump` branch.
2. Execute a single transaction with the `with-trace-dump` feature
    ```bash
    cargo run --features with-trace-dump tx <HASH> <CHAIN> <BLOCK>
    ```
3. Once finished, it will have written the traces of each inner contract inside of `traces/native`, relative to the current working directory.

#### Patching Dependencies

If the execution panics, It may indicate that not all the required libfuncs or types have been implemented (for either sierra emulator or Cairo Native trace dump feature). It is a good idea to patch the dependencies to a local path and implement the missing features. You can add this to `Cargo.toml`

```toml
[patch.'https://github.com/lambdaclass/cairo_native']
cairo-native = { path = "../cairo_native" }
[patch.'https://github.com/lambdaclass/sierra-emu']
sierra-emu = { path = "../sierra-emu" }
```

#### Comparing Traces

Once you have generated the traces for both the Sierra emulator and Cairo Native, you can begin debugging.

1. Compare the traces of the same contract with the favorite tool:
    ```bash
    diff "traces/{emu,native}/trace_0.json" # or
    delta "traces/{emu,native}/trace_0.json" --side-by-side
    ```
2. Look for the first significant difference between the traces. Not all the differences are significant, for example:
    1. Sometimes the emulator and Cairo Native differ in the Gas builtin. It usually doesn’t affect the outcome of the contract.
    2. The ec_state_init libfunc randomizes an elliptic curve point, which is why they always differ.
3. Find the index of the statement executed immediately previous to the first difference.
4. Open `traces/prog_0.sierra` and look for that statement.
    1. If it’s a return, then you are dealing with a control flow bug. These are difficult to debug.
    2. If it’s a libfunc invocation, then that libfunc is probably the one that is buggy.
    3. If it’s a library or contract call, then the bug is probably in another contract, and you should move onto the next trace.

#### Useful Scripts

In the `scripts` folder of starknet-replay, you can find useful scripts for debugging. Make sure to execute them in the root directory. Some scripts require `delta` to be installed.

- `compare-traces`: Compares every trace and outputs which are different. This can help finding the buggy contract when there are a lot of traces.
    ```bash
    > ./scripts/compare-traces.sh
    difference: ./traces/emu/trace_0.json ./traces/native/trace_0.json
    difference: ./traces/emu/trace_1.json ./traces/native/trace_1.json
    difference: ./traces/emu/trace_3.json ./traces/native/trace_3.json
    missing file: ./traces/native/trace_4.json
    ```
- `diff-trace`: Receives a trace number, and executes `delta` to compare that trace.
    ```bash
    ./scripts/diff-trace.sh 1
    ```
- `diff-trace-flow`: Like `diff-trace`, but only diffs (with `delta`) the statement indexes. It can be used to visualize the control flow difference.
    ```bash
    ./scripts/diff-trace-flow.sh 1
    ```
- `string-to-felt`: Converts the given string to a felt. Can be used to search in the code where a specific error message was generated.
    ```bash
    > ./scripts/string-to-felt.sh "u256_mul Overflow"
    753235365f6d756c204f766572666c6f77
    ```

## Debugging Compilation

If we encounter contracts/programs that take too long to compile, the first step is to pinpoint what is causing the long compilation times.

If we find that a particular libfunc is taking too much time to compile/optimize, we should consider moving that libfunc to the runtime. First, we need to check if it would give any improvements at all. To do this, we can "fake" a runtime call to trick the compiler into thinking that a particular libfunc is implemented externally. If we just "delete" the libfunc implementation, we may allow the compiler to optimize a lot of instructions away. This would hide the actual problem.

For details on how to do this, see the debugging functions `build_mock_runtime_call` and `build_mock_libfunc`. The latter is fully generic, and can be used as a replacement for any libfunc implementation.

For example, to check if the `eval_circuit` libfunc is taking too much time to compile, just replace this:
```rust,ignore
// at src/libfuncs/circuit.rs
CircuitConcreteLibfunc::Eval(info) => {
    build_eval(context, registry, entry, location, helper, metadata, info)
}
```
With this:
```rust,ignore
CircuitConcreteLibfunc::Eval(info) => {
    build_mock_libfunc(context, registry, entry, location, helper, metadata, info.signature())
}
```

Note that sometimes the problem is not a libfunc, but the actual types involved. In these cases mocking a libunc may not help, as doing so would have to operate with those complex types anyway (particularly, loading them from pointers).