Skip to main content

detect_chips

Function detect_chips 

Source
pub fn detect_chips<E>(
    root_chips: Vec<Chip>,
    init_callback: &mut impl FnMut(ChipDetectState<'_>) -> Result<(), E>,
    options: ChipDetectOptions,
) -> Result<Vec<UninitChip>, InitError<E>>
Expand description

Find all chips accessible from the given set of root chips. For the most part this should be a set of chips found via a PCI scan, but it doesn’t have to be.

The most important part of this algorithm is determining which chips are duplicates of other chips. In general two boards can be differentiated by their board id, but this is not always the case. For example the gs or wh X2, in that case we must fallback on the interface id for grayskull or ethernet address for wh. However this does not cover all cases, if there is a wh X2 that is not in the root_chips list (which could be because it is in a neighbouring hose) and both chips are in two separate meshes with the same ethernet address. We will incorrectly detect them as being one chip.

Search steps:

  1. Add all given chips to output list removing duplicates this will ensure that if list indexes are used to assign a chip id pci chips will always be output instead of the remote equivalent.
  2. To a depth first search for each root chip, adding all new chips found to the output list.

When continue on failure is true, we report errors, but continue searching for chips. We pass all chips that did not complete initializations as UninitChip, the user will see the status and can decide for themselves if they want to upgrade the chip to a full Chip. Error Cases:

  1. ARC fw is hung, this usually means that there is a noc hang as well. a. Not catastrophic, we can recover from the hang by resetting the chip.
  2. DRAM is not trained a. Not catastrophic, but we should not pass this over as a good chip as we may get a noc hang when accessing DRAM.
  3. ARC did not complete initialization a. Not catastrophic, but for gs we will have no thermal control.
  4. Ethernet fw is corrupted, we check this by looking for a known fw version. a. Not catastrophic, we need to report this, but can continue exploring other chips in the mesh.
  5. Ethernet fw is hung, this usually means that the ethernet is in a bad state. a. Not catastrophic, we need to report this, but can continue exploring other chips in the mesh.
  6. 0xffffffff error, this means that the underlying transport is hung. a. This is catastrophic, we cannot continue searching for chips, because some of the chips in the mesh may no longer be accessible b. We could recover from this by rerunning the search, but this is not implemented.