Crate holochain_metrics

source ·
Expand description

Initialize holochain metrics. This crate should only be used in binaries to initialize the actual metrics collection. Libraries should just use the opentelemetry_api to report metrics if any collector has been initialized.

§Environment Variables

When calling HolochainMetricsConfig::new(&path).init(), the actual metrics instance that will be created is largely controlled by the existence of environment variables.

Curently, by default, the Null metrics collector will be used, meaning metrics will not be collected, and all metrics operations will be no-ops.

If you wish to enable metrics, the current options are:

  • InfluxDB as a zero-config child-process.
    • Enable via environment variable: HOLOCHAIN_INFLUXIVE_CHILD_SVC=1
    • The binaries influxd and influx will be downloaded and verified before automatically being run as a child process, and set up to be reported to. The InfluxDB UI will be available on a randomly assigned port (currently only reported in the trace logging).
  • InfluxDB as a pre-existing system process.
    • Enable via environment variable: HOLOCHAIN_INFLUXIVE_EXTERNAL=1
    • Configure via environment variables:
      • HOLOCHAIN_INFLUXIVE_EXTERNAL_HOST=[my influxdb url] where a default InfluxDB install will need http://localhost:8086 and otherwise can be found by running influx config in a terminal.
      • HOLOCHAIN_INFLUXIVE_EXTERNAL_BUCKET=[my influxdb bucket name] but it’s simplest to use influxive if you plan to import the provided dashboards.
      • HOLOCHAIN_INFLUXIVE_EXTERNAL_TOKEN=[my influxdb auth token]
    • The influxdb auth token must have permission to write to all buckets
    • Metrics will be set up to report to this already running InfluxDB.

§Metric Naming Conventions

We will largely attempt to follow the guidelines for metric naming enumerated at https://opentelemetry.io/docs/specs/otel/metrics/semantic_conventions/, with additional rules made to fit with our particular project. We will also attempt to keep this documentation up-to-date on a best-effort basis to act as an example and registry of metrics avaliable in Holochain, and related support dependency crates managed by the organization.

Generic naming convention rules:

  • Dot notation logical module hierarchy. This need not, and perhaps should not, match the rust crate/module hierarchy. As we may rearange crates and modules, but the metric names themselves should remain more consistant.
    • Examples:
      • hc.db
      • hc.workflow.integration
      • kitsune.gossip
      • tx5.signal
  • A dot notation metric name or context should follow the logical module name. The thing that can be charted should be the actual metric. Related context that may want to be filtered for the chart should be attributes. For example, a “request” may have two separate metrics, “duration”, and “byte.count”, which both may have the filtering attribute “remote_id”.
    • Examples
      •   use opentelemetry_api::{Context, KeyValue, metrics::Unit};
          let req_dur = opentelemetry_api::global::meter("tx5")
              .f64_histogram("tx5.signal.request.duration")
              .with_description("tx5 signal server request duration")
              .with_unit(Unit::new("s"))
              .init();
          req_dur.record(&Context::new(), 0.42, &[
              KeyValue::new("remote_id", "abcd"),
          ]);
      •   use opentelemetry_api::{Context, KeyValue, metrics::Unit};
          let req_size = opentelemetry_api::global::meter("tx5")
              .u64_histogram("tx5.signal.request.byte.count")
              .with_description("tx5 signal server request byte count")
              .with_unit(Unit::new("By"))
              .init();
          req_size.record(&Context::new(), 42, &[
              KeyValue::new("remote_id", "abcd"),
          ]);

§Metric Name Registry

Full Metric NameTypeUnit (optional)DescriptionAttributes
kitsune.peer.send.durationf64_histogramsWhen kitsune sends data to a remote peer.- remote_id: the base64 remote peer id.
- is_error: if the send failed.
kitsune.peer.send.byte.countu64_histogramByWhen kitsune sends data to a remote peer.- remote_id: the base64 remote peer id.
- is_error: if the send failed.
kitsune.gossip.generate_op_blooms.durationf64_histogramsThe time taken to generate op blooms for gossip.- space: The space (dna_hash representation) that gossip is being performed for.
- batch_size: The number of ops that were included in the bloom batch for this observation.
kitsune.gossip.generate_op_region_set.durationf64_histogramsThe time taken to generate op region sets for gossip.- space: The space (dna_hash representation) that gossip is being performed for.
tx5.conn.ice.sendu64_observable_counterByBytes sent on ice channel.- remote_id: the base64 remote peer id.
- state_uniq: endpoint identifier.
- conn_uniq: connection identifier.
tx5.conn.ice.recvu64_observable_counterByBytes received on ice channel.- remote_id: the base64 remote peer id.
- state_uniq: endpoint identifier.
- conn_uniq: connection identifier.
tx5.conn.data.sendu64_observable_counterByBytes sent on data channel.- remote_id: the base64 remote peer id.
- state_uniq: endpoint identifier.
- conn_uniq: connection identifier.
tx5.conn.data.recvu64_observable_counterByBytes received on data channel.- remote_id: the base64 remote peer id.
- state_uniq: endpoint identifier.
- conn_uniq: connection identifier.
tx5.conn.data.send.message.countu64_observable_counterMessage count sent on data channel.- remote_id: the base64 remote peer id.
- state_uniq: endpoint identifier.
- conn_uniq: connection identifier.
tx5.conn.data.recv.message.countu64_observable_counterMessage count received on data channel.- remote_id: the base64 remote peer id.
- state_uniq: endpoint identifier.
- conn_uniq: connection identifier.
hc.conductor.p2p_event.durationf64_histogramsThe time spent processing a p2p event.- dna_hash: The DNA hash that this event is being sent on behalf of.
hc.conductor.post_commit.durationf64_histogramsThe time spent executing a post commit.- dna_hash: The DNA hash that this post commit is running for.
- agent: The agent running the post commit.
hc.conductor.workflow.durationf64_histogramsThe time spent running a workflow.- workflow: The name of the workflow.
- dna_hash: The DNA hash that this workflow is running for.
- agent: (optional) The agent that this workflow is running for if the workflow is cell bound.
hc.cascade.durationf64_histogramsThe time taken to execute a cascade query.
hc.db.pool.utilizationf64_gaugeThe utilisation of connections in the pool.- kind: The kind of database such as Conductor, Wasm or Dht etc.
- id: The unique identifier for this database if multiple instances can exist, such as a Dht database.
hc.db.connections.use_timef64_histogramsThe time between borrowing a connection and returning it to the pool.- kind: The kind of database such as Conductor, Wasm or Dht etc.
- id: The unique identifier for this database if multiple instances can exist, such as a Dht database.
hc.ribosome.wasm.usageu64_counterThe metered usage of a wasm ribosome.- dna: The DNA hash that this wasm is metered for.
- zome: The zome that this wasm is metered for.
- fn: The function that this wasm is metered for.
- agent: The agent that this wasm is metered for (if there is one).

Enums§