Crate holochain_metrics

Expand description

Initialize holochain metrics. This crate should only be used in binaries to initialize the actual metrics collection. Libraries should just use the opentelemetry crate to report metrics if any collector has been initialized.

§Environment Variables

When calling HolochainMetricsConfig::new(&path).init(), the actual metrics instance that will be created is largely controlled by the existence of environment variables.

Currently, by default, the Null metrics collector will be used, meaning metrics will not be collected, and all metrics operations will be no-ops.

If you wish to enable metrics, the current options are:

A file, containing InfluxDB line protocol metrics. These can be pushed to InfluxDB later with Telegraf.
- Enable and configure via environment variable: HOLOCHAIN_INFLUXIVE_FILE="path/to/influx/file"
InfluxDB as a zero-config child-process.
- Enable via environment variable: HOLOCHAIN_INFLUXIVE_CHILD_SVC=1
- The binaries influxd and influx will be downloaded and verified before automatically being run as a child process, and set up to be reported to. The InfluxDB UI will be available on a randomly assigned port (currently only reported in the trace logging).
InfluxDB as a pre-existing system process.
- Enable via environment variable: HOLOCHAIN_INFLUXIVE_EXTERNAL=1
- Configure via environment variables:
  - HOLOCHAIN_INFLUXIVE_EXTERNAL_HOST=[my influxdb url] where a default InfluxDB install will need http://localhost:8086 and otherwise can be found by running influx config in a terminal.
  - HOLOCHAIN_INFLUXIVE_EXTERNAL_BUCKET=[my influxdb bucket name] but it’s simplest to use influxive if you plan to import the provided dashboards.
  - HOLOCHAIN_INFLUXIVE_EXTERNAL_TOKEN=[my influxdb auth token]
- The influxdb auth token must have permission to write to all buckets
- Metrics will be set up to report to this already running InfluxDB.

All metrics modes automatically stamp a host tag on every emitted metric so that metrics from different nodes can be distinguished when multiple Holochain instances write to a shared InfluxDB. The value defaults to the OS hostname. Override it with:

HOLOCHAIN_INFLUXIVE_HOST_TAG=<my-custom-node-name>

To set the interval at which recorded metrics are written to Influx, use OTEL_METRIC_EXPORT_INTERVAL. The value is specified as milliseconds. 10 s is the default. When the report interval is configured in the code, it overrides this environment variable setting.

§Metric Naming Conventions

We will largely attempt to follow the guidelines for metric naming enumerated at https://opentelemetry.io/docs/specs/otel/metrics/semantic_conventions/, with additional rules made to fit with our particular project. We will also attempt to keep this documentation up to date on a best-effort basis to act as an example and registry of metrics available in Holochain, and related support dependency crates managed by the organization.

Generic naming convention rules:

Dot notation logical module hierarchy. This need not, and perhaps should not, match the rust crate/module hierarchy. As we may rearrange crates and modules, but the metric names themselves should remain more consistent.
- Examples:
  - hc.db
  - hc.workflow.integration
  - hc.ribosome.wasm

A dot notation metric name or context should follow the logical module name. The thing that can be charted should be the actual metric. Related context that may want to be filtered for the chart should be attributes. For example, a “request” may have two separate metrics, “duration”, and “byte.count”, which both may have the filtering attribute “remote_id”.

Examples

  use opentelemetry::KeyValue;
  let req_dur = opentelemetry::global::meter("hc")
      .f64_histogram("hc.holochain_p2p.request.duration")
      .with_description("holochain p2p request duration")
      .with_unit("s")
      .build();
  req_dur.record(0.42, &[KeyValue::new("remote_id", "abcd")]);

  use opentelemetry::KeyValue;
  let req_size = opentelemetry::global::meter("hc")
      .u64_histogram("hc.holochain_p2p.request.byte.count")
      .with_description("holochain p2p request byte count")
      .with_unit("B")
      .build();
  req_size.record(42, &[
      KeyValue::new("remote_id", "abcd"),
  ]);

§Metric Name Registry

These following metrics are defined and recorded in their respective crates. Do a text search to look up metric type, description and unit.

Full Metric Name	Type	Unit (optional)	Description	Attributes
`hc.db.connections.use_time`	f64 histogram	s	The time between borrowing a connection and returning it to the pool	`kind`: DB type (authored/dht/cache/…), `id`: DB instance identifier
`hc.db.write_txn.duration`	f64 histogram	s	The time spent executing an exclusive write transaction	`kind`: DB type (authored/dht/cache/…), `id`: DB instance identifier
`hc.keystore.lair_request.duration`	f64 histogram	s	Duration of signing and encryption requests to Lair	`operation`: cryptographic operation (sign/encrypt/…)
`hc.conductor.workflow.duration`	f64 histogram	s	The time spent running a workflow	`workflow`: workflow process name, `dna_hash`: DNA identifier, `agent`: agent public key
`hc.conductor.workflow.integrated_ops`	u64 counter		The number of integrated operations
`hc.conductor.workflow.integration_delay`	f64 histogram	s	Time between an op being stored and it being integrated
`hc.conductor.workflow.validation_attempts`	u64 histogram		Number of validation attempts required to integrate an op
`hc.conductor.post_commit.duration`	f64 histogram	s	The time spent executing a post commit	`dna_hash`: DNA identifier, `agent`: agent public key
`hc.conductor.uptime`	f64 observable gauge	s	The number of seconds the conductor has been running
`hc.conductor.app_ws.dropped_signal`	u64 counter		The number of signals dropped from app ws due to channel overload
`hc.ribosome.wasm.usage`	u64 counter		The metered usage of a wasm ribosome	`dna_hash`: DNA identifier, `zome`: zome module name, `fn`: function name, `agent`: agent public key
`hc.ribosome.zome_call.duration`	f64 histogram	s	The time spent running a zome call	`dna_hash`: DNA identifier, `zome`: zome module name, `fn`: function name
`hc.ribosome.wasm_call.duration`	f64 histogram	s	The time spent running a wasm call	`dna_hash`: DNA identifier, `zome`: zome module name, `fn`: function name, `agent`: agent public key
`hc.ribosome.host_fn_call.duration`	f64 histogram	s	The time spent executing a host function call	`dna_hash`: DNA identifier, `zome`: zome module name, `fn`: function name, `host_fn`: host function name
`hc.ribosome.host_fn.emit_signal`	u64 counter		The number of local signals emitted	`cell_id`: cell identifier, `zome`: zome module name
`hc.ribosome.host_fn.send_remote_signal`	u64 counter		The number of remote signals sent	`dna_hash`: DNA identifier, `zome`: zome module name
`hc.cascade.duration`	f64 histogram	s	The time taken to execute a cascade query	`zome`: originating zome name, `fn`: originating function name
`hc.cascade.fetch_error`	u64 counter		Number of errors encountered while fetching data from the network	`fetch_type`: type of data fetched, `zome`: originating zome name, `fn`: originating function name
`hc.holochain_p2p.request.duration`	f64 histogram	s	The time spent sending an outgoing p2p request awaiting the response	`dna_hash`: DNA identifier, `tag`: request category tag, `error`: request failed, `zome`: originating zome name, `fn`: originating function name
`hc.holochain_p2p.handle_request.duration`	f64 histogram	s	The time spent handling an incoming p2p request	`message_type`: p2p message type, `dna_hash`: DNA identifier
`hc.holochain_p2p.recv_remote_signal`	u64 counter		The number of remote signals received	`dna_hash`: DNA identifier

Enums§

HolochainMetricsConfig: Configuration for holochain metrics.