Entity Equivalence Use Cases

Background

There are several use cases that require support for either alarm equivalence or resource equivalence. The design of these features is in progress, and is not trivial. The purpose of this document is to define the basic requirements and use cases that should be supported, regardless of the implementation that will be selected later on.

The term “equivalence” is used to note resources or alarms that are “equal” although they are reported by different datasources and some of their properties might conflict. Alternative terms could be equality, merge, overlapping, etc.

Basic Equivalence Requirements

Resource Equivalence

We currently have two use cases for resource equivalence.

  1. K8s datasource reports VMs that are also reported by Nova

  2. Vitrage discovery agent (TBD) reports hosts that are also reported by Nova

Maybe both cases can be solved hard-coded by the datasources themselves. This option should be checked against the use cases.

Alarm Equivalence

We should support the following use cases:

  1. Equivalent alarms from different monitors, e.g. Zabbix and Nagios

  2. Non-equivalent alarms from different monitors, e.g. Zabbix and Nagios (meaning the alarms are similar but not the same)

  3. Equivalence between a monitored alarm and a Vitrage deduced alarm

Equivalence Definition

In order to support these use cases, we must define a way for the user to determine which entities are equivalent.

For resources we should define:

  • Which properties determine the equivalence. E.g. Nova instance UUID equals k8s vm externalID

  • Optional: what property should be used in case of conflict (could it be done arbitrarily or hard-coded?)

For alarms we should define:

  • Which properties determine the equivalence. E.g. Zabbix ALARM name “HIGH CPU” equals Prometheus alarm name “high cpu”.

  • Hidden assumption: equivalent alarms are always “on” the same resource.

Equivalence should be transitive. If the user defines two equivalences with a common entity, then all entities should be equivalent to one another.

For Example:

  • Zabbix high_cpu ~ Nagios HIGH_CPU

  • Nagios HIGH_CPU ~ Prometheus High CPU

Vitrage will handle Zabbix, Nagios and Prometheus CPU alarms as all equivalent to one another.

Note: We must support both hard-coded and user-defined equivalence definitions.

  • Hard-coded equivalence: k8s vms always map to Nova vms by the same strategy. We can’t let the user change it.

  • User-defined equivalence: the end user may decide that two alarms are, or are not, equivalent. The user should be able to change this definition at any time. The equivalence definition should be tenant-specific (see the section about multi tenancy).

Merge Strategy

There are different approaches for what information the user should see in case there is a conflict between two datasources. The user should be able to define the wanted “merge strategy” out of the following options:

  1. last_update: Use the properties from the last update.

  2. most_credible: Use the properties from the most credible datasource. A ‘credibility’ property should be added to each datasource. By default, most datasources will have ‘medium’ credibility, except from Vitrage that will have ‘low’ credibility. The user will be able to change it in vitrage.conf options. If the equivalent datasources have the same credibility, last_update merge strategy will be used.

  3. worst_state: In case of state/severity calculation: Use the worst state of all.

The default, which is the current behavior, will be worst_state.

Equivalence Use Cases

1. Two datasources report the same resource

1.1. Nova reports first, then Vitrage discovery agent

  1. Nova host datasource asks to create nova.host entity

  2. Vitrage discovery agent datasource asks to create host (nova.host?) entity

Expected behavior: Vitrage API returns a single host

1.2. Vitrage discovery agent reports first

Similar to 1.a, but the discovery agent reports first

1.3. Nova reports again on the next get_all

  1. An entity in the graph already exists for the host, with properties from both datasources

  2. Nova host datasources reports the same host again

Expected behavior: There should be no change in what the API returns

1.4. Conflict in the host state

  1. Nova host datasource asks to create nova.host entity with state ERROR

  2. Vitrage discovery agent datasource asks to create host entity with state ACTIVE

Expected behavior: Vitrage API returns a single host with a state that depends on the merge strategy.

Merge Strategy

Aggregated state

last_update

ACTIVE

most_credible

ERROR

worst_state

ERROR

1.5. Nova and K8s have different vm names

  1. Nova instance datasource asks to create nova.instance entity named ‘vm1’

  2. K8s datasource asks to create instance entity named ‘VM_1’

Both vms are equivalent by the Nova UUID.

Expected behavior: Vitrage API will return a single instance. Its name will be determined by one of the datasources in a consistent way (meaning it will be either always the K8s name or always the Nova name).

1.6. One datasource stops reporting

  1. Nova host datasource asks to create nova.host entity

  2. Vitrage discovery agent datasource asks to create host (nova.host?) entity

  3. Vitrage discovery agent crashes and stops reporting

  4. In the next get_all, Vitrage discovery agent reports nothing

Expected behavior:

  • The host is not deleted

  • The data that was provided by Nova is returned

2. Two monitors report the same alarm (e.g. Zabbix and Prometheus)

2.1. Zabbix reports CRITICAL, Nagios reports WARNING

  1. Zabbix datasource asks to create a Zabbix alarm with severity CRITICAL

  2. Nagios datasource asks to create a Nagios alarm with severity WARNING

Expected behavior: Vitrage API returns a single alarm with a severity that depends on the merge strategy.

Merge Strategy

Aggregated severity

last_update

WARNING

most_credible

CRITICAL

worst_state

CRITICAL

2.2. Zabbix reports CRITICAL, Nagios reports WARNING, Zabbix reports OK

  1. Nagios datasource asks to create a Nagios alarm with severity WARNING

  2. Zabbix datasource asks to create a Zabbix alarm with severity CRITICAL

  3. Zabbix datasource changes the severity to OK

Expected behavior: depends on the merge strategy.

Merge Strategy

Aggregated severity

last_update

OK (the alarm is deleted)

most_credible

WARNING

worst_state

WARNING

2.3. Zabbix, Nagios and Prometheus report the same alarm

Assume that the merge strategy is worst_state.

  1. Prometheus datasource asks to create Prometheus alarm with severity WARNING

  2. Zabbix datasource asks to create a Zabbix alarm with severity CRITICAL

  3. Nagios datasource asks to create a Nagios alarm with severity CRITICAL

Expected behavior: Vitrage API returns a single alarm with severity CRITICAL

3. Two monitors report similar yet different alarms

  1. Nagios datasource asks to create a Nagios “high CPU” alarm

  2. Zabbix datasource asks to create a Zabbix “extremely high CPU” alarm

Expected behavior: Vitrage API returns two alarms

4. A monitor reports the same alarm as a Vitrage deduced alarm

This use case is detailed also in https://review.openstack.org/#/c/547931/

4.1. Nagios reports first

  1. Nagios datasource asks to create a Nagios alarm with severity WARNING

  2. Vitrage evaluator asks to create a deduced alarm with severity CRITICAL

Expected behavior: Vitrage API returns a single alarm with severity that depends on the merge strategy.

Merge Strategy

Aggregated severity

last_update

CRITICAL

most_credible

WARNING

worst_state

CRITICAL

4.2. Nagios reports alarm, Vitrage deduced alarm, Nagios reports OK

  1. Nagios datasource asks to create a Nagios alarm

  2. Vitrage evaluator asks to create a deduced alarm with severity WARNING

  3. Nagios datasource asks to delete the Nagios alarm

Expected behavior: depends on the merge strategy.

Merge Strategy

Aggregated severity

last_update

OK (the alarm is deleted)

most_credible

OK (the alarm is deleted)

worst_state

WARNING

The behavior for worst_state strategy:

  • The alarm is not deleted (Vitrage still identifies a problem, let’s not ignore it)

  • The alarm contains all Vitrage properties

  • A diagnose action is executed, if such an action is defined

4.3. Nagios, Zabbix and Vitrage report an alarm

  1. Nagios datasource asks to create a Nagios alarm with severity WARNING

  2. Vitrage evaluator asks to create a deduced alarm with severity CRITICAL

  3. Zabbix datasource asks to create a Zabbix alarm with severity WARNING

Expected behavior: Vitrage API returns a single alarm with properties from Nagios, Zabbix and Vitrage and severity that depends on the merge strategy.

Merge Strategy

Aggregated severity

last_update

WARNING

most_credible

WARNING

worst_state

CRITICAL

5. The user changes the alarm equivalence definition

5.1. Nagios, Zabbix and Vitrage are equivalent, then the user changes it

Assume that the merge strategy is last_update.

  1. Vitrage datasource asks to create a Zabbix alarm with severity WARNING

  2. Zabbix datasource asks to create a Zabbix alarm with severity WARNING

  3. Nagios datasource asks to create a Nagios alarm with severity CRITICAL

  4. Vitrage API returns a single alarm with severity CRITICAL

  5. The user changes the equivalence definition so Vitrage and Zabbix are equivalent to each other but Nagios is not equivalent to them

Expected behavior: Vitrage API returns two alarms:

  • Zabbix+Vitrage alarm with severity WARNING

  • Nagios alarm with severity CRITICAL

Note: Since in Rocky we are going to implement vitrage-graph start-up from the database, there is no real difference if the user restarts the graph after he changes the equivalence definition or not.

5.2. Zabbix and Vitrage are equivalent, then the makes Nagios equivalent too

Assume that the merge strategy is last_update.

  1. Vitrage datasource asks to create a Zabbix alarm with severity WARNING

  2. Zabbix datasource asks to create a Zabbix alarm with severity WARNING

  3. Nagios datasource asks to create a Nagios alarm with severity CRITICAL

  4. Vitrage API returns two alarms:

    • Zabbix+Vitrage alarm with severity WARNING

    • Nagios alarm with severity CRITICAL

  5. The user changes the equivalence definition so Vitrage, Zabbix and Nagios are equivalent to each other

Expected behavior: Vitrage API returns a single alarm with severity CRITICAL

6. Template on one datasource should apply to another datasource

6.1. Simple alarm equivalence

Assume that Zabbix high_cpu alarm is equivalent to Nagios HIGH_CPU alarm.

Template example:

definitions:
 entities:
  - entity:
     category: ALARM
     rawtext: high_cpu
     type: zabbix
     template_id: zabbix_alarm

scenarios:
 - scenario:
    condition: zabbix_alarm_on_host
    actions:
     - ...
  1. Nagios datasource asks to create a Nagios HIGH_CPU alarm

  2. Zabbix datasource DOES NOT ask to create a Zabbix high_cpu alarm (yet)

Expected behavior: the actions in the scenario are executed as a result of the Nagios alarm.

6.2. Simple resource equivalence

Assume that Nova host is equivalent to Vitrage discovery agent host.

Template example:

definitions:
 entities:
  - entity:
     category: RESOURCE
     type: nova.host
     template_id: nova_host
  - entity:
     category: RESOURCE
     type: discovery_host (???)
     template_id: discovery_host

scenarios:
 - scenario:
    condition: discovery_host and discovery_host_contains_instance
    actions:
     - ...

Expected behavior: the scenario will work if the host contains an instance, no matter if the host is defined by Nova or by Vitrage discovery agent.

6.3. alarm equivalence + resource equivalence

Assume that Zabbix high_cpu alarm is equivalent to Nagios HIGH_CPU alarm and Nova host is equivalent to Vitrage discovery agent host.

Template example:

scenarios:
 - scenario:
    condition: discovery_host and discovery_host_contains_instance and
               zabbix_alarm_on_discovery_host
    actions:
     - ...

Expected behavior: the scenario will work if the host contains an instance, no matter if the host is defined by Nova or by Vitrage discovery agent; and if either Zabbix alarm of Nagios alarm was raised on the host.

7. Template on one datasource should not apply to another datasource

Assume that Zabbix high_cpu alarm is equivalent to Nagios HIGH_CPU alarm.

Template example:

definitions:
 entities:
  - entity:
     category: ALARM
     rawtext: high_cpu
     type: zabbix
     severity:warning
     template_id: zabbix_alarm
  - entity:
     category: ALARM
     name: HIGH_CPU
     type: nagios
     template_id: nagios_alarm

scenarios:
 - scenario:
    condition: zabbix_alarm_on_host
    actions:
     - ...

This use case is the same as 5.1, with one exception: the template entity zabbix_alarm is defined only for the case that the severity is warning. What will happen if a Nagios alarm is raised with severity warning? and what if it is raised with a different severity?

8. Overlapping templates

Is the overlapping templates mechanism somehow related to the equivalence use cases?

9. Multi Tenancy

Per-tenant equivalence

Entity equivalence should be defined for a specific tenant. One tenant may want to see Nagios and Zabbix alarms as one alarm, while the other tenant may want to see them separated.

Cross-tenant equivalence

Is it possible that equivalent resources will be reported on different tenants?

  1. Nova instance datasource asks to create nova.instance for tenant_1

  2. k8s datasource asks to create instance (nova.instance?) with the same UUID for tenant_2

What do we do in such a case?