5.30.1. OpenStack reliability testing¶
- status
ready
- version
1.0
- Abstract
This document describes an abstract methodology for OpenStack cluster high-availability testing and analysis. OpenStack data plane testing at this moment is out of scope, but will be described in future.
- Conventions
OpenStack cluster: consists of server nodes with deployed and fully operational OpenStack environment in high-availability configuration.
Fault-injection operation: represents common types of failures which can occur in production environment: service-hang, service-crash, network-partition, network-flapping, and node-crash.
Service-hang: faults are injected into specified OpenStack service by sending -SIGSTOP and -SIGCONT POSIX signals.
Service-crash: faults are injected by sending -SIGKILL signal into specified OpenStack service.
Node-crash: faults are injected to an OpenStack cluster by rebooting or shutting down a server node.
Network-partition: faults are injected by inserting iptables rules to OpenStack cluster nodes to a corresponding service that should be network-partitioned.
Network-flapping: faults are injected into OpenStack cluster nodes by inserting/deleting iptables rules on the fly which will affect corresponding service that should be tested.
Factor: consists of a set of atomic fault-injection operations. For example: reboot-random-controller, reboot-random-rabbitmq.
Test plan: contains two elements: test scenario execution graph and fault-injection factors.
SLA: Service-level agreement
Testing-cycles: number of test cycles of each factor
Inf: assumes infinite time to auto-healing of cluster after fault-factor injection.
5.30.1.1. Test Plan¶
5.30.1.1.1. Test Environment¶
This section should contain all information about deployed OpenStack
environment including archive with all information in the /etc
folder from
all nodes.
5.30.1.1.1.1. Preparation¶
This section should contain all steps to reproduce Openstack environment deployment and client node. For example: if testing environment is deployed with DevStack, this section should contain all DevStack configuration files, DevStack version and all deployment steps.
5.30.1.1.1.2. Environment description¶
This section should contain all cluster hardware information, including processor model and its frequency, memory size, storage type and its capacity, network interfaces, and others. A separate client node must be used to drive the tests.
5.30.1.1.1.2.1. Hardware¶
This section should contain a full hardware nodes specification.
SERVER |
name |
||
role |
|||
vendor,model |
|||
operating_system |
|||
CPU |
vendor,model |
||
processor_count |
|||
core_count |
|||
frequency_MHz |
|||
RAM |
vendor,model |
||
amount_MB |
|||
NETWORK |
interface_name |
||
vendor,model |
|||
bandwidth |
|||
STORAGE |
dev_name |
||
vendor,model |
|||
SSD/HDD |
|||
size |
5.30.1.1.1.2.2. Networking¶
This section should сontain full description of network equipment used in OpenStack cluster. Network topology diagram and network hardware configuration files should be included in this section.
5.30.1.1.2. Factors description¶
Please define here description of used factors during test runs. Examples are:
reboot-random-controller: consist node-crash fault injection on random
OpenStack controller node.
reboot-random-rabbitmq: consist node-crash fault injection on master
RabbitMQ messaging node.
sigstop-random-nova-api: consist service-hang fault injection on random
nova-api service.
sigkill-random-mysql: consist service-crash fault injection on
random MySQL node.
network-partition-random-mysql: consist network-partition fault injection on
random MySQL node.
5.30.1.1.3. Test Case 1: NovaServers.boot_and_delete_server¶
5.30.1.1.3.1. Description¶
This Rally scenario boots and deletes virtual instances with injected fault factors through OpenStack Nova API.
5.30.1.1.3.2. Service-level agreement¶
In this section, specify SLA values. For example:
Parameter |
Value |
---|---|
MTTR (sec) |
<=240 |
Failure rate (%) |
<=95 |
Auto-healing |
Yes |
5.30.1.1.3.3. Parameters¶
In this section, specify load parameters during the test. For example:
Parameter |
Value |
---|---|
Runner |
constant |
Concurrency |
X |
Times |
Y |
Injection-iteration |
Z |
Testing-cycles |
N |
5.30.1.1.3.4. List of reliability metrics¶
Priority |
Value |
Measurement Units |
Description |
---|---|---|---|
1 |
SLA |
Boolean |
Service-level agreement result |
2 |
Auto-healing |
Boolean |
Is cluster auto-healed after fault-injection |
3 |
Failure rate |
Percents |
Test iteration failure ratio |
4 |
MTTR (auto) |
Seconds |
Automatic mean time to repair |
5 |
MTTR (manual) |
Seconds |
Manual mean time to repair, if Auto MTTR is Inf. |
5.30.1.1.3.5. Results¶
5.30.1.1.3.5.1. reboot-random-controller¶
Cycles |
MTTR(sec) | Failure rate(%) |
Auto-healing |
Performance degradation |
|
1 |
X |
Y |
Yes |
Yes |
2 |
X |
Y |
Yes |
Yes |
3 |
X |
Y |
No |
Yes |
4 |
X |
Y |
Yes |
Yes |
5 |
X |
Y |
Yes |
Yes |
Place here link to rally report file with results of testing this factor.
Value |
MTTR |
Failure rate |
Min |
X |
Y |
Max |
X |
Y |
SLA |
X |
Y |
5.30.1.1.3.5.2. Detailed results description¶
In this section, specify detailed description of test results, including factor impact.
5.30.1.1.3.5.3. reboot-random-rabbitmq¶
Cycles |
MTTR(sec) |
Failure rate(%) |
Auto-healing |
Performance degradation |
1 |
X |
Y |
Yes |
Yes |
2 |
X |
Y |
Yes |
Yes |
3 |
X |
Y |
No |
Yes |
4 |
X |
Y |
Yes |
Yes |
5 |
X |
Y |
Yes |
Yes |
Place here link to rally report file with results of testing this factor.
Value |
MTTR |
Failure rate |
Min |
X |
Y |
Max |
X |
Y |
SLA |
X |
Y |
5.30.1.1.3.5.4. Detailed results description¶
In this section, specify detailed description of test results, including factor impact.
5.30.1.1.4. Test Case 2: GlanceImages.create_and_delete_image¶
5.30.1.1.4.1. Description¶
This Rally scenario creates and deletes images with injected fault factors through OpenStack Glance API.
5.30.1.1.4.2. Service-level agreement¶
In this section, specify SLA values. For example:
Parameter |
Value |
---|---|
MTTR (sec) |
<=120 |
Failure rate (%) |
<=95 |
Auto-healing |
Yes |
5.30.1.1.4.3. Parameters¶
In this section, specify load parameters during the test. For example:
Parameter |
Value |
---|---|
Runner |
constant |
Concurrency |
X |
Times |
Y |
Injection-iteration |
Z |
Testing-cycles |
N |
5.30.1.1.4.4. List of reliability metrics¶
Priority |
Value |
Measurement Units |
Description |
---|---|---|---|
1 |
SLA |
Boolean |
Service-level agreement result |
2 |
Auto-healing |
Boolean |
Is cluster auto-healed after fault-injection |
3 |
Failure rate |
Percents |
Test iteration failure ratio |
4 |
MTTR (auto) |
Seconds |
Automatic mean time to repair |
5 |
MTTR (manual) |
Seconds |
Manual mean time to repair, if Auto MTTR is Inf. |
5.30.1.1.4.5. Results¶
5.30.1.1.4.5.1. reboot-random-controller¶
Cycles |
MTTR(sec) |
Failure rate(%) |
Auto-healing |
Performance degradation |
1 |
X |
Y |
Yes |
Yes |
2 |
X |
Y |
Yes |
Yes |
3 |
X |
Y |
No |
Yes |
4 |
X |
Y |
Yes |
Yes |
5 |
X |
Y |
Yes |
Yes |
Place here link to rally report file with results of testing this factor.
Value |
MTTR |
Failure rate |
Min |
X |
Y |
Max |
X |
Y |
SLA |
X |
Y |
5.30.1.1.4.5.2. Detailed results description¶
In this section, specify detailed description of test results, including factor impact.
5.30.1.1.4.5.3. reboot-random-rabbitmq¶
Cycles |
MTTR(sec) |
Failure rate(%) |
Auto-healing |
Performance degradation |
1 |
X |
Y |
Yes |
Yes |
2 |
X |
Y |
Yes |
Yes |
3 |
X |
Y |
No |
Yes |
4 |
X |
Y |
Yes |
Yes |
5 |
X |
Y |
Yes |
Yes |
Place here link to rally report file with results of testing this factor.
Value |
MTTR |
Failure rate |
Min |
X |
Y |
Max |
X |
Y |
SLA |
X |
Y |
5.30.1.1.4.5.4. Detailed results description¶
In this section, specify detailed description of test results, including factor impact.
5.30.1.2. Reports¶
- Test plan execution reports: