5.14. Massively Distributed RPCs¶
- status
draft
- version
1.0
- Abstract
This document describes a test plan for evaluating the OpenStack message bus in the context of a massively distributed cloud architecture. For instance, a massively distributed cloud addresses the use case of the Fog or Edge computing where the services are geographically distributed accross a large area. For the time being, OpenStack inter-service communications rely on oslo_messaging which defines a common abstraction to different instantiations of the message bus. Historically broker-based implementations (e.g RabbitMQ, QPid) competed with brokerless based implementations (e.g ZeroMQ), but with the advent of AMQP1.0 in oslo_messaging, alternative non-broker messaging system can be now envisioned. In the latter messages can traverse a set of inter-connected agents (broker or routers) before reaching their destination.
The test plan takes place in the context of a prospective effort to evaluate the distribution of the messaging bus using emerging solutions (e.g qpid dispatch router) or established ones (e.g Zero-MQ) compared to the traditional and centralized solutions (e.g RabbitMQ). Finally the scope of the test plan is RPC communication between OpenStack services, thus notification is out of the scope of range.
5.14.1. Test Plan¶
5.14.1.1. Test Environment¶
Most of the following test cases are synthetic tests. Those tests are performed on top of oslo_messaging in isolation from any OpenStack components. The test plan is completed by an operational testing. It aims to evaluate the overall behaviour of Openstack using similar deployment of the messaging middleware.
5.14.1.1.1. Preparation¶
For the synthetic tests tools like ombt2 or simulator can be used. In the former case it must be configured to use a separated control bus (e.g RabbitMQ) different from the message bus under test. This will avoid any unwanted perturbations in the measurements. Failure injection can leverage os_faults. Both synthetic and operational experiments can be scripted using enoslib. Finally operational testing can leverage rally .
5.14.1.1.2. Environment description¶
The environment description includes hardware specification of servers, network parameters and operating system configuration.
5.14.1.1.2.1. Hardware¶
This section contains list of all types of hardware nodes.
Parameter |
Value |
Comments |
model |
e.g. Supermicro X9SRD-F |
|
CPU |
e.g. 6 x Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz |
5.14.1.1.2.2. Network¶
This section contains list of interfaces and network parameters. In the context of a cloud massively distributed (e.g across a WAN), the network links may present different characteristics in terms of latency, bandwidth, packet loss. These characteristics can be emulated (e.g tc) or be the result of a real deployment over a large geographical area. In any cases, link characteristics must be described.
Parameter |
Value |
Comments |
card model |
e.g. Intel |
|
driver |
e.g. ixgbe |
|
speed |
e.g. 10G or 1G |
5.14.1.1.2.3. Software¶
This section describes installed Operating System and other relevant parameter (e.g system level) and software.
Parameter |
Value |
Comments |
OS |
e.g. Ubuntu 14.04.3 |
|
oslo.messaging |
e.g 5.30.0 |
Backend specific versions must be gathered as well as third party tools.
RabbitMQ backend
Parameter |
Value |
Comments |
RMQ server |
e.g 3.6.11 |
|
kombu client |
e.g 4.10 |
|
AMQP client |
e.g 2.2.1 |
AMQP backend
Parameter |
Value |
Comments |
Qpid dispatch router |
e.g 0.8.0 |
|
python-qpid-proton |
e.g 0.17.0 |
|
pyngus |
e.g 2.2.1 |
ZeroMQ backend
Parameter |
Value |
Comments |
pyzmq |
e.g 17.0.0 |
|
redis-server |
e.g 4.0 |
Kafka backend
Parameter |
Value |
Comments |
Kafka server |
e.g 0.10.2 |
|
Kafka python |
e.g 1.3.4 |
|
Java RE |
e.g 8u144 |
5.14.1.1.2.4. Messaging middleware topology¶
The actual deployment of the messaging middleware. A graph may be used to illustrate thoroughly the topology of the messaging entities (e.g federated RabbitMQ clusters, set of qdrouterd daemons)
5.14.1.1.2.5. Openstack version¶
For the operational testings, OpenStack version must be specified.
5.14.1.2. Test Case 1 : One single large distributed target¶
5.14.1.2.1. Description¶
In this test case clients are sending requests to the same Target. Servers are serving those requests. The goal of this test case is to evaluate how large a single distributed queue can be in terms of number of clients/servers. Moreover RPC clients and servers must be distributed evenly across the messaging components
5.14.1.2.2. Methodology¶
Start
* Provision a single RPC server on Target T
* Provision a single RPC client
* RPC client issues calls to T using a fixed delay between two messages.
Repeat:
* Add additional clients until RPC server CPU utilization reaches >70%
* Provision another RPC server on T
5.14.1.2.3. List of performance metrics¶
The following metrics are recorded for each repetition.
Priority |
Value |
Measurement Units |
Description |
---|---|---|---|
1 |
Messages rate |
msg/sec |
Number of calls made by the callers per second (overall and by client) |
2 |
Latency |
ms |
The round-trip latency in message processing |
2 |
Latency stddev |
ms |
Standard deviation of the latency. |
3 |
Sent |
The number of messages sent (overall and by client) |
|
3 |
Processed |
The number of messages processed (overall and by server) |
|
4 |
Throughput |
bytes/sec |
Volume of raw data flowing through the bus by unit of time. |
Note
This test case can be run for RPC call and RPC cast.
In the case of RPC call tests, throughput and latency should be measured from the RPC client (the caller). For cast tests the latency and throughput should be measured from the RPC server (since the client does not block for ack). More specifically the latency is the time taken by a message to reach the server. The throughput will be calculated by dividing the total number of messages by the time interval between the first message sent by a client and the last message received by a server.
Throughput is correlated to the message rate but depends on the actual encoding of the message payload. This can be obtained by different means e.g: monitoring statistics from the bus itself or estimation based on the wired protocol used. This must be specified clearly to allow fair comparisons.
5.14.1.3. Test Case 2: Multiple distributed targets¶
5.14.1.3.1. Description¶
The objective of the test case is to evaluate how many queues can be simultaneously active and managed by the messaging middleware.
5.14.1.3.2. Methodology¶
Start:
* Provision a single RPC server on Target T
* Provision a single RPC client
* RPC client issues calls to T using a fixed delay between two messages.
Repeat:
* Add additional couple (client, server) on another Target.
5.14.1.3.3. List of performance metrics¶
The following metrics are recorded for each repetition.
Priority |
Value |
Measurement Units |
Description |
---|---|---|---|
1 |
Messages rate |
msg/sec |
Number of calls made by the callers per second (overall and by client) |
2 |
Latency |
ms |
The round-trip latency in message processing |
2 |
Latency stddev |
ms |
Standard deviation of the latency. |
3 |
Sent |
The number of messages sent (overall and by client) |
|
3 |
Processed |
The number of messages processed (overall and by server) |
|
4 |
Throughput |
bytes/sec |
Volume of raw data flowing through the bus by unit of time. |
Note
This test case can be run for RPC call and RPC cast.
Note that throughput is less interesting in the case of cast messages since it can be artificially high due to the lack of ack.
5.14.1.4. Test Case 3 : one single large distributed fanout¶
5.14.1.4.1. Description¶
The goal of this test case is to evaluate the ability of the message bus to handle large fanout.
5.14.1.4.2. Methodology¶
Start:
* Provision a single RPC server on Target T
* Provision a single RPC client
* RPC client issues fanout cast to T:
- 1 cast every second
- n messages
Repeat:
* Add additional RPC server on T
5.14.1.4.3. List of performance metrics¶
The following metrics are recorded for each repetition.
Priority |
Value |
Measurement Units |
Description |
---|---|---|---|
1 |
Latency |
ms |
Latency |
2 |
Sent |
The number of messages sent (overall and by client) |
|
2 |
Processed |
The number of messages processed (overall and by server) |
Note
In case of fanout cast, no ack are sent to the sender. The latency will be the time interval between the message is sent and the message is received by all the servers.
5.14.1.5. Test Case 4 : multiple distributed fanouts¶
5.14.1.5.1. Description¶
The goal of this test case is to scale the number of fanouts handled by the message bus.
5.14.1.5.2. Methodology¶
Start:
* Provision n RPC servers on Target T
* Provision a single RPC client
* RPC client issues fanout cast to T:
- 1 cast every second
- m messages
Repeat:
* Add (n RPC servers, 1 RPC client) on another Target
5.14.1.5.3. List of performance metrics¶
The following metrics are recorded for each repetition.
Priority |
Value |
Measurement Units |
Description |
---|---|---|---|
1 |
Latency |
ms |
Latency |
2 |
Sent |
The number of messages sent (overall and by client) |
|
2 |
Processed |
The number of messages processed (overall and by server) |
Note
In case of fanout cast, no ack are sent to the sender. The latency will be the time interval between the message is sent and the message is received by all the servers.
5.14.1.6. Test Case 5 : Resilience¶
5.14.1.6.1. Description¶
Usual centralized solutions offer some solution to increase their scalability while providing high-availability (e.g RabbitMQ clustering, mirroring). This kind of solution fit well the one-datacenter case but doesn’t cope with the distributed case where high latency between communicating entities can be observed. In a massively distributed case, communicating entities may fail more often (link down, hardware failure). The goal of this test case is to evaluate the resiliency of the messaging layer to failures.
5.14.1.6.2. Methodology¶
The messaging infrastructure must be configured in such a way as to ensure functionality can be preserved in the case of loss of any one messaging component (e.g. three rabbit brokers in a cluster, two alternate paths across a router mesh, etc.) Each messaging client must be configured with a fail-over address for re-connecting to the message bus should its primary connection fail (see the oslo.messaging documentation for TransportURL addresses).
The test environment is the same as that for Test Case 1 : One single large distributed queue, with the caveat that each process comprising the message bus maintains a steady state CPU load of approximately 50%. In other words the test traffic should maintain a reasonable and consistent load on the message bus without overloading it. Additionally test will be based on RPC call traffic. RPC cast traffic is sent “least effort” - cast messages are more likely to be dropped than calls since there is no return ACKs in the case of cast.
Start:
* Provision the test environment as described above
Phase 1:
* reboot one component of the message bus (e.g. a single rabbit broker in
the cluster)
* wait until the component recovers and the message bus returns to steady state
Repeat:
* Phase 1 for each component of the message bus
Phase 2:
* force the failure of one of the TCP connections linking two components
of the message bus (e.g. the connection between two rabbit brokers in a
cluster).
* wait 60 seconds
* restore the connection
* wait until message bus stabilizes
Repeat:
* Phase 2 for each TCP connection connecting any two message bus
components.
Phase 3:
* force the failure of one of the TCP connections linking one client and
its connected endpoint
* wait until client reconnects using one fail-over address
* restore the connection
Repeat:
* Phase 3 for each TCP connection connecting one client and the bus
Note
Message bus backend are likely to offer specific ways to know when a steady state is reached after the recovery of one agent (e.g polling RabbitMQ API for the cluster status). This must be clearly stated in the test resut.
5.14.1.6.3. List of performance metrics¶
Priority |
Value |
Measurement Units |
Description |
---|---|---|---|
1 |
Call failures |
Total number of RPC call operations that failed grouped by exception type. |
|
2 |
Reboot recovery |
seconds |
The average time between a component reboot and recovery of the message bus |
2 |
Reconnect recovery |
seconds |
The average time between the restoration of an internal TCP connection and the recovery of the message bus |
3 |
Message duplication |
Total number of duplicated messages received by the servers. |
5.14.1.7. Common metrics to all test cases¶
For each agent involved in the communication middleware, metrics about their resource consumption under load must be gathered.
Priority |
Value |
Measurement Units |
Description |
---|---|---|---|
1 |
CPU load |
Mhz |
CPU load |
2 |
RAM consumption |
Gb |
RAM consumption |
3 |
Opened Connection |
Number of TCP sockets opened |
5.14.1.8. Test Case 6 : Operational testing¶
5.14.1.8.1. Description¶
Operational testing intends to evaluate the correct behaviour of a running OpenStack on top of a specific deployment of the messaging middleware. This test case aims to measure the correct behaviour of OpenStack under WAN at messaging plane level. It relies on rally that runs loads on the current OpenStack. Then Rally reports can be used to get time of operations executions and percent of failure to evaluate OpenStack. The chosen Rally scenarios and those known to be intensive on the messaging layer (e.g Neutron scenarios and Nova scenarios).
5.14.1.8.2. List of performance metrics¶
Since rally is used, the performance metrics are those reported by the framework.
5.14.2. Reports¶
Results of Test Case 1 are available here.