Finding a Keystone bug while testing 20 node HA cloud performance at creating 400 VMs

(Contributed by Alexander Maretskiy, Mirantis)

Below we describe how we found a bug in Keystone and achieved 2x average performance increase at booting Nova servers after fixing that bug. Our initial goal was to measure performance the booting of a significant amount of servers on a cluster (running on a custom build of Mirantis OpenStack v5.1) and to ensure that this operation has reasonable performance and completes with no errors.

Goal

  • Get data on how a cluster behaves when a huge amount of servers is started
  • Get data on how good the neutron component is good in this case

Summary

  • Creating 400 servers with configured networking
  • Servers are being created simultaneously - 5 servers at the same time

Hardware

Having a real hardware lab with 20 nodes:

Vendor SUPERMICRO SUPERSERVER
CPU 12 cores, Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
RAM 32GB (4 x Samsung DDRIII 8GB)
HDD 1TB

Cluster

This cluster was created via Fuel Dashboard interface.

Deployment Custom build of Mirantis OpenStack v5.1
OpenStack release Icehouse
Operating System Ubuntu 12.04.4
Mode High availability
Hypervisor KVM
Networking Neutron with GRE segmentation
Controller nodes 3
Compute nodes 17

Rally

Version

For this test case, we use custom Rally with the following patch:

https://review.openstack.org/#/c/96300/

Deployment

Rally was deployed for cluster using ExistingCloud type of deployment.

Server flavor

$ nova flavor-show ram64
+----------------------------+--------------------------------------+
| Property                   | Value                                |
+----------------------------+--------------------------------------+
| OS-FLV-DISABLED:disabled   | False                                |
| OS-FLV-EXT-DATA:ephemeral  | 0                                    |
| disk                       | 0                                    |
| extra_specs                | {}                                   |
| id                         | 2e46aba0-9e7f-4572-8b0a-b12cfe7e06a1 |
| name                       | ram64                                |
| os-flavor-access:is_public | True                                 |
| ram                        | 64                                   |
| rxtx_factor                | 1.0                                  |
| swap                       |                                      |
| vcpus                      | 1                                    |
+----------------------------+--------------------------------------+

Server image

$ glance image-show d1c116f4-3c38-4aa6-8fa1-f7a28c4e72a6
+------------------+--------------------------------------+
| Property         | Value                                |
+------------------+--------------------------------------+
| checksum         | 053ad369d58aa98afb1d355aa16b0663     |
| container_format | bare                                 |
| created_at       | 2018-01-09T06:23:18Z                 |
| disk_format      | qcow2                                |
| id               | d1c116f4-3c38-4aa6-8fa1-f7a28c4e72a6 |
| min_disk         | 0                                    |
| min_ram          | 0                                    |
| name             | TestVM                               |
| owner            | 01cb845eee6449cea4381865a1270736     |
| protected        | False                                |
| size             | 5254208                              |
| status           | active                               |
| tags             | []                                   |
| updated_at       | 2018-01-09T06:23:18Z                 |
| virtual_size     | None                                 |
| visibility       | public                               |
+------------------+--------------------------------------+

Task configuration file (in JSON format):

{
   "NovaServers.boot_server": [
       {
           "args": {
               "flavor": {
                   "name": "ram64"
               },
               "image": {
                   "name": "TestVM"
               }
           },
           "runner": {
               "type": "constant",
               "concurrency": 5,
               "times": 400
           },
           "context": {
               "neutron_network": {
                   "network_ip_version": 4
               },
               "users": {
                   "concurrent": 30,
                   "users_per_tenant": 5,
                   "tenants": 5
               },
               "quotas": {
                   "neutron": {
                       "subnet": -1,
                       "port": -1,
                       "network": -1,
                       "router": -1
                   }
               }
           }
       }
   ]
}

The only difference between first and second run is that runner.times for first time was set to 500

Results

First time - a bug was found:

Starting from 142 server, we have error from novaclient: Error <class ‘novaclient.exceptions.Unauthorized’>: Unauthorized (HTTP 401).

That is how a bug in Keystone was found.

action min (sec) avg (sec) max (sec) 90 percentile 95 percentile success count
nova.boot_server total 6.507 6.507 17.402 17.402 100.303 100.303 39.222 39.222 50.134 50.134 26.8% 26.8% 500 500

Second run, with bugfix:

After a patch was applied (using RPC instead of neutron client in metadata agent), we got 100% success and 2x improved average performance:

action min (sec) avg (sec) max (sec) 90 percentile 95 percentile success count
nova.boot_server total 5.031 5.031 8.008 8.008 14.093 14.093 9.616 9.616 9.716 9.716 100.0% 100.0% 400 400