Cyborg is a service for managing accelerators, such as FPGAs, GPUs, etc. For scheduling an instance that needs accelerators, Cyborg needs to work with Nova at three levels:
The first two aspects are addressed in [1]. This spec addresses the attachment of accelerators to instances, via os-acc. For FPGAs, Cyborg also needs to interact with Glance for fetching bitstreams. Some aspects of that are covered in [2]. This spec will address the interaction of Cyborg and Glance in the compute node.
This spec is common to all accelerators, including GPUs, High Precision Time Synchronization (HPTS) cards, etc. Since FPGAs have more aspects to be considered than other devices, some sections may focus on FPGA-specific factors. The spec calls out the FPGA-specific aspects.
Smart NICs based on FPGAs fall into two categories: those which expose the FPGA explicitly to the host, and those that do not. Cyborg’s current scope includes the former. This spec includes such devices, though the Cyborg-Neutron interaction is out of scope.
The scope of this spec is Rocky release.
Here is an example diagram for an FPGA with multiple regions, and multiple functions in a region:
PCI A PCI B
| |
+-------|--------|-------------------+
| | | |
| +----|--------|---+ +--------+ |
| | +--|--+ +---|-+ | | | |
| | | Fn A| | Fn B| | | | |
| | +-----+ +-----+ | | | |
| +-----------------+ +--------+ |
| Region 1 Region 2 |
| |
+------------------------------------+
Once Nova has picked a compute node for placement of an instance that needs accelerators, the following steps needs to happen:
The behavior of each of these steps needs to be specified.
In addition, the OpenStack Compute API [3] specifies the operations that can be done on an instance. The behavior with respect to accelerators must be defined for each of these operations. That in turn is related to when Nova compute calls os-acc.
Please see [1]. We intend to support FPGAaaS with request time programming, and AFaaS (both pre-programmed and orchestrator-programmed scenarios).
Cyborg will discover accelerator resources whenever the Cyborg agent starts up. PCI hot plug can be supported past Rocky release.
Cyborg must support all instance operations mentioned in OpenStack Compute API [3] in Rocky, except booting off a snapshot and live migration.
The OpenStack Compute API [3] mentions the list of operations that can be performed on an instance. Of these, some will not be supported by Cyborg in Rocky. The list of supported operations (with the intended behaviors) are as follows:
The following instance operations are not supported in this release:
Cyborg will develop a new library named os-acc. That library will offer the APIs listed later in this section. Nova Compute calls these APIs if it sees that the requested flavor refers to CUSTOM_ACCELERATOR resource class, except for the initialize() call, which is called unconditionally. Nova Compute calls these APIs asynchronously, as suggested below:
with ThreadPoolExecutor(max_workers=1) as executor:
future = executor.submit(os_acc.<api>, *args)
# do other stuff
try:
data = future.result()
except:
# handle exceptions
The APIs of os-acc are as below:
initialize()
CyborgAgentUnavailable
exception if Cyborg Agent cannot be
contacted.plug(instance_info, selected_rp, flavor_extra_specs)
{ “pci_id”: <pci bdf> }
unplug(instance_info)
- Called when an instance is stopped, shelved, or deleted and before a resize or cold migration.
- As part of this call, Cyborg Agent will clean up internal resources, call the appropriate Cyborg driver to clean up the device resources and update its data structures persistently.
- Returns the number of accelerators that were released. Errors may cause exceptions to be thrown.
The pseudocode for each os-acc API can be expressed as below:
def initialize():
# checks that all devices are discovered and their traits published
# waits if any discovery operation is ongoing
return None
def plug(instance_info, rp, extra_specs):
validate_params(....)
glance = glanceclient.Client(...)
driver = # select Cyborg driver for chosen rp
rp_deployable = # get deployable for RP
if extra_specs refers to ``CUSTOM_FPGA_<vendor>_REGION_<uuid>`` and
extra_specs refers to ``bitstream:<uuid>``:
bitstream = glance.images.data(image_uuid)
driver.program(bitstream, rp_deployable, …)
if extra_specs refers to ``CUSTOM_FPGA_<vendor>_FUNCTION_<uuid>`` and
extra_specs refers to function UUID/name:
region_type_uuid = # fetch from selected RP
bitstreams = glance.images.list(...)
# queries Glance by function UUID/name property and region type
# UUID to get matching bitstreams
if len(bitstreams) > 1:
error(...) # bitstream choice policy is outside Cyborg
driver.program(bitstream, rp_deployable, …)
pci_bdf = driver.allocate_handle(...)
# update Cyborg DB with instance_info and BDF usage
return { “pci_id”: pci bdf }
def unplug(instance_info):
bdf_list = # fetch BDF usage from Cyborg DB for instance
# update Cyborg DB to mark those BDFs as free
return len(bdf_list)
N/A
None
None
None
None
None
None
None
None
None
For each vendor driver supported in this release, we need to integrate the corresponding FPGA type(s) in the CI infrastructure.
The behavior with respect to accelerators during various instance operations (reboot, pause, etc.) must be documented. The procedure to upload a bitstream, including applying Glance properties, must also be documented.
[1] | (1, 2) Cyborg Nova Scheduling Specification |
[2] | Cyborg bitstream metadata standardization spec |
[3] | (1, 2, 3) OpenStack Server API Concepts |
Except where otherwise noted, this document is licensed under Creative Commons Attribution 3.0 License. See all OpenStack Legal Documents.