Browse Source
This started as an attempt to update ivshmem_device_spec.txt for clarity, accuracy and completeness while working on its code, and quickly became a full rewrite. Since the diff would be useless anyway, I'm using the opportunity to rename the file to ivshmem-spec.txt. I tried hard to ensure the new text contradicts neither the old text nor the code. If the new text contradicts the old text but not the code, it's probably a bug in the old text. If the new text contradicts both, its probably a bug in the new text. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com> Message-Id: <1458066895-20632-11-git-send-email-armbru@redhat.com>pull/7/merge
2 changed files with 243 additions and 161 deletions
@ -0,0 +1,243 @@ |
|||
= Device Specification for Inter-VM shared memory device = |
|||
|
|||
The Inter-VM shared memory device (ivshmem) is designed to share a |
|||
memory region between multiple QEMU processes running different guests |
|||
and the host. In order for all guests to be able to pick up the |
|||
shared memory area, it is modeled by QEMU as a PCI device exposing |
|||
said memory to the guest as a PCI BAR. |
|||
|
|||
The device can use a shared memory object on the host directly, or it |
|||
can obtain one from an ivshmem server. |
|||
|
|||
In the latter case, the device can additionally interrupt its peers, and |
|||
get interrupted by its peers. |
|||
|
|||
|
|||
== Configuring the ivshmem PCI device == |
|||
|
|||
There are two basic configurations: |
|||
|
|||
- Just shared memory: -device ivshmem,shm=NAME,... |
|||
|
|||
This uses shared memory object NAME. |
|||
|
|||
- Shared memory plus interrupts: -device ivshmem,chardev=CHR,vectors=N,... |
|||
|
|||
An ivshmem server must already be running on the host. The device |
|||
connects to the server's UNIX domain socket via character device |
|||
CHR. |
|||
|
|||
Each peer gets assigned a unique ID by the server. IDs must be |
|||
between 0 and 65535. |
|||
|
|||
Interrupts are message-signaled by default (MSI-X). With msi=off |
|||
the device has no MSI-X capability, and uses legacy INTx instead. |
|||
vectors=N configures the number of vectors to use. |
|||
|
|||
For more details on ivshmem device properties, see The QEMU Emulator |
|||
User Documentation (qemu-doc.*). |
|||
|
|||
|
|||
== The ivshmem PCI device's guest interface == |
|||
|
|||
The device has vendor ID 1af4, device ID 1110, revision 0. |
|||
|
|||
=== PCI BARs === |
|||
|
|||
The ivshmem PCI device has two or three BARs: |
|||
|
|||
- BAR0 holds device registers (256 Byte MMIO) |
|||
- BAR1 holds MSI-X table and PBA (only when using MSI-X) |
|||
- BAR2 maps the shared memory object |
|||
|
|||
There are two ways to use this device: |
|||
|
|||
- If you only need the shared memory part, BAR2 suffices. This way, |
|||
you have access to the shared memory in the guest and can use it as |
|||
you see fit. Memnic, for example, uses ivshmem this way from guest |
|||
user space (see http://dpdk.org/browse/memnic). |
|||
|
|||
- If you additionally need the capability for peers to interrupt each |
|||
other, you need BAR0 and, if using MSI-X, BAR1. You will most |
|||
likely want to write a kernel driver to handle interrupts. Requires |
|||
the device to be configured for interrupts, obviously. |
|||
|
|||
If the device is configured for interrupts, BAR2 is initially invalid. |
|||
It becomes safely accessible only after the ivshmem server provided |
|||
the shared memory. Guest software should wait for the IVPosition |
|||
register (described below) to become non-negative before accessing |
|||
BAR2. |
|||
|
|||
The device is not capable to tell guest software whether it is |
|||
configured for interrupts. |
|||
|
|||
=== PCI device registers === |
|||
|
|||
BAR 0 contains the following registers: |
|||
|
|||
Offset Size Access On reset Function |
|||
0 4 read/write 0 Interrupt Mask |
|||
bit 0: peer interrupt |
|||
bit 1..31: reserved |
|||
4 4 read/write 0 Interrupt Status |
|||
bit 0: peer interrupt |
|||
bit 1..31: reserved |
|||
8 4 read-only 0 or -1 IVPosition |
|||
12 4 write-only N/A Doorbell |
|||
bit 0..15: vector |
|||
bit 16..31: peer ID |
|||
16 240 none N/A reserved |
|||
|
|||
Software should only access the registers as specified in column |
|||
"Access". Reserved bits should be ignored on read, and preserved on |
|||
write. |
|||
|
|||
Interrupt Status and Mask Register together control the legacy INTx |
|||
interrupt when the device has no MSI-X capability: INTx is asserted |
|||
when the bit-wise AND of Status and Mask is non-zero and the device |
|||
has no MSI-X capability. Interrupt Status Register bit 0 becomes 1 |
|||
when an interrupt request from a peer is received. Reading the |
|||
register clears it. |
|||
|
|||
IVPosition Register: if the device is not configured for interrupts, |
|||
this is zero. Else, it's -1 for a short while after reset, then |
|||
changes to the device's ID (between 0 and 65535). |
|||
|
|||
There is no good way for software to find out whether the device is |
|||
configured for interrupts. A positive IVPosition means interrupts, |
|||
but zero could be either. The initial -1 cannot be reliably observed. |
|||
|
|||
Doorbell Register: writing this register requests to interrupt a peer. |
|||
The written value's high 16 bits are the ID of the peer to interrupt, |
|||
and its low 16 bits select an interrupt vector. |
|||
|
|||
If the device is not configured for interrupts, the write is ignored. |
|||
|
|||
If the interrupt hasn't completed setup, the write is ignored. The |
|||
device is not capable to tell guest software whether setup is |
|||
complete. Interrupts can regress to this state on migration. |
|||
|
|||
If the peer with the requested ID isn't connected, or it has fewer |
|||
interrupt vectors connected, the write is ignored. The device is not |
|||
capable to tell guest software what peers are connected, or how many |
|||
interrupt vectors are connected. |
|||
|
|||
If the peer doesn't use MSI-X, its Interrupt Status register is set to |
|||
1. This asserts INTx unless masked by the Interrupt Mask register. |
|||
The device is not capable to communicate the interrupt vector to guest |
|||
software then. |
|||
|
|||
If the peer uses MSI-X, the interrupt for this vector becomes pending. |
|||
There is no way for software to clear the pending bit, and a polling |
|||
mode of operation is therefore impossible with MSI-X. |
|||
|
|||
With multiple MSI-X vectors, different vectors can be used to indicate |
|||
different events have occurred. The semantics of interrupt vectors |
|||
are left to the application. |
|||
|
|||
|
|||
== Interrupt infrastructure == |
|||
|
|||
When configured for interrupts, the peers share eventfd objects in |
|||
addition to shared memory. The shared resources are managed by an |
|||
ivshmem server. |
|||
|
|||
=== The ivshmem server === |
|||
|
|||
The server listens on a UNIX domain socket. |
|||
|
|||
For each new client that connects to the server, the server |
|||
- picks an ID, |
|||
- creates eventfd file descriptors for the interrupt vectors, |
|||
- sends the ID and the file descriptor for the shared memory to the |
|||
new client, |
|||
- sends connect notifications for the new client to the other clients |
|||
(these contain file descriptors for sending interrupts), |
|||
- sends connect notifications for the other clients to the new client, |
|||
and |
|||
- sends interrupt setup messages to the new client (these contain file |
|||
descriptors for receiving interrupts). |
|||
|
|||
When a client disconnects from the server, the server sends disconnect |
|||
notifications to the other clients. |
|||
|
|||
The next section describes the protocol in detail. |
|||
|
|||
If the server terminates without sending disconnect notifications for |
|||
its connected clients, the clients can elect to continue. They can |
|||
communicate with each other normally, but won't receive disconnect |
|||
notification on disconnect, and no new clients can connect. There is |
|||
no way for the clients to connect to a restarted server. The device |
|||
is not capable to tell guest software whether the server is still up. |
|||
|
|||
Example server code is in contrib/ivshmem-server/. Not to be used in |
|||
production. It assumes all clients use the same number of interrupt |
|||
vectors. |
|||
|
|||
A standalone client is in contrib/ivshmem-client/. It can be useful |
|||
for debugging. |
|||
|
|||
=== The ivshmem Client-Server Protocol === |
|||
|
|||
An ivshmem device configured for interrupts connects to an ivshmem |
|||
server. This section details the protocol between the two. |
|||
|
|||
The connection is one-way: the server sends messages to the client. |
|||
Each message consists of a single 8 byte little-endian signed number, |
|||
and may be accompanied by a file descriptor via SCM_RIGHTS. Both |
|||
client and server close the connection on error. |
|||
|
|||
On connect, the server sends the following messages in order: |
|||
|
|||
1. The protocol version number, currently zero. The client should |
|||
close the connection on receipt of versions it can't handle. |
|||
|
|||
2. The client's ID. This is unique among all clients of this server. |
|||
IDs must be between 0 and 65535, because the Doorbell register |
|||
provides only 16 bits for them. |
|||
|
|||
3. The number -1, accompanied by the file descriptor for the shared |
|||
memory. |
|||
|
|||
4. Connect notifications for existing other clients, if any. This is |
|||
a peer ID (number between 0 and 65535 other than the client's ID), |
|||
repeated N times. Each repetition is accompanied by one file |
|||
descriptor. These are for interrupting the peer with that ID using |
|||
vector 0,..,N-1, in order. If the client is configured for fewer |
|||
vectors, it closes the extra file descriptors. If it is configured |
|||
for more, the extra vectors remain unconnected. |
|||
|
|||
5. Interrupt setup. This is the client's own ID, repeated N times. |
|||
Each repetition is accompanied by one file descriptor. These are |
|||
for receiving interrupts from peers using vector 0,..,N-1, in |
|||
order. If the client is configured for fewer vectors, it closes |
|||
the extra file descriptors. If it is configured for more, the |
|||
extra vectors remain unconnected. |
|||
|
|||
From then on, the server sends these kinds of messages: |
|||
|
|||
6. Connection / disconnection notification. This is a peer ID. |
|||
|
|||
- If the number comes with a file descriptor, it's a connection |
|||
notification, exactly like in step 4. |
|||
|
|||
- Else, it's a disconnection notification for the peer with that ID. |
|||
|
|||
Known bugs: |
|||
|
|||
* The protocol changed incompatibly in QEMU 2.5. Before, messages |
|||
were native endian long, and there was no version number. |
|||
|
|||
* The protocol is poorly designed. |
|||
|
|||
=== The ivshmem Client-Client Protocol === |
|||
|
|||
An ivshmem device configured for interrupts receives eventfd file |
|||
descriptors for interrupting peers and getting interrupted by peers |
|||
from the server, as explained in the previous section. |
|||
|
|||
To interrupt a peer, the device writes the 8-byte integer 1 in native |
|||
byte order to the respective file descriptor. |
|||
|
|||
To receive an interrupt, the device reads and discards as many 8-byte |
|||
integers as it can. |
|||
@ -1,161 +0,0 @@ |
|||
|
|||
Device Specification for Inter-VM shared memory device |
|||
------------------------------------------------------ |
|||
|
|||
The Inter-VM shared memory device is designed to share a memory region (created |
|||
on the host via the POSIX shared memory API) between multiple QEMU processes |
|||
running different guests. In order for all guests to be able to pick up the |
|||
shared memory area, it is modeled by QEMU as a PCI device exposing said memory |
|||
to the guest as a PCI BAR. |
|||
The memory region does not belong to any guest, but is a POSIX memory object on |
|||
the host. The host can access this shared memory if needed. |
|||
|
|||
The device also provides an optional communication mechanism between guests |
|||
sharing the same memory object. More details about that in the section 'Guest to |
|||
guest communication' section. |
|||
|
|||
|
|||
The Inter-VM PCI device |
|||
----------------------- |
|||
|
|||
From the VM point of view, the ivshmem PCI device supports three BARs. |
|||
|
|||
- BAR0 is a 1 Kbyte MMIO region to support registers and interrupts when MSI is |
|||
not used. |
|||
- BAR1 is used for MSI-X when it is enabled in the device. |
|||
- BAR2 is used to access the shared memory object. |
|||
|
|||
It is your choice how to use the device but you must choose between two |
|||
behaviors : |
|||
|
|||
- basically, if you only need the shared memory part, you will map BAR2. |
|||
This way, you have access to the shared memory in guest and can use it as you |
|||
see fit (memnic, for example, uses it in userland |
|||
http://dpdk.org/browse/memnic). |
|||
|
|||
- BAR0 and BAR1 are used to implement an optional communication mechanism |
|||
through interrupts in the guests. If you need an event mechanism between the |
|||
guests accessing the shared memory, you will most likely want to write a |
|||
kernel driver that will handle interrupts. See details in the section 'Guest |
|||
to guest communication' section. |
|||
|
|||
The behavior is chosen when starting your QEMU processes: |
|||
- no communication mechanism needed, the first QEMU to start creates the shared |
|||
memory on the host, subsequent QEMU processes will use it. |
|||
|
|||
- communication mechanism needed, an ivshmem server must be started before any |
|||
QEMU processes, then each QEMU process connects to the server unix socket. |
|||
|
|||
For more details on the QEMU ivshmem parameters, see qemu-doc documentation. |
|||
|
|||
|
|||
Guest to guest communication |
|||
---------------------------- |
|||
|
|||
This section details the communication mechanism between the guests accessing |
|||
the ivhsmem shared memory. |
|||
|
|||
*ivshmem server* |
|||
|
|||
This server code is available in qemu.git/contrib/ivshmem-server. |
|||
|
|||
The server must be started on the host before any guest. |
|||
It creates a shared memory object then waits for clients to connect on a unix |
|||
socket. All the messages are little-endian int64_t integer. |
|||
|
|||
For each client (QEMU process) that connects to the server: |
|||
- the server sends a protocol version, if client does not support it, the client |
|||
closes the communication, |
|||
- the server assigns an ID for this client and sends this ID to him as the first |
|||
message, |
|||
- the server sends a fd to the shared memory object to this client, |
|||
- the server creates a new set of host eventfds associated to the new client and |
|||
sends this set to all already connected clients, |
|||
- finally, the server sends all the eventfds sets for all clients to the new |
|||
client. |
|||
|
|||
The server signals all clients when one of them disconnects. |
|||
|
|||
The client IDs are limited to 16 bits because of the current implementation (see |
|||
Doorbell register in 'PCI device registers' subsection). Hence only 65536 |
|||
clients are supported. |
|||
|
|||
All the file descriptors (fd to the shared memory, eventfds for each client) |
|||
are passed to clients using SCM_RIGHTS over the server unix socket. |
|||
|
|||
Apart from the current ivshmem implementation in QEMU, an ivshmem client has |
|||
been provided in qemu.git/contrib/ivshmem-client for debug. |
|||
|
|||
*QEMU as an ivshmem client* |
|||
|
|||
At initialisation, when creating the ivshmem device, QEMU first receives a |
|||
protocol version and closes communication with server if it does not match. |
|||
Then, QEMU gets its ID from the server then makes it available through BAR0 |
|||
IVPosition register for the VM to use (see 'PCI device registers' subsection). |
|||
QEMU then uses the fd to the shared memory to map it to BAR2. |
|||
eventfds for all other clients received from the server are stored to implement |
|||
BAR0 Doorbell register (see 'PCI device registers' subsection). |
|||
Finally, eventfds assigned to this QEMU process are used to send interrupts in |
|||
this VM. |
|||
|
|||
*PCI device registers* |
|||
|
|||
From the VM point of view, the ivshmem PCI device supports 4 registers of |
|||
32-bits each. |
|||
|
|||
enum ivshmem_registers { |
|||
IntrMask = 0, |
|||
IntrStatus = 4, |
|||
IVPosition = 8, |
|||
Doorbell = 12 |
|||
}; |
|||
|
|||
The first two registers are the interrupt mask and status registers. Mask and |
|||
status are only used with pin-based interrupts. They are unused with MSI |
|||
interrupts. |
|||
|
|||
Status Register: The status register is set to 1 when an interrupt occurs. |
|||
|
|||
Mask Register: The mask register is bitwise ANDed with the interrupt status |
|||
and the result will raise an interrupt if it is non-zero. However, since 1 is |
|||
the only value the status will be set to, it is only the first bit of the mask |
|||
that has any effect. Therefore interrupts can be masked by setting the first |
|||
bit to 0 and unmasked by setting the first bit to 1. |
|||
|
|||
IVPosition Register: The IVPosition register is read-only and reports the |
|||
guest's ID number. The guest IDs are non-negative integers. When using the |
|||
server, since the server is a separate process, the VM ID will only be set when |
|||
the device is ready (shared memory is received from the server and accessible |
|||
via the device). If the device is not ready, the IVPosition will return -1. |
|||
Applications should ensure that they have a valid VM ID before accessing the |
|||
shared memory. |
|||
|
|||
Doorbell Register: To interrupt another guest, a guest must write to the |
|||
Doorbell register. The doorbell register is 32-bits, logically divided into |
|||
two 16-bit fields. The high 16-bits are the guest ID to interrupt and the low |
|||
16-bits are the interrupt vector to trigger. The semantics of the value |
|||
written to the doorbell depends on whether the device is using MSI or a regular |
|||
pin-based interrupt. In short, MSI uses vectors while regular interrupts set |
|||
the status register. |
|||
|
|||
Regular Interrupts |
|||
|
|||
If regular interrupts are used (due to either a guest not supporting MSI or the |
|||
user specifying not to use them on startup) then the value written to the lower |
|||
16-bits of the Doorbell register results is arbitrary and will trigger an |
|||
interrupt in the destination guest. |
|||
|
|||
Message Signalled Interrupts |
|||
|
|||
An ivshmem device may support multiple MSI vectors. If so, the lower 16-bits |
|||
written to the Doorbell register must be between 0 and the maximum number of |
|||
vectors the guest supports. The lower 16 bits written to the doorbell is the |
|||
MSI vector that will be raised in the destination guest. The number of MSI |
|||
vectors is configurable but it is set when the VM is started. |
|||
|
|||
The important thing to remember with MSI is that it is only a signal, no status |
|||
is set (since MSI interrupts are not shared). All information other than the |
|||
interrupt itself should be communicated via the shared memory region. Devices |
|||
supporting multiple MSI vectors can use different vectors to indicate different |
|||
events have occurred. The semantics of interrupt vectors are left to the |
|||
user's discretion. |
|||
Loading…
Reference in new issue