GPU offloading - PRIME - proof of concept: airlied

airlied

GPU offloading - PRIME - proof of concept

Mar 12, 2010 16:16

THIS IS A PROOF OF CONCEPT - its not going to be upstream unless someone else dedicates their life to it, (btw anyone know anyone in ASUS?)

So NVIDIA unveiled their optimus GPU selection solution for Windows 7, so I decided to see what it would take to implement something similar under DRI. I've named it PRIME for obvious reasons.

Goals:
1. Allow a second GPU to render 3D apps onto the screen of the first, pickable from the client side.
2. Just target the rendering side, I'm assuming the GPU power up/down is similiar to what was done for the older switching method.

Restrictions + limitations:
1. Must have compositing manager running
2. Must have second screen configured for slave card (doesn't need to be used)

Test system:
Intel 945 IGP + radeon r200 PCI card - yes this won't be a speed demon.

Terms:
Master: the IGP displaying the output - intel
Slave: the GPU rendering the app - radeon r200 in this case.

Step 1: kernel support

http://git.kernel.org/?p=linux/kernel/git/airlied/drm-testing.git;a=shortlog;h=refs/heads/drm-prime-test
http://cgit.freedesktop.org/~airlied/drm/log/?h=prime-test

The kernel requirements were simple, we needed a way to share a memory managed object between two kernel device drivers.
The kernel has a GEM namespace per device, however this isn't good enough to share with other devices, so I introduced a new PRIME namespace with two ioctls. One ioctl allows the master device to associate a device buffer handle with a name in the prime namespace, and the other allows the slave device to associate a prime namespace handle with a buffer. When the master creates a prime buffer the kernel associates the list of pages with the handle, and when the slave looks up the same handle it retrieves the list of pages and fakes up a TTM buffer populated with those pages as backing store. I've added the concept of slave object to TTM to allow for this.

The drm repo contains the API wrappers + intel + radeon pieces to call the association functions for buffer objects.

Step two: DRI2 Protocol
http://people.freedesktop.org/~airlied/prime/0001-dri2proto-add-prime-token.patch
http://people.freedesktop.org/~airlied/prime/0001-prime-support-for-mesa.patch

From the X server point of view a recent change to the DRI2 layer allowed for multiple device driver names to be associated with a DRI2 end point. The client can request either a DRI or VDPAU device name currently. I firstly extended the DRI2 protocol, to add a new buffer type, called PRIME, and added a hack to mesa's glx loader to request the prime driver if an environment variable was specified.

Step 3: X server DRI2 module + drivers

http://people.freedesktop.org/~airlied/prime/0001-intel-add-prime-master-support.patch
http://cgit.freedesktop.org/~airlied/xf86-video-ati/log/?h=prime-test
http://people.freedesktop.org/~airlied/prime/0001-dri2-prime-hackfest.patch

This was the messiest bit and still requires a lot of change. First up I added an interface for the drivers to register as PRIME master and slaves. Intel driver registers as master, radeon as slave for my demo. We store these in an array. When a client connects and requests prime driver, we mark the drawable and redirect the dri2 buffer creation requests to the slave screen driver. Also the drm authentication is sent to both kernel drms. It then hooks the swapbuffers command where it does a region copy, and redirects this to the slave driver, and damages the pixmap in the master driver. Now the "interesting" part, my original implementation simply grabbed the window pixmap at the dri2 create buffers time, however there is an ordering issue with compositing, this pixmap is pre-composite redirection so isn't actually the pixmap you want to tell the kernel to bind to both gpus. This turned out to function badly, I could see gears all stretched over the front buffer.

So a quick coke + chocolate break later, I had enough sugar to bash out the hack that now exists. DRI2 calls the slave driver copy region callback, which checks if the drawable pixmap is on the same screen, if its not, it checks if we've marked the pixmap as a prime pixmap (i.e. one that belongs to the master). It is, it swaps in the slaves copy, otherwise it callsback into DRI2. This callback calls the Intel driver to make the buffer object backing the pixmap, shareable, and returns the handle,then calls into radeon with the handle to create a new pixmap pointing at the shared buffer object. Once all that is done, radeon copies the back buffer to the shared front pixmap, we return and damage is posted and the compositor grabs the window pixmap and displays it.

So does it work?
On my blistering fast test system with X + xcompmgr running glxgears was going at 150fps from the r200 PCI card. Hopefully I can get some time on a faster system or one of the dual laptops.

Caveats:
- When a window manager is running the gears get all corrupted, this looks like the clipping and/or stride matching between
the drivers isn't correct. I suspect something with reparenting and decorations, I'm not enough of an X guru to understand this yet, hopefully one of the other hackers can fill me in. Also before it gets reparented and redirected a frame can land on the real front buffer, again clipping should take care of this, but isn't working yet. I need to workout how clipping and that stuff works in X/DRI2. - talk to ppl about clipping then JDI.
- Once a client has connected as a prime, we don't tear it down properly, so later clients can end marked as prime. - work out some sort of resources to turn stuff off
- Reference counting on the pages in the kernel is iffy, currently i915 ups the page list refcount but never drops it. solution JDI
- hardcoded /dev/dri paths in dri2 for slave device - solution JDI
- radeon driver could in theory be a prime master - solution JDI
- nouveau could support prime master/slave also. - solution nouveau guys JDI
- requires an ugly second screen in xorg.conf to load the slave driver. Can we have a 0 sized screen or maybe a rootless second screen. - solution : rearchitect X server to allow drivers without screens (6m-1yr work)
- pageflipping needs to be hacked off in intel driver. - work out and then JDI

Where is the video?
Once I get it working with a window manager on a useful machine I might do a video of two gears going.

Where now?
Well this is a purely academic exercise so far, after a week of kernel fighting I decided to do something new and cool. To make this as good as Windows we need to seriously re-architect the X server + drivers. At the moment you can't load an X driver without having a screen to attach it to, I don't really want a screen for the slave driver, however I still have to have one all setup and doing nothing and hopefully not getting in the way. We'd need to separate screen + drivers a lot better. Having some sort of dynamic screens would probably fall out of this work if someone decides to actually do it.

The kernel bits aren't as ugly as I thought but I'm not sure if upstreaming them is a good idea without the others bits. The refcounting definitely needs work also the cleanup when clients exit.

DRI2 needs some more changes, I might try and flesh it out a bit more and then talk to krh about a sane interface.

I'm probably going to get forced task switch quite soon, so I might just get to having this running on a W500 or T500, before dropping it for 6 months, so if anyone wants a neat project to play with and has the hw feel free to try and take this on.

ASUS feel free to send me one of the real optimus laptops and I'll get nouveau guys hooked up and try and RE the nvidia DMA engine.