Aros/Developer/Docs/HIDD/ATI
Introduction
[edit | edit source]- 2006 ati from radeon 7000 to x600
- 2011 ati from r400 and r500 added
- 20?? ati HD r600 and up
- 20?? ati south islands r800 and up - could look at vulkan api for this
Initial 2d ati driver written mid 2000s
Some work was done to implement 2d acceleration for r4xx and r5xx card, nearly the same functions as older card and has not been tested. There were some problem switching resolution with the Ati x1300
Hit or miss luck with igp's with this 2011 update so far. Some work better than others.
Gallium 3D
[edit | edit source]- Before HD chipsets no gallium
- During and after HD gallium support
- SI vulkan api with OpenGL wrapper
https://docs.mesa3d.org/gallium/index.html
Trying to port DRM, to get 3d acceleration to work... Must get 3d acceleration to work to get 2d acceleration to work with r600-evergreen card. You output 2d with 3d. 2d acceleration was removed from r600 evergreen and beyond cards.
For the moment there only software rendering for r600 r700 card, some of the 2d functions do not work properly either.
DRM
[edit | edit source]Next step, port most of the code of drm/radeon/ so I can compile it with the radeon-driver. http://www.kernel.org/ /drivers/gpu/drm/radeon/ /drivers/gpu/drm/
Now trying to set-up the GART table to get the ringbuffer working. The function r600_do_init_cp(... in file “r600_cp.c”. /drivers/gpu/drm/radeon/r600_cp.c
The codes in workbench/hidd/hidd.nouveau contain BOTH nouveau and generic drm code. You should be able to reuse a large portion of generic stuff in radeon driver. As for GART, what basically need to do it provide to the driver memory allocated at 4k boundary simulating pages. The nouveau driver already has the necessary code to map this into cards address space. Assuming the same is true for radeon driver.
Dri2 - locking
[edit | edit source]Dri1 - locking
[edit | edit source]Hardware Locking for the Direct Rendering Infrastructure Rickard E. Faith, Jens Owen, Kevin E. Martin Precision Insight, Inc. $Date: 1999/05/11 22:45:24 $, $Revision: 1.6 $
This paper examines the locking requirements of typical DMA-based hardware currently available for the PC platform with respect to the Direct Rendering Infrastructure (DRI). A locking algorithm is described. This algorithm enhances a typical kernel-call-based blocking lock with the single optimization that an entity that held the lock most recently can require the lock without using the kernel call. The algorithm relies on atomic instructions and is constant in time and space. Familiarity with the DRI design documents is assumed [OM98, MFOA99].
1. Preamble
1.1 Copyright
Copyright © 1999 by Precision Insight, Inc., Cedar Park, Texas. All Rights Reserved.
Permission is granted to make and distribute verbatim copies of this document provided the copyright notice and this permission notice are preserved on all copies.
1.2 Trademarks
OpenGL is a registered trademark of Silicon Graphics, Inc. Unix is a registered trademark of The Open Group. The `X' device and X Window System are trademarks of The Open Group. XFree86 is a trademark of The XFree86 Project. Linux is a registered trademark of Linus Torvalds. Intel is a registered trademark of Intel Corporation. All other trademarks mentioned are the property of their respective owners.
2. Locking Requirements
Although some cards may support some subset of simultaneous accesses, typical Intel-based PC graphics hardware (costing less than US$2000 in 1998) does not allow simultaneous, interleaved access to the frame buffer memory, the MMIO-based command FIFO, and the DMA-based command FIFO. For example, typical cards will allow frame buffer and command FIFO activity while processing updates to the MMIO registers that control the hardware cursor position. Some cards will atomically add commands sent via DMA to the command FIFO, permitting some kinds of simultaneous MMIO and DMA. However, other cards will intermingle MMIO and DMA commands in the command FIFO on a word-by-word basis, thereby completely corrupting the command FIFO. Other cards are not robust enough to permit frame buffer access while other operations are occurring.
Because of these limitations, typical hardware will require a single (per-device) lock that cooperatively restricts access to the hardware by the X server, the kernel, and 3D direct-rendering clients. This section briefly outlines how the hardware lock would typically be used by each of the three types of DRI entity. Some hardware may not require locking in all of the instances listed in this section.
2.1 Kernel
DMA Buffer Dispatch
When the hardware is ready for another DMA dispatch from the kernel's DMA buffer queue, the kernel will obtain the hardware lock. If the notification that another DMA dispatch is possible came via a hardware interruption, then the kernel may have to reset the interrupt on the hardware. If the current GLXContext is different from the GLXContext required by the next DMA buffer to be dispatched, then the kernel will have to update the hardware graphics context, possibly via a callback to the X server (see below).
Graphics Context Save and Restore
If the X server is performing hardware graphics context switches on behalf of the kernel, then the kernel will hold the hardware lock on behalf of the X server while the context is switched. The X server may perform the context switch by doing MMIO or by issuing a request to dispatch DMA immediately (without obtaining another lock), and will then issue an ioctl to tell the kernel that the context has been switched and it is now safe to proceed with DMA dispatches.
Vertical Retrace
When an operation (e.g., a DMA buffer dispatch) must be synchronized with the next vertical retrace, the kernel will obtain the lock and poll or wait for an interruption before performing the operation.
2.2 3D Client Library: Software Fallbacks
The 3D client library will obtain the hardware lock when performing software fallbacks that require direct hardware access (e.g., access to the frame buffer). During this time, the client may issue high-priority blocking DMA requests that bypass the normal kernel DMA buffer queues (and that do not require the kernel to obtain the lock).
The 3D client library assumes that the kernel DMA buffer queue for the current GLXContext has been flushed, that other kernel DMA buffer queues are halted (implied by taking the lock), that the hardware is quiescent, and that the current full graphics context is available (including textures and display lists). The initial DRI implement will assume that this context is stored on the hardware, not in the SAREA.
2.3 X Server
Hardware Cursor
Depending on the requirements of the graphics hardware, the lock may be needed to move the hardware cursor. (At the time of this writing, we have not identified any hardware that requires locking for hardware cursor movement, but we note the possibility so that future driver implementors can check for it.) 2D Rendering Without Changing Window Offsets
2D rendering by the X server will require the lock, and may require quiescent hardware (depending on the operation being performed and the hardware being used).
Region Modifications
Region modifications (e.g., moving windows) require the lock and quiescent hardware. Further, all of the kernel DMA buffer queues must be flushed (it may be possible to compute an optimized set of DMA queues that must be flushed, based on the window to be moved and the GLXContexts touched by that window). After the region modification, the window-changed ID will be updated (this update may not have to be locked).
DGA
When the application program makes use of the XFree86-DGA protocol extension, the X server will hold the lock on behalf of the DGA client. XFree86-DGA must be supported for clients that access the hardware directly but that do not have knowledge of the DRI.
OpenGL/Fullscreen
A new protocol extension that allows fullscreen OpenGL access within the framework of the DRI will be provided. The server will issue an ioctl to halt all kernel DMA buffer queues for existing GLXContexts. From that point onward, all newly created GLXContexts will have operational (i.e., unhalted) kernel DMA buffer queues. This implies that the client must issue the OpenGL/Fullscreen request before creating any GLXContexts.
3. Optimization Opportunities
3.1 Analysis
The X server, 3D clients, and the kernel will share the lock cooperatively. Since client processes can die while holding the lock, some process must detect client death and reclaim the lock. The X server can detect client death when the connection used for the X protocol fails. However, in the case of a UNIX Domain Socket (i.e., a named pipe), timely notification requires active I/O on the connection. The kernel-level device driver, however, knows about client death as soon as the client closes the connection to the kernel device driver. Timely notification of client death, together with other kernel-level issues (e.g., handling of SIGSTOP [FM99]), make the use of the kernel as the ultimate lock arbiter compelling.
However, obtaining the hardware lock via an ioctl is a heavyweight operation. If an entity is performing several short operations, the lock will have to be taken and released for each operation in order to provide user-perceived interactivity. The next section explores methods for avoiding an ioctl for each locking operation by introducing a two-tiered lock. This section outlines the requirements for the two-tiered lock design by exploring the transitions between common entity states and identifying the transitions that are performance-critical according to the following general criteria:
Interaction is optimized for a single GLXContext. User-perceived responsiveness of the X server is maintained.
For purposes of analysis, the system has the following states:
DA: The kernel dispatching DMA buffers for GLXContext A. DB: The kernel dispatching DMA buffers for GLXContext B. SA: A 3D client doing software fallbacks for GLXContext A. Software fallbacks are assumed to be slow operations, such that the overhead of obtaining the hardware lock, regardless of implementation, is negligible. SB: A 3D client doing a software fallbacks for GLXContext B. 2D: The X server performing 2D rendering. HC: The X server moving the hardware cursor. This operation is lightweight, but a human is involved, so responsiveness is more important than overall speed. RM: The X server performing region modifications.
The importance of optimization will be ranked on the following scale:
Optimization is critical to meet performance goals. Optimization will help meet performance goals, but is not critical. Optimization may help meet performance goals, but may not have a large performance impact.
The rankings are shown below for transitions between any two states:
DA DB SA SB 2D HC RM DA 1 2 3 3 2 2 3 DB 2 1 3 3 2 2 3 SA 3 3 3 3 2 2 3 SB 3 3 3 3 2 2 3 2D 2 2 2 2 1 2 3 HC 2 2 2 2 2 2 3 RM 3 3 3 3 3 3 3
3.2 Discussion
Since our goals are to optimize single GLXContext throughput as well as X server responsiveness, it is most important that locking for DMA dispatch and for 2D rendering be optimized. If the kernel holds the lock for long periods of time, responsiveness is compromised. If the X server holds the lock for long periods of time, rendering throughput is compromised. Hence, the kernel and the X server must not hold the lock for more than minimum necessary amount of time. However, because both DMA dispatch and 2D rendering operations are relatively short, the lock will be taken and released with great frequency: heavyweight ioctl-based locking may account for a significant percentage of the time used to perform the operations. In the case of advanced MMIO-based graphics hardware (i.e., vs. available DMA-based hardware), the overhead of the ioctl-based lock will be prohibitive.
Because of these requirements and observations, the design of a two-tiered lock was undertaken. We adopt operating system terminology and call the two methods of obtaining the lock ``heavyweight and ``lightweight. The key observation from the above discussion is that locking must be optimized for a single GLXContext taking and releasing the lock many consecutive times (for this discussion, the X server will be considered as a special GLXContext). In particular, the lightweight lock will not be designed to be shared between GLXContexts, because such design could possibly complicate the algorithm at the expense of the more critical case (e.g., by requiring additional work in the lightweight routines to set flags or check for additional state that is important, but seldom or never used in the single GLXContext case).
4. Lock Design
4.1 Introduction
The following assumptions will simplify lock design:
Locking will be performed using the GLXContext ID to determine which entity last had the lock (or a hash of this ID that makes at least two bits (e.g., the high bits) and one ID (e.g., zero) available as flags. The heavyweight lock will use the ioctl interface to the kernel (an ioctl has the approximate overhead of a system call). The heavyweight lock will be used to resolve all cases of contention. The lightweight lock will be stored in a user-space shared memory segment that is available to all locking entities. A pointer-sized compare-and-set (CAS) atomic primitive is available. This is true for most modern processors, including Intel processors starting with the Intel486 (a double-wide CAS instruction is available starting with the Pentium). Similar lightweight algorithms can be designed using other atomic primitives. (For older hardware, such as the Intel386, which will have extremely poor 3D graphics throughput, the lightweight lock may simply fallback to the heavyweight lock.)
4.2 Previous Work
[MCS91] discusses synchronization algorithms that make use of busy waiting and atomic operations, as abstracted by the fetch-and-phi primitive, but do not discuss a two-tiered locking mechanism.
[BKMS98] describes a ``thin lock that can be ``inflated to a ``fat lock. However, a fat lock cannot ``deflate to a thin lock, so these methods do not apply to our work.
[L87] describes a method that, in the absence of contention, uses 7 memory accesses to provide mutual exclusion. If Lamport's algorithm is modified to provide for a kernel-level fallback in the case of contention, an algorithm with fewer reads and writes may be possible. However, since all modern architectures provide atomic fetch-and-phi primitives, there is limited value in exploring an algorithm that depends only on atomic reads and writes.
[LA93] explore "two-phase" algorithms that combine busy-waiting with blocking. The two-tiered lock described in this paper is a Lim and Agarwal two-phase lock with the polling value set to 1. At this time, however, we impose the restriction that only the last process to have the lock can obtain the lock by checking a synchronization variable -- all other lock acquisitions must block. In the future, after we obtain more experience with DRI contention, we may extend our locking algorithm to be a completely general two-phase lock.
Lim an Agarwal note that the idea of combining the advantages of polling and signalling was first suggested by Ousterhout in [O82]. Unfortunately, we were not able to obtain a copy of this paper.
4.3 Locking Algorithm
Algorithms and structures are presented in a C-like programming language.
Lock Structure
The lock structure is a simple cache-line aligned integer. To avoid processor bus contention on a multiprocessor system, there should not be any other data stored in the same cache line.
typedef struct lock_s { unsigned int context; unsigned int padding[3]; } lock_t;
Flags
Bits in the lock word will be used to claim the lock, and to notify the client that the kernel must be involved in contention resolution.
#define LOCK_HELD 0x80000000 #define LOCK_CONT 0x40000000
Compare-and-Swap (CAS)
This is a standard 32-bit compare-and-swap (CAS) routine. It can be implemented atomically with a single Intel486 instruction. On a RISC processor, CAS is usually implemented with the instruction pair load-linked/store-conditional.
int CAS(int *address, int compare_value, int update_value) { if (*address == compare-value) { *address = update-value; return SUCCEED; } else return FAIL; }
Get Lightweight Lock
void get_lightweight_lock(lock_t *L, int my_context) { if (CAS(&L->context, my_context, LOCK_HELD|my_context) == FAIL) { // Contention, so we use the kernel to arbitrate get_heavyweight_lock(L, my_context); } }
Release Lightweight Lock
void release_lightweight_lock(lock_t *L, int my_context) { if (CAS(&L->context, LOCK_HELD|my_context, my_context) == FAIL) { // Kernel is requesting the lock release_heavyweight_lock(L, my_context); } }
Get Heavyweight Lock
void get_heavyweight_lock(lock_t *L, int my_context) { for (;;) { do { // If lock held, mark it as contended // Otherwise, try to get it current_context = L->context; if (current_context & LOCK_HELD) new = LOCK_CONT | current_context; else new = LOCK_HELD | my_context; } while (CAS(&L->context, current_context, new)); if (new == LOCK_HELD|my_context) break; // Have lock // Didn't get lock // Suspend process until lock available place_current_process_on_queue(); schedule(); // blocks until wake_up_queued_processes() called // Loop and try to obtain lock again } }
Release Heavyweight Lock
void release_heavyweight_lock(lock_t *L, int my_context) { L->context = 0; // CAS can be used to detect multiple unlocks wake_up_queued_processes(); }
Discussion
Both getting and releasing a lightweight lock requires a CAS instruction, which may cause a processor bus LOCK cycle. The bus LOCK is avoided on Pentium Pro and later processors if the cache line is resident in the cache of the current processor. Since this is typically the case for the use of this lock, scalability should be good in the number of processors. On RISC processors, the load-linked/store-conditional instruction pair is used to implement CAS. This instruction pair does not cause bus locks, and scales well in the number of processors.
If the lock release uses a simple write operation instead of a CAS, then there will be a race condition between the time the process determines that the kernel wants the lock, and the time the lock is released. During this time, the kernel could set the bit making the request. If the process does not realize the kernel is waiting for the lock, then the kernel will not obtain the lock until the same process or another process requests the lock again. Since this may never happen, deadlock could result.
5. Acknowledgements
Thanks to James H. Anderson for discussing lock implementation issues.
6. References
[AM97] James H. Anderson and Mark Moir. Universal constructions for large objects. Submitted to IEEE Transactions on Parallel and Distributed Systems, June 1997. Available from http://www.cs.pitt.edu/~moir/Papers/anderson-moir-tpds97.ps. (An earlier version of this paper was published in Proceedings of the Ninth International Workshop on Distributed Algorithms, Lecture Notes in Computer Science 972, Springer-Verlag, pp. 168-182, September 1995.)
[BKMS98] David F. Bacon, Ravi Konuru, Chet Murthy, and Mauricio Serrano. Thin locks: featherweight synchronization for Java. Proceedings of the ACM SIGPLAN '98 Conference on Programming Language Design and Implementation (Montreal, Canada, 17-19 June 1998). Published as SIGPLAN Notices 33, 5 (May 1998), 258-268.
[FM99] Rickard E. Faith and Kevin E. Martin. A Security Analysis of the Direct Rendering Infrastructure. Cedar Park, Texas: Precision Insight, Inc., 1999.
[L87 Leslie Lamport. A fast mutual exclusion algorithm. ACM Transactions on Computer Systems 5, 1 (Feb. 1987), 1-11.
[LA93] Beng-Hong Lim and Anat Agarwal. Waiting algorithms for synchronization in large-scale multiprocessors. ACM Transactions on Computer Systems 11, 3 (Aug. 1993), 253-294.
[M92] Henry Massalin. Synthesis: An Efficient Implementation of Fundamental Operating System Services. Ph.D. dissertation, published as Technical Report CUCS-039-92. Graduate School of Arts and Sciences, Columbia University, 1992, 71-91. Available from ftp://ftp.cs.columbia.edu/reports/reports-1992/cucs-039-92.ps.gz.
[MCS91] John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems 9, 1 (Feb. 1991), 21-65.
[MFOA99] Kevin E. Martin, Rickard E. Faith, Jens Owen, Allen Akin. Direct Rendering Infrastructure, Low-Level Design Document. Cedar Park, Texas: Precision Insight, Inc., 1999.
[O82] J. K. Ousterhout. Scheduling techniques for concurrent systems. In Proceedings of the 3rd International Conference on Distributed Computing Systems. IEEE, New York, 1982, pp. 22-30.
[OM98] Jens Owen and Kevin E. Martin. A Multipipe Direct Rendering Architecture for 3D. Cedar Park, Texas: Precision Insight, Inc., 15 September 1998. Available from http://www.precisioninsight.com/dr/dr.html. 10°
Vulkan
[edit | edit source]Graphics Core Next (GCN) architecture from Southern Islands (GCN 1.0) and up supports Vulkan (the Radeon 7000 series is Vulkan 1.0, while 8000 series Sea Islands is GCN 2 and supports Vulkan 1.1). Earlier non GCN architectures shouldn’t support Vulkan natively.
Vulkan could be the only API that is necessary at the driver level. OpenGL and Direct3D can be used through wrappers.
Radeon video memory structure, GPU have its own virtual memory and page table like CPU.
Render loop needed and other major parts:
- Initialization and acquiring information.
- Graphics memory allocation (AMDGPU_GEM_CREATE).
- GPU virtual memory mapping (AMDGPU_GEM_VA).
- Sending command buffers to GPU ring buffers (AMDGPU_CS).
- Handling interrupts to know when command execution completed (AMDGPU_WAIT_CS)
Radeon GPU has it own MMU unit so each process using 3D acceleration have its own GPU virtual address space. GPU use 2 level page translation table to map virtual addresses to GPU physical addresses.
On Southern Islands Radeon GPU physical memory layout is:
0…VRAM size: VRAM (video RAM) VRAM size…GTT end: GTT (CPU memory mapped to GPU)
Ring buffer needs to manage the read writes to from sys ram mmu to gpu mmu with compiled shaders and drawing commands to be frd down gpu pipelines
Reference counted GPU buffer class and GPU memory manager. Memory manager can allocate memory of 3 types:
- VRAM mappable to CPU,
- VRAM not mappable to CPU,
- CPU memory mapped to GPU (GTT). GTT buffer
A GART page table that maps CPU memory to GPU GTT memory range. So CPU memory mapping is in userland without special kernel driver. Making sure the GPU DMA engine can write to GTT memory
Indirect buffers allows to execute commands on ring buffer without copying it to ring buffer. Instead execute indirect buffer command is written to ring buffer with indirect buffer address and size as parameter. Vulkan driver prepares and sends commands in indirect buffers.
GTT buffer test. bufAdr is OS area address mapped to GPU by GART and written by GPU DMA engine.
References
[edit | edit source]What I understand as a "driver" in your terms is something that sits on top of this module and allows interactions with only a single output?
A driver is a driver. It's a software module (like current nvidia.hidd, radeon.hidd, etc.). They almost will not change. They are HIDD classes, and will stay so. There's also an instance of the driver, an object of driver class. These things will really control different displays. So, if you for example have one ATI video card with two displays and one NVidia card with one more display, you'll have two drivers and three instances. Right now we always have only one instance of only one driver.
So an instance of the driver will represent a controllable output. How will the client (graphics.library?) create such instances. via OOP_NewObject as usual.
Currently creating an instance of the driver via OOP_NewObject is easy because you should only do it once and when created you know you control the correct output. With the new approach, create an instance of the driver will return an object that controls an output - but how will a client know which output it is? Also how will a client know how many outputs there can be - or put it differently how will the client know how to create driver objects for all outputs? (unless I'm mistake, the only thing a client can use at this stage is OOP_NewObject call)
This model doesn't make much sense to me. An instance of a "driver" should represent a single display adaptor, and expose separate objects to represent the ldp's - IMHO
Different outputs (or different cards - this doesn't matter) are different OOP objects. Each driver object (instance) is registered in the display mode DB with AddDisplayDriverA(). These drivers are assigned different display mode IDs:
0x0010xxyy - First display 0x0011xxyy - Second display 0x0012xxyy - Third display
etc.
xxyy is driver-dependent part of the ID, it encodes sync and pixelformat indexes, the same as before. Why starting at 0x0010 ? Because Amiga(tm) chipset uses fixed definitions that occupy range from 0x0000xxyy to 0x000Axxyy. And decided to leave 0x000B - 0x000F as reserved, just in case. Anyway maximum of 65519 displays in the system are enough. :)
Every display driver already provides names for its display modes. Just the driver will have to pass different names for the user ("Radeon analog", "Radeon DVI", "Radeon N2 analog", "Radeon N2 DVI", etc.).
Also how will a client know how many outputs there can be - or put it differently how will the client know how to create driver objects for all outputs? (unless mistaken, the only thing a client can use at this stage is OOP_NewObject call)
The client won't do OOP_NewObject() call. The driver module will have a startup code that does it. The new model is very close to what is done at the moment. Only one thing is different:
Current model: When OOP_NewObject() is performed, a first capable PCI device is found and an object is created for it.
New model: a driver startup code enumerates all PCI devices and calls OOP_NewObject() for each of them. It uses private attributes to pass device base address etc. The driver classes even do not have to have public names. Yes, the driver does not have to be a library. It can be a plain executable laying in DEVS:Monitors, as on other systems. This way it can be launched at any time, and its display modes will be instantly added to the system.
See arch/all-mingw32/hidd/wingdi/startup.c as an example. Currently it creates only one object, in future it can be modified to create several objects for simulating several displays (there'll be a separate host window per display). Such approach even allows to use old drivers until they are rewritten. A very small loader program is needed for them. By this time i'm going to move on with the conversion, create DEVS:Monitors, and write such a loader.
There'll be no LoadMonDrvs in Startup-sequence. IMHO a better place for loading display drivers is dosboot resident, in inithidds.c
- R100 Radeon 7000
- R200 Radeon 8000
- RS400/RS480 Radeon XPRESS 200(M)/1100 IGP
R300 Radeon 9700PRO/9700/9500PRO/9500/9600TX, FireGL X1/Z1 R350 Radeon 9800PRO/9800SE/9800, FireGL X2 R360 Radeon 9800XT RV350 Radeon 9600PRO/9600SE/9600/9550, M10/M11, FireGL T2 RV360 Radeon 9600XT RV370 Radeon X300, M22 RV380 Radeon X600, M24 RV410 Radeon X700, M26 PCIE R420 Radeon X800 AGP R423/R430 Radeon X800, M28 PCIE R480/R481 Radeon X850 PCIE/AGP
RV505/RV515/RV516/RV550 Radeon X1300/X1400/X1500/X1550/X2300 R520 Radeon X1800 RV530/RV560 Radeon X1600/X1650/X1700 RV570/R580 Radeon X1900/X1950
- RS600/RS690/RS740 Radeon X1200/X1250/X2100
- R600 Radeon HD 2900
RV610/RV630 Radeon HD 2400/2600/2700/4200/4225/4250 RV620/RV635 Radeon HD 3410/3430/3450/3470/3650/3670 RV670 Radeon HD 3690/3850/3870 RS780/RS880 Radeon HD 3100/3200/3300/4100/4200/4250/4290 RV710/RV730 Radeon HD 4330/4350/4550/4650/4670/5145/5165/530v/545v/560v/565v RV740/RV770/RV790 Radeon HD 4770/4730/4830/4850/4860/4870/4890 CEDAR Radeon HD 5430/5450/6330/6350/6370 REDWOOD Radeon HD 5550/5570/5650/5670/5730/5750/5770/6530/6550/6570 JUNIPER Radeon HD 5750/5770/5830/5850/5870/6750/6770/6830/6850/6870 CYPRESS Radeon HD 5830/5850/5870 HEMLOCK Radeon HD 5970
PALM Radeon HD 6310/6250 SUMO/SUMO2 Radeon HD 6370/6380/6410/6480/6520/6530/6550/6620 BARTS Radeon HD 6790/6850/6870/6950/6970/6990 TURKS Radeon HD 6570/6630/6650/6670/6730/6750/6770 CAICOS Radeon HD 6430/6450/6470/6490 CAYMAN Radeon HD 6950/6970/6990 ARUBA Radeon HD 7000 series
- TAHITI Radeon HD 7900 series
Radeon R9 280/280X PITCAIRN Radeon HD 7800 series Radeon R7 265/370 Radeon R9 270/270X/M290X VERDE Radeon HD 7700 series Radeon R7 250X/350 Radeon R9 M265X/M270X/M275X OLAND Radeon HD 8000 series Radeon R7 240/250/350 HAINAN Radeon HD 8800 series BONAIRE Radeon HD 7790 series Radeon R7 260/260X/360 KAVERI KAVERI APUs KABINI KABINI APUs HAWAII Radeon R9 290/290X/390/390X MULLINS (Puma/Puma+ cores, GCN GPU) MULLINS/BEEMA/CARRIZO-L APUs
- radeon/R600_rlc.bin radeon/R600_uvd.bin
- radeon/R600_rlc.bin radeon/RS780_uvd.bin radeon/RS780_pfp.bin radeon/RS780_me.bin
- radeon/PALM_me.bin radeon/PALM_pfp.bin
- radeon/SUMO_rlc.bin radeon/SUMO_uvd.bin SUMO radeon/SUMO_me.bin radeon/SUMO_pfp.bin
- radeon/BTC_rlc.bin radeon/CAICOS_mc.bin radeon/CAICOS_me.bin radeon/CAICOS_pfp.bin radeon/CAICOS_smc.bin radeon/SUMO_uvd.bin
- radeon/BTC_rlc.bin radeon/TURKS_mc.bin radeon/TURKS_me.bin radeon/TURKS_pfp.bin radeon/TURKS_smc.bin radeon/SUMO_uvd.bin
- radeon/BONAIRE_ce.bin radeon/BONAIRE_mc.bin radeon/BONAIRE_mc2.bin radeon/BONAIRE_me.bin radeon/BONAIRE_mec.bin radeon/BONAIRE_pfp.bin radeon/BONAIRE_rlc.bin radeon/BONAIRE_sdma.bin radeon/BONAIRE_smc.bin radeon/BONAIRE_uvd.bin radeon/BONAIRE_vce.bin