Chris Jang
GATLAS
(GPU
Automatically
Tuned
Linear
Algebra
Software)
embodies a view of OpenCL™ kernel design with the central concept
of statistical auto-tuning. The aim is to automate performance tuning
when practical. The human developer designs model families that
represent a space of possible GPGPU kernel implementations. The
automated system then uses statistical optimization techniques to
search this space to find the best model solutions. These solutions
correspond to high performance kernels.
Orthodox performance tuning relies on deep white box knowledge of both instruction set and hardware architecture. There is no automation. The human developer programs GPGPU kernel implementations and explores the design space manually. This was not an option for me as I had no prior experience with GPU technology before GATLAS. To me, the GPU was a black box programmed in OpenCL™. Automation had to help explore the kernel design space and find good solutions. This forced a different approach and viewpoint relying on statistics and brute force.
These are not new ideas.
The ATLAS project (Automatically Tuned Linear Algebra Software) auto-tunes math kernels for the CPU. Computers use statistics and brute force to find good strategies for games like chess, demand forecasting or product and portfolio pricing. When problems are too complex to fully understand, approximate solutions calculated from empirical data become attractive. In the case of GATLAS, the empirical data is timings of kernel task execution and the solutions are high performance math kernels.
If computation were free, then exhaustive search would be all we ever use. GATLAS could benchmark every possible kernel implementation from a model family and pick the fastest one. In reality, this becomes intractably expensive as computation is not free and the kernel design space tends to grow in a combinatorial explosion of possible implementations. More intelligent search is required. GATLAS uses expectation-maximization (EM) with dynamic programming (kernel timings are memoized). Like other practical methods of optimization, it is local. There is no guarantee any solution found is the best possible (unless the problem is convex, which does not hold for practical problems like this one). However, empirical experience with EM optimization in GATLAS is good. EM generally converges to the same solutions as exhaustive search.
GATLAS supports three math
kernels: GEMM (generalized matrix multiply), GEMV (generalized matrix
vector multiply) and SAXPY (scalar alpha X plus Y). Each math kernel
has two OpenCL™ variants for memory buffers and images. These
kernel variants correspond to the model families GATLAS searches.
They are generic types implemented as C++ template classes
parameterized by precision (float
or double) and vector length
(scalars are length 1 while vector types are length 2 or 4).
The kernel variants generate OpenCL™ source code for specific kernels (models) in the design space (the model family). Each is a mapping from a small set of input parameters to an OpenCL™ kernel. The input parameters are represented as mixin base classes with state maintained externally in a vector of integral numbers.
|
|
GATLAS uses the EM algorithm to search for values of the input parameters that correspond to fast kernels. It is optimizing a function from a small set of numbers as input and the kernel timing and number of FLOPS as output. GATLAS optimizes for the expected value of that function. The time a kernel takes to execute may not be constant. Each time a kernel runs, the execution time may be different. Therefore this must be a statistical process.
|
|
It may seem strange to parameterize kernels outside of the meta-programming facilities provided by C++, that is, template classes. The small set of input parameters are represented outside of the template mechanism with the mixin base classes and vector of integral numbers. This is necessary as parameterized types must be specialized at runtime during the auto-tuning search of the optimization space. C++ templates are specialized at compile time only. Compile time generation of all kernel variations is impractical, if not impossible due to a combinatorial explosion of kernel variations with the number of parameters. Dynamic generation is better and usually more efficient in this case.
Another important reason motivates parameterizing the kernels with a vector of integral numbers. It insulates the main auto-tuning loop from any knowledge of kernel variants. The statistical optimization process does not need to know about GPGPU kernels. From the viewpoint of the EM based auto-tuning, it is optimizing an abstract function with a few integral input arguments and a single number, FLOPS, as the output. As it is unaware of what it is optimizing, the auto-tuning is more generic. This also avoids some of the problems that can arise when using C++ templates which are specialized statically at compile time. The kernel variants live in their own universe of types conceptually separate from the statistical optimization system.
For the purposes of this article, I will avoid delving into the low level details of the design in order to focus on the lessons learned from auto-tuning GPGPU kernels. GATLAS is an experimental test bed and written in a somewhat conceptually dense style (although simple and direct) with an objective of functionality first and form second.
Auto-tuning can be instructive
even for manual performance optimization.
Programming the GPU can be complicated. Many effects interact to determine kernel performance. It is not always easy to understand what design features are responsible for a kernel's performance. This is true whether performance is good or bad. Automation makes running large numbers of experiments practical and can help clarify the underlying reasons for empirically observed performance. It becomes practical to more deeply search the kernel design space and evaluate how well ideas work with greater confidence.
While developing GATLAS, some general guidelines for high performance OpenCL™ kernel design emerged, at least on AMD GPUs. GATLAS was developed almost entirely on the ATI Stream computing technology stack across a range of GPU models. Preliminary porting work and testing has been done on NVIDIA GPUs. However, the initial experience indicates each vendor's technology requires different kernel designs to achieve high performance, high utilization results.
Here are ten design guidelines I learned while developing GATLAS for AMD GPUs.
Empirical experience with
benchmarking the memory buffer kernel variants shows that vectorized
memory accesses are faster than scalar memory accesses. However, the
vector length for the fastest kernel may change depending on problem
size. For instance, most single precision kernel designs in OpenCL™
use float4. For GATLAS kernel
designs using memory buffers, sometimes float2
gives faster kernel execution timings.
|
|
It is clear in the figure (single precision
matrix multiply using memory buffers on a ATI Radeon™ HD 5870) that
vector length is a significant parameter in performance tuning. The
best vector length to use (float2
or float4)
varies depending on the matrix size.
For images, the underlying textures are naturally addressed by quads, either float4 or double2 (packed into 128 bits). There is no option to sample textures by scalars. Reading is always by vector types.
The benefit of single cycle multiply-add is obvious. What may not be obvious is that the scalar mad() calls should be used. Empirically observed performance shows peak device utilization for GEMM at well over 50% in both the single and double precision implementation when following this rule.
|
|
For instance, double precision GEMM with
separate multiplication and addition on a ATI Radeon™ HD 5870 GPU
has a peak performance of roughly 250 GFLOPS. Using mad()
results in an increase of 100 GFLOPS to more than 350 GFLOPS. The ATI
Radeon™ HD 5870 GPU has a theoretical peak throughput of 544 GFLOPS
in double precision. So device utilization in this case is actually
around 65%.
The effect of mad() is especially significant when using double precision as the demands on memory bandwidth are comparatively lower than with single precision. ALU throughput is a larger factor in total performance. As practical numerical and quantitative calculation is generally done in double precision, mad() usage becomes increasingly necessary for high performance.
Here is a summary of this design rule. Gather input data with memory reads by vector types. Scatter output data with memory writes by vector types. The calculations and any reduction between gather and scatter are best done using scalar arithmetic.
GATLAS supports row and column
major data layouts of input matrices for GEMM and GEMV kernels.
Kernels that read down the columns of images and across the rows of
memory buffers generally have resulted in significantly faster kernel
timings. This is intuitively explained by the texture cache and
coalesced memory access.
|
|
The official documentation says only that the
texture sampling favors spatial locality. It does not specify a
direction. Empirical experience with auto-tuning indicates that the
texture cache does have a preferred direction, at least when
considered together with the full technology stack. This
non-isotropic bias in texture sampling performance may be due to a
combination of many aspects and not solely a quality of the texture
cache itself.
|
|
It is more natural that striding across
contiguous rows of an array exhibits improved performance compared
with other memory access patterns. This corresponds to coalesced
memory access. The GATLAS implements GEMM as a block inner product.
So the product of two matrices A and B requires reading across rows
of A and down columns of B when a row major data layout is used. When
the matrix B is in column major data layout (equivalent to
transposed), peak performance is higher.
These guidelines for reading images and memory buffers are general rules and should not be taken as constraining the design of kernels. Kernels that access memory in other ways may also achieve high performance. There are many effects that determine total performance. It is likely that while the memory access may be more expensive, another factor involved is cheaper. So the total cost of kernel execution ends up being less.
Early in the development of
GATLAS, I experimented with SGEMM kernel designs that used tiled data
layouts. The tiling corresponded exactly with the data blocking of
the algorithm. This allowed fully coalesced memory access and did
noticeably increase performance. The price for this performance was
much higher design complexity and additional costs in reordering data
between different data layouts.
More extensive empirical benchmarks revealed that while tiled data layouts did often increase performance, the asymptotic peak across a range of problem sizes did not change. Tiled and linear data layouts had the same peak performance. For a given problem size, a tiled data layout may generally be faster than a simple linear row or column major data layout. However, there were problem sizes for which simple linear data layouts won. Tiled data layout did not have always perform better.
A way of summarizing this is as another example of Occam's Razor. Complex solutions are not always better than simple ones. In this case, complex tiled data layouts did not always beat simple linear ones in the larger context of the optimization problem.
Data blocking has the largest
effect on kernel performance. The difference in performance when
using data blocking can sometimes be two orders of magnitude (100
times). Most of what auto-tuning does is find the blocking of the
problem data that exhibits the highest kernel performance.
This is not unique to the GPU. It is analogous to tuning an algorithm so data fits in the cache on a CPU. Data movement is expensive. Optimization is about maximizing effective memory bandwidth by minimizing this cost. The GPU also has a memory hierarchy, except that management of it is more explicit than with the CPU. There are different memory spaces on the GPU device corresponding to the memory hierarchy. The GPGPU kernel programmatically controls data movement between these levels of memory.
|
|
Outer and inner data blocking is naturally implemented using the OpenCL™ kernel execution index spaces. Work groups correspond to the outer blocking. Note that the problem size (matrix dimensions), outer and inner blocking are all coupled together. If any one of them changes, the OpenCL™ kernel execution index space must also change.
|
|
Inner blocking is also reflected in the kernel itself and determines how many registers (private memory in OpenCL™) are used. This affects the source code of the kernel. It also affects the scheduling of threads on the GPU. With more registers, the total number of memory reads is reduced. At the same time, thread scheduling is also affected. There is an optimal tradeoff between gain from reducing memory reads and loss from negatively impacting thread scheduling. Finding this optimal tradeoff through analysis is impractical. There are too many factors involved with the combination of GPU hardware, driver and SDK. Change any one of these and the optimal inner blocking may change too.
This type of performance tuning is inherently combinatorial and so can be automated in a straightforward way. My experience is that automated tuning generally outperforms manual effort.
As stated earlier, a benchmark
is a statistical measurement. Kernel times are not constant. They do
vary, sometimes significantly. So it is necessary to estimate
performance using averages over many samples.
Less obvious is that the variation in kernel timings often reflects a bi-modal distribution. Over many trials, kernel timings can cluster around two widely separated values with most samples around the shorter time. My theory is this reflects the intermittent cost of resource management on the GPU. As kernels execute, there is accumulation of some deferred cost from memory fragmentation or garbage. When this reaches a limit, it must be released, cleaned and collected and the cost incurred in the executing kernel task.
|
|
GATLAS does not attempt to account for this
effect. It makes the pessimistic assumption that kernel timings have
a uni-modal statistical distribution. If it did detect bi-modal
distributions of benchmark times and exclude the effects of resource
management on the average FLOPS, measured performance would
noticeably increase. However, this would not be a meaningful
reflection of sustained throughput from the repeated execution of a
kernel over a large data set most likely to occur in practice.
It
is important to have end-to-end checking of the GPU calculation with
at least one reference calculation which must be correct. There may
be multiple reference calculations. Each has a different trade-off
between speed and confidence of correctness based on the input and
output data. An inexpensive check using test/flag/sentinel values is
best for the auto-tuning search. An expensive check using random data
or stress testing is useful for verification.
My experience implementing numerical algorithms, on both the CPU and GPU, is that the code is extremely dense and requires multiple iterations of testing and careful debugging before it is correct. For most developers, writing GPU kernels is more difficult as procedural and functional programming idioms are mixed together in an unfamiliar style. This is the primary reason for end-to-end checking of kernel data output.
A secondary motivation for end-to-end checking are GPGPU platform issues. Empirical experience with GATLAS auto-tuning across different GPU devices (from both AMD and NVIDIA), drivers and SDKs indicates that a specific kernel producing correct output in one case may generate bad output in another. The kernel executes without incident. However, the data output seen on the host is incorrect. The cause may be in the shader compiler, runtime libraries or driver. Determining the root cause can be difficult.
Mainstream software development practice has little involvement with the algorithmically dense code of math kernels. Most engineers are focused on higher-level application code and (sensibly) re-use library implementations whenever possible.
Mainstream software development practice also has a culture of trust in the compiler and platform. Optimization is generally not as important as it is for GPGPU which is defined by high performance. My experience is high optimization levels with a C/C++ compiler on the CPU are avoided for production applications as bugs may be introduced in generated code. With GPGPU, where high performance is central, the compiler is always expected to generate highly optimized code. There is greater inherent risk.
GATLAS has been tested on five
GPUs at this time, four from AMD and one from NVIDIA. The four AMD
GPUs are from the AMD "Evergreen" architecture: ATI Radeon™
HD 5440 GPU; ATI Radeon™ HD 5670 GPU; ATI Radeon™ HD 5770 GPU;
ATI Radeon™ HD 5870 GPU. The one NVIDIA GPU is from the Fermi
architecture: GTX 480. Others have tried GATLAS on NVIDIA Tesla and
GTX 2xx GPUs. I will limit these observations to my personal
experience on the five GPUs I have access to.
|
|
The good news is there are absolutely no
surprises observed across the AMD "Evergreen" GPU models.
Kernel performance scales as intuitively expected based on device
hardware specifications. Even better, functionality is stable between
models. If a kernel design works on a lower specification model, it
will work on a GPU model of higher specification. At least for
GATLAS, kernel designs generalize within the AMD "Evergreen"
device architecture.
The bad news is that the GATLAS kernel designs perform and function completely differently on the NVIDIA Fermi architecture of the GTX 480. This is seen with both kernel performance and correctness of output data. I have not investigated deeply enough to understand and resolve these issues. It is only obvious that to GATLAS, AMD "Evergreen" and NVIDIA Fermi architectures are different enough that the kernel meta-programming and auto-tuning in GATLAS is not yet able to fully adapt and overcome the differences between vendors.
Texture sampling is very fast on
AMD GPUs. It skips the need for the address arithmetic required by
memory buffers. It also takes advantage of the texture cache. A
consequence is that kernel designs that access memory through images
are generally significantly faster than designs that rely on memory
buffers accesses. However, there are two important cases for which
memory buffers can outperform images.
|
|
For small problem sizes, kernels using memory
buffers are often faster. While memory buffers have lower asymptotic
peak performance relative to image based kernel designs, they may
have qualitatively similar performance at small problem sizes. The
advantage of image based kernels occurs at larger problem sizes after
memory buffer kernel performance reaches an upper limit. The image
based kernel continues onwards to larger problem sizes until it
reaches a higher limit. If a computation naturally involves smaller
problems, then memory buffers may be faster than images.
|
|
The lower throughput of image based DGEMM with
PCIe bus data transfer from host to GPU device is caused by texture
creation overhead. Texture creation is not free and is included in
the kernel benchmark along with the actual movement of data across
the bus.
|
|
The image based DGEMM kernel itself is
significantly faster than the memory buffer version.
|
|
Including PCIe bus data transfer from GPU
device to the host does not change the relative performance of image
and memory buffer based DGEMM. The image based kernel is faster.
A common computational pipeline is mapping data to kernels followed by iteration and reduction. After initial data transfers, efficient scheduling of computation should maximize use of that data already on the GPU. This increases the effective memory bandwidth by managing data locality. It is exactly analogous to the memory hierarchy of cache in a CPU. Memory bandwidth is high because most reads hit the cache. In the DGEMM example above, the choice of images or memory buffers is a trade-off between the cost of texture creation and the higher performance textures offer. It will depend on the computational pipeline and how effectively it uses GPU device memory.
This is why GATLAS supports both memory buffer and image based kernel variants. Depending on the context and problem, different kernels are required. A general guideline is that images are faster for larger problems with data that is on the GPU device. For other cases when problems are smaller, or a computational pipeline is very short, memory buffers may be faster.
GATLAS supports GEMV and SAXPY
kernels that have constant arithmetic intensity. The ratio of
arithmetic operations to memory accesses remains the same
irrespective of problem size. Yet, it is empirically observed that
increasing the problem size results in higher performance.
The reason is that total cost of kernel execution is not determined solely by arithmetic operations and memory access. There are also fixed overheads. Larger problem sizes increase the relative cost of arithmetic and memory access, amortizing the kernel overhead and reducing its effect on total performance.
GATLAS also supports GEMM kernels with arithmetic intensity that increases with problem size. The big-O order of arithmetic operations is cubic while the order of memory accesses is larger than quadratic but less than cubic. As a result, the ratio of arithmetic to memory access increases with problem size. This is the primary reason why GEMM kernels exhibit much higher performance than GEMV and SAXPY.
However, all kernels have some overhead including GEMM. Some of the performance increase of GEMM with larger problem sizes is also due to amortizing out the cost of this overhead by increasing the relative proportion of arithmetic and memory access.
|
|
There is another way of amortizing out the cost
of kernel execution overhead. If multiple computational kernels are
synthesized together into a single coalesced GPGPU kernel, the
overhead is incurred only once for those multiple kernel
calculations. In practice, this is highly effective. Coalescing
allows GEMM to run at hundreds of GFLOPS even when multiplying very
small matrices.
Many practical quantitative problems fall into this coalesced kernel use case. Statistical learning and optimization problems are quite unlike traditional numerical linear algebra motivated by solving large systems of PDEs. These data driven problems from economics and data mining motivate very large numbers of small problems. There may be millions of elements in a data set, each requiring a small calculation. If implemented as a single data element for every kernel execution, performance will be very low due to the accumulated kernel overhead. A coalesced kernel that performs the same calculation for many data elements with each execution proportionally reduces the cost of this overhead.
Design of a computational
pipeline should make best use of what the GPU does well and avoid
what it does poorly. Calculations should be planned to maximize the
gains of fast high arithmetic intensity kernels. They should also
minimize the costs of slow low arithmetic intensity kernels.
High arithmetic intensity kernels like GEMM have large payoffs once performance tuned. There is clear benefit from extensive auto-tuning ahead-of-time even if it is very expensive. Application computations can then be planned around these tuned kernels to take advantage of their high performance.
|
|
In contrast, GEMV and SAXPY are low arithmetic
intensity kernels. Auto-tuning has some benefit and is much cheaper.
However, no amount of performance tuning can make these kernels run
very fast. They are inherently slow. A computational pipeline with an
inner loop of low arithmetic intensity can be optimized but is
incapable of very high performance.
Assuming a computational pipeline is transformed so that inner loops contain high arithmetic intensity kernels like GEMM, the remaining computations of either lower arithmetic intensity or small amounts of data will have a much smaller payoff from performance optimization. When possible, it makes sense to combine these computations into single synthesized GPU kernels and perform a more cursory level of auto-tuning.
GATLAS does not support this kind of kernel synthesis. It currently supports homogeneous coalescing only. Multiple identical compute kernels may be combined into a single GPU kernel. However, there is no technical reason preventing the coalescing of different compute kernels. So long as the kernel execution index spaces are compatible, different compute kernels can also be combined into a single GPU kernel. A concrete example of this would be combining GEMV with SAXPY in an application's computational pipeline.
This design guideline is really a pointer to future development.
OpenCL™ is a language that
looks like C implemented dynamically at runtime using LLVM (Low
Level Virtual Machine). Programming idioms are
those of a statically typed language without meta-programming
facilities (as macros are not yet supported). Yet the underlying
compiler and runtime technology is very dynamic as all compilation
and shader code generation occurs at runtime. OpenCL™ is
implemented in a way that supports a managed platform with an
optimizing JIT/AOT compiler and virtual machine.
Several years ago, PeakStream had this technology on AMD GPU devices. Compute kernels were synthesized at runtime from application code. This was really a meta-programming and optimization system for GPGPU as a managed platform. I see this as a glimpse of the future for GPGPU. It follows the same technological progression seen in other programming languages and platforms. First, there is assembly language. Next comes a high level language like C or C++. OpenCL™ corresponds to a high level language for GPGPU and was even modeled to follow the conventions and syntax of the C language. What always comes next is the managed platform with a virtual machine in which application kernels are synthesized by a JIT compiler and auto-tuned.
The potential improvement in software development productivity is perhaps greater than for the managed platforms used in conventional CPU computing (e.g. JVM and scripting languages). Writing GPGPU kernels is intricate work while subsequent performance tuning is time consuming and sensitive to many variables (i.e. the optimal kernel changes depending on device, driver and SDK). The gain in developer productivity from subsuming kernel synthesis and auto-tuning in a managed platform runtime should offset any potential cost of lower performance compared with hand crafted and manually tuned kernels.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.