Benchmarks for SGEMM with PCIe bus data transfer on HD 5870

Several interesting observations:
  1. Memory buffer kernels are faster than image kernels when A and B are written from host to device.
  2. Performance approaches the same asymptotic maximum in all cases with increasing matrix size. The usual trade of space (memory) for time (gigaFLOPS) applies.
  3. Accounting for PCIe bus data transfer will reduce the performance gap between IL/ISA and OpenCL. The host to device I/O bottleneck makes OpenCL more competitive than synthetic kernel benchmarks suggest.
kernel only
kernel + write A and B from host to device
kernel + read C from device to host
kernel + write A and B from host to device + read C from device to host

Bellevue, WA, May 19, 2010