Blas basic linear algebra subprograms download




















Contents Search. How to cite. Introduction The BLAS interface supports portable high-performance implementation of applications that are matrix and vector computation intensive. This is a preview of subscription content, log in to check access. Activity Not Available. Because the BLAS are efficient, portable, and widely available, they are commonly used in the development of high quality linear algebra software, LAPACK Tags blas lapack linearalgebra matrix netlib vector.

No code available to analyze. This project has no code locations, and so Open Hub cannot perform this analysis Is this project's source code hosted in a publicly available repository? Quick Reference. Project Links:. Homepage Download. Code Locations:. Add a code location! Listing 4. SYRK: Symmetric rank-k matrix-matrix product. They are of the same precision as the arrays Ai and Ci.

The size of the matrix Ai depends on trans[g]; its corresponding size is mentioned in the equation above. C language declarations of HERK functions in multiple precisions and argument domains are shown in Listing 5. Listing 5. HERK: Hermitian rank-k matrix-matrix product. This routine is only available for the complex precisions.

Listing 6. SYR2K: Symmetric rank-2k matrix-matrix product. The size of the matrices Ai and Bi depends on trans[g]; its corresponding size is mentioned in the equation above. C language declarations of HER2K functions in multiple precisions and argument domains are shown in Listing 7.

Listing 7. HER2K: Hermitian rank-2k matrix-matrix product. Listing 8. TRMM: Triangular matrix-matrix product. It is of the same precision as the arrays A and B. The size of matrix Ai depends on side[g]; its corresponding size is mentioned in the equation above. Listing 9. TRSM: Triangular matrix-matrix solve. The size of the matrix Ai depends on side[g]; its corresponding size is mentioned in the equation above. The number of routines in Level 6 The layout argument specifies whether leading dimension is across rows or columns.

Consequently, the details are left out. Listing AXPY: Scaling a vector and adding another vector. GEMV: General matrix-vector product. C language declarations of GESV functions in multiple precisions and argument domains are shown in Listing C language declarations of POSV functions in multiple precisions and argument domains are shown in Listing POSV: Cholesky matrix factorization and solve. The goal is to reduce expensive data movements by loading the data required for a task into fast memory and reusing it in computations from there as many times as possible.

Hierarchical blocking and communications are needed for optimal performance even for memory-bound computations like Level 2 BLAS, e. Thus, splitting an algorithm into hierarchical tasks that block the computation over the available memory hierarchies to reduce data movement is essential for implementing high-performance BLAS. Details on how these techniques can be extended to develop high-performance Batched BLAS, and in particular, the extensively used batch GEMM, can be found elsewhere [2].

Besides hierarchical blocking, specialized ker- nels are designed for various sizes, and a comprehensive autotuning process is applied to all ker- nels. For very small matrix sizes, e. For larger sizes on GPUs, e. For larger matrix sizes, streaming is applied to GEMMs tuned for larger sizes. For these sizes, similar to CPUs, coding multilevel blocking types of algorithms on GPUs must be in native machine language to overcome some limitations of the CUDA compiler or warp scheduler or both [44].

Running these types of implementations through different streams gives the currently best performing batch im- plementations for large size matrices. See Figures 3 and 4 for more details. Applications that suffer from this problem include those that require tensor contractions as in the quantum Hall effect , astrophysics [40], metabolic networks [36], CFD and resulting PDEs through direct and multifrontal solvers [48], high-order FEM schemes for hydrodynamics [11], direct-iterative preconditioned solvers [33], quantum chem- istry [7], image [41], and signal processing [6].

Batch LU factorization was used in subsurface transport simulation [46], whereby many chemical and microbiological reactions in a flow path are simulated in parallel [47]. Finally, small independent problems also occur as a very important aspect of computations on hierarchical matrices H-matrices [25]. One might expect that such applications would be well suited to accelerators or coprocessors, like GPUs. Due to the high levels of parallelism that these devices support, they can efficiently achieve very high performance for large data-parallel computations when they are used in combination with a CPU that handles the part of the computation that is difficult to parallelize [3, 26, 45].

But for several reasons, this turns out not to be the case for applications involving large amounts of data that come in small units. By using batch operations to overcome the bottleneck, small problems can be solved two to three times faster on GPUs, and with four to five times better energy efficiency than on multicore CPUs alone subject to the same power draw.

For example, Figure 1 a illustrates this for the case of many small LU factorizations—even in a multicore setting, the batch processing approach outperforms its ACM Transactions on Mathematical Software, Vol. For example, as shown in Figure 5 a , the batched LU results were used to speed up a nuclear network simulation. The XNet benchmark shows up to 3.

While expressing the computations in applications through matrix algebra e. Similarly to the use of BLAS, there are optimization opportunities for batch computing problems that cannot be folded into the Batched BLAS, and therefore must be addressed separately. For instances where the operands originate from multi-dimensional data, which is a common case, in future work, we will look at new interfaces and data abstractions, e.

A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines 1 explicit preparation of operands can be replaced by some index operation; 2 operands do not need to be in matrix form, but instead, can be directly loaded in matrix form in fast memory and proceed with the computation from there; 3 expressing computations through BLAS will not lead to loss of information, e.

Finally, we reiterate that the goal is to provide the developers of applications, compilers, and runtime systems with the option of expressing many small BLAS operations as a single call to a routine from the new batch operation standard.

Thus, we hope that this standard will help and encourage community efforts to build higher-level algorithms, e. Dongarra, C. Earl, J. University of Tennessee Computer Science. Springer, 21— Faster, cheaper, better — A hybridization methodology to develop linear algebra software for GPUs. Hwu Ed. Morgan Kaufmann, Burlington, MA. Series , 1 Bai, C. Bischof, Susan L. Blackford, James W. Demmel, Jack J. Dongarra, J. Du Croz, A. Greenbaum, Sven Hammarling, A. McKenney, and Danny C.

Anderson, David Sheffield, and Kurt Keutzer. A predictive model for solving small linear algebra problems in GPU registers. Auer, Gerald Baumgartner, David E.

Ramanujam, P. Sadayappan, and Alexander Sibiryakov. Automatic code generation for many-body electronic structure methods: The tensor contraction engine. Towards a high-performance tensor algebra package for accelerators.

HAL Archives.



0コメント

  • 1000 / 1000