This is where Grouped GEMM differs significantly from standard GEMM. In recent CUDA versions (e.g., CUDA 11.4+), you utilize cublasLtMatmul but pass arrays of pointers to the alpha , beta , and matrix data pointers, often utilizing a specific API signature or passing a parameter.
Prepare arrays on the device that hold the pointers to each individual matrix in the group (e.g., an array of pointers to all matrices). cublaslt grouped gemm documentation
cublasLtHandle_t handle; cublasLtCreate(&handle); This is where Grouped GEMM differs significantly from