ggml库文档说明docs

纯真CZ

已于 2024-09-14 15:00:30 修改

阅读量426

点赞数

CC 4.0 BY-SA版权

文章标签： llama c++

于 2024-09-14 10:55:14 首次发布

原文链接：https://github.com/ggerganov/ggml/blob/master/include/ggml.h

背景

近期学习PowerInfer代码，其是基于llama.cpp改的，所以本质上还是去读llama.cpp的代码；而llama.cpp又基于ggml库完成了相关建图和计算的工作。

无论是llama.cpp，还是ggml库，除了项目目录下的docs目录和代码中的注释，基本上没有官方参考了，很难受。

偶然点开ggml.h，发现刚开头有一些注释，属于是非常清晰、简洁、有用的参考了。故将翻译后的文档和英语原文放上来，供学习llama.cpp和ggml库的朋友们参考。

Origin：https://github.com/ggerganov/ggml/blob/master/include/ggml.h

翻译后的文档

GGML Tensor Library

本文件仍在编写中。如果您希望涵盖某些特定主题,请随时发表评论:

https://github.com/ggerganov/whisper.cpp/issues/40

概述

该库实现了:

一组张量运算
自动微分
基本优化算法

该库的目的是为各种机器学习任务提供一种简约的方法。这包括但不限于以下内容:

线性回归
支持向量机
神经网络

该库允许用户使用可用的张量运算来定义特定函数。该函数定义通过计算图在内部表示。函数定义中的每个张量运算对应于图中的一个节点。定义了计算图后,用户可以选择计算函数的值和/或相对于输入变量的梯度。此外,还可以使用可用的优化算法之一来优化函数。

例如,我们在此定义函数:f(x) = a*x^2 + b

{
struct ggml_init_params params = {
    .mem_size = 16*1024*1024,
    .mem_buffer = NULL,
};

// 内存分配在此处发生
struct ggml_context * ctx = ggml_init(params);

struct ggml_tensor * x = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);

ggml_set_param(ctx, x); // x 是一个输入变量

struct ggml_tensor * a = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);
struct ggml_tensor * b = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);
struct ggml_tensor * x2 = ggml_mul(ctx, x, x);
struct ggml_tensor * f = ggml_add(ctx, ggml_mul(ctx, a, x2), b);...


}

请注意,上面的函数定义不涉及任何实际计算。只有当用户明确请求时,才会执行计算。例如,计算函数在x = 2.0处的值:

{
...

struct ggml_cgraph * gf = ggml_new_graph(ctx);
ggml_build_forward_expand(gf, f); // 备注：通过拓扑排序建立计算图。进行拓扑排序时，是从最后一个节点往前走的。所以在这里传入最后一个结果Tensor。这基于“每个算子可能具有多个输入，但最终只有1个输出”的假设进行的。

// 设置输入变量和参数值
ggml_set_f32(x, 2.0f);
ggml_set_f32(a, 3.0f);
ggml_set_f32(b, 4.0f);

ggml_graph_compute_with_ctx(ctx, &gf, n_threads);

printf(“f = %f\n”, ggml_get_f32_1d(f, 0));

...


}

实际计算在ggml_graph_compute()函数中执行。

ggml_new_tensor_…()函数创建新的张量。它们被分配到提供给ggml_init()函数的内存缓冲区中。您必须注意不要超过内存缓冲区的大小。因此,您必须事先知道计算需要多少内存。或者,您可以分配足够大的内存,并在定义计算图后,调用ggml_used_mem()函数来找出实际需要的内存量。

ggml_set_param()函数将张量标记为输入变量。自动微分和优化算法使用该变量。

上述方法允许一次定义函数图,然后多次计算其前向或后向图。所有计算都将使用ggml_init()函数中分配的同一内存缓冲区。这样,用户就可以避免运行时内存分配的开销。

该库支持多维张量——最多4个维度。FP16和FP32数据类型是首要考虑的对象,但理论上该库可以扩展为支持FP8和整数数据类型。

每个张量运算都会产生一个新的张量。最初,该库仅支持一元和二元运算。大多数可用运算属于这两类之一。随着时间的推移,很明显该库需要支持更复杂的运算。支持这些运算的方法尚不明确,但以下运算中演示了一些示例:

ggml_permute()
ggml_conv_1d_1s()
ggml_conv_1d_2s()

对于每个张量运算符,库都实现了前向和后向计算函数。前向函数根据输入张量值计算输出张量值。后向函数根据输出张量的伴随计算输入张量的伴随。关于此含义的详细解释,请参加微积分课程或观看以下视频:

什么是自动微分?
https://www.youtube.com/watch?v=wG_nF1awSSY

张量数据(ggml_tensor结构体)

张量通过ggml_tensor结构存储在内存中。该结构提供有关张量大小、数据类型以及存储张量数据的内存缓冲区的信息。此外,它还包含指向“源”张量的指针,即用于计算当前张量的张量。例如:

{
struct ggml_tensor * c = ggml_add(ctx, a, b);
// 由于张量c是a+b的结果，所以它的“源”张量分别是a和b
assert(c->src[0] == a); 
assert(c->src[1] == b);
}

多维张量按行优先顺序存储。ggml_tensor结构包含每个维度中的元素数量(“ne”)以及字节数(“nb”,又称步长)的字段。这允许在内存中存储不连续的张量,这对于转置和置换等操作非常有用。所有张量操作都必须考虑步长,而不能假设张量在内存中是连续的。

张量数据通过“数据”指针访问。例如:

{
const int nx = 2;
const int ny = 3;

struct ggml_tensor * a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, nx, ny);

for (int y = 0; y < ny; y++) {
    for (int x = 0; x < nx; x++) {
       *(float *) ((char *) a->data + y*a->nb[1] + x*a->nb[0]) = x + y;
    }
}

...
}

此外,还可以使用辅助函数,例如ggml_get_f32_1d()和ggml_set_f32_1d()。

英语原文

GGML Tensor Library

This documentation is still a work in progress. If you wish some specific topics to be covered, feel free to drop a comment:

https://github.com/ggerganov/whisper.cpp/issues/40

Overview

This library implements:

a set of tensor operations
automatic differentiation
basic optimization algorithms

The aim of this library is to provide a minimalistic approach for various machine learning tasks. This includes, but is not limited to, the following:

linear regression
support vector machines
neural networks

The library allows the user to define a certain function using the available tensor operations. This function definition is represented internally via a computation graph. Each tensor operation in the function definition corresponds to a node in the graph. Having the computation graph defined, the user can choose to compute the function’s value and/or its gradient with respect to the input variables. Optionally, the function can be optimized using one of the available optimization algorithms.

For example, here we define the function: f(x) = a*x^2 + b

{
struct ggml_init_params params = {
.mem_size = 16*1024*1024,
.mem_buffer = NULL,
};

// memory allocation happens here
struct ggml_context * ctx = ggml_init(params);

struct ggml_tensor * x = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);

ggml_set_param(ctx, x); // x is an input variable

struct ggml_tensor * a = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);
struct ggml_tensor * b = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);
struct ggml_tensor * x2 = ggml_mul(ctx, x, x);
struct ggml_tensor * f = ggml_add(ctx, ggml_mul(ctx, a, x2), b);

...
}

Notice that the function definition above does not involve any actual computation. The computation is performed only when the user explicitly requests it. For example, to compute the function’s value at x = 2.0:

{
...

struct ggml_cgraph * gf = ggml_new_graph(ctx);
ggml_build_forward_expand(gf, f);

// set the input variable and parameter values
ggml_set_f32(x, 2.0f);
ggml_set_f32(a, 3.0f);
ggml_set_f32(b, 4.0f);

ggml_graph_compute_with_ctx(ctx, &gf, n_threads);

printf("f = %f\n", ggml_get_f32_1d(f, 0));

...
}

The actual computation is performed in the ggml_graph_compute() function.

The ggml_new_tensor_…() functions create new tensors. They are allocated in the memory buffer provided to the ggml_init() function. You have to be careful not to exceed the memory buffer size. Therefore, you have to know in advance how much memory you need for your computation. Alternatively, you can allocate a large enough memory and after defining the computation graph, call the ggml_used_mem() function to find out how much memory was actually needed.

The ggml_set_param() function marks a tensor as an input variable. This is used by the automatic differentiation and optimization algorithms.

The described approach allows to define the function graph once and then compute its forward or backward graphs multiple times. All computations will use the same memory buffer allocated in the ggml_init() function. This way the user can avoid the memory allocation overhead at runtime.

The library supports multi-dimensional tensors - up to 4 dimensions. The FP16 and FP32 data types are first class citizens, but in theory the library can be extended to support FP8 and integer data types.

Each tensor operation produces a new tensor. Initially the library was envisioned to support only the use of unary and binary operations. Most of the available operations fall into one of these two categories. With time, it became clear that the library needs to support more complex operations. The way to support these operations is not clear yet, but a few examples are demonstrated in the following operations:

ggml_permute()
ggml_conv_1d_1s()
ggml_conv_1d_2s()

For each tensor operator, the library implements a forward and backward computation function. The forward function computes the output tensor value given the input tensor values. The backward function computes the adjoint of the input tensors given the adjoint of the output tensor. For a detailed explanation of what this means, take a calculus class, or watch the following video:

What is Automatic Differentiation?
https://www.youtube.com/watch?v=wG_nF1awSSY

Tensor data (struct ggml_tensor)

The tensors are stored in memory via the ggml_tensor struct. The structure provides information about the size of the tensor, the data type, and the memory buffer where the tensor data is stored. Additionally, it contains pointers to the “source” tensors - i.e. the tensors that were used to compute the current tensor. For example:

{
struct ggml_tensor * c = ggml_add(ctx, a, b);

assert(c->src[0] == a);
assert(c->src[1] == b);
}

The multi-dimensional tensors are stored in row-major order. The ggml_tensor struct contains fields for the number of elements in each dimension (“ne”) as well as the number of bytes (“nb”, a.k.a. stride). This allows to store tensors that are not contiguous in memory, which is useful for operations such as transposition and permutation. All tensor operations have to take the stride into account and not assume that the tensor is contiguous in memory.

The data of the tensor is accessed via the “data” pointer. For example:

{
const int nx = 2;
const int ny = 3;

struct ggml_tensor * a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, nx, ny);

for (int y = 0; y < ny; y++) {
for (int x = 0; x < nx; x++) {
*(float *) ((char *) a->data + y*a->nb[1] + x*a->nb[0]) = x + y;
}
}

...
}

Alternatively, there are helper functions, such as ggml_get_f32_1d() and ggml_set_f32_1d() that can be used.