It appears context memory usage can be trivially halved by using fp16?

I'm not fully familiar with this codebase, so pardon if I'm wrong. My first attempt to modify the code was to expand hardcoded context window of 512 to 4096 but additional memory usage was not pleasant.

LLAMA 7B quantized to 4 bits reports `ggml ctx size = 8113.34 MB`

I went to the code and changed data type for `memory_k` and `memory_v` from `GGML_TYPE_F32` to `GGML_TYPE_F16`

These are the changed lines:

```
        ctx_size += n_ctx*n_layer*n_embd*ggml_type_sizef(GGML_TYPE_F16); // memory_k
        ctx_size += n_ctx*n_layer*n_embd*ggml_type_sizef(GGML_TYPE_F16); // memory_v
```

And these:

```
        model.memory_k = ggml_new_tensor_1d(ctx, GGML_TYPE_F16, n_elements);
        model.memory_v = ggml_new_tensor_1d(ctx, GGML_TYPE_F16, n_elements);
```

New memory usage is reportedly `ggml ctx size = 6065.34 MB` and task manager agrees. That's 2GB down.
So far everything is working, no crashes and no degradation in quality. Is there any reason to not do that?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

It appears context memory usage can be trivially halved by using fp16? #146

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

It appears context memory usage can be trivially halved by using fp16? #146

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions