Skip to content

It appears context memory usage can be trivially halved by using fp16? #146

Closed
@jarcen

Description

@jarcen

I'm not fully familiar with this codebase, so pardon if I'm wrong. My first attempt to modify the code was to expand hardcoded context window of 512 to 4096 but additional memory usage was not pleasant.

LLAMA 7B quantized to 4 bits reports ggml ctx size = 8113.34 MB

I went to the code and changed data type for memory_k and memory_v from GGML_TYPE_F32 to GGML_TYPE_F16

These are the changed lines:

        ctx_size += n_ctx*n_layer*n_embd*ggml_type_sizef(GGML_TYPE_F16); // memory_k
        ctx_size += n_ctx*n_layer*n_embd*ggml_type_sizef(GGML_TYPE_F16); // memory_v

And these:

        model.memory_k = ggml_new_tensor_1d(ctx, GGML_TYPE_F16, n_elements);
        model.memory_v = ggml_new_tensor_1d(ctx, GGML_TYPE_F16, n_elements);

New memory usage is reportedly ggml ctx size = 6065.34 MB and task manager agrees. That's 2GB down.
So far everything is working, no crashes and no degradation in quality. Is there any reason to not do that?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions