Tags: duaneking/llama.cpp
Tags
Quantized dot products for CUDA mul mat vec (ggml-org#2067)
llama: Don't double count the sampling time (ggml-org#2107)
Add an API example using server.cpp similar to OAI. (ggml-org#2009) * add api_like_OAI.py * add evaluated token count to server * add /v1/ endpoints binding
ggml : sync latest (new ops, macros, refactoring) (ggml-org#2106) - add ggml_argmax() - add ggml_tanh() - add ggml_elu() - refactor ggml_conv_1d() and variants - refactor ggml_conv_2d() and variants - add helper macros to reduce code duplication in ggml.c
Allow old Make to build server. (ggml-org#2098) Also make server build by default. Tested with Make 3.82
embd-input: Fix input embedding example unsigned int seed (ggml-org#2105 )
Simple webchat for server (ggml-org#1998) * expose simple web interface on root domain * embed index and add --path for choosing static dir * allow server to multithread because web browsers send a lot of garbage requests we want the server to multithread when serving 404s for favicon's etc. To avoid blowing up llama we just take a mutex when it's invoked. * let's try this with the xxd tool instead and see if msvc is happier with that * enable server in Makefiles * add /completion.js file to make it easy to use the server from js * slightly nicer css * rework state management into session, expose historyTemplate to settings --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
CI: make the brew update temporarily optional. (ggml-org#2092) until they decide to fix the brew installation in the macos runners. see the open issues. eg actions/runner-images#7710
PreviousNext