Pulse · ggerganov/llama.cpp · GitHub

December 11, 2024 – December 18, 2024

Overview

63 Active pull requests

57 Active issues

28 Releases published by 1 person

b4304
published Dec 11, 2024
b4311
published Dec 12, 2024
b4312
published Dec 12, 2024
b4314
published Dec 12, 2024
b4315
published Dec 12, 2024
b4317
published Dec 12, 2024
b4318
published Dec 13, 2024
b4319
published Dec 13, 2024
b4320
published Dec 13, 2024
b4321
published Dec 13, 2024
b4324
published Dec 13, 2024
b4325
published Dec 13, 2024
b4326
published Dec 13, 2024
b4327
published Dec 14, 2024
b4329
published Dec 14, 2024
b4331
published Dec 15, 2024
b4333
published Dec 15, 2024
b4337
published Dec 16, 2024
b4338
published Dec 17, 2024
b4341
published Dec 17, 2024
b4342
published Dec 17, 2024
b4343
published Dec 17, 2024
b4348
published Dec 17, 2024
b4349
published Dec 17, 2024
b4350
published Dec 17, 2024
b4351
published Dec 18, 2024
b4353
published Dec 18, 2024
b4354
published Dec 18, 2024

45 Pull requests merged by 30 people

server : output embeddings for all tokens when pooling = none
#10861 merged Dec 18, 2024
server : add "tokens" output
#10853 merged Dec 18, 2024
server : (embeddings) using same format for "input" and "content"
#10872 merged Dec 18, 2024
docs: Fix HIP (née hipBLAS) in README
#10880 merged Dec 18, 2024
Revert "Add Falcon3 model support"
#10876 merged Dec 18, 2024
Use model->gguf_kv for loading the template instead of using the C API.
#10868 merged Dec 17, 2024
tests: add tests for GGUF
#10830 merged Dec 17, 2024
ggml : update ggml_backend_cpu_device_supports_op
#10867 merged Dec 17, 2024
server : fill usage info in embeddings and rerank responses
#10852 merged Dec 17, 2024
Add Falcon3 model support
#10864 merged Dec 17, 2024
readme : update typos
#10863 merged Dec 17, 2024
server : (UI) fix missing async generator on safari
#10857 merged Dec 17, 2024
vulkan: bugfixes for small subgroup size systems + llvmpipe test
#10809 merged Dec 17, 2024
rwkv6: add wkv6 support for Vulkan backend
#10829 merged Dec 16, 2024
unicode : improve naming style
#10838 merged Dec 16, 2024
sampling : refactor + optimize penalties sampler
#10803 merged Dec 16, 2024
Allow locally downloaded models for QwenVL
#10833 merged Dec 15, 2024
Add Deepseek MoE v1 & GigaChat models
#10827 merged Dec 15, 2024
scripts : change build path to "build-bench" for compare-commits.sh
#10836 merged Dec 15, 2024
server: (UI) add syntax highlighting and latex math rendering
#10808 merged Dec 15, 2024
server: Fix has_next_line in JSON response
#10818 merged Dec 14, 2024
nix: allow to override rocm gpu targets
#10794 merged Dec 14, 2024
Add support for Qwen2VL
#10361 merged Dec 14, 2024
Removes spurious \r in output that causes logging in journalctl to tr…
#10771 merged Dec 13, 2024
Introducing experimental OpenCL backend with support for Qualcomm Adreno GPUs
#10693 merged Dec 13, 2024
Opt class for positional argument handling
#10508 merged Dec 13, 2024
fix: graceful shutdown for Docker images
#10815 merged Dec 13, 2024
[gguf-py] gguf_reader: numpy 2 newbyteorder fix
#9772 merged Dec 13, 2024
Fix crash caused by ggml_backend_load_all when launching on Android Activity
#10812 merged Dec 13, 2024
vulkan: small mul_mat_vec optimizations
#10665 merged Dec 13, 2024
SYCL: Reduce most of the compiler warnings
#10748 merged Dec 13, 2024
ggml: Fix compilation issues on ARM platform when building without fp16
#10811 merged Dec 13, 2024
common : improve -ctv -ctk CLI arguments
#10806 merged Dec 12, 2024
contrib : add ngxson as codeowner for server, devops
#10804 merged Dec 12, 2024
[backend](cuda): faster uncontiguous concat
#10760 merged Dec 12, 2024
remove CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS
#10797 merged Dec 12, 2024
Vulkan: Use improved q4_k and q5_k dequant code in dequant shaders
#10798 merged Dec 12, 2024
Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats
#10721 merged Dec 12, 2024
common : add missing env var for speculative
#10801 merged Dec 12, 2024
docs: update server streaming mode documentation
#9519 merged Dec 11, 2024
gguf-py : bump version to 0.11.0
#10788 merged Dec 11, 2024
server : (UI) add tok/s, get rid of completion.js
#10786 merged Dec 11, 2024
Fix a small typo in the Quantization Docs
#10772 merged Dec 11, 2024
ci : pin nodejs to 22.11.0
#10779 merged Dec 11, 2024
bug-fix: snprintf prints NULL in place of the last character
#10419 merged Dec 11, 2024

18 Pull requests opened by 18 people

server : fix logprobs, make it OAI-compatible
#10783 opened Dec 11, 2024
tts : add OuteTTS support
#10784 opened Dec 11, 2024
Bamba architecture
#10810 opened Dec 12, 2024
Add support for Microsoft Phi-4 model
#10817 opened Dec 13, 2024
Improve progress bar
#10821 opened Dec 13, 2024
add `ggml_backend_sched_dump_dot`
#10825 opened Dec 14, 2024
added docker-multi-stage builds
#10832 opened Dec 14, 2024
Fix compilation on Pop!_OS 22.04 LTS CUDA
#10835 opened Dec 15, 2024
SYCL: Migrate away from deprecated ggml_tensor->backend
#10840 opened Dec 15, 2024
vulkan: multi-row k quants
#10846 opened Dec 16, 2024
SYCL: Fixes for building SYCL backend for AMD GPUs
#10851 opened Dec 16, 2024
vulkan: optimize coopmat2 dequant functions
#10855 opened Dec 16, 2024
Roberta embeddings fixes
#10856 opened Dec 16, 2024
llama: Ensure KV cache is fully defragmented.
#10873 opened Dec 17, 2024
ggml-cpu: replace NEON asm with intrinsics in ggml_gemv_q4_0_4x8_q8_0()
#10874 opened Dec 17, 2024
server: avoid overwriting Authorization header
#10878 opened Dec 18, 2024
Add Falcon3 support and Fix issue #10875
#10883 opened Dec 18, 2024
tests: disable GGUF test for bad value size
#10886 opened Dec 18, 2024

38 Issues closed by 12 people

Misc. bug: JS error when using the web ui on iPhone
#10842 closed Dec 18, 2024
Eval bug: PR#10864 tokenization regression
#10875 closed Dec 18, 2024
Misc. bug: Server Demo on Mac, safari return error
#10841 closed Dec 17, 2024
Bug: Intel Arc - not working at all
#9106 closed Dec 17, 2024
Eval bug: EXAONE-3.5-2.4B-Instruct has relatively low context limit (50% the limit of Qwen 2.5 3B)
#10823 closed Dec 16, 2024
Support Mistral-Nemo-Instruct-2407 128K
#8577 closed Dec 16, 2024
Bug: Model Output Repeats and Shows Errors when Running GGUF File with llama.cpp
#9788 closed Dec 16, 2024
Bug: Server /v1/chat/completions API response's model info is wrong
#10056 closed Dec 16, 2024
Bug: [SYCL] SYCL + Docker
#10113 closed Dec 16, 2024
Feature Request: count tokens before calling '/v1/chat/completions'
#10115 closed Dec 16, 2024
Build docker image llama.cpp:server-cuda: CMakeLists.txt missing
#10844 closed Dec 15, 2024
Bug: Build failure with GGML_VULKAN=1 GGML_HIPBLAS=1
#10284 closed Dec 15, 2024
web UI : support syntax highlighting
#10246 closed Dec 15, 2024
Feature Request: RDMA support for rpc back ends
#9493 closed Dec 15, 2024
Bug: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED
#10080 closed Dec 15, 2024
Feature Request: Precompiled llamacpp builds of cuda 12.2
#10093 closed Dec 15, 2024
Misc. bug: Some server response JSON still not restored
#10728 closed Dec 14, 2024
Documentation Inconsistency: llama-server endpoint
#10715 closed Dec 14, 2024
Support QuaRot quantization scheme
#6444 closed Dec 14, 2024
Bug: convert_hf_to_gguf.py: error: argument --outtype: invalid choice: 'q4_k_m' (choose from 'f32', 'f16', 'bf16', 'q8_0', 'tq1_0', 'tq2_0', 'auto')
#10077 closed Dec 14, 2024
Bug: llama-server not logging to file
#10078 closed Dec 14, 2024
Bug: Floating Point Exceptions turned off by default, hiding fpExceptions
#10083 closed Dec 14, 2024
Feature Request: Meta releases Layer Skip, an end-to-end solution for accelerating LLMs
#10090 closed Dec 14, 2024
webUI local storage can become corrupted
#10348 closed Dec 13, 2024
Bug: server (New UI) ChatML templates are wrong
#9640 closed Dec 13, 2024
Clean up server code
#5762 closed Dec 13, 2024
llama : save downloaded models to local cache
#7252 closed Dec 13, 2024
Bug: No docs explain the value for cache-type-k/v
#10373 closed Dec 13, 2024
Feature Request: Tensor Parallelism support
#9086 closed Dec 13, 2024
Compile bug: Vulkan build fails on GL_KHR_cooperative_matrix
#10785 closed Dec 12, 2024
Feature Request: llamacppp server - generated syntax code coloring
#10800 closed Dec 12, 2024
Compile bug: /ggml/src/libggml.so: undefined reference to `std::filesystem::__cxx11
#10778 closed Dec 12, 2024
Eval bug: ValueError: Duplicated tensor name 'token_embd.weight'
#10756 closed Dec 12, 2024
Feature Request: Convert .devops container images to be RHEL-based UBI images rather than Ubuntu based
#9961 closed Dec 12, 2024
Bug: Setting the `np` configs leads to grabled generated tokens.
#10070 closed Dec 12, 2024
Bug: Wrong slots management when receiving multiple concurrent requests.
#10072 closed Dec 12, 2024
Feature Request: Implement « Why Does the Effective Context Length of LLMs Fall Short? »
#10075 closed Dec 12, 2024
Feature Request: Add "tokens per second" information in the Web UI
#10502 closed Dec 11, 2024

19 Issues opened by 19 people

Feature Request: support `"encoding_format": "base64"` in the `*/embeddings` endpoints
#10887 opened Dec 18, 2024
Compile bug: bad interpreter: No such file or directory
#10881 opened Dec 18, 2024
Feature Request: Add support for SmolVLM
#10877 opened Dec 17, 2024
Misc. bug: [SERVER] Multiple slots, generation speed is degraded after each generation/slot used
#10860 opened Dec 17, 2024
Misc. bug: since 235f6e1 llama-server's main.js overwrites Authorization header so Web UI fails behind authenticating reverse proxy
#10854 opened Dec 16, 2024
Misc. bug: llama-bench SEGFAULTS w/ SYCL/HIP backend, however llama-cli seems to work
#10850 opened Dec 16, 2024
Compile bug: Compiling on Maxwell architecture 52 cuda12.7
#10849 opened Dec 16, 2024
Feature Request: Q6_0 quant
#10848 opened Dec 16, 2024
Eval bug: ggml_metal_encode_node: error: unsupported op 'IM2COL'
#10845 opened Dec 16, 2024
Eval bug: Qwen2-VL Hallucinates image content on Vulkan backend
#10843 opened Dec 15, 2024
Feature Request: Add support for the WePOINTS/POINTS1.5 model
#10834 opened Dec 15, 2024
Misc. bug: llama 3.2 3B : for Q4 quantized models repeated output is generating without hitting [end of text]
#10824 opened Dec 14, 2024
Is LayerSkip / self-speculative decoding possible (requires getting one intermediate layer's output + some KV cache changes)?
#10820 opened Dec 13, 2024
Feature Request: Allow Filtering LLama Server Response Fields
#10819 opened Dec 13, 2024
Feature Request: Support for C4AI Command R7B / Cohere2ForCausalLM
#10816 opened Dec 13, 2024
Feature Request: Add support for Phi-4 model
#10814 opened Dec 13, 2024
Eval bug: GGML_ASSERT((qs.n_attention_wv == n_attn_layer) && "n_attention_wv is unexpected") failed Could not attach to process.
#10799 opened Dec 12, 2024
Error while using llama-quantize with Meta-Llama-3.1-8B-Instruct
#10793 opened Dec 12, 2024
terminate called after throwing an instance of 'std::out_of_range' what(): _Map_base::at Aborted (core dumped)
#10790 opened Dec 12, 2024

53 Unresolved conversations

Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.

more perfo with llamafile tinyblas on x86_64.
#10714 commented on Dec 16, 2024 • 22 new comments
ggml: GGML_NATIVE uses -mcpu=native on ARM
#10752 commented on Dec 18, 2024 • 8 new comments
musa: fix aarch64 build
#10781 commented on Dec 16, 2024 • 3 new comments
Support for Llama-3_1-Nemotron-51B
#10669 commented on Dec 18, 2024 • 2 new comments
Feature Request: server default system prompt support like -spf in old version support gemma2
#10520 commented on Dec 11, 2024 • 0 new comments
Bug: I am unable to use llama_cli interactively
#10297 commented on Dec 17, 2024 • 0 new comments
Bug: llama-gbnf-validator parses grammar but gets a seg fault when validating an input string against the grammar
#10321 commented on Dec 17, 2024 • 0 new comments
Misc. bug: Q4_0 with runtime repacking not working as expected (TYPE_Q4_0_4_4 REMOVED)
#10757 commented on Dec 17, 2024 • 0 new comments
Bug: llama.cpp with Vulkan not running on Snapdragon X + Windows (Copilot+PCs)
#8455 commented on Dec 17, 2024 • 0 new comments
Misc. bug: Virus detected
#10768 commented on Dec 17, 2024 • 0 new comments
Bug: convert_hf_to_gguf bluescreens windows with very large models
#10365 commented on Dec 18, 2024 • 0 new comments
ggml : reintegrate the AMX backend into the CPU backend
#10359 commented on Dec 18, 2024 • 0 new comments
Bug: rope-scale and rope-scaling parameters not being parsed in llama.cpp server
#10355 commented on Dec 18, 2024 • 0 new comments
Feature Request: Tencent-Hunyuan-Large (Text Generation)
#10263 commented on Dec 18, 2024 • 0 new comments
Bug: `llama-server` web UI resets the text selection during inference on every token update
#9608 commented on Dec 18, 2024 • 0 new comments
changelog : `llama-server` REST API
#9291 commented on Dec 18, 2024 • 0 new comments
Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars & minimalist Jinja engine
#9639 commented on Dec 15, 2024 • 0 new comments
llama : adds llama-grammar memoization stacks (#4218)
#9833 commented on Dec 16, 2024 • 0 new comments
metal : GPU "idle-throttling" analysis
#10119 commented on Dec 17, 2024 • 0 new comments
Introduce New Lookup-Table(LUT)-Based Matrix Multiplication Method
#10181 commented on Dec 17, 2024 • 0 new comments
metal : use F16 math in mul_mat kernels
#10220 commented on Dec 12, 2024 • 0 new comments
fix: ggml: fix vulkan-shaders-gen build
#10448 commented on Dec 15, 2024 • 0 new comments
Add support for GLM-Edge and GLM-Edge-V series models
#10573 commented on Dec 11, 2024 • 0 new comments
server: add OpenAI compatible response format for legacy /completions with b…
#10645 commented on Dec 12, 2024 • 0 new comments
Make->CMake
#10663 commented on Dec 15, 2024 • 0 new comments
server: Add timeout to stop the server automatically when idling for too long.
#10742 commented on Dec 11, 2024 • 0 new comments
Cuda build doc
#10743 commented on Dec 12, 2024 • 0 new comments
Bug: Failing to build using cmake on tag b3912
#9913 commented on Dec 11, 2024 • 0 new comments
Eval bug: llama-imatrix.exe - loads on to CPU instead of GPU ... sometimes. (?)
#10687 commented on Dec 11, 2024 • 0 new comments
Bug: No text response when "--log-disable" is set
#10002 commented on Dec 12, 2024 • 0 new comments
Bug: CANN E89999
#10161 commented on Dec 12, 2024 • 0 new comments
Misc. bug: --cfg-negative-prompt is gone
#10774 commented on Dec 12, 2024 • 0 new comments
Misc. bug: Launch params (1024, 1, 1) are larger than launch bounds (256) for kernel _ZL12rms_norm_f32ILi1024EEvPKfPfif please add __launch_bounds__ to kernel define or use --gpu-max-threads-per-block recompile program !
#10610 commented on Dec 12, 2024 • 0 new comments
Bug: docker sample usage will always trigger unhealty container status
#10262 commented on Dec 13, 2024 • 0 new comments
Feature Request: adderALL
#10265 commented on Dec 13, 2024 • 0 new comments
Feature Request: Add split model support in gguf-py
#9023 commented on Dec 13, 2024 • 0 new comments
Compile bug: ios swift xcode build error when upgrade to llama : use cmake for swift build
#10747 commented on Dec 13, 2024 • 0 new comments
Feature Request: A method to load all model layers into VRAM, then with the remaining VRAM load context active context, and overlow into system ram
#10283 commented on Dec 14, 2024 • 0 new comments
Bug: I use qwen2_7b_instruc Python llama. cp/convert_cf_to_gguf. py error
#10273 commented on Dec 14, 2024 • 0 new comments
Bug: Nondeterministic results on AMD RDNA3 (ROCm) despite zero temperature and fixed seed
#10197 commented on Dec 14, 2024 • 0 new comments
Bug: IQ3_M is significantly slower than IQ4_XS on AMD, is it expected?
#9644 commented on Dec 14, 2024 • 0 new comments
[Feature request] Any plans for AMD XDNA AI Engine support on Ryzen 7x40 processors?
#1499 commented on Dec 14, 2024 • 0 new comments
Feature Request: Source code highlight and math formula rendering
#10758 commented on Dec 14, 2024 • 0 new comments
ggml : refactor ggml-cpu.c into multiple C++ source files
#10180 commented on Dec 14, 2024 • 0 new comments
Bug: Flash Attention performs worse under ROCM
#10439 commented on Dec 14, 2024 • 0 new comments
Bug: Cannot run larger than VRAM models with `GGML_CUDA_ENABLE_UNIFIED_MEMORY`
#10091 commented on Dec 14, 2024 • 0 new comments
changelog : `libllama` API
#9289 commented on Dec 15, 2024 • 0 new comments
Feature Request: Support for Qwen2-VL
#9246 commented on Dec 15, 2024 • 0 new comments
Misc. bug: n_probs is not working with llama.cpp server
#10733 commented on Dec 15, 2024 • 0 new comments
Bug: Unable to load GGUF models after update
#9852 commented on Dec 16, 2024 • 0 new comments
How to utilize GPU on Android to accelerate inference?
#8705 commented on Dec 16, 2024 • 0 new comments
Bug: "GPU + CUDA + VRAM + Shared Memory (UMA)" slower then "CPU + RAM"?
#10330 commented on Dec 17, 2024 • 0 new comments
Bug: Using llama_batch_init+add+free instead of llama_batch_get_one() permanently slows down llama_decode significantly
#10322 commented on Dec 17, 2024 • 0 new comments