-
Notifications
You must be signed in to change notification settings - Fork 10k
Insights: ggerganov/llama.cpp
Overview
Could not load contribution data
Please try again later
28 Releases published by 1 person
-
b4304
published
Dec 11, 2024 -
b4311
published
Dec 12, 2024 -
b4312
published
Dec 12, 2024 -
b4314
published
Dec 12, 2024 -
b4315
published
Dec 12, 2024 -
b4317
published
Dec 12, 2024 -
b4318
published
Dec 13, 2024 -
b4319
published
Dec 13, 2024 -
b4320
published
Dec 13, 2024 -
b4321
published
Dec 13, 2024 -
b4324
published
Dec 13, 2024 -
b4325
published
Dec 13, 2024 -
b4326
published
Dec 13, 2024 -
b4327
published
Dec 14, 2024 -
b4329
published
Dec 14, 2024 -
b4331
published
Dec 15, 2024 -
b4333
published
Dec 15, 2024 -
b4337
published
Dec 16, 2024 -
b4338
published
Dec 17, 2024 -
b4341
published
Dec 17, 2024 -
b4342
published
Dec 17, 2024 -
b4343
published
Dec 17, 2024 -
b4348
published
Dec 17, 2024 -
b4349
published
Dec 17, 2024 -
b4350
published
Dec 17, 2024 -
b4351
published
Dec 18, 2024 -
b4353
published
Dec 18, 2024 -
b4354
published
Dec 18, 2024
45 Pull requests merged by 30 people
-
server : output embeddings for all tokens when pooling = none
#10861 merged
Dec 18, 2024 -
server : add "tokens" output
#10853 merged
Dec 18, 2024 -
server : (embeddings) using same format for "input" and "content"
#10872 merged
Dec 18, 2024 -
docs: Fix HIP (née hipBLAS) in README
#10880 merged
Dec 18, 2024 -
Revert "Add Falcon3 model support"
#10876 merged
Dec 18, 2024 -
Use model->gguf_kv for loading the template instead of using the C API.
#10868 merged
Dec 17, 2024 -
tests: add tests for GGUF
#10830 merged
Dec 17, 2024 -
ggml : update ggml_backend_cpu_device_supports_op
#10867 merged
Dec 17, 2024 -
server : fill usage info in embeddings and rerank responses
#10852 merged
Dec 17, 2024 -
Add Falcon3 model support
#10864 merged
Dec 17, 2024 -
readme : update typos
#10863 merged
Dec 17, 2024 -
server : (UI) fix missing async generator on safari
#10857 merged
Dec 17, 2024 -
vulkan: bugfixes for small subgroup size systems + llvmpipe test
#10809 merged
Dec 17, 2024 -
rwkv6: add wkv6 support for Vulkan backend
#10829 merged
Dec 16, 2024 -
unicode : improve naming style
#10838 merged
Dec 16, 2024 -
sampling : refactor + optimize penalties sampler
#10803 merged
Dec 16, 2024 -
Allow locally downloaded models for QwenVL
#10833 merged
Dec 15, 2024 -
Add Deepseek MoE v1 & GigaChat models
#10827 merged
Dec 15, 2024 -
scripts : change build path to "build-bench" for compare-commits.sh
#10836 merged
Dec 15, 2024 -
server: (UI) add syntax highlighting and latex math rendering
#10808 merged
Dec 15, 2024 -
server: Fix
has_next_line
in JSON response#10818 merged
Dec 14, 2024 -
nix: allow to override rocm gpu targets
#10794 merged
Dec 14, 2024 -
Add support for Qwen2VL
#10361 merged
Dec 14, 2024 -
Removes spurious \r in output that causes logging in journalctl to tr…
#10771 merged
Dec 13, 2024 -
Introducing experimental OpenCL backend with support for Qualcomm Adreno GPUs
#10693 merged
Dec 13, 2024 -
Opt class for positional argument handling
#10508 merged
Dec 13, 2024 -
fix: graceful shutdown for Docker images
#10815 merged
Dec 13, 2024 -
[gguf-py] gguf_reader: numpy 2 newbyteorder fix
#9772 merged
Dec 13, 2024 -
Fix crash caused by ggml_backend_load_all when launching on Android Activity
#10812 merged
Dec 13, 2024 -
vulkan: small mul_mat_vec optimizations
#10665 merged
Dec 13, 2024 -
SYCL: Reduce most of the compiler warnings
#10748 merged
Dec 13, 2024 -
ggml: Fix compilation issues on ARM platform when building without fp16
#10811 merged
Dec 13, 2024 -
common : improve -ctv -ctk CLI arguments
#10806 merged
Dec 12, 2024 -
contrib : add ngxson as codeowner for server, devops
#10804 merged
Dec 12, 2024 -
[backend](cuda): faster uncontiguous concat
#10760 merged
Dec 12, 2024 -
remove CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS
#10797 merged
Dec 12, 2024 -
Vulkan: Use improved q4_k and q5_k dequant code in dequant shaders
#10798 merged
Dec 12, 2024 -
Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats
#10721 merged
Dec 12, 2024 -
common : add missing env var for speculative
#10801 merged
Dec 12, 2024 -
docs: update server streaming mode documentation
#9519 merged
Dec 11, 2024 -
gguf-py : bump version to 0.11.0
#10788 merged
Dec 11, 2024 -
server : (UI) add tok/s, get rid of completion.js
#10786 merged
Dec 11, 2024 -
Fix a small typo in the Quantization Docs
#10772 merged
Dec 11, 2024 -
ci : pin nodejs to 22.11.0
#10779 merged
Dec 11, 2024 -
bug-fix: snprintf prints NULL in place of the last character
#10419 merged
Dec 11, 2024
18 Pull requests opened by 18 people
-
server : fix logprobs, make it OAI-compatible
#10783 opened
Dec 11, 2024 -
tts : add OuteTTS support
#10784 opened
Dec 11, 2024 -
Bamba architecture
#10810 opened
Dec 12, 2024 -
Add support for Microsoft Phi-4 model
#10817 opened
Dec 13, 2024 -
Improve progress bar
#10821 opened
Dec 13, 2024 -
add `ggml_backend_sched_dump_dot`
#10825 opened
Dec 14, 2024 -
added docker-multi-stage builds
#10832 opened
Dec 14, 2024 -
Fix compilation on Pop!_OS 22.04 LTS CUDA
#10835 opened
Dec 15, 2024 -
SYCL: Migrate away from deprecated ggml_tensor->backend
#10840 opened
Dec 15, 2024 -
vulkan: multi-row k quants
#10846 opened
Dec 16, 2024 -
SYCL: Fixes for building SYCL backend for AMD GPUs
#10851 opened
Dec 16, 2024 -
vulkan: optimize coopmat2 dequant functions
#10855 opened
Dec 16, 2024 -
Roberta embeddings fixes
#10856 opened
Dec 16, 2024 -
llama: Ensure KV cache is fully defragmented.
#10873 opened
Dec 17, 2024 -
ggml-cpu: replace NEON asm with intrinsics in ggml_gemv_q4_0_4x8_q8_0()
#10874 opened
Dec 17, 2024 -
server: avoid overwriting Authorization header
#10878 opened
Dec 18, 2024 -
Add Falcon3 support and Fix issue #10875
#10883 opened
Dec 18, 2024 -
tests: disable GGUF test for bad value size
#10886 opened
Dec 18, 2024
38 Issues closed by 12 people
-
Misc. bug: JS error when using the web ui on iPhone
#10842 closed
Dec 18, 2024 -
Eval bug: PR#10864 tokenization regression
#10875 closed
Dec 18, 2024 -
Misc. bug: Server Demo on Mac, safari return error
#10841 closed
Dec 17, 2024 -
Bug: Intel Arc - not working at all
#9106 closed
Dec 17, 2024 -
Eval bug: EXAONE-3.5-2.4B-Instruct has relatively low context limit (50% the limit of Qwen 2.5 3B)
#10823 closed
Dec 16, 2024 -
Support Mistral-Nemo-Instruct-2407 128K
#8577 closed
Dec 16, 2024 -
Bug: Model Output Repeats and Shows Errors when Running GGUF File with llama.cpp
#9788 closed
Dec 16, 2024 -
Bug: Server /v1/chat/completions API response's model info is wrong
#10056 closed
Dec 16, 2024 -
Bug: [SYCL] SYCL + Docker
#10113 closed
Dec 16, 2024 -
Feature Request: count tokens before calling '/v1/chat/completions'
#10115 closed
Dec 16, 2024 -
Build docker image llama.cpp:server-cuda: CMakeLists.txt missing
#10844 closed
Dec 15, 2024 -
Bug: Build failure with GGML_VULKAN=1 GGML_HIPBLAS=1
#10284 closed
Dec 15, 2024 -
web UI : support syntax highlighting
#10246 closed
Dec 15, 2024 -
Feature Request: RDMA support for rpc back ends
#9493 closed
Dec 15, 2024 -
Bug: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED
#10080 closed
Dec 15, 2024 -
Feature Request: Precompiled llamacpp builds of cuda 12.2
#10093 closed
Dec 15, 2024 -
Misc. bug: Some server response JSON still not restored
#10728 closed
Dec 14, 2024 -
Documentation Inconsistency: llama-server endpoint
#10715 closed
Dec 14, 2024 -
Support QuaRot quantization scheme
#6444 closed
Dec 14, 2024 -
Bug: llama-server not logging to file
#10078 closed
Dec 14, 2024 -
Bug: Floating Point Exceptions turned off by default, hiding fpExceptions
#10083 closed
Dec 14, 2024 -
Feature Request: Meta releases Layer Skip, an end-to-end solution for accelerating LLMs
#10090 closed
Dec 14, 2024 -
webUI local storage can become corrupted
#10348 closed
Dec 13, 2024 -
Bug: server (New UI) ChatML templates are wrong
#9640 closed
Dec 13, 2024 -
Clean up server code
#5762 closed
Dec 13, 2024 -
llama : save downloaded models to local cache
#7252 closed
Dec 13, 2024 -
Bug: No docs explain the value for cache-type-k/v
#10373 closed
Dec 13, 2024 -
Feature Request: Tensor Parallelism support
#9086 closed
Dec 13, 2024 -
Compile bug: Vulkan build fails on GL_KHR_cooperative_matrix
#10785 closed
Dec 12, 2024 -
Feature Request: llamacppp server - generated syntax code coloring
#10800 closed
Dec 12, 2024 -
Compile bug: /ggml/src/libggml.so: undefined reference to `std::filesystem::__cxx11
#10778 closed
Dec 12, 2024 -
Eval bug: ValueError: Duplicated tensor name 'token_embd.weight'
#10756 closed
Dec 12, 2024 -
Feature Request: Convert .devops container images to be RHEL-based UBI images rather than Ubuntu based
#9961 closed
Dec 12, 2024 -
Bug: Setting the `np` configs leads to grabled generated tokens.
#10070 closed
Dec 12, 2024 -
Bug: Wrong slots management when receiving multiple concurrent requests.
#10072 closed
Dec 12, 2024 -
Feature Request: Implement « Why Does the Effective Context Length of LLMs Fall Short? »
#10075 closed
Dec 12, 2024 -
Feature Request: Add "tokens per second" information in the Web UI
#10502 closed
Dec 11, 2024
19 Issues opened by 19 people
-
Feature Request: support `"encoding_format": "base64"` in the `*/embeddings` endpoints
#10887 opened
Dec 18, 2024 -
Compile bug: bad interpreter: No such file or directory
#10881 opened
Dec 18, 2024 -
Feature Request: Add support for SmolVLM
#10877 opened
Dec 17, 2024 -
Misc. bug: [SERVER] Multiple slots, generation speed is degraded after each generation/slot used
#10860 opened
Dec 17, 2024 -
Misc. bug: llama-bench SEGFAULTS w/ SYCL/HIP backend, however llama-cli seems to work
#10850 opened
Dec 16, 2024 -
Compile bug: Compiling on Maxwell architecture 52 cuda12.7
#10849 opened
Dec 16, 2024 -
Feature Request: Q6_0 quant
#10848 opened
Dec 16, 2024 -
Eval bug: ggml_metal_encode_node: error: unsupported op 'IM2COL'
#10845 opened
Dec 16, 2024 -
Eval bug: Qwen2-VL Hallucinates image content on Vulkan backend
#10843 opened
Dec 15, 2024 -
Feature Request: Add support for the WePOINTS/POINTS1.5 model
#10834 opened
Dec 15, 2024 -
Feature Request: Allow Filtering LLama Server Response Fields
#10819 opened
Dec 13, 2024 -
Feature Request: Support for C4AI Command R7B / Cohere2ForCausalLM
#10816 opened
Dec 13, 2024 -
Feature Request: Add support for Phi-4 model
#10814 opened
Dec 13, 2024 -
Error while using llama-quantize with Meta-Llama-3.1-8B-Instruct
#10793 opened
Dec 12, 2024
53 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
more perfo with llamafile tinyblas on x86_64.
#10714 commented on
Dec 16, 2024 • 22 new comments -
ggml: GGML_NATIVE uses -mcpu=native on ARM
#10752 commented on
Dec 18, 2024 • 8 new comments -
musa: fix aarch64 build
#10781 commented on
Dec 16, 2024 • 3 new comments -
Support for Llama-3_1-Nemotron-51B
#10669 commented on
Dec 18, 2024 • 2 new comments -
Feature Request: server default system prompt support like -spf in old version support gemma2
#10520 commented on
Dec 11, 2024 • 0 new comments -
Bug: I am unable to use llama_cli interactively
#10297 commented on
Dec 17, 2024 • 0 new comments -
Bug: llama-gbnf-validator parses grammar but gets a seg fault when validating an input string against the grammar
#10321 commented on
Dec 17, 2024 • 0 new comments -
Misc. bug: Q4_0 with runtime repacking not working as expected (TYPE_Q4_0_4_4 REMOVED)
#10757 commented on
Dec 17, 2024 • 0 new comments -
Bug: llama.cpp with Vulkan not running on Snapdragon X + Windows (Copilot+PCs)
#8455 commented on
Dec 17, 2024 • 0 new comments -
Misc. bug: Virus detected
#10768 commented on
Dec 17, 2024 • 0 new comments -
Bug: convert_hf_to_gguf bluescreens windows with very large models
#10365 commented on
Dec 18, 2024 • 0 new comments -
ggml : reintegrate the AMX backend into the CPU backend
#10359 commented on
Dec 18, 2024 • 0 new comments -
Bug: rope-scale and rope-scaling parameters not being parsed in llama.cpp server
#10355 commented on
Dec 18, 2024 • 0 new comments -
Feature Request: Tencent-Hunyuan-Large (Text Generation)
#10263 commented on
Dec 18, 2024 • 0 new comments -
Bug: `llama-server` web UI resets the text selection during inference on every token update
#9608 commented on
Dec 18, 2024 • 0 new comments -
changelog : `llama-server` REST API
#9291 commented on
Dec 18, 2024 • 0 new comments -
Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars & minimalist Jinja engine
#9639 commented on
Dec 15, 2024 • 0 new comments -
llama : adds llama-grammar memoization stacks (#4218)
#9833 commented on
Dec 16, 2024 • 0 new comments -
metal : GPU "idle-throttling" analysis
#10119 commented on
Dec 17, 2024 • 0 new comments -
Introduce New Lookup-Table(LUT)-Based Matrix Multiplication Method
#10181 commented on
Dec 17, 2024 • 0 new comments -
metal : use F16 math in mul_mat kernels
#10220 commented on
Dec 12, 2024 • 0 new comments -
fix: ggml: fix vulkan-shaders-gen build
#10448 commented on
Dec 15, 2024 • 0 new comments -
Add support for GLM-Edge and GLM-Edge-V series models
#10573 commented on
Dec 11, 2024 • 0 new comments -
server: add OpenAI compatible response format for legacy /completions with b…
#10645 commented on
Dec 12, 2024 • 0 new comments -
Make->CMake
#10663 commented on
Dec 15, 2024 • 0 new comments -
server: Add timeout to stop the server automatically when idling for too long.
#10742 commented on
Dec 11, 2024 • 0 new comments -
Cuda build doc
#10743 commented on
Dec 12, 2024 • 0 new comments -
Bug: Failing to build using cmake on tag b3912
#9913 commented on
Dec 11, 2024 • 0 new comments -
Eval bug: llama-imatrix.exe - loads on to CPU instead of GPU ... sometimes. (?)
#10687 commented on
Dec 11, 2024 • 0 new comments -
Bug: No text response when "--log-disable" is set
#10002 commented on
Dec 12, 2024 • 0 new comments -
Bug: CANN E89999
#10161 commented on
Dec 12, 2024 • 0 new comments -
Misc. bug: --cfg-negative-prompt is gone
#10774 commented on
Dec 12, 2024 • 0 new comments -
Bug: docker sample usage will always trigger unhealty container status
#10262 commented on
Dec 13, 2024 • 0 new comments -
Feature Request: adderALL
#10265 commented on
Dec 13, 2024 • 0 new comments -
Feature Request: Add split model support in gguf-py
#9023 commented on
Dec 13, 2024 • 0 new comments -
Compile bug: ios swift xcode build error when upgrade to llama : use cmake for swift build
#10747 commented on
Dec 13, 2024 • 0 new comments -
Feature Request: A method to load all model layers into VRAM, then with the remaining VRAM load context active context, and overlow into system ram
#10283 commented on
Dec 14, 2024 • 0 new comments -
Bug: I use qwen2_7b_instruc Python llama. cp/convert_cf_to_gguf. py error
#10273 commented on
Dec 14, 2024 • 0 new comments -
Bug: Nondeterministic results on AMD RDNA3 (ROCm) despite zero temperature and fixed seed
#10197 commented on
Dec 14, 2024 • 0 new comments -
Bug: IQ3_M is significantly slower than IQ4_XS on AMD, is it expected?
#9644 commented on
Dec 14, 2024 • 0 new comments -
[Feature request] Any plans for AMD XDNA AI Engine support on Ryzen 7x40 processors?
#1499 commented on
Dec 14, 2024 • 0 new comments -
Feature Request: Source code highlight and math formula rendering
#10758 commented on
Dec 14, 2024 • 0 new comments -
ggml : refactor ggml-cpu.c into multiple C++ source files
#10180 commented on
Dec 14, 2024 • 0 new comments -
Bug: Flash Attention performs worse under ROCM
#10439 commented on
Dec 14, 2024 • 0 new comments -
Bug: Cannot run larger than VRAM models with `GGML_CUDA_ENABLE_UNIFIED_MEMORY`
#10091 commented on
Dec 14, 2024 • 0 new comments -
changelog : `libllama` API
#9289 commented on
Dec 15, 2024 • 0 new comments -
Feature Request: Support for Qwen2-VL
#9246 commented on
Dec 15, 2024 • 0 new comments -
Misc. bug: n_probs is not working with llama.cpp server
#10733 commented on
Dec 15, 2024 • 0 new comments -
Bug: Unable to load GGUF models after update
#9852 commented on
Dec 16, 2024 • 0 new comments -
How to utilize GPU on Android to accelerate inference?
#8705 commented on
Dec 16, 2024 • 0 new comments -
Bug: "GPU + CUDA + VRAM + Shared Memory (UMA)" slower then "CPU + RAM"?
#10330 commented on
Dec 17, 2024 • 0 new comments -
Bug: Using llama_batch_init+add+free instead of llama_batch_get_one() permanently slows down llama_decode significantly
#10322 commented on
Dec 17, 2024 • 0 new comments