Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metal : use F16 math in mul_mat kernels #10220

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Nov 8, 2024

Using F16 math and accumulators in matrix multiplication kernels.

The prompt processing performance is improved. The perplexity is affected as follows:

./llama-perplexity -m models/llama-3.1-8b/ggml-model-f16.gguf -ngl 99 -f build/wikitext-2-raw/wiki.test.raw
master: PPL = 6.4007 +/- 0.03938
PR:     PPL = 6.4385 +/- 0.03969

Not sure if it worth it yet. Might do this as a compile-time flag GGML_METAL_F16. #10665 (comment)

./scripts/compare-commits.sh master gg/metal-mul-mat-f16 -m ./models/llama-3.2-3b-instruct/ggml-model-f16.gguf -m ./models/llama-3.2-3b-instruct/ggml-model-q8_0.gguf -m ./models/llama-3.2-3b-instruct/ggml-model-q4_0.gguf -m ./models/llama-3.1-8b/ggml-model-q8_0.gguf -m ./models/llama-3.1-8b/ggml-model-q4_k.gguf -m ./models/llama-3.2-1b-instruct/ggml-model-f16.gguf -m ./models/llama-3.2-1b-instruct/ggml-model-q8_0.gguf -m models/qwen2.5-7b-coder/ggml-model-q8_0.gguf -m models/qwen2.5-7b-coder/ggml-model-q4_k.gguf -m models/qwen2.5-1.5b-coder/ggml-model-f16.gguf -m models/qwen2.5-1.5b-coder/ggml-model-q8_0.gguf -m models/gemma-2-2b/ggml-model-q8_0.gguf -m models/gemma-2-2b/ggml-model-f16.gguf -m models/gemma-2-9b/ggml-model-q5_k.gguf -m models/gemma-2-9b/ggml-model-q8_0.gguf -fa 1 -p 512,4096 -b 4096 -ub 4096 -n 0
CPU Model Test t/s master t/s gg/metal-mul-mat-f16 Speedup
M2 Ultra gemma2 2B F16 pp512 3858.01 3958.99 1.03
M2 Ultra gemma2 2B F16 pp4096 3307.38 3350.91 1.01
M2 Ultra gemma2 2B Q8_0 pp512 3493.71 3761.59 1.08
M2 Ultra gemma2 2B Q8_0 pp4096 3037.94 3192.87 1.05
M2 Ultra gemma2 9B Q5_K_M pp512 804.61 827.90 1.03
M2 Ultra gemma2 9B Q5_K_M pp4096 731.31 746.59 1.02
M2 Ultra gemma2 9B Q8_0 pp512 958.09 1045.54 1.09
M2 Ultra gemma2 9B Q8_0 pp4096 841.43 893.08 1.06
M2 Ultra llama 1B F16 pp512 8632.68 8820.95 1.02
M2 Ultra llama 1B F16 pp4096 7787.07 7854.06 1.01
M2 Ultra llama 1B Q8_0 pp512 7784.41 8213.40 1.06
M2 Ultra llama 1B Q8_0 pp4096 7198.94 7441.86 1.03
M2 Ultra llama 3B F16 pp512 3239.58 3300.75 1.02
M2 Ultra llama 3B F16 pp4096 3078.37 3124.01 1.01
M2 Ultra llama 3B Q4_0 pp512 2986.85 3081.41 1.03
M2 Ultra llama 3B Q4_0 pp4096 2863.05 2916.12 1.02
M2 Ultra llama 3B Q8_0 pp512 2924.48 3075.55 1.05
M2 Ultra llama 3B Q8_0 pp4096 2820.40 2924.54 1.04
M2 Ultra llama 8B Q4_K_M pp512 1123.40 1189.15 1.06
M2 Ultra llama 8B Q4_K_M pp4096 1122.38 1178.45 1.05
M2 Ultra qwen2 1.5B Q8_0 pp512 5626.92 5947.59 1.06
M2 Ultra qwen2 1.5B Q8_0 pp4096 5465.39 5772.59 1.06
M2 Ultra qwen2 7B Q8_0 pp512 1361.00 1484.00 1.09
M2 Ultra qwen2 7B Q8_0 pp4096 1354.30 1456.51 1.08

@ggerganov ggerganov force-pushed the gg/metal-mul-mat-f16 branch from fc1a76a to 748833a Compare November 9, 2024 10:13
@joseph777111
Copy link

joseph777111 commented Dec 9, 2024

A compile time flag seems like a great idea! With that, I think it would nice to have two flags:

  1. GGML_METAL_F16
  2. GGML_METAL_F32

That way the User can choose the precision. 😋

@ggerganov
Copy link
Owner Author

It will be checked at run-time, based on the precision configured for each matrix multiplication (see #10665 (comment)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants