Could you provide GGUF files please?
As title say
Присоединяюсь к @rivvada . GGUF - дефакто стандарт для десктоп и мобильных ИИ-приложений. Стандартный конвертор от Герганова не справляется:
Error quantizing: main: build = 7113 (845f200b2)
main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: quantizing 'outputs/tmpvu720z06/GigaChat3-10B-A1.8B-bf16.fp16.gguf' to 'outputs/tmpvu720z06/gigachat3-10b-a1.8b-bf16-q4_k_m.gguf' as Q4_K_M
llama_model_loader: loaded meta data with 47 key-value pairs and 414 tensors from outputs/tmpvu720z06/GigaChat3-10B-A1.8B-bf16.fp16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = deepseek2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = GigaChat3 10B A1.8B Bf16
llama_model_loader: - kv 3: general.basename str = GigaChat3
llama_model_loader: - kv 4: general.size_label str = 10B-A1.8B
llama_model_loader: - kv 5: general.license str = mit
llama_model_loader: - kv 6: general.tags arr[str,2] = ["moe", "text-generation"]
llama_model_loader: - kv 7: general.languages arr[str,2] = ["ru", "en"]
llama_model_loader: - kv 8: deepseek2.block_count u32 = 26
llama_model_loader: - kv 9: deepseek2.context_length u32 = 262144
llama_model_loader: - kv 10: deepseek2.embedding_length u32 = 1536
llama_model_loader: - kv 11: deepseek2.feed_forward_length u32 = 8960
llama_model_loader: - kv 12: deepseek2.attention.head_count u32 = 32
llama_model_loader: - kv 13: deepseek2.attention.head_count_kv u32 = 1
llama_model_loader: - kv 14: deepseek2.rope.freq_base f32 = 100000.000000
llama_model_loader: - kv 15: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 16: deepseek2.expert_used_count u32 = 4
llama_model_loader: - kv 17: deepseek2.expert_group_count u32 = 1
llama_model_loader: - kv 18: deepseek2.expert_group_used_count u32 = 1
llama_model_loader: - kv 19: deepseek2.expert_gating_func u32 = 2
llama_model_loader: - kv 20: deepseek2.attention.key_length u32 = 576
llama_model_loader: - kv 21: deepseek2.attention.value_length u32 = 512
llama_model_loader: - kv 22: general.file_type u32 = 1
llama_model_loader: - kv 23: deepseek2.leading_dense_block_count u32 = 1
llama_model_loader: - kv 24: deepseek2.vocab_size u32 = 128256
llama_model_loader: - kv 25: deepseek2.attention.kv_lora_rank u32 = 512
llama_model_loader: - kv 26: deepseek2.attention.key_length_mla u32 = 192
llama_model_loader: - kv 27: deepseek2.attention.value_length_mla u32 = 192
llama_model_loader: - kv 28: deepseek2.expert_feed_forward_length u32 = 1280
llama_model_loader: - kv 29: deepseek2.expert_count u32 = 64
llama_model_loader: - kv 30: deepseek2.expert_shared_count u32 = 1
llama_model_loader: - kv 31: deepseek2.expert_weights_scale f32 = 1.000000
llama_model_loader: - kv 32: deepseek2.expert_weights_norm bool = true
llama_model_loader: - kv 33: deepseek2.rope.dimension_count u32 = 64
llama_model_loader: - kv 34: deepseek2.rope.scaling.type str = yarn
llama_model_loader: - kv 35: deepseek2.rope.scaling.factor f32 = 64.000000
llama_model_loader: - kv 36: deepseek2.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 37: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
llama_model_loader: - kv 38: general.quantization_version u32 = 2
llama_model_loader: - kv 39: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 40: tokenizer.ggml.pre str = gigachat
llama_model_loader: - kv 41: tokenizer.ggml.tokens arr[str,128256] = ["<unk>", "<s>", "</s>", "!", "\"", "...
llama_model_loader: - kv 42: tokenizer.ggml.token_type arr[i32,128256] = [1, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 43: tokenizer.ggml.merges arr[str,127744] = ["Ð ¾", "Ð °", "Ð µ", "Ð ¸", ...
llama_model_loader: - kv 44: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 45: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 46: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - type f32: 129 tensors
llama_model_loader: - type f16: 285 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA A10G, compute capability 8.6, VMM: yes
llama_model_quantize: failed to quantize: key not found in model: deepseek2.attention.q_lora_rank
main: failed to quantize model from 'outputs/tmpvu720z06/GigaChat3-10B-A1.8B-bf16.fp16.gguf'
И MLX версию бы тоже на 6 бит. Также местный конвертер в MLX не осилил эту модель как и конвертер в GGUF
Still not working...
UPDATE: Maybe this is a 'lite' version of the architechture?
OLD:
Does this model have slightly different architecture that is missing
self_attn.q_a_layernormin the safetensors which is normally mapped to GGUFattn_q_a_normwhich present in deepseek-v2 architecture?If so, it will likely need a couple patches in llama.cpp e.g.:
- Update convert script as needed: https://github.com/ggml-org/llama.cpp/blob/master/convert_hf_to_gguf.py#L7069
- Might need slightly different llama-graph than original
deepseek2Otherwise it might just be some naming convention is different and that tensor is present but in a different name?~
👈 Details
patch config.json
First I removed the config.json line with q_lora_rank:
$ ai-sage/GigaChat3-10B-A1.8B-bf16$ diff config.json config.json.bak
14a15
> "q_lora_rank": null,
52c53
< }
---
> }
convert it
$ cd llama.cpp
$ source venv/bin/activate
$ python \
convert_hf_to_gguf.py \
--outtype bf16 \
--split-max-size 50G \
--outfile /mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF \
/mnt/data/models/ai-sage/GigaChat3-10B-A1.8B-bf16/
INFO:hf-to-gguf:gguf: indexing model part 'model-00003-of-00010.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00004-of-00010.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00005-of-00010.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00006-of-00010.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00007-of-00010.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00008-of-00010.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00009-of-00010.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00010-of-00010.safetensors'
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:output.weight, torch.bfloat16 --> BF16, shape = {1536, 128256}
INFO:hf-to-gguf:token_embd.weight, torch.bfloat16 --> BF16, shape = {1536, 128256}
INFO:hf-to-gguf:blk.0.attn_norm.weight, torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.0.ffn_down.weight, torch.bfloat16 --> BF16, shape = {8960, 1536}
INFO:hf-to-gguf:blk.0.ffn_gate.weight, torch.bfloat16 --> BF16, shape = {1536, 8960}
INFO:hf-to-gguf:blk.0.ffn_up.weight, torch.bfloat16 --> BF16, shape = {1536, 8960}
INFO:hf-to-gguf:blk.0.ffn_norm.weight, torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.0.attn_kv_a_norm.weight, torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.0.attn_kv_a_mqa.weight, torch.bfloat16 --> BF16, shape = {1536, 576}
INFO:hf-to-gguf:blk.0.attn_k_b.weight, torch.bfloat16 --> BF16, shape = {128, 512, 32}
INFO:hf-to-gguf:blk.0.attn_v_b.weight, torch.bfloat16 --> BF16, shape = {512, 192, 32}
INFO:hf-to-gguf:blk.0.attn_output.weight, torch.bfloat16 --> BF16, shape = {6144, 1536}
INFO:hf-to-gguf:blk.0.attn_q.weight, torch.bfloat16 --> BF16, shape = {1536, 6144}
INFO:hf-to-gguf:blk.1.attn_norm.weight, torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.1.ffn_down_exps.weight, torch.bfloat16 --> BF16, shape = {1280, 1536, 64}
INFO:hf-to-gguf:blk.1.ffn_gate_exps.weight, torch.bfloat16 --> BF16, shape = {1536, 1280, 64}
INFO:hf-to-gguf:blk.1.ffn_up_exps.weight, torch.bfloat16 --> BF16, shape = {1536, 1280, 64}
INFO:hf-to-gguf:blk.1.exp_probs_b.bias, torch.bfloat16 --> F32, shape = {64}
INFO:hf-to-gguf:blk.1.ffn_gate_inp.weight, torch.bfloat16 --> F32, shape = {1536, 64}
INFO:hf-to-gguf:blk.1.ffn_down_shexp.weight, torch.bfloat16 --> BF16, shape = {1280, 1536}
INFO:hf-to-gguf:blk.1.ffn_gate_shexp.weight, torch.bfloat16 --> BF16, shape = {1536, 1280}
INFO:hf-to-gguf:blk.1.ffn_up_shexp.weight, torch.bfloat16 --> BF16, shape = {1536, 1280}
INFO:hf-to-gguf:blk.1.ffn_norm.weight, torch.bfloat16 --> F32, shape = {1536}
.
.
.
INFO:hf-to-gguf:blk.25.attn_q.weight, torch.bfloat16 --> BF16, shape = {1536, 6144}
INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 262144
INFO:hf-to-gguf:gguf: embedding length = 1536
INFO:hf-to-gguf:gguf: feed forward length = 8960
INFO:hf-to-gguf:gguf: head count = 32
INFO:hf-to-gguf:gguf: key-value head count = 1
INFO:hf-to-gguf:gguf: rope theta = 100000
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-06
INFO:hf-to-gguf:gguf: experts used count = 4
INFO:hf-to-gguf:gguf: expert groups count = 1
INFO:hf-to-gguf:gguf: expert groups used count = 1
INFO:hf-to-gguf:gguf: file type = 32
WARNING:gguf.gguf_writer:Duplicated key name 'deepseek2.attention.key_length', overwriting it with new value 576 of type UINT32
WARNING:gguf.gguf_writer:Duplicated key name 'deepseek2.attention.value_length', overwriting it with new value 512 of type UINT32
INFO:hf-to-gguf:Set model quantization version
INFO:hf-to-gguf:Set model tokenizer
WARNING:gguf.vocab:TemplateProcessing<single> leading/trailing special tokens do not match TemplateProcessing<pair>
INFO:gguf.vocab:Adding 127744 merge(s).
INFO:gguf.vocab:Setting special token type bos to 1
INFO:gguf.vocab:Setting special token type eos to 2
INFO:gguf.vocab:Setting add_bos_token to True
INFO:gguf.vocab:Setting chat_template to {#--------TOOL RENDERING FUNCTIONS---------#}
.
.
.
run it
$ cd ik_llama.cpp
$ export model=/mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/GigaChat3-10B-A1.8B-BF16.gguf
$ numactl -N "$SOCKET" -m "$SOCKET" \
./build/bin/llama-server \
--model "$model"\
--alias ubergarm/GigaChat3-10B-A1.8B-GGUF \
--ctx-size 32768 \
-ctk q8_0 \
-ub 4096 -b 4096 \
--parallel 1 \
--threads 96 \
--threads-batch 128 \
--numa numactl \
--host 127.0.0.1 \
--port 8080 \
--no-mmap \
--no-display-prompt \
--validate-quants
llama_model_loader: loaded meta data with 49 key-value pairs and 414 tensors from /mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/GigaChat3-10B-A1.
8B-BF16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = deepseek2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = GigaChat3 10B A1.8B Bf16
llama_model_loader: - kv 3: general.basename str = GigaChat3
llama_model_loader: - kv 4: general.size_label str = 10B-A1.8B
llama_model_loader: - kv 5: general.license str = mit
llama_model_loader: - kv 6: general.tags arr[str,2] = ["moe", "text-generation"]
llama_model_loader: - kv 7: general.languages arr[str,2] = ["ru", "en"]
llama_model_loader: - kv 8: deepseek2.block_count u32 = 26
llama_model_loader: - kv 9: deepseek2.context_length u32 = 262144
llama_model_loader: - kv 10: deepseek2.embedding_length u32 = 1536
llama_model_loader: - kv 11: deepseek2.feed_forward_length u32 = 8960
llama_model_loader: - kv 12: deepseek2.attention.head_count u32 = 32
llama_model_loader: - kv 13: deepseek2.attention.head_count_kv u32 = 1
llama_model_loader: - kv 14: deepseek2.rope.freq_base f32 = 100000.000000
llama_model_loader: - kv 15: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 16: deepseek2.expert_used_count u32 = 4
llama_model_loader: - kv 17: deepseek2.expert_group_count u32 = 1
llama_model_loader: - kv 18: deepseek2.expert_group_used_count u32 = 1
llama_model_loader: - kv 19: deepseek2.attention.key_length u32 = 576
llama_model_loader: - kv 20: deepseek2.attention.value_length u32 = 512
llama_model_loader: - kv 21: general.file_type u32 = 32
llama_model_loader: - kv 22: deepseek2.leading_dense_block_count u32 = 1
llama_model_loader: - kv 23: deepseek2.vocab_size u32 = 128256
llama_model_loader: - kv 24: deepseek2.attention.q_lora_rank u32 = 1536
llama_model_loader: - kv 25: deepseek2.attention.kv_lora_rank u32 = 512
llama_model_loader: - kv 26: deepseek2.attention.key_length_mla u32 = 192
llama_model_loader: - kv 27: deepseek2.attention.value_length_mla u32 = 192
llama_model_loader: - kv 28: deepseek2.expert_feed_forward_length u32 = 1280
llama_model_loader: - kv 29: deepseek2.expert_count u32 = 64
llama_model_loader: - kv 30: deepseek2.expert_shared_count u32 = 1
llama_model_loader: - kv 31: deepseek2.expert_weights_scale f32 = 1.000000
llama_model_loader: - kv 32: deepseek2.expert_weights_norm bool = true
llama_model_loader: - kv 33: deepseek2.expert_gating_func u32 = 2
llama_model_loader: - kv 34: deepseek2.rope.dimension_count u32 = 64
llama_model_loader: - kv 36: deepseek2.rope.scaling.factor f32 = 64.000000
llama_model_loader: - kv 37: deepseek2.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 38: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
llama_model_loader: - kv 39: general.quantization_version u32 = 2
llama_model_loader: - kv 40: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 41: tokenizer.ggml.pre str = gigachat
llama_model_loader: - kv 42: tokenizer.ggml.tokens arr[str,128256] = ["<unk>", "<s>", "</s>", "!", "\"", "...
llama_model_loader: - kv 43: tokenizer.ggml.token_type arr[i32,128256] = [1, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 44: tokenizer.ggml.merges arr[str,127744] = ["Ð ¾", "Ð °", "Ð µ", "Ð ¸", ...
llama_model_loader: - kv 45: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 46: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 47: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 48: tokenizer.chat_template str = {#--------TOOL RENDERING FUNCTIONS---...
llama_model_loader: - type f32: 129 tensors
llama_model_loader: - type bf16: 285 tensors
================= Adjusted mainline llama.cpp MLA tensors to ik_llama.cpp
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 2 ('</s>')
load: special tokens cache size = 14
load: token to piece cache size = 1.0295 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = deepseek2
llm_load_print_meta: n_ctx_train = 262144
llm_load_print_meta: n_embd = 1536
llm_load_print_meta: n_layer = 26
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_swa_pattern = 1
llm_load_print_meta: n_embd_head_k = 192
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 6144
llm_load_print_meta: n_embd_v_gqa = 4096
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 8960
llm_load_print_meta: n_expert = 64
llm_load_print_meta: n_expert_used = 4
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = yarn
llm_load_print_meta: freq_base_train = 100000.0
llm_load_print_meta: freq_scale_train = 0.015625
llm_load_print_meta: n_ctx_orig_yarn = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = BF16
llm_load_print_meta: model params = 10.673 B
llm_load_print_meta: model size = 19.884 GiB (16.004 BPW)
llm_load_print_meta: repeating layers = 19.150 GiB (16.004 BPW, 10.279 B parameters)
llm_load_print_meta: general.name = GigaChat3 10B A1.8B Bf16
llm_load_print_meta: n_layer_dense_lead = 1
llm_load_print_meta: n_lora_q = 1536
llm_load_print_meta: n_lora_kv = 512
llm_load_print_meta: n_ff_exp = 1280
llm_load_print_meta: n_expert_shared = 1
llm_load_print_meta: expert_weights_scale = 1.0
llm_load_print_meta: expert_weights_norm = 1
llm_load_print_meta: expert_gating_func = sigmoid
llm_load_print_meta: rope_yarn_log_mul = 0.1000
print_info: vocab type = BPE
print_info: n_vocab = 128256
print_info: n_merges = 127744
print_info: BOS token = 1 '<s>'
print_info: EOS token = 2 '</s>'
print_info: LF token = 201 'Ċ'
print_info: EOG token = 2 '</s>'
print_info: max token length = 226
llm_load_tensors: ggml ctx size = 0.17 MiB
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q_a_norm.weight' not found
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/GigaChat3-10B-A1.8B-BF16.gguf'
ERR [ load_model] unable to load model | tid="131753511823616" timestamp=1763653420 model="/mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/GigaChat3-10B-A1.8B-BF16.gguf"
So maybe GigaChat3-10B-A1.8B is a lite version of the architecture? This patch helps it get a little bit further, but then still crashes:
👈 Details
$ cd llama.cpp
$ git diff src/llama-model.cpp
diff --git a/src/llama-model.cpp b/src/llama-model.cpp
index e703181a1..f3783f26c 100644
--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
@@ -1593,7 +1593,7 @@ void llama_model::load_hparams(llama_model_loader & ml) {
} break;
case LLM_ARCH_DEEPSEEK2:
{
- bool is_lite = (hparams.n_layer == 27);
+ bool is_lite = (hparams.n_layer == 27 || hparams.n_layer == 26); // 26 for GigaChat3-10B-A1.8B
ml.get_key(LLM_KV_ATTENTION_LAYERNORM_RMS_EPS, hparams.f_norm_rms_eps);
ml.get_key(LLM_KV_LEADING_DENSE_BLOCK_COUNT, hparams.n_layer_dense_lead);
if (!is_lite) {
@@ -4581,7 +4581,7 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
} break;
case LLM_ARCH_DEEPSEEK2:
{
- const bool is_lite = (hparams.n_layer == 27);
+ const bool is_lite = (hparams.n_layer == 27 || hparams.n_layer == 26); // 26 for GigaChat3-10B-A1.8B
const bool is_mla = (hparams.n_embd_head_k_mla != 0 && hparams.n_embd_head_v_mla != 0);
This is output from debugger:
$ cd llama.cpp
$ cmake -B build -DCMAKE_BUILD_TYPE=Debug -DGGML_CUDA=0
$ cmake --build build --config Debug -j $(nproc)
$ gdb -q --args \
./build/bin/llama-server \
--model "$model"\
--alias ubergarm/GigaChat3-10B-A1.8B-GGUF \
--ctx-size 32768 \
--parallel 1 \
--threads 96 \
--threads-batch 128 \
--numa numactl \
--host 127.0.0.1 \
--port 8080 \
--jinja
llama_context: n_ctx_seq (32768) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 1.96 MiB
llama_kv_cache: CPU KV buffer size = 1768.00 MiB
llama_kv_cache: size = 1768.00 MiB ( 32768 cells, 26 layers, 4/1 seqs), K (f16): 936.00 MiB, V (f16): 832.00 MiB
Thread 1 "llama-server" received signal SIGSEGV, Segmentation fault.
0x00007ffff754b6c9 in ggml_can_mul_mat (t0=0x0, t1=0x5555585c7700) at /home/w/projects/llama.cpp/ggml/src/ggml.c:3146
3146 return (t0->ne[0] == t1->ne[0]) &&
(gdb) bt
#0 0x00007ffff754b6c9 in ggml_can_mul_mat (t0=0x0, t1=0x5555585c7700) at /home/w/projects/llama.cpp/ggml/src/ggml.c:3146
#1 0x00007ffff754b765 in ggml_mul_mat (ctx=0x555555edd630, a=0x0, b=0x5555585c7700) at /home/w/projects/llama.cpp/ggml/src/ggml.c:3155
#2 0x00007ffff7c39327 in llm_build_deepseek2::llm_build_deepseek2 (this=0x555555edd660, model=..., params=...)
at /home/w/projects/llama.cpp/src/models/deepseek2.cpp:50
#3 0x00007ffff7b688f8 in std::make_unique<llm_build_deepseek2, llama_model const&, llm_graph_params const&> ()
at /usr/include/c++/13/bits/unique_ptr.h:1070
#4 0x00007ffff7b59a6c in llama_model::build_graph (this=0x555555ca2f40, params=...) at /home/w/projects/llama.cpp/src/llama-model.cpp:7224
#5 0x00007ffff7a3942e in llama_context::graph_reserve (this=0x555555ccee70, n_tokens=1, n_seqs=1, n_outputs=1, mctx=0x555555ed74a0, split_only=true)
at /home/w/projects/llama.cpp/src/llama-context.cpp:1427
#6 0x00007ffff7a33e42 in llama_context::llama_context (this=0x555555ccee70, model=..., params=...)
at /home/w/projects/llama.cpp/src/llama-context.cpp:312
#7 0x00007ffff7a3daff in llama_init_from_model (model=0x555555ca2f40, params=...) at /home/w/projects/llama.cpp/src/llama-context.cpp:2381
#8 0x000055555588051a in common_init_from_params (params=...) at /home/w/projects/llama.cpp/common/common.cpp:967
#9 0x0000555555642b7f in server_context::load_model (this=0x7fffffffc5e0, params=...) at /home/w/projects/llama.cpp/tools/server/server.cpp:2392
#10 0x000055555560b3f5 in main (argc=20, argv=0x7fffffffe048) at /home/w/projects/llama.cpp/tools/server/server.cpp:5608
Okay, got it working, it is a lite version and ik/llama.cpp will need a patch as it is currently hardcoded to detect lite only for exactly 27 layers, but this model is only 26 layers...
diff --git a/src/llama-model.cpp b/src/llama-model.cpp
index e703181a1..30902a59d 100644
--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
@@ -1593,7 +1593,7 @@ void llama_model::load_hparams(llama_model_loader & ml) {
} break;
case LLM_ARCH_DEEPSEEK2:
{
- bool is_lite = (hparams.n_layer == 27);
+ bool is_lite = (hparams.n_layer == 27 || hparams.n_layer == 26);
ml.get_key(LLM_KV_ATTENTION_LAYERNORM_RMS_EPS, hparams.f_norm_rms_eps);
ml.get_key(LLM_KV_LEADING_DENSE_BLOCK_COUNT, hparams.n_layer_dense_lead);
if (!is_lite) {
@@ -4581,7 +4581,7 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
} break;
case LLM_ARCH_DEEPSEEK2:
{
- const bool is_lite = (hparams.n_layer == 27);
+ const bool is_lite = (hparams.n_layer == 27 || hparams.n_layer == 26);
const bool is_mla = (hparams.n_embd_head_k_mla != 0 && hparams.n_embd_head_v_mla != 0);
diff --git a/src/models/deepseek2.cpp b/src/models/deepseek2.cpp
index 68f72f72b..507926af5 100644
--- a/src/models/deepseek2.cpp
+++ b/src/models/deepseek2.cpp
@@ -4,7 +4,7 @@
llm_build_deepseek2::llm_build_deepseek2(const llama_model & model, const llm_graph_params & params) :
llm_graph_context(params) {
- bool is_lite = (hparams.n_layer == 27);
+ bool is_lite = (hparams.n_layer == 27 || hparams.n_layer == 26);
const bool is_mla = (hparams.n_embd_head_k_mla != 0 && hparams.n_embd_head_v_mla != 0);
Opened a PR here if anyone else could test: https://github.com/ggml-org/llama.cpp/pull/17420
Also uploaded a GGUF here: https://huggingface.co/ubergarm/GigaChat3-10B-A1.8B-GGUF/tree/main
Fixed chat template is here: https://huggingface.co/evilfreelancer/GigaChat3-10B-A1.8B-GGUF/blob/main/chat_template.jinja
ik_llama.cpp specific quants now available:
The Q6_K version still doesn't work in LMStudio, even though the MetalLlama.cpp runtime was updated today.
The Q6_K version still doesn't work in LMStudio, even though the MetalLlama.cpp runtime was updated today.
I didn't release a Q6_K quant, so not sure what you're asking about? Only the Q8_0 that I released would run on LMStudio/mainline llama.cpp as the model card says right up front.
If you want to use these high quality quantizations, you need to use something based on ik_llama.cpp e.g.:
- https://github.com/ikawrakow/ik_llama.cpp/ - compile it yourself on linux or windows
- https://github.com/Thireus/ik_llama.cpp/releases - pre-compiled windows binaries
- https://github.com/janhq/jan/pull/6897 - Jan might support the pre-compiled Thireus package
- https://github.com/Nexesenex/croco.cpp - should work with most ik quants
Cheers!
И MLX версию бы тоже на 6 бит. Также местный конвертер в MLX не осилил эту модель как и конвертер в GGUF
пытаюсь на mlx_lm запустить
from mlx_lm.utils import load_model, get_model_path, load_tokenizer
model, tokenizer = load(giga)
ловлю ошибку:
ValueError Traceback (most recent call last)
Cell In[7], line 5
4 try:
----> 5 model, tokenizer = load(giga)
6 except ValueError as e:
File ~/Documents/ML/gpt-oss/.venv-1/lib/python3.13/site-packages/mlx_lm/utils.py:265, in load(path_or_hf_repo, tokenizer_config, model_config, adapter_path, lazy)
263 model_path, _ = get_model_path(path_or_hf_repo)
--> 265 model, config = load_model(model_path, lazy, model_config=model_config)
266 if adapter_path is not None:
File ~/Documents/ML/gpt-oss/.venv-1/lib/python3.13/site-packages/mlx_lm/utils.py:226, in load_model(model_path, lazy, strict, model_config, get_model_classes)
224 _quantize(quantization)
--> 226 model.load_weights(list(weights.items()), strict=strict)
228 if not lazy:
File ~/Documents/ML/gpt-oss/.venv-1/lib/python3.13/site-packages/mlx/nn/layers/base.py:185, in Module.load_weights(self, file_or_weights, strict)
184 extras = ",\n".join(sorted(extras))
--> 185 raise ValueError(
186 f"Received {num_extra} parameters not in model: \n{extras}."
187 )
188 if missing := (curr_weights.keys() - new_weights.keys()):
ValueError: Received 210 parameters not in model:
model.layers.26.eh_proj.weight,
model.layers.26.embed_tokens.weight,
model.layers.26.enorm.weight,
model.layers.26.hnorm.weight,
model.layers.26.input_layernorm.weight,
model.layers.26.mlp.experts.0.down_proj.weight,
model.layers.26.mlp.experts.0.gate_proj.weight,
model.layers.26.mlp.experts.0.up_proj.weight,
model.layers.26.mlp.experts.1.down_proj.weight,
model.layers.26.mlp.experts.1.gate_proj.weight,
model.layers.26.mlp.experts.1.up_proj.weight,
model.layers.26.mlp.experts.10.down_proj.weight,
model.layers.26.mlp.experts.10.gate_proj.weight,
model.layers.26.mlp.experts.10.up_proj.weight,
model.layers.26.mlp.experts.11.down_proj.weight,
model.layers.26.mlp.experts.11.gate_proj.weight,
model.layers.26.mlp.experts.11.up_proj.weight,
model.layers.26.mlp.experts.12.down_proj.weight,
model.layers.26.mlp.experts.12.gate_proj.weight,
model.layers.26.mlp.experts.12.up_proj.weight,
model.layers.26.mlp.experts.13.down_proj.weight,
model.layers.26.mlp.experts.13.gate_proj.weight,
model.layers.26.mlp.experts.13.up_proj.weight,
model.layers.26.mlp.experts.14.down_proj.weight,
model.layers.26.mlp.experts.14.gate_proj.weight,
model.layers.26.mlp.experts.14.up_proj.weight,
model.layers.26.mlp.experts.15.down_proj.weight,
model.layers.26.mlp.experts.15.gate_proj.weight,
model.layers.26.mlp.experts.15.up_proj.weight,
model.layers.26.mlp.experts.16.down_proj.weight,
model.layers.26.mlp.experts.16.gate_proj.weight,
model.layers.26.mlp.experts.16.up_proj.weight,
model.layers.26.mlp.experts.17.down_proj.weight,
model.layers.26.mlp.experts.17.gate_proj.weight,
model.layers.26.mlp.experts.17.up_proj.weight,
model.layers.26.mlp.experts.18.down_proj.weight,
model.layers.26.mlp.experts.18.gate_proj.weight,
model.layers.26.mlp.experts.18.up_proj.weight,
model.layers.26.mlp.experts.19.down_proj.weight,
model.layers.26.mlp.experts.19.gate_proj.weight,
model.layers.26.mlp.experts.19.up_proj.weight,
model.layers.26.mlp.experts.2.down_proj.weight,
model.layers.26.mlp.experts.2.gate_proj.weight,
model.layers.26.mlp.experts.2.up_proj.weight,
model.layers.26.mlp.experts.20.down_proj.weight,
model.layers.26.mlp.experts.20.gate_proj.weight,
model.layers.26.mlp.experts.20.up_proj.weight,
model.layers.26.mlp.experts.21.down_proj.weight,
model.layers.26.mlp.experts.21.gate_proj.weight,
model.layers.26.mlp.experts.21.up_proj.weight,
model.layers.26.mlp.experts.22.down_proj.weight,
model.layers.26.mlp.experts.22.gate_proj.weight,
model.layers.26.mlp.experts.22.up_proj.weight,
model.layers.26.mlp.experts.23.down_proj.weight,
model.layers.26.mlp.experts.23.gate_proj.weight,
model.layers.26.mlp.experts.23.up_proj.weight,
model.layers.26.mlp.experts.24.down_proj.weight,
model.layers.26.mlp.experts.24.gate_proj.weight,
model.layers.26.mlp.experts.24.up_proj.weight,
model.layers.26.mlp.experts.25.down_proj.weight,
model.layers.26.mlp.experts.25.gate_proj.weight,
model.layers.26.mlp.experts.25.up_proj.weight,
model.layers.26.mlp.experts.26.down_proj.weight,
model.layers.26.mlp.experts.26.gate_proj.weight,
model.layers.26.mlp.experts.26.up_proj.weight,
model.layers.26.mlp.experts.27.down_proj.weight,
model.layers.26.mlp.experts.27.gate_proj.weight,
model.layers.26.mlp.experts.27.up_proj.weight,
model.layers.26.mlp.experts.28.down_proj.weight,
model.layers.26.mlp.experts.28.gate_proj.weight,
model.layers.26.mlp.experts.28.up_proj.weight,
model.layers.26.mlp.experts.29.down_proj.weight,
model.layers.26.mlp.experts.29.gate_proj.weight,
model.layers.26.mlp.experts.29.up_proj.weight,
model.layers.26.mlp.experts.3.down_proj.weight,
model.layers.26.mlp.experts.3.gate_proj.weight,
model.layers.26.mlp.experts.3.up_proj.weight,
model.layers.26.mlp.experts.30.down_proj.weight,
model.layers.26.mlp.experts.30.gate_proj.weight,
model.layers.26.mlp.experts.30.up_proj.weight,
model.layers.26.mlp.experts.31.down_proj.weight,
model.layers.26.mlp.experts.31.gate_proj.weight,
model.layers.26.mlp.experts.31.up_proj.weight,
model.layers.26.mlp.experts.32.down_proj.weight,
model.layers.26.mlp.experts.32.gate_proj.weight,
model.layers.26.mlp.experts.32.up_proj.weight,
model.layers.26.mlp.experts.33.down_proj.weight,
model.layers.26.mlp.experts.33.gate_proj.weight,
model.layers.26.mlp.experts.33.up_proj.weight,
model.layers.26.mlp.experts.34.down_proj.weight,
model.layers.26.mlp.experts.34.gate_proj.weight,
model.layers.26.mlp.experts.34.up_proj.weight,
model.layers.26.mlp.experts.35.down_proj.weight,
model.layers.26.mlp.experts.35.gate_proj.weight,
model.layers.26.mlp.experts.35.up_proj.weight,
model.layers.26.mlp.experts.36.down_proj.weight,
model.layers.26.mlp.experts.36.gate_proj.weight,
model.layers.26.mlp.experts.36.up_proj.weight,
model.layers.26.mlp.experts.37.down_proj.weight,
model.layers.26.mlp.experts.37.gate_proj.weight,
model.layers.26.mlp.experts.37.up_proj.weight,
model.layers.26.mlp.experts.38.down_proj.weight,
model.layers.26.mlp.experts.38.gate_proj.weight,
model.layers.26.mlp.experts.38.up_proj.weight,
model.layers.26.mlp.experts.39.down_proj.weight,
model.layers.26.mlp.experts.39.gate_proj.weight,
model.layers.26.mlp.experts.39.up_proj.weight,
model.layers.26.mlp.experts.4.down_proj.weight,
model.layers.26.mlp.experts.4.gate_proj.weight,
model.layers.26.mlp.experts.4.up_proj.weight,
model.layers.26.mlp.experts.40.down_proj.weight,
model.layers.26.mlp.experts.40.gate_proj.weight,
model.layers.26.mlp.experts.40.up_proj.weight,
model.layers.26.mlp.experts.41.down_proj.weight,
model.layers.26.mlp.experts.41.gate_proj.weight,
model.layers.26.mlp.experts.41.up_proj.weight,
model.layers.26.mlp.experts.42.down_proj.weight,
model.layers.26.mlp.experts.42.gate_proj.weight,
model.layers.26.mlp.experts.42.up_proj.weight,
model.layers.26.mlp.experts.43.down_proj.weight,
model.layers.26.mlp.experts.43.gate_proj.weight,
model.layers.26.mlp.experts.43.up_proj.weight,
model.layers.26.mlp.experts.44.down_proj.weight,
model.layers.26.mlp.experts.44.gate_proj.weight,
model.layers.26.mlp.experts.44.up_proj.weight,
model.layers.26.mlp.experts.45.down_proj.weight,
model.layers.26.mlp.experts.45.gate_proj.weight,
model.layers.26.mlp.experts.45.up_proj.weight,
model.layers.26.mlp.experts.46.down_proj.weight,
model.layers.26.mlp.experts.46.gate_proj.weight,
model.layers.26.mlp.experts.46.up_proj.weight,
model.layers.26.mlp.experts.47.down_proj.weight,
model.layers.26.mlp.experts.47.gate_proj.weight,
model.layers.26.mlp.experts.47.up_proj.weight,
model.layers.26.mlp.experts.48.down_proj.weight,
model.layers.26.mlp.experts.48.gate_proj.weight,
model.layers.26.mlp.experts.48.up_proj.weight,
model.layers.26.mlp.experts.49.down_proj.weight,
model.layers.26.mlp.experts.49.gate_proj.weight,
model.layers.26.mlp.experts.49.up_proj.weight,
model.layers.26.mlp.experts.5.down_proj.weight,
model.layers.26.mlp.experts.5.gate_proj.weight,
model.layers.26.mlp.experts.5.up_proj.weight,
model.layers.26.mlp.experts.50.down_proj.weight,
model.layers.26.mlp.experts.50.gate_proj.weight,
model.layers.26.mlp.experts.50.up_proj.weight,
model.layers.26.mlp.experts.51.down_proj.weight,
model.layers.26.mlp.experts.51.gate_proj.weight,
model.layers.26.mlp.experts.51.up_proj.weight,
model.layers.26.mlp.experts.52.down_proj.weight,
model.layers.26.mlp.experts.52.gate_proj.weight,
model.layers.26.mlp.experts.52.up_proj.weight,
model.layers.26.mlp.experts.53.down_proj.weight,
model.layers.26.mlp.experts.53.gate_proj.weight,
model.layers.26.mlp.experts.53.up_proj.weight,
model.layers.26.mlp.experts.54.down_proj.weight,
model.layers.26.mlp.experts.54.gate_proj.weight,
model.layers.26.mlp.experts.54.up_proj.weight,
model.layers.26.mlp.experts.55.down_proj.weight,
model.layers.26.mlp.experts.55.gate_proj.weight,
model.layers.26.mlp.experts.55.up_proj.weight,
model.layers.26.mlp.experts.56.down_proj.weight,
model.layers.26.mlp.experts.56.gate_proj.weight,
model.layers.26.mlp.experts.56.up_proj.weight,
model.layers.26.mlp.experts.57.down_proj.weight,
model.layers.26.mlp.experts.57.gate_proj.weight,
model.layers.26.mlp.experts.57.up_proj.weight,
model.layers.26.mlp.experts.58.down_proj.weight,
model.layers.26.mlp.experts.58.gate_proj.weight,
model.layers.26.mlp.experts.58.up_proj.weight,
model.layers.26.mlp.experts.59.down_proj.weight,
model.layers.26.mlp.experts.59.gate_proj.weight,
model.layers.26.mlp.experts.59.up_proj.weight,
model.layers.26.mlp.experts.6.down_proj.weight,
model.layers.26.mlp.experts.6.gate_proj.weight,
model.layers.26.mlp.experts.6.up_proj.weight,
model.layers.26.mlp.experts.60.down_proj.weight,
model.layers.26.mlp.experts.60.gate_proj.weight,
model.layers.26.mlp.experts.60.up_proj.weight,
model.layers.26.mlp.experts.61.down_proj.weight,
model.layers.26.mlp.experts.61.gate_proj.weight,
model.layers.26.mlp.experts.61.up_proj.weight,
model.layers.26.mlp.experts.62.down_proj.weight,
model.layers.26.mlp.experts.62.gate_proj.weight,
model.layers.26.mlp.experts.62.up_proj.weight,
model.layers.26.mlp.experts.63.down_proj.weight,
model.layers.26.mlp.experts.63.gate_proj.weight,
model.layers.26.mlp.experts.63.up_proj.weight,
model.layers.26.mlp.experts.7.down_proj.weight,
model.layers.26.mlp.experts.7.gate_proj.weight,
model.layers.26.mlp.experts.7.up_proj.weight,
model.layers.26.mlp.experts.8.down_proj.weight,
model.layers.26.mlp.experts.8.gate_proj.weight,
model.layers.26.mlp.experts.8.up_proj.weight,
model.layers.26.mlp.experts.9.down_proj.weight,
model.layers.26.mlp.experts.9.gate_proj.weight,
model.layers.26.mlp.experts.9.up_proj.weight,
model.layers.26.mlp.gate.e_score_correction_bias,
model.layers.26.mlp.gate.weight,
model.layers.26.mlp.shared_experts.down_proj.weight,
model.layers.26.mlp.shared_experts.gate_proj.weight,
model.layers.26.mlp.shared_experts.up_proj.weight,
model.layers.26.post_attention_layernorm.weight,
model.layers.26.self_attn.kv_a_layernorm.weight,
model.layers.26.self_attn.kv_a_proj_with_mqa.weight,
model.layers.26.self_attn.kv_b_proj.weight,
model.layers.26.self_attn.o_proj.weight,
model.layers.26.self_attn.q_proj.weight,
model.layers.26.shared_head.head.weight,
model.layers.26.shared_head.norm.weight.
During handling of the above exception, another exception occurred:
IndexError Traceback (most recent call last)
Cell In[7], line 10
8 # Try with custom model loading
9 model_path, _ = get_model_path(giga)
---> 10 model, config = load_model(model_path, lazy=True, strict=False)
11 tokenizer = load_tokenizer(model_path)
12 print('Model loaded successfully with strict=False')
File ~/Documents/ML/gpt-oss/.venv-1/lib/python3.13/site-packages/mlx_lm/utils.py:226, in load_model(model_path, lazy, strict, model_config, get_model_classes)
223 config["quantization_config"] = quantization
224 _quantize(quantization)
--> 226 model.load_weights(list(weights.items()), strict=strict)
228 if not lazy:
229 mx.eval(model.parameters())
File ~/Documents/ML/gpt-oss/.venv-1/lib/python3.13/site-packages/mlx/nn/layers/base.py:206, in Module.load_weights(self, file_or_weights, strict)
200 raise ValueError(
201 f"Expected shape {v.shape} but received "
202 f"shape {v_new.shape} for parameter {k}"
203 )
205 if len(weights) != 0:
--> 206 self.update(tree_unflatten(weights), strict=False)
207 return self
File ~/Documents/ML/gpt-oss/.venv-1/lib/python3.13/site-packages/mlx/nn/layers/base.py:356, in Module.update(self, parameters, strict)
353 elif strict:
354 raise ValueError(f"Received invalid type: {type(parameters).name}.")
--> 356 apply(self, parameters)
357 return self
File ~/Documents/ML/gpt-oss/.venv-1/lib/python3.13/site-packages/mlx/nn/layers/base.py:338, in Module.update..apply(dst, parameters)
336 dst[k] = new_value
337 else:
--> 338 apply(current_value, new_value)
339 elif strict:
340 raise ValueError(f'Module does not have parameter named "{k}".')
File ~/Documents/ML/gpt-oss/.venv-1/lib/python3.13/site-packages/mlx/nn/layers/base.py:338, in Module.update..apply(dst, parameters)
336 dst[k] = new_value
337 else:
--> 338 apply(current_value, new_value)
339 elif strict:
340 raise ValueError(f'Module does not have parameter named "{k}".')
File ~/Documents/ML/gpt-oss/.venv-1/lib/python3.13/site-packages/mlx/nn/layers/base.py:343, in Module.update..apply(dst, parameters)
341 elif isinstance(parameters, list):
342 for i in range(len(parameters)):
--> 343 current_value = dst[i]
344 new_value = parameters[i]
345 if isinstance(current_value, mx.array):
IndexError: list index out of range
