Could you provide GGUF files please?

#1
by rivvada - opened

As title say

Присоединяюсь к @rivvada . GGUF - дефакто стандарт для десктоп и мобильных ИИ-приложений. Стандартный конвертор от Герганова не справляется:

Error quantizing: main: build = 7113 (845f200b2)
main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: quantizing 'outputs/tmpvu720z06/GigaChat3-10B-A1.8B-bf16.fp16.gguf' to 'outputs/tmpvu720z06/gigachat3-10b-a1.8b-bf16-q4_k_m.gguf' as Q4_K_M
llama_model_loader: loaded meta data with 47 key-value pairs and 414 tensors from outputs/tmpvu720z06/GigaChat3-10B-A1.8B-bf16.fp16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = GigaChat3 10B A1.8B Bf16
llama_model_loader: - kv   3:                           general.basename str              = GigaChat3
llama_model_loader: - kv   4:                         general.size_label str              = 10B-A1.8B
llama_model_loader: - kv   5:                            general.license str              = mit
llama_model_loader: - kv   6:                               general.tags arr[str,2]       = ["moe", "text-generation"]
llama_model_loader: - kv   7:                          general.languages arr[str,2]       = ["ru", "en"]
llama_model_loader: - kv   8:                      deepseek2.block_count u32              = 26
llama_model_loader: - kv   9:                   deepseek2.context_length u32              = 262144
llama_model_loader: - kv  10:                 deepseek2.embedding_length u32              = 1536
llama_model_loader: - kv  11:              deepseek2.feed_forward_length u32              = 8960
llama_model_loader: - kv  12:             deepseek2.attention.head_count u32              = 32
llama_model_loader: - kv  13:          deepseek2.attention.head_count_kv u32              = 1
llama_model_loader: - kv  14:                   deepseek2.rope.freq_base f32              = 100000.000000
llama_model_loader: - kv  15: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  16:                deepseek2.expert_used_count u32              = 4
llama_model_loader: - kv  17:               deepseek2.expert_group_count u32              = 1
llama_model_loader: - kv  18:          deepseek2.expert_group_used_count u32              = 1
llama_model_loader: - kv  19:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  20:             deepseek2.attention.key_length u32              = 576
llama_model_loader: - kv  21:           deepseek2.attention.value_length u32              = 512
llama_model_loader: - kv  22:                          general.file_type u32              = 1
llama_model_loader: - kv  23:        deepseek2.leading_dense_block_count u32              = 1
llama_model_loader: - kv  24:                       deepseek2.vocab_size u32              = 128256
llama_model_loader: - kv  25:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  26:         deepseek2.attention.key_length_mla u32              = 192
llama_model_loader: - kv  27:       deepseek2.attention.value_length_mla u32              = 192
llama_model_loader: - kv  28:       deepseek2.expert_feed_forward_length u32              = 1280
llama_model_loader: - kv  29:                     deepseek2.expert_count u32              = 64
llama_model_loader: - kv  30:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  31:             deepseek2.expert_weights_scale f32              = 1.000000
llama_model_loader: - kv  32:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  33:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  34:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  35:              deepseek2.rope.scaling.factor f32              = 64.000000
llama_model_loader: - kv  36: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  37: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  38:               general.quantization_version u32              = 2
llama_model_loader: - kv  39:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  40:                         tokenizer.ggml.pre str              = gigachat
llama_model_loader: - kv  41:                      tokenizer.ggml.tokens arr[str,128256]  = ["<unk>", "<s>", "</s>", "!", "\"", "...
llama_model_loader: - kv  42:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  43:                      tokenizer.ggml.merges arr[str,127744]  = ["Ð ¾", "Ð °", "Ð µ", "Ð ¸", ...
llama_model_loader: - kv  44:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  45:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  46:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - type  f32:  129 tensors
llama_model_loader: - type  f16:  285 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA A10G, compute capability 8.6, VMM: yes
llama_model_quantize: failed to quantize: key not found in model: deepseek2.attention.q_lora_rank
main: failed to quantize model from 'outputs/tmpvu720z06/GigaChat3-10B-A1.8B-bf16.fp16.gguf'

И MLX версию бы тоже на 6 бит. Также местный конвертер в MLX не осилил эту модель как и конвертер в GGUF

Still not working...

UPDATE: Maybe this is a 'lite' version of the architechture?

OLD:

Does this model have slightly different architecture that is missing self_attn.q_a_layernormin the safetensors which is normally mapped to GGUF attn_q_a_norm which present in deepseek-v2 architecture?

If so, it will likely need a couple patches in llama.cpp e.g.:

  1. Update convert script as needed: https://github.com/ggml-org/llama.cpp/blob/master/convert_hf_to_gguf.py#L7069
  2. Might need slightly different llama-graph than original deepseek2

Otherwise it might just be some naming convention is different and that tensor is present but in a different name?~

👈 Details

patch config.json

First I removed the config.json line with q_lora_rank:

$ ai-sage/GigaChat3-10B-A1.8B-bf16$ diff config.json config.json.bak
14a15
>   "q_lora_rank": null,
52c53
< }
---
> }

convert it

$ cd llama.cpp
$ source venv/bin/activate
$ python \
    convert_hf_to_gguf.py \
    --outtype bf16 \
    --split-max-size 50G \
    --outfile /mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF \
    /mnt/data/models/ai-sage/GigaChat3-10B-A1.8B-bf16/

INFO:hf-to-gguf:gguf: indexing model part 'model-00003-of-00010.safetensors'                                                      
INFO:hf-to-gguf:gguf: indexing model part 'model-00004-of-00010.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00005-of-00010.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00006-of-00010.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00007-of-00010.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00008-of-00010.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00009-of-00010.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00010-of-00010.safetensors'
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:output.weight,                torch.bfloat16 --> BF16, shape = {1536, 128256}
INFO:hf-to-gguf:token_embd.weight,            torch.bfloat16 --> BF16, shape = {1536, 128256}
INFO:hf-to-gguf:blk.0.attn_norm.weight,       torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.0.ffn_down.weight,        torch.bfloat16 --> BF16, shape = {8960, 1536}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,        torch.bfloat16 --> BF16, shape = {1536, 8960}
INFO:hf-to-gguf:blk.0.ffn_up.weight,          torch.bfloat16 --> BF16, shape = {1536, 8960}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,        torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.0.attn_kv_a_norm.weight,  torch.bfloat16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.0.attn_kv_a_mqa.weight,   torch.bfloat16 --> BF16, shape = {1536, 576}
INFO:hf-to-gguf:blk.0.attn_k_b.weight,        torch.bfloat16 --> BF16, shape = {128, 512, 32}
INFO:hf-to-gguf:blk.0.attn_v_b.weight,        torch.bfloat16 --> BF16, shape = {512, 192, 32}
INFO:hf-to-gguf:blk.0.attn_output.weight,     torch.bfloat16 --> BF16, shape = {6144, 1536}
INFO:hf-to-gguf:blk.0.attn_q.weight,          torch.bfloat16 --> BF16, shape = {1536, 6144}
INFO:hf-to-gguf:blk.1.attn_norm.weight,       torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.1.ffn_down_exps.weight,   torch.bfloat16 --> BF16, shape = {1280, 1536, 64}
INFO:hf-to-gguf:blk.1.ffn_gate_exps.weight,   torch.bfloat16 --> BF16, shape = {1536, 1280, 64}
INFO:hf-to-gguf:blk.1.ffn_up_exps.weight,     torch.bfloat16 --> BF16, shape = {1536, 1280, 64}
INFO:hf-to-gguf:blk.1.exp_probs_b.bias,       torch.bfloat16 --> F32, shape = {64}
INFO:hf-to-gguf:blk.1.ffn_gate_inp.weight,    torch.bfloat16 --> F32, shape = {1536, 64}
INFO:hf-to-gguf:blk.1.ffn_down_shexp.weight,  torch.bfloat16 --> BF16, shape = {1280, 1536}
INFO:hf-to-gguf:blk.1.ffn_gate_shexp.weight,  torch.bfloat16 --> BF16, shape = {1536, 1280}
INFO:hf-to-gguf:blk.1.ffn_up_shexp.weight,    torch.bfloat16 --> BF16, shape = {1536, 1280}
INFO:hf-to-gguf:blk.1.ffn_norm.weight,        torch.bfloat16 --> F32, shape = {1536}
.
.
.
INFO:hf-to-gguf:blk.25.attn_q.weight,         torch.bfloat16 --> BF16, shape = {1536, 6144}
INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 262144
INFO:hf-to-gguf:gguf: embedding length = 1536
INFO:hf-to-gguf:gguf: feed forward length = 8960
INFO:hf-to-gguf:gguf: head count = 32
INFO:hf-to-gguf:gguf: key-value head count = 1
INFO:hf-to-gguf:gguf: rope theta = 100000
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-06
INFO:hf-to-gguf:gguf: experts used count = 4
INFO:hf-to-gguf:gguf: expert groups count = 1
INFO:hf-to-gguf:gguf: expert groups used count = 1
INFO:hf-to-gguf:gguf: file type = 32
WARNING:gguf.gguf_writer:Duplicated key name 'deepseek2.attention.key_length', overwriting it with new value 576 of type UINT32
WARNING:gguf.gguf_writer:Duplicated key name 'deepseek2.attention.value_length', overwriting it with new value 512 of type UINT32
INFO:hf-to-gguf:Set model quantization version
INFO:hf-to-gguf:Set model tokenizer
WARNING:gguf.vocab:TemplateProcessing<single> leading/trailing special tokens do not match TemplateProcessing<pair>
INFO:gguf.vocab:Adding 127744 merge(s).
INFO:gguf.vocab:Setting special token type bos to 1
INFO:gguf.vocab:Setting special token type eos to 2
INFO:gguf.vocab:Setting add_bos_token to True
INFO:gguf.vocab:Setting chat_template to {#--------TOOL RENDERING FUNCTIONS---------#}
.
.
.

run it

$ cd ik_llama.cpp
$ export model=/mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/GigaChat3-10B-A1.8B-BF16.gguf
$ numactl -N "$SOCKET" -m "$SOCKET" \
./build/bin/llama-server \
    --model "$model"\
    --alias ubergarm/GigaChat3-10B-A1.8B-GGUF \
    --ctx-size 32768 \
    -ctk q8_0 \
    -ub 4096 -b 4096 \
    --parallel 1 \
    --threads 96 \
    --threads-batch 128 \
    --numa numactl \
    --host 127.0.0.1 \
    --port 8080 \
    --no-mmap \
    --no-display-prompt \
    --validate-quants


llama_model_loader: loaded meta data with 49 key-value pairs and 414 tensors from /mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/GigaChat3-10B-A1.
8B-BF16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = GigaChat3 10B A1.8B Bf16
llama_model_loader: - kv   3:                           general.basename str              = GigaChat3
llama_model_loader: - kv   4:                         general.size_label str              = 10B-A1.8B
llama_model_loader: - kv   5:                            general.license str              = mit
llama_model_loader: - kv   6:                               general.tags arr[str,2]       = ["moe", "text-generation"]
llama_model_loader: - kv   7:                          general.languages arr[str,2]       = ["ru", "en"]
llama_model_loader: - kv   8:                      deepseek2.block_count u32              = 26
llama_model_loader: - kv   9:                   deepseek2.context_length u32              = 262144
llama_model_loader: - kv  10:                 deepseek2.embedding_length u32              = 1536
llama_model_loader: - kv  11:              deepseek2.feed_forward_length u32              = 8960
llama_model_loader: - kv  12:             deepseek2.attention.head_count u32              = 32
llama_model_loader: - kv  13:          deepseek2.attention.head_count_kv u32              = 1
llama_model_loader: - kv  14:                   deepseek2.rope.freq_base f32              = 100000.000000
llama_model_loader: - kv  15: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  16:                deepseek2.expert_used_count u32              = 4
llama_model_loader: - kv  17:               deepseek2.expert_group_count u32              = 1
llama_model_loader: - kv  18:          deepseek2.expert_group_used_count u32              = 1
llama_model_loader: - kv  19:             deepseek2.attention.key_length u32              = 576
llama_model_loader: - kv  20:           deepseek2.attention.value_length u32              = 512
llama_model_loader: - kv  21:                          general.file_type u32              = 32
llama_model_loader: - kv  22:        deepseek2.leading_dense_block_count u32              = 1
llama_model_loader: - kv  23:                       deepseek2.vocab_size u32              = 128256
llama_model_loader: - kv  24:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  25:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  26:         deepseek2.attention.key_length_mla u32              = 192
llama_model_loader: - kv  27:       deepseek2.attention.value_length_mla u32              = 192
llama_model_loader: - kv  28:       deepseek2.expert_feed_forward_length u32              = 1280
llama_model_loader: - kv  29:                     deepseek2.expert_count u32              = 64
llama_model_loader: - kv  30:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  31:             deepseek2.expert_weights_scale f32              = 1.000000
llama_model_loader: - kv  32:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  33:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  34:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  36:              deepseek2.rope.scaling.factor f32              = 64.000000 
llama_model_loader: - kv  37: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  38: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  39:               general.quantization_version u32              = 2
llama_model_loader: - kv  40:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  41:                         tokenizer.ggml.pre str              = gigachat
llama_model_loader: - kv  42:                      tokenizer.ggml.tokens arr[str,128256]  = ["<unk>", "<s>", "</s>", "!", "\"", "...
llama_model_loader: - kv  43:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  44:                      tokenizer.ggml.merges arr[str,127744]  = ["Ð ¾", "Ð °", "Ð µ", "Ð ¸", ...
llama_model_loader: - kv  45:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  46:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  47:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  48:                    tokenizer.chat_template str              = {#--------TOOL RENDERING FUNCTIONS---...
llama_model_loader: - type  f32:  129 tensors
llama_model_loader: - type bf16:  285 tensors
================= Adjusted mainline llama.cpp MLA tensors to ik_llama.cpp
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 2 ('</s>')
load: special tokens cache size = 14
load: token to piece cache size = 1.0295 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = deepseek2
llm_load_print_meta: n_ctx_train      = 262144
llm_load_print_meta: n_embd           = 1536
llm_load_print_meta: n_layer          = 26
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 192
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 6144
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 8960
llm_load_print_meta: n_expert         = 64
llm_load_print_meta: n_expert_used    = 4
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = yarn
llm_load_print_meta: freq_base_train  = 100000.0
llm_load_print_meta: freq_scale_train = 0.015625
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = BF16
llm_load_print_meta: model params     = 10.673 B
llm_load_print_meta: model size       = 19.884 GiB (16.004 BPW)
llm_load_print_meta: repeating layers = 19.150 GiB (16.004 BPW, 10.279 B parameters)
llm_load_print_meta: general.name     = GigaChat3 10B A1.8B Bf16
llm_load_print_meta: n_layer_dense_lead   = 1
llm_load_print_meta: n_lora_q             = 1536
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 1280
llm_load_print_meta: n_expert_shared      = 1
llm_load_print_meta: expert_weights_scale = 1.0
llm_load_print_meta: expert_weights_norm  = 1
llm_load_print_meta: expert_gating_func   = sigmoid
llm_load_print_meta: rope_yarn_log_mul    = 0.1000
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 127744
print_info: BOS token        = 1 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: LF token         = 201 'Ċ'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 226
llm_load_tensors: ggml ctx size =    0.17 MiB
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q_a_norm.weight' not found
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/GigaChat3-10B-A1.8B-BF16.gguf'
 ERR [              load_model] unable to load model | tid="131753511823616" timestamp=1763653420 model="/mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/GigaChat3-10B-A1.8B-BF16.gguf"

So maybe GigaChat3-10B-A1.8B is a lite version of the architecture? This patch helps it get a little bit further, but then still crashes:

👈 Details
$ cd llama.cpp
$ git diff src/llama-model.cpp

diff --git a/src/llama-model.cpp b/src/llama-model.cpp
index e703181a1..f3783f26c 100644
--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
@@ -1593,7 +1593,7 @@ void llama_model::load_hparams(llama_model_loader & ml) {
             } break;
         case LLM_ARCH_DEEPSEEK2:
             {
-                bool is_lite = (hparams.n_layer == 27);
+                bool is_lite = (hparams.n_layer == 27 || hparams.n_layer == 26); // 26 for GigaChat3-10B-A1.8B
                 ml.get_key(LLM_KV_ATTENTION_LAYERNORM_RMS_EPS, hparams.f_norm_rms_eps);
                 ml.get_key(LLM_KV_LEADING_DENSE_BLOCK_COUNT,   hparams.n_layer_dense_lead);
                 if (!is_lite) {
@@ -4581,7 +4581,7 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
                 } break;
             case LLM_ARCH_DEEPSEEK2:
                 {
-                    const bool is_lite = (hparams.n_layer == 27);
+                    const bool is_lite = (hparams.n_layer == 27 || hparams.n_layer == 26); // 26 for GigaChat3-10B-A1.8B

                     const bool is_mla = (hparams.n_embd_head_k_mla != 0 && hparams.n_embd_head_v_mla != 0);

This is output from debugger:

$ cd llama.cpp
$ cmake -B build -DCMAKE_BUILD_TYPE=Debug -DGGML_CUDA=0
$ cmake --build build --config Debug -j $(nproc)

$ gdb -q --args \
./build/bin/llama-server \
    --model "$model"\
    --alias ubergarm/GigaChat3-10B-A1.8B-GGUF \
    --ctx-size 32768 \
    --parallel 1 \
    --threads 96 \
    --threads-batch 128 \
    --numa numactl \
    --host 127.0.0.1 \
    --port 8080 \
    --jinja

llama_context: n_ctx_seq (32768) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     1.96 MiB
llama_kv_cache:        CPU KV buffer size =  1768.00 MiB
llama_kv_cache: size = 1768.00 MiB ( 32768 cells,  26 layers,  4/1 seqs), K (f16):  936.00 MiB, V (f16):  832.00 MiB

Thread 1 "llama-server" received signal SIGSEGV, Segmentation fault.
0x00007ffff754b6c9 in ggml_can_mul_mat (t0=0x0, t1=0x5555585c7700) at /home/w/projects/llama.cpp/ggml/src/ggml.c:3146
3146        return (t0->ne[0]           == t1->ne[0])  &&
(gdb) bt
#0  0x00007ffff754b6c9 in ggml_can_mul_mat (t0=0x0, t1=0x5555585c7700) at /home/w/projects/llama.cpp/ggml/src/ggml.c:3146
#1  0x00007ffff754b765 in ggml_mul_mat (ctx=0x555555edd630, a=0x0, b=0x5555585c7700) at /home/w/projects/llama.cpp/ggml/src/ggml.c:3155
#2  0x00007ffff7c39327 in llm_build_deepseek2::llm_build_deepseek2 (this=0x555555edd660, model=..., params=...)
    at /home/w/projects/llama.cpp/src/models/deepseek2.cpp:50
#3  0x00007ffff7b688f8 in std::make_unique<llm_build_deepseek2, llama_model const&, llm_graph_params const&> ()
    at /usr/include/c++/13/bits/unique_ptr.h:1070
#4  0x00007ffff7b59a6c in llama_model::build_graph (this=0x555555ca2f40, params=...) at /home/w/projects/llama.cpp/src/llama-model.cpp:7224
#5  0x00007ffff7a3942e in llama_context::graph_reserve (this=0x555555ccee70, n_tokens=1, n_seqs=1, n_outputs=1, mctx=0x555555ed74a0, split_only=true)
    at /home/w/projects/llama.cpp/src/llama-context.cpp:1427
#6  0x00007ffff7a33e42 in llama_context::llama_context (this=0x555555ccee70, model=..., params=...)
    at /home/w/projects/llama.cpp/src/llama-context.cpp:312
#7  0x00007ffff7a3daff in llama_init_from_model (model=0x555555ca2f40, params=...) at /home/w/projects/llama.cpp/src/llama-context.cpp:2381
#8  0x000055555588051a in common_init_from_params (params=...) at /home/w/projects/llama.cpp/common/common.cpp:967
#9  0x0000555555642b7f in server_context::load_model (this=0x7fffffffc5e0, params=...) at /home/w/projects/llama.cpp/tools/server/server.cpp:2392
#10 0x000055555560b3f5 in main (argc=20, argv=0x7fffffffe048) at /home/w/projects/llama.cpp/tools/server/server.cpp:5608

Okay, got it working, it is a lite version and ik/llama.cpp will need a patch as it is currently hardcoded to detect lite only for exactly 27 layers, but this model is only 26 layers...

diff --git a/src/llama-model.cpp b/src/llama-model.cpp
index e703181a1..30902a59d 100644
--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
@@ -1593,7 +1593,7 @@ void llama_model::load_hparams(llama_model_loader & ml) {
             } break;
         case LLM_ARCH_DEEPSEEK2:
             {
-                bool is_lite = (hparams.n_layer == 27);
+                bool is_lite = (hparams.n_layer == 27 || hparams.n_layer == 26);
                 ml.get_key(LLM_KV_ATTENTION_LAYERNORM_RMS_EPS, hparams.f_norm_rms_eps);
                 ml.get_key(LLM_KV_LEADING_DENSE_BLOCK_COUNT,   hparams.n_layer_dense_lead);
                 if (!is_lite) {
@@ -4581,7 +4581,7 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
                 } break;
             case LLM_ARCH_DEEPSEEK2:
                 {
-                    const bool is_lite = (hparams.n_layer == 27);
+                    const bool is_lite = (hparams.n_layer == 27 || hparams.n_layer == 26);

                     const bool is_mla = (hparams.n_embd_head_k_mla != 0 && hparams.n_embd_head_v_mla != 0);

diff --git a/src/models/deepseek2.cpp b/src/models/deepseek2.cpp
index 68f72f72b..507926af5 100644
--- a/src/models/deepseek2.cpp
+++ b/src/models/deepseek2.cpp
@@ -4,7 +4,7 @@

 llm_build_deepseek2::llm_build_deepseek2(const llama_model & model, const llm_graph_params & params) :
     llm_graph_context(params) {
-    bool is_lite = (hparams.n_layer == 27);
+    bool is_lite = (hparams.n_layer == 27 || hparams.n_layer == 26);

     const bool is_mla = (hparams.n_embd_head_k_mla != 0 && hparams.n_embd_head_v_mla != 0);

Opened a PR here if anyone else could test: https://github.com/ggml-org/llama.cpp/pull/17420

ik_llama.cpp specific quants now available:

image

PR here: https://github.com/ikawrakow/ik_llama.cpp/pull/995

The Q6_K version still doesn't work in LMStudio, even though the MetalLlama.cpp runtime was updated today.

The Q6_K version still doesn't work in LMStudio, even though the MetalLlama.cpp runtime was updated today.

I didn't release a Q6_K quant, so not sure what you're asking about? Only the Q8_0 that I released would run on LMStudio/mainline llama.cpp as the model card says right up front.

If you want to use these high quality quantizations, you need to use something based on ik_llama.cpp e.g.:

Cheers!

И MLX версию бы тоже на 6 бит. Также местный конвертер в MLX не осилил эту модель как и конвертер в GGUF

пытаюсь на mlx_lm запустить
from mlx_lm.utils import load_model, get_model_path, load_tokenizer

model, tokenizer = load(giga)
ловлю ошибку:

ValueError Traceback (most recent call last)
Cell In[7], line 5
4 try:
----> 5 model, tokenizer = load(giga)
6 except ValueError as e:

File ~/Documents/ML/gpt-oss/.venv-1/lib/python3.13/site-packages/mlx_lm/utils.py:265, in load(path_or_hf_repo, tokenizer_config, model_config, adapter_path, lazy)
263 model_path, _ = get_model_path(path_or_hf_repo)
--> 265 model, config = load_model(model_path, lazy, model_config=model_config)
266 if adapter_path is not None:

File ~/Documents/ML/gpt-oss/.venv-1/lib/python3.13/site-packages/mlx_lm/utils.py:226, in load_model(model_path, lazy, strict, model_config, get_model_classes)
224 _quantize(quantization)
--> 226 model.load_weights(list(weights.items()), strict=strict)
228 if not lazy:

File ~/Documents/ML/gpt-oss/.venv-1/lib/python3.13/site-packages/mlx/nn/layers/base.py:185, in Module.load_weights(self, file_or_weights, strict)
184 extras = ",\n".join(sorted(extras))
--> 185 raise ValueError(
186 f"Received {num_extra} parameters not in model: \n{extras}."
187 )
188 if missing := (curr_weights.keys() - new_weights.keys()):

ValueError: Received 210 parameters not in model:
model.layers.26.eh_proj.weight,
model.layers.26.embed_tokens.weight,
model.layers.26.enorm.weight,
model.layers.26.hnorm.weight,
model.layers.26.input_layernorm.weight,
model.layers.26.mlp.experts.0.down_proj.weight,
model.layers.26.mlp.experts.0.gate_proj.weight,
model.layers.26.mlp.experts.0.up_proj.weight,
model.layers.26.mlp.experts.1.down_proj.weight,
model.layers.26.mlp.experts.1.gate_proj.weight,
model.layers.26.mlp.experts.1.up_proj.weight,
model.layers.26.mlp.experts.10.down_proj.weight,
model.layers.26.mlp.experts.10.gate_proj.weight,
model.layers.26.mlp.experts.10.up_proj.weight,
model.layers.26.mlp.experts.11.down_proj.weight,
model.layers.26.mlp.experts.11.gate_proj.weight,
model.layers.26.mlp.experts.11.up_proj.weight,
model.layers.26.mlp.experts.12.down_proj.weight,
model.layers.26.mlp.experts.12.gate_proj.weight,
model.layers.26.mlp.experts.12.up_proj.weight,
model.layers.26.mlp.experts.13.down_proj.weight,
model.layers.26.mlp.experts.13.gate_proj.weight,
model.layers.26.mlp.experts.13.up_proj.weight,
model.layers.26.mlp.experts.14.down_proj.weight,
model.layers.26.mlp.experts.14.gate_proj.weight,
model.layers.26.mlp.experts.14.up_proj.weight,
model.layers.26.mlp.experts.15.down_proj.weight,
model.layers.26.mlp.experts.15.gate_proj.weight,
model.layers.26.mlp.experts.15.up_proj.weight,
model.layers.26.mlp.experts.16.down_proj.weight,
model.layers.26.mlp.experts.16.gate_proj.weight,
model.layers.26.mlp.experts.16.up_proj.weight,
model.layers.26.mlp.experts.17.down_proj.weight,
model.layers.26.mlp.experts.17.gate_proj.weight,
model.layers.26.mlp.experts.17.up_proj.weight,
model.layers.26.mlp.experts.18.down_proj.weight,
model.layers.26.mlp.experts.18.gate_proj.weight,
model.layers.26.mlp.experts.18.up_proj.weight,
model.layers.26.mlp.experts.19.down_proj.weight,
model.layers.26.mlp.experts.19.gate_proj.weight,
model.layers.26.mlp.experts.19.up_proj.weight,
model.layers.26.mlp.experts.2.down_proj.weight,
model.layers.26.mlp.experts.2.gate_proj.weight,
model.layers.26.mlp.experts.2.up_proj.weight,
model.layers.26.mlp.experts.20.down_proj.weight,
model.layers.26.mlp.experts.20.gate_proj.weight,
model.layers.26.mlp.experts.20.up_proj.weight,
model.layers.26.mlp.experts.21.down_proj.weight,
model.layers.26.mlp.experts.21.gate_proj.weight,
model.layers.26.mlp.experts.21.up_proj.weight,
model.layers.26.mlp.experts.22.down_proj.weight,
model.layers.26.mlp.experts.22.gate_proj.weight,
model.layers.26.mlp.experts.22.up_proj.weight,
model.layers.26.mlp.experts.23.down_proj.weight,
model.layers.26.mlp.experts.23.gate_proj.weight,
model.layers.26.mlp.experts.23.up_proj.weight,
model.layers.26.mlp.experts.24.down_proj.weight,
model.layers.26.mlp.experts.24.gate_proj.weight,
model.layers.26.mlp.experts.24.up_proj.weight,
model.layers.26.mlp.experts.25.down_proj.weight,
model.layers.26.mlp.experts.25.gate_proj.weight,
model.layers.26.mlp.experts.25.up_proj.weight,
model.layers.26.mlp.experts.26.down_proj.weight,
model.layers.26.mlp.experts.26.gate_proj.weight,
model.layers.26.mlp.experts.26.up_proj.weight,
model.layers.26.mlp.experts.27.down_proj.weight,
model.layers.26.mlp.experts.27.gate_proj.weight,
model.layers.26.mlp.experts.27.up_proj.weight,
model.layers.26.mlp.experts.28.down_proj.weight,
model.layers.26.mlp.experts.28.gate_proj.weight,
model.layers.26.mlp.experts.28.up_proj.weight,
model.layers.26.mlp.experts.29.down_proj.weight,
model.layers.26.mlp.experts.29.gate_proj.weight,
model.layers.26.mlp.experts.29.up_proj.weight,
model.layers.26.mlp.experts.3.down_proj.weight,
model.layers.26.mlp.experts.3.gate_proj.weight,
model.layers.26.mlp.experts.3.up_proj.weight,
model.layers.26.mlp.experts.30.down_proj.weight,
model.layers.26.mlp.experts.30.gate_proj.weight,
model.layers.26.mlp.experts.30.up_proj.weight,
model.layers.26.mlp.experts.31.down_proj.weight,
model.layers.26.mlp.experts.31.gate_proj.weight,
model.layers.26.mlp.experts.31.up_proj.weight,
model.layers.26.mlp.experts.32.down_proj.weight,
model.layers.26.mlp.experts.32.gate_proj.weight,
model.layers.26.mlp.experts.32.up_proj.weight,
model.layers.26.mlp.experts.33.down_proj.weight,
model.layers.26.mlp.experts.33.gate_proj.weight,
model.layers.26.mlp.experts.33.up_proj.weight,
model.layers.26.mlp.experts.34.down_proj.weight,
model.layers.26.mlp.experts.34.gate_proj.weight,
model.layers.26.mlp.experts.34.up_proj.weight,
model.layers.26.mlp.experts.35.down_proj.weight,
model.layers.26.mlp.experts.35.gate_proj.weight,
model.layers.26.mlp.experts.35.up_proj.weight,
model.layers.26.mlp.experts.36.down_proj.weight,
model.layers.26.mlp.experts.36.gate_proj.weight,
model.layers.26.mlp.experts.36.up_proj.weight,
model.layers.26.mlp.experts.37.down_proj.weight,
model.layers.26.mlp.experts.37.gate_proj.weight,
model.layers.26.mlp.experts.37.up_proj.weight,
model.layers.26.mlp.experts.38.down_proj.weight,
model.layers.26.mlp.experts.38.gate_proj.weight,
model.layers.26.mlp.experts.38.up_proj.weight,
model.layers.26.mlp.experts.39.down_proj.weight,
model.layers.26.mlp.experts.39.gate_proj.weight,
model.layers.26.mlp.experts.39.up_proj.weight,
model.layers.26.mlp.experts.4.down_proj.weight,
model.layers.26.mlp.experts.4.gate_proj.weight,
model.layers.26.mlp.experts.4.up_proj.weight,
model.layers.26.mlp.experts.40.down_proj.weight,
model.layers.26.mlp.experts.40.gate_proj.weight,
model.layers.26.mlp.experts.40.up_proj.weight,
model.layers.26.mlp.experts.41.down_proj.weight,
model.layers.26.mlp.experts.41.gate_proj.weight,
model.layers.26.mlp.experts.41.up_proj.weight,
model.layers.26.mlp.experts.42.down_proj.weight,
model.layers.26.mlp.experts.42.gate_proj.weight,
model.layers.26.mlp.experts.42.up_proj.weight,
model.layers.26.mlp.experts.43.down_proj.weight,
model.layers.26.mlp.experts.43.gate_proj.weight,
model.layers.26.mlp.experts.43.up_proj.weight,
model.layers.26.mlp.experts.44.down_proj.weight,
model.layers.26.mlp.experts.44.gate_proj.weight,
model.layers.26.mlp.experts.44.up_proj.weight,
model.layers.26.mlp.experts.45.down_proj.weight,
model.layers.26.mlp.experts.45.gate_proj.weight,
model.layers.26.mlp.experts.45.up_proj.weight,
model.layers.26.mlp.experts.46.down_proj.weight,
model.layers.26.mlp.experts.46.gate_proj.weight,
model.layers.26.mlp.experts.46.up_proj.weight,
model.layers.26.mlp.experts.47.down_proj.weight,
model.layers.26.mlp.experts.47.gate_proj.weight,
model.layers.26.mlp.experts.47.up_proj.weight,
model.layers.26.mlp.experts.48.down_proj.weight,
model.layers.26.mlp.experts.48.gate_proj.weight,
model.layers.26.mlp.experts.48.up_proj.weight,
model.layers.26.mlp.experts.49.down_proj.weight,
model.layers.26.mlp.experts.49.gate_proj.weight,
model.layers.26.mlp.experts.49.up_proj.weight,
model.layers.26.mlp.experts.5.down_proj.weight,
model.layers.26.mlp.experts.5.gate_proj.weight,
model.layers.26.mlp.experts.5.up_proj.weight,
model.layers.26.mlp.experts.50.down_proj.weight,
model.layers.26.mlp.experts.50.gate_proj.weight,
model.layers.26.mlp.experts.50.up_proj.weight,
model.layers.26.mlp.experts.51.down_proj.weight,
model.layers.26.mlp.experts.51.gate_proj.weight,
model.layers.26.mlp.experts.51.up_proj.weight,
model.layers.26.mlp.experts.52.down_proj.weight,
model.layers.26.mlp.experts.52.gate_proj.weight,
model.layers.26.mlp.experts.52.up_proj.weight,
model.layers.26.mlp.experts.53.down_proj.weight,
model.layers.26.mlp.experts.53.gate_proj.weight,
model.layers.26.mlp.experts.53.up_proj.weight,
model.layers.26.mlp.experts.54.down_proj.weight,
model.layers.26.mlp.experts.54.gate_proj.weight,
model.layers.26.mlp.experts.54.up_proj.weight,
model.layers.26.mlp.experts.55.down_proj.weight,
model.layers.26.mlp.experts.55.gate_proj.weight,
model.layers.26.mlp.experts.55.up_proj.weight,
model.layers.26.mlp.experts.56.down_proj.weight,
model.layers.26.mlp.experts.56.gate_proj.weight,
model.layers.26.mlp.experts.56.up_proj.weight,
model.layers.26.mlp.experts.57.down_proj.weight,
model.layers.26.mlp.experts.57.gate_proj.weight,
model.layers.26.mlp.experts.57.up_proj.weight,
model.layers.26.mlp.experts.58.down_proj.weight,
model.layers.26.mlp.experts.58.gate_proj.weight,
model.layers.26.mlp.experts.58.up_proj.weight,
model.layers.26.mlp.experts.59.down_proj.weight,
model.layers.26.mlp.experts.59.gate_proj.weight,
model.layers.26.mlp.experts.59.up_proj.weight,
model.layers.26.mlp.experts.6.down_proj.weight,
model.layers.26.mlp.experts.6.gate_proj.weight,
model.layers.26.mlp.experts.6.up_proj.weight,
model.layers.26.mlp.experts.60.down_proj.weight,
model.layers.26.mlp.experts.60.gate_proj.weight,
model.layers.26.mlp.experts.60.up_proj.weight,
model.layers.26.mlp.experts.61.down_proj.weight,
model.layers.26.mlp.experts.61.gate_proj.weight,
model.layers.26.mlp.experts.61.up_proj.weight,
model.layers.26.mlp.experts.62.down_proj.weight,
model.layers.26.mlp.experts.62.gate_proj.weight,
model.layers.26.mlp.experts.62.up_proj.weight,
model.layers.26.mlp.experts.63.down_proj.weight,
model.layers.26.mlp.experts.63.gate_proj.weight,
model.layers.26.mlp.experts.63.up_proj.weight,
model.layers.26.mlp.experts.7.down_proj.weight,
model.layers.26.mlp.experts.7.gate_proj.weight,
model.layers.26.mlp.experts.7.up_proj.weight,
model.layers.26.mlp.experts.8.down_proj.weight,
model.layers.26.mlp.experts.8.gate_proj.weight,
model.layers.26.mlp.experts.8.up_proj.weight,
model.layers.26.mlp.experts.9.down_proj.weight,
model.layers.26.mlp.experts.9.gate_proj.weight,
model.layers.26.mlp.experts.9.up_proj.weight,
model.layers.26.mlp.gate.e_score_correction_bias,
model.layers.26.mlp.gate.weight,
model.layers.26.mlp.shared_experts.down_proj.weight,
model.layers.26.mlp.shared_experts.gate_proj.weight,
model.layers.26.mlp.shared_experts.up_proj.weight,
model.layers.26.post_attention_layernorm.weight,
model.layers.26.self_attn.kv_a_layernorm.weight,
model.layers.26.self_attn.kv_a_proj_with_mqa.weight,
model.layers.26.self_attn.kv_b_proj.weight,
model.layers.26.self_attn.o_proj.weight,
model.layers.26.self_attn.q_proj.weight,
model.layers.26.shared_head.head.weight,
model.layers.26.shared_head.norm.weight.

During handling of the above exception, another exception occurred:

IndexError Traceback (most recent call last)
Cell In[7], line 10
8 # Try with custom model loading
9 model_path, _ = get_model_path(giga)
---> 10 model, config = load_model(model_path, lazy=True, strict=False)
11 tokenizer = load_tokenizer(model_path)
12 print('Model loaded successfully with strict=False')

File ~/Documents/ML/gpt-oss/.venv-1/lib/python3.13/site-packages/mlx_lm/utils.py:226, in load_model(model_path, lazy, strict, model_config, get_model_classes)
223 config["quantization_config"] = quantization
224 _quantize(quantization)
--> 226 model.load_weights(list(weights.items()), strict=strict)
228 if not lazy:
229 mx.eval(model.parameters())

File ~/Documents/ML/gpt-oss/.venv-1/lib/python3.13/site-packages/mlx/nn/layers/base.py:206, in Module.load_weights(self, file_or_weights, strict)
200 raise ValueError(
201 f"Expected shape {v.shape} but received "
202 f"shape {v_new.shape} for parameter {k}"
203 )
205 if len(weights) != 0:
--> 206 self.update(tree_unflatten(weights), strict=False)
207 return self

File ~/Documents/ML/gpt-oss/.venv-1/lib/python3.13/site-packages/mlx/nn/layers/base.py:356, in Module.update(self, parameters, strict)
353 elif strict:
354 raise ValueError(f"Received invalid type: {type(parameters).name}.")
--> 356 apply(self, parameters)
357 return self

File ~/Documents/ML/gpt-oss/.venv-1/lib/python3.13/site-packages/mlx/nn/layers/base.py:338, in Module.update..apply(dst, parameters)
336 dst[k] = new_value
337 else:
--> 338 apply(current_value, new_value)
339 elif strict:
340 raise ValueError(f'Module does not have parameter named "{k}".')

File ~/Documents/ML/gpt-oss/.venv-1/lib/python3.13/site-packages/mlx/nn/layers/base.py:338, in Module.update..apply(dst, parameters)
336 dst[k] = new_value
337 else:
--> 338 apply(current_value, new_value)
339 elif strict:
340 raise ValueError(f'Module does not have parameter named "{k}".')

File ~/Documents/ML/gpt-oss/.venv-1/lib/python3.13/site-packages/mlx/nn/layers/base.py:343, in Module.update..apply(dst, parameters)
341 elif isinstance(parameters, list):
342 for i in range(len(parameters)):
--> 343 current_value = dst[i]
344 new_value = parameters[i]
345 if isinstance(current_value, mx.array):

IndexError: list index out of range

Sign up or log in to comment