Jinghao-Guo commited on Dec 6, 2025

Commit

cb70e64

verified ·

1 Parent(s): 258862c

Transfer model via script

Browse files

Files changed (19) hide show

.gitattributes +1 -0
README.md +290 -0
added_tokens.json +24 -0
chat_template.jinja +7 -0
config.json +89 -0
configuration_llavaonevision1_5.py +288 -0
generation_config.json +10 -0
merges.txt +0 -0
model-00001-of-00002.safetensors +3 -0
model-00002-of-00002.safetensors +3 -0
model.safetensors.index.json +705 -0
modeling_llavaonevision1_5.py +0 -0
preprocessor_config.json +29 -0
special_tokens_map.json +31 -0
tensorboard/instruct/README.md +0 -0
tensorboard/instruct/events.out.tfevents.1758101239.109436.0 +3 -0
tokenizer.json +3 -0
tokenizer_config.json +208 -0
vocab.json +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,290 @@

+---
+base_model:
+- DeepGlint-AI/rice-vit-large-patch14-560
+- Qwen/Qwen3-4B-Instruct-2507
+datasets:
+- lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M
+- lmms-lab/LLaVA-OneVision-1.5-Insturct-Data
+- HuggingFaceM4/FineVision
+library_name: transformers
+license: apache-2.0
+pipeline_tag: image-text-to-text
+---
+<div align="center">
+<h1>LLaVA-OneVision-1.5: Fully Open-Source State-of-the-Art VLM Model</h1>
+<p>
+  <a href="https://huggingface.co/papers/2509.23661">
+    <img alt="Paper" src="https://img.shields.io/badge/Paper-b31b1b?style=for-the-badge&logo=arXiv&logoColor=white">
+  </a>
+  <a href="https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5">
+    <img alt="Code" src="https://img.shields.io/badge/Code-181717?style=for-the-badge&logo=github&logoColor=white">
+  </a>
+  <a href="https://huggingface.co/datasets/mvp-lab/LLaVA-OneVision-1.5-Mid-Training-85M">
+    <img alt="Mid-Training Dataset" src="https://img.shields.io/badge/Mid--Training%20Dataset-f59e0b?style=for-the-badge&logo=huggingface&logoColor=white">
+  </a>
+  <a href="https://huggingface.co/datasets/mvp-lab/LLaVA-OneVision-1.5-Instruct-Data">
+    <img alt="Instruct Dataset" src="https://img.shields.io/badge/Instruct%20Dataset-3fb950?style=for-the-badge&logo=huggingface&logoColor=white">
+  </a>
+  <a href="https://huggingface.co/spaces/lmms-lab/LLaVA-OneVision-1.5">
+    <img alt="Demo" src="https://img.shields.io/badge/Demo-1f6feb?style=for-the-badge&logo=huggingface&logoColor=white">
+  </a>
+  <a href="https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-4B-Instruct/tensorboard">
+    <img alt="TensorBoard" src="https://img.shields.io/badge/TensorBoard-FF6F00?style=for-the-badge&logo=tensorflow&logoColor=white">
+  </a>
+</p>
+</div>
+## Introduction
+LLaVA-OneVision-1.5 is a fully open-source family of large multimodal models (LMMs) built to democratize multimodal training. Trained on native‑resolution images, it delivers state‑of‑the‑art performance at substantially lower cost. The project also releases high‑quality pretraining and SFT data, a complete and efficient training framework with recipes and configs, and comprehensive logs to support transparent, reproducible research.
+#### **Superior Performance**
+  - The model leads on multiple multimodal benchmarks and generally surpasses Qwen2.5-VL.
+  - Training on native-resolution images significantly improves its visual understanding.
+#### **High-Quality Data at Scale**
+  - The pretraining corpus comprises large-scale, concept-balanced, diverse, and high-quality captions curated with strict filtering and quality control.
+  - The instruction-tuning dataset is comprehensive and covers a wide range of tasks.
+#### **Ultra-Efficient Training Framework**
+  - The end-to-end training cost is about $16,000 on A100 GPUs at roughly $0.60 per GPU-hour.
+  - The system is built on Megatron-LM with support for MoE, FP8, and long-sequence parallelism, and the codebase is optimized for cost-effective scaling.
+#### **Fully Open Framework**
+  - The project releases high-quality pretraining and SFT datasets along with the complete training framework, configurations, and recipes.
+  - It also provides detailed training logs and metrics to enable reproducibility and community adoption.
+## Models
+| Model | HF Link | Training Log |
+|---|---|---|
+| LLaVA-OV-1.5-4B-Instruct | [🤗 HF / 4B-Instruct](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-4B-Instruct) | [📈 Tensorboard](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-4B-Instruct/tensorboard) |
+| LLaVA-OV-1.5-8B-Instruct | [🤗 HF / 8B-Instruct](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-8B-Instruct) | [📈 Tensorboard](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-8B-Instruct/tensorboard) |
+## Dataset
+| Description        | Link                                                                                                   | Status      |
+|--------------------|--------------------------------------------------------------------------------------------------------|-------------|
+| LLaVA-OneVision-1.5-Mid-Training-85M   | [🤗HF / Mid-Training 85M](https://huggingface.co/datasets/mvp-lab/LLaVA-OneVision-1.5-Mid-Training-85M) | Uploading…  |
+| LLaVA-OneVision-1.5-Instruct           | [🤗HF / Instruct-Data](https://huggingface.co/datasets/mvp-lab/LLaVA-OneVision-1.5-Instruct-Data)        | Available  |
+## Evaluation Results
+All evaluations were conducted using [lmms_eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).
+![image](https://cdn-uploads.huggingface.co/production/uploads/655c70d331c4978366d4b2e6/J8oBYmQkTOC6pBNLgJn9d.png)
+## Quick Start with HuggingFace
+Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
+```python
+from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM
+from qwen_vl_utils import process_vision_info
+model_path = "lmms-lab/LLaVA-One-Vision-1.5-8B-Instruct"
+# default: Load the model on the available device(s)
+model = AutoModelForCausalLM.from_pretrained(
+    model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True
+)
+# default processer
+processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "image",
+                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
+            },
+            {"type": "text", "text": "Describe this image."},
+        ],
+    }
+]
+# Preparation for inference
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+)
+inputs = inputs.to("cuda")
+# Inference: Generation of the output
+generated_ids = model.generate(**inputs, max_new_tokens=1024)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text)
+```
+## Evaluation
+```
+# pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git
+accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
+    --model=llava_onevision1_5 \
+    --model_args=pretrained=lmms-lab/LLaVA-OneVision-1.5-8B-Instruct,attn_implementation=flash_attention_2,max_pixels=3240000 \
+    --tasks=mmmu_val,mmmu_pro_standard,mmbench_en_test,mmerealworld,mmerealworld_cn,ai2d,ai2d_no_mask,vstar_bench,chartqa,charxiv,docvqa_test,mathvista_testmini,mmstar,scienceqa \
+    --batch_size=1
+```
+### Mid-Training
+To improve model training efficiency, we implement offline sample packing:
+1.  Download the [**Mid-Training-85M Dataset**](https://huggingface.co/datasets/lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M)
+2.  Pack the data into webdataset format, refer to [**Examples offlinepacking**](examples_offline_packing) and [**Offline Padding-Free Data Packing**](examples/llava_ov_1_5/sample_packing/README.md)
+### Instruct
+1.  Download the [**LLaVA-OneVision-1.5-Insturct-Data**](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-1.5-Insturct-Data)
+2.  Convert the data into webdataset format, refer to [**Conversion for Mixed Instruction Data**](docs/sft_data_preprocessing.md)
+## Roadmaps
+Q4 2025 Key Deliverables:
+1.  **Ultra-efficient MoE Training**
+2.  **Full Video Input LLM**
+## Contributors
+Thanks so much to all of our amazing contributors!
+<!-- readme: collaborators,contributors,jiankangdeng/- -start -->
+<table>
+	<tbody>
+		<tr>
+            <td align="center">
+                <a href="https://github.com/fdcp">
+                    <img src="https://avatars.githubusercontent.com/u/15667917?v=4" width="80;" alt="fdcp"/>
+                    <br />
+                    <sub><b>fdcp</b></sub>
+                </a>
+            </td>
+            <td align="center">
+                <a href="https://github.com/anxiangsir">
+                    <img src="https://avatars.githubusercontent.com/u/31175974?v=4" width="80;" alt="anxiangsir"/>
+                    <br />
+                    <sub><b>anxiangsir</b></sub>
+                </a>
+            </td>
+            <td align="center">
+                <a href="https://github.com/yiyexy">
+                    <img src="https://avatars.githubusercontent.com/u/35927125?v=4" width="80;" alt="yiyexy"/>
+                    <br />
+                    <sub><b>yiyexy</b></sub>
+                </a>
+            </td>
+            <td align="center">
+                <a href="https://github.com/wideyard">
+                    <img src="https://avatars.githubusercontent.com/u/101321826?v=4" width="80;" alt="wideyard"/>
+                    <br />
+                    <sub><b>wideyard</b></sub>
+                </a>
+            </td>
+            <td align="center">
+                <a href="https://github.com/chengzheng345">
+                    <img src="https://avatars.githubusercontent.com/u/209475443?v=4" width="80;" alt="chengzheng345"/>
+                    <br />
+                    <sub><b>chengzheng345</b></sub>
+                </a>
+            </td>
+            <td align="center">
+                <a href="https://github.com/killTheHostage">
+                    <img src="https://avatars.githubusercontent.com/u/16442720?v=4" width="80;" alt="killTheHostage"/>
+                    <br />
+                    <sub><b>killTheHostage</b></sub>
+                </a>
+            </td>
+            <td align="center">
+                <a href="https://github.com/mathCrazyy">
+                    <img src="https://avatars.githubusercontent.com/u/20607153?v=4" width="80;" alt="mathCrazyy"/>
+                    <br />
+                    <sub><b>mathCrazyy</b></sub>
+                </a>
+            </td>
+            <td align="center">
+                <a href="https://github.com/yunglechao">
+                    <img src="https://avatars.githubusercontent.com/u/7631185?v=4" width="80;" alt="yunglechao"/>
+                    <br />
+                    <sub><b>yunglechao</b></sub>
+                </a>
+            </td>
+		</tr>
+		<tr>
+            <td align="center">
+                <a href="https://github.com/RobitYadda">
+                    <img src="https://avatars.githubusercontent.com/u/6811311?v=4" width="80;" alt="RobitYadda"/>
+                    <br />
+                    <sub><b>RobitYadda</b></sub>
+                </a>
+            </td>
+		</tr>
+	<tbody>
+</table>
+<!-- readme: collaborators,contributors,jiankangdeng/- -end -->
+## Citation
+If you find *LLaVA-OneVision-1.5* useful in your research, please consider to cite the following related papers:
+```
+@inproceedings{LLaVA-OneVision-1.5,
+  title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training},
+  author={An, Xiang and Xie, Yin and Yang, Kaicheng and Zhang, Wenkang and Zhao, Xiuwei and Cheng, Zheng and Wang, Yirui and Xu, Songcen and Chen, Changrui and Wu, Chunsheng and Tan, Huajie and Li, Chunyuan and Yang, Jing and Yu, Jie and Wang, Xiyao and Qin, Bin and Wang, Yumeng and Yan, Zizhen and Feng, Ziyong and Liu, Ziwei and Li, Bo and Deng, Jiankang},
+  booktitle={arxiv},
+  year={2025}
+ }
+@inproceedings{xie2025region,
+  title={Region-based Cluster Discrimination for Visual Representation Learning},
+  author={Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong and Miles, Roy and Elezi, Ismail and Deng, Jiankang},
+  booktitle={ICCV},
+  year={2025}
+}
+@article{lillava,
+  title={LLaVA-OneVision: Easy Visual Task Transfer},
+  author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Zhang, Peiyuan and Li, Yanwei and Liu, Ziwei and Li, Chunyuan},
+  journal={Transactions on Machine Learning Research}
+  year={2024}
+}
+```
+## Acknowledgement
+We extend our sincere gratitude to **AIAK team of the** [**Baige AI computing platform**](https://cloud.baidu.com/product/aihc.html) **from Baidu AI Cloud** for providing the exceptional training framework. The outstanding capabilities of AIAK-Training-LLM and AIAK-Megatron have significantly accelerated our training process with remarkable efficiency. These cutting-edge frameworks have been instrumental in achieving our research goals. `To get full AIAK support, you can contact Baidu Cloud.`
+We also thank the maintainers and contributors of the following open-source projects, whose work greatly inspired and supported our research:
+- LLaVA: Large Language-and-Vision Assistant — [LLaVA](https://github.com/haotian-liu/LLaVA)
+- LLaVA-NeXT: Next-generation multi-modal assistant — [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT)
+- lmms-eval: A standardized evaluation framework for Large Multimodal Models — [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval)
+- Megatron-LM: Efficient, scalable training for large language models — [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
+- Qwen2.5-VL: Strong vision-language foundation model — [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL)
+- InternVL: Open-source large-scale vision-language foundation model — [InternVL](https://github.com/OpenGVLab/InternVL)
+- Qwen3: Next-generation Qwen LLM — [Qwen](https://github.com/QwenLM/Qwen)
+- MetaCLIP: Scalable contrastive pretraining — [MetaCLIP](https://github.com/facebookresearch/MetaCLIP)
+- FineVision: Open Data Is All You Need — [FineVision](https://huggingface.co/spaces/HuggingFaceM4/FineVision)

added_tokens.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "</tool_call>": 151658,
+  "<tool_call>": 151657,
+  "<|box_end|>": 151649,
+  "<|box_start|>": 151648,
+  "<|endoftext|>": 151643,
+  "<|file_sep|>": 151664,
+  "<|fim_middle|>": 151660,
+  "<|fim_pad|>": 151662,
+  "<|fim_prefix|>": 151659,
+  "<|fim_suffix|>": 151661,
+  "<|im_end|>": 151645,
+  "<|im_start|>": 151644,
+  "<|image_pad|>": 151655,
+  "<|object_ref_end|>": 151647,
+  "<|object_ref_start|>": 151646,
+  "<|quad_end|>": 151651,
+  "<|quad_start|>": 151650,
+  "<|repo_name|>": 151663,
+  "<|video_pad|>": 151656,
+  "<|vision_end|>": 151653,
+  "<|vision_pad|>": 151654,
+  "<|vision_start|>": 151652
+}

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,7 @@

+{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system
+You are a helpful assistant.<|im_end|>
+{% endif %}<|im_start|>{{ message['role'] }}
+{% if message['content'] is string %}{{ message['content'] }}<|im_end|>
+{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>
+{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant
+{% endif %}

config.json ADDED Viewed

	@@ -0,0 +1,89 @@

+{
+  "architectures": [
+    "LLaVAOneVision1_5_ForConditionalGeneration"
+  ],
+  "image_token_id": 151655,
+  "model_type": "llavaonevision1_5",
+  "text_config": {
+    "attention_bias": false,
+    "attention_dropout": 0.0,
+    "head_dim": 128,
+    "hidden_act": "silu",
+    "hidden_size": 2560,
+    "image_token_id": null,
+    "initializer_range": 0.02,
+    "intermediate_size": 9728,
+    "layer_types": [
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention"
+    ],
+    "max_position_embeddings": 262144,
+    "max_window_layers": 36,
+    "model_type": "LLaVAOneVision1_5_text",
+    "num_attention_heads": 32,
+    "num_hidden_layers": 36,
+    "num_key_value_heads": 8,
+    "rms_norm_eps": 1e-06,
+    "rope_scaling": null,
+    "rope_theta": 5000000.0,
+    "sliding_window": null,
+    "use_cache": true,
+    "use_sliding_window": false,
+    "video_token_id": null,
+    "vocab_size": 151936
+  },
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.53.0",
+  "video_token_id": 151656,
+  "vision_config": {
+    "depth": 24,
+    "embed_dim": 1024,
+    "hidden_act": "gelu",
+    "hidden_size": 1024,
+    "in_channels": 3,
+    "initializer_range": 0.02,
+    "intermediate_size": 4096,
+    "layer_norm_eps": 1e-05,
+    "model_type": "rice_vit",
+    "num_heads": 16,
+    "patch_size": 14,
+    "spatial_merge_size": 2,
+    "temporal_patch_size": 1,
+    "text_hidden_size": 2560
+  },
+  "vocab_size": 151936
+}

configuration_llavaonevision1_5.py ADDED Viewed

	@@ -0,0 +1,288 @@

+# coding=utf-8
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from transformers.configuration_utils import PretrainedConfig, layer_type_validation
+from transformers.modeling_rope_utils import rope_config_validation
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+class RiceConfig(PretrainedConfig):
+    model_type = "rice_vit"
+    base_config_key = "vision_config"
+    def __init__(
+        self,
+        depth=24,
+        embed_dim=1024,
+        hidden_size=1024,
+        hidden_act="gelu",
+        intermediate_size=4096,
+        num_heads=16,
+        in_channels=3,
+        patch_size=14,
+        spatial_merge_size=2,
+        temporal_patch_size=1,
+        initializer_range=0.02,
+        layer_norm_eps=1e-05,
+        text_hidden_size=2560,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.depth = depth
+        self.embed_dim = embed_dim
+        self.hidden_size = hidden_size
+        self.hidden_act = hidden_act
+        self.intermediate_size = intermediate_size
+        self.num_heads = num_heads
+        self.in_channels = in_channels
+        self.patch_size = patch_size
+        self.spatial_merge_size = spatial_merge_size
+        self.temporal_patch_size = temporal_patch_size
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.text_hidden_size = text_hidden_size
+class LLaVAOneVision1_5_TextConfig(PretrainedConfig):
+    r"""
+    Args:
+        vocab_size (`int`, *optional*, defaults to 152064):
+            Vocabulary size of the Qwen2VL model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`Qwen2VLModel`]
+        hidden_size (`int`, *optional*, defaults to 8192):
+            Dimension of the hidden representations.
+        intermediate_size (`int`, *optional*, defaults to 29568):
+            Dimension of the MLP representations.
+        num_hidden_layers (`int`, *optional*, defaults to 80):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 64):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        num_key_value_heads (`int`, *optional*, defaults to 8):
+            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
+            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
+            `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
+            converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
+            by meanpooling all the original heads within that group. For more details checkout [this
+            paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to `32`.
+        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
+            The non-linear activation function (function or string) in the decoder.
+        max_position_embeddings (`int`, *optional*, defaults to 32768):
+            The maximum sequence length that this model might ever be used with.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        rms_norm_eps (`float`, *optional*, defaults to 1e-05):
+            The epsilon used by the rms normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
+            Whether the model's input and output word embeddings should be tied.
+        rope_theta (`float`, *optional*, defaults to 1000000.0):
+            The base period of the RoPE embeddings.
+        use_sliding_window (`bool`, *optional*, defaults to `False`):
+            Whether to use sliding window attention.
+        sliding_window (`int`, *optional*, defaults to 4096):
+            Sliding window attention (SWA) window size. If not specified, will default to `4096`.
+        max_window_layers (`int`, *optional*, defaults to 80):
+            The number of layers that use SWA (Sliding Window Attention). The bottom layers use SWA while the top use full attention.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        rope_scaling (`Dict`, *optional*):
+            Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type
+            and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value
+            accordingly.
+            Expected contents:
+                `rope_type` (`str`):
+                    The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope',
+                    'llama3'], with 'default' being the original RoPE implementation.
+                `factor` (`float`, *optional*):
+                    Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In
+                    most scaling types, a `factor` of x will enable the model to handle sequences of length x *
+                    original maximum pre-trained length.
+                `original_max_position_embeddings` (`int`, *optional*):
+                    Used with 'dynamic', 'longrope' and 'llama3'. The original max position embeddings used during
+                    pretraining.
+                `attention_factor` (`float`, *optional*):
+                    Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention
+                    computation. If unspecified, it defaults to value recommended by the implementation, using the
+                    `factor` field to infer the suggested value.
+                `beta_fast` (`float`, *optional*):
+                    Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear
+                    ramp function. If unspecified, it defaults to 32.
+                `beta_slow` (`float`, *optional*):
+                    Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear
+                    ramp function. If unspecified, it defaults to 1.
+                `short_factor` (`List[float]`, *optional*):
+                    Only used with 'longrope'. The scaling factor to be applied to short contexts (<
+                    `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
+                    size divided by the number of attention heads divided by 2
+                `long_factor` (`List[float]`, *optional*):
+                    Only used with 'longrope'. The scaling factor to be applied to long contexts (<
+                    `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
+                    size divided by the number of attention heads divided by 2
+                `low_freq_factor` (`float`, *optional*):
+                    Only used with 'llama3'. Scaling factor applied to low frequency components of the RoPE
+                `high_freq_factor` (`float`, *optional*):
+                    Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE
+        image_token_id (`int`, *optional*):
+            Token index used as placeholder for image embeddings.
+        video_token_id (`int`, *optional*):
+            Token index used as placeholder for video embeddings.
+    """
+    model_type = "LLaVAOneVision1_5_text"
+    base_config_key = "text_config"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    # Default tensor parallel plan for base model `Qwen2VL`
+    base_model_tp_plan = {
+        "layers.*.self_attn.q_proj": "colwise",
+        "layers.*.self_attn.k_proj": "colwise",
+        "layers.*.self_attn.v_proj": "colwise",
+        "layers.*.self_attn.o_proj": "rowwise",
+        "layers.*.mlp.gate_proj": "colwise",
+        "layers.*.mlp.up_proj": "colwise",
+        "layers.*.mlp.down_proj": "rowwise",
+    }
+    base_model_pp_plan = {
+        "embed_tokens": (["input_ids"], ["inputs_embeds"]),
+        "layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
+        "norm": (["hidden_states"], ["hidden_states"]),
+    }
+    def __init__(
+        self,
+        vocab_size=151936,
+        hidden_size=4096,
+        intermediate_size=12288,
+        num_hidden_layers=36,
+        num_attention_heads=32,
+        num_key_value_heads=8,
+        head_dim=128,
+        hidden_act="silu",
+        max_position_embeddings=32768,
+        initializer_range=0.02,
+        rms_norm_eps=1e-06,
+        use_cache=True,
+        tie_word_embeddings=False,
+        rope_theta=1000000.0,
+        attention_bias=False,
+        use_sliding_window=False,
+        sliding_window=None,
+        max_window_layers=36,
+        attention_dropout=0.0,
+        rope_scaling=None,
+        layer_types=None,
+        image_token_id=None,
+        video_token_id=None,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.use_sliding_window = use_sliding_window
+        self.sliding_window = sliding_window
+        self.max_window_layers = max_window_layers
+        # for backward compatibility
+        if num_key_value_heads is None:
+            num_key_value_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.head_dim = head_dim
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
+        self.rms_norm_eps = rms_norm_eps
+        self.use_cache = use_cache
+        self.rope_theta = rope_theta
+        self.attention_dropout = attention_dropout
+        self.rope_scaling = rope_scaling
+        self.attention_bias = attention_bias
+        self.tie_word_embeddings = tie_word_embeddings
+        # Validate the correctness of rotary position embeddings parameters
+        # BC: if there is a 'type' field, move it to 'rope_type'.
+        # and change type from 'mrope' to 'default' because `mrope` does default RoPE calculations
+        # one can set it to "linear"/"dynamic" etc. to have scaled RoPE
+        # TODO: @raushan update config in the hub
+        if self.rope_scaling is not None and "type" in self.rope_scaling:
+            if self.rope_scaling["type"] == "mrope":
+                self.rope_scaling["type"] = "default"
+            self.rope_scaling["rope_type"] = self.rope_scaling["type"]
+        rope_config_validation(self, ignore_keys={"mrope_section"})
+        self.image_token_id = image_token_id
+        self.video_token_id = video_token_id
+        self.layer_types = layer_types
+        if self.layer_types is None:
+            self.layer_types = [
+                "sliding_attention"
+                if self.sliding_window is not None and i >= self.max_window_layers
+                else "full_attention"
+                for i in range(self.num_hidden_layers)
+            ]
+        layer_type_validation(self.layer_types)
+        super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
+class Llavaonevision1_5Config(PretrainedConfig):
+    r"""
+    Args:
+        text_config (`Union[PreTrainedConfig, dict]`, *optional*, defaults to `LLaVAOneVision1_5_TextConfig`):
+            The config object or dictionary of the text backbone.
+        vision_config (`Union[PreTrainedConfig, dict]`,  *optional*, defaults to `LLaVAOneVision1_5_VisionConfig`):
+            The config object or dictionary of the vision backbone.
+        image_token_id (`int`, *optional*, defaults to 151655):
+            The image token index to encode the image prompt.
+        video_token_id (`int`, *optional*, defaults to 151656):
+            The video token index to encode the image prompt.
+    """
+    model_type = "llavaonevision1_5"
+    sub_configs = {"vision_config": RiceConfig, "text_config": LLaVAOneVision1_5_TextConfig}
+    keys_to_ignore_at_inference = ["past_key_values"]
+    def __init__(
+        self,
+        text_config=None,
+        vision_config=None,
+        image_token_id=151655,
+        video_token_id=151656,
+        vocab_size=152064,
+        **kwargs,
+    ):
+        if isinstance(vision_config, dict):
+            self.vision_config = self.sub_configs["vision_config"](**vision_config)
+        elif vision_config is None:
+            self.vision_config = self.sub_configs["vision_config"]()
+        if isinstance(text_config, dict):
+            self.text_config = self.sub_configs["text_config"](**text_config)
+        elif text_config is None:
+            # For BC use all kwargs to init `TextConfig`
+            self.text_config = self.sub_configs["text_config"](**kwargs)
+        self.image_token_id = image_token_id
+        self.video_token_id = video_token_id
+        self.vocab_size = vocab_size
+        super().__init__(**kwargs)
+__all__ = ["Llavaonevision1_5Config", "LLaVAOneVision1_5_TextConfig"]

generation_config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "bos_token_id": 151643,
+  "pad_token_id": 151643,
+  "do_sample": true,
+  "eos_token_id": 151645,
+  "repetition_penalty": 1.05,
+  "temperature": 0.000001,
+  "_from_model_config": true,
+  "transformers_version": "4.53.0"
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model-00001-of-00002.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f82cd85dce1f7a8438ff40d84a269a0fa99a0e34a848c7353e18aabd462fcdec
+size 4563695792

model-00002-of-00002.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d61103678768e54825eb35862e8b98e25db536bb1b4528e1822a50e5491cd286
+size 4919602640

model.safetensors.index.json ADDED Viewed

	@@ -0,0 +1,705 @@

+{
+  "metadata": {
+    "total_size": 9483221056
+  },
+  "weight_map": {
+    "lm_head.weight": "model-00002-of-00002.safetensors",
+    "model.embed_tokens.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.18.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.19.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.19.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.19.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.19.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.19.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.19.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.2.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.20.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.20.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.20.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.20.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.20.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.20.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.21.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.21.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.21.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.21.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.21.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.21.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.21.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.22.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.22.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.22.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.22.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.22.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.22.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.22.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.23.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.23.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.23.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.23.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.23.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.23.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.23.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.24.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.24.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.24.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.24.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.24.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.24.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.24.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.25.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.25.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.25.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.25.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.25.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.25.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.25.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.26.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.26.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.26.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.26.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.26.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.26.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.26.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.26.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.26.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.26.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.26.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.27.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.27.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.27.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.27.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.27.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.27.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.27.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.27.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.27.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.27.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.27.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.28.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.28.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.28.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.28.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.28.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.28.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.28.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.28.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.28.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.28.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.28.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.29.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.29.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.29.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.29.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.29.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.29.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.29.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.29.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.29.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.29.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.29.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.3.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.30.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.30.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.30.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.30.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.30.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.31.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.31.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.31.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.31.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.32.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.32.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.32.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.32.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.32.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.32.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.32.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.32.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.32.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.32.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.32.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.33.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.33.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.33.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.33.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.33.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.33.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.33.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.33.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.33.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.33.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.33.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.34.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.34.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.34.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.34.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.34.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.34.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.34.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.34.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.34.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.34.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.34.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.35.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.35.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.35.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.35.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.35.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.35.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.35.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.35.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.35.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.35.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.35.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.4.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.norm.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.0.attn.proj.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.0.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.0.attn.qkv.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.0.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.0.mlp.fc1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.0.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.0.mlp.fc2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.0.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.0.norm1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.0.norm1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.0.norm2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.0.norm2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.1.attn.proj.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.1.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.1.attn.qkv.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.1.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.1.mlp.fc1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.1.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.1.mlp.fc2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.1.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.1.norm1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.1.norm1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.1.norm2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.1.norm2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.10.attn.proj.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.10.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.10.attn.qkv.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.10.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.10.mlp.fc1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.10.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.10.mlp.fc2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.10.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.10.norm1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.10.norm1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.10.norm2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.10.norm2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.11.attn.proj.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.11.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.11.attn.qkv.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.11.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.11.mlp.fc1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.11.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.11.mlp.fc2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.11.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.11.norm1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.11.norm1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.11.norm2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.11.norm2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.12.attn.proj.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.12.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.12.attn.qkv.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.12.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.12.mlp.fc1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.12.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.12.mlp.fc2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.12.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.12.norm1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.12.norm1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.12.norm2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.12.norm2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.13.attn.proj.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.13.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.13.attn.qkv.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.13.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.13.mlp.fc1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.13.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.13.mlp.fc2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.13.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.13.norm1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.13.norm1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.13.norm2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.13.norm2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.14.attn.proj.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.14.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.14.attn.qkv.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.14.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.14.mlp.fc1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.14.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.14.mlp.fc2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.14.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.14.norm1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.14.norm1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.14.norm2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.14.norm2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.15.attn.proj.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.15.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.15.attn.qkv.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.15.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.15.mlp.fc1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.15.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.15.mlp.fc2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.15.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.15.norm1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.15.norm1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.15.norm2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.15.norm2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.16.attn.proj.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.16.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.16.attn.qkv.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.16.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.16.mlp.fc1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.16.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.16.mlp.fc2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.16.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.16.norm1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.16.norm1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.16.norm2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.16.norm2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.17.attn.proj.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.17.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.17.attn.qkv.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.17.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.17.mlp.fc1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.17.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.17.mlp.fc2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.17.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.17.norm1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.17.norm1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.17.norm2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.17.norm2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.18.attn.proj.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.18.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.18.attn.qkv.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.18.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.18.mlp.fc1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.18.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.18.mlp.fc2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.18.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.18.norm1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.18.norm1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.18.norm2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.18.norm2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.19.attn.proj.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.19.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.19.attn.qkv.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.19.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.19.mlp.fc1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.19.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.19.mlp.fc2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.19.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.19.norm1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.19.norm1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.19.norm2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.19.norm2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.2.attn.proj.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.2.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.2.attn.qkv.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.2.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.2.mlp.fc1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.2.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.2.mlp.fc2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.2.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.2.norm1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.2.norm1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.2.norm2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.2.norm2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.20.attn.proj.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.20.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.20.attn.qkv.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.20.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.20.mlp.fc1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.20.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.20.mlp.fc2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.20.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.20.norm1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.20.norm1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.20.norm2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.20.norm2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.21.attn.proj.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.21.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.21.attn.qkv.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.21.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.21.mlp.fc1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.21.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.21.mlp.fc2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.21.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.21.norm1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.21.norm1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.21.norm2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.21.norm2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.22.attn.proj.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.22.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.22.attn.qkv.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.22.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.22.mlp.fc1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.22.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.22.mlp.fc2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.22.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.22.norm1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.22.norm1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.22.norm2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.22.norm2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.23.attn.proj.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.23.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.23.attn.qkv.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.23.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.23.mlp.fc1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.23.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.23.mlp.fc2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.23.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.23.norm1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.23.norm1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.23.norm2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.23.norm2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.3.attn.proj.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.3.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.3.attn.qkv.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.3.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.3.mlp.fc1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.3.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.3.mlp.fc2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.3.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.3.norm1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.3.norm1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.3.norm2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.3.norm2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.4.attn.proj.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.4.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.4.attn.qkv.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.4.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.4.mlp.fc1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.4.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.4.mlp.fc2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.4.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.4.norm1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.4.norm1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.4.norm2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.4.norm2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.5.attn.proj.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.5.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.5.attn.qkv.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.5.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.5.mlp.fc1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.5.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.5.mlp.fc2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.5.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.5.norm1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.5.norm1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.5.norm2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.5.norm2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.6.attn.proj.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.6.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.6.attn.qkv.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.6.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.6.mlp.fc1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.6.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.6.mlp.fc2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.6.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.6.norm1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.6.norm1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.6.norm2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.6.norm2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.7.attn.proj.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.7.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.7.attn.qkv.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.7.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.7.mlp.fc1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.7.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.7.mlp.fc2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.7.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.7.norm1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.7.norm1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.7.norm2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.7.norm2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.8.attn.proj.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.8.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.8.attn.qkv.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.8.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.8.mlp.fc1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.8.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.8.mlp.fc2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.8.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.8.norm1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.8.norm1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.8.norm2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.8.norm2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.9.attn.proj.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.9.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.9.attn.qkv.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.9.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.9.mlp.fc1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.9.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.9.mlp.fc2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.9.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.9.norm1.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.9.norm1.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.9.norm2.bias": "model-00002-of-00002.safetensors",
+    "visual.blocks.9.norm2.weight": "model-00002-of-00002.safetensors",
+    "visual.class_embedding": "model-00002-of-00002.safetensors",
+    "visual.class_pos_emb": "model-00002-of-00002.safetensors",
+    "visual.merger.ln_q.bias": "model-00002-of-00002.safetensors",
+    "visual.merger.ln_q.weight": "model-00002-of-00002.safetensors",
+    "visual.merger.mlp.0.bias": "model-00002-of-00002.safetensors",
+    "visual.merger.mlp.0.weight": "model-00002-of-00002.safetensors",
+    "visual.merger.mlp.2.bias": "model-00002-of-00002.safetensors",
+    "visual.merger.mlp.2.weight": "model-00002-of-00002.safetensors",
+    "visual.patch_embed.proj.weight": "model-00002-of-00002.safetensors",
+    "visual.pre_layernorm.bias": "model-00002-of-00002.safetensors",
+    "visual.pre_layernorm.weight": "model-00002-of-00002.safetensors"
+  }
+}

modeling_llavaonevision1_5.py ADDED Viewed

The diff for this file is too large to render. See raw diff

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+  "do_convert_rgb": true,
+  "do_normalize": true,
+  "do_rescale": true,
+  "do_resize": true,
+  "image_mean": [
+    0.48145466,
+    0.4578275,
+    0.40821073
+  ],
+  "image_processor_type": "Qwen2VLImageProcessor",
+  "image_std": [
+    0.26862954,
+    0.26130258,
+    0.27577711
+  ],
+  "max_pixels": 2560000,
+  "merge_size": 2,
+  "min_pixels": 3136,
+  "patch_size": 14,
+  "processor_class": "Qwen2_5_VLProcessor",
+  "resample": 3,
+  "rescale_factor": 0.00392156862745098,
+  "size": {
+    "longest_edge": 12845056,
+    "shortest_edge": 3136
+  },
+  "temporal_patch_size": 1
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,31 @@

+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "eos_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tensorboard/instruct/README.md ADDED Viewed

File without changes

tensorboard/instruct/events.out.tfevents.1758101239.109436.0 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:543ddda0e04e35a6bb2e488bfdd641b2f84cbb814bacc29e0936ae780d0b0ab1
+size 92983372

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9c5ae00e602b8860cbd784ba82a8aa14e8feecec692e7076590d014d7b7fdafa
+size 11421896

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,208 @@

+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "151643": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151644": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151645": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151646": {
+      "content": "<|object_ref_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151647": {
+      "content": "<|object_ref_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151648": {
+      "content": "<|box_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151649": {
+      "content": "<|box_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151650": {
+      "content": "<|quad_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151651": {
+      "content": "<|quad_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151652": {
+      "content": "<|vision_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151653": {
+      "content": "<|vision_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151654": {
+      "content": "<|vision_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151655": {
+      "content": "<|image_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151656": {
+      "content": "<|video_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151657": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151658": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151659": {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151660": {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151661": {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151662": {
+      "content": "<|fim_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151663": {
+      "content": "<|repo_name|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151664": {
+      "content": "<|file_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "model_max_length": 131072,
+  "pad_token": "<|endoftext|>",
+  "processor_class": "Qwen2_5_VLProcessor",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff