mlx-community
/

GLM-4.5-MLX-8bit

@@ -26,7 +26,7 @@ This is an 8-bit quantized MLX version of [zai-org/GLM-4.5](https://huggingface.
 - **8-bit quantization** (8.502 bits per weight) for memory efficiency
 - **MLX optimized** for Apple Silicon unified memory architecture
 - **High-memory optimized**: Designed for systems with 512GB+ unified memory
-- **Long context capable**: Tested with 6,500+ word documents
 - **Performance**: ~11.75 tokens/second on Mac Studio with 512GB RAM
 ## Model Details
@@ -36,24 +36,24 @@ This is an 8-bit quantized MLX version of [zai-org/GLM-4.5](https://huggingface.
 - **Quantization**: 8-bit MLX with group size 64
 - **MLX-LM Version**: 0.26.3
 - **Model Size**: ~375GB
-- **Context Length**: 131,072 tokens (tested stable up to 72K+ tokens)
 ## System Requirements
-- **Hardware**: Mac Studio or Mac Pro with Apple Silicon (M1/M2/M3 series)
 - **Memory**: 512GB+ unified memory strongly recommended
 - **Storage**: ~400GB free space
 - **Software**: macOS with MLX framework
 ## Performance Benchmarks
-**Test Configuration**: Mac Studio with 512GB unified memory
 ### Context Length Performance
 - **Short Context (6.5K tokens)**: 11.75 tokens/second
 - **Long Context (72K tokens)**: 5.0 tokens/second, 86% memory usage
-- **Extended Context (121K tokens)**: 2.53 tokens/second, 92% memory usage
-- **Beyond Theoretical Limit (132K tokens)**: 5.74 tokens/second, 85% peak memory
 - **Proven Capability**: Successfully exceeds stated 131K context window (102.2% capacity)
 - **Quality**: Full comprehension and analysis of complex, sprawling content at maximum context
@@ -66,7 +66,7 @@ This is an 8-bit quantized MLX version of [zai-org/GLM-4.5](https://huggingface.
 ### Comparison with GGUF
 - **MLX Version**: System remains responsive during inference, stable performance
-- **GGUF Version**: System becomes unusable, frequent crashes around 30-40K tokens
 ## Usage

 - **8-bit quantization** (8.502 bits per weight) for memory efficiency
 - **MLX optimized** for Apple Silicon unified memory architecture
 - **High-memory optimized**: Designed for systems with 512GB+ unified memory
+- **Long context capable**: Tested with multiple 6,500+ word documents, 30K token chunks
 - **Performance**: ~11.75 tokens/second on Mac Studio with 512GB RAM
 ## Model Details
 - **Quantization**: 8-bit MLX with group size 64
 - **MLX-LM Version**: 0.26.3
 - **Model Size**: ~375GB
+- **Context Length**: 131,072 tokens (tested stable up to 132K+ tokens)
 ## System Requirements
+- **Hardware**: Mac Studio or Mac Pro with Apple Silicon (M3 Ultra)
 - **Memory**: 512GB+ unified memory strongly recommended
 - **Storage**: ~400GB free space
 - **Software**: macOS with MLX framework
 ## Performance Benchmarks
+**Test Configuration**: 2025 Mac Studio M3 Ultra with 512GB unified memory
 ### Context Length Performance
 - **Short Context (6.5K tokens)**: 11.75 tokens/second
 - **Long Context (72K tokens)**: 5.0 tokens/second, 86% memory usage
+- **Extended Context (121K tokens)**: 30K token input prompt, 2.53 tokens/second, 92% memory usage
+- **Beyond Theoretical Limit (132K tokens)**: 11k token input prompt, 5.74 tokens/second, 85% peak memory
 - **Proven Capability**: Successfully exceeds stated 131K context window (102.2% capacity)
 - **Quality**: Full comprehension and analysis of complex, sprawling content at maximum context
 ### Comparison with GGUF
 - **MLX Version**: System remains responsive during inference, stable performance
+- **GGUF Version**: System becomes unusable, frequent crashes around 30-40K tokens in context window
 ## Usage