Update README.md
Browse files
README.md
CHANGED
|
@@ -26,7 +26,7 @@ This is an 8-bit quantized MLX version of [zai-org/GLM-4.5](https://huggingface.
|
|
| 26 |
- **8-bit quantization** (8.502 bits per weight) for memory efficiency
|
| 27 |
- **MLX optimized** for Apple Silicon unified memory architecture
|
| 28 |
- **High-memory optimized**: Designed for systems with 512GB+ unified memory
|
| 29 |
-
- **Long context capable**: Tested with 6,500+ word documents
|
| 30 |
- **Performance**: ~11.75 tokens/second on Mac Studio with 512GB RAM
|
| 31 |
|
| 32 |
## Model Details
|
|
@@ -36,24 +36,24 @@ This is an 8-bit quantized MLX version of [zai-org/GLM-4.5](https://huggingface.
|
|
| 36 |
- **Quantization**: 8-bit MLX with group size 64
|
| 37 |
- **MLX-LM Version**: 0.26.3
|
| 38 |
- **Model Size**: ~375GB
|
| 39 |
-
- **Context Length**: 131,072 tokens (tested stable up to
|
| 40 |
|
| 41 |
## System Requirements
|
| 42 |
|
| 43 |
-
- **Hardware**: Mac Studio or Mac Pro with Apple Silicon (
|
| 44 |
- **Memory**: 512GB+ unified memory strongly recommended
|
| 45 |
- **Storage**: ~400GB free space
|
| 46 |
- **Software**: macOS with MLX framework
|
| 47 |
|
| 48 |
## Performance Benchmarks
|
| 49 |
|
| 50 |
-
**Test Configuration**: Mac Studio with 512GB unified memory
|
| 51 |
|
| 52 |
### Context Length Performance
|
| 53 |
- **Short Context (6.5K tokens)**: 11.75 tokens/second
|
| 54 |
- **Long Context (72K tokens)**: 5.0 tokens/second, 86% memory usage
|
| 55 |
-
- **Extended Context (121K tokens)**: 2.53 tokens/second, 92% memory usage
|
| 56 |
-
- **Beyond Theoretical Limit (132K tokens)**: 5.74 tokens/second, 85% peak memory
|
| 57 |
- **Proven Capability**: Successfully exceeds stated 131K context window (102.2% capacity)
|
| 58 |
- **Quality**: Full comprehension and analysis of complex, sprawling content at maximum context
|
| 59 |
|
|
@@ -66,7 +66,7 @@ This is an 8-bit quantized MLX version of [zai-org/GLM-4.5](https://huggingface.
|
|
| 66 |
|
| 67 |
### Comparison with GGUF
|
| 68 |
- **MLX Version**: System remains responsive during inference, stable performance
|
| 69 |
-
- **GGUF Version**: System becomes unusable, frequent crashes around 30-40K tokens
|
| 70 |
|
| 71 |
## Usage
|
| 72 |
|
|
|
|
| 26 |
- **8-bit quantization** (8.502 bits per weight) for memory efficiency
|
| 27 |
- **MLX optimized** for Apple Silicon unified memory architecture
|
| 28 |
- **High-memory optimized**: Designed for systems with 512GB+ unified memory
|
| 29 |
+
- **Long context capable**: Tested with multiple 6,500+ word documents, 30K token chunks
|
| 30 |
- **Performance**: ~11.75 tokens/second on Mac Studio with 512GB RAM
|
| 31 |
|
| 32 |
## Model Details
|
|
|
|
| 36 |
- **Quantization**: 8-bit MLX with group size 64
|
| 37 |
- **MLX-LM Version**: 0.26.3
|
| 38 |
- **Model Size**: ~375GB
|
| 39 |
+
- **Context Length**: 131,072 tokens (tested stable up to 132K+ tokens)
|
| 40 |
|
| 41 |
## System Requirements
|
| 42 |
|
| 43 |
+
- **Hardware**: Mac Studio or Mac Pro with Apple Silicon (M3 Ultra)
|
| 44 |
- **Memory**: 512GB+ unified memory strongly recommended
|
| 45 |
- **Storage**: ~400GB free space
|
| 46 |
- **Software**: macOS with MLX framework
|
| 47 |
|
| 48 |
## Performance Benchmarks
|
| 49 |
|
| 50 |
+
**Test Configuration**: 2025 Mac Studio M3 Ultra with 512GB unified memory
|
| 51 |
|
| 52 |
### Context Length Performance
|
| 53 |
- **Short Context (6.5K tokens)**: 11.75 tokens/second
|
| 54 |
- **Long Context (72K tokens)**: 5.0 tokens/second, 86% memory usage
|
| 55 |
+
- **Extended Context (121K tokens)**: 30K token input prompt, 2.53 tokens/second, 92% memory usage
|
| 56 |
+
- **Beyond Theoretical Limit (132K tokens)**: 11k token input prompt, 5.74 tokens/second, 85% peak memory
|
| 57 |
- **Proven Capability**: Successfully exceeds stated 131K context window (102.2% capacity)
|
| 58 |
- **Quality**: Full comprehension and analysis of complex, sprawling content at maximum context
|
| 59 |
|
|
|
|
| 66 |
|
| 67 |
### Comparison with GGUF
|
| 68 |
- **MLX Version**: System remains responsive during inference, stable performance
|
| 69 |
+
- **GGUF Version**: System becomes unusable, frequent crashes around 30-40K tokens in context window
|
| 70 |
|
| 71 |
## Usage
|
| 72 |
|