⚡ RexBERT Complete On-device Study: Comprehensive Performance Analysis Across Mobile Devices
(Check details at https://mlange.zetic.ai/p/Steve/RexBERT)

TL;DR: Transformer models are now practical for real-time mobile applications. The cloud-to-edge AI migration is complete.

- Original model from @thebajajra

🎯 Study Overview:
- Model: RexBERT (ModernBERT for E-commerce)
- Focus: Real-world deployment viability and performance analysis

📊 Key Performance Metrics:

Latency Results:
- NPU (Best): 4.74ms average
- GPU: 12.56ms average
- CPU: 35.16ms average

NPU Advantage: 16.98x speedup over CPU

Memory Efficiency:
- Model Size: 568.96 MB (compressed for mobile)
- Runtime Memory: 299.01 MB peak consumption
- Load Memory Range: 285 MB - 1,072 MB across devices

Accuracy Preservation:
- FP16 Precision: 63.72 dB
- Quantized Mode: Available with minimal accuracy loss
- Inference Quality: Production-grade maintained

🛠 Technical Implementation:
(Runnable with Copy & Paste at the ZETIC.MLange link: https://mlange.zetic.ai/p/Steve/RexBERT)

This study demonstrates that:

Transformer models are viable for real-time mobile applications
NPU acceleration provides the breakthrough needed for practical deployment
Mobile-first AI architecture is now technically feasible
The performance gap between cloud and edge inference is rapidly closing

🚀 Real-World Applications Enabled:

E-commerce Intelligence:
- Instant product search and discovery
- Real-time semantic matching
- Context-aware recommendations
- Natural language query processing

Conversational Commerce:
- Voice-to-product search
- Chatbot-style shopping assistance
- Intent recognition and classification
- Multi-turn conversation handling

Privacy-First AI:
- On-device processing (no data transmission)
- GDPR/privacy regulation compliant
- Reduced server infrastructure costs
- Offline capability maintenance

Are you ready to integrate BERT-level language understanding into your mobile applications?

reacted to takarajordan's post with 👍 3 months ago

Post

467

Yay I made an in memory vector DB in pure golang, check it out here https://github.com/takara-ai/serverlessVector

reacted to yeonseok-zeticai's post with 👍 3 months ago

Post

3396

YOLOv11 Complete On-device Study
- {NPU vs GPU vs CPU} Across All Model Variants

We've just completed comprehensive benchmarking of the entire YOLOv11 family on ZETIC.MLange.
Here's what every ML engineer needs to know.

📊 Key Findings Across 5 Model Variants (XL to Nano):

1. NPU Dominance in Efficiency:
- YOLOv11n: 1.72ms on NPU vs 53.60ms on CPU (31x faster)
- Memory footprint: 0-65MB across all variants
- Consistent sub-10ms inference even on XL models

2. The Sweet Spot - YOLOv11s:
- NPU: 3.23ms @ 95.57% mAP
- Perfect balance: 36MB model, production-ready speed
- 10x faster than GPU, 30x faster than CPU

3. Surprising Discovery:
Medium models (YOLOv11m) show unusual GPU performance patterns - NPU outperforms GPU by 4x (9.55ms vs 35.82ms), suggesting current GPU kernels aren't optimized for mid-size architectures.

4. Production Insights:
- XL/Large: GPU still competitive for batch processing
- Small/Nano: NPU absolutely crushes everything else
- Memory scaling: Linear from 10MB (Nano) to 217MB (XL)
- Accuracy plateau: 95.5-95.7% mAP across S/M/L variants

Real-world Impact:
For edge deployment, YOLOv11s on NPU delivers server-level accuracy at embedded speeds. This changes everything for real-time applications.

🔗 Test these benchmarks yourself: https://mlange.zetic.ai/p/Steve/YOLOv11_comparison?tab=versions&version=5

📈 Full benchmark suite available now

The data speaks for itself.
NPUs aren't the future - they're the present for efficient inference.
Which variant fits your use case? Let's discuss in the comments.

3 replies

reacted to yeonseok-zeticai's post with 🔥 3 months ago

Post

3724

🎯 RetinaFace On-Device Deployment Study: NPU Acceleration Breakthrough!
(Check details at :https://mlange.zetic.ai/p/Steve/RetinaFace)

TL;DR: Successfully deployed RetinaFace with ZETIC.MLange achieving 1.43ms inference on mobile NPU!

🔍 Complete Performance Analysis:
Latency Comparison:
- NPU: 1.43ms (Winner! 🏆)
- GPU: 3.75ms
- CPU: 21.42ms

Accuracy Metrics - SNR:
- FP16: 56.98 dB
- Integer Quantized: 48.03 dB
(Precision-Performance: Excellent trade-off maintained)

Memory Footprint:
- Model Size: 2.00 MB (highly compressed)
- Runtime Memory: 14.58 MB peak
- Deployment Ready: ✅ Production optimized

🛠 Technical Implementation:
(Runnable with Copy & Paste at the MLange link!)

📊 Device Compatibility Matrix:
Tested on 50+ devices including Samsung Galaxy series, Google Pixel lineup, and Xiaomi devices, iPhones and iPads.
Consistent sub-5ms performance across the board!

🚀 Applications Unlocked:
- Real-time AR/VR face tracking
- Privacy-preserving edge authentication
- Live video processing pipelines
- Mobile security applications
- Interactive camera filters

The democratization of high-performance computer vision on mobile devices is happening NOW! This study proves that complex CV models can run efficiently on consumer hardware without compromising accuracy.
Want to reproduce these results? Check out the benchmark methodology and implementation guide!

liked 2 models 4 months ago

microsoft/VibeVoice-1.5B

Text-to-Speech • 3B • Updated Sep 1 • 662k • 2.12k

wsbagnsv1/VibeVoice-1.5B-gguf

3B • Updated Sep 22 • 417 • 15

liked 2 Spaces 4 months ago

C4AI Command Models

🌟

1.47k

Ask questions and get answers

Gaze Demo

👀

176

Gaze detection using Moondream

liked a model 4 months ago

xai-org/grok-2

Updated Nov 5 • 1.62k • 1k

commented a paper 4 months ago

Samba: Synchronized Set-of-Sequences Modeling for Multiple Object Tracking

Paper • 2410.01806 • Published Oct 2, 2024 •

reacted to jasoncorkill's post with 😎 6 months ago

Post

3278

"Why did the bee get married?"

"Because he found his honey!"

This was the "funniest" joke out of 10'000 jokes we generated with LLMs. With 68% of respondents rating it as "funny".

Original jokes are particularly hard for LLMs, as jokes are very nuanced and a lot of context is needed to understand if something is "funny". Something that can only reliably be measured using humans.

LLMs are not equally good at generating jokes in every language. Generated English jokes turned out to be way funnier than the Japanese ones. 46% of English-speaking voters on average found the generated joke funny. The same statistic for other languages:

Vietnamese: 44%
Portuguese: 40%
Arabic: 37%
Japanese: 28%

There is not much variance in generation quality among models for any fixed language. But still Claude Sonnet 4 slightly outperforms others in Vietnamese, Arabic and Japanese and Gemini 2.5 Flash in Portuguese and English

We have release the 1 Million (!) native speaker ratings and the 10'000 jokes as a dataset for anyone to use:
Rapidata/multilingual-llm-jokes-4o-claude-gemini

7 replies

liked 2 models 6 months ago

Qwen/Qwen2.5-VL-3B-Instruct

Image-Text-to-Text • 4B • Updated Apr 6 • 3.12M • 579

llava-hf/llava-onevision-qwen2-7b-ov-hf

Image-Text-to-Text • 8B • Updated Jun 18 • 68.3k • 37

upvoted a paper 7 months ago

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 29

reacted to jjokah's post with 🔥 9 months ago

Post

2411

# Video Tokenization — for efficient AI video processing

Meet 𝐕𝐢𝐝𝐓𝐨𝐤, a new open-source video tokenization technique developed by Microsoft Research to address the computational challenges of processing large volumes of video data. The core problem VidTok tackles is the inefficiency caused by redundant information in raw video pixels.

VidTok converts complex video footage into compact, structured units called tokens, making it easier and more efficient for AI systems to analyze, understand, and generate video content.

Research Paper: https://arxiv.org/abs/2412.13061
VidTok Code: https://github.com/microsoft/VidTok

reacted to AtAndDev's post with 👍 9 months ago

Post

3145

Llama 4 is out...

3 replies

dfdf

AI & ML interests

Recent Activity

Organizations

novincom's activity

RF-DETR

C4AI Command Models

Gaze Demo