Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up

All HF Hub posts

ajibawa-2023ย 
posted an update 2 days ago
view post
Post
2752
Python-Code-Large
Dataset: ajibawa-2023/Python-Code-Large

Python-Code-Large is a large-scale corpus of Python source code comprising more than 2 million rows of Python code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the Python ecosystem.

By providing a high-volume, language-specific corpus, Python-Code-Large enables systematic experimentation in Python-focused model training, domain adaptation, and downstream code understanding tasks.

Python-Code-Large addresses the need for a dedicated Python-only dataset at substantial scale, enabling focused research across data science, backend systems, automation, scientific computing, and AI-driven Python environments.
  • 1 reply
ยท
etemizย 
posted an update 1 day ago
view post
Post
1838
AHA 2026 scores of Qwen3.5

27B Huihui abliteration 65%
27B Heretic abliteration 55%
27B Normal 50%

35B Huihui abliteration 64%
35B @jiaojjjjje abliteration 57%
35B @LeadFootThrottleCock abliteration 56%
  • 6 replies
ยท
DavidAUย 
posted an update 1 day ago
view post
Post
1278
Gemma 3 27B - The record breaker (Heretic'ed (uncensored) ; then training on Unsloth):

arc_challenge,arc_easy,boolq,hellaswag,openbookqa,piqa, winogrande
0.661 ,0.816 ,0.878,0.763 ,0.464 ,0.808 ,0.762

For comparison:
Qwen3.5-27B-Text
qx86-hi 0.443,0.498,0.857,0.701,0.372,0.770,0.752

Trained on a HERETIC uncensored base too ;

DavidAU/Gemma3-27B-it-vl-Polaris-HI16-Heretic-Uncensored-INSTRUCT
SeaWolf-AIย 
posted an update 3 days ago
view post
Post
2512
AI Is Training on Your Content Without Permission โ€” Fight Back with Invisible Watermarks

FINAL-Bench/security-scan

Most generative AI training data is crawled without consent. Your text gets summarized, images reprocessed, videos clipped โ€” with no way to prove you're the original creator. Existing watermarks are either visible or wiped out by a single AI preprocessing pass.

Detect Before, Track After

Pre-embed โ€” Detect theft without any watermark. Text plagiarism check, image similarity analysis (perceptual hash, SSIM, color histogram, feature matching), and video temporal matching catch copies, edits, and excerpts.

Post-embed โ€” Embed invisible multi-layer watermarks. If one layer is destroyed, others survive independently. Even full removal leaves forensic traces as evidence.

Text: 4 Independent Layers

Four mechanisms work simultaneously: zero-width Unicode characters at morpheme/word boundaries (Korean Kiwi + English NLP), style fingerprinting via synonym-ending-connective substitution, SHA-256 timestamped evidence packages, and punctuation-anchored micro-marks. Each layer uses a different Unicode category, so attacks on one cannot eliminate the others. Full bilingual support, zero readability impact.

34-Attack Defense

7 categories, 34 attacks simulated: Unicode normalization, invisible character removal, homoglyph substitution (9,619 confusables), and AI rewriting. Each scored on Signal (watermark survival) + Trace (forensic evidence of attack) โ€” proving deliberate removal even when watermarks are destroyed.

Image & Video

Images: DCT frequency-domain watermarks surviving JPEG compression and resize. Videos: keyframe watermarking with temporal propagation and majority-vote extraction. Both support pre-embed similarity detection.

Who Is This For

Creators, rights holders needing legal evidence, media companies, and organizations tracking document leaks. Korean/English bilingual, open source, Gradio-based.
ยท
NJX-njxย 
posted an update about 9 hours ago
view post
Post
684
Recently, I have open-sourced an AI emotional companion product based on openclaw, called opensoul.

On this platform, you can create a "soulmate" that matches your personality, and configure it with the skills, tools you want it to have, as well as the platforms it can integrate with (such as Telegram, Discord, etc.).
You can even create group chats, invite multiple agents and your friends to chat about recent events, discuss projects together, and so on.

On the one hand, I hope it can better accompany you in daily life by virtue of its unique memory mechanism, self-feedback and iteration mechanism, and the modeling of users' emotions. On the other hand, I also hope it can help you better handle your work with its unique skills, tools and ability to deal with complex task scenarios.

Although the entire product has taken shape, I think there are still many areas that need adjustment and optimization. I also hope to rely on the strength of the community to do a good job in AI emotional companionship.

This is the project introduction URL: https://opensoul-web.vercel.app
This is the GitHub project URL: https://github.com/NJX-njx/opensoul
@AdinaY @lilianweng@burtenshaw@clem
let's just do it

  • 2 replies
ยท
imnotkittyย 
posted an update 3 days ago
view post
Post
1219
In the Text-to-Video arena, Seedance 2.0 has first secured a spot in the LMArena Top 10, while Kling 3.0 has topped the Artificial Analysis leaderboard, with the Kling family claiming 7 spots in the top 15.

Which one do you prefer?
  • 2 replies
ยท
YatharthSย 
posted an update 3 days ago
view post
Post
2245
Just open sourced LavaSR v2: a model that can enhance 5000 seconds of audio in 1 second while being higher quality than giant and slow 6gb diffusion models!

It works with any sampling rate from 8-48khz and is nearly 5000x faster than competition while being superior in objective benchmarks.

LavaSR v2 is Perfect for
- Enhancing TTS models.
- Fixing old audio datasets.
- Restoring low quality recordings.

You can check out the examples and run it locally or online:

Repo: https://github.com/ysharma3501/LavaSR.git
Demo: YatharthS/LavaSR
Model: YatharthS/LavaSR
BibbyResearchย 
posted an update 2 days ago
view post
Post
2413
Announcement :-

BibbyResearch/China-Egocentric-Dataset-Robotics

Bibby AI - AI Latex Editor for Research writing has launched the above Chinese Egocentric Dataset for Robotics Research!
  • 1 reply
ยท
nyuuzyouย 
posted an update 3 days ago
view post
Post
1781
๐ŸŒ Street-Level Imagery Dataset nyuuzyou/streetview

934,191 image records index Eastern Europe and Northern Asia. Temporal links map historical views at identical coordinates across nine years.

Key Stats:

- 905,940 unique images
- Coverage spanning 2016 to 2025
- Average 14.3 historical links per location

Geographic bounds span 20.49ยฐ E to 152.32ยฐ E. Urban centers show higher data density.
  • 3 replies
ยท
sergiopaniegoย 
posted an update 3 days ago
view post
Post
2153
What happens when you make an LLM drive a car where physics are real and actions can't be undone?

I ported CARLA, the autonomous driving simulator, to OpenEnv and added training support via TRL + Hugging Face Spaces.

The model interacts with the simulator through tool calls (observe, brake, change lane) and learns from a reward signal.

In 50 training steps, Qwen 0.6B learns to swerve and brake to avoid pedestrians in emergency situations.

The project supports text and vision (VLMs can see through a camera sensor), open-world driving with traffic, and multiple driving scenarios.

This builds on the carla-env project by sinatras, which originally placed LLMs inside CARLA for evaluation. We extended it with vision, new scenarios, rubric-based rewards, and made it trainable end-to-end.

Blog: https://huggingface.co/blog/sergiopaniego/bringing-carla-to-openenv-trl/
CARLA env in OpenEnv: https://github.com/meta-pytorch/OpenEnv/tree/main/envs/carla_env
Training script: https://github.com/huggingface/trl/blob/main/examples/scripts/openenv/carla.py