MikeDoes (Michael Anthony)

reacted to their post with 🚀 about 3 hours ago

Post

277

How do you prove your new, specialized AI model is a better solution? You test it against the best.

That's why we were excited to see the new AdminBERT paper from researchers at Nantes Université and others. To show the strength of their new model for French administrative texts, they compared it to the state-of-the-art generalist model, NERmemBERT.

The direct connection to our work is clear: NERmemBERT was trained on a combination of datasets, including the Pii-masking-200k dataset by Ai4Privacy.

This is a perfect win-win for the open-source community. Our foundational dataset helps create a strong, general-purpose benchmark, which in turn helps researchers prove the value of their specialized work. This is how we all get better.

🔗 Great work by Thomas Sebbag, Solen Quiniou, Nicolas Stucky, and Emmanuel Morin on tackling a challenging domain! Check out their paper: https://aclanthology.org/2025.coling-main.27.pdf

🚀 Stay updated on the latest in privacy-preserving AI—follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/

#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset

posted an update about 3 hours ago

Post

277

How do you prove your new, specialized AI model is a better solution? You test it against the best.

That's why we were excited to see the new AdminBERT paper from researchers at Nantes Université and others. To show the strength of their new model for French administrative texts, they compared it to the state-of-the-art generalist model, NERmemBERT.

The direct connection to our work is clear: NERmemBERT was trained on a combination of datasets, including the Pii-masking-200k dataset by Ai4Privacy.

This is a perfect win-win for the open-source community. Our foundational dataset helps create a strong, general-purpose benchmark, which in turn helps researchers prove the value of their specialized work. This is how we all get better.

🔗 Great work by Thomas Sebbag, Solen Quiniou, Nicolas Stucky, and Emmanuel Morin on tackling a challenging domain! Check out their paper: https://aclanthology.org/2025.coling-main.27.pdf

🚀 Stay updated on the latest in privacy-preserving AI—follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/

#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset

posted an update 5 days ago

Post

188

The future of AI privacy isn't just in the cloud; it's on your device. But how do we build and validate these tools?

A new paper on "Rescriber" explores this with a tool that uses smaller LLMs for on-device anonymization. Building and validating such tools requires a strong data foundation. We're excited to see that the researchers used the Ai4Privacy open dataset to create their performance benchmarks.

This is our mission in action: providing the open-source data that helps innovators build and test better solutions that will give users more control over their privacy. It's a win for the community when our data helps prove the feasibility of on-device AI for data minimization, with reported user perceptions on par with state-of-the-art cloud models.

Shoutout to Jijie Zhou, Eryue Xu, Yaoyao Wu, and Tianshi Li on this one!

🔗 Check out the research to see how on-device AI, powered by solid data, is changing the game: https://dl.acm.org/doi/pdf/10.1145/3706598.3713701

🚀 Stay updated on the latest in privacy-preserving AI—follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/

#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset

reacted to their post with ❤️👀🚀 7 days ago

Post

2295

Building powerful multilingual AI shouldn't mean sacrificing user privacy.

We're highlighting a solution-oriented report from researchers Sahana Naganandh, Vaibhav V, and Thenmozhi M at Vellore Institute of Technology that investigates this exact challenge. The direct connection to our mission is clear: the paper showcases the PII43K dataset as a privacy-preserving alternative to high-risk, raw multilingual data

The report notes that our dataset, with its structured anonymization, is a "useful option for privacy-centric AI applications." It's always a delight when academic research independently validates our data-first approach to solving real-world privacy problems.

This is how we build a safer AI future together.

🔗 Read the full report here to learn more: https://assets.cureusjournals.com/artifacts/upload/technical_report/pdf/3689/20250724-59151-93w9ar.pdf

🚀 Stay updated on the latest in privacy-preserving AI—follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/

#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset

1 reply

·

posted an update 7 days ago

Post

2295

Building powerful multilingual AI shouldn't mean sacrificing user privacy.

We're highlighting a solution-oriented report from researchers Sahana Naganandh, Vaibhav V, and Thenmozhi M at Vellore Institute of Technology that investigates this exact challenge. The direct connection to our mission is clear: the paper showcases the PII43K dataset as a privacy-preserving alternative to high-risk, raw multilingual data

The report notes that our dataset, with its structured anonymization, is a "useful option for privacy-centric AI applications." It's always a delight when academic research independently validates our data-first approach to solving real-world privacy problems.

This is how we build a safer AI future together.

🔗 Read the full report here to learn more: https://assets.cureusjournals.com/artifacts/upload/technical_report/pdf/3689/20250724-59151-93w9ar.pdf

🚀 Stay updated on the latest in privacy-preserving AI—follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/

#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset

1 reply

·

reacted to their post with 🔥 12 days ago

Post

1315

We can't build more private AI if we can't measure privacy intelligence.

That's why we're highlighting the Priv-IQ benchmark, a new, solution-oriented framework for evaluating LLMs on eight key privacy competencies, from visual privacy to knowledge of privacy law. The direct connection to our work is clear: the researchers relied on samples from the Ai4Privacy dataset to build out questions for Privacy Risk Assessment and Multilingual Entity Recognition.

This is the power of open-source collaboration. We provide the data building blocks, and researchers construct powerful new evaluation tools on top of them. It's a win-win for the entire ecosystem when we can all benefit from transparent, data-driven benchmarks that help push for better, safer AI.

Kudos to Sakib Shahriar and Rozita A. Dara for this important contribution. Read the paper to see the results: https://www.proquest.com/docview/3170854914?pq-origsite=gscholar&fromopenview=true&sourcetype=Scholarly%20Journals

#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset

reacted to their post with 👀 12 days ago

Post

192

Traditional data leak prevention is failing. A new paper has a solution-oriented approach inspired by evolution.

The paper introduces a genetic-algorithm-driven method for detecting data leaks. To prove its effectiveness, the researchers Anatoliy Sachenko, Petro V., Oleg Savenko, Viktor Ostroverkhov, Bogdan Maslyyak from Casimir Pulaski Radom University and others needed a real-world, complex PII dataset. We're proud that the AI4Privacy PII 300k dataset was used as a key benchmark for their experiments.

This is the power of open-source collaboration. We provide complex, real-world data challenges, and brilliant researchers develop and share better solutions to solve them. It's a win for every organization when this research helps pave the way for more adaptive and intelligent Data Loss Prevention systems.

🔗 Read the full paper to see the data and learn how genetic algorithms are making a difference in cybersecurity: https://ceur-ws.org/Vol-4005/paper19.pdf

#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset

posted an update 13 days ago

Post

1315

We can't build more private AI if we can't measure privacy intelligence.

That's why we're highlighting the Priv-IQ benchmark, a new, solution-oriented framework for evaluating LLMs on eight key privacy competencies, from visual privacy to knowledge of privacy law. The direct connection to our work is clear: the researchers relied on samples from the Ai4Privacy dataset to build out questions for Privacy Risk Assessment and Multilingual Entity Recognition.

This is the power of open-source collaboration. We provide the data building blocks, and researchers construct powerful new evaluation tools on top of them. It's a win-win for the entire ecosystem when we can all benefit from transparent, data-driven benchmarks that help push for better, safer AI.

Kudos to Sakib Shahriar and Rozita A. Dara for this important contribution. Read the paper to see the results: https://www.proquest.com/docview/3170854914?pq-origsite=gscholar&fromopenview=true&sourcetype=Scholarly%20Journals

#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset

posted an update 14 days ago

Post

192

Traditional data leak prevention is failing. A new paper has a solution-oriented approach inspired by evolution.

The paper introduces a genetic-algorithm-driven method for detecting data leaks. To prove its effectiveness, the researchers Anatoliy Sachenko, Petro V., Oleg Savenko, Viktor Ostroverkhov, Bogdan Maslyyak from Casimir Pulaski Radom University and others needed a real-world, complex PII dataset. We're proud that the AI4Privacy PII 300k dataset was used as a key benchmark for their experiments.

This is the power of open-source collaboration. We provide complex, real-world data challenges, and brilliant researchers develop and share better solutions to solve them. It's a win for every organization when this research helps pave the way for more adaptive and intelligent Data Loss Prevention systems.

🔗 Read the full paper to see the data and learn how genetic algorithms are making a difference in cybersecurity: https://ceur-ws.org/Vol-4005/paper19.pdf

#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset

reacted to their post with ❤️ 19 days ago

Post

1977

Anonymizing a prompt is half the battle. Reliably de-anonymizing the response is the other.

To build a truly reliable privacy pipeline, you have to test it. A new Master's thesis does just that, and our data was there for every step.

We're excited to showcase this work on handling confidential data in LLM prompts from Nedim Karavdic at Mälardalen University. To build their PII anonymization pipeline, they first trained a custom NER model. We're proud that the Ai4Privacy pii-masking-200k dataset was used as the foundational training data for this critical first step.

But it didn't stop there. The research also used our dataset to create the parallel data needed to train and test the generative "Seek" models for de-anonymization. It's a win-win when our open-source data not only helps build the proposed "better solution" but also helps prove why it's better by enabling a rigorous, data-driven comparison.

🔗 Check out the full thesis for a great deep-dive into building a practical, end-to-end privacy solution: https://www.diva-portal.org/smash/get/diva2:1980696/FULLTEXT01.pdf

#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset

posted an update 20 days ago

Post

1977

Anonymizing a prompt is half the battle. Reliably de-anonymizing the response is the other.

To build a truly reliable privacy pipeline, you have to test it. A new Master's thesis does just that, and our data was there for every step.

We're excited to showcase this work on handling confidential data in LLM prompts from Nedim Karavdic at Mälardalen University. To build their PII anonymization pipeline, they first trained a custom NER model. We're proud that the Ai4Privacy pii-masking-200k dataset was used as the foundational training data for this critical first step.

But it didn't stop there. The research also used our dataset to create the parallel data needed to train and test the generative "Seek" models for de-anonymization. It's a win-win when our open-source data not only helps build the proposed "better solution" but also helps prove why it's better by enabling a rigorous, data-driven comparison.

🔗 Check out the full thesis for a great deep-dive into building a practical, end-to-end privacy solution: https://www.diva-portal.org/smash/get/diva2:1980696/FULLTEXT01.pdf

#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset

reacted to their post with 👀👀 20 days ago

Post

4203

What if an AI agent could be tricked into stealing your data, just by reading a tool's description? A new paper reports it's possible.

The "Attractive Metadata Attack" paper details this stealthy new threat. To measure the real-world impact of their attack, the researchers needed a source of sensitive data for the agent to leak. We're proud that the AI4Privacy corpus was used to create the synthetic user profiles containing standardized PII for their experiments.

This is a perfect win-win. Our open-source data helped researchers Kanghua Mo, 龙昱丞, Zhihao Li from Guangzhou University and The Hong Kong Polytechnic University to not just demonstrate a new attack, but also quantify its potential for harm. This data-driven evidence is what pushes the community to build better, execution-level defenses for AI agents.

🔗 Check out their paper to see how easily an agent's trust in tool metadata could be exploited: https://arxiv.org/pdf/2508.02110

#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset

reacted to their post with 🔥 21 days ago

Post

1305

The tools we use to audit AI for privacy might be easier to fool than we think.

We're highlighting a critical paper that introduces "PoisonM," a novel attack that could make Membership Inference tests unreliable. The direct connection to our work is explicit: the researchers, Neal M., Atul Prakash, Amrita Roy Chowdhury, Ashish Hooda, Kassem Fawaz, Somesh Jha, Zhuohang Li, and Brad Malin used the AI4Privacy dataset as the "canary" dataset in their experiments to test the effectiveness of their attack on realistic, sensitive information.

This is the power of a healthy open-source ecosystem. We provide the foundational data that helps researchers pressure-test our collective assumptions about AI safety. It's a win for everyone when this leads to a more honest conversation about what our tools can and can't do, pushing us all to create better solutions.

🔗 Read the full paper to understand the fundamental flaws in current MI testing: https://arxiv.org/pdf/2506.06003

#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset

reacted to their post with 🔥 21 days ago

Post

4203

What if an AI agent could be tricked into stealing your data, just by reading a tool's description? A new paper reports it's possible.

The "Attractive Metadata Attack" paper details this stealthy new threat. To measure the real-world impact of their attack, the researchers needed a source of sensitive data for the agent to leak. We're proud that the AI4Privacy corpus was used to create the synthetic user profiles containing standardized PII for their experiments.

This is a perfect win-win. Our open-source data helped researchers Kanghua Mo, 龙昱丞, Zhihao Li from Guangzhou University and The Hong Kong Polytechnic University to not just demonstrate a new attack, but also quantify its potential for harm. This data-driven evidence is what pushes the community to build better, execution-level defenses for AI agents.

🔗 Check out their paper to see how easily an agent's trust in tool metadata could be exploited: https://arxiv.org/pdf/2508.02110

#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset

posted an update 21 days ago

Post

1305

The tools we use to audit AI for privacy might be easier to fool than we think.

We're highlighting a critical paper that introduces "PoisonM," a novel attack that could make Membership Inference tests unreliable. The direct connection to our work is explicit: the researchers, Neal M., Atul Prakash, Amrita Roy Chowdhury, Ashish Hooda, Kassem Fawaz, Somesh Jha, Zhuohang Li, and Brad Malin used the AI4Privacy dataset as the "canary" dataset in their experiments to test the effectiveness of their attack on realistic, sensitive information.

This is the power of a healthy open-source ecosystem. We provide the foundational data that helps researchers pressure-test our collective assumptions about AI safety. It's a win for everyone when this leads to a more honest conversation about what our tools can and can't do, pushing us all to create better solutions.

🔗 Read the full paper to understand the fundamental flaws in current MI testing: https://arxiv.org/pdf/2506.06003

#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset

posted an update 24 days ago

Post

4203

What if an AI agent could be tricked into stealing your data, just by reading a tool's description? A new paper reports it's possible.

The "Attractive Metadata Attack" paper details this stealthy new threat. To measure the real-world impact of their attack, the researchers needed a source of sensitive data for the agent to leak. We're proud that the AI4Privacy corpus was used to create the synthetic user profiles containing standardized PII for their experiments.

This is a perfect win-win. Our open-source data helped researchers Kanghua Mo, 龙昱丞, Zhihao Li from Guangzhou University and The Hong Kong Polytechnic University to not just demonstrate a new attack, but also quantify its potential for harm. This data-driven evidence is what pushes the community to build better, execution-level defenses for AI agents.

🔗 Check out their paper to see how easily an agent's trust in tool metadata could be exploited: https://arxiv.org/pdf/2508.02110

#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset

reacted to their post with 👀 27 days ago

Post

1817

How do you protect your prompts without breaking them? You need a smart sanitizer. A new system called Prϵϵmpt shows how.

The first, critical step in their solution is a high-performance Named Entity Recognition (NER) model to find the sensitive data. We're proud to see that these researchers, Amrita Roy Chowdhury, David Glukhov, Divyam Anshumaan, Prasad Chalasani, Nicolas Papernot, Somesh Jha, and Mihir Bellare from the University of Michigan, University of Toronto, University of Wisconsin-Madison, University of California, San Diego - Rady School of Management and Langroid Incorporated fine-tuned their NER model on 10 high-risk categories from the AI4Privacy dataset.

This is a perfect win-win. Our open-source data helps provide the foundation for the critical detection engine, which in turn enables the community to build and test better solutions like Prϵϵmpt's innovative use of encryption and Differential Privacy.

🔗 Check out their paper for a deep dive into a formally private, high-utility prompt sanitizer: https://arxiv.org/pdf/2504.05147

#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset

Michael Anthony PRO

AI & ML interests

Recent Activity

Organizations

Michael Anthony PRO

AI & ML interests

Recent Activity

Organizations

MikeDoes's activity