AI vulnerability Database (AVID)

community

https://avidml.org

AvidMldb

avidml

Activity Feed Request to join this org

AI & ML interests

AI Risk, Responsible AI, LLM, Generative Models

ajibawa-2023

posted an update 1 day ago

Post

1263

C-Code-Large
Dataset: ajibawa-2023/C-Code-Large

C-Code-Large is a large-scale corpus of C programming language source code comprising more than 4 million code samples stored in .jsonl format. The dataset is designed to support research and development in large language model (LLM) pretraining, static analysis, and software engineering automation for the C ecosystem.

By offering a high-volume, language-focused dataset, C-Code-Large enables targeted experimentation in low-level programming, memory-constrained environments, and performance-critical systems, where C continues to be a dominant language.

C-Code-Large addresses the lack of large, curated, C-specific datasets, making it possible to conduct focused research on procedural programming paradigms, manual memory management, and system-level abstractions.

ajibawa-2023

posted an update 14 days ago

Post

3741

Cpp-Code-Large
Dataset: ajibawa-2023/Cpp-Code-Large

Cpp-Code-Large is a large-scale corpus of C++ source code comprising more than 5 million lines of C++ code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and static program analysis for the C++ ecosystem.

By providing a high-volume, language-specific corpus, Cpp-Code-Large enables systematic experimentation in C++-focused model training, domain adaptation, and downstream code understanding tasks.

Cpp-Code-Large addresses the need for a dedicated C++-only dataset at substantial scale, enabling focused research across systems programming, performance-critical applications, embedded systems, game engines, and large-scale native software projects.

3 replies

ajibawa-2023

posted an update 19 days ago

Post

3485

Python-Code-Large
Dataset: ajibawa-2023/Python-Code-Large

Python-Code-Large is a large-scale corpus of Python source code comprising more than 2 million rows of Python code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the Python ecosystem.

By providing a high-volume, language-specific corpus, Python-Code-Large enables systematic experimentation in Python-focused model training, domain adaptation, and downstream code understanding tasks.

Python-Code-Large addresses the need for a dedicated Python-only dataset at substantial scale, enabling focused research across data science, backend systems, automation, scientific computing, and AI-driven Python environments.

1 reply

ajibawa-2023

posted an update 23 days ago

Post

2570

PHP-Code-Large

Dataset: ajibawa-2023/PHP-Code-Large

PHP-Code-Large is a large-scale corpus of PHP source code comprising more than 12 million lines of PHP code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and static program analysis for the PHP ecosystem.

By providing a high-volume, language-specific corpus, PHP-Code-Large enables systematic experimentation in PHP-focused model training, domain adaptation, and downstream code understanding tasks.

PHP-Code-Large addresses the need for a dedicated PHP-only dataset at substantial scale, enabling focused research across backend systems, CMS platforms, APIs, and full-stack PHP environments.

ajibawa-2023

posted an update 28 days ago

Post

3253

JavaScript-Code-Large
ajibawa-2023/JavaScript-Code-Large

JavaScript-Code-Large is a large-scale corpus of JavaScript source code comprising around 5 million JavaScript files. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the JavaScript ecosystem.

By providing a high-volume, language-specific corpus, JavaScript-Code-Large enables systematic experimentation in JavaScript-focused model training, domain adaptation, and downstream code understanding tasks.

JavaScript-Code-Large addresses the need for a dedicated JavaScript-only dataset at substantial scale, enabling focused research across frontend, backend, and full-stack JavaScript environments. .

ajibawa-2023

posted an update 30 days ago

Post

3134

Java-Code-Large ( ajibawa-2023/Java-Code-Large)

Java-Code-Large is a large-scale corpus of publicly available Java source code comprising more than 15 million java codes. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis.

By providing a high-volume, language-specific corpus, Java-Code-Large enables systematic experimentation in Java-focused model training, domain adaptation, and downstream code understanding tasks.

shubhobm

in avid-ml/bert_regard_v2_large 7 months ago

Adding `safetensors` variant of this model

#1 opened 8 months ago by

SFconvertbot

Harsh1729

updated a model 8 months ago

avid-ml/bert_regard_v2_large

0.1B • Updated Sep 5, 2025 • 4.43k

Harsh1729

published a model 8 months ago

avid-ml/bert_regard_v2_large

0.1B • Updated Sep 5, 2025 • 4.43k

carolanderson

updated a Space 11 months ago

Indielabel

🏢

Display a Svelte app with Material Design

ajibawa-2023

posted an update 11 months ago

Post

4644

Hi All, I recently released two Audio datasets which are generated using my earlier released dataset: ajibawa-2023/Children-Stories-Collection

First Audio Dataset:https://huggingface.co/datasets/ajibawa-2023/Audio-Children-Stories-Collection-Large has 5600++ stories in .mp3 format.

Second Audio Dataset:https://huggingface.co/datasets/ajibawa-2023/Audio-Children-Stories-Collection has 600 stories in .mp3 format.

3 replies

borhane

authored 2 papers about 1 year ago

Stop treating `AGI' as the north-star goal of AI research

Paper • 2502.03689 • Published Feb 6, 2025 • 5

Unsocial Intelligence: an Investigation of the Assumptions of AGI Discourse

Paper • 2401.13142 • Published Jan 23, 2024

Jekaterina

authored 2 papers about 1 year ago

INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

Paper • 2411.19799 • Published Nov 29, 2024 • 16

DEPAC: a Corpus for Depression and Anxiety Detection from Speech

Paper • 2306.12443 • Published Jun 20, 2023

ajibawa-2023

posted an update over 1 year ago

Post

3954

New Dataset: Software-Architecture
Link: ajibawa-2023/Software-Architecture

I am releasing a Large Dataset covering topics related to Software-Architecture. This dataset consists of around 450,000 lines of data in jsonl.

I have included following topics:

Architectural Frameworks

Architectural Patterns for Reliability

Architectural Patterns for Scalability

Architectural Patterns

Architectural Quality Attributes

Architectural Testing

Architectural Views

Architectural Decision-Making

Advanced Research

Cloud-Based Architectures

Component-Based Architecture

Data Architecture

Emerging Trends

Event-Driven Architecture

Evolvability and Maintainability

Microservices and Monolithic

Microservices Architecture

Security Architecture

Service-Oriented Architecture

Software Design Principles

and Many More!

This dataset is useful in LLM development. Also those who are working on developing Software development related LLMs then this dataset can be useful.

This dataset is very useful to Researchers as well.

4 replies

leondz

authored 4 papers over 1 year ago

SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)

Paper • 2006.07235 • Published Jun 12, 2020

AI & ML interests

Team members 13

avid-ml's activity

Adding `safetensors` variant of this model

Indielabel