--- title: WiktionaryDE emoji: 🐠 colorFrom: indigo colorTo: pink sdk: gradio sdk_version: 5.49.1 app_file: app.py pinned: false license: cc-by-sa-3.0 --- # 🇩🇪 WiktionaryDE - German Linguistics Hub [![License: CC-BY-SA-3.0](https://img.shields.io/badge/License-CC--BY--SA%203.0-lightgrey.svg)](https://creativecommons.org/licenses/by-sa/3.0/) [![Gradio](https://img.shields.io/badge/Gradio-4.31.0-orange)](https://gradio.app/) An advanced multi-tool for German linguistic analysis that combines German Wiktionary database query with multiple morphological engines and semantic knowledge bases into a single, comprehensive interface. ## 🎯 Overview This Space aggregates multiple German NLP tools and databases to provide: - Deep morphological analysis of German words - Contextual sentence analysis with semantic ranking - Full inflection tables (declensions and conjugations) - Thesaurus and semantic relation discovery - Grammar and spelling checking ## 🛠️ Tools & Data Sources ### Core Databases - **Wiktionary Database**: 3.7GB `cstr/de-wiktionary-sqlite-normalized` database providing ground truth for lemmas, inflected forms, definitions, examples, and pronunciation - **OdeNet (WordNet)**: German thesaurus for synonyms, antonyms, hypernyms, etc. - **ConceptNet**: Multilingual knowledge graph for semantic relations ### Morphological Engines - **DWDSmor**: High-precision FST-based analyzer from `zentrum-lexikographie/dwdsmor-open` - **HanTa**: Hanover Tagger for robust morphological analysis and lemmatization - **spaCy-IWNLP**: `de_core_news_md` combined with IWNLP for spaCy-based analysis - **Pattern.de**: Full inflection table generation ### Additional Tools - **LanguageTool**: German grammar and spelling checks ## 📖 Main Features ### 1. Word Encyclopedia (DE) The primary non-contextual tool for analyzing single words. **What it does:** - Finds all possible analyses (e.g., "Lauf" as noun vs. "lauf" as verb) - Aggregates data from all engines and databases - Cross-validates results to filter out artifacts - Provides complete morphological, semantic, and inflectional information **Engine Options:** - **Wiktionary** (Default): Most accurate, database-driven - **DWDSmor**: High-precision formal grammar - **HanTa**: Robust tagger-based - **IWNLP**: spaCy-based analysis The engine selector automatically falls back to other engines if no result is found. ### 2. Comprehensive Analyzer (DE) Full sentence analysis with contextual disambiguation. **Features:** - Uses spaCy to parse sentences and extract lemmas - Runs full Word Encyclopedia analysis on each lemma - **Contextual Ranking**: Uses sentence similarity to rank semantic senses by relevance to the full sentence - Provides integrated analysis of all words in context ### 3. Individual Engine Tabs Direct access to raw outputs from: - Wiktionary - DWDSmor - HanTa - IWNLP Useful for comparing individual engine outputs. ### 4. Component Tools Raw access to specialized tools: - **spaCy**: Dependency parsing and NER - **Grammar**: LanguageTool checking - **Inflections**: Pattern.de inflection tables - **Thesaurus**: OdeNet relations - **ConceptNet**: Semantic knowledge graph ## ⚙️ Technical Details - **SDK**: Gradio 4.31.0 - **Database Size**: 3.7GB (Wiktionary sqlite) - **Processing**: Multi-engine pipeline with intelligent fallback - (basic) **Quality Control**: Cross-validation between engines to filter artifacts ## 📝 License The code for this Gradio interface is licensed under [CC-BY-SA-3.0](https://creativecommons.org/licenses/by-sa/3.0/). The underlying models and data sources retain their original licenses: - Wiktionary: CC-BY-SA - DWDSmor: Open license (zentrum-lexikographie) - HanTa: Various open licenses - spaCy models: MIT License - OdeNet: CC-BY-SA - ConceptNet: CC-BY-SA **Note**: This is a simple educational tool and work-in-progress. Many results will not be consistent and faulty.