Aggregation & Statistics

The Pāṇini aggregation system allows you to transform atomic extractions (e.g., analyzed sentences) into global insights about a text corpus or a learner profile.

1. Role and Concept

The goal of aggregation is to answer questions like:

"What are the most frequent grammatical cases in this text?"
"What is the coverage rate of Polish morphological features in this corpus?"
"What are the most used verb lemmas?"

The system relies on two types of data:

Distributions (Closed Sets): Used for enums (e.g., Case, Gender). They allow calculating Coverage (e.g., "I've seen 4 out of 7 possible cases").
Inventories (Open Sets): Used for free strings (e.g., Lemmas, Roots). They allow counting unique occurrences.

2. Usage: The Aggregable Pipeline

To be aggregated, an object must implement the Aggregable trait. This is automatic for morphology enums derived with #[derive(MorphologyInfo)].

graph LR
    Feature1["Word: 'kot' Case: Nom"]
    Feature2["Word: 'psa' Case: Acc"]
    Feature3["Word: 'dom' Case: Nom"]

    Aggregator[BasicAggregator]

    Feature1 & Feature2 & Feature3 --> Aggregator

    Aggregator --> Result[AggregationResult]

    subgraph Statistics ["Statistics"]
        Result --> C1["Nom: 2, Acc: 1"]
        Result --> C2["Case Coverage: 2/7"]
    end

3. The AggregationResult Object

The AggregationResult is the final product of an aggregation. It contains data grouped by categories (e.g., "Noun", "Verb", "morpheme").

Example Console Output:

[NOUN] total: 15
  |- case [3/7]: nominative(8), accusative(5), genitive(2)
  |- gender [2/3]: feminine(10), masculine(5)
  |- lemma [8 unique]: studentka(3), biblioteka(2)...

Programmatic Access (Rust)

let result: AggregationResult = features.into_iter().collect();

println!("Total items: {}", result.total_count());
println!("Group count: {}", result.group_count());

// Access a specific group
if let Some(noun_stats) = result.by_group.get("Noun") {
    // Analyze dimensions...
}

4. Technical Importance: Closed Sets

Why use ClosedValues?

For a dimension (e.g., Case) to calculate its coverage rate, it must know the exhaustive list of possibilities. This is the role of #[derive(ClosedValues)] on your linguistic enums. Without it, the system treats data as a simple inventory without any notion of completion percentage.