Official Guide to Data Layers and Nomenclature in Arpia

🎯 Purpose

To establish a common language and a clear structure for classifying, governing, and consuming data in Arpia.
This terminology will be used across the organization, both for technical documentation and business communication, ensuring consistency and shared understanding among teams.


🔹 Data Layers in Arpia

1. Raw (Crude)

  • Definition: Data as it arrives from sources (ERP, POS, IoT, external integrations, files, APIs).
  • Characteristics:
    • No transformation or cleaning applied.
    • May contain duplicates, nulls, or format errors.
    • Represents the exact “snapshot” of the source.
  • Examples:
    • ventas_raw with dates as strings and inconsistent fields.
    • maestro_sociedades_raw exported directly from SAP.
  • Color/Tag: Grey ⚪ with tag RAW.

2. Clean

  • Definition: Data processed by workshops or cleaning/normalization objects, validated and standardized.
  • Characteristics:
    • Correct type casting.
    • Null values handled, duplicates removed.
    • Includes clean master data and normalized tables.
    • May or may not be included in the Knowledge Grid, but must always be tagged and color-coded.
  • Examples:
    • ventas_silver with typed dates and normalized costs.
    • dim_articulo_clean with unified hierarchies.
  • Color/Tag: Blue 🔵 with tag CLEAN.

3. Gold (Refined)

  • Definition: Data ready for business consumption and for inclusion in the Knowledge Grid.
  • Characteristics:
    • Aggregated or transformed into KPIs and business metrics.
    • Ambiguity-free: defines the “single source of truth” for sales, budgets, utilities, etc.
    • Serves as the foundation for both traditional dashboards and AI assistants.
  • Examples:
    • agg_mes_tienda_categoria with Net Sales, Budget, Compliance, and Profitability.
    • agg_dia_regional_categoria for operational analysis.
  • Color/Tag: Gold 🟡 with tag GOLD.

4. Optimized

  • Definition: Subset of Clean or Gold, tailored and optimized for a specific process, application, or consumption.
  • Characteristics:
    • Always documented with its associated Knowledge Node.
    • May include indexes, special partitions, or wide structures to accelerate a use case.
    • Maintains traceability back to its Clean/Gold origin.
  • Examples:
    • agg_mes_forecast_ventas optimized for AI predictive models.
    • agg_tienda_dia_dashboard optimized for fast executive dashboards.
  • Color/Tag: Emerald Green 🟢 with tag OPTIMIZED.

🔹 General Rules

  1. All datasets must be tagged with their layer (RAW, CLEAN, GOLD, OPTIMIZED).
  2. Official colors must be used in documentation, diagrams, and dashboards to visually reinforce the layer.
  3. Responsibilities:
    • Raw → Clean: Data Engineering.
    • Clean → Gold: Shared responsibility with Data Analysts (business rules).
    • Optimized: Process/application owner, documented in Knowledge Grid.
  4. Knowledge Grid:
    • Only Gold and Optimized enter the Knowledge Grid.
    • Each node must define metrics, hierarchies, joins, and usage policies.
  5. Governance:
    • No dashboard/AI may consume Raw directly.
    • Clean is for exploratory analysis and validation.
    • Gold/Optimized are the only reference datasets in production.

🔹 Visual Flow Example