System Architecture

High-Level Overview

Astrophage follows a clean pipeline architecture from raw data to final predictions:

graph LR
    subgraph "Input"
        A[Raw KOI CSV<br/>28 columns]
    end

    subgraph "Feature Engineering"
        B[Base Features<br/>28 columns]
        C[Derived Features<br/>8 interactions]
        D[Preprocessing<br/>impute + standardize]
    end

    subgraph "Two-Stage Model"
        E[Stage 1 RF<br/>CONFIRMED vs NOT]
        F[Stage 2 RF<br/>CANDIDATE vs FALSE+]
    end

    subgraph "Output"
        G[Predictions<br/>3 classes]
        H[JSON Report<br/>metrics + insights]
    end

    A --> B
    B --> C
    C --> D
    D --> E
    E -->|CONFIRMED| G
    E -->|NOT| F
    F -->|CANDIDATE| G
    F -->|FALSE_POSITIVE| G
    G --> H

    style A fill:#e74c3c,stroke:#c0392c,color:#fff
    style D fill:#f39c12,stroke:#e67e22,color:#fff
    style E fill:#2ecc71,stroke:#27ae60,color:#fff
    style F fill:#3498db,stroke:#2980b9,color:#fff
    style H fill:#9b59b6,stroke:#8e44ad,color:#fff

Data Flow

sequenceDiagram
    participant User as User
    participant Main as main.rs
    participant Data as data.rs
    participant Features as features.rs
    participant Stage1 as Stage 1 RF
    participant Stage2 as Stage 2 RF
    participant Eval as evaluation.rs
    participant Report as report.rs

    User->>Main: cargo run --release
    Main->>Data: load("data/koi_dataset.csv")
    Data-->>Main: KoiDataset (9,564 samples)

    Main->>Features: process(&dataset)
    Features->>Features: impute_missing()
    Features->>Features: standardize()
    Features->>Features: compute_derived()
    Features-->>Main: ProcessedDataset (36 features)

    Main->>Main: split(0.2, seed=42)
    Note over Main: 80/20 stratified split

    Main->>Stage1: train(&train_data)
    Note over Stage1: Binary: CONFIRMED=1, NOT=0
    Stage1-->>Main: Stage 1 trained

    Main->>Stage2: train(&train_stage2)
    Note over Stage2: Binary: CANDIDATE=1, FALSE_POSITIVE=0
    Stage2-->>Main: Stage 2 trained

    Main->>Eval: evaluate(&classifier, &test)
    Eval-->>Main: Metrics (accuracy, F1, etc.)

    Main->>Report: generate_report(&metrics, &classifier)
    Report-->>Main: output/report.json
    Main-->>User: Done!

Random Forest Internals

Single Decision Tree

graph TD
    A[Root Node<br/>Gini = 0.65] -->|fpflag_sum < 0.5| B[Left: Gini = 0.15]
    A -->|fpflag_sum >= 0.5| C[Right: Gini = 0.05]

    B -->|koi_model_snr < 2.0| D[Leaf: CANDIDATE]
    B -->|koi_model_snr >= 2.0| E[Leaf: CONFIRMED]

    C -->|koi_prad < 15.0| F[Leaf: FALSE_POSITIVE]
    C -->|koi_prad >= 15.0| G[Leaf: FALSE_POSITIVE]

    style D fill:#3498db,stroke:#2980b9,color:#fff
    style E fill:#2ecc71,stroke:#27ae60,color:#fff
    style F fill:#e74c3c,stroke:#c0392c,color:#fff
    style G fill:#e74c3c,stroke:#c0392c,color:#fff

Ensemble Voting

graph TD
    A[Sample Input] --> B[Tree 1]
    A --> C[Tree 2]
    A --> D[Tree 3]
    A --> E[...]
    A --> F[Tree N]

    B -->|CONFIRMED| G[Voting Box]
    C -->|CONFIRMED| G
    D -->|CANDIDATE| G
    E -->|CONFIRMED| G
    F -->|CONFIRMED| G

    G -->|Majority Vote| H[Final: CONFIRMED]

    style G fill:#f39c12,stroke:#e67e22,color:#fff
    style H fill:#2ecc71,stroke:#27ae60,color:#fff

Technology Layers

graph TB
    subgraph "Application Layer"
        A1[main.rs - CLI & Orchestration]
        A2[report.rs - JSON Generation]
    end

    subgraph "ML Layer"
        M1[two_stage_model.rs - Pipeline]
        M2[model.rs - Random Forest]
        M3[decision_tree.rs - Trees]
    end

    subgraph "Data Layer"
        D1[features.rs - Engineering]
        D2[data.rs - Loading]
    end

    subgraph "Infrastructure Layer"
        I1[Polars - DataFrame I/O]
        I2[NDArray - Vectorized Math]
        I3[Tokio - Async Runtime]
        I4[Serde - Serialization]
    end

    A1 --> M1
    A1 --> A2
    M1 --> M2
    M2 --> M3
    M1 --> D1
    D1 --> D2
    D2 --> I1
    D1 --> I2
    A1 --> I3
    A2 --> I4

Performance Comparison

graph LR
    subgraph "Training Time"
        A[Astrophage<br/>Rust: ~30s]
        B[sklearn RF<br/>Python: ~120s]
    end

    subgraph "Inference Time"
        C[Astrophage<br/>~1ms/sample]
        D[sklearn RF<br/>~10ms/sample]
    end

    subgraph "Binary Size"
        E[Astrophage<br/>~2MB]
        F[sklearn env<br/>~500MB+]
    end

    style A fill:#2ecc71,stroke:#27ae60,color:#fff
    style C fill:#2ecc71,stroke:#27ae60,color:#fff
    style E fill:#2ecc71,stroke:#27ae60,color:#fff

Memory Layout

graph TD
    subgraph "Training Data"
        A[Features Array2<br/>f64 x (n_samples x 36)]
        B[Labels Array1<br/>u8 x n_samples]
    end

    subgraph "Stage 1 Model"
        C[100 Decision Trees]
        C1[Tree 1: ~50 nodes]
        C2[Tree 2: ~50 nodes]
        C3[Tree N: ~50 nodes]
    end

    subgraph "Stage 2 Model"
        D[100 Decision Trees]
        D1[Tree 1: ~50 nodes]
        D2[Tree 2: ~50 nodes]
        D3[Tree N: ~50 nodes]
    end

    A --> C
    A --> D
    B --> C
    B --> D

Keyboard shortcuts

Astrophage Documentation