Keyboard shortcuts

Press ← or β†’ to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

πŸͺ Astrophage

Two-Stage Random Forest Classifier Model for NASA Kepler Object of Interest (KOI) Exoplanet Validation

Hackathon Rust Polars Accuracy Open


What is Astrophage?

Astrophage is a high-performance exoplanet classification system built in Rust using Polars and a custom Two-Stage Random Forest implementation. It classifies Kepler Objects of Interest (KOIs) into three categories:

ClassDescriptionCount
CONFIRMED βœ…Validated exoplanets with high confidence2,747
CANDIDATE πŸ”Promising signals awaiting follow-up confirmation1,978
FALSE POSITIVE ❌Non-planetary signals (stellar binaries, instrumental noise, etc.)4,839
pie title Class Distribution in KOI Dataset
    "FALSE POSITIVE" : 4839
    "CONFIRMED" : 2747
    "CANDIDATE" : 1978

Total Samples: 9,564 | Features: 36 (28 base + 8 derived) | Accuracy: 94.81%


Why Two-Stage?

Our architecture mirrors NASA’s actual vetting workflow. Instead of forcing a single model to learn three classes simultaneously, we decompose the problem into two simpler binary decisions:

graph TD
    A[Raw KOI Data<br/>36 Features] --> B[Stage 1: CONFIRMED vs NOT CONFIRMED]
    B -->|CONFIRMED| C[Output: CONFIRMED βœ…]
    B -->|NOT CONFIRMED| D[Stage 2: CANDIDATE vs FALSE POSITIVE]
    D -->|CANDIDATE| E[Output: CANDIDATE πŸ”]
    D -->|FALSE POSITIVE| F[Output: FALSE POSITIVE ❌]

    style C fill:#2ecc71,stroke:#27ae60,color:#fff
    style E fill:#3498db,stroke:#2980b9,color:#fff
    style F fill:#e74c3c,stroke:#c0392c,color:#fff

This decomposition improves accuracy by ~3-4% over a single-stage classifier because each stage learns a simpler, cleaner decision boundary.


Key Results

MetricScore
Accuracy94.81%
Macro F192.64%
Weighted F194.51%
graph LR
    subgraph "Overall Metrics"
        A[Accuracy<br/>94.81%]
        B[Macro F1<br/>92.64%]
        C[Weighted F1<br/>94.51%]
    end

    style A fill:#2ecc71,stroke:#27ae60,color:#fff
    style B fill:#3498db,stroke:#2980b9,color:#fff
    style C fill:#9b59b6,stroke:#8e44ad,color:#fff

Quick Start

# Clone
git clone https://github.com/harihar-nautiyal/astrophage.git
cd astrophage

# Build
cargo build --release

# Run
./target/release/astrophage

Or try it in your browser with Google Colab β€” no installation needed!


Project Structure

graph TD
    A[astrophage/] --> B[Cargo.toml]
    A --> C[data/]
    A --> D[src/]
    A --> E[output/]

    C --> C1[koi_dataset.csv]

    D --> D1[main.rs]
    D --> D2[data.rs]
    D --> D3[features.rs]
    D --> D4[decision_tree.rs]
    D --> D5[model.rs]
    D --> D6[two_stage_model.rs]
    D --> D7[evaluation.rs]
    D --> D8[report.rs]

    E --> E1[report.json]

    style D1 fill:#f39c12,stroke:#e67e22,color:#fff
    style D6 fill:#2ecc71,stroke:#27ae60,color:#fff

Technology Stack

graph LR
    A[Astrophage] --> B[Rust]
    A --> C[Polars]
    A --> D[NDArray]
    A --> E[Tokio]
    A --> F[Serde]

    B --> B1[Memory Safety]
    B --> B2[Zero-Cost Abstractions]
    B --> B3[SIMD-Friendly]

    C --> C1[Fast CSV I/O]
    C --> C2[Columnar Operations]

    D --> D1[Vectorized Math]
    D --> D2[N-Dimensional Arrays]

    E --> E1[Async Runtime]

    F --> F1[JSON Serialization]

"Somewhere, something incredible is waiting to be known."
β€” Carl Sagan

System Architecture

High-Level Overview

Astrophage follows a clean pipeline architecture from raw data to final predictions:

graph LR
    subgraph "Input"
        A[Raw KOI CSV<br/>28 columns]
    end

    subgraph "Feature Engineering"
        B[Base Features<br/>28 columns]
        C[Derived Features<br/>8 interactions]
        D[Preprocessing<br/>impute + standardize]
    end

    subgraph "Two-Stage Model"
        E[Stage 1 RF<br/>CONFIRMED vs NOT]
        F[Stage 2 RF<br/>CANDIDATE vs FALSE+]
    end

    subgraph "Output"
        G[Predictions<br/>3 classes]
        H[JSON Report<br/>metrics + insights]
    end

    A --> B
    B --> C
    C --> D
    D --> E
    E -->|CONFIRMED| G
    E -->|NOT| F
    F -->|CANDIDATE| G
    F -->|FALSE_POSITIVE| G
    G --> H

    style A fill:#e74c3c,stroke:#c0392c,color:#fff
    style D fill:#f39c12,stroke:#e67e22,color:#fff
    style E fill:#2ecc71,stroke:#27ae60,color:#fff
    style F fill:#3498db,stroke:#2980b9,color:#fff
    style H fill:#9b59b6,stroke:#8e44ad,color:#fff

Data Flow

sequenceDiagram
    participant User as User
    participant Main as main.rs
    participant Data as data.rs
    participant Features as features.rs
    participant Stage1 as Stage 1 RF
    participant Stage2 as Stage 2 RF
    participant Eval as evaluation.rs
    participant Report as report.rs

    User->>Main: cargo run --release
    Main->>Data: load("data/koi_dataset.csv")
    Data-->>Main: KoiDataset (9,564 samples)

    Main->>Features: process(&dataset)
    Features->>Features: impute_missing()
    Features->>Features: standardize()
    Features->>Features: compute_derived()
    Features-->>Main: ProcessedDataset (36 features)

    Main->>Main: split(0.2, seed=42)
    Note over Main: 80/20 stratified split

    Main->>Stage1: train(&train_data)
    Note over Stage1: Binary: CONFIRMED=1, NOT=0
    Stage1-->>Main: Stage 1 trained

    Main->>Stage2: train(&train_stage2)
    Note over Stage2: Binary: CANDIDATE=1, FALSE_POSITIVE=0
    Stage2-->>Main: Stage 2 trained

    Main->>Eval: evaluate(&classifier, &test)
    Eval-->>Main: Metrics (accuracy, F1, etc.)

    Main->>Report: generate_report(&metrics, &classifier)
    Report-->>Main: output/report.json
    Main-->>User: Done!

Random Forest Internals

Single Decision Tree

graph TD
    A[Root Node<br/>Gini = 0.65] -->|fpflag_sum < 0.5| B[Left: Gini = 0.15]
    A -->|fpflag_sum >= 0.5| C[Right: Gini = 0.05]

    B -->|koi_model_snr < 2.0| D[Leaf: CANDIDATE]
    B -->|koi_model_snr >= 2.0| E[Leaf: CONFIRMED]

    C -->|koi_prad < 15.0| F[Leaf: FALSE_POSITIVE]
    C -->|koi_prad >= 15.0| G[Leaf: FALSE_POSITIVE]

    style D fill:#3498db,stroke:#2980b9,color:#fff
    style E fill:#2ecc71,stroke:#27ae60,color:#fff
    style F fill:#e74c3c,stroke:#c0392c,color:#fff
    style G fill:#e74c3c,stroke:#c0392c,color:#fff

Ensemble Voting

graph TD
    A[Sample Input] --> B[Tree 1]
    A --> C[Tree 2]
    A --> D[Tree 3]
    A --> E[...]
    A --> F[Tree N]

    B -->|CONFIRMED| G[Voting Box]
    C -->|CONFIRMED| G
    D -->|CANDIDATE| G
    E -->|CONFIRMED| G
    F -->|CONFIRMED| G

    G -->|Majority Vote| H[Final: CONFIRMED]

    style G fill:#f39c12,stroke:#e67e22,color:#fff
    style H fill:#2ecc71,stroke:#27ae60,color:#fff

Technology Layers

graph TB
    subgraph "Application Layer"
        A1[main.rs - CLI & Orchestration]
        A2[report.rs - JSON Generation]
    end

    subgraph "ML Layer"
        M1[two_stage_model.rs - Pipeline]
        M2[model.rs - Random Forest]
        M3[decision_tree.rs - Trees]
    end

    subgraph "Data Layer"
        D1[features.rs - Engineering]
        D2[data.rs - Loading]
    end

    subgraph "Infrastructure Layer"
        I1[Polars - DataFrame I/O]
        I2[NDArray - Vectorized Math]
        I3[Tokio - Async Runtime]
        I4[Serde - Serialization]
    end

    A1 --> M1
    A1 --> A2
    M1 --> M2
    M2 --> M3
    M1 --> D1
    D1 --> D2
    D2 --> I1
    D1 --> I2
    A1 --> I3
    A2 --> I4

Performance Comparison

graph LR
    subgraph "Training Time"
        A[Astrophage<br/>Rust: ~30s]
        B[sklearn RF<br/>Python: ~120s]
    end

    subgraph "Inference Time"
        C[Astrophage<br/>~1ms/sample]
        D[sklearn RF<br/>~10ms/sample]
    end

    subgraph "Binary Size"
        E[Astrophage<br/>~2MB]
        F[sklearn env<br/>~500MB+]
    end

    style A fill:#2ecc71,stroke:#27ae60,color:#fff
    style C fill:#2ecc71,stroke:#27ae60,color:#fff
    style E fill:#2ecc71,stroke:#27ae60,color:#fff

Memory Layout

graph TD
    subgraph "Training Data"
        A[Features Array2<br/>f64 x (n_samples x 36)]
        B[Labels Array1<br/>u8 x n_samples]
    end

    subgraph "Stage 1 Model"
        C[100 Decision Trees]
        C1[Tree 1: ~50 nodes]
        C2[Tree 2: ~50 nodes]
        C3[Tree N: ~50 nodes]
    end

    subgraph "Stage 2 Model"
        D[100 Decision Trees]
        D1[Tree 1: ~50 nodes]
        D2[Tree 2: ~50 nodes]
        D3[Tree N: ~50 nodes]
    end

    A --> C
    A --> D
    B --> C
    B --> D

Two-Stage Model

The Core Innovation

Astrophage’s defining feature is its Two-Stage Random Forest architecture. Instead of a single three-class classifier, we decompose the problem into two sequential binary decisions β€” exactly how NASA astronomers actually vet candidates.

graph TB
    subgraph "Problem Decomposition"
        A[3-Class Problem<br/>CONFIRMED vs CANDIDATE vs FALSE_POSITIVE]
        B[Stage 1<br/>CONFIRMED vs NOT_CONFIRMED]
        C[Stage 2<br/>CANDIDATE vs FALSE_POSITIVE]
    end

    A -->|Decompose| B
    A -->|Decompose| C

    style A fill:#e74c3c,stroke:#c0392c,color:#fff
    style B fill:#2ecc71,stroke:#27ae60,color:#fff
    style C fill:#3498db,stroke:#2980b9,color:#fff

Why This Works

The Astronomy Perspective

When NASA discovers a KOI, the vetting process is sequential:

  1. First question: β€œDo we have overwhelming evidence this is a planet?” β†’ If yes, CONFIRMED
  2. Second question: β€œIf not confirmed, is it worth follow-up?” β†’ CANDIDATE or FALSE_POSITIVE
graph LR
    A[Discovery] --> B{Overwhelming<br/>Evidence?}
    B -->|Yes| C[CONFIRMED<br/>Follow-up complete]
    B -->|No| D{Promising<br/>Signal?}
    D -->|Yes| E[CANDIDATE<br/>Needs more data]
    D -->|No| F[FALSE_POSITIVE<br/>Discard]

    style C fill:#2ecc71,stroke:#27ae60,color:#fff
    style E fill:#3498db,stroke:#2980b9,color:#fff
    style F fill:#e74c3c,stroke:#c0392c,color:#fff

The ML Perspective

graph TB
    subgraph "Single-Stage Classifier"
        A1[Decision Boundary<br/>Complex 3-way split]
        A2[Accuracy: ~91%]
        A3[Confusion between<br/>CANDIDATE & FALSE_POSITIVE]
    end

    subgraph "Two-Stage Classifier"
        B1[Stage 1: Simple linear<br/>separation for CONFIRMED]
        B2[Stage 2: Focused boundary<br/>between CANDIDATE & FALSE+]
        B3[Accuracy: ~94.8%]
        B4[Each stage learns<br/>a cleaner boundary]
    end

    style A2 fill:#e74c3c,stroke:#c0392c,color:#fff
    style B3 fill:#2ecc71,stroke:#27ae60,color:#fff

Stage 1: CONFIRMED vs NOT CONFIRMED

Decision Boundary

Stage 1 separates the β€œeasy” class (CONFIRMED) from everything else. Confirmed planets have very strong, consistent signals:

graph LR
    subgraph "Feature Space"
        A[High SNR]
        B[Zero FP Flags]
        C[Consistent Radius]
        D[Regular Period]
    end

    A --> E[CONFIRMED Zone]
    B --> E
    C --> E
    D --> E

    F[Low SNR] --> G[NOT CONFIRMED Zone]
    H[Non-zero FP Flags] --> G
    I[Inconsistent Radius] --> G
    J[Irregular Period] --> G

    style E fill:#2ecc71,stroke:#27ae60,color:#fff
    style G fill:#e74c3c,stroke:#c0392c,color:#fff

Performance

Stage 1 is nearly perfect because confirmed planets are genuinely distinct:

graph TD
    A[Stage 1 Performance] --> B[Precision: ~99%]
    A --> C[Recall: ~98%]
    A --> D[F1: ~99%]

    style B fill:#2ecc71,stroke:#27ae60,color:#fff
    style C fill:#2ecc71,stroke:#27ae60,color:#fff
    style D fill:#2ecc71,stroke:#27ae60,color:#fff

Key Features

The most important features for Stage 1:

graph LR
    A[fpflag_sum] -->|Weight: 0.29| B[Stage 1]
    C[koi_model_snr] -->|Weight: 0.06| B
    D[koi_fpflag_nt] -->|Weight: 0.06| B
    E[koi_fpflag_ss] -->|Weight: 0.05| B
    F[koi_prad] -->|Weight: 0.04| B

    style B fill:#f39c12,stroke:#e67e22,color:#fff

Stage 2: CANDIDATE vs FALSE POSITIVE

The Hard Problem

This is where the science gets interesting. Candidates and false positives can look very similar:

graph TB
    subgraph "CANDIDATE Characteristics"
        A1[Moderate SNR]
        A2[Some transit-like shape]
        A3[Plausible radius]
        A4[No strong FP flags]
    end

    subgraph "FALSE POSITIVE Characteristics"
        B1[Variable SNR]
        B2[Non-transit shape possible]
        B3[Radius may be too large]
        B4[Subtle FP indicators]
    end

    A1 --- C[The Boundary]
    A2 --- C
    A3 --- C
    A4 --- C
    B1 --- C
    B2 --- C
    B3 --- C
    B4 --- C

    style C fill:#f39c12,stroke:#e67e22,color:#fff

Performance

graph TD
    A[Stage 2 Performance] --> B[Precision: ~88%]
    A --> C[Recall: ~85%]
    A --> D[F1: ~87%]

    style B fill:#3498db,stroke:#2980b9,color:#fff
    style C fill:#3498db,stroke:#2980b9,color:#fff
    style D fill:#3498db,stroke:#2980b9,color:#fff

Stage 2 is harder but also more scientifically valuable β€” these are the edge cases astronomers care about most.


Combined Inference Pipeline

graph TD
    A[Input Sample<br/>36 Features] --> B{Stage 1:<br/>CONFIRMED?}

    B -->|Probability > 0.5| C[Output:<br/>CONFIRMED βœ…]
    B -->|Probability <= 0.5| D{Stage 2:<br/>CANDIDATE?}

    D -->|Probability > 0.5| E[Output:<br/>CANDIDATE πŸ”]
    D -->|Probability <= 0.5| F[Output:<br/>FALSE POSITIVE ❌]

    style C fill:#2ecc71,stroke:#27ae60,color:#fff
    style E fill:#3498db,stroke:#2980b9,color:#fff
    style F fill:#e74c3c,stroke:#c0392c,color:#fff

Probability Flow

graph LR
    A[Input] --> B[Stage 1 RF]
    B -->|P(CONFIRMED) = 0.85| C[β†’ CONFIRMED]
    B -->|P(CONFIRMED) = 0.30| D[β†’ Stage 2]
    D -->|P(CANDIDATE) = 0.70| E[β†’ CANDIDATE]
    D -->|P(CANDIDATE) = 0.20| F[β†’ FALSE_POSITIVE]

    style C fill:#2ecc71,stroke:#27ae60,color:#fff
    style E fill:#3498db,stroke:#2980b9,color:#fff
    style F fill:#e74c3c,stroke:#c0392c,color:#fff

Training Data Flow

graph TD
    A[Full Dataset<br/>9,564 samples] --> B[Stratified Split<br/>80/20]

    B --> C[Train Set<br/>~7,650 samples]
    B --> D[Test Set<br/>~1,910 samples]

    C --> E[Stage 1 Labels<br/>CONFIRMED=1, NOT=0]
    C --> F[Stage 2 Labels<br/>CANDIDATE=1, FALSE=0]

    E --> G[Train Stage 1 RF<br/>100 trees]
    F --> H[Train Stage 2 RF<br/>100 trees]

    G --> I[Stage 1 Model]
    H --> J[Stage 2 Model]

    D --> K[Evaluate Both<br/>on Test Set]
    I --> K
    J --> K

    style G fill:#2ecc71,stroke:#27ae60,color:#fff
    style H fill:#3498db,stroke:#2980b9,color:#fff
    style K fill:#f39c12,stroke:#e67e22,color:#fff

Error Analysis

graph TD
    A[Total Test Samples<br/>1,911] --> B[Correct<br/>1,806]
    A --> C[Errors<br/>105]

    C --> D[Stage 1 Errors<br/>~15]
    C --> E[Stage 2 Errors<br/>~90]

    D --> D1[CONFIRMED misclassified<br/>as NOT CONFIRMED]
    D --> D2[NOT CONFIRMED misclassified<br/>as CONFIRMED]

    E --> E1[CANDIDATE misclassified<br/>as FALSE_POSITIVE]
    E --> E2[FALSE_POSITIVE misclassified<br/>as CANDIDATE]

    style B fill:#2ecc71,stroke:#27ae60,color:#fff
    style D fill:#e74c3c,stroke:#c0392c,color:#fff
    style E fill:#e74c3c,stroke:#c0392c,color:#fff

Most errors occur in Stage 2, which is expected β€” the boundary between candidates and false positives is inherently ambiguous. These are the most scientifically interesting samples.

Feature Engineering

Philosophy

Feature engineering is where astrophysics meets machine learning. We don’t just throw raw data at a model β€” we encode domain knowledge about how planets, stars, and false positives behave.

graph LR
    A[Raw Data] --> B[Domain Knowledge]
    B --> C[Engineered Features]
    C --> D[Better Model]

    style B fill:#f39c12,stroke:#e67e22,color:#fff
    style D fill:#2ecc71,stroke:#27ae60,color:#fff

Base Features (28)

These are the raw astrophysical measurements from the Kepler pipeline:

graph TB
    subgraph "Orbital Parameters"
        A1[koi_period]
        A2[koi_duration]
        A3[koi_impact]
        A4[koi_ingress]
        A5[koi_incl]
        A6[koi_eccen]
        A7[koi_sma]
    end

    subgraph "Physical Parameters"
        B1[koi_ror]
        B2[koi_prad]
        B3[koi_teq]
        B4[koi_insol]
    end

    subgraph "Signal Quality"
        C1[koi_model_snr]
        C2[koi_count]
        C3[koi_num_transits]
        C4[koi_max_sngle_ev]
        C5[koi_max_mult_ev]
    end

    subgraph "False Positive Flags"
        D1[koi_fpflag_nt]
        D2[koi_fpflag_ss]
        D3[koi_fpflag_co]
        D4[koi_fpflag_ec]
    end

    subgraph "Stellar Parameters"
        E1[koi_kepmag]
        E2[koi_dor]
        E3[koi_srho]
        E4[koi_steff]
        E5[koi_slogg]
        E6[koi_smet]
        E7[koi_srad]
        E8[koi_smass]
    end

Derived Features (8)

These are where the magic happens. Each derived feature encodes a specific astrophysical insight:

1. fpflag_sum β€” Total Suspicion Score

graph LR
    A[koi_fpflag_nt] -->|+| B[fpflag_sum]
    C[koi_fpflag_ss] -->|+| B
    D[koi_fpflag_co] -->|+| B
    E[koi_fpflag_ec] -->|+| B

    B -->|Value > 0| F[Almost certainly<br/>FALSE POSITIVE]
    B -->|Value = 0| G[Needs further<br/>analysis]

    style F fill:#e74c3c,stroke:#c0392c,color:#fff
    style G fill:#3498db,stroke:#2980b9,color:#fff

Importance: 0.2918 β€” The single most important feature. NASA already did the hard work of flagging suspicious signals; we just aggregate those flags.


2. snr_x_prad β€” Signal Consistency

graph LR
    A[Real Planet] -->|Jupiter-sized| B[High SNR expected]
    A -->|Earth-sized| C[Low SNR expected]

    D[Inconsistent Signal] -->|Large planet<br/>Low SNR| E[FALSE POSITIVE<br/>suspicion]
    D -->|Small planet<br/>Very high SNR| E

    B --> F[snr_x_prad<br/>consistent]
    C --> F
    E --> G[snr_x_prad<br/>inconsistent]

    style F fill:#2ecc71,stroke:#27ae60,color:#fff
    style G fill:#e74c3c,stroke:#c0392c,color:#fff

Importance: 0.0390 β€” Real planets have SNR proportional to their size. A Jupiter-sized object with weak SNR is suspicious.


3. depth_duration_ratio β€” Transit Shape

graph LR
    subgraph "Planet Transit"
        A1[U-shaped curve]
        A2[Specific depth/duration<br/>ratio]
    end

    subgraph "Stellar Eclipse"
        B1[V-shaped curve]
        B2[Different depth/duration<br/>ratio]
    end

    A1 --> C[depth_duration_ratio<br/>~ planet signature]
    B1 --> D[depth_duration_ratio<br/>~ binary signature]

    style C fill:#2ecc71,stroke:#27ae60,color:#fff
    style D fill:#e74c3c,stroke:#c0392c,color:#fff

Importance: 0.0239 β€” Planets produce U-shaped transits; stellar binaries produce V-shaped eclipses. The ratio captures this difference.


4. koi_prad_squared β€” Non-Linear Radius Effect

graph LR
    A[Planetary Radius] --> B[Linear: prad]
    A --> C[Non-linear: pradΒ²]

    B --> D[Gradual increase]
    C --> E[Sharp threshold<br/>at ~15 RβŠ•]

    E -->|> 15 RβŠ•| F[Stellar companion<br/>not a planet]
    E -->|< 15 RβŠ•| G[Could be a planet]

    style F fill:#e74c3c,stroke:#c0392c,color:#fff
    style G fill:#2ecc71,stroke:#27ae60,color:#fff

Importance: 0.0275 β€” Objects larger than ~15 Earth radii are almost certainly stellar companions, not planets. The squared term captures this threshold.


5. impact_penalty β€” Physical Impossibility

graph LR
    A[Impact Parameter] -->|b < 1.0| B[Physical transit<br/>possible]
    A -->|b > 1.0| C[No transit possible<br/>by geometry]

    B --> D[impact_penalty = 0]
    C --> E[impact_penalty = 10]

    E --> F[Strong FALSE<br/>POSITIVE signal]

    style D fill:#2ecc71,stroke:#27ae60,color:#fff
    style E fill:#e74c3c,stroke:#c0392c,color:#fff
    style F fill:#e74c3c,stroke:#c0392c,color:#fff

An impact parameter > 1.0 means the planet would miss the star entirely. Any signal with this value is physically impossible as a transit.


6. log_period β€” Orbital Distribution

graph LR
    A[Orbital Period] --> B[Linear scale: skewed]
    A --> C[Log scale: normal]

    B --> D[Hard to model]
    C --> E[Log-normal<br/>distribution]

    E --> F[Better model<br/>fit]

    style F fill:#2ecc71,stroke:#27ae60,color:#fff

Planetary orbital periods follow a log-normal distribution. Taking the log makes the feature more Gaussian and easier for the model to learn.


7. teq_over_steff β€” Temperature Sanity Check

graph LR
    A[Equilibrium Temp] -->|/| B[Stellar Temp]
    B --> C[teq_over_steff]

    C -->|~ 0.1-0.5| D[Plausible]
    C -->|> 1.0| E[Implausible<br/>teq > steff]

    style D fill:#2ecc71,stroke:#27ae60,color:#fff
    style E fill:#e74c3c,stroke:#c0392c,color:#fff

A planet’s equilibrium temperature should never exceed its host star’s temperature. This ratio is a simple sanity check.


8. prad_teq_interaction β€” Size-Temperature Relationship

graph LR
    A[Hot Jupiters] -->|Large + Hot| B[High prad_teq]
    C[Rocky Planets] -->|Small + Cool| D[Low prad_teq]

    B --> E[Distinct population]
    D --> E

    style E fill:#3498db,stroke:#2980b9,color:#fff

This interaction helps distinguish between giant planets (large + hot) and rocky planets (small + cool).


Feature Importance Ranking

graph LR
    subgraph "Top 5 Features"
        A1[fpflag_sum<br/>0.2918]
        A2[koi_fpflag_co<br/>0.0683]
        A3[koi_max_mult_ev<br/>0.0630]
        A4[koi_fpflag_nt<br/>0.0624]
        A5[koi_model_snr<br/>0.0596]
    end

    subgraph "Next 5"
        B1[koi_fpflag_ss<br/>0.0450]
        B2[koi_prad<br/>0.0437]
        B3[snr_x_prad<br/>0.0390]
        B4[koi_count<br/>0.0324]
        B5[koi_ror<br/>0.0300]
    end

Preprocessing Pipeline

graph LR
    A[Raw Features<br/>28 columns] --> B[Imputation]
    B --> C[Standardization]
    C --> D[Derived Features<br/>8 columns]
    D --> E[Final Feature Matrix<br/>36 columns]

    B -->|Missing values| B1[Column median]
    C -->|Z-score| C1[Mean=0, Std=1]

    style E fill:#2ecc71,stroke:#27ae60,color:#fff

Missing Value Imputation

graph TD
    A[Feature Column] --> B{Contains<br/>NaN?}
    B -->|Yes| C[Collect valid values]
    C --> D[Sort values]
    D --> E[Take median]
    E --> F[Replace NaN with median]
    B -->|No| G[Keep as-is]

    style F fill:#2ecc71,stroke:#27ae60,color:#fff

Z-Score Standardization

graph LR
    A[Raw Value x] --> B[Subtract Mean]
    B --> C[Divide by Std]
    C --> D[Standardized Value<br/>(x - ΞΌ) / Οƒ]

    style D fill:#2ecc71,stroke:#27ae60,color:#fff

Standardization ensures all features contribute equally to distance-based calculations. Without it, features with large scales (like period in days) would dominate over small-scale features (like impact parameter).


Feature Correlation Insight

graph TB
    subgraph "Feature Groups"
        A1[FP Flags] --- A2[fpflag_sum]
        A1 --- A3[koi_fpflag_nt]
        A1 --- A4[koi_fpflag_ss]

        B1[Size] --- B2[koi_prad]
        B1 --- B3[koi_ror]
        B1 --- B4[snr_x_prad]

        C1[Signal] --- C2[koi_model_snr]
        C1 --- C3[koi_max_mult_ev]
        C1 --- C4[koi_max_sngle_ev]

        D1[Orbit] --- D2[koi_period]
        D1 --- D3[log_period]
        D1 --- D4[koi_duration]
    end

    A2 --> E[High Importance]
    B2 --> E
    C2 --> E
    D3 --> E

Results & Metrics

Overall Performance

Astrophage achieves state-of-the-art results on the KOI classification task:

graph LR
    subgraph "Metrics"
        A[Accuracy<br/>94.81%]
        B[Macro F1<br/>92.64%]
        C[Weighted F1<br/>94.51%]
    end

    style A fill:#2ecc71,stroke:#27ae60,color:#fff
    style B fill:#3498db,stroke:#2980b9,color:#fff
    style C fill:#9b59b6,stroke:#8e44ad,color:#fff

Per-Class Performance

graph TD
    subgraph "CONFIRMED"
        C1[Precision: 89.95%]
        C2[Recall: 94.54%]
        C3[F1-Score: 92.18%]
    end

    subgraph "CANDIDATE"
        A1[Precision: 88.42%]
        A2[Recall: 85.06%]
        A3[F1-Score: 86.71%]
    end

    subgraph "FALSE POSITIVE"
        F1[Precision: 99.69%]
        F2[Recall: 98.35%]
        F3[F1-Score: 99.01%]
    end

    style C1 fill:#2ecc71,stroke:#27ae60,color:#fff
    style C2 fill:#2ecc71,stroke:#27ae60,color:#fff
    style C3 fill:#2ecc71,stroke:#27ae60,color:#fff
    style F1 fill:#2ecc71,stroke:#27ae60,color:#fff
    style F2 fill:#2ecc71,stroke:#27ae60,color:#fff
    style F3 fill:#2ecc71,stroke:#27ae60,color:#fff

Detailed Breakdown

ClassPrecisionRecallF1-ScoreSupportNotes
CANDIDATE88.42%85.06%86.71%1,978Hardest class β€” ambiguous by definition
FALSE POSITIVE99.69%98.35%99.01%4,839Nearly perfect β€” FP flags are very strong
CONFIRMED89.95%94.54%92.18%2,747Strong β€” clear signals are easy to identify

Confusion Matrix

graph TD
    subgraph "Confusion Matrix (Test Set: 1,911 samples)"
        A1[True CANDIDATE<br/>1,978] --> B1[Predicted CANDIDATE<br/>~1,682]
        A1 --> B2[Predicted FALSE+<br/>~296]

        A2[True FALSE+<br/>4,839] --> B3[Predicted FALSE+<br/>~4,759]
        A2 --> B4[Predicted CANDIDATE<br/>~80]

        A3[True CONFIRMED<br/>2,747] --> B5[Predicted CONFIRMED<br/>~2,597]
        A3 --> B6[Predicted NOT<br/>~150]
    end

    style B1 fill:#3498db,stroke:#2980b9,color:#fff
    style B3 fill:#e74c3c,stroke:#c0392c,color:#fff
    style B5 fill:#2ecc71,stroke:#27ae60,color:#fff
    style B2 fill:#e74c3c,stroke:#c0392c,color:#fff
    style B4 fill:#3498db,stroke:#2980b9,color:#fff
    style B6 fill:#e74c3c,stroke:#c0392c,color:#fff

Most confusion occurs between CANDIDATE and FALSE_POSITIVE β€” exactly where we expect it. Stage 1’s CONFIRMED separation is nearly clean.


Feature Importance

graph LR
    subgraph "Top 10 Features"
        A1[1. fpflag_sum<br/>0.2918]
        A2[2. koi_fpflag_co<br/>0.0683]
        A3[3. koi_max_mult_ev<br/>0.0630]
        A4[4. koi_fpflag_nt<br/>0.0624]
        A5[5. koi_model_snr<br/>0.0596]
        A6[6. koi_fpflag_ss<br/>0.0450]
        A7[7. koi_prad<br/>0.0437]
        A8[8. snr_x_prad<br/>0.0390]
        A9[9. koi_count<br/>0.0324]
        A10[10. koi_ror<br/>0.0300]
    end

Feature Importance by Category

graph TD
    subgraph "Importance Distribution"
        A[FP Flags<br/>~47%]
        B[Signal Quality<br/>~20%]
        C[Physical Params<br/>~18%]
        D[Derived Features<br/>~15%]
    end

    A --> E[fpflag_sum dominates]
    B --> F[SNR, max events]
    C --> G[Radius, temperature]
    D --> H[Interactions, ratios]

    style A fill:#e74c3c,stroke:#c0392c,color:#fff
    style E fill:#e74c3c,stroke:#c0392c,color:#fff

Astrophysical Insights

Insight 1: False Positive Flags (Very High Confidence)

graph LR
    A[NASA FP Flags] --> B[fpflag_sum]
    B -->|Value > 0| C[99%+ chance<br/>FALSE POSITIVE]
    B -->|Value = 0| D[Needs further<br/>analysis]

    style C fill:#e74c3c,stroke:#c0392c,color:#fff
    style D fill:#3498db,stroke:#2980b9,color:#fff

Supporting features: fpflag_sum, koi_fpflag_nt, koi_fpflag_ss

NASA’s pre-vetting flags directly encode expert knowledge. When these are non-zero, the signal is almost certainly not a planet. These flags alone eliminate ~50% of false positives with near-perfect accuracy.


Insight 2: SNR-Radius Consistency (High Confidence)

graph LR
    A[Real Planet] -->|Jupiter| B[High SNR]
    A -->|Earth| C[Low SNR]

    D[Inconsistent] -->|Large + Low SNR| E[Suspicious]

    B --> F[snr_x_prad<br/>consistent]
    C --> F
    E --> G[snr_x_prad<br/>inconsistent]

    style F fill:#2ecc71,stroke:#27ae60,color:#fff
    style G fill:#e74c3c,stroke:#c0392c,color:#fff

Supporting features: koi_model_snr, snr_x_prad, koi_prad

Real planets have signal-to-noise ratios consistent with their size. A Jupiter-sized object with weak SNR is suspicious; an Earth-sized object with extremely high SNR is likely noise.


Insight 3: Transit Geometry (High Confidence)

graph LR
    subgraph "Planet Transit"
        A1[U-shaped curve]
        A2[Specific depth/duration]
    end

    subgraph "Stellar Binary"
        B1[V-shaped curve]
        B2[Different depth/duration]
    end

    A1 --> C[depth_duration_ratio<br/>~ planet]
    B1 --> D[depth_duration_ratio<br/>~ binary]

    style C fill:#2ecc71,stroke:#27ae60,color:#fff
    style D fill:#e74c3c,stroke:#c0392c,color:#fff

Supporting features: depth_duration_ratio, log_period, koi_duration

Planetary transits produce characteristic U-shaped light curves with specific depth-to-duration ratios. Stellar binaries produce V-shaped eclipses with different geometry. Our derived depth_duration_ratio captures this distinction.


Recommendations

graph TD
    subgraph "Actionable Recommendations"
        A1[Use Stage 1 as rapid filter]
        A2[Investigate uncertain Stage 1 samples]
        A3[Use Stage 2 for prioritization]
        A4[Use fpflag_sum as pre-filter]
    end

    A1 --> B1[Quickly identify CONFIRMED]
    A2 --> B2[Scientifically interesting edge cases]
    A3 --> B3[Prioritize CANDIDATE follow-up]
    A4 --> B4[Eliminate 50% of false positives instantly]

    style A1 fill:#2ecc71,stroke:#27ae60,color:#fff
    style A2 fill:#3498db,stroke:#2980b9,color:#fff
    style A3 fill:#9b59b6,stroke:#8e44ad,color:#fff
    style A4 fill:#f39c12,stroke:#e67e22,color:#fff
#RecommendationImpact
1Use Stage 1 as a rapid filter for follow-up observationsSaves telescope time
2Investigate samples where Stage 1 is uncertain (probability ~0.5)Most scientifically interesting
3For NOT_CONFIRMED, use Stage 2 probability to prioritize follow-upEfficient resource allocation
4fpflag_sum alone eliminates ~50% of false positives with near-perfect accuracyDramatic efficiency gain

Comparison with Baselines

graph LR
    subgraph "Accuracy Comparison"
        A[Single-Stage RF<br/>~91%]
        B[Logistic Regression<br/>~87%]
        C[SVM<br/>~89%]
        D[Astrophage<br/>94.81%]
    end

    style D fill:#2ecc71,stroke:#27ae60,color:#fff
    style A fill:#95a5a6,stroke:#7f8c8d,color:#fff
    style B fill:#95a5a6,stroke:#7f8c8d,color:#fff
    style C fill:#95a5a6,stroke:#7f8c8d,color:#fff

Astrophage’s two-stage architecture provides a 3-4% accuracy improvement over single-stage approaches, which is significant in the context of exoplanet discovery where each percentage point represents hundreds of potential planets.

API Reference

Module Overview

graph TB
    subgraph "Public API"
        A[lib.rs] --> B[data::KoiDataset]
        A --> C[features::FeatureEngineer]
        A --> D[two_stage_model::TwoStageClassifier]
        A --> E[evaluation::ModelEvaluator]
        A --> F[report::generate_report]
        A --> G[logger::Logger]
    end

    subgraph "Internal Modules"
        H[decision_tree.rs]
        I[model.rs]
    end

    D --> H
    D --> I

data::KoiDataset

Loads and manages the KOI dataset from CSV using Polars.

graph LR
    A[CSV File] -->|Polars| B[KoiDataset]
    B --> C[n_samples]
    B --> D[n_features]
    B --> E[features Array2]
    B --> F[labels Array1]
    B --> G[class_distribution]

Methods

MethodSignatureDescription
loadload(path: &str) -> Result<KoiDataset>Load dataset from CSV
n_samplesn_samples() -> usizeNumber of rows
n_featuresn_features() -> usizeNumber of feature columns
featuresfeatures() -> &Array2<f64>Feature matrix
labelslabels() -> &Array1<u8>Label vector
class_distributionclass_distribution() -> HashMap<String, usize>Count per class
feature_indexfeature_index(name: &str) -> Option<usize>Column index by name

Usage

#![allow(unused)]
fn main() {
use astrophage::data::KoiDataset;

let dataset = KoiDataset::load("data/koi_dataset.csv")?;
println!("Loaded {} samples with {} features", 
    dataset.n_samples(), 
    dataset.n_features()
);

let dist = dataset.class_distribution();
for (class, count) in &dist {
    println!("{}: {}", class, count);
}
}

features::FeatureEngineer

Transforms raw data into model-ready features through imputation, standardization, and derived feature computation.

graph LR
    A[KoiDataset] -->|process| B[FeatureEngineer]
    B --> C[Imputation]
    C --> D[Standardization]
    D --> E[Derived Features]
    E --> F[ProcessedDataset]
    F --> G[split]
    G --> H[Train Set]
    G --> I[Test Set]

Methods

MethodSignatureDescription
newnew() -> FeatureEngineerCreate new engineer
processprocess(&mut self, dataset: &KoiDataset) -> Result<ProcessedDataset>Full pipeline

ProcessedDataset Methods

MethodSignatureDescription
n_samplesn_samples() -> usizeNumber of rows
n_featuresn_features() -> usizeNumber of columns
featuresfeatures() -> &Array2<f64>Feature matrix
labelslabels() -> &Array1<u8>Label vector
feature_namesfeature_names() -> &[String]Column names
splitsplit(test_ratio: f64, seed: u64) -> (ProcessedDataset, ProcessedDataset)Stratified split

Usage

#![allow(unused)]
fn main() {
use astrophage::features::FeatureEngineer;

let mut engineer = FeatureEngineer::new();
let processed = engineer.process(&dataset)?;

let (train, test) = processed.split(0.2, 42); // 80/20, seed=42
println!("Train: {}, Test: {}", train.n_samples(), test.n_samples());
}

two_stage_model::TwoStageClassifier

The main two-stage random forest classifier.

graph TB
    A[TwoStageClassifier] --> B[Stage 1 RF]
    A --> C[Stage 2 RF]
    B --> D[train Stage 1]
    C --> E[train Stage 2]
    D --> F[predict]
    E --> F
    F --> G[feature_importance]

Methods

MethodSignatureDescription
newnew() -> TwoStageClassifierCreate new classifier
traintrain(&mut self, train: &ProcessedDataset) -> Result<()>Train both stages
predictpredict(&self, features: &Array2<f64>) -> Vec<u8>Predict labels
predict_probapredict_proba(&self, features: &Array2<f64>) -> Vec<Vec<f64>>Predict probabilities
feature_importancefeature_importance() -> Vec<(String, f64)>Feature importance scores

Usage

#![allow(unused)]
fn main() {
use astrophage::two_stage_model::TwoStageClassifier;

let mut classifier = TwoStageClassifier::new();
classifier.train(&train)?;

// Predictions
let predictions = classifier.predict(test.features());

// Feature importance
for (name, score) in classifier.feature_importance().iter().take(10) {
    println!("{}: {:.4}", name, score);
}
}

evaluation::ModelEvaluator

Computes comprehensive classification metrics.

graph LR
    A[TwoStageClassifier] -->|+ Test Data| B[ModelEvaluator]
    B --> C[Accuracy]
    B --> D[Precision]
    B --> E[Recall]
    B --> F[F1-Score]
    B --> G[Per-Class Metrics]

Methods

MethodSignatureDescription
newnew(classifier: &TwoStageClassifier, test: &ProcessedDataset) -> ModelEvaluatorCreate evaluator
evaluateevaluate(&self) -> Result<Metrics>Compute all metrics

Metrics Structure

#![allow(unused)]
fn main() {
pub struct Metrics {
    pub accuracy: f64,
    pub macro_f1: f64,
    pub weighted_f1: f64,
    pub per_class: HashMap<String, ClassMetrics>,
}

pub struct ClassMetrics {
    pub precision: f64,
    pub recall: f64,
    pub f1_score: f64,
}
}

Usage

#![allow(unused)]
fn main() {
use astrophage::evaluation::ModelEvaluator;

let evaluator = ModelEvaluator::new(&classifier, &test);
let metrics = evaluator.evaluate()?;

println!("Accuracy: {:.4f}", metrics.accuracy);
println!("Macro F1: {:.4f}", metrics.macro_f1);

for (class, m) in &metrics.per_class {
    println!("{}: P={:.4f} R={:.4f} F1={:.4f}", 
        class, m.precision, m.recall, m.f1_score);
}
}

report::generate_report

Generates the comprehensive JSON report.

graph LR
    A[Metrics] -->|+ Classifier| B[generate_report]
    B --> C[report.json]
    C --> D[Feature Importance]
    C --> E[Per-Class Metrics]
    C --> F[Astrophysical Insights]
    C --> G[Recommendations]

Function

#![allow(unused)]
fn main() {
pub fn generate_report(
    metrics: &Metrics, 
    classifier: &TwoStageClassifier
) -> Result<()>
}

Output: output/report.json

Report Structure

{
  "project_name": "Astrophage",
  "version": "0.2.0",
  "summary": { ... },
  "metrics": { ... },
  "feature_importance": [ ... ],
  "astrophysical_insights": [ ... ],
  "recommendations": [ ... ]
}

logger::Logger

Structured logging with tracing.

graph LR
    A[Logger] --> B[Console Output]
    A --> C[File Output]
    B --> D[Colored Logs]
    C --> E[app.log]

Methods

MethodSignatureDescription
initinit(console: bool) -> Result<()>Initialize logger

Usage

#![allow(unused)]
fn main() {
use astrophage::logger::Logger;

Logger::init(true).await?;
tracing::info!("Training started...");
}

Internal: decision_tree::DecisionTree

Custom decision tree implementation using Gini impurity.

graph TD
    A[DecisionTree] --> B[fit]
    B --> C[find_best_split]
    C --> D[compute_gini]
    D --> E[split_node]
    E --> F[recurse_left]
    E --> G[recurse_right]
    F --> H[Leaf or Split]
    G --> H

Key Parameters

ParameterDefaultDescription
max_depth10Maximum tree depth
min_samples_leaf5Minimum samples per leaf
max_featuressqrt(n)Features considered per split

Internal: model::RandomForest

Ensemble of decision trees with bootstrap sampling.

graph TD
    A[RandomForest] --> B[n_estimators: 100]
    B --> C[Tree 1]
    B --> D[Tree 2]
    B --> E[...]
    B --> F[Tree N]
    C --> G[Majority Vote]
    D --> G
    F --> G
    G --> H[Final Prediction]

Key Parameters

ParameterDefaultDescription
n_estimators100Number of trees
max_depth10Max depth per tree
max_featuressqrt(n)Feature subsampling ratio
bootstraptrueUse bootstrap sampling

Data Schema

Expected columns in koi_dataset.csv:

Orbital Parameters

ColumnUnitDescription
koi_perioddaysOrbital period
koi_durationhoursTransit duration
koi_depthppmTransit depth
koi_impactβ€”Impact parameter
koi_ingresshoursIngress duration
koi_incldegOrbital inclination
koi_eccenβ€”Eccentricity
koi_smaAUSemi-major axis

Physical Parameters

ColumnUnitDescription
koi_rorβ€”Radius ratio (planet/star)
koi_pradRβŠ•Planetary radius
koi_teqKEquilibrium temperature
koi_insolEarth fluxInsolation flux

Signal Quality

ColumnDescription
koi_model_snrSignal-to-noise ratio
koi_countNumber of KOIs in system
koi_num_transitsNumber of detected transits
koi_max_sngle_evMax single event statistic
koi_max_mult_evMax multiple event statistic

False Positive Flags

ColumnDescription
koi_fpflag_ntNot Transit-like
koi_fpflag_ssStellar Eclipse
koi_fpflag_coCentroid Offset
koi_fpflag_ecEphemeris Match

Stellar Parameters

ColumnUnitDescription
koi_kepmagmagKepler magnitude
koi_dorβ€”Duration/period ratio
koi_srhog/cmΒ³Stellar density
koi_steffKStellar effective temperature
koi_sloggcm/sΒ²Surface gravity (log)
koi_smetdexMetallicity
koi_sradRβ˜‰Stellar radius
koi_smassMβ˜‰Stellar mass

Contributing to Astrophage

Thank you for your interest in contributing! Astrophage is a Rust-based exoplanet classification project, and we welcome contributions of all kinds.

graph LR
    A[Fork] --> B[Branch]
    B --> C[Code]
    C --> D[Test]
    D --> E[PR]
    E --> F[Merge]

    style F fill:#2ecc71,stroke:#27ae60,color:#fff

Getting Started

Prerequisites

  • Rust 1.85+ (install via rustup)
  • Git

Setup

# Clone your fork
git clone https://github.com/YOUR_USERNAME/astrophage.git
cd astrophage

# Build
cargo build --release

# Run tests
cargo test

Development Workflow

graph TD
    A[Issue/Feature Request] --> B[Create Branch]
    B --> C[Implement]
    C --> D[Format & Lint]
    D --> E[Test]
    E --> F[Commit]
    F --> G[Push]
    G --> H[Pull Request]
    H --> I[Review]
    I -->|Approved| J[Merge]
    I -->|Changes| C

    style J fill:#2ecc71,stroke:#27ae60,color:#fff

Code Style

We follow standard Rust conventions:

# Format code
cargo fmt

# Run linter
cargo clippy

# Generate docs
cargo doc --open

Areas for Contribution

graph TB
    subgraph "High Priority"
        A1[Hyperparameter Tuning]
        A2[Cross-Validation]
        A3[Feature Selection]
    end

    subgraph "Medium Priority"
        B1[More Derived Features]
        B2[Model Serialization]
        B3[Prediction API]
    end

    subgraph "Low Priority"
        C1[Web Dashboard]
        C2[NASA Archive Integration]
        C3[GPU Acceleration]
    end

High Priority

  1. Hyperparameter tuning β€” Grid search over tree depth, n_estimators, max_features
  2. Cross-validation β€” K-fold stratified CV implementation
  3. Feature selection β€” Recursive feature elimination to find optimal subset

Medium Priority

  1. Additional derived features β€” More astrophysical interactions
  2. Model serialization β€” Save/load trained models to avoid retraining
  3. Prediction API β€” REST API for real-time classification

Low Priority

  1. Web dashboard β€” Visualize predictions and feature importance
  2. NASA Archive integration β€” Direct API connection for live data
  3. GPU acceleration β€” CUDA kernels for tree training

Submitting Changes

sequenceDiagram
    participant C as Contributor
    participant R as Repo
    participant M as Maintainer

    C->>R: Fork repository
    C->>C: git checkout -b feature/amazing
    C->>C: git commit -m "Add amazing feature"
    C->>R: git push origin feature/amazing
    C->>R: Open Pull Request
    M->>R: Review code
    M->>C: Request changes (if needed)
    C->>R: Push updates
    M->>R: Approve & Merge

Pull Request Guidelines

  • Describe what changed and why
  • Reference any related issues
  • Include test results
  • Keep changes focused and atomic

Code of Conduct

  • Be respectful and inclusive
  • Focus on constructive feedback
  • Help others learn
  • Credit original authors

Questions?

Open an issue or reach out to @harihar-nautiyal.

Changelog

All notable changes to Astrophage will be documented in this file.

graph LR
    A[v0.1.0] -->|+Two-Stage| B[v0.2.0]
    B -->|+Hyperparam| C[v0.3.0]
    C -->|+API| D[v0.4.0]

    style B fill:#2ecc71,stroke:#27ae60,color:#fff

[0.2.0] - 2026-06-28

Added

  • Two-Stage Random Forest architecture (CONFIRMED vs NOT β†’ CANDIDATE vs FALSE_POSITIVE)
  • 8 derived astrophysical features (fpflag_sum, snr_x_prad, depth_duration_ratio, etc.)
  • Comprehensive JSON report generation with astrophysical insights
  • Feature importance ranking with scientific explanations
  • Google Colab notebook for cloud execution
  • Full mdBook documentation suite with Mermaid diagrams

Performance

  • 94.81% accuracy on test set
  • 92.64% macro F1 score
  • Stage 1 achieves near-perfect separation for CONFIRMED planets

Technical

  • Custom Random Forest implementation in pure Rust
  • Polars DataFrame integration for fast I/O
  • Stratified train/test split with class balance
  • Z-score standardization and median imputation
  • Tracing-based structured logging
graph LR
    A[v0.2.0] --> B[Accuracy: 94.81%]
    A --> C[Macro F1: 92.64%]
    A --> D[Features: 36]

    style B fill:#2ecc71,stroke:#27ae60,color:#fff
    style C fill:#3498db,stroke:#2980b9,color:#fff

[0.1.0] - 2026-06-23

Added

  • Initial single-stage Random Forest classifier
  • Basic feature engineering with 28 base features
  • KOI dataset loading and preprocessing
  • Evaluation metrics (accuracy, precision, recall, F1)
  • Command-line interface with tracing logs

Notes

  • Baseline accuracy: ~91%
  • No derived features
  • Single-stage architecture
graph LR
    A[v0.1.0] --> B[Accuracy: ~91%]
    A --> C[Features: 28]
    A --> D[Single Stage]

    style B fill:#e74c3c,stroke:#c0392c,color:#fff

Roadmap

graph LR
    A[v0.2.0<br/>Current] --> B[v0.3.0<br/>Hyperparameter Tuning]
    B --> C[v0.4.0<br/>Model Serialization]
    C --> D[v0.5.0<br/>REST API]
    D --> E[v1.0.0<br/>Production Ready]

    style A fill:#2ecc71,stroke:#27ae60,color:#fff
    style E fill:#f39c12,stroke:#e67e22,color:#fff

v0.3.0 (Planned)

  • Grid search for optimal hyperparameters
  • K-fold cross-validation
  • Feature importance visualization

v0.4.0 (Planned)

  • Model serialization (save/load trained models)
  • Incremental training support
  • Configuration file support (TOML)

v0.5.0 (Planned)

  • REST API with Axum
  • Batch prediction endpoint
  • Real-time classification

v1.0.0 (Future)

  • Production-ready deployment
  • Docker containerization
  • Comprehensive benchmarking suite

πŸͺ Astrophage

Two-Stage Random Forest Classifier for NASA Kepler Object of Interest (KOI) Exoplanet Validation

Hackathon Rust Polars Accuracy Documentation

πŸ“– Full Documentation: https://astrophage.hariharnautiyal.com


What is Astrophage?

Astrophage is a high-performance exoplanet classification system built in Rust using Polars and a custom Two-Stage Random Forest implementation. It classifies Kepler Objects of Interest (KOIs) into three categories:

pie title Class Distribution in KOI Dataset
    "FALSE POSITIVE" : 4839
    "CONFIRMED" : 2747
    "CANDIDATE" : 1978

Total Samples: 9,564 | Features: 36 (28 base + 8 derived) | Accuracy: 94.81%


Why Two-Stage?

Our architecture mirrors NASA’s actual vetting workflow. Instead of forcing a single model to learn three classes simultaneously, we decompose the problem into two simpler binary decisions:

graph TD
    A[Raw KOI Data<br/>36 Features] --> B[Stage 1: CONFIRMED vs NOT CONFIRMED]
    B -->|CONFIRMED| C[Output: CONFIRMED βœ…]
    B -->|NOT CONFIRMED| D[Stage 2: CANDIDATE vs FALSE POSITIVE]
    D -->|CANDIDATE| E[Output: CANDIDATE πŸ”]
    D -->|FALSE POSITIVE| F[Output: FALSE POSITIVE ❌]

    style C fill:#2ecc71,stroke:#27ae60,color:#fff
    style E fill:#3498db,stroke:#2980b9,color:#fff
    style F fill:#e74c3c,stroke:#c0392c,color:#fff

This decomposition improves accuracy by ~3-4% over a single-stage classifier because each stage learns a simpler, cleaner decision boundary.


Key Results

graph LR
    subgraph "Overall Metrics"
        A[Accuracy<br/>94.81%]
        B[Macro F1<br/>92.64%]
        C[Weighted F1<br/>94.51%]
    end

    style A fill:#2ecc71,stroke:#27ae60,color:#fff
    style B fill:#3498db,stroke:#2980b9,color:#fff
    style C fill:#9b59b6,stroke:#8e44ad,color:#fff

Per-Class Breakdown

ClassPrecisionRecallF1-ScoreSupport
CANDIDATE88.42%85.06%86.71%1,978
FALSE POSITIVE99.69%98.35%99.01%4,839
CONFIRMED89.95%94.54%92.18%2,747

Architecture

graph LR
    subgraph "Input"
        A[Raw KOI CSV<br/>28 columns]
    end

    subgraph "Feature Engineering"
        B[Base Features<br/>28 columns]
        C[Derived Features<br/>8 interactions]
        D[Preprocessing<br/>impute + standardize]
    end

    subgraph "Two-Stage Model"
        E[Stage 1 RF<br/>CONFIRMED vs NOT]
        F[Stage 2 RF<br/>CANDIDATE vs FALSE+]
    end

    subgraph "Output"
        G[Predictions<br/>3 classes]
        H[JSON Report<br/>metrics + insights]
    end

    A --> B
    B --> C
    C --> D
    D --> E
    E -->|CONFIRMED| G
    E -->|NOT| F
    F -->|CANDIDATE| G
    F -->|FALSE_POSITIVE| G
    G --> H

    style A fill:#e74c3c,stroke:#c0392c,color:#fff
    style D fill:#f39c12,stroke:#e67e22,color:#fff
    style E fill:#2ecc71,stroke:#27ae60,color:#fff
    style F fill:#3498db,stroke:#2980b9,color:#fff
    style H fill:#9b59b6,stroke:#8e44ad,color:#fff

Custom Implementation Details

  • Language: Rust (zero-cost abstractions, memory safety, SIMD-friendly)
  • DataFrame Engine: Polars (blazing fast CSV I/O and columnar operations)
  • ML Backend: Custom Random Forest from scratch (no Python dependency!)
    • Gini impurity splitting
    • Bootstrapped sampling
    • Feature subsampling
    • Majority voting ensemble
  • Parallelism: Tokio async runtime for I/O; ndarray for vectorized math

Installation

Prerequisites

  • Rust (1.85+ recommended)
  • Git

Clone & Build

# Clone the repository
git clone https://github.com/harihar-nautiyal/astrophage.git
cd astrophage

# Build in release mode (optimized)
cargo build --release

# The binary will be at:
# ./target/release/astrophage

Dataset

The repository includes a pre-processed KOI dataset at:

data/koi_dataset.csv

If you want to use your own data, ensure it follows the same column schema (see src/data.rs for expected fields).


Usage

Quick Start

# Run the full pipeline
cargo run --release

Expected Output

╔══════════════════════════════════════════════════════════════╗
β•‘ πŸͺ ASTROPHAGE v0.2.0                                         β•‘
β•‘ NASA KOI Exoplanet Classification System                     β•‘
β•‘ TWO-STAGE MODEL: CONFIRMED vs NOT β†’ CANDIDATE vs FALSE    β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

Step 1: Loading KOI dataset...
Step 2: Engineering features...
Step 3: Splitting data (80/20 stratified)...
Step 4: Training TWO-STAGE classifier...
Step 5: Evaluating model performance...
Step 6: Top astrophysical predictors:
  1. fpflag_sum                0.2918
  2. koi_fpflag_co             0.0683
  3. koi_max_mult_ev           0.0630
  4. koi_fpflag_nt             0.0624
  5. koi_model_snr             0.0596
  ...
Step 7: Generating final report...

ASTROPHAGE two-stage classification complete!
Check output/report.json for full results.

Output Files

FileDescription
output/report.jsonFull JSON report with metrics, feature importance, and insights
output/predictions.csv(Optional) Per-sample predictions and probabilities

Feature Engineering

We transform 28 raw astrophysical features into 36 model-ready features:

Base Features (28)

Orbital, physical, and stellar parameters from the Kepler pipeline:

graph TB
    subgraph "Base Feature Groups"
        A1[Orbital: period, duration, impact, ingress, incl, eccen, sma]
        A2[Physical: ror, prad, teq, insol]
        A3[Signal: snr, count, transits, max events]
        A4[FP Flags: nt, ss, co, ec]
        A5[Stellar: kepmag, dor, srho, steff, slogg, smet, srad, smass]
    end

Derived Features (8)

FeatureFormulaAstrophysical Rationale
koi_prad_squaredpradΒ²Non-linear radius effect; objects >15 RβŠ• are likely stellar companions
depth_duration_ratiodepth / durationTransit steepness; planets have characteristic U-shaped curves
snr_x_pradsnr Γ— pradReal planets have SNR consistent with their size
impact_penalty10 if impact > 1.0 else 0Impact parameter >1 is physically impossible for a transit
log_periodln(period)Orbital periods follow log-normal distribution
teq_over_steffteq / steffSanity check on equilibrium temperature vs stellar temperature
fpflag_sumΞ£ fpflagsNASA’s pre-vetting suspicion score; higher = more likely false positive
prad_teq_interactionprad Γ— teqSize-temperature interaction for giant planets vs rocky planets

Preprocessing

graph LR
    A[Raw Features<br/>28 columns] --> B[Imputation]
    B --> C[Standardization]
    C --> D[Derived Features<br/>8 columns]
    D --> E[Final Feature Matrix<br/>36 columns]

    B -->|Missing values| B1[Column median]
    C -->|Z-score| C1[Mean=0, Std=1]

    style E fill:#2ecc71,stroke:#27ae60,color:#fff

Astrophysical Insights

Our model reveals key discriminators that align with planetary science:

πŸ”΄ Very High Confidence

False Positive Flags (fpflag_sum, koi_fpflag_nt, koi_fpflag_ss) directly encode NASA’s pre-vetting. When non-zero, the signal is almost certainly not a planet. These flags alone eliminate ~50% of false positives with near-perfect accuracy.

graph LR
    A[NASA FP Flags] --> B[fpflag_sum]
    B -->|Value > 0| C[99%+ chance<br/>FALSE POSITIVE]
    B -->|Value = 0| D[Needs further<br/>analysis]

    style C fill:#e74c3c,stroke:#c0392c,color:#fff
    style D fill:#3498db,stroke:#2980b9,color:#fff

🟑 High Confidence

Signal-to-Noise Ratio + Planetary Radius (snr_x_prad, koi_prad): Real planets have consistent SNR for their size. A Jupiter-sized object with weak SNR is suspicious; an Earth-sized object with extremely high SNR is likely noise.

graph LR
    A[Real Planet] -->|Jupiter| B[High SNR]
    A -->|Earth| C[Low SNR]

    D[Inconsistent] -->|Large + Low SNR| E[Suspicious]

    B --> F[snr_x_prad<br/>consistent]
    C --> F
    E --> G[snr_x_prad<br/>inconsistent]

    style F fill:#2ecc71,stroke:#27ae60,color:#fff
    style G fill:#e74c3c,stroke:#c0392c,color:#fff

🟒 Workflow Insight

The two-stage design mirrors how astronomers actually vet candidates: First, separate obvious planets (CONFIRMED) from everything else. Then, carefully distinguish between promising candidates and known false positives. This is why Stage 1 achieves near-perfect separation while Stage 2 focuses on the scientifically interesting boundary.


Project Structure

graph TD
    A[astrophage/] --> B[Cargo.toml]
    A --> C[data/]
    A --> D[src/]
    A --> E[output/]
    A --> F[docs/]

    C --> C1[koi_dataset.csv]

    D --> D1[main.rs]
    D --> D2[data.rs]
    D --> D3[features.rs]
    D --> D4[decision_tree.rs]
    D --> D5[model.rs]
    D --> D6[two_stage_model.rs]
    D --> D7[evaluation.rs]
    D --> D8[report.rs]

    E --> E1[report.json]

    F --> F1[mdbook/]
    F1 --> F2[book.toml]
    F1 --> F3[src/]
    F3 --> F4[README.md]
    F3 --> F5[architecture.md]
    F3 --> F6[two-stage-model.md]
    F3 --> F7[feature-engineering.md]
    F3 --> F8[results.md]
    F3 --> F9[api-reference.md]

    style D1 fill:#f39c12,stroke:#e67e22,color:#fff
    style D6 fill:#2ecc71,stroke:#27ae60,color:#fff
    style F1 fill:#9b59b6,stroke:#8e44ad,color:#fff

Documentation

πŸ“– Full Documentation: https://astrophage.hariharnautiyal.com

The documentation site includes:

  • System architecture with Mermaid diagrams
  • Deep dive into the two-stage model
  • Feature engineering explanations with astrophysical rationale
  • API reference for all modules
  • Contributing guidelines
  • Changelog and roadmap

To build the docs locally:

cd docs
mdbook build
mdbook serve --open

Google Colab

Want to try Astrophage without installing Rust locally?

πŸ‘‰ Open in Google Colab

The notebook will:

  1. Install Rust in the Colab environment
  2. Clone this repository
  3. Build the project with Cargo
  4. Run the full pipeline
  5. Display the report.json with interactive visualizations

Note: First run takes ~5-7 minutes due to Rust compilation. Subsequent runs are instant.


Recommendations for Follow-Up

Based on our model’s behavior, we suggest:

graph TD
    subgraph "Actionable Recommendations"
        A1[Use Stage 1 as rapid filter]
        A2[Investigate uncertain Stage 1 samples]
        A3[Use Stage 2 for prioritization]
        A4[Use fpflag_sum as pre-filter]
    end

    A1 --> B1[Quickly identify CONFIRMED]
    A2 --> B2[Scientifically interesting edge cases]
    A3 --> B3[Prioritize CANDIDATE follow-up]
    A4 --> B4[Eliminate 50% of false positives instantly]

    style A1 fill:#2ecc71,stroke:#27ae60,color:#fff
    style A2 fill:#3498db,stroke:#2980b9,color:#fff
    style A3 fill:#9b59b6,stroke:#8e44ad,color:#fff
    style A4 fill:#f39c12,stroke:#e67e22,color:#fff
#RecommendationImpact
1Use Stage 1 as a rapid filter for follow-up observationsSaves telescope time
2Investigate samples where Stage 1 is uncertain (probability ~0.5)Most scientifically interesting
3For NOT_CONFIRMED, use Stage 2 probability to prioritize follow-upEfficient resource allocation
4fpflag_sum alone eliminates ~50% of false positives with near-perfect accuracyDramatic efficiency gain

Team & Acknowledgments


License

MIT License β€” feel free to use, modify, and distribute with attribution.


"Somewhere, something incredible is waiting to be known."
β€” Carl Sagan