πͺ Astrophage
Two-Stage Random Forest Classifier Model for NASA Kepler Object of Interest (KOI) Exoplanet Validation
What is Astrophage?
Astrophage is a high-performance exoplanet classification system built in Rust using Polars and a custom Two-Stage Random Forest implementation. It classifies Kepler Objects of Interest (KOIs) into three categories:
| Class | Description | Count |
|---|---|---|
| CONFIRMED β | Validated exoplanets with high confidence | 2,747 |
| CANDIDATE π | Promising signals awaiting follow-up confirmation | 1,978 |
| FALSE POSITIVE β | Non-planetary signals (stellar binaries, instrumental noise, etc.) | 4,839 |
pie title Class Distribution in KOI Dataset
"FALSE POSITIVE" : 4839
"CONFIRMED" : 2747
"CANDIDATE" : 1978
Total Samples: 9,564 | Features: 36 (28 base + 8 derived) | Accuracy: 94.81%
Why Two-Stage?
Our architecture mirrors NASAβs actual vetting workflow. Instead of forcing a single model to learn three classes simultaneously, we decompose the problem into two simpler binary decisions:
graph TD
A[Raw KOI Data<br/>36 Features] --> B[Stage 1: CONFIRMED vs NOT CONFIRMED]
B -->|CONFIRMED| C[Output: CONFIRMED β
]
B -->|NOT CONFIRMED| D[Stage 2: CANDIDATE vs FALSE POSITIVE]
D -->|CANDIDATE| E[Output: CANDIDATE π]
D -->|FALSE POSITIVE| F[Output: FALSE POSITIVE β]
style C fill:#2ecc71,stroke:#27ae60,color:#fff
style E fill:#3498db,stroke:#2980b9,color:#fff
style F fill:#e74c3c,stroke:#c0392c,color:#fff
This decomposition improves accuracy by ~3-4% over a single-stage classifier because each stage learns a simpler, cleaner decision boundary.
Key Results
| Metric | Score |
|---|---|
| Accuracy | 94.81% |
| Macro F1 | 92.64% |
| Weighted F1 | 94.51% |
graph LR
subgraph "Overall Metrics"
A[Accuracy<br/>94.81%]
B[Macro F1<br/>92.64%]
C[Weighted F1<br/>94.51%]
end
style A fill:#2ecc71,stroke:#27ae60,color:#fff
style B fill:#3498db,stroke:#2980b9,color:#fff
style C fill:#9b59b6,stroke:#8e44ad,color:#fff
Quick Start
# Clone
git clone https://github.com/harihar-nautiyal/astrophage.git
cd astrophage
# Build
cargo build --release
# Run
./target/release/astrophage
Or try it in your browser with Google Colab β no installation needed!
Project Structure
graph TD
A[astrophage/] --> B[Cargo.toml]
A --> C[data/]
A --> D[src/]
A --> E[output/]
C --> C1[koi_dataset.csv]
D --> D1[main.rs]
D --> D2[data.rs]
D --> D3[features.rs]
D --> D4[decision_tree.rs]
D --> D5[model.rs]
D --> D6[two_stage_model.rs]
D --> D7[evaluation.rs]
D --> D8[report.rs]
E --> E1[report.json]
style D1 fill:#f39c12,stroke:#e67e22,color:#fff
style D6 fill:#2ecc71,stroke:#27ae60,color:#fff
Technology Stack
graph LR
A[Astrophage] --> B[Rust]
A --> C[Polars]
A --> D[NDArray]
A --> E[Tokio]
A --> F[Serde]
B --> B1[Memory Safety]
B --> B2[Zero-Cost Abstractions]
B --> B3[SIMD-Friendly]
C --> C1[Fast CSV I/O]
C --> C2[Columnar Operations]
D --> D1[Vectorized Math]
D --> D2[N-Dimensional Arrays]
E --> E1[Async Runtime]
F --> F1[JSON Serialization]
"Somewhere, something incredible is waiting to be known."
β Carl Sagan
System Architecture
High-Level Overview
Astrophage follows a clean pipeline architecture from raw data to final predictions:
graph LR
subgraph "Input"
A[Raw KOI CSV<br/>28 columns]
end
subgraph "Feature Engineering"
B[Base Features<br/>28 columns]
C[Derived Features<br/>8 interactions]
D[Preprocessing<br/>impute + standardize]
end
subgraph "Two-Stage Model"
E[Stage 1 RF<br/>CONFIRMED vs NOT]
F[Stage 2 RF<br/>CANDIDATE vs FALSE+]
end
subgraph "Output"
G[Predictions<br/>3 classes]
H[JSON Report<br/>metrics + insights]
end
A --> B
B --> C
C --> D
D --> E
E -->|CONFIRMED| G
E -->|NOT| F
F -->|CANDIDATE| G
F -->|FALSE_POSITIVE| G
G --> H
style A fill:#e74c3c,stroke:#c0392c,color:#fff
style D fill:#f39c12,stroke:#e67e22,color:#fff
style E fill:#2ecc71,stroke:#27ae60,color:#fff
style F fill:#3498db,stroke:#2980b9,color:#fff
style H fill:#9b59b6,stroke:#8e44ad,color:#fff
Data Flow
sequenceDiagram
participant User as User
participant Main as main.rs
participant Data as data.rs
participant Features as features.rs
participant Stage1 as Stage 1 RF
participant Stage2 as Stage 2 RF
participant Eval as evaluation.rs
participant Report as report.rs
User->>Main: cargo run --release
Main->>Data: load("data/koi_dataset.csv")
Data-->>Main: KoiDataset (9,564 samples)
Main->>Features: process(&dataset)
Features->>Features: impute_missing()
Features->>Features: standardize()
Features->>Features: compute_derived()
Features-->>Main: ProcessedDataset (36 features)
Main->>Main: split(0.2, seed=42)
Note over Main: 80/20 stratified split
Main->>Stage1: train(&train_data)
Note over Stage1: Binary: CONFIRMED=1, NOT=0
Stage1-->>Main: Stage 1 trained
Main->>Stage2: train(&train_stage2)
Note over Stage2: Binary: CANDIDATE=1, FALSE_POSITIVE=0
Stage2-->>Main: Stage 2 trained
Main->>Eval: evaluate(&classifier, &test)
Eval-->>Main: Metrics (accuracy, F1, etc.)
Main->>Report: generate_report(&metrics, &classifier)
Report-->>Main: output/report.json
Main-->>User: Done!
Random Forest Internals
Single Decision Tree
graph TD
A[Root Node<br/>Gini = 0.65] -->|fpflag_sum < 0.5| B[Left: Gini = 0.15]
A -->|fpflag_sum >= 0.5| C[Right: Gini = 0.05]
B -->|koi_model_snr < 2.0| D[Leaf: CANDIDATE]
B -->|koi_model_snr >= 2.0| E[Leaf: CONFIRMED]
C -->|koi_prad < 15.0| F[Leaf: FALSE_POSITIVE]
C -->|koi_prad >= 15.0| G[Leaf: FALSE_POSITIVE]
style D fill:#3498db,stroke:#2980b9,color:#fff
style E fill:#2ecc71,stroke:#27ae60,color:#fff
style F fill:#e74c3c,stroke:#c0392c,color:#fff
style G fill:#e74c3c,stroke:#c0392c,color:#fff
Ensemble Voting
graph TD
A[Sample Input] --> B[Tree 1]
A --> C[Tree 2]
A --> D[Tree 3]
A --> E[...]
A --> F[Tree N]
B -->|CONFIRMED| G[Voting Box]
C -->|CONFIRMED| G
D -->|CANDIDATE| G
E -->|CONFIRMED| G
F -->|CONFIRMED| G
G -->|Majority Vote| H[Final: CONFIRMED]
style G fill:#f39c12,stroke:#e67e22,color:#fff
style H fill:#2ecc71,stroke:#27ae60,color:#fff
Technology Layers
graph TB
subgraph "Application Layer"
A1[main.rs - CLI & Orchestration]
A2[report.rs - JSON Generation]
end
subgraph "ML Layer"
M1[two_stage_model.rs - Pipeline]
M2[model.rs - Random Forest]
M3[decision_tree.rs - Trees]
end
subgraph "Data Layer"
D1[features.rs - Engineering]
D2[data.rs - Loading]
end
subgraph "Infrastructure Layer"
I1[Polars - DataFrame I/O]
I2[NDArray - Vectorized Math]
I3[Tokio - Async Runtime]
I4[Serde - Serialization]
end
A1 --> M1
A1 --> A2
M1 --> M2
M2 --> M3
M1 --> D1
D1 --> D2
D2 --> I1
D1 --> I2
A1 --> I3
A2 --> I4
Performance Comparison
graph LR
subgraph "Training Time"
A[Astrophage<br/>Rust: ~30s]
B[sklearn RF<br/>Python: ~120s]
end
subgraph "Inference Time"
C[Astrophage<br/>~1ms/sample]
D[sklearn RF<br/>~10ms/sample]
end
subgraph "Binary Size"
E[Astrophage<br/>~2MB]
F[sklearn env<br/>~500MB+]
end
style A fill:#2ecc71,stroke:#27ae60,color:#fff
style C fill:#2ecc71,stroke:#27ae60,color:#fff
style E fill:#2ecc71,stroke:#27ae60,color:#fff
Memory Layout
graph TD
subgraph "Training Data"
A[Features Array2<br/>f64 x (n_samples x 36)]
B[Labels Array1<br/>u8 x n_samples]
end
subgraph "Stage 1 Model"
C[100 Decision Trees]
C1[Tree 1: ~50 nodes]
C2[Tree 2: ~50 nodes]
C3[Tree N: ~50 nodes]
end
subgraph "Stage 2 Model"
D[100 Decision Trees]
D1[Tree 1: ~50 nodes]
D2[Tree 2: ~50 nodes]
D3[Tree N: ~50 nodes]
end
A --> C
A --> D
B --> C
B --> D
Two-Stage Model
The Core Innovation
Astrophageβs defining feature is its Two-Stage Random Forest architecture. Instead of a single three-class classifier, we decompose the problem into two sequential binary decisions β exactly how NASA astronomers actually vet candidates.
graph TB
subgraph "Problem Decomposition"
A[3-Class Problem<br/>CONFIRMED vs CANDIDATE vs FALSE_POSITIVE]
B[Stage 1<br/>CONFIRMED vs NOT_CONFIRMED]
C[Stage 2<br/>CANDIDATE vs FALSE_POSITIVE]
end
A -->|Decompose| B
A -->|Decompose| C
style A fill:#e74c3c,stroke:#c0392c,color:#fff
style B fill:#2ecc71,stroke:#27ae60,color:#fff
style C fill:#3498db,stroke:#2980b9,color:#fff
Why This Works
The Astronomy Perspective
When NASA discovers a KOI, the vetting process is sequential:
- First question: βDo we have overwhelming evidence this is a planet?β β If yes, CONFIRMED
- Second question: βIf not confirmed, is it worth follow-up?β β CANDIDATE or FALSE_POSITIVE
graph LR
A[Discovery] --> B{Overwhelming<br/>Evidence?}
B -->|Yes| C[CONFIRMED<br/>Follow-up complete]
B -->|No| D{Promising<br/>Signal?}
D -->|Yes| E[CANDIDATE<br/>Needs more data]
D -->|No| F[FALSE_POSITIVE<br/>Discard]
style C fill:#2ecc71,stroke:#27ae60,color:#fff
style E fill:#3498db,stroke:#2980b9,color:#fff
style F fill:#e74c3c,stroke:#c0392c,color:#fff
The ML Perspective
graph TB
subgraph "Single-Stage Classifier"
A1[Decision Boundary<br/>Complex 3-way split]
A2[Accuracy: ~91%]
A3[Confusion between<br/>CANDIDATE & FALSE_POSITIVE]
end
subgraph "Two-Stage Classifier"
B1[Stage 1: Simple linear<br/>separation for CONFIRMED]
B2[Stage 2: Focused boundary<br/>between CANDIDATE & FALSE+]
B3[Accuracy: ~94.8%]
B4[Each stage learns<br/>a cleaner boundary]
end
style A2 fill:#e74c3c,stroke:#c0392c,color:#fff
style B3 fill:#2ecc71,stroke:#27ae60,color:#fff
Stage 1: CONFIRMED vs NOT CONFIRMED
Decision Boundary
Stage 1 separates the βeasyβ class (CONFIRMED) from everything else. Confirmed planets have very strong, consistent signals:
graph LR
subgraph "Feature Space"
A[High SNR]
B[Zero FP Flags]
C[Consistent Radius]
D[Regular Period]
end
A --> E[CONFIRMED Zone]
B --> E
C --> E
D --> E
F[Low SNR] --> G[NOT CONFIRMED Zone]
H[Non-zero FP Flags] --> G
I[Inconsistent Radius] --> G
J[Irregular Period] --> G
style E fill:#2ecc71,stroke:#27ae60,color:#fff
style G fill:#e74c3c,stroke:#c0392c,color:#fff
Performance
Stage 1 is nearly perfect because confirmed planets are genuinely distinct:
graph TD
A[Stage 1 Performance] --> B[Precision: ~99%]
A --> C[Recall: ~98%]
A --> D[F1: ~99%]
style B fill:#2ecc71,stroke:#27ae60,color:#fff
style C fill:#2ecc71,stroke:#27ae60,color:#fff
style D fill:#2ecc71,stroke:#27ae60,color:#fff
Key Features
The most important features for Stage 1:
graph LR
A[fpflag_sum] -->|Weight: 0.29| B[Stage 1]
C[koi_model_snr] -->|Weight: 0.06| B
D[koi_fpflag_nt] -->|Weight: 0.06| B
E[koi_fpflag_ss] -->|Weight: 0.05| B
F[koi_prad] -->|Weight: 0.04| B
style B fill:#f39c12,stroke:#e67e22,color:#fff
Stage 2: CANDIDATE vs FALSE POSITIVE
The Hard Problem
This is where the science gets interesting. Candidates and false positives can look very similar:
graph TB
subgraph "CANDIDATE Characteristics"
A1[Moderate SNR]
A2[Some transit-like shape]
A3[Plausible radius]
A4[No strong FP flags]
end
subgraph "FALSE POSITIVE Characteristics"
B1[Variable SNR]
B2[Non-transit shape possible]
B3[Radius may be too large]
B4[Subtle FP indicators]
end
A1 --- C[The Boundary]
A2 --- C
A3 --- C
A4 --- C
B1 --- C
B2 --- C
B3 --- C
B4 --- C
style C fill:#f39c12,stroke:#e67e22,color:#fff
Performance
graph TD
A[Stage 2 Performance] --> B[Precision: ~88%]
A --> C[Recall: ~85%]
A --> D[F1: ~87%]
style B fill:#3498db,stroke:#2980b9,color:#fff
style C fill:#3498db,stroke:#2980b9,color:#fff
style D fill:#3498db,stroke:#2980b9,color:#fff
Stage 2 is harder but also more scientifically valuable β these are the edge cases astronomers care about most.
Combined Inference Pipeline
graph TD
A[Input Sample<br/>36 Features] --> B{Stage 1:<br/>CONFIRMED?}
B -->|Probability > 0.5| C[Output:<br/>CONFIRMED β
]
B -->|Probability <= 0.5| D{Stage 2:<br/>CANDIDATE?}
D -->|Probability > 0.5| E[Output:<br/>CANDIDATE π]
D -->|Probability <= 0.5| F[Output:<br/>FALSE POSITIVE β]
style C fill:#2ecc71,stroke:#27ae60,color:#fff
style E fill:#3498db,stroke:#2980b9,color:#fff
style F fill:#e74c3c,stroke:#c0392c,color:#fff
Probability Flow
graph LR
A[Input] --> B[Stage 1 RF]
B -->|P(CONFIRMED) = 0.85| C[β CONFIRMED]
B -->|P(CONFIRMED) = 0.30| D[β Stage 2]
D -->|P(CANDIDATE) = 0.70| E[β CANDIDATE]
D -->|P(CANDIDATE) = 0.20| F[β FALSE_POSITIVE]
style C fill:#2ecc71,stroke:#27ae60,color:#fff
style E fill:#3498db,stroke:#2980b9,color:#fff
style F fill:#e74c3c,stroke:#c0392c,color:#fff
Training Data Flow
graph TD
A[Full Dataset<br/>9,564 samples] --> B[Stratified Split<br/>80/20]
B --> C[Train Set<br/>~7,650 samples]
B --> D[Test Set<br/>~1,910 samples]
C --> E[Stage 1 Labels<br/>CONFIRMED=1, NOT=0]
C --> F[Stage 2 Labels<br/>CANDIDATE=1, FALSE=0]
E --> G[Train Stage 1 RF<br/>100 trees]
F --> H[Train Stage 2 RF<br/>100 trees]
G --> I[Stage 1 Model]
H --> J[Stage 2 Model]
D --> K[Evaluate Both<br/>on Test Set]
I --> K
J --> K
style G fill:#2ecc71,stroke:#27ae60,color:#fff
style H fill:#3498db,stroke:#2980b9,color:#fff
style K fill:#f39c12,stroke:#e67e22,color:#fff
Error Analysis
graph TD
A[Total Test Samples<br/>1,911] --> B[Correct<br/>1,806]
A --> C[Errors<br/>105]
C --> D[Stage 1 Errors<br/>~15]
C --> E[Stage 2 Errors<br/>~90]
D --> D1[CONFIRMED misclassified<br/>as NOT CONFIRMED]
D --> D2[NOT CONFIRMED misclassified<br/>as CONFIRMED]
E --> E1[CANDIDATE misclassified<br/>as FALSE_POSITIVE]
E --> E2[FALSE_POSITIVE misclassified<br/>as CANDIDATE]
style B fill:#2ecc71,stroke:#27ae60,color:#fff
style D fill:#e74c3c,stroke:#c0392c,color:#fff
style E fill:#e74c3c,stroke:#c0392c,color:#fff
Most errors occur in Stage 2, which is expected β the boundary between candidates and false positives is inherently ambiguous. These are the most scientifically interesting samples.
Feature Engineering
Philosophy
Feature engineering is where astrophysics meets machine learning. We donβt just throw raw data at a model β we encode domain knowledge about how planets, stars, and false positives behave.
graph LR
A[Raw Data] --> B[Domain Knowledge]
B --> C[Engineered Features]
C --> D[Better Model]
style B fill:#f39c12,stroke:#e67e22,color:#fff
style D fill:#2ecc71,stroke:#27ae60,color:#fff
Base Features (28)
These are the raw astrophysical measurements from the Kepler pipeline:
graph TB
subgraph "Orbital Parameters"
A1[koi_period]
A2[koi_duration]
A3[koi_impact]
A4[koi_ingress]
A5[koi_incl]
A6[koi_eccen]
A7[koi_sma]
end
subgraph "Physical Parameters"
B1[koi_ror]
B2[koi_prad]
B3[koi_teq]
B4[koi_insol]
end
subgraph "Signal Quality"
C1[koi_model_snr]
C2[koi_count]
C3[koi_num_transits]
C4[koi_max_sngle_ev]
C5[koi_max_mult_ev]
end
subgraph "False Positive Flags"
D1[koi_fpflag_nt]
D2[koi_fpflag_ss]
D3[koi_fpflag_co]
D4[koi_fpflag_ec]
end
subgraph "Stellar Parameters"
E1[koi_kepmag]
E2[koi_dor]
E3[koi_srho]
E4[koi_steff]
E5[koi_slogg]
E6[koi_smet]
E7[koi_srad]
E8[koi_smass]
end
Derived Features (8)
These are where the magic happens. Each derived feature encodes a specific astrophysical insight:
1. fpflag_sum β Total Suspicion Score
graph LR
A[koi_fpflag_nt] -->|+| B[fpflag_sum]
C[koi_fpflag_ss] -->|+| B
D[koi_fpflag_co] -->|+| B
E[koi_fpflag_ec] -->|+| B
B -->|Value > 0| F[Almost certainly<br/>FALSE POSITIVE]
B -->|Value = 0| G[Needs further<br/>analysis]
style F fill:#e74c3c,stroke:#c0392c,color:#fff
style G fill:#3498db,stroke:#2980b9,color:#fff
Importance: 0.2918 β The single most important feature. NASA already did the hard work of flagging suspicious signals; we just aggregate those flags.
2. snr_x_prad β Signal Consistency
graph LR
A[Real Planet] -->|Jupiter-sized| B[High SNR expected]
A -->|Earth-sized| C[Low SNR expected]
D[Inconsistent Signal] -->|Large planet<br/>Low SNR| E[FALSE POSITIVE<br/>suspicion]
D -->|Small planet<br/>Very high SNR| E
B --> F[snr_x_prad<br/>consistent]
C --> F
E --> G[snr_x_prad<br/>inconsistent]
style F fill:#2ecc71,stroke:#27ae60,color:#fff
style G fill:#e74c3c,stroke:#c0392c,color:#fff
Importance: 0.0390 β Real planets have SNR proportional to their size. A Jupiter-sized object with weak SNR is suspicious.
3. depth_duration_ratio β Transit Shape
graph LR
subgraph "Planet Transit"
A1[U-shaped curve]
A2[Specific depth/duration<br/>ratio]
end
subgraph "Stellar Eclipse"
B1[V-shaped curve]
B2[Different depth/duration<br/>ratio]
end
A1 --> C[depth_duration_ratio<br/>~ planet signature]
B1 --> D[depth_duration_ratio<br/>~ binary signature]
style C fill:#2ecc71,stroke:#27ae60,color:#fff
style D fill:#e74c3c,stroke:#c0392c,color:#fff
Importance: 0.0239 β Planets produce U-shaped transits; stellar binaries produce V-shaped eclipses. The ratio captures this difference.
4. koi_prad_squared β Non-Linear Radius Effect
graph LR
A[Planetary Radius] --> B[Linear: prad]
A --> C[Non-linear: pradΒ²]
B --> D[Gradual increase]
C --> E[Sharp threshold<br/>at ~15 Rβ]
E -->|> 15 Rβ| F[Stellar companion<br/>not a planet]
E -->|< 15 Rβ| G[Could be a planet]
style F fill:#e74c3c,stroke:#c0392c,color:#fff
style G fill:#2ecc71,stroke:#27ae60,color:#fff
Importance: 0.0275 β Objects larger than ~15 Earth radii are almost certainly stellar companions, not planets. The squared term captures this threshold.
5. impact_penalty β Physical Impossibility
graph LR
A[Impact Parameter] -->|b < 1.0| B[Physical transit<br/>possible]
A -->|b > 1.0| C[No transit possible<br/>by geometry]
B --> D[impact_penalty = 0]
C --> E[impact_penalty = 10]
E --> F[Strong FALSE<br/>POSITIVE signal]
style D fill:#2ecc71,stroke:#27ae60,color:#fff
style E fill:#e74c3c,stroke:#c0392c,color:#fff
style F fill:#e74c3c,stroke:#c0392c,color:#fff
An impact parameter > 1.0 means the planet would miss the star entirely. Any signal with this value is physically impossible as a transit.
6. log_period β Orbital Distribution
graph LR
A[Orbital Period] --> B[Linear scale: skewed]
A --> C[Log scale: normal]
B --> D[Hard to model]
C --> E[Log-normal<br/>distribution]
E --> F[Better model<br/>fit]
style F fill:#2ecc71,stroke:#27ae60,color:#fff
Planetary orbital periods follow a log-normal distribution. Taking the log makes the feature more Gaussian and easier for the model to learn.
7. teq_over_steff β Temperature Sanity Check
graph LR
A[Equilibrium Temp] -->|/| B[Stellar Temp]
B --> C[teq_over_steff]
C -->|~ 0.1-0.5| D[Plausible]
C -->|> 1.0| E[Implausible<br/>teq > steff]
style D fill:#2ecc71,stroke:#27ae60,color:#fff
style E fill:#e74c3c,stroke:#c0392c,color:#fff
A planetβs equilibrium temperature should never exceed its host starβs temperature. This ratio is a simple sanity check.
8. prad_teq_interaction β Size-Temperature Relationship
graph LR
A[Hot Jupiters] -->|Large + Hot| B[High prad_teq]
C[Rocky Planets] -->|Small + Cool| D[Low prad_teq]
B --> E[Distinct population]
D --> E
style E fill:#3498db,stroke:#2980b9,color:#fff
This interaction helps distinguish between giant planets (large + hot) and rocky planets (small + cool).
Feature Importance Ranking
graph LR
subgraph "Top 5 Features"
A1[fpflag_sum<br/>0.2918]
A2[koi_fpflag_co<br/>0.0683]
A3[koi_max_mult_ev<br/>0.0630]
A4[koi_fpflag_nt<br/>0.0624]
A5[koi_model_snr<br/>0.0596]
end
subgraph "Next 5"
B1[koi_fpflag_ss<br/>0.0450]
B2[koi_prad<br/>0.0437]
B3[snr_x_prad<br/>0.0390]
B4[koi_count<br/>0.0324]
B5[koi_ror<br/>0.0300]
end
Preprocessing Pipeline
graph LR
A[Raw Features<br/>28 columns] --> B[Imputation]
B --> C[Standardization]
C --> D[Derived Features<br/>8 columns]
D --> E[Final Feature Matrix<br/>36 columns]
B -->|Missing values| B1[Column median]
C -->|Z-score| C1[Mean=0, Std=1]
style E fill:#2ecc71,stroke:#27ae60,color:#fff
Missing Value Imputation
graph TD
A[Feature Column] --> B{Contains<br/>NaN?}
B -->|Yes| C[Collect valid values]
C --> D[Sort values]
D --> E[Take median]
E --> F[Replace NaN with median]
B -->|No| G[Keep as-is]
style F fill:#2ecc71,stroke:#27ae60,color:#fff
Z-Score Standardization
graph LR
A[Raw Value x] --> B[Subtract Mean]
B --> C[Divide by Std]
C --> D[Standardized Value<br/>(x - ΞΌ) / Ο]
style D fill:#2ecc71,stroke:#27ae60,color:#fff
Standardization ensures all features contribute equally to distance-based calculations. Without it, features with large scales (like period in days) would dominate over small-scale features (like impact parameter).
Feature Correlation Insight
graph TB
subgraph "Feature Groups"
A1[FP Flags] --- A2[fpflag_sum]
A1 --- A3[koi_fpflag_nt]
A1 --- A4[koi_fpflag_ss]
B1[Size] --- B2[koi_prad]
B1 --- B3[koi_ror]
B1 --- B4[snr_x_prad]
C1[Signal] --- C2[koi_model_snr]
C1 --- C3[koi_max_mult_ev]
C1 --- C4[koi_max_sngle_ev]
D1[Orbit] --- D2[koi_period]
D1 --- D3[log_period]
D1 --- D4[koi_duration]
end
A2 --> E[High Importance]
B2 --> E
C2 --> E
D3 --> E
Results & Metrics
Overall Performance
Astrophage achieves state-of-the-art results on the KOI classification task:
graph LR
subgraph "Metrics"
A[Accuracy<br/>94.81%]
B[Macro F1<br/>92.64%]
C[Weighted F1<br/>94.51%]
end
style A fill:#2ecc71,stroke:#27ae60,color:#fff
style B fill:#3498db,stroke:#2980b9,color:#fff
style C fill:#9b59b6,stroke:#8e44ad,color:#fff
Per-Class Performance
graph TD
subgraph "CONFIRMED"
C1[Precision: 89.95%]
C2[Recall: 94.54%]
C3[F1-Score: 92.18%]
end
subgraph "CANDIDATE"
A1[Precision: 88.42%]
A2[Recall: 85.06%]
A3[F1-Score: 86.71%]
end
subgraph "FALSE POSITIVE"
F1[Precision: 99.69%]
F2[Recall: 98.35%]
F3[F1-Score: 99.01%]
end
style C1 fill:#2ecc71,stroke:#27ae60,color:#fff
style C2 fill:#2ecc71,stroke:#27ae60,color:#fff
style C3 fill:#2ecc71,stroke:#27ae60,color:#fff
style F1 fill:#2ecc71,stroke:#27ae60,color:#fff
style F2 fill:#2ecc71,stroke:#27ae60,color:#fff
style F3 fill:#2ecc71,stroke:#27ae60,color:#fff
Detailed Breakdown
| Class | Precision | Recall | F1-Score | Support | Notes |
|---|---|---|---|---|---|
| CANDIDATE | 88.42% | 85.06% | 86.71% | 1,978 | Hardest class β ambiguous by definition |
| FALSE POSITIVE | 99.69% | 98.35% | 99.01% | 4,839 | Nearly perfect β FP flags are very strong |
| CONFIRMED | 89.95% | 94.54% | 92.18% | 2,747 | Strong β clear signals are easy to identify |
Confusion Matrix
graph TD
subgraph "Confusion Matrix (Test Set: 1,911 samples)"
A1[True CANDIDATE<br/>1,978] --> B1[Predicted CANDIDATE<br/>~1,682]
A1 --> B2[Predicted FALSE+<br/>~296]
A2[True FALSE+<br/>4,839] --> B3[Predicted FALSE+<br/>~4,759]
A2 --> B4[Predicted CANDIDATE<br/>~80]
A3[True CONFIRMED<br/>2,747] --> B5[Predicted CONFIRMED<br/>~2,597]
A3 --> B6[Predicted NOT<br/>~150]
end
style B1 fill:#3498db,stroke:#2980b9,color:#fff
style B3 fill:#e74c3c,stroke:#c0392c,color:#fff
style B5 fill:#2ecc71,stroke:#27ae60,color:#fff
style B2 fill:#e74c3c,stroke:#c0392c,color:#fff
style B4 fill:#3498db,stroke:#2980b9,color:#fff
style B6 fill:#e74c3c,stroke:#c0392c,color:#fff
Most confusion occurs between CANDIDATE and FALSE_POSITIVE β exactly where we expect it. Stage 1βs CONFIRMED separation is nearly clean.
Feature Importance
graph LR
subgraph "Top 10 Features"
A1[1. fpflag_sum<br/>0.2918]
A2[2. koi_fpflag_co<br/>0.0683]
A3[3. koi_max_mult_ev<br/>0.0630]
A4[4. koi_fpflag_nt<br/>0.0624]
A5[5. koi_model_snr<br/>0.0596]
A6[6. koi_fpflag_ss<br/>0.0450]
A7[7. koi_prad<br/>0.0437]
A8[8. snr_x_prad<br/>0.0390]
A9[9. koi_count<br/>0.0324]
A10[10. koi_ror<br/>0.0300]
end
Feature Importance by Category
graph TD
subgraph "Importance Distribution"
A[FP Flags<br/>~47%]
B[Signal Quality<br/>~20%]
C[Physical Params<br/>~18%]
D[Derived Features<br/>~15%]
end
A --> E[fpflag_sum dominates]
B --> F[SNR, max events]
C --> G[Radius, temperature]
D --> H[Interactions, ratios]
style A fill:#e74c3c,stroke:#c0392c,color:#fff
style E fill:#e74c3c,stroke:#c0392c,color:#fff
Astrophysical Insights
Insight 1: False Positive Flags (Very High Confidence)
graph LR
A[NASA FP Flags] --> B[fpflag_sum]
B -->|Value > 0| C[99%+ chance<br/>FALSE POSITIVE]
B -->|Value = 0| D[Needs further<br/>analysis]
style C fill:#e74c3c,stroke:#c0392c,color:#fff
style D fill:#3498db,stroke:#2980b9,color:#fff
Supporting features:
fpflag_sum,koi_fpflag_nt,koi_fpflag_ssNASAβs pre-vetting flags directly encode expert knowledge. When these are non-zero, the signal is almost certainly not a planet. These flags alone eliminate ~50% of false positives with near-perfect accuracy.
Insight 2: SNR-Radius Consistency (High Confidence)
graph LR
A[Real Planet] -->|Jupiter| B[High SNR]
A -->|Earth| C[Low SNR]
D[Inconsistent] -->|Large + Low SNR| E[Suspicious]
B --> F[snr_x_prad<br/>consistent]
C --> F
E --> G[snr_x_prad<br/>inconsistent]
style F fill:#2ecc71,stroke:#27ae60,color:#fff
style G fill:#e74c3c,stroke:#c0392c,color:#fff
Supporting features:
koi_model_snr,snr_x_prad,koi_pradReal planets have signal-to-noise ratios consistent with their size. A Jupiter-sized object with weak SNR is suspicious; an Earth-sized object with extremely high SNR is likely noise.
Insight 3: Transit Geometry (High Confidence)
graph LR
subgraph "Planet Transit"
A1[U-shaped curve]
A2[Specific depth/duration]
end
subgraph "Stellar Binary"
B1[V-shaped curve]
B2[Different depth/duration]
end
A1 --> C[depth_duration_ratio<br/>~ planet]
B1 --> D[depth_duration_ratio<br/>~ binary]
style C fill:#2ecc71,stroke:#27ae60,color:#fff
style D fill:#e74c3c,stroke:#c0392c,color:#fff
Supporting features:
depth_duration_ratio,log_period,koi_durationPlanetary transits produce characteristic U-shaped light curves with specific depth-to-duration ratios. Stellar binaries produce V-shaped eclipses with different geometry. Our derived
depth_duration_ratiocaptures this distinction.
Recommendations
graph TD
subgraph "Actionable Recommendations"
A1[Use Stage 1 as rapid filter]
A2[Investigate uncertain Stage 1 samples]
A3[Use Stage 2 for prioritization]
A4[Use fpflag_sum as pre-filter]
end
A1 --> B1[Quickly identify CONFIRMED]
A2 --> B2[Scientifically interesting edge cases]
A3 --> B3[Prioritize CANDIDATE follow-up]
A4 --> B4[Eliminate 50% of false positives instantly]
style A1 fill:#2ecc71,stroke:#27ae60,color:#fff
style A2 fill:#3498db,stroke:#2980b9,color:#fff
style A3 fill:#9b59b6,stroke:#8e44ad,color:#fff
style A4 fill:#f39c12,stroke:#e67e22,color:#fff
| # | Recommendation | Impact |
|---|---|---|
| 1 | Use Stage 1 as a rapid filter for follow-up observations | Saves telescope time |
| 2 | Investigate samples where Stage 1 is uncertain (probability ~0.5) | Most scientifically interesting |
| 3 | For NOT_CONFIRMED, use Stage 2 probability to prioritize follow-up | Efficient resource allocation |
| 4 | fpflag_sum alone eliminates ~50% of false positives with near-perfect accuracy | Dramatic efficiency gain |
Comparison with Baselines
graph LR
subgraph "Accuracy Comparison"
A[Single-Stage RF<br/>~91%]
B[Logistic Regression<br/>~87%]
C[SVM<br/>~89%]
D[Astrophage<br/>94.81%]
end
style D fill:#2ecc71,stroke:#27ae60,color:#fff
style A fill:#95a5a6,stroke:#7f8c8d,color:#fff
style B fill:#95a5a6,stroke:#7f8c8d,color:#fff
style C fill:#95a5a6,stroke:#7f8c8d,color:#fff
Astrophageβs two-stage architecture provides a 3-4% accuracy improvement over single-stage approaches, which is significant in the context of exoplanet discovery where each percentage point represents hundreds of potential planets.
API Reference
Module Overview
graph TB
subgraph "Public API"
A[lib.rs] --> B[data::KoiDataset]
A --> C[features::FeatureEngineer]
A --> D[two_stage_model::TwoStageClassifier]
A --> E[evaluation::ModelEvaluator]
A --> F[report::generate_report]
A --> G[logger::Logger]
end
subgraph "Internal Modules"
H[decision_tree.rs]
I[model.rs]
end
D --> H
D --> I
data::KoiDataset
Loads and manages the KOI dataset from CSV using Polars.
graph LR
A[CSV File] -->|Polars| B[KoiDataset]
B --> C[n_samples]
B --> D[n_features]
B --> E[features Array2]
B --> F[labels Array1]
B --> G[class_distribution]
Methods
| Method | Signature | Description |
|---|---|---|
load | load(path: &str) -> Result<KoiDataset> | Load dataset from CSV |
n_samples | n_samples() -> usize | Number of rows |
n_features | n_features() -> usize | Number of feature columns |
features | features() -> &Array2<f64> | Feature matrix |
labels | labels() -> &Array1<u8> | Label vector |
class_distribution | class_distribution() -> HashMap<String, usize> | Count per class |
feature_index | feature_index(name: &str) -> Option<usize> | Column index by name |
Usage
#![allow(unused)]
fn main() {
use astrophage::data::KoiDataset;
let dataset = KoiDataset::load("data/koi_dataset.csv")?;
println!("Loaded {} samples with {} features",
dataset.n_samples(),
dataset.n_features()
);
let dist = dataset.class_distribution();
for (class, count) in &dist {
println!("{}: {}", class, count);
}
}
features::FeatureEngineer
Transforms raw data into model-ready features through imputation, standardization, and derived feature computation.
graph LR
A[KoiDataset] -->|process| B[FeatureEngineer]
B --> C[Imputation]
C --> D[Standardization]
D --> E[Derived Features]
E --> F[ProcessedDataset]
F --> G[split]
G --> H[Train Set]
G --> I[Test Set]
Methods
| Method | Signature | Description |
|---|---|---|
new | new() -> FeatureEngineer | Create new engineer |
process | process(&mut self, dataset: &KoiDataset) -> Result<ProcessedDataset> | Full pipeline |
ProcessedDataset Methods
| Method | Signature | Description |
|---|---|---|
n_samples | n_samples() -> usize | Number of rows |
n_features | n_features() -> usize | Number of columns |
features | features() -> &Array2<f64> | Feature matrix |
labels | labels() -> &Array1<u8> | Label vector |
feature_names | feature_names() -> &[String] | Column names |
split | split(test_ratio: f64, seed: u64) -> (ProcessedDataset, ProcessedDataset) | Stratified split |
Usage
#![allow(unused)]
fn main() {
use astrophage::features::FeatureEngineer;
let mut engineer = FeatureEngineer::new();
let processed = engineer.process(&dataset)?;
let (train, test) = processed.split(0.2, 42); // 80/20, seed=42
println!("Train: {}, Test: {}", train.n_samples(), test.n_samples());
}
two_stage_model::TwoStageClassifier
The main two-stage random forest classifier.
graph TB
A[TwoStageClassifier] --> B[Stage 1 RF]
A --> C[Stage 2 RF]
B --> D[train Stage 1]
C --> E[train Stage 2]
D --> F[predict]
E --> F
F --> G[feature_importance]
Methods
| Method | Signature | Description |
|---|---|---|
new | new() -> TwoStageClassifier | Create new classifier |
train | train(&mut self, train: &ProcessedDataset) -> Result<()> | Train both stages |
predict | predict(&self, features: &Array2<f64>) -> Vec<u8> | Predict labels |
predict_proba | predict_proba(&self, features: &Array2<f64>) -> Vec<Vec<f64>> | Predict probabilities |
feature_importance | feature_importance() -> Vec<(String, f64)> | Feature importance scores |
Usage
#![allow(unused)]
fn main() {
use astrophage::two_stage_model::TwoStageClassifier;
let mut classifier = TwoStageClassifier::new();
classifier.train(&train)?;
// Predictions
let predictions = classifier.predict(test.features());
// Feature importance
for (name, score) in classifier.feature_importance().iter().take(10) {
println!("{}: {:.4}", name, score);
}
}
evaluation::ModelEvaluator
Computes comprehensive classification metrics.
graph LR
A[TwoStageClassifier] -->|+ Test Data| B[ModelEvaluator]
B --> C[Accuracy]
B --> D[Precision]
B --> E[Recall]
B --> F[F1-Score]
B --> G[Per-Class Metrics]
Methods
| Method | Signature | Description |
|---|---|---|
new | new(classifier: &TwoStageClassifier, test: &ProcessedDataset) -> ModelEvaluator | Create evaluator |
evaluate | evaluate(&self) -> Result<Metrics> | Compute all metrics |
Metrics Structure
#![allow(unused)]
fn main() {
pub struct Metrics {
pub accuracy: f64,
pub macro_f1: f64,
pub weighted_f1: f64,
pub per_class: HashMap<String, ClassMetrics>,
}
pub struct ClassMetrics {
pub precision: f64,
pub recall: f64,
pub f1_score: f64,
}
}
Usage
#![allow(unused)]
fn main() {
use astrophage::evaluation::ModelEvaluator;
let evaluator = ModelEvaluator::new(&classifier, &test);
let metrics = evaluator.evaluate()?;
println!("Accuracy: {:.4f}", metrics.accuracy);
println!("Macro F1: {:.4f}", metrics.macro_f1);
for (class, m) in &metrics.per_class {
println!("{}: P={:.4f} R={:.4f} F1={:.4f}",
class, m.precision, m.recall, m.f1_score);
}
}
report::generate_report
Generates the comprehensive JSON report.
graph LR
A[Metrics] -->|+ Classifier| B[generate_report]
B --> C[report.json]
C --> D[Feature Importance]
C --> E[Per-Class Metrics]
C --> F[Astrophysical Insights]
C --> G[Recommendations]
Function
#![allow(unused)]
fn main() {
pub fn generate_report(
metrics: &Metrics,
classifier: &TwoStageClassifier
) -> Result<()>
}
Output: output/report.json
Report Structure
{
"project_name": "Astrophage",
"version": "0.2.0",
"summary": { ... },
"metrics": { ... },
"feature_importance": [ ... ],
"astrophysical_insights": [ ... ],
"recommendations": [ ... ]
}
logger::Logger
Structured logging with tracing.
graph LR
A[Logger] --> B[Console Output]
A --> C[File Output]
B --> D[Colored Logs]
C --> E[app.log]
Methods
| Method | Signature | Description |
|---|---|---|
init | init(console: bool) -> Result<()> | Initialize logger |
Usage
#![allow(unused)]
fn main() {
use astrophage::logger::Logger;
Logger::init(true).await?;
tracing::info!("Training started...");
}
Internal: decision_tree::DecisionTree
Custom decision tree implementation using Gini impurity.
graph TD
A[DecisionTree] --> B[fit]
B --> C[find_best_split]
C --> D[compute_gini]
D --> E[split_node]
E --> F[recurse_left]
E --> G[recurse_right]
F --> H[Leaf or Split]
G --> H
Key Parameters
| Parameter | Default | Description |
|---|---|---|
max_depth | 10 | Maximum tree depth |
min_samples_leaf | 5 | Minimum samples per leaf |
max_features | sqrt(n) | Features considered per split |
Internal: model::RandomForest
Ensemble of decision trees with bootstrap sampling.
graph TD
A[RandomForest] --> B[n_estimators: 100]
B --> C[Tree 1]
B --> D[Tree 2]
B --> E[...]
B --> F[Tree N]
C --> G[Majority Vote]
D --> G
F --> G
G --> H[Final Prediction]
Key Parameters
| Parameter | Default | Description |
|---|---|---|
n_estimators | 100 | Number of trees |
max_depth | 10 | Max depth per tree |
max_features | sqrt(n) | Feature subsampling ratio |
bootstrap | true | Use bootstrap sampling |
Data Schema
Expected columns in koi_dataset.csv:
Orbital Parameters
| Column | Unit | Description |
|---|---|---|
koi_period | days | Orbital period |
koi_duration | hours | Transit duration |
koi_depth | ppm | Transit depth |
koi_impact | β | Impact parameter |
koi_ingress | hours | Ingress duration |
koi_incl | deg | Orbital inclination |
koi_eccen | β | Eccentricity |
koi_sma | AU | Semi-major axis |
Physical Parameters
| Column | Unit | Description |
|---|---|---|
koi_ror | β | Radius ratio (planet/star) |
koi_prad | Rβ | Planetary radius |
koi_teq | K | Equilibrium temperature |
koi_insol | Earth flux | Insolation flux |
Signal Quality
| Column | Description |
|---|---|
koi_model_snr | Signal-to-noise ratio |
koi_count | Number of KOIs in system |
koi_num_transits | Number of detected transits |
koi_max_sngle_ev | Max single event statistic |
koi_max_mult_ev | Max multiple event statistic |
False Positive Flags
| Column | Description |
|---|---|
koi_fpflag_nt | Not Transit-like |
koi_fpflag_ss | Stellar Eclipse |
koi_fpflag_co | Centroid Offset |
koi_fpflag_ec | Ephemeris Match |
Stellar Parameters
| Column | Unit | Description |
|---|---|---|
koi_kepmag | mag | Kepler magnitude |
koi_dor | β | Duration/period ratio |
koi_srho | g/cmΒ³ | Stellar density |
koi_steff | K | Stellar effective temperature |
koi_slogg | cm/sΒ² | Surface gravity (log) |
koi_smet | dex | Metallicity |
koi_srad | Rβ | Stellar radius |
koi_smass | Mβ | Stellar mass |
Contributing to Astrophage
Thank you for your interest in contributing! Astrophage is a Rust-based exoplanet classification project, and we welcome contributions of all kinds.
graph LR
A[Fork] --> B[Branch]
B --> C[Code]
C --> D[Test]
D --> E[PR]
E --> F[Merge]
style F fill:#2ecc71,stroke:#27ae60,color:#fff
Getting Started
Prerequisites
- Rust 1.85+ (install via rustup)
- Git
Setup
# Clone your fork
git clone https://github.com/YOUR_USERNAME/astrophage.git
cd astrophage
# Build
cargo build --release
# Run tests
cargo test
Development Workflow
graph TD
A[Issue/Feature Request] --> B[Create Branch]
B --> C[Implement]
C --> D[Format & Lint]
D --> E[Test]
E --> F[Commit]
F --> G[Push]
G --> H[Pull Request]
H --> I[Review]
I -->|Approved| J[Merge]
I -->|Changes| C
style J fill:#2ecc71,stroke:#27ae60,color:#fff
Code Style
We follow standard Rust conventions:
# Format code
cargo fmt
# Run linter
cargo clippy
# Generate docs
cargo doc --open
Areas for Contribution
graph TB
subgraph "High Priority"
A1[Hyperparameter Tuning]
A2[Cross-Validation]
A3[Feature Selection]
end
subgraph "Medium Priority"
B1[More Derived Features]
B2[Model Serialization]
B3[Prediction API]
end
subgraph "Low Priority"
C1[Web Dashboard]
C2[NASA Archive Integration]
C3[GPU Acceleration]
end
High Priority
- Hyperparameter tuning β Grid search over tree depth, n_estimators, max_features
- Cross-validation β K-fold stratified CV implementation
- Feature selection β Recursive feature elimination to find optimal subset
Medium Priority
- Additional derived features β More astrophysical interactions
- Model serialization β Save/load trained models to avoid retraining
- Prediction API β REST API for real-time classification
Low Priority
- Web dashboard β Visualize predictions and feature importance
- NASA Archive integration β Direct API connection for live data
- GPU acceleration β CUDA kernels for tree training
Submitting Changes
sequenceDiagram
participant C as Contributor
participant R as Repo
participant M as Maintainer
C->>R: Fork repository
C->>C: git checkout -b feature/amazing
C->>C: git commit -m "Add amazing feature"
C->>R: git push origin feature/amazing
C->>R: Open Pull Request
M->>R: Review code
M->>C: Request changes (if needed)
C->>R: Push updates
M->>R: Approve & Merge
Pull Request Guidelines
- Describe what changed and why
- Reference any related issues
- Include test results
- Keep changes focused and atomic
Code of Conduct
- Be respectful and inclusive
- Focus on constructive feedback
- Help others learn
- Credit original authors
Questions?
Open an issue or reach out to @harihar-nautiyal.
Changelog
All notable changes to Astrophage will be documented in this file.
graph LR
A[v0.1.0] -->|+Two-Stage| B[v0.2.0]
B -->|+Hyperparam| C[v0.3.0]
C -->|+API| D[v0.4.0]
style B fill:#2ecc71,stroke:#27ae60,color:#fff
[0.2.0] - 2026-06-28
Added
- Two-Stage Random Forest architecture (CONFIRMED vs NOT β CANDIDATE vs FALSE_POSITIVE)
- 8 derived astrophysical features (fpflag_sum, snr_x_prad, depth_duration_ratio, etc.)
- Comprehensive JSON report generation with astrophysical insights
- Feature importance ranking with scientific explanations
- Google Colab notebook for cloud execution
- Full mdBook documentation suite with Mermaid diagrams
Performance
- 94.81% accuracy on test set
- 92.64% macro F1 score
- Stage 1 achieves near-perfect separation for CONFIRMED planets
Technical
- Custom Random Forest implementation in pure Rust
- Polars DataFrame integration for fast I/O
- Stratified train/test split with class balance
- Z-score standardization and median imputation
- Tracing-based structured logging
graph LR
A[v0.2.0] --> B[Accuracy: 94.81%]
A --> C[Macro F1: 92.64%]
A --> D[Features: 36]
style B fill:#2ecc71,stroke:#27ae60,color:#fff
style C fill:#3498db,stroke:#2980b9,color:#fff
[0.1.0] - 2026-06-23
Added
- Initial single-stage Random Forest classifier
- Basic feature engineering with 28 base features
- KOI dataset loading and preprocessing
- Evaluation metrics (accuracy, precision, recall, F1)
- Command-line interface with tracing logs
Notes
- Baseline accuracy: ~91%
- No derived features
- Single-stage architecture
graph LR
A[v0.1.0] --> B[Accuracy: ~91%]
A --> C[Features: 28]
A --> D[Single Stage]
style B fill:#e74c3c,stroke:#c0392c,color:#fff
Roadmap
graph LR
A[v0.2.0<br/>Current] --> B[v0.3.0<br/>Hyperparameter Tuning]
B --> C[v0.4.0<br/>Model Serialization]
C --> D[v0.5.0<br/>REST API]
D --> E[v1.0.0<br/>Production Ready]
style A fill:#2ecc71,stroke:#27ae60,color:#fff
style E fill:#f39c12,stroke:#e67e22,color:#fff
v0.3.0 (Planned)
- Grid search for optimal hyperparameters
- K-fold cross-validation
- Feature importance visualization
v0.4.0 (Planned)
- Model serialization (save/load trained models)
- Incremental training support
- Configuration file support (TOML)
v0.5.0 (Planned)
- REST API with Axum
- Batch prediction endpoint
- Real-time classification
v1.0.0 (Future)
- Production-ready deployment
- Docker containerization
- Comprehensive benchmarking suite
πͺ Astrophage
Two-Stage Random Forest Classifier for NASA Kepler Object of Interest (KOI) Exoplanet Validation
π Full Documentation: https://astrophage.hariharnautiyal.com
What is Astrophage?
Astrophage is a high-performance exoplanet classification system built in Rust using Polars and a custom Two-Stage Random Forest implementation. It classifies Kepler Objects of Interest (KOIs) into three categories:
pie title Class Distribution in KOI Dataset
"FALSE POSITIVE" : 4839
"CONFIRMED" : 2747
"CANDIDATE" : 1978
Total Samples: 9,564 | Features: 36 (28 base + 8 derived) | Accuracy: 94.81%
Why Two-Stage?
Our architecture mirrors NASAβs actual vetting workflow. Instead of forcing a single model to learn three classes simultaneously, we decompose the problem into two simpler binary decisions:
graph TD
A[Raw KOI Data<br/>36 Features] --> B[Stage 1: CONFIRMED vs NOT CONFIRMED]
B -->|CONFIRMED| C[Output: CONFIRMED β
]
B -->|NOT CONFIRMED| D[Stage 2: CANDIDATE vs FALSE POSITIVE]
D -->|CANDIDATE| E[Output: CANDIDATE π]
D -->|FALSE POSITIVE| F[Output: FALSE POSITIVE β]
style C fill:#2ecc71,stroke:#27ae60,color:#fff
style E fill:#3498db,stroke:#2980b9,color:#fff
style F fill:#e74c3c,stroke:#c0392c,color:#fff
This decomposition improves accuracy by ~3-4% over a single-stage classifier because each stage learns a simpler, cleaner decision boundary.
Key Results
graph LR
subgraph "Overall Metrics"
A[Accuracy<br/>94.81%]
B[Macro F1<br/>92.64%]
C[Weighted F1<br/>94.51%]
end
style A fill:#2ecc71,stroke:#27ae60,color:#fff
style B fill:#3498db,stroke:#2980b9,color:#fff
style C fill:#9b59b6,stroke:#8e44ad,color:#fff
Per-Class Breakdown
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| CANDIDATE | 88.42% | 85.06% | 86.71% | 1,978 |
| FALSE POSITIVE | 99.69% | 98.35% | 99.01% | 4,839 |
| CONFIRMED | 89.95% | 94.54% | 92.18% | 2,747 |
Architecture
graph LR
subgraph "Input"
A[Raw KOI CSV<br/>28 columns]
end
subgraph "Feature Engineering"
B[Base Features<br/>28 columns]
C[Derived Features<br/>8 interactions]
D[Preprocessing<br/>impute + standardize]
end
subgraph "Two-Stage Model"
E[Stage 1 RF<br/>CONFIRMED vs NOT]
F[Stage 2 RF<br/>CANDIDATE vs FALSE+]
end
subgraph "Output"
G[Predictions<br/>3 classes]
H[JSON Report<br/>metrics + insights]
end
A --> B
B --> C
C --> D
D --> E
E -->|CONFIRMED| G
E -->|NOT| F
F -->|CANDIDATE| G
F -->|FALSE_POSITIVE| G
G --> H
style A fill:#e74c3c,stroke:#c0392c,color:#fff
style D fill:#f39c12,stroke:#e67e22,color:#fff
style E fill:#2ecc71,stroke:#27ae60,color:#fff
style F fill:#3498db,stroke:#2980b9,color:#fff
style H fill:#9b59b6,stroke:#8e44ad,color:#fff
Custom Implementation Details
- Language: Rust (zero-cost abstractions, memory safety, SIMD-friendly)
- DataFrame Engine: Polars (blazing fast CSV I/O and columnar operations)
- ML Backend: Custom Random Forest from scratch (no Python dependency!)
- Gini impurity splitting
- Bootstrapped sampling
- Feature subsampling
- Majority voting ensemble
- Parallelism: Tokio async runtime for I/O; ndarray for vectorized math
Installation
Prerequisites
- Rust (1.85+ recommended)
- Git
Clone & Build
# Clone the repository
git clone https://github.com/harihar-nautiyal/astrophage.git
cd astrophage
# Build in release mode (optimized)
cargo build --release
# The binary will be at:
# ./target/release/astrophage
Dataset
The repository includes a pre-processed KOI dataset at:
data/koi_dataset.csv
If you want to use your own data, ensure it follows the same column schema (see src/data.rs for expected fields).
Usage
Quick Start
# Run the full pipeline
cargo run --release
Expected Output
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β πͺ ASTROPHAGE v0.2.0 β
β NASA KOI Exoplanet Classification System β
β TWO-STAGE MODEL: CONFIRMED vs NOT β CANDIDATE vs FALSE β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Step 1: Loading KOI dataset...
Step 2: Engineering features...
Step 3: Splitting data (80/20 stratified)...
Step 4: Training TWO-STAGE classifier...
Step 5: Evaluating model performance...
Step 6: Top astrophysical predictors:
1. fpflag_sum 0.2918
2. koi_fpflag_co 0.0683
3. koi_max_mult_ev 0.0630
4. koi_fpflag_nt 0.0624
5. koi_model_snr 0.0596
...
Step 7: Generating final report...
ASTROPHAGE two-stage classification complete!
Check output/report.json for full results.
Output Files
| File | Description |
|---|---|
output/report.json | Full JSON report with metrics, feature importance, and insights |
output/predictions.csv | (Optional) Per-sample predictions and probabilities |
Feature Engineering
We transform 28 raw astrophysical features into 36 model-ready features:
Base Features (28)
Orbital, physical, and stellar parameters from the Kepler pipeline:
graph TB
subgraph "Base Feature Groups"
A1[Orbital: period, duration, impact, ingress, incl, eccen, sma]
A2[Physical: ror, prad, teq, insol]
A3[Signal: snr, count, transits, max events]
A4[FP Flags: nt, ss, co, ec]
A5[Stellar: kepmag, dor, srho, steff, slogg, smet, srad, smass]
end
Derived Features (8)
| Feature | Formula | Astrophysical Rationale |
|---|---|---|
koi_prad_squared | pradΒ² | Non-linear radius effect; objects >15 Rβ are likely stellar companions |
depth_duration_ratio | depth / duration | Transit steepness; planets have characteristic U-shaped curves |
snr_x_prad | snr Γ prad | Real planets have SNR consistent with their size |
impact_penalty | 10 if impact > 1.0 else 0 | Impact parameter >1 is physically impossible for a transit |
log_period | ln(period) | Orbital periods follow log-normal distribution |
teq_over_steff | teq / steff | Sanity check on equilibrium temperature vs stellar temperature |
fpflag_sum | Ξ£ fpflags | NASAβs pre-vetting suspicion score; higher = more likely false positive |
prad_teq_interaction | prad Γ teq | Size-temperature interaction for giant planets vs rocky planets |
Preprocessing
graph LR
A[Raw Features<br/>28 columns] --> B[Imputation]
B --> C[Standardization]
C --> D[Derived Features<br/>8 columns]
D --> E[Final Feature Matrix<br/>36 columns]
B -->|Missing values| B1[Column median]
C -->|Z-score| C1[Mean=0, Std=1]
style E fill:#2ecc71,stroke:#27ae60,color:#fff
Astrophysical Insights
Our model reveals key discriminators that align with planetary science:
π΄ Very High Confidence
False Positive Flags (
fpflag_sum,koi_fpflag_nt,koi_fpflag_ss) directly encode NASAβs pre-vetting. When non-zero, the signal is almost certainly not a planet. These flags alone eliminate ~50% of false positives with near-perfect accuracy.
graph LR
A[NASA FP Flags] --> B[fpflag_sum]
B -->|Value > 0| C[99%+ chance<br/>FALSE POSITIVE]
B -->|Value = 0| D[Needs further<br/>analysis]
style C fill:#e74c3c,stroke:#c0392c,color:#fff
style D fill:#3498db,stroke:#2980b9,color:#fff
π‘ High Confidence
Signal-to-Noise Ratio + Planetary Radius (
snr_x_prad,koi_prad): Real planets have consistent SNR for their size. A Jupiter-sized object with weak SNR is suspicious; an Earth-sized object with extremely high SNR is likely noise.
graph LR
A[Real Planet] -->|Jupiter| B[High SNR]
A -->|Earth| C[Low SNR]
D[Inconsistent] -->|Large + Low SNR| E[Suspicious]
B --> F[snr_x_prad<br/>consistent]
C --> F
E --> G[snr_x_prad<br/>inconsistent]
style F fill:#2ecc71,stroke:#27ae60,color:#fff
style G fill:#e74c3c,stroke:#c0392c,color:#fff
π’ Workflow Insight
The two-stage design mirrors how astronomers actually vet candidates: First, separate obvious planets (CONFIRMED) from everything else. Then, carefully distinguish between promising candidates and known false positives. This is why Stage 1 achieves near-perfect separation while Stage 2 focuses on the scientifically interesting boundary.
Project Structure
graph TD
A[astrophage/] --> B[Cargo.toml]
A --> C[data/]
A --> D[src/]
A --> E[output/]
A --> F[docs/]
C --> C1[koi_dataset.csv]
D --> D1[main.rs]
D --> D2[data.rs]
D --> D3[features.rs]
D --> D4[decision_tree.rs]
D --> D5[model.rs]
D --> D6[two_stage_model.rs]
D --> D7[evaluation.rs]
D --> D8[report.rs]
E --> E1[report.json]
F --> F1[mdbook/]
F1 --> F2[book.toml]
F1 --> F3[src/]
F3 --> F4[README.md]
F3 --> F5[architecture.md]
F3 --> F6[two-stage-model.md]
F3 --> F7[feature-engineering.md]
F3 --> F8[results.md]
F3 --> F9[api-reference.md]
style D1 fill:#f39c12,stroke:#e67e22,color:#fff
style D6 fill:#2ecc71,stroke:#27ae60,color:#fff
style F1 fill:#9b59b6,stroke:#8e44ad,color:#fff
Documentation
π Full Documentation: https://astrophage.hariharnautiyal.com
The documentation site includes:
- System architecture with Mermaid diagrams
- Deep dive into the two-stage model
- Feature engineering explanations with astrophysical rationale
- API reference for all modules
- Contributing guidelines
- Changelog and roadmap
To build the docs locally:
cd docs
mdbook build
mdbook serve --open
Google Colab
Want to try Astrophage without installing Rust locally?
π Open in Google Colab
The notebook will:
- Install Rust in the Colab environment
- Clone this repository
- Build the project with Cargo
- Run the full pipeline
- Display the
report.jsonwith interactive visualizations
Note: First run takes ~5-7 minutes due to Rust compilation. Subsequent runs are instant.
Recommendations for Follow-Up
Based on our modelβs behavior, we suggest:
graph TD
subgraph "Actionable Recommendations"
A1[Use Stage 1 as rapid filter]
A2[Investigate uncertain Stage 1 samples]
A3[Use Stage 2 for prioritization]
A4[Use fpflag_sum as pre-filter]
end
A1 --> B1[Quickly identify CONFIRMED]
A2 --> B2[Scientifically interesting edge cases]
A3 --> B3[Prioritize CANDIDATE follow-up]
A4 --> B4[Eliminate 50% of false positives instantly]
style A1 fill:#2ecc71,stroke:#27ae60,color:#fff
style A2 fill:#3498db,stroke:#2980b9,color:#fff
style A3 fill:#9b59b6,stroke:#8e44ad,color:#fff
style A4 fill:#f39c12,stroke:#e67e22,color:#fff
| # | Recommendation | Impact |
|---|---|---|
| 1 | Use Stage 1 as a rapid filter for follow-up observations | Saves telescope time |
| 2 | Investigate samples where Stage 1 is uncertain (probability ~0.5) | Most scientifically interesting |
| 3 | For NOT_CONFIRMED, use Stage 2 probability to prioritize follow-up | Efficient resource allocation |
| 4 | fpflag_sum alone eliminates ~50% of false positives with near-perfect accuracy | Dramatic efficiency gain |
Team & Acknowledgments
- Author: Harihar Nautiyal
- Hackathon: Celesta β India High School Exoplanet Data Challenge 2026
- Data Source: NASA Exoplanet Archive / Kepler Mission
- Built with: Rust, Polars, NDArray, Tokio, Serde
License
MIT License β feel free to use, modify, and distribute with attribution.
"Somewhere, something incredible is waiting to be known."
β Carl Sagan