Changelog
All notable changes to Astrophage will be documented in this file.
graph LR
A[v0.1.0] -->|+Two-Stage| B[v0.2.0]
B -->|+Hyperparam| C[v0.3.0]
C -->|+API| D[v0.4.0]
style B fill:#2ecc71,stroke:#27ae60,color:#fff
[0.2.0] - 2026-06-28
Added
- Two-Stage Random Forest architecture (CONFIRMED vs NOT β CANDIDATE vs FALSE_POSITIVE)
- 8 derived astrophysical features (fpflag_sum, snr_x_prad, depth_duration_ratio, etc.)
- Comprehensive JSON report generation with astrophysical insights
- Feature importance ranking with scientific explanations
- Google Colab notebook for cloud execution
- Full mdBook documentation suite with Mermaid diagrams
Performance
- 94.81% accuracy on test set
- 92.64% macro F1 score
- Stage 1 achieves near-perfect separation for CONFIRMED planets
Technical
- Custom Random Forest implementation in pure Rust
- Polars DataFrame integration for fast I/O
- Stratified train/test split with class balance
- Z-score standardization and median imputation
- Tracing-based structured logging
graph LR
A[v0.2.0] --> B[Accuracy: 94.81%]
A --> C[Macro F1: 92.64%]
A --> D[Features: 36]
style B fill:#2ecc71,stroke:#27ae60,color:#fff
style C fill:#3498db,stroke:#2980b9,color:#fff
[0.1.0] - 2026-06-23
Added
- Initial single-stage Random Forest classifier
- Basic feature engineering with 28 base features
- KOI dataset loading and preprocessing
- Evaluation metrics (accuracy, precision, recall, F1)
- Command-line interface with tracing logs
Notes
- Baseline accuracy: ~91%
- No derived features
- Single-stage architecture
graph LR
A[v0.1.0] --> B[Accuracy: ~91%]
A --> C[Features: 28]
A --> D[Single Stage]
style B fill:#e74c3c,stroke:#c0392c,color:#fff
Roadmap
graph LR
A[v0.2.0<br/>Current] --> B[v0.3.0<br/>Hyperparameter Tuning]
B --> C[v0.4.0<br/>Model Serialization]
C --> D[v0.5.0<br/>REST API]
D --> E[v1.0.0<br/>Production Ready]
style A fill:#2ecc71,stroke:#27ae60,color:#fff
style E fill:#f39c12,stroke:#e67e22,color:#fff
v0.3.0 (Planned)
- Grid search for optimal hyperparameters
- K-fold cross-validation
- Feature importance visualization
v0.4.0 (Planned)
- Model serialization (save/load trained models)
- Incremental training support
- Configuration file support (TOML)
v0.5.0 (Planned)
- REST API with Axum
- Batch prediction endpoint
- Real-time classification
v1.0.0 (Future)
- Production-ready deployment
- Docker containerization
- Comprehensive benchmarking suite
πͺ Astrophage
Two-Stage Random Forest Classifier for NASA Kepler Object of Interest (KOI) Exoplanet Validation
π Full Documentation: https://astrophage.hariharnautiyal.com
What is Astrophage?
Astrophage is a high-performance exoplanet classification system built in Rust using Polars and a custom Two-Stage Random Forest implementation. It classifies Kepler Objects of Interest (KOIs) into three categories:
pie title Class Distribution in KOI Dataset
"FALSE POSITIVE" : 4839
"CONFIRMED" : 2747
"CANDIDATE" : 1978
Total Samples: 9,564 | Features: 36 (28 base + 8 derived) | Accuracy: 94.81%
Why Two-Stage?
Our architecture mirrors NASAβs actual vetting workflow. Instead of forcing a single model to learn three classes simultaneously, we decompose the problem into two simpler binary decisions:
graph TD
A[Raw KOI Data<br/>36 Features] --> B[Stage 1: CONFIRMED vs NOT CONFIRMED]
B -->|CONFIRMED| C[Output: CONFIRMED β
]
B -->|NOT CONFIRMED| D[Stage 2: CANDIDATE vs FALSE POSITIVE]
D -->|CANDIDATE| E[Output: CANDIDATE π]
D -->|FALSE POSITIVE| F[Output: FALSE POSITIVE β]
style C fill:#2ecc71,stroke:#27ae60,color:#fff
style E fill:#3498db,stroke:#2980b9,color:#fff
style F fill:#e74c3c,stroke:#c0392c,color:#fff
This decomposition improves accuracy by ~3-4% over a single-stage classifier because each stage learns a simpler, cleaner decision boundary.
Key Results
graph LR
subgraph "Overall Metrics"
A[Accuracy<br/>94.81%]
B[Macro F1<br/>92.64%]
C[Weighted F1<br/>94.51%]
end
style A fill:#2ecc71,stroke:#27ae60,color:#fff
style B fill:#3498db,stroke:#2980b9,color:#fff
style C fill:#9b59b6,stroke:#8e44ad,color:#fff
Per-Class Breakdown
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| CANDIDATE | 88.42% | 85.06% | 86.71% | 1,978 |
| FALSE POSITIVE | 99.69% | 98.35% | 99.01% | 4,839 |
| CONFIRMED | 89.95% | 94.54% | 92.18% | 2,747 |
Architecture
graph LR
subgraph "Input"
A[Raw KOI CSV<br/>28 columns]
end
subgraph "Feature Engineering"
B[Base Features<br/>28 columns]
C[Derived Features<br/>8 interactions]
D[Preprocessing<br/>impute + standardize]
end
subgraph "Two-Stage Model"
E[Stage 1 RF<br/>CONFIRMED vs NOT]
F[Stage 2 RF<br/>CANDIDATE vs FALSE+]
end
subgraph "Output"
G[Predictions<br/>3 classes]
H[JSON Report<br/>metrics + insights]
end
A --> B
B --> C
C --> D
D --> E
E -->|CONFIRMED| G
E -->|NOT| F
F -->|CANDIDATE| G
F -->|FALSE_POSITIVE| G
G --> H
style A fill:#e74c3c,stroke:#c0392c,color:#fff
style D fill:#f39c12,stroke:#e67e22,color:#fff
style E fill:#2ecc71,stroke:#27ae60,color:#fff
style F fill:#3498db,stroke:#2980b9,color:#fff
style H fill:#9b59b6,stroke:#8e44ad,color:#fff
Custom Implementation Details
- Language: Rust (zero-cost abstractions, memory safety, SIMD-friendly)
- DataFrame Engine: Polars (blazing fast CSV I/O and columnar operations)
- ML Backend: Custom Random Forest from scratch (no Python dependency!)
- Gini impurity splitting
- Bootstrapped sampling
- Feature subsampling
- Majority voting ensemble
- Parallelism: Tokio async runtime for I/O; ndarray for vectorized math
Installation
Prerequisites
- Rust (1.85+ recommended)
- Git
Clone & Build
# Clone the repository
git clone https://github.com/harihar-nautiyal/astrophage.git
cd astrophage
# Build in release mode (optimized)
cargo build --release
# The binary will be at:
# ./target/release/astrophage
Dataset
The repository includes a pre-processed KOI dataset at:
data/koi_dataset.csv
If you want to use your own data, ensure it follows the same column schema (see src/data.rs for expected fields).
Usage
Quick Start
# Run the full pipeline
cargo run --release
Expected Output
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β πͺ ASTROPHAGE v0.2.0 β
β NASA KOI Exoplanet Classification System β
β TWO-STAGE MODEL: CONFIRMED vs NOT β CANDIDATE vs FALSE β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Step 1: Loading KOI dataset...
Step 2: Engineering features...
Step 3: Splitting data (80/20 stratified)...
Step 4: Training TWO-STAGE classifier...
Step 5: Evaluating model performance...
Step 6: Top astrophysical predictors:
1. fpflag_sum 0.2918
2. koi_fpflag_co 0.0683
3. koi_max_mult_ev 0.0630
4. koi_fpflag_nt 0.0624
5. koi_model_snr 0.0596
...
Step 7: Generating final report...
ASTROPHAGE two-stage classification complete!
Check output/report.json for full results.
Output Files
| File | Description |
|---|---|
output/report.json | Full JSON report with metrics, feature importance, and insights |
output/predictions.csv | (Optional) Per-sample predictions and probabilities |
Feature Engineering
We transform 28 raw astrophysical features into 36 model-ready features:
Base Features (28)
Orbital, physical, and stellar parameters from the Kepler pipeline:
graph TB
subgraph "Base Feature Groups"
A1[Orbital: period, duration, impact, ingress, incl, eccen, sma]
A2[Physical: ror, prad, teq, insol]
A3[Signal: snr, count, transits, max events]
A4[FP Flags: nt, ss, co, ec]
A5[Stellar: kepmag, dor, srho, steff, slogg, smet, srad, smass]
end
Derived Features (8)
| Feature | Formula | Astrophysical Rationale |
|---|---|---|
koi_prad_squared | pradΒ² | Non-linear radius effect; objects >15 Rβ are likely stellar companions |
depth_duration_ratio | depth / duration | Transit steepness; planets have characteristic U-shaped curves |
snr_x_prad | snr Γ prad | Real planets have SNR consistent with their size |
impact_penalty | 10 if impact > 1.0 else 0 | Impact parameter >1 is physically impossible for a transit |
log_period | ln(period) | Orbital periods follow log-normal distribution |
teq_over_steff | teq / steff | Sanity check on equilibrium temperature vs stellar temperature |
fpflag_sum | Ξ£ fpflags | NASAβs pre-vetting suspicion score; higher = more likely false positive |
prad_teq_interaction | prad Γ teq | Size-temperature interaction for giant planets vs rocky planets |
Preprocessing
graph LR
A[Raw Features<br/>28 columns] --> B[Imputation]
B --> C[Standardization]
C --> D[Derived Features<br/>8 columns]
D --> E[Final Feature Matrix<br/>36 columns]
B -->|Missing values| B1[Column median]
C -->|Z-score| C1[Mean=0, Std=1]
style E fill:#2ecc71,stroke:#27ae60,color:#fff
Astrophysical Insights
Our model reveals key discriminators that align with planetary science:
π΄ Very High Confidence
False Positive Flags (
fpflag_sum,koi_fpflag_nt,koi_fpflag_ss) directly encode NASAβs pre-vetting. When non-zero, the signal is almost certainly not a planet. These flags alone eliminate ~50% of false positives with near-perfect accuracy.
graph LR
A[NASA FP Flags] --> B[fpflag_sum]
B -->|Value > 0| C[99%+ chance<br/>FALSE POSITIVE]
B -->|Value = 0| D[Needs further<br/>analysis]
style C fill:#e74c3c,stroke:#c0392c,color:#fff
style D fill:#3498db,stroke:#2980b9,color:#fff
π‘ High Confidence
Signal-to-Noise Ratio + Planetary Radius (
snr_x_prad,koi_prad): Real planets have consistent SNR for their size. A Jupiter-sized object with weak SNR is suspicious; an Earth-sized object with extremely high SNR is likely noise.
graph LR
A[Real Planet] -->|Jupiter| B[High SNR]
A -->|Earth| C[Low SNR]
D[Inconsistent] -->|Large + Low SNR| E[Suspicious]
B --> F[snr_x_prad<br/>consistent]
C --> F
E --> G[snr_x_prad<br/>inconsistent]
style F fill:#2ecc71,stroke:#27ae60,color:#fff
style G fill:#e74c3c,stroke:#c0392c,color:#fff
π’ Workflow Insight
The two-stage design mirrors how astronomers actually vet candidates: First, separate obvious planets (CONFIRMED) from everything else. Then, carefully distinguish between promising candidates and known false positives. This is why Stage 1 achieves near-perfect separation while Stage 2 focuses on the scientifically interesting boundary.
Project Structure
graph TD
A[astrophage/] --> B[Cargo.toml]
A --> C[data/]
A --> D[src/]
A --> E[output/]
A --> F[docs/]
C --> C1[koi_dataset.csv]
D --> D1[main.rs]
D --> D2[data.rs]
D --> D3[features.rs]
D --> D4[decision_tree.rs]
D --> D5[model.rs]
D --> D6[two_stage_model.rs]
D --> D7[evaluation.rs]
D --> D8[report.rs]
E --> E1[report.json]
F --> F1[mdbook/]
F1 --> F2[book.toml]
F1 --> F3[src/]
F3 --> F4[README.md]
F3 --> F5[architecture.md]
F3 --> F6[two-stage-model.md]
F3 --> F7[feature-engineering.md]
F3 --> F8[results.md]
F3 --> F9[api-reference.md]
style D1 fill:#f39c12,stroke:#e67e22,color:#fff
style D6 fill:#2ecc71,stroke:#27ae60,color:#fff
style F1 fill:#9b59b6,stroke:#8e44ad,color:#fff
Documentation
π Full Documentation: https://astrophage.hariharnautiyal.com
The documentation site includes:
- System architecture with Mermaid diagrams
- Deep dive into the two-stage model
- Feature engineering explanations with astrophysical rationale
- API reference for all modules
- Contributing guidelines
- Changelog and roadmap
To build the docs locally:
cd docs
mdbook build
mdbook serve --open
Google Colab
Want to try Astrophage without installing Rust locally?
π Open in Google Colab
The notebook will:
- Install Rust in the Colab environment
- Clone this repository
- Build the project with Cargo
- Run the full pipeline
- Display the
report.jsonwith interactive visualizations
Note: First run takes ~5-7 minutes due to Rust compilation. Subsequent runs are instant.
Recommendations for Follow-Up
Based on our modelβs behavior, we suggest:
graph TD
subgraph "Actionable Recommendations"
A1[Use Stage 1 as rapid filter]
A2[Investigate uncertain Stage 1 samples]
A3[Use Stage 2 for prioritization]
A4[Use fpflag_sum as pre-filter]
end
A1 --> B1[Quickly identify CONFIRMED]
A2 --> B2[Scientifically interesting edge cases]
A3 --> B3[Prioritize CANDIDATE follow-up]
A4 --> B4[Eliminate 50% of false positives instantly]
style A1 fill:#2ecc71,stroke:#27ae60,color:#fff
style A2 fill:#3498db,stroke:#2980b9,color:#fff
style A3 fill:#9b59b6,stroke:#8e44ad,color:#fff
style A4 fill:#f39c12,stroke:#e67e22,color:#fff
| # | Recommendation | Impact |
|---|---|---|
| 1 | Use Stage 1 as a rapid filter for follow-up observations | Saves telescope time |
| 2 | Investigate samples where Stage 1 is uncertain (probability ~0.5) | Most scientifically interesting |
| 3 | For NOT_CONFIRMED, use Stage 2 probability to prioritize follow-up | Efficient resource allocation |
| 4 | fpflag_sum alone eliminates ~50% of false positives with near-perfect accuracy | Dramatic efficiency gain |
Team & Acknowledgments
- Author: Harihar Nautiyal
- Hackathon: Celesta β India High School Exoplanet Data Challenge 2026
- Data Source: NASA Exoplanet Archive / Kepler Mission
- Built with: Rust, Polars, NDArray, Tokio, Serde
License
MIT License β feel free to use, modify, and distribute with attribution.
"Somewhere, something incredible is waiting to be known."
β Carl Sagan