Academic2026

Stellar Classification

CS372 Machine Learning: An astronomical object classification system

Stellar Classification
Role
AI/ML Engineer
Year
2026
Team
Solo Project
Tech Stack

This project was developed to study and apply machine learning techniques to the problem of stellar classification using data from the Sloan Digital Sky Survey (SDSS) Data Release 17 (DR17), consisting of 100,000 records. The system was designed to classify celestial objects into three main categories: galaxies, stars, and quasars. It does this by analyzing spectral features, photometric data across the u, g, r, i, and z filters, and redshift values in order to identify the most accurate and suitable mathematical model for handling large-scale astronomical data.

01

The Problem

  • Modern astronomy has entered the era of big data, where sky survey projects generate massive amounts of data, often at the terabyte scale per night. Manually classifying these objects is therefore impractical.
  • Some objects, such as quasars, are high-energy and extremely distant, yet can appear visually similar to stars when observed through conventional telescopes, making them difficult to distinguish.
  • An automatic classification system is needed to process large volumes of data quickly and accurately, reducing the workload of astronomers and helping identify objects for further in-depth study.
02

The Solution

  • Performed systematic data cleaning and preprocessing by converting physically invalid sentinel values, such as -9999, into NaN and removing them, while also applying IQR clipping (Winsorization) to reduce the impact of outliers.
  • Used tree-based methods and permutation importance for feature selection, removing system-identifying variables such as IDs that could lead to data leakage.
  • Developed and compared three different algorithms: K-Nearest Neighbors (KNN), XGBoost (tree-based ensemble), and a Neural Network (Multilayer Perceptron, or MLP).
  • Tuned hyperparameters using grid search combined with cross-validation to find the most effective settings while reducing the risk of overfitting and underfitting.
03

The Result

  • Produced models capable of classifying galaxies, stars, and quasars with very high accuracy, with XGBoost achieving the best performance at an evaluation accuracy of 0.9785 and an F1-macro score of 0.9752.
  • All developed models showed no significant signs of overfitting and demonstrated strong generalization to unseen data.
  • Built a complete data pipeline covering preprocessing, anomaly handling, and prediction, which could realistically be applied as an initial screening system within an observatory data pipeline.