Skip to main content

        Random Forest Classifier Based on 100K Spectral Data Points of Stars, Quasars, and Galaxies from the Sloan Digital Sky Survey Data Release 17

Celestial Classifier Trained on SDSS Data using Random Forest

Random Forest Classifier Based on 100K Spectral Data Points of Stars, Quasars, and Galaxies from the Sloan Digital Sky Survey Data Release 17

SSDS?

The Sloan Digital Sky Survey is a multi-spectral series of survies of the sky at Apache Point Observatory in New Mexico (and Chile for the southern hemisphere.) It’s simply a big camera that captures photos of the sky and puts them together to make a 3D map of the cosmos. And in doing so, it captures the spectra—the fingerprint—of all of the different objects in the nightsky. The SDSS has mapped a third of the nightsky to this day, and lucky for us, all of its data are available to us. I’m using the SDSS-17 dataset from Kaggle, which is based on Data Release 17 of December 2021.

The SDSS telescope at Apache Point Observatory taking calibration observations. For these particular calibration observations, the telescope’s cover is closed and the inside of the telescope is illuminated by a special lamp.
The SDSS telescope at Apache Point Observatory taking calibration observations. For these particular calibration observations, the telescope’s cover is closed and the inside of the telescope is illuminated by a special lamp.


Spectral?

Stars—and galaxies and quasars—emit electromagentic radiation throughout their existance, which differ depending on their type, magnitude, and distance to us. The SDSS-17 dataset cosnists of photometric magnitudes, which means it’s showing how “bright” the radiation is when looking through different lenses. It also includes features such as the right acension (α) and declination (δ) of the celestial objects. However, like the object IDs and the other additional columns, I dropped these since they aren’t fo interest and risk poisoning the results. The features of interest in the dataset are:

  • u (ultraviolet)
  • g (green)
  • r (red)
  • i (infrared)
  • z (near-infrared)
  • redshift

By far, redshift will be the dominant feature in the classifier. This is expected since, generally, stars will be closer to us (& relatively stationary) and have a lower redshift, then galaxies then quasars are the farthest and fastest moving, having the highest redshift values.


Implementation

I used a Random Forest Classifier from the sklearn library to classify the ceelstial objects. All of the code below can be found in a Jupyter notebook here.

Preprocessing

First things, I set the values of the X-axis to the features of interest (i.e., u, g, r, i, z, and redshift) adn the Y to the class column (i.e., star, galaxy, or quasar). I then applied a label encoder to transform the classes into 0 (GALAXY), 1 (QSO), 2 (STAR). And transformed the X-axis data into a standard scale. Finally, I split the data with a test size of 20%, making sure to stratify the classes because the data is imbalanced with galaxies composing ~60% of the datasets.

X = df.drop(columns=["obj_ID", "run_ID","rerun_ID","cam_col","field_ID","spec_obj_ID","class", "plate","MJD","fiber_ID", "alpha", "delta"])
Y = df["class"]

le = LabelEncoder()
Y = le.fit_transform(Y)

scaler = StandardScaler()
X = scaler.fit_transform(X)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=3)

Training

I then created a Random Forest object with 100 trees, which seems reasonable for this application and fit the model to the training data.

rf = RandomForestClassifier(n_estimators=100, class_weight="balanced", random_state=3)
rf.fit(X_train, Y_train)

Testing

Finally, I tested the data and produced an overall accuracy of 98%.

predictions = rf.predict(X_test)
report = classification_report(Y_test, predictions, target_names=le.classes_)
precision    recall  f1-score   support

      GALAXY       0.98      0.99      0.98     11889
         QSO       0.97      0.93      0.95      3792
        STAR       1.00      1.00      1.00      4319

    accuracy                           0.98     20000
   macro avg       0.98      0.97      0.98     20000
weighted avg       0.98      0.98      0.98     20000

Analysis

The thing to note here is the 93% recall score of quasers, which are misidentified as galaxies (see confusion matrix below). This is understandable since redshift is the dominant factor, and as light travels from the dsitant quasers, it will be redshifted, making it similar to distant galaxies, especially when it passes through cosmic dust that makes them look fuzzy like galaxies.

Confusion Matrix of the Random Forest Classifier on the SDSS-17 Dataset.
Confusion Matrix of the Random Forest Classifier on the SDSS-17 Dataset.

Additionally, a feature analysis (see bar plot below) reveals what was expected: redshift is the dominant feature since it is one feature that each of the classes generally accumulate around at certain values.

Feature Analysis of the Random Forest Classifier on the SDSS-17 Dataset.
Feature Analysis of the Random Forest Classifier on the SDSS-17 Dataset.