SSDS?
The Sloan Digital Sky Survey is a multi-spectral series of survies of the sky at Apache Point Observatory in New Mexico (and Chile for the southern hemisphere.) It’s simply a big camera that captures photos of the sky and puts them together to make a 3D map of the cosmos. And in doing so, it captures the spectra—the fingerprint—of all of the different objects in the nightsky. The SDSS has mapped a third of the nightsky to this day, and lucky for us, all of its data are available to us. I’m using the SDSS-17 dataset from Kaggle, which is based on Data Release 17 of December 2021.
Spectral?
Stars—and galaxies and quasars—emit electromagentic radiation throughout their existance, which differ depending on their type, magnitude, and distance to us. The SDSS-17 dataset cosnists of photometric magnitudes, which means it’s showing how “bright” the radiation is when looking through different lenses. It also includes features such as the right acension (α) and declination (δ) of the celestial objects. However, like the object IDs and the other additional columns, I dropped these since they aren’t fo interest and risk poisoning the results. The features of interest in the dataset are:
- u (ultraviolet)
- g (green)
- r (red)
- i (infrared)
- z (near-infrared)
- redshift
By far, redshift will be the dominant feature in the classifier. This is expected since, generally, stars will be closer to us (& relatively stationary) and have a lower redshift, then galaxies then quasars are the farthest and fastest moving, having the highest redshift values.
Implementation
I used a Random Forest Classifier from the sklearn library to classify the ceelstial objects. All of the code below can be found in a Jupyter notebook here.
Preprocessing
First things, I set the values of the X-axis to the features of interest (i.e., u, g, r, i, z, and redshift) adn the Y to the class column (i.e., star, galaxy, or quasar). I then applied a label encoder to transform the classes into 0 (GALAXY), 1 (QSO), 2 (STAR). And transformed the X-axis data into a standard scale. Finally, I split the data with a test size of 20%, making sure to stratify the classes because the data is imbalanced with galaxies composing ~60% of the datasets.
X = df.drop(columns=["obj_ID", "run_ID","rerun_ID","cam_col","field_ID","spec_obj_ID","class", "plate","MJD","fiber_ID", "alpha", "delta"])
Y = df["class"]
le = LabelEncoder()
Y = le.fit_transform(Y)
scaler = StandardScaler()
X = scaler.fit_transform(X)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=3)
Training
I then created a Random Forest object with 100 trees, which seems reasonable for this application and fit the model to the training data.
rf = RandomForestClassifier(n_estimators=100, class_weight="balanced", random_state=3)
rf.fit(X_train, Y_train)
Testing
Finally, I tested the data and produced an overall accuracy of 98%.
predictions = rf.predict(X_test)
report = classification_report(Y_test, predictions, target_names=le.classes_)
precision recall f1-score support
GALAXY 0.98 0.99 0.98 11889
QSO 0.97 0.93 0.95 3792
STAR 1.00 1.00 1.00 4319
accuracy 0.98 20000
macro avg 0.98 0.97 0.98 20000
weighted avg 0.98 0.98 0.98 20000
Analysis
The thing to note here is the 93% recall score of quasers, which are misidentified as galaxies (see confusion matrix below). This is understandable since redshift is the dominant factor, and as light travels from the dsitant quasers, it will be redshifted, making it similar to distant galaxies, especially when it passes through cosmic dust that makes them look fuzzy like galaxies.
Additionally, a feature analysis (see bar plot below) reveals what was expected: redshift is the dominant feature since it is one feature that each of the classes generally accumulate around at certain values.