SRC18. Analysis of Variable Selection Methods on Scientific Cluster Measurement Data
Student: Jonathan Wang (University of California, Berkeley)
Supervisor: Wucherl Yoo (Lawrence Berkeley National Laboratory)
Abstract: The goal of the project was to use parallelized variable selection methods to improve the performance of machine learning models on the PTF astrophysics dataset by reducing model training time and removing disruptive variables. Several methods were implemented in Spark to utilize high performance computing and tested on the PTF data. The results from the PTF data tests showed that Sequential Backward Selection was able to approximate the optimal subset relatively quickly. This subset took significantly less time to train on and had higher accuracy than the full feature set. We also experimented with correlation-based grouping to take advantage of feature correlations in the PTF data. This method allows large correlated datasets to be handled more efficiently. We were able to further improve the performance of Sequential Backward Selection on this dataset without significant loss in accuracy.
Two-page extended abstract: pdf