Modern Variable Selection

By Meadhbh O’Neill

 We are in an era where valuable information and data are being collected continuously around us. High-dimensional data is a current “hot” research area. The quantity of data available to us continues to increase persistently. This explains the necessary study of modern variable selection methods. These techniques are being developed and are aimed at tackling this abundance of data.

James et. al (2013) describe how innovative technologies have enhanced the way that data is collected. These advancements have been particularly important in sectors including finance, medicine and manufacturing. It is common to compile an almost unlimited number of predictor measurements for a process. However, usually as a result of cost (or other issues) the number of observations recorded can be limited.

manufacturing1

In this setting, dimensionality indicates the number of predictors, p, that a dataset has. High-dimensional means that the number of predictors exceeds the number of observations n. When p > n calculations become complex and traditional variable selection methods are not appropriate.

manufacturing2.jpgIn the age of smart manufacturing, consider high-dimensional industrial data. The goal is to identify the variables responsible for success or failure in a manufacturing process. In this process there can be thousands of sensors embedded throughout the system. These can record measurements such as temperature, pressure and weight. Although an extensive amount of data is generated, we probably won’t benefit from using it all. Efron and Hastie (2016) explain how a subset of these variables measured will be sufficient. Our model fit will be less effective if we include all the redundant variables in the model.

For large p, traditional variable selection methods such as best-subset selection and stepwise selection are computationally expensive. The importance of the bias-variance trade-off should be considered when deciding on potential methods. The danger of over-fitting the data also comes into view when considering linear models for high-dimensional data. A simple least squares regression line tends to overfit the data. This is due to having many parameters and much less observations to work with.

Modern less flexible methods for model selection are especially useful in the high-dimensional setting. Methods that perform reduction or regularization work very efficiently. These techniques include ridge regression, the lasso, the elastic net and principal components regression (James et. al, 2013). The use of sparse modelling and shrinkage methods are essential to identify significant predictors (Efron and Hastie, 2016).

Multicollinearity is a prevalent issue in the high-dimensional world. Model fits and results must be interpreted carefully. James et. al (2013) discuss how we can never identify which variables (if any) are truly predictive of the outcome. However, we can try to select the variables that are correlated with the true predictors and assign large regression coefficients to them. As a result, we must be cautious to not overstate the results achieved.

Although new technologies allow us to collect complex data – simple, highly regularized methods perform best. Having millions of predictors can lead to greater predictive performance – but only if they are relevant to the problem. Including many irrelevant predictions can lead to poor results. Often it is not worthwhile to include all relevant variables. The variance acquired in fitting their coefficients may be greater that the reduction in bias they provide. The primary issues are variance and over-fitting. This problem is known as the “curse of dimensionality”.

High-dimensional analysis is popular in domains such as medicine, finance and robotics. Bio-medical applications include DNA expression microarrays. Microarrays assist in the quantitative study of thousands of genes from a single sample of cells simultaneously. A gene expression dataset can have several thousand rows expressing individual genes and only tens of columns serving as samples. High-dimensional methods can be performed to explore if certain genes show very high (or low) expression for certain cancer samples. In the financial world estimation error is common in large panels of economic data such as modern portfolio theory. Traditional estimation methods cannot ensure accuracy as this is a high-dimensional problem. The number of unknown parameters can grow quadratically with the size of the portfolio when examining a large pool of assets. Important developments in robotics have been facilitated by advancements in machine learning. For example, consider an assistive robot that can process sensory information and perform actions that can benefit people with disabilities. These robots must have the ability to execute complex actions. Dimension reduction methods can identify a low-dimensional representation of the high-dimensional state space developed by the robot. This makes the state space more useful for understanding and machine learning.

Traditional variable selection methods are not suitable in the high-dimensional setting. Modern simple, regularized methods can identify a subset of important variables. Thought and consideration needs to be employed during both stages of method selection and interpretation of results.

 

References:

Efron, Bradley, and Trevor Hastie. Computer age statistical inference. Vol. 5. Cambridge University Press, 2016.

James, Gareth, et al. An introduction to statistical learning. Vol. 112. New York: springer, 2013.

 

Meadhbh O’Neill is a PhD working under the supervision of Dr Kevin Burke and  in MACSI and the SFI Centre for Smart Manufacturing CONFIRM