Feature selection [TODO]
I. Introduction
Feature selection is the process of selecting feature or sub set of features for build model in machine learning. A good feature selection method leads to build a better performing model, and hence helps to improve our results. In particular, there are two reasons explaining why feature selection is important:
Feature selection is the process of selecting feature or sub set of features for build model in machine learning. A good feature selection method leads to build a better performing model, and hence helps to improve our results. In particular, there are two reasons explaining why feature selection is important:
- If we are able to reduce number of features, it means that we reduce 'over-fitting' and improve the generalization of models. Moreover, it also helps to reduce the measurement and storage requirements, reducing training and utilization times.
- We gain a better understanding of the underlying structure and characteristics of the data and leads to better intuition of the algorithm.
Removing features with low variance
VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance does not meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.
VAR(X) = p (1 - p)
Example code:
In this example, we applied VarianceThreshold with threshold = 0.8. Only two features are removed among all features list.
In this example, we applied VarianceThreshold with threshold = 0.8. Only two features are removed among all features list.
from sklearn.feature_selection import VarianceThreshold from sklearn.datasets import load_boston # Load boston housing dataset as an exampleboston = load_boston() X = boston["data"] Y = boston["target"] names = boston["feature_names"] sel = VarianceThreshold(threshold=(.8 * (1 - .8))) X_transform = sel.fit_transform(X) print len(names), len(X_transform[0])
VarianceThreshold | Score |
TAX | 1 |
RAD | 1 |
INDUS | 1 |
NOX | 0 |
CRIM | 1 |
AGE | 1 |
LSTAT | 1 |
PTRATIO | 1 |
CHAS | 0 |
RM | 1 |
ZN | 1 |
B | 1 |
DIS | 1 |
One drawback of VarianceThreshold is that we need to set up a threshold to remove all non-valid features. If we do not have an distribution overview of feature values, hence it is hard to choose the correct threshold.
Univariate feature selection
Univariate feature selection selecting the best features based on univariate statistical tests. It can be seen as a preprocessing step to an estimator. In particular, this method focuses on the relationship of the feature in order to gain a better understanding of data (but not necessarily for optimizing the feature set for better generalization). There are a lot of different options for univariate selection,i.e., Pearson correlation, Euclidean distance, distance correlation, etc.
Pearson correlation
It is a measure of the linear correlation between two variables X and Y, giving a value between +1 and −1 inclusive, where 1 is total positive correlation, 0 is no correlation, and −1 is total negative correlation. Below is the Pearson correlation between X & Y:
Example code:
from scipy.stats import pearsonr from sklearn.datasets import load_boston import numpy as npdef pearson_selection(command): # Load boston housing dataset as an example boston = load_boston() X = boston["data"] Y = boston["target"] names = boston["feature_names"] print names list_features = list() for i in range(X.shape[1]): list_features.append(X[:, i:i+1]) # print len(list_features) list_score = list() for i in range(0, len(list_features)): feature_first = list_features[i] score = list() for j in range(0, len(list_features)): feature_second = list_features[j] value_pearson = pearsonr(feature_first, feature_second) score.append(value_pearson[0]) if command == 'std': list_score.append(np.std(score)) elif command == 'mean': list_score.append(np.mean(score)) else: print 'Wrong command' quit() list_all = list() for i in range(0, len(list_score)): each = list() each.append(names[i]) each.append(float(list_score[i])) list_all.append(each) list_return = list() sortlist = sorted(list_all, key=lambda x: (x[1]), reverse=True) # reverse=True means that we order list from maximum to minimum for value in sortlist: print value[0] + '\t' + str(value[1]) list_return.append(value[0]) return list_return
Result:
|
However, Pearson correlation is only good when we measure the degree of linear dependence between two variables. If the relation is non-linear, Pearson correlation can be close to zero even if there is a 1-1 correspondence between the two variables.
Model based ranking
We also can build an arbitrary machine learning method to build a predictive model for the response variable using each individual feature, and measure the performance of each model based on its feature. Hence we know which features are the best compared to other features.
Example code:
def model_ranking(): # Load boston housing dataset as an example boston = load_boston() X = boston["data"] Y = boston["target"] names = boston["feature_names"] rf = LinearRegression() scores = [] for i in range(X.shape[1]): score = cross_val_score(rf, X[:, i:i+1], Y, scoring="r2", cv=ShuffleSplit(len(X), 4, .25)) scores.append((round(np.mean(score), 3), names[i])) list_scores = sorted(scores, reverse=True) list_return = list() for value in list_scores: print value[1] + '\t' + str(value[0]) list_return.append(value[1]) return list_return
Model_based_ranking | Score |
LSTAT | 0.532 |
RM | 0.507 |
INDUS | 0.227 |
TAX | 0.222 |
PTRATIO | 0.205 |
NOX | 0.16 |
ZN | 0.148 |
AGE | 0.106 |
RAD | 0.096 |
B | 0.086 |
DIS | 0.08 |
CRIM | 0.08 |
CHAS | 0.031 |
Wrapper methods
Wrapper methods consider the selection of a set of features as a search problem, where different combinations are prepared, evaluated and compared to other combinations. A predictive model us used to evaluate a combination of features and assign a score based on model accuracy.
There are two popular methods used in Wrapper:
A common example of heuristic search is hill climbing: keep adding
features one at a time until no further improvement can be achieved
(“forward greedy wrapping”).
Alternatively we can start with the full set of predictors and keep
removing features one at a time until no further improvement can be
achieved (“backward greedy wrapping”).
Example code:
We used a "backward greedy wrapping" to select a set of features. However the result showed that we should keep all the features.
Result:
Embedded methods
Embedded methods choose features contributing to the accuracy of the model when we create model. The most common type of embedded feature selection methods are regularization method. However, we may apply other methods like Lasso, Ridge, etc. to select the features.
Example code:
In this example, I used linear regression to create model and regularization method for choosing the good features.
Result:
Example code:
We used a "backward greedy wrapping" to select a set of features. However the result showed that we should keep all the features.
def recursive_elimination(): boston = load_boston() X = boston["data"] Y = boston["target"] names = boston["feature_names"] estimator = SVR(kernel="linear") selector = RFECV(estimator, step=1, cv=5) selector = selector.fit(X, Y) print selector.ranking_ print names
Result:
Recursive feature elimination | Score |
TAX | 1 |
RAD | 1 |
INDUS | 1 |
NOX | 1 |
CRIM | 1 |
AGE | 1 |
LSTAT | 1 |
PTRATIO | 1 |
CHAS | 1 |
RM | 1 |
ZN | 1 |
B | 1 |
DIS | 1 |
Embedded methods
Embedded methods choose features contributing to the accuracy of the model when we create model. The most common type of embedded feature selection methods are regularization method. However, we may apply other methods like Lasso, Ridge, etc. to select the features.
Example code:
In this example, I used linear regression to create model and regularization method for choosing the good features.
def regular_ftr_selection(): boston = load_boston() X = boston["data"] Y = boston["target"] names = boston["feature_names"] lr = LinearRegression() lr.fit(X, Y) print 'Linear_models: ', print_coefficient(lr.coef_, names) list_all = list() for i in range(0, len(lr.coef_)): each = list() each.append(names[i]) each.append(float(lr.coef_[i])) list_all.append(each) list_return = list() sortlist = sorted(list_all, key=lambda x: (x[1]), reverse=True) # reverse=True mean that we order list from maximum to minimum for value in sortlist: print value[0] + '\t' + str(value[1]) list_return.append(value[0]) return list_return
Result:
III. Feature selection checklist
In the paper “An Introduction to Variable and Feature Selection” (PDF), Isabelle Guyon and Andre Elisseeff provide an excellent checklist that you can use the next time you need to select data features for you predictive modeling problem.
- Do you have domain knowledge? If yes, construct a better set of ad hoc”” features
- Are your features commensurate? If no, consider normalizing them.
- Do you suspect interdependence of features? If yes, expand your feature set by constructing conjunctive features or products of features, as much as your computer resources allow you.
- Do you need to prune the input variables (e.g. for cost, speed or data understanding reasons)? If no, construct disjunctive features or weighted sums of feature
- Do you need to assess features individually (e.g. to understand their influence on the system or because their number is so large that you need to do a first filtering)? If yes, use a variable ranking method; else, do it anyway to get baseline results.
- Do you need a predictor? If no, stop
- Do you suspect your data is “dirty” (has a few meaningless input patterns and/or noisy outputs or wrong class labels)? If yes, detect the outlier examples using the top ranking variables obtained in step 5 as representation; check and/or discard them.
- Do you know what to try first? If no, use a linear predictor. Use a forward selection method with the “probe” method as a stopping criterion or use the 0-norm embedded method for comparison, following the ranking of step 5, construct a sequence of predictors of same nature using increasing subsets of features. Can you match or improve performance with a smaller subset? If yes, try a non-linear predictor with that subset.
- Do you have new ideas, time, computational resources, and enough examples? If yes, compare several feature selection methods, including your new idea, correlation coefficients, backward selection and embedded methods. Use linear and non-linear predictors. Select the best approach with model selection
- Do you want a stable solution (to improve performance and/or understanding)? If yes, sub sample your data and redo your analysis for several “bootstrap”.
IV. Further reading
To have a deeper understanding on feature selecetion, you also want to take a look at these books:
https://en.wikipedia.org/wiki/Feature_selection
http://www.jmlr.org/papersvolume3/guyon03a/guyon03a.pdf
http://scikit-learn.org/stable/modules/feature_selection.html
http://scikit-learn.org/stable/modules/feature_selection.html
Comments
Post a Comment