Feature selection [TODO]

I. Introduction

Feature selection is the process of selecting feature or sub set of features for build model in machine learning. A good feature selection method leads to build a better performing model, and hence helps to improve our results. In particular, there are two reasons explaining why feature selection is important:

  1. If we are able to reduce number of features, it means that we reduce 'over-fitting' and improve the generalization of models. Moreover, it also helps to reduce the measurement and storage requirements, reducing training and utilization times.
  2. We gain a better understanding of the underlying structure and characteristics of the data and leads to better intuition of the algorithm.
In this blog, we briefly present some popular methods in the feature selection. We use python and scikit-learn to run all the experiments for feature selection. The data including 506 samples and 13 features is available at (http://scikit-learn.org/stable/datasets/). 
II. Methods

Removing features with low variance
VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance does not meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples. 
VAR(X) = p (1 - p)
Example code:
In this example, we applied VarianceThreshold  with threshold = 0.8. Only two features are removed among all features list. 

from sklearn.feature_selection import VarianceThreshold
from sklearn.datasets import load_boston

# Load boston housing dataset as an exampleboston = load_boston()
X = boston["data"]
Y = boston["target"]
names = boston["feature_names"]
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
X_transform = sel.fit_transform(X)
print len(names), len(X_transform[0])

Result:

VarianceThreshold   Score
TAX 1
RAD 1
INDUS 1
NOX 0
CRIM 1
AGE 1
LSTAT 1
PTRATIO 1
CHAS 0
RM 1
ZN 1
B 1
DIS 1

One drawback of VarianceThreshold is that we need to set up a threshold to remove all non-valid features. If we do not have an distribution overview of feature values, hence it is hard to choose the correct threshold. 

Univariate feature selection
Univariate feature selection selecting the best features based on univariate statistical tests. It can be seen as a preprocessing step to an estimator. In particular, this method focuses on the relationship of the feature in order to gain a better understanding of data (but not necessarily for optimizing the feature set for better generalization). There are a lot of different options for univariate selection,i.e., Pearson correlation, Euclidean distance, distance correlation, etc.

Pearson correlation
It is a measure of the linear correlation between two variables X and Y, giving a value between +1 and −1 inclusive, where 1 is total positive correlation, 0 is no correlation, and −1 is total negative correlation. Below is the Pearson correlation between X & Y:



Example code:
For each feature in boston dataset, we calculated the Pearson correlation between this feature with all the remaining features. The feature's ranking is estimated by applying "mean" or "standard deviation". In this example, we used both of them. 

from scipy.stats import pearsonr
from sklearn.datasets import load_boston
import numpy as np
def pearson_selection(command):
    # Load boston housing dataset as an example    boston = load_boston()
    X = boston["data"]
    Y = boston["target"]
    names = boston["feature_names"]
    print names

    list_features = list()
    for i in range(X.shape[1]):
        list_features.append(X[:, i:i+1])
    # print len(list_features)
    list_score = list()
    for i in range(0, len(list_features)):
        feature_first = list_features[i]
        score = list()
        for j in range(0, len(list_features)):
            feature_second = list_features[j]
            value_pearson = pearsonr(feature_first, feature_second)
            score.append(value_pearson[0])

        if command == 'std':
            list_score.append(np.std(score))
        elif command == 'mean':
            list_score.append(np.mean(score))
        else:
            print 'Wrong command'            quit()

    list_all = list()
    for i in range(0, len(list_score)):
        each = list()
        each.append(names[i])
        each.append(float(list_score[i]))
        list_all.append(each)

    list_return = list()
    sortlist = sorted(list_all, key=lambda x: (x[1]), reverse=True)  # reverse=True means that we order list from maximum to minimum    for value in sortlist:
        print value[0] + '\t' + str(value[1])
        list_return.append(value[0])
    return list_return

Result:

Pearson correlation Std Mean
NOX 0.55 0.29
INDUS 0.54 0.28
DIS 0.54 0.25
TAX 0.52 0.24
AGE 0.51 0.22
LSTAT 0.51 0.22
RAD 0.49 0.21
ZN 0.47 0.16
CRIM 0.41 0.07
RM 0.40 -0.07
B 0.40 -0.09
PTRATIO 0.39 -0.09
CHAS 0.28 -0.18



However, Pearson correlation is only good when we measure the degree of linear dependence between two variables. If the relation is non-linear, Pearson correlation can be close to zero even if there is a 1-1 correspondence between the two variables.

Model based ranking
We also can build an arbitrary machine learning method to build a predictive model for the response variable using each individual feature, and measure the performance of each model based on its feature. Hence we know which features are the best compared to other features. 

Example code:


def model_ranking():
    # Load boston housing dataset as an example    boston = load_boston()
    X = boston["data"]
    Y = boston["target"]
    names = boston["feature_names"]

    rf = LinearRegression()
    scores = []
    for i in range(X.shape[1]):
         score = cross_val_score(rf, X[:, i:i+1], Y, scoring="r2", cv=ShuffleSplit(len(X), 4, .25))
         scores.append((round(np.mean(score), 3), names[i]))
    list_scores = sorted(scores, reverse=True)

    list_return = list()
    for value in list_scores:
        print value[1] + '\t' + str(value[0])
        list_return.append(value[1])
    return list_return

Result:


Model_based_ranking Score
LSTAT 0.532
RM 0.507
INDUS 0.227
TAX 0.222
PTRATIO 0.205
NOX 0.16
ZN 0.148
AGE 0.106
RAD 0.096
B 0.086
DIS 0.08
CRIM 0.08
CHAS 0.031

Wrapper methods
Wrapper methods consider the selection of a set of features as a search problem, where different combinations are prepared, evaluated and compared to other combinations. A predictive model us used to evaluate a combination of features and assign a score based on model accuracy.
There are two popular methods used in Wrapper:
A common example of heuristic search is hill climbing: keep adding features one at a time until no further improvement can be achieved (“forward greedy wrapping”). 
Alternatively we can start with the full set of predictors and keep removing features one at a time until no further improvement can be achieved (“backward greedy wrapping”).

Example code:
We used a "backward greedy wrapping" to select a set of features. However the result showed that we should keep all the features. 
def recursive_elimination():
    boston = load_boston()
    X = boston["data"]
    Y = boston["target"]
    names = boston["feature_names"]


    estimator = SVR(kernel="linear")
    selector = RFECV(estimator, step=1, cv=5)
    selector = selector.fit(X, Y)

    print selector.ranking_
    print names

Result:

Recursive feature elimination Score
TAX 1
RAD 1
INDUS 1
NOX 1
CRIM 1
AGE 1
LSTAT 1
PTRATIO 1
CHAS 1
RM 1
ZN 1
B 1
DIS 1

Embedded methods
Embedded methods choose features contributing to the accuracy of the model when we create model. The most common type of embedded feature selection methods are regularization method. However, we may apply other methods like Lasso, Ridge, etc. to select the features.

Example code:

In this example, I used linear regression to create model and regularization method for choosing the good features.


def regular_ftr_selection():
    boston = load_boston()
    X = boston["data"]
    Y = boston["target"]
    names = boston["feature_names"]

    lr = LinearRegression()
    lr.fit(X, Y)
    print 'Linear_models: ', print_coefficient(lr.coef_, names)

    list_all = list()
    for i in range(0, len(lr.coef_)):
        each = list()
        each.append(names[i])
        each.append(float(lr.coef_[i]))
        list_all.append(each)

    list_return = list()
    sortlist = sorted(list_all, key=lambda x: (x[1]), reverse=True)  # reverse=True mean that we order list from maximum to minimum    for value in sortlist:
        print value[0] + '\t' + str(value[1])
        list_return.append(value[0])
    return list_return

Result:



III. Feature selection checklist

In the paper “An Introduction to Variable and Feature Selection” (PDF), Isabelle Guyon and Andre Elisseeff provide an excellent checklist that you can use the next time you need to select data features for you predictive modeling problem.
  1. Do you have domain knowledge? If yes, construct a better set of ad hoc”” features
  2. Are your features commensurate? If no, consider normalizing them.
  3. Do you suspect interdependence of features? If yes, expand your feature set by constructing conjunctive features or products of features, as much as your computer resources allow you.
  4. Do you need to prune the input variables (e.g. for cost, speed or data understanding reasons)? If no, construct disjunctive features or weighted sums of feature
  5. Do you need to assess features individually (e.g. to understand their influence on the system or because their number is so large that you need to do a first filtering)? If yes, use a variable ranking method; else, do it anyway to get baseline results.
  6. Do you need a predictor? If no, stop
  7. Do you suspect your data is “dirty” (has a few meaningless input patterns and/or noisy outputs or wrong class labels)? If yes, detect the outlier examples using the top ranking variables obtained in step 5 as representation; check and/or discard them.
  8. Do you know what to try first? If no, use a linear predictor. Use a forward selection method with the “probe” method as a stopping criterion or use the 0-norm embedded method for comparison, following the ranking of step 5, construct a sequence of predictors of same nature using increasing subsets of features. Can you match or improve performance with a smaller subset? If yes, try a non-linear predictor with that subset.
  9. Do you have new ideas, time, computational resources, and enough examples? If yes, compare several feature selection methods, including your new idea, correlation coefficients, backward selection and embedded methods. Use linear and non-linear predictors. Select the best approach with model selection
  10. Do you want a stable solution (to improve performance and/or understanding)? If yes, sub sample your data and redo your analysis for several “bootstrap”.
IV. Further reading

To have a deeper understanding on feature selecetion, you also want to take a look at these books:
Source: 
https://en.wikipedia.org/wiki/Feature_selection
http://www.jmlr.org/papersvolume3/guyon03a/guyon03a.pdf
http://scikit-learn.org/stable/modules/feature_selection.html
http://scikit-learn.org/stable/modules/feature_selection.html


Comments

Popular posts from this blog

Manifold Learning [unfinished]

Feature scaling

Find all pairs in array of integers whose sum is equal to a given number ?