When we are talking about feature selection, we often refer to methods like MI (mutual information), IG (information gain), Pearson correlation, etc. Generally, these methods can be categorized into two classes: measurements of linear dependence and measurements of statistical dependence. (When we talk about dependence here, we mean that between a feature and a response or between two features.)
MI and IG measures statistical dependence between random variables X and Y. In the special application of feature selection:
- IG measures the KL-divergence between two probability densities P(X) and P(Y). If these two functions have shapes look like each other, the divergence is small. This implies that X and Y have the same support (or, say, have the same domain).
- MI measures the KL-divergence between P(X,Y) and P(X)P(Y). If X and Y are statistical independent with each other, P(X,Y)=P(X)P(Y), and the divergence is small. MI does not require X and Y have the same support. It is limited by the requirement of P(X,Y), which is hard to get from data if X and/or Y are continuous variables.
In MI and IG, the computation (of KL-divergence) is conducted on distributions, P(X), P(Y) and P(X,Y), but not on data points directly. So statistical dependence provides a general view on relationships between distributions.
Differ from statistical dependence methods, covariance matrix and Pearson correlation compute relationship between X and Y by using their samples directly. More details can be found in this previous post. This gives linear dependence methods the ability to check whether samples of X-Y pairs distribute on (or close to) a line.