Feature and Sample Selection

scikit-matter contains multiple data sub-selection modules, primarily corresponding to methods derived from CUR matrix decomposition and Farthest Point Sampling. In their classical form, CUR and FPS determine a data subset that maximizes the variance (CUR) or distribution (FPS) of the features or samples. These methods can be modified to combine supervised and unsupervised learning, in a formulation denoted PCov-CUR and PCov-FPS. For further reading, refer to [Imbalzano2018] and [Cersonsky2021].

These selectors can be used for both feature and sample selection, with similar instantiations. Currently, all sub-selection methods extend GreedySelector, where at each iteration the model scores each feature or sample (without an estimator) and chooses that with the maximum score. This can be executed using:

selector = Selector(
                    # the number of selections to make
                    # if None, set to half the samples or features
                    # if float, fraction of the total dataset to select
                    # if int, absolute number of selections to make
                    n_to_select=4,

                    # option to use `tqdm <https://tqdm.github.io/>`_ progress bar
                    progress_bar=True,

                    # float, cutoff score to stop selecting
                    score_threshold=1E-12

                    # boolean, whether to select randomly after non-redundant selections are exhausted
                    full=False,
                    )
selector.fit(X, y)

Xr = selector.transform(X)

where Selector is one of the classes below that overwrites the method score().

From GreedySelector, selectors inherit these public methods:

class skmatter._selection.GreedySelector
fit(X, y=None, warm_start=False)

Learn the features to select.

Parameters:
  • X (ndarray of shape (n_samples, n_features)) – Training vectors.

  • y (ndarray of shape (n_samples,), default=None) – Target values.

  • warm_start (bool) – Whether the fit should continue after having already run, after increasing n_to_select. Assumes it is called with the same X and y

Returns:

self (object)

transform(X, y=None)

Reduce X to the selected features.

Parameters:
  • X (ndarray of shape [n_samples, n_features]) – The input samples.

  • y (ignored) –

Returns:

X_r (ndarray) – The selected subset of the input.

get_support(indices=False, ordered=False)

Get a mask, or integer index, of the subset

Parameters:
  • indices (bool, default=False) – If True, the return value will be an array of integers, rather than a bool mask.

  • ordered (bool, default=False) – With indices, if True, the return value will be an array of integers, rather than a bool mask, in the order in which they were selected.

Returns:

support (An index that selects the retained subset from a original vectors.) – If indices is False, this is a bool array of shape [# input], in which an element is True iff its corresponding feature or sample is selected for retention. If indices is True, this is an integer array of shape [# n_to_select] whose values are indices into the input vectors.

CUR

CUR decomposition begins by approximating a matrix \({\mathbf{X}}\) using a subset of columns and rows

\[\mathbf{\hat{X}} \approx \mathbf{X}_\mathbf{c} \left(\mathbf{X}_\mathbf{c}^- \mathbf{X} \mathbf{X}_\mathbf{r}^-\right) \mathbf{X}_\mathbf{r}.\]

These subsets of rows and columns, denoted \(\mathbf{X}_\mathbf{r}\) and \(\mathbf{X}_\mathbf{c}\), respectively, can be determined by iterative maximization of a leverage score \(\pi\), representative of the relative importance of each column or row. From hereon, we will call selection methods which are derived off of the CUR decomposition “CUR” as a shorthand for “CUR-derived selection”. In each iteration of CUR, we select the column or row that maximizes \(\pi\) and orthogonalize the remaining columns or rows. These steps are iterated until a sufficient number of features has been selected. This iterative approach, albeit comparatively time consuming, is the most deterministic and efficient route in reducing the number of features needed to approximate \(\mathbf{X}\) when compared to selecting all features in a single iteration based upon the relative \(\pi\) importance.

The feature and sample selection versions of CUR differ only in the computation of \(\pi\). In sample selection \(\pi\) is computed using the left singular vectors, versus in feature selection, \(\pi\) is computed using the right singular vectors. In addition to GreedySelector, both instances of CUR selection build off of skmatter._selection._cur._CUR, and inherit

_CUR.score(X, y=None)

Returns the importance score of the given samples or features.

NOTE: This function does not compute the importance score each time it is called, in order to avoid unnecessary computations. This is done by self._compute_pi().

Parameters:
  • X (ndarray of shape [n_samples, n_features]) – The input samples.

  • y (ignored) –

Returns:

score (ndarray of (n_to_select_from_)) – \(\pi\) importance for the given samples or features

_CUR._compute_pi(X, y=None)

For feature selection, the importance score \(\pi\) is the sum over the squares of the first \(k\) components of the right singular vectors

\[\pi_j = \sum_i^k \left(\mathbf{U}_\mathbf{C}\right)_{ij}^2.\]

where \(\mathbf{C} = \mathbf{X}^T\mathbf{X}\).

For sample selection, the importance score \(\pi\) is the sum over the squares of the first \(k\) components of the right singular vectors

\[\pi_j = \sum_i^k \left(\mathbf{U}_\mathbf{K}\right)_{ij}^2.\]

where \(\mathbf{K} = \mathbf{X}\mathbf{X}^T\).

Parameters:
  • X (ndarray of shape [n_samples, n_features]) – The input samples.

  • y (ignored) –

Returns:

pi (ndarray of (n_to_select_from_)) – \(\pi\) importance for the given samples or features

They are instantiated using skmatter.feature_selection.CUR and skmatter.sample_selection.CUR, e.g.

from skmatter.feature_selection import CUR
selector = CUR(
                    n_to_select=4,
                    progress_bar=True,
                    score_threshold=1E-12
                    full=False,

                    # int, number of eigenvectors to use in computing pi
                    k = 1,

                    # int, number of steps after which to recompute pi
                    recompute_every = 1,

                    # float, threshold below which scores will be considered 0, defaults to 1E-12
                    tolerance=1E-12,
                    )
selector.fit(X)

Xr = selector.transform(X)

PCov-CUR

PCov-CUR extends upon CUR by using augmented right or left singular vectors inspired by Principal Covariates Regression, as demonstrated in [Cersonsky2021]. These methods employ the modified kernel and covariance matrices introduced in PCovR and available via the Utility Classes.

Again, the feature and sample selection versions of PCov-CUR differ only in the computation of \(\pi\). So, in addition to GreedySelector, both instances of PCov-CUR selection build off of skmatter._selection._cur._PCovCUR, inheriting

_PCovCUR.score(X, y=None)

Returns the importance score of the given samples or features.

NOTE: This function does not compute the importance score each time it is called, in order to avoid unnecessary computations. This is done by self._compute_pi().

Parameters:
  • X (ignored) –

  • y (ignored) –

Returns:

score (ndarray of (n_to_select_from_)) – \(\pi\) importance for the given samples or features

_PCovCUR._compute_pi(X, y=None)

For feature selection, the importance score \(\pi\) is the sum over the squares of the first \(k\) components of the right singular vectors

\[\pi_j = \sum_i^k \left(\mathbf{U}_\mathbf{\tilde{C}}\right)_{ij}^2.\]

where \({\mathbf{\tilde{C}} = \alpha \mathbf{X}^T\mathbf{X} + (1 - \alpha)(\mathbf{X}^T\mathbf{X})^{-1/2}\mathbf{X}^T \mathbf{\hat{Y}\hat{Y}}^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1/2}}\) for some mixing parameter \({\alpha}\). When \({\alpha = 1}\), this defaults to the covariance matrix \({\mathbf{C} = \mathbf{X}^T\mathbf{X}}\) used in CUR.

For sample selection, the importance score \(\pi\) is the sum over the squares of the first \(k\) components of the right singular vectors

\[\pi_j = \sum_i^k \left(\mathbf{U}_\mathbf{\tilde{K}}\right)_{ij}^2.\]

where \({\mathbf{\tilde{K}} = \alpha \mathbf{XX}^T + (1 - \alpha)\mathbf{\hat{Y}\hat{Y}}^T}\) for some mixing parameter \({\alpha}\). When \({\alpha = 1}\), this defaults to the Gram matrix \({\mathbf{K} = \mathbf{X}\mathbf{X}^T}\).

Parameters:
  • X (ndarray of shape [n_samples, n_features]) – The input samples.

  • y (ignored) –

Returns:

pi (ndarray of (n_to_select_from_)) – \(\pi\) importance for the given samples or features

and are instantiated using skmatter.feature_selection.PCovCUR and skmatter.sample_selection.PCovCUR.

from skmatter.feature_selection import PCovCUR
selector = PCovCUR(
                    n_to_select=4,
                    progress_bar=True,
                    score_threshold=1E-12
                    full=False,

                    # float, default=0.5
                    # The PCovR mixing parameter, as described in PCovR as alpha
                    mixing = 0.5,

                    # int, number of eigenvectors to use in computing pi
                    k = 1,

                    # int, number of steps after which to recompute pi
                    recompute_every = 1,

                    # float, threshold below which scores will be considered 0, defaults to 1E-12
                    tolerance=1E-12,
                    )
selector.fit(X, y)

Xr = selector.transform(X)

Farthest Point-Sampling (FPS)

Farthest Point Sampling is a common selection technique intended to exploit the diversity of the input space.

In FPS, the selection of the first point is made at random or by a separate metric. Each subsequent selection is made to maximize the Haussdorf distance, i.e. the minimum distance between a point and all previous selections. It is common to use the Euclidean distance, however other distance metrics may be employed.

Similar to CUR, the feature and selection versions of FPS differ only in the way distance is computed (feature selection does so column-wise, sample selection does so row-wise), and are built off of the same base class, skmatter._selection._fps._FPS, in addition to GreedySelector, and inherit

_FPS.score(X, y=None)

Returns the Haussdorf distances of all samples to previous selections

NOTE: This function does not compute the importance score each time it is called, in order to avoid unnecessary computations. The haussdorf distance is updated in self._update_haussdorf()

Parameters:
  • X (ignored) –

  • y (ignored) –

Returns:

haussdorf (Haussdorf distances)

_FPS.get_distance()

Traditional FPS employs a column-wise Euclidean distance for feature selection, which can be expressed using the covariance matrix \(\mathbf{C} = \mathbf{X} ^ T \mathbf{X}\)

\[\operatorname{d}_c(i, j) = C_{ii} - 2 C_{ij} + C_{jj}.\]

For sample selection, this is a row-wise Euclidean distance, which can be expressed in terms of the Gram matrix \(\mathbf{K} = \mathbf{X} \mathbf{X} ^ T\)

\[\operatorname{d}_r(i, j) = K_{ii} - 2 K_{ij} + K_{jj}.\]
Returns:

haussdorf (ndarray of shape (n_to_select_from_)) – the minimum distance from each point to the set of selected points. once a point is selected, the distance is not updated; the final list will reflect the distances when selected.

_FPS.get_select_distance()
Returns:

haussdorf_at_select (ndarray of shape (n_to_select)) – at the time of selection, the minimum distance from each selected point to the set of previously selected points.

These selectors can be instantiated using skmatter.feature_selection.FPS and skmatter.sample_selection.FPS.

from skmatter.feature_selection import FPS
selector = FPS(
                    n_to_select=4,
                    progress_bar=True,
                    score_threshold=1E-12
                    full=False,

                    # int or 'random', default=0
                    # Index of the first selection.
                    # If ‘random’, picks a random value when fit starts.
                    initialize = 0,
                    )
selector.fit(X)

Xr = selector.transform(X)

PCov-FPS

PCov-FPS extends upon FPS much like PCov-CUR does to CUR. Instead of using the Euclidean distance solely in the space of \(\mathbf{X}\), we use a combined distance in terms of \(\mathbf{X}\) and \(\mathbf{y}\).

Again, the feature and sample selection versions of PCov-FPS differ only in computing the distances. So, in addition to GreedySelector, both instances of PCov-FPS selection build off of skmatter._selection._fps._PCovFPS, and inherit

_PCovFPS.score(X, y=None)

Returns the Haussdorf distances of all samples to previous selections

NOTE: This function does not compute the importance score each time it is called, in order to avoid unnecessary computations. The haussdorf distance is updated in self._update_haussdorf()

Parameters:
  • X (ignored) –

  • y (ignored) –

Returns:

haussdorf (Haussdorf distances)

_PCovFPS.get_distance()
Returns:

haussdorf (ndarray of shape (n_to_select_from_)) – the minimum distance from each point to the set of selected points. once a point is selected, the distance is not updated; the final list will reflect the distances when selected.

_PCovFPS.get_select_distance()
Returns:

haussdorf_at_select (ndarray of shape (n_to_select)) – at the time of selection, the minimum distance from each selected point to the set of previously selected points.

and can be instantiated using skmatter.feature_selection.PCovFPS and skmatter.sample_selection.PCovFPS.

from skmatter.feature_selection import PCovFPS
selector = PCovFPS(
                    n_to_select=4,
                    progress_bar=True,
                    score_threshold=1E-12
                    full=False,

                    # float, default=0.5
                    # The PCovR mixing parameter, as described in PCovR as alpha
                    mixing = 0.5,

                    # int or 'random', default=0
                    # Index of the first selection.
                    # If ‘random’, picks a random value when fit starts.
                    initialize = 0,
                    )
selector.fit(X, y)

Xr = selector.transform(X)

Voronoi FPS

class skmatter.sample_selection._voronoi_fps.VoronoiFPS(n_trial_calculation=4, full_fraction=None, initialize=0, **kwargs)

In FPS, points are selected based upon their Hausdorff distance to previous selections, i.e. the minimum distance between a given point and any previously selected points. This implicitly constructs a Voronoi tessellation which is updated with each new selection, as each unselected point “belongs” to the Voronoi polyhedron of the nearest previous selection.

This implicit tessellation enabled a more efficient evaluation of the FPS – at each iteration, we need only consider for selection those points at the boundaries of the Voronoi polyhedra, and when updating the tessellation we need only consider moving those points whose Hausdorff distance is greater than half of the distance between the corresponding Voronoi center and the newly selected point, per the triangle equality.

_images/VoronoiFPS-Schematic.pdf

To demonstrate the algorithm behind Voronoi FPS, let \(*_{m+1}\) be a new chosen point, \(v(j)\) was chosen earlier, \(j\) is a point in the polyhedron with center \(v(j)\). From the inequalities of the triangle one can easily see that if \(d(v(j),j)<d(*_{m+1}, j)/2\), then point \(j\) is guaranteed not to fall into the formed polyhedron and the distance to it can be uncalculated.

This algorithm is particularly appealing when using a non-Euclidean or computationally-intensive distance metric, for which the decrease in computational time due to the reduction in distance calculations outweighs the increase from book-keeping. For simple metrics (such as Euclidean distance), VoronoiFPS will likely not accelerate, and may decelerate, computations when compared to FPS.

Parameters:
  • n_trial_calculation (integer, default=4) – Number of calculations used for the switching point between Voronoi FPS and traditional FPS (for detail look at full_fraction).

  • full_fraction (float, default=None) – Proportion of calculated distances from the total number of features at which the switch from Voronoi FPS to FPS occurs. At a certain number of distances to be calculated, the use of Voronoi FPS becomes unreasonably expensive due to the associated costs connected with reading data from the memory. The switching point depends on many conditions, and it is determined “in situ” for optimal use of the algorithm. Determination is done with a few test calculations and memory operations.

These selectors can be instantiated using skmatter.sample_selection.VoronoiFPS.

from skmatter.feature_selection import VoronoiFPS
selector = VoronoiFPS(
                    n_to_select=4,
                    progress_bar=True,
                    score_threshold=1E-12
                    full=False,
                    #n_trial_calculation used for calculation of full_fraction,
                    #so you need to determine only one parameter
                    n_trial_calculation = 4,
                    full_fraction = None,
                    # int or 'random', default=0
                    # Index of the first selection.
                    # If ‘random’, picks a random value when fit starts.
                    initialize = 0,
                    )
selector.fit(X)

Xr = selector.transform(X)

When Not to Use Voronoi FPS

In many cases, this algorithm may not increase upon the efficiency. For example, for simple metrics (such as Euclidean distance), Voronoi FPS will likely not accelerate, and may decelerate, computations when compared to FPS. The sweet spot for Voronoi FPS is when the number of selectable samples is already enough to divide the space with Voronoi polyhedrons, but not yet comparable to the total number of samples, when the cost of bookkeeping significantly degrades the speed of work compared to FPS.

Directional Convex Hull (DCH)

class skmatter.sample_selection._base.DirectionalConvexHull(low_dim_idx=None, tolerance=1e-12)

Performs Sample Selection by constructing a Directional Convex Hull and determining the distance to the hull as outlined in the reference

Parameters:
  • low_dim_idx (list of ints, default None) – Indices of columns of X containing features to be used for the directional convex hull construction (also known as the low- dimensional (LD) hull). By default [0] is used.

  • tolerance (float, default=1.0E-12) – Tolerance for the negative distances to the directional convex hull to consider a point below the convex hull. Depending if a point is below or above the convex hull the distance is differently computed. A very low value can result in a completely wrong distances. Distances cannot be distinguished from zero up to tolerance. It is recommended to leave the default setting.

high_dim_idx_

Indices of columns in data containing high-dimensional features (i.e. those not used for the convex hull construction)

Type:

list of ints

selected_idx_

Indices of datapoints that form the vertices of the convex hull

Type:

numpy.ndarray

interpolator_high_dim_

Interpolater for the features in the high- dimensional space

Type:

scipy.interpolate.interpnd.LinearNDInterpolator

References

[dch]

A. Anelli, E. A. Engel, C. J. Pickard and M. Ceriotti, Physical Review Materials, 2018.

This selector can be instantiated using skmatter.sample_selection.DirectionalConvexHull.

from skmatter.sample_selection import DirectionalConvexHull
selector = DirectionalConvexHull(
                    # Indices of columns of X to use for fitting
                    # the convex hull
                    low_dim_idx=[0,1],
                    )
selector.fit(X,y)

# Get the distance to the convex hull for samples used to fit the
# convex hull. This can also be called using other samples (X_new)
# and corresponding properties (y_new) that were not used to fit
# the hull.
Xr = selector.score_samples(X,y)