Preprocessing

The sklearn.preprocessing module includes scaling, centering and normalization methods.

class skmatter.preprocessing.KernelNormalizer(with_center=True, with_trace=True)

Bases: KernelCenterer

Kernel centering method, similar to KernelCenterer, but with additional scaling and ability to pass a set of sample weights.

Let K(x, z) be a kernel defined by phi(x)^T phi(z), where phi is a function mapping x to a Hilbert space. KernelNormalizer centers (i.e., normalize to have zero mean) the data without explicitly computing phi(x). It is equivalent to centering and scaling phi(x) with sklearn.preprocessing.StandardScaler(with_std=False).

Parameters:
  • with_center (bool, default=True) – If True, center the kernel matrix before scaling. If False, do not center the kernel

  • with_trace (bool, default=True) – If True, scale the kernel so that the trace is equal to the number of samples. If False, do not scale the kernel

K_fit_rows_

Average of each column of kernel matrix.

Type:

ndarray of shape (n_samples,)

K_fit_all_

Average of kernel matrix.

Type:

float

sample_weight_

Sample weights (if provided during the fit)

Type:

float

scale_

Scaling parameter used when ‘with_trace’=True Calculated as np.trace(K) / K.shape[0]

Type:

float

Examples

>>> from skmatter.preprocessing import KernelNormalizer
>>> from sklearn.metrics.pairwise import pairwise_kernels
>>> X = [[ 1., -2.,  2.],
...      [ -2.,  1.,  3.],
...      [ 4.,  1., -2.]]
>>> K = pairwise_kernels(X, metric='linear')
>>> K
array([[  9.,   2.,  -2.],
       [  2.,  14., -13.],
       [ -2., -13.,  21.]])
>>> transformer = KernelNormalizer().fit(K)
>>> transformer
KernelNormalizer()
>>> transformer.transform(K)
array([[ 0.39473684,  0.        , -0.39473684],
       [ 0.        ,  1.10526316, -1.10526316],
       [-0.39473684, -1.10526316,  1.5       ]])
>>> transformer.scale_ * transformer.transform(K)
array([[  5.,   0.,  -5.],
       [  0.,  14., -14.],
       [ -5., -14.,  19.]])
>>>
fit(K=None, y=None, sample_weight=None)

Fit KernelFlexibleCenterer

Parameters:
  • K (ndarray of shape (n_samples, n_samples)) – Kernel matrix.

  • y (None) – Ignored.

  • sample_weight (ndarray of shape (n_samples,), default=None) – Weights for each sample. Sample weighting can be used to center (and scale) data using a weighted mean. Weights are internally normalized before preprocessing.

Returns:

self (object) – Fitted transformer.

fit_transform(K, y=None, sample_weight=None, copy=True, **fit_params)

Fit to data, then transform it.

Parameters:
  • K (ndarray of shape (n_samples, n_samples)) – Kernel matrix.

  • y (None) – Ignored.

  • sample_weight (ndarray of shape (n_samples,), default=None) – Weights for each sample. Sample weighting can be used to center (and scale) data using a weighted mean. Weights are internally normalized before preprocessing.

  • **fit_params – necessary for compatibility with the functions of the TransformerMixin class

Returns:

K_new (ndarray of shape (n_samples1, n_samples2)) – Transformed array

transform(K, copy=True)

Center kernel matrix.

Parameters:
  • K (ndarray of shape (n_samples1, n_samples2)) – Kernel matrix.

  • copy (bool, default=True) – Set to False to perform inplace computation.

Returns:

K_new (ndarray of shape (n_samples1, n_samples2)) – Transformed array

class skmatter.preprocessing.SparseKernelCenterer(with_center=True, with_trace=True, rcond=1e-12)

Bases: TransformerMixin, BaseEstimator

Kernel centering method for sparse kernels, similar to KernelFlexibleCenterer.

The main disadvantage of kernel methods, which is widely used in machine learning it is that they quickly grow in time and space complexity with the number of sample. It is clear that with a large dataset, not only do you need to store a huge amount of information, but you also need to use it constantly in calculations. In order to avoid this, so-called sparse kernel methods are used formulated from the low-dimensional (The Nystrom) approximation:

\[\mathbf{K} \approx \hat{\mathbf{K}}_{N N}=\mathbf{K}_{N M} \mathbf{K}_{M M}^{-1} \mathbf{K}_{N M}^{T}\]

where the subscripts for $mathbf{K}$ denote the size of the sets of samples compared in each kernel, with $N$ being the size of the full data set and $M$ referring a small, active set containing $M$ samples. With this method it is only need to save and use the matrix $mathbf{K}_{NM}$, i.e. it is possible to get a $N/M$ times improvement in the asymptotic by memory.

Parameters:
  • with_center (bool, default=True) – If True, center the kernel matrix before scaling. If False, do not center the kernel

  • with_trace (bool, default=True) – If True, scale the kernel so that the trace is equal to the number of samples. If False, do not scale the kernel

  • rcond (float, default 1E-12) – conditioning parameter to use when computing the Nystrom-approximated kernel for scaling

K_fit_rows_

Average of each column of kernel matrix.

Type:

ndarray of shape (n_samples,)

K_fit_all_

Average of kernel matrix.

Type:

float

sample_weight_

Sample weights (if provided during the fit)

Type:

float

scale_

Scaling parameter used when ‘with_trace’=True Calculated as np.trace(K) / K.shape[0]

Type:

float

n_active_

size of active set

Type:

int

fit(Knm, Kmm, y=None, sample_weight=None)

Fit KernelFlexibleCenterer

Parameters:
  • Knm (ndarray of shape (n_samples, n_active)) – Kernel matrix between the reference data set and the active set

  • Kmm (ndarray of shape (n_active, n_active)) – Kernel matrix between the active set and itself

  • y (None) – Ignored.

  • sample_weight (ndarray of shape (n_samples,), default=None) – Weights for each sample. Sample weighting can be used to center (and scale) data using a weighted mean. Weights are internally normalized before preprocessing.

Returns:

self (object) – Fitted transformer.

fit_transform(Knm, Kmm, y=None, sample_weight=None, **fit_params)

Fit to data, then transform it.

Parameters:
  • Knm (ndarray of shape (n_samples, n_active)) – Kernel matrix between the reference data set and the active set

  • Kmm (ndarray of shape (n_active, n_active)) – Kernel matrix between the active set and itself

  • y (None) – Ignored.

  • sample_weight (ndarray of shape (n_samples,), default=None) – Weights for each sample. Sample weighting can be used to center (and scale) data using a weighted mean. Weights are internally normalized before preprocessing.

  • **fit_params – necessary for compatibility with the functions of the TransformerMixin class

Returns:

K_new (ndarray of shape (n_samples, n_active)) – Transformed array

transform(Knm, y=None)

Centering our Kernel. Previously you should fit data.

Parameters:
  • Knm (ndarray of shape (n_samples, n_active)) – Kernel matrix between the reference data set and the active set

  • y (None) – Ignored.

Returns:

K_new (ndarray of shape (n_samples, n_active)) – Transformed array

class skmatter.preprocessing.StandardFlexibleScaler(with_mean=True, with_std=True, column_wise=False, rtol=0, atol=1e-12, copy=False)

Bases: TransformerMixin, BaseEstimator

Standardize features by removing the mean and scaling to unit variance. Reduce the mean of the column to zero and, in the case of column_wise=True the variance of each column equal to one / number of columns. The standard score of a sample x is calculated as:

z = (x - u) / s

where u is the mean of the samples if with_mean, otherwise zero, and s is the standard deviation of the samples if with_std or one.

Centering and scaling can occur independently for each feature by calculating the appropriate statistics for the input or for the whole matrix (column_wise=False). The mean and standard deviation are then stored for use on later data using transform().

Standardization of a dataset is a common requirement for many machine learning estimators: an improperly scaled / centered dataset may result in anomalous behavior.

At the same time, depending on the conditions of the task, it may be necessary to preserve the ratio in the scale between the features (for example, in the case where the feature matrix is something like a covariance matrix), so the standardization should be carried out for the whole matrix, as opposed to the individual columns, as is done in sklearn.preprocessing.StandardScaler.

Parameters:
  • with_mean (bool, default=True) – If True, center the data before scaling. If False, keep the mean intact

  • with_std (bool, default=True) – If True, scale the data to unit variance. If False, keep the variance intact

  • column_wise (bool, default=False) – If True, normalize each column separately. If False, normalize the whole matrix with respect to its total variance.

  • rtol (float, default=0) – The relative tolerance for the optimization: variance is considered zero when it is less than abs(mean) * rtol + atol.

  • atol (float, default=1.0E-12) – The relative tolerance for the optimization: variance is considered zero when it is less than abs(mean) * rtol + atol.

  • copy (bool, default=None) – Copy the input X or not.

n_samples_seen_

Number of samples in the reference ndarray

Type:

int

n_features_

Number of features in the reference ndarray

Type:

int

mean_

The mean value for each feature in the training set. Equal to ndarray of zeros shape (n_features,) when with_mean=False.

Type:

ndarray of shape (n_features,)

scale_

The scaling factor, ndarray of shape (n_features,) when column_wise=True or float when column_wise = False.

Type:

ndarray of shape (n_features,), float or None

copy

Copy the input X or not.

Type:

bool, default=None

Examples

>>> import numpy as np
>>> from skmatter.preprocessing import StandardFlexibleScaler
>>> X = np.array([[ 1., -2.,  2.],
...               [-2.,  1.,  3.],
...               [ 4.,  1., -2.]])
>>> transformer = StandardFlexibleScaler().fit(X)
>>> transformer
StandardFlexibleScaler()
>>> transformer.transform(X)
array([[ 0.        , -0.56195149,  0.28097574],
       [-0.84292723,  0.28097574,  0.56195149],
       [ 0.84292723,  0.28097574, -0.84292723]])
>>> transformer.scale_ * transformer.transform(X)
array([[ 0., -2.,  1.],
       [-3.,  1.,  2.],
       [ 3.,  1., -3.]])
>>> transformer.scale_ * transformer.transform(X) + transformer.mean_
    array([[ 1., -2.,  2.],
           [-2.,  1.,  3.],
           [ 4.,  1., -2.]])
fit(X, y=None, sample_weight=None)

Compute mean and scaling to be applied for subsequent normalization.

Parameters:
  • X (ndarray of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.

  • y (None) – Ignored.

  • sample_weight (ndarray of shape (n_samples,)) – Weights for each sample. Sample weighting can be used to center (and scale) data using a weighted mean. Weights are internally normalized before preprocessing.

Returns:

self (object) – Fitted scaler.

inverse_transform(X_tr)

Scale back the data to the original representation

Parameters:

X_tr (ndarray of shape (n_samples, n_features)) – Transformed matrix

Returns:

X (original matrix)

transform(X, y=None, copy=None)

Normalize a vector based on previously computed mean and scaling.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to scale along the features axis.

  • y (None) – Ignored.

  • copy (bool, default=None) – Copy the input X or not.

Returns:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Transformed array.