Datasets¶

CSD-1000R¶

This dataset, intended for model testing, contains the SOAP power spectrum features and local NMR chemical shieldings for 100 environments selected from CSD-1000r, originally published in [Ceriotti2019].

Function Call¶

skmatter.datasets.load_csd_1000r()¶

Data Set Characteristics¶

Number of Instances:

Each representation 100

Number of Features:

Each representation 100

The representations were computed with [C1] using the hyperparameters:

rascal hyperparameters:

key

value

interaction_cutoff:

3.5

max_radial:

6

max_angular:

6

gaussian_sigma_constant”:

0.4

gaussian_sigma_type:

“Constant”

cutoff_smooth_width:

0.5

normalize:

True

Of the 2’520 resulting features, 100 were selected via FPS using [C2].

References¶

[C1]

https://github.com/lab-cosmo/librascal commit ade202a6

[C2]

https://github.com/lab-cosmo/scikit-matter commit 4ed1d92

Reference Code¶

from skmatter.feature_selection import CUR
from skmatter.preprocessing import StandardFlexibleScaler
from skmatter.sample_selection import FPS

# read all of the frames and book-keep the centers and species
filename = "/path/to/CSD-1000R.xyz"
frames = np.asarray(
        read(filename, ":"),
        dtype=object,
)

n_centers = np.array([len(frame) for frame in frames])
center_idx = np.array([i for i, f in enumerate(frames) for p in f])
n_env_accum = np.zeros(len(frames) + 1, dtype=int)
n_env_accum[1:] = np.cumsum(n_centers)

numbers = np.concatenate([frame.numbers for frame in frames])

# compute radial soap vectors as first pass
hypers = dict(
    soap_type="PowerSpectrum",
    interaction_cutoff=2.5,
    max_radial=6,
    max_angular=0,
    gaussian_sigma_type="Constant",
    gaussian_sigma_constant=0.4,
    cutoff_smooth_width=0.5,
    normalize=False,
    global_species=[1, 6, 7, 8],
    expansion_by_species_method="user defined",
)
soap = SOAP(**hypers)

X_raw = StandardFlexibleScaler(column_wise=False).fit_transform(
    soap.transform(frames).get_features(soap)
)

# rank the environments in terms of diversity
n_samples = 500
i_selected = FPS(n_to_select=n_samples, initialize=0).fit(X_raw).selected_idx_

# book-keep which frames these samples belong in
f_selected = center_idx[i_selected]
reduced_f_selected = list(sorted(set(f_selected)))
frames_selected = frames[f_selected].copy()
ci_selected = i_selected - n_env_accum[f_selected]

properties_select = [
    frames[fi].arrays["CS_local"][ci] for fi, ci in zip(f_selected, ci_selected)
]

Degenerate CH4 manifold¶

The dataset contains two representations (SOAP power spectrum and bispectrum) of the two manifolds spanned by the carbon atoms of two times 81 methane structures. The SOAP power spectrum representation the two manifolds intersect creating a degenerate manifold/line for which the representation remains the same. In contrast for higher body order representations as the (SOAP) bispectrum the carbon atoms can be uniquely represented and do not create a degenerate manifold. Following the naming convention of [Pozdnyakov2020] for each representation the first 81 samples correspond to the X minus manifold and the second 81 samples contain the X plus manifold

Function Call¶

skmatter.datasets.load_degenerate_CH4_manifold()¶

Data Set Characteristics¶

Number of Instances:

Each representation 162

Number of Features:

Each representation 12

The representations were computed with [D1] using the hyperparameters:

rascal hyperparameters:

key

value

radial_basis:

“GTO”

interaction_cutoff:

4

max_radial:

2

max_angular:

2

gaussian_sigma_constant”:

0.5

gaussian_sigma_type:

“Constant”

cutoff_smooth_width:

0.5

normalize:

False

The SOAP bispectrum features were in addition reduced to 12 features with principal component analysis (PCA) [D2].

References¶

[D1]

https://github.com/lab-cosmo/librascal commit 8d9ad7a

[D2]

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

NICE dataset¶

This is a toy dataset containing NICE[1, 4](N-body Iterative Contraction of Equivariants) features for first 500 configurations of the dataset[2, 3] with randomly displaced methane configurations.

Function Call¶

skmatter.datasets.load_nice_dataset()¶

Data Set Characteristics¶

Number of Instances:: 500
Number of Features:: 160

The representations were computed using the NICE package[4] using the following definition of the NICE calculator:

StandardSequence(
    [
        StandardBlock(
            ThresholdExpansioner(num_expand=150),
            CovariantsPurifierBoth(max_take=10),
            IndividualLambdaPCAsBoth(n_components=50),
            ThresholdExpansioner(num_expand=300, mode="invariants"),
            InvariantsPurifier(max_take=50),
            InvariantsPCA(n_components=30),
        ),
        StandardBlock(
            ThresholdExpansioner(num_expand=150),
            CovariantsPurifierBoth(max_take=10),
            IndividualLambdaPCAsBoth(n_components=50),
            ThresholdExpansioner(num_expand=300, mode="invariants"),
            InvariantsPurifier(max_take=50),
            InvariantsPCA(n_components=20),
        ),
        StandardBlock(
            None,
            None,
            None,
            ThresholdExpansioner(num_expand=300, mode="invariants"),
            InvariantsPurifier(max_take=50),
            InvariantsPCA(n_components=20),
        ),
    ],
    initial_scaler=InitialScaler(mode="signal integral", individually=True),
)

References¶

[1] Jigyasa Nigam, Sergey Pozdnyakov, and Michele Ceriotti. “Recursive evaluation and iterative contraction of N-body equivariant features.” The Journal of Chemical Physics 153.12 (2020): 121101.

[2] Incompleteness of Atomic Structure Representations Sergey N. Pozdnyakov, Michael J. Willatt, Albert P. Bartók, Christoph Ortner, Gábor Csányi, and Michele Ceriotti

[3] https://archive.materialscloud.org/record/2020.110

Reference Code¶

[4] https://github.com/lab-cosmo/nice

WHO dataset¶

who_dataset.csv is a compilation of multiple publically-available datasets through data.worldbank.org. Specifically, the following versioned datasets are used:

NY.GDP.PCAP.CD (v2_4770383) [1]
SE.XPD.TOTL.GD.ZS (v2_4773094) [2]
SH.DYN.AIDS.ZS (v2_4770518) [3]
SH.IMM.IDPT (v2_4770682) [4]
SH.IMM.MEAS (v2_4774112) [5]
SH.TBS.INCD (v2_4770775) [6]
SH.XPD.CHEX.GD.ZS (v2_4771258) [7]
SN.ITK.DEFC.ZS (v2_4771336) [8]
SP.DYN.LE00.IN (v2_4770556) [9]
SP.POP.TOTL (v2_4770385) [10]

where the corresponding file names are API_{dataset}_DS2_excel_en_{version}.xls.

This dataset, intended only for demonstration, contains 2020 country-year pairings and the corresponding values above.

Data Set Characteristics¶

Number of Instances:

2020

Number of Features:

10

Reference Code¶

and compiled through the following script, where the datasets have been placed in a folder named who_data:

import os
import pandas as pd
import numpy as np

files = os.listdir('who_data/')
indicators = [f[4:f[4:].index('_')+4] for f in files]
indicator_codes = {}
data_dict = {}
entries = []

for file in files:
    data = pd.read_excel(
        "who_data/" + file,
        header=3,
        sheet_name="Data",
        index_col=0,
    )

    indicator = data["Indicator Code"].values[0]
    indicator_codes[indicator] = data["Indicator Name"].values[0]

    for index in data.index:
        for year in range(1900, 2022):
            if str(year) in data.loc[index] and not np.isnan(
                data.loc[index].loc[str(year)]
            ):
                if (index, year) not in data_dict:
                    data_dict[(index, year)] = np.nan * np.ones(len(indicators))
                data_dict[(index, year)][indicators.index(indicator)] = data.loc[index].loc[str(year)]

with open('who_data.csv','w') as outf:
    outf.write('Country,Year,'+','.join(indicators)+'\n')
    for key, data in data_dict.items():
        if np.count_nonzero(~np.isnan(np.array(data, dtype=float))) == len(indicators):
            outf.write('{},{},{}\n'.format(key[0].replace(',',' '), key[1], ','.join([str(d) for d in data])))