Practical Machine Learning Techniques to Accelerate Materials Science Research
Predicting the Critical Temperature of Superconductors using Regression Techniques, Feature Selection, and Selection Criteria
The U.S. energy grid loses about 5% of its power due to resistive losses in its transmission lines, according to an estimate from the EIA. What if we could find a way to eliminate all of that? As it turns out, there’s a really cool class of materials called superconductors – materials that conduct electricity with 0 resistance. If there’s no resistance, there’s no resistive loss in transmission lines. I’ll admit, I’m no expert on how exactly the superconducting phenomenon happens. What I do know is that it only happens when the given material gets really cold – we’re talking down to single digits of Kelvin. At room temperature, these materials act like your typical conductors, and only after falling below this "critical temperature" do they exhibit this superconducting property. In recent years, there have been advances and new materials discovered that operate in much more reasonable conditions. However, "high temperature" superconductors are usually thought of as materials with a critical temperature above 77K, or the temperature of liquid nitrogen. With a whole periodic table in play, is there a way that we can use machine learning techniques to accelerate new material discovery? While we may still be a far way off from discovery of materials that operate under the conditions that transmission lines do (not to mention finding a cost-effective material for such long stretches of line), there are some interesting ML techniques that can be applied to all types of material science research.
Before I get too far, you can just as easily follow along in my code, posted on my Github repo. There are more plots and code shown there; I’ll only show the highlights in this article. The data is available to the public from the UCI Machine Learning Repository [1]. I’ll be tackling a multivariate regression problem on the superconductivity dataset (available under the CC0: Public Domain license). There is some fantastic work on the dataset from the article published by Kam Hamidieh [2], and it’s worth a read for even more insights into superconductivity and the features used than what I go over here or in the accompanying notebook. So, with all that out of the way, let’s get started!
Feature Exploration
We start with 81 features, and we’re wanting to predict the critical temperature of the material. In addition, we have the chemical formula, and are able to extract the counts of each element using a handy package called chemparse. If you want to take a look at the individual distribution plots, go ahead and open up the notebook…with so many features, there’s quite a lot of information to unravel. Another thing that we can look at to start to get an idea of how the different parameters interact is looking at the correlations and some clustering. These steps are particularly important to get a good idea of the data that we’re working with, especially if it’s in an unfamiliar topic.
From this plot, we can see highly correlated and highly uncorrelated features (a value of 1 indicates exact correlation, -1 indicates a complete inverse correlation, and 0 indicates no correlation). Perhaps a first approach to scaling down the number of features that we have would be to remove highly correlated features…but we’ll keep all of them for now and see what initial results we get.
The clustering shown here is a useful way of visualizing high-dimensional data…and 81 features is certainly high-dimensional! Unfortunately, there’s no clear correlation between clusters and ranges of critical temperature, at least upon initial inspection.
Model preparation
We can always add more features that are extracted from the original features (for example, by squaring or log scaling features, or finding ratios between features). After extracting the atomic counts, though, we already have 150+ features to train a model on. There is, however, one last transformation that we want to make. Rather than setting our target as the critical temperature, we want to predict the log-transformed value. While this isn’t strictly necessary in most cases, it makes sense for our target value of a critical temperature, since we cannot physically go below 0 K. The log scaling ensures that predicted values will always be positive. Let’s take a quick look at how the distribution of our target changes with this log transformation:
With all of that preliminary work, we can quickly go through the typical stages of model preparation next, such as defining our features and target, performing a train/test split, and scaling.
# Define target and potential features
target = 'critical_temp'
features = list(df.columns)
features.remove(target)
features.remove('material')
# Input and output values
X = df[features]
y = np.log10(df[target]) # log scale forces values to always be positive and non-zero
# Train/test split
X_train_unscaled, X_test_unscaled, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)
# Scale
s = MinMaxScaler()
s.fit(X_train_unscaled)
X_train = s.transform(X_train_unscaled)
X_test = s.transform(X_test_unscaled)
Regression models
We’ll look at 3 different variations on the linear regression model, then see how the performance of a gradient boosting model compares. Starting with a simple regression model, we initiate, train, and predict, then store our predictions for later comparison and plot the results.
# Initiate and fit linear regression model
lr = LinearRegression()
lr.fit(X_train, y_train)
# Predict on test data
y_pred = lr.predict(X_test)
As we can see, there is certainly a general correlation between the predicted and real values, but the R² value is quite poor. We can try a Lasso or Ridge model to help with any overfitting that might be happening, and also gain some insight into some feature importance. We’ll show the code for the Ridge regression, and check out the notebook for the Lasso regression.
# Range of alphas (regularization strength)
alphas = np.logspace(-4,3,20)
# dataframe to store coefficients
ridge_coefs = pd.DataFrame(index=alphas,columns=features)
# list to store mse
ridge_mse = []
for alpha in tqdm(alphas):
# Initiate and fit model with given alpha
ridge = Ridge(alpha=alpha)
ridge.fit(X_train, y_train)
# Save coefficients
ridge_coefs.loc[alpha] = ridge.coef_
# Predict on test data
y_pred = ridge.predict(X_test)
With the regression, we want to steadily increase the regularization term. As we increase the regularization term, the weights of each input feature decrease towards 0. This concept is primarily used to prevent overfitting, but, as long as we’ve scaled our input data, we’re also able to see the relative feature importance based on the absolute value of the weight. I have a nice interactive plot in the Jupyter notebook where you can hover over the lines to see the feature shown. Below, we can see the performance based on the regularization strength, as well as the actual performance.
Unfortunately, the results don’t look much better, and we don’t see much improvement over a simple linear regression. Before we move on, however, be sure to check out how the weights change with a Lasso regression in the notebook – you’ll see some interesting trends! (Thought question as you look at it: What’s the most important feature when you’re using all 150+ features? What’s the most important if you only had 5 features?)
While we gained some interesting insights into feature importance by looking at the feature weights, let’s turn to a more sophisticated model: gradient boosting (our particular flavor is XGBoost). I’m particularly fond of this model, not only because of its stellar performance in applications, but it was also proposed and developed at my alma mater, University of Washington (see the publication here).
# Initialize and fit model
xgb = XGBRegressor()
xgb.fit(X_train, y_train)
# Make and store predictions
y_pred = xgb.predict(X_test)
predictions['xgb_raw'] = 10**y_pred
Here are the results from the XGBoost model:
There’s always room for improvement, but given this is an out-of-box model, it looks much more promising. We’ll go ahead and take this and iterate on it.
Feature selection
The Lasso and Ridge models gave a good first indication of the importance of features, based on the weights at various regularization strengths. However, the models still performed poorly. XGBRegressor was far superior to any other regression algorithm. There are many options to look at feature importance, but the one we’ll use is recursive feature elimination (RFE). This trains the specified model (for this case, XGBRegressor) on all input features and shows the resulting prediction performance, then removes the least important feature iteratively and retrains the model on the subset of features. Ideally we’d do RFE with cross validation, but this method is already computationally expensive, so for now we’ll just use regular RFE.
# Define step size and search space
step = 5
max_features = X_train.shape[1]
min_features = 5
search_space = range(min_features,max_features,step)
# List to store MSE
rmse = []
for n_features in tqdm(search_space):
# RFE to select n_features
selector = RFE(estimator = XGBRegressor(),
n_features_to_select = n_features,
step = step)
selector.fit(X_train, y_train)
# Predict on feature subset and store prediction
y_pred = selector.predict(X_test)
rmse.append(mean_squared_error(10**y_pred,10**y_test,
squared=False))
If we wanted to get more detailed info, we could change the step so that only 1 feature is eliminated at each iteration, and we could get statistical data if we did cross validation…but each of those adds computational time. For our purposes, it’s enough to see the general trend:
As we can see, we don’t get much improvement in performance after about 20 features. We can readily see the feature rankings and print them out (take a look at the code for the full printout). What are some interesting observations that you notice?
# Display the rankings
rankings = selector.ranking_
for rank in range(1,max(rankings)):
print(rank)
features_index = [i for i, x in enumerate(rankings) if x == rank]
features_list = [features[i] for i in features_index]
print(features_list)
>>> 1
['gmean_Density', 'wtd_gmean_ElectronAffinity', 'wtd_gmean_ThermalConductivity', 'range_ThermalConductivity', 'wtd_range_ThermalConductivity', 'wtd_mean_Valence', 'wtd_gmean_Valence', 'entropy_Valence', 'wtd_std_Valence', 'Ba', 'Cu', 'Ca', 'Ti', 'Fe', 'Zn', 'Pr', 'Nb', 'Ce', 'Mg', 'Pd']
2
['mean_Density', 'Gd', 'Ni']
3
['range_ElectronAffinity', 'Bi', 'Sn', 'Si', 'Tc']
...
...
...
27
['mean_atomic_mass', 'std_Valence', 'Er', 'H', 'Tb']
28
['number_of_elements', 'Ag', 'I', 'B', 'Be']
Hyperparameter Optimization
Using our subset of the 20 most important features, we can now perform hyperparameter optimization for the best model. Recall that we used the out-of-box model for our initial results. As a side note, we could potentially include feature selection in tandem with hyperparameter optimization – in fact, this would be desirable for a final model – but it would be a lot more computationally expensive.
We’ll use the hyperopt package to perform a Bayesian optimization over the search space. I’ve written an article on it before, which you can find here, so I won’t go into too many details. Here’s a few important code snippets to define our hyperparameter search space, set up our model, and run the tests:
# Define search space for the hyperparameters of interest
space = {'eta' : hp.uniform('eta',0.01,0.3),
'gamma' : scope.int(hp.quniform('gamma',0,100,1)),
'max_depth' : scope.int(hp.quniform('max_depth',2,15,1)),
'subsample' : hp.uniform('subsample',0.5,1.0),
'lambda' : hp.loguniform('lambda', -3, 3),
'colsample_bytree' : hp.uniform('colsample_bytree',0.5,1.0)
}
def xgboost(params):
# Initialize model with hyperparameters
xgb = XGBRegressor(eta = params['eta'],
gamma = params['gamma'],
max_depth = params['max_depth'],
subsample = params['subsample'],
reg_lambda = params['lambda'],
colsample_bytree = params['colsample_bytree']
)
# Split into validation set
X_train_hyper, X_val, y_train_hyper, y_val = train_test_split(
X_train_sub, y_train, test_size=0.2, shuffle=True)
# Fit
xgb.fit(X_train_hyper, y_train_hyper)
# Predict on validation set
y_pred = xgb.predict(X_val)
# calculate validation loss
validation_loss = mean_squared_error(10**y_pred, 10**y_val,
squared=False)
return {'loss': validation_loss, 'status': STATUS_OK,
'model': xgb, 'params': params}
####################################################################
# Initiate trials and start hyperparameter search space for 250
iterations
trials = Trials()
best = fmin(xgboost, space, algo=tpe.suggest, max_evals=250, trials=trials)
# Find the best model and set of hyperparameters
best_model = trials.results[np.argmin([r['loss'] for r in trials.results])]['model']
best_params = trials.results[np.argmin([r['loss'] for r in trials.results])]['params']
The really interesting part of this is when we can plot the results based on a given hyperparameter. In some cases, we can see clearly a correlation between the performance and the selected hyperparameter value (for example, a low gamma and higher eta is ideal). For an in-depth description of these hyperparameters, see this source.
So at this point, we’ve tried out 6 different types of models: a linear regression, along with the Ridge and Lasso variations, as well as an out-of-box XGBoost using all of the 150+ features, an out-of-box XGBoost using the optimal subset of 20 features, and an XGBoost model using the optimal subset of 20 features and hyperparameter optimization (for what it’s worth, the plot also includes an XGBoost model with the worst configuration of hyperparameters, so that you can see the importance of simply changing the hyperparameters).
Identify high-performing materials
At this point, we have a model that we can use to predict critical temperature. Importantly, we determined the 20 most important features out of the 150+ possibilities, so when we get new materials, we don’t need to take endless measurements before we have a good idea of how well it might perform. That alone is a great time- and money-saver. But could we also use our knowledge to inform new materials that we might want to try? There are endless new materials to explore, but most of them won’t help us in working towards our goal of higher critical temperature superconductors.
The scenario I’m constructing is that we have about 200 potential materials, and we want to quickly find materials that have a high critical temperature, above a certain threshold (say we’re only interested in materials with critical temperature > 100K). We can either randomly start trying materials, or create some kind of selection function to find samples that have the greatest likelihood of high performance.
I started looking at some clustering algorithms, but none of them did what I wanted to do efficiently. So I came up with a ‘pseudo k-means’ selection criteria. k-means is a clustering algorithm that finds similar samples based on their features and a similarity metric. Unfortunately, most k-means implementations use Euclidean distance, which suffers from the curse of dimensionality. Even though we got our input features down from 158 to 20, it is still fairly high-dimensional. While cosine similarity also suffers in high dimensionality (especially with sparse data), it tends to work better. So the approach was to select the highest performing materials based on cosine similarity to other known high-performing materials. According to the workflow in this paper, we then evaluate the sample, then if it meets the criteria, we are finished; otherwise, we add the sample to the dataset and retrain the model. We can then evaluate how long it takes to find a sample with the desired critical temperature.
I’ll admit, the method here can use a lot of work. I mostly wanted to try it out, and perhaps at some point I could refine it. Now we have a selection function, though, so it’s time to see how it performs compared to random sampling. We try this scenario 50 times on different subsets so that we have statistical data. Ideally, we’d hope that our selection criteria would help us find high-performing materials in fewer tries.
We do see that the cosine similarity metric helps us find high-performing samples faster, given the higher peak at lower values. There is, however, clear room for improvement. Perhaps if we set our threshold higher, we would see a more dramatic improvement. This is a very nascent idea that I’m exploring, however, and I’d be really interested in hearing if you’ve seen or done anything along these lines, and how it worked for you!
Conclusion
There was a lot in this, and my hope is that it gives you some ideas in your own projects or work. While machine learning is often used as a "black box" solution to make predictions, we can utilize the principles and techniques to make informed decisions in a research lab or in materials science development. There are several aspects to this, including feature selection and importance ranking to eliminate unnecessary measurements, as well as clustering strategies to help determine potential high-performing candidates. These approaches can help save time and money as we continue to explore the vast world around us for continuing possibilities. Thanks for reading this – I’d love to hear your thoughts and ideas on the topic! Feel free to connect with me on LinkedIn.
References
[1] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
[2] Hamidieh, Kam, A data-driven statistical model for predicting the critical temperature of a superconductor, Computational Materials Science, Volume 154, November 2018, Pages 346–354
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS