VOOZH about

URL: https://towardsdatascience.com/practical-machine-learning-techniques-to-accelerate-materials-science-research-9dc9f62f33e8/

⇱ Practical Machine Learning Techniques to Accelerate Materials Science Research | Towards Data Science


Practical Machine Learning Techniques to Accelerate Materials Science Research

Predicting the Critical Temperature of Superconductors using Regression Techniques, Feature Selection, and Selection Criteria

15 min read
👁 Photo by American Public Power Association on Unsplash
Photo by American Public Power Association on Unsplash

The U.S. energy grid loses about 5% of its power due to resistive losses in its transmission lines, according to an estimate from the EIA. What if we could find a way to eliminate all of that? As it turns out, there’s a really cool class of materials called superconductors – materials that conduct electricity with 0 resistance. If there’s no resistance, there’s no resistive loss in transmission lines. I’ll admit, I’m no expert on how exactly the superconducting phenomenon happens. What I do know is that it only happens when the given material gets really cold – we’re talking down to single digits of Kelvin. At room temperature, these materials act like your typical conductors, and only after falling below this "critical temperature" do they exhibit this superconducting property. In recent years, there have been advances and new materials discovered that operate in much more reasonable conditions. However, "high temperature" superconductors are usually thought of as materials with a critical temperature above 77K, or the temperature of liquid nitrogen. With a whole periodic table in play, is there a way that we can use machine learning techniques to accelerate new material discovery? While we may still be a far way off from discovery of materials that operate under the conditions that transmission lines do (not to mention finding a cost-effective material for such long stretches of line), there are some interesting ML techniques that can be applied to all types of material science research.

Before I get too far, you can just as easily follow along in my code, posted on my Github repo. There are more plots and code shown there; I’ll only show the highlights in this article. The data is available to the public from the UCI Machine Learning Repository [1]. I’ll be tackling a multivariate regression problem on the superconductivity dataset (available under the CC0: Public Domain license). There is some fantastic work on the dataset from the article published by Kam Hamidieh [2], and it’s worth a read for even more insights into superconductivity and the features used than what I go over here or in the accompanying notebook. So, with all that out of the way, let’s get started!


Feature Exploration

We start with 81 features, and we’re wanting to predict the critical temperature of the material. In addition, we have the chemical formula, and are able to extract the counts of each element using a handy package called chemparse. If you want to take a look at the individual distribution plots, go ahead and open up the notebook…with so many features, there’s quite a lot of information to unravel. Another thing that we can look at to start to get an idea of how the different parameters interact is looking at the correlations and some clustering. These steps are particularly important to get a good idea of the data that we’re working with, especially if it’s in an unfamiliar topic.

👁 Correlations between features in the superconductor dataset. Plot by author
Correlations between features in the superconductor dataset. Plot by author

From this plot, we can see highly correlated and highly uncorrelated features (a value of 1 indicates exact correlation, -1 indicates a complete inverse correlation, and 0 indicates no correlation). Perhaps a first approach to scaling down the number of features that we have would be to remove highly correlated features…but we’ll keep all of them for now and see what initial results we get.

👁 t-SNE clustering of features, with colors indicating a range of critical temperatures. Plot by author.
t-SNE clustering of features, with colors indicating a range of critical temperatures. Plot by author.

The clustering shown here is a useful way of visualizing high-dimensional data…and 81 features is certainly high-dimensional! Unfortunately, there’s no clear correlation between clusters and ranges of critical temperature, at least upon initial inspection.

Model preparation

We can always add more features that are extracted from the original features (for example, by squaring or log scaling features, or finding ratios between features). After extracting the atomic counts, though, we already have 150+ features to train a model on. There is, however, one last transformation that we want to make. Rather than setting our target as the critical temperature, we want to predict the log-transformed value. While this isn’t strictly necessary in most cases, it makes sense for our target value of a critical temperature, since we cannot physically go below 0 K. The log scaling ensures that predicted values will always be positive. Let’s take a quick look at how the distribution of our target changes with this log transformation:

👁 Distribution of original critical temperature and log-scaled critical temperature. Plot by author.
Distribution of original critical temperature and log-scaled critical temperature. Plot by author.

With all of that preliminary work, we can quickly go through the typical stages of model preparation next, such as defining our features and target, performing a train/test split, and scaling.

# Define target and potential features
target = 'critical_temp'
features = list(df.columns)
features.remove(target)
features.remove('material')

# Input and output values
X = df[features]
y = np.log10(df[target]) # log scale forces values to always be positive and non-zero

# Train/test split
X_train_unscaled, X_test_unscaled, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)

# Scale
s = MinMaxScaler()
s.fit(X_train_unscaled)
X_train = s.transform(X_train_unscaled)
X_test = s.transform(X_test_unscaled)

Regression models

We’ll look at 3 different variations on the linear regression model, then see how the performance of a gradient boosting model compares. Starting with a simple regression model, we initiate, train, and predict, then store our predictions for later comparison and plot the results.

# Initiate and fit linear regression model
lr = LinearRegression()
lr.fit(X_train, y_train)

# Predict on test data
y_pred = lr.predict(X_test)
👁 Linear regression results. Plot by author.
Linear regression results. Plot by author.

As we can see, there is certainly a general correlation between the predicted and real values, but the R² value is quite poor. We can try a Lasso or Ridge model to help with any overfitting that might be happening, and also gain some insight into some feature importance. We’ll show the code for the Ridge regression, and check out the notebook for the Lasso regression.

# Range of alphas (regularization strength)
alphas = np.logspace(-4,3,20)
# dataframe to store coefficients
ridge_coefs = pd.DataFrame(index=alphas,columns=features)
# list to store mse
ridge_mse = []

for alpha in tqdm(alphas):
 # Initiate and fit model with given alpha
 ridge = Ridge(alpha=alpha)
 ridge.fit(X_train, y_train)

 # Save coefficients
 ridge_coefs.loc[alpha] = ridge.coef_

 # Predict on test data
 y_pred = ridge.predict(X_test)

With the regression, we want to steadily increase the regularization term. As we increase the regularization term, the weights of each input feature decrease towards 0. This concept is primarily used to prevent overfitting, but, as long as we’ve scaled our input data, we’re also able to see the relative feature importance based on the absolute value of the weight. I have a nice interactive plot in the Jupyter notebook where you can hover over the lines to see the feature shown. Below, we can see the performance based on the regularization strength, as well as the actual performance.

👁 Accuracy of predictions based on regularization strength for Ridge regression; note the optimal performance at approximately 0.1. Plot by author.
Accuracy of predictions based on regularization strength for Ridge regression; note the optimal performance at approximately 0.1. Plot by author.
👁 Linear regression results with optimal Ridge regularization. Plot by author.
Linear regression results with optimal Ridge regularization. Plot by author.

Unfortunately, the results don’t look much better, and we don’t see much improvement over a simple linear regression. Before we move on, however, be sure to check out how the weights change with a Lasso regression in the notebook – you’ll see some interesting trends! (Thought question as you look at it: What’s the most important feature when you’re using all 150+ features? What’s the most important if you only had 5 features?)

👁 Screenshot of interactive plot of feature weights based on regularization strength for Lasso regression. Note the interesting behavior of changing weights based on regularization strength. Plot by author.
Screenshot of interactive plot of feature weights based on regularization strength for Lasso regression. Note the interesting behavior of changing weights based on regularization strength. Plot by author.

While we gained some interesting insights into feature importance by looking at the feature weights, let’s turn to a more sophisticated model: gradient boosting (our particular flavor is XGBoost). I’m particularly fond of this model, not only because of its stellar performance in applications, but it was also proposed and developed at my alma mater, University of Washington (see the publication here).

# Initialize and fit model
xgb = XGBRegressor()
xgb.fit(X_train, y_train)

# Make and store predictions
y_pred = xgb.predict(X_test)
predictions['xgb_raw'] = 10**y_pred

Here are the results from the XGBoost model:

👁 XGBoost model results. Plot by author.
XGBoost model results. Plot by author.

There’s always room for improvement, but given this is an out-of-box model, it looks much more promising. We’ll go ahead and take this and iterate on it.

Feature selection

The Lasso and Ridge models gave a good first indication of the importance of features, based on the weights at various regularization strengths. However, the models still performed poorly. XGBRegressor was far superior to any other regression algorithm. There are many options to look at feature importance, but the one we’ll use is recursive feature elimination (RFE). This trains the specified model (for this case, XGBRegressor) on all input features and shows the resulting prediction performance, then removes the least important feature iteratively and retrains the model on the subset of features. Ideally we’d do RFE with cross validation, but this method is already computationally expensive, so for now we’ll just use regular RFE.

# Define step size and search space
step = 5
max_features = X_train.shape[1]
min_features = 5
search_space = range(min_features,max_features,step)
# List to store MSE
rmse = []

for n_features in tqdm(search_space):
 # RFE to select n_features
 selector = RFE(estimator = XGBRegressor(), 
 n_features_to_select = n_features,
 step = step)
 selector.fit(X_train, y_train)
 # Predict on feature subset and store prediction
 y_pred = selector.predict(X_test)
 rmse.append(mean_squared_error(10**y_pred,10**y_test,
 squared=False))

If we wanted to get more detailed info, we could change the step so that only 1 feature is eliminated at each iteration, and we could get statistical data if we did cross validation…but each of those adds computational time. For our purposes, it’s enough to see the general trend:

👁 RMSE of predictions based on number of features, determined by recursive feature elimination. Plot by author
RMSE of predictions based on number of features, determined by recursive feature elimination. Plot by author

As we can see, we don’t get much improvement in performance after about 20 features. We can readily see the feature rankings and print them out (take a look at the code for the full printout). What are some interesting observations that you notice?

# Display the rankings
rankings = selector.ranking_
for rank in range(1,max(rankings)):
 print(rank)
 features_index = [i for i, x in enumerate(rankings) if x == rank]
 features_list = [features[i] for i in features_index]
 print(features_list)
>>> 1
['gmean_Density', 'wtd_gmean_ElectronAffinity', 'wtd_gmean_ThermalConductivity', 'range_ThermalConductivity', 'wtd_range_ThermalConductivity', 'wtd_mean_Valence', 'wtd_gmean_Valence', 'entropy_Valence', 'wtd_std_Valence', 'Ba', 'Cu', 'Ca', 'Ti', 'Fe', 'Zn', 'Pr', 'Nb', 'Ce', 'Mg', 'Pd']
2
['mean_Density', 'Gd', 'Ni']
3
['range_ElectronAffinity', 'Bi', 'Sn', 'Si', 'Tc']
...
...
...
27
['mean_atomic_mass', 'std_Valence', 'Er', 'H', 'Tb']
28
['number_of_elements', 'Ag', 'I', 'B', 'Be']

Hyperparameter Optimization

Using our subset of the 20 most important features, we can now perform hyperparameter optimization for the best model. Recall that we used the out-of-box model for our initial results. As a side note, we could potentially include feature selection in tandem with hyperparameter optimization – in fact, this would be desirable for a final model – but it would be a lot more computationally expensive.

We’ll use the hyperopt package to perform a Bayesian optimization over the search space. I’ve written an article on it before, which you can find here, so I won’t go into too many details. Here’s a few important code snippets to define our hyperparameter search space, set up our model, and run the tests:

# Define search space for the hyperparameters of interest
space = {'eta' : hp.uniform('eta',0.01,0.3),
 'gamma' : scope.int(hp.quniform('gamma',0,100,1)),
 'max_depth' : scope.int(hp.quniform('max_depth',2,15,1)),
 'subsample' : hp.uniform('subsample',0.5,1.0),
 'lambda' : hp.loguniform('lambda', -3, 3),
 'colsample_bytree' : hp.uniform('colsample_bytree',0.5,1.0)
 }
def xgboost(params):
 # Initialize model with hyperparameters
 xgb = XGBRegressor(eta = params['eta'],
 gamma = params['gamma'],
 max_depth = params['max_depth'],
 subsample = params['subsample'],
 reg_lambda = params['lambda'],
 colsample_bytree = params['colsample_bytree']
 )

 # Split into validation set
 X_train_hyper, X_val, y_train_hyper, y_val = train_test_split(
 X_train_sub, y_train, test_size=0.2, shuffle=True)

 # Fit
 xgb.fit(X_train_hyper, y_train_hyper)

 # Predict on validation set
 y_pred = xgb.predict(X_val)

 # calculate validation loss
 validation_loss = mean_squared_error(10**y_pred, 10**y_val, 
 squared=False)

 return {'loss': validation_loss, 'status': STATUS_OK, 
 'model': xgb, 'params': params}
####################################################################
# Initiate trials and start hyperparameter search space for 250 
 iterations
trials = Trials()
best = fmin(xgboost, space, algo=tpe.suggest, max_evals=250, trials=trials)
# Find the best model and set of hyperparameters
best_model = trials.results[np.argmin([r['loss'] for r in trials.results])]['model']
best_params = trials.results[np.argmin([r['loss'] for r in trials.results])]['params']

The really interesting part of this is when we can plot the results based on a given hyperparameter. In some cases, we can see clearly a correlation between the performance and the selected hyperparameter value (for example, a low gamma and higher eta is ideal). For an in-depth description of these hyperparameters, see this source.

👁 MSE values for predicted critical temperature, based on hyperparameter value. The orange value is the optimal one. Plot by author.
MSE values for predicted critical temperature, based on hyperparameter value. The orange value is the optimal one. Plot by author.

So at this point, we’ve tried out 6 different types of models: a linear regression, along with the Ridge and Lasso variations, as well as an out-of-box XGBoost using all of the 150+ features, an out-of-box XGBoost using the optimal subset of 20 features, and an XGBoost model using the optimal subset of 20 features and hyperparameter optimization (for what it’s worth, the plot also includes an XGBoost model with the worst configuration of hyperparameters, so that you can see the importance of simply changing the hyperparameters).

👁 RMSE performance of the various types of models trained. Plot by author
RMSE performance of the various types of models trained. Plot by author

Identify high-performing materials

At this point, we have a model that we can use to predict critical temperature. Importantly, we determined the 20 most important features out of the 150+ possibilities, so when we get new materials, we don’t need to take endless measurements before we have a good idea of how well it might perform. That alone is a great time- and money-saver. But could we also use our knowledge to inform new materials that we might want to try? There are endless new materials to explore, but most of them won’t help us in working towards our goal of higher critical temperature superconductors.

The scenario I’m constructing is that we have about 200 potential materials, and we want to quickly find materials that have a high critical temperature, above a certain threshold (say we’re only interested in materials with critical temperature > 100K). We can either randomly start trying materials, or create some kind of selection function to find samples that have the greatest likelihood of high performance.

I started looking at some clustering algorithms, but none of them did what I wanted to do efficiently. So I came up with a ‘pseudo k-means’ selection criteria. k-means is a clustering algorithm that finds similar samples based on their features and a similarity metric. Unfortunately, most k-means implementations use Euclidean distance, which suffers from the curse of dimensionality. Even though we got our input features down from 158 to 20, it is still fairly high-dimensional. While cosine similarity also suffers in high dimensionality (especially with sparse data), it tends to work better. So the approach was to select the highest performing materials based on cosine similarity to other known high-performing materials. According to the workflow in this paper, we then evaluate the sample, then if it meets the criteria, we are finished; otherwise, we add the sample to the dataset and retrain the model. We can then evaluate how long it takes to find a sample with the desired critical temperature.

I’ll admit, the method here can use a lot of work. I mostly wanted to try it out, and perhaps at some point I could refine it. Now we have a selection function, though, so it’s time to see how it performs compared to random sampling. We try this scenario 50 times on different subsets so that we have statistical data. Ideally, we’d hope that our selection criteria would help us find high-performing materials in fewer tries.

👁 Number of samples tried before finding a material with a critical temperature above a given threshold (100 K), using both the cosine similarity search criteria and a randomized search. Plot by author.
Number of samples tried before finding a material with a critical temperature above a given threshold (100 K), using both the cosine similarity search criteria and a randomized search. Plot by author.

We do see that the cosine similarity metric helps us find high-performing samples faster, given the higher peak at lower values. There is, however, clear room for improvement. Perhaps if we set our threshold higher, we would see a more dramatic improvement. This is a very nascent idea that I’m exploring, however, and I’d be really interested in hearing if you’ve seen or done anything along these lines, and how it worked for you!

Conclusion

There was a lot in this, and my hope is that it gives you some ideas in your own projects or work. While machine learning is often used as a "black box" solution to make predictions, we can utilize the principles and techniques to make informed decisions in a research lab or in materials science development. There are several aspects to this, including feature selection and importance ranking to eliminate unnecessary measurements, as well as clustering strategies to help determine potential high-performing candidates. These approaches can help save time and money as we continue to explore the vast world around us for continuing possibilities. Thanks for reading this – I’d love to hear your thoughts and ideas on the topic! Feel free to connect with me on LinkedIn.

References

[1] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

[2] Hamidieh, Kam, A data-driven statistical model for predicting the critical temperature of a superconductor, Computational Materials Science, Volume 154, November 2018, Pages 346–354


Written By

Nicholas Lewis

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles