Powerful Data Analysis and Plotting via Natural Language Requests by Giving LLMs Access to…
Getting LLMs to analyze and plot data for you, right in your web browser
Introduction: Automating Data Analysis With Large Language Models
Recently I started exploring how to use large language models (LLMs) to automate data analysis such that you can ask them questions about a dataset in natural form and they would answer by generating and running code. Implemented all this as a web app, I (and you!) could try out the power and limitations of this approach, at the moment relying entirely on the program writing vanilla JavaScript:
Exploring Data Analysis Via Natural Language, Using LLMs—Approach 1
As I explain in that article, my main interest is addressing this question:
Can I ask an LLM questions about a dataset with my own words and have it interpret these questions with the maths or scripting required to answer them?
After the several tests reported there, I pinpointed several limitations that, honestly, preclude the most interesting applications. Namely, while both GPT-3.5-turbo and GPT-4 demonstrate an understanding of user queries and can generate proper code for various data analysis tasks, challenges arise when dealing with complex mathematical operations and requests of certain complexity. And I’m not talking about very high complexity; for example, both LLMs could produce correct code to run linear regression but failed at quadratic fits, and where just totally lost when trying to implement procedures such as principal components analysis (PCA). What’s worst is that they often wouldn’t even "realize" that the task was too much for them, hallucinating code that looked OK on a quick pass (for example, the PCA procedure tried to invoke singular value decomposition, SVD) and sometimes didn’t even crash, yet was plain wrong.
Notably, and relevant to this new article, sometimes the LLMs created code that tried to invoke functions that didn’t exist at all. For the challenging tasks above, for example, some runs attempted to call functions such as polyfit(x,y,2) when asking for quadratic fits or SVD(m) or PCA(m) when asking for PCA. Since such functions didn’t exist in the environment, these programs would just crash.
This observation sparked the core idea presented here, that is to actually provide the LLMs with:
- Access to libraries that it can exploit, explaining which specific library is available and providing example uses in the prompt.
- Customized functions provided in the environment of the program itself, explaining and exemplifying in the prompt how to use them.
Now in this new article, then, I refine my main question to this:
Can I ask an LLM questions about a dataset with my own words and have it interpret these questions with the maths or scripting required to answer them, assisted by libraries and customized functions?
(Notice the "assisted by libraries and customized functions" that I now added).
Short answer: Yes, and you get an extremely powerful tool! (available on the web for you to try).
Long answer: Yes, and you get an extremely powerful tool that can for example run advanced fitting and optimization problems, plot graphics, and much more (quadratic fits and PCA included :)). Faithful to my style, with code and web apps for you to try online.
Read on to learn exactly how this all works, in a project that entails a lot of programming, especially JavaScript, and API calls, LLMs, LLM prompt optimization, and more. Or simply to go use this web app online.
Choosing a JavaScript Library for Mathematics Compatible With OpenAI’s Language Models
My first step was to explore, non-programmatically, what JavaScript libraries I could tell the program to use.
And good that I did this! Why? Because I really like a library called LALOLIB (some info about it here), but it turned out that OpenAI’s models don’t even know about it. On the contrary, over a conversation with ChatGPT using GPT-3.5-turbo I could find out that this LLM does know about another good library called numeric.js, and it even knows how to use it to solve problems:
Me: Hello, do you know lalolib, the JS library for maths?
ChatGPT (GPT-3.5-turbo): As of my last knowledge update in January 2022, I don’t have specific information about a JavaScript library named "lalolib" for mathematics. It’s possible that this library was created or gained popularity after that date, or it may not be widely known.
[…]
Me: And do you know numeric.js?
ChatGPT (GPT-3.5-turbo): As of my last knowledge update in January 2022, Numeric JavaScript (numeric.js) is a library for numerical computing in JavaScript. It provides functionality for working with matrices, linear algebra operations, and other mathematical computations. The library is designed to bring numerical computing capabilities to web applications and environments where JavaScript is used.
Here are some key points about numeric.js:
Matrix Operations: numeric.js allows you to perform various matrix operations, such as matrix multiplication, inversion, eigenvalue decomposition, and more.
Linear Algebra: The library provides functions for solving linear systems of equations, least squares problems, and singular value decomposition.
Integration and Differentiation: Numeric.js supports numerical integration and differentiation, which are common tasks in scientific computing.
Documentation: The library is typically accompanied by documentation that outlines the available functions and how to use them. This documentation can often be found on the library’s official website or on its GitHub repository.
[…]
Me: Let’s see, how do you do a fit to say a quadratic function using numeric.js?
ChatGPT (GPT-3.5-turbo): To perform a quadratic fit using numeric.js in JavaScript, you can use the numeric.uncmin function along with the appropriate quadratic model. Here’s a basic example:
Assuming you have an array of x-values (xData) and corresponding y-values (yData), you can define a quadratic model function, set up the optimization problem, and use numeric.uncmin to find the coefficients that minimize the sum of squared differences between the predicted and actual y-values.
Here’s a sample code snippet:
// Sample data const xData = [1, 2, 3, 4, 5]; const yData = [3.1, 7.8, 12.4, 21.5, 30.2];
[…correct piece of code follows…]
In this example, […explanation of the code follows…]
Wow, that’s very promising!
If you recall from my previous article, both GPT-3.5-turbo and GPT-4 struggled with fitting x to y to a second-order polynomial. Here, it seems like GPT-3.5-turbo knows how to use numeric.js’ uncmin, a function for unconstrained minimization, to solve this problem. So voilà as the French speakers say, I decided to incorporate numeric.js into my web app for automated data analysis.
If you recall from my previous article, both GPT-3.5-turbo and GPT-4 were excellent at grasping the essence of the user’s question or request, and they could then provide code to solve easy questions but struggled with questions that involved somewhat complex procedures. For example, doing linear fits worked but quadratic fits didn’t. Now, with the above tests I know that GPT-3.5-turbo (hence probably also GPT-4) can tackle the problem by using numeric.js’ uncmin. Similarly, I expect the LLMs can use numeric.js’ built-in functions for other data analysis tasks. Let’s now see how to put the approach to work inside a web app, and then let’s put it all to the test.
Putting Together the Approach Using numeric.js
The modifications I had to do to the prototype reporting my first (simpler) approach in the previous article were actually minimal:
- Source numeric.js in the headings
- Tell the LLM in the prompt that numeric.js is at its disposal in the environment.
To source numeric.js I simply added this line:
<script src="https://cdnjs.cloudflare.com/ajax/libs/numeric/1.2.6/numeric.min.js"></script>
To tell the LLM that numeric.js is available for use, I inserted this sentence in the System part of the prompt:
"You have numeric.js at hand as a library to simplify mathematical operations."
I didn’t provide any examples of question-script pairs illustrating the use of numeric.js in the prompt, yet as you will see the approach worked and quite well. Probably there’s place for improvement there, and adding examples of how to use the library might help the system perform even better.
Let’s see some examples of requests that OpenAI’s LLMs solve by apprlying numeric.js, and let’s then see how to build from these results. I will show you here first the LLMs’ answers to questions that didn’t work with the first approach, where the app/LLM lacked any access to libraries for "advanced" maths: quadratic fits and PCA.
Quadratic fits
Starting with the same request that failed both on GPT-3.5-turbo and GPT-4 when no library was made available:
> User: run a quadratic fit of sample A in the X axis vs sample J in the y axis
> StatBot: – – – [fun fact: I gave this web app a tentative name: "StatBot"!]
var nColumns = datatable[0].length;
var nRows = datatable.length;
var columnNames = 'sample A, sample B, sample X, sample Y, sample J';
var rowNames = 'Variable 1, Variable 2, Variable 3, Variable 5, Variable 7, Variable 8, Variable 9, Variable 10';
// Find the indices of sample A and sample J
var columnNamesArr = columnNames.split(', ');
var sampleAIndex = columnNamesArr.indexOf('sample A');
var sampleJIndex = columnNamesArr.indexOf('sample J');
// Create arrays to store the x and y values
var xValues = [];
var yValues = [];
// Populate the x and y arrays with the corresponding values from the datatable
for (var r = 0; r < nRows; r++) {
xValues.push(datatable[r][sampleAIndex]);
yValues.push(datatable[r][sampleJIndex]);
}
// Run a quadratic fit using numeric.js
var coefficients = numeric.uncmin(function(coef) {
var sum = 0;
for (var i = 0; i < xValues.length; i++) {
var x = xValues[i];
var y = yValues[i];
var predictedY = coef[0] + coef[1] * x + coef[2] * x * x;
sum += Math.pow(predictedY - y, 2);
}
return sum;
}, [0, 0, 0]).solution;
// Format the coefficients as a string
var answer = "The quadratic fit for sample A in the X axis vs sample J in the Y axis is: y = " + coefficients[0].toFixed(2) + " + " + coefficients[1].toFixed(2) + "x + " + coefficients[2].toFixed(2) + "x^2";
return answer;
> StatBot: Running code…
> StatBot: The quadratic fit for sample A in the X axis vs sample J in the Y axis is: y = 5.54 + 1.40x + -0.09x²
Now let’s compare this with the output of running the same polynomial regression in a data analysis program:
Got it!
Now let’s see if the LLM coupled with numeric.js can handle the fact that a quadratic function might intercept a y value either twice or not at all. Let’s see the first option:
> User: run a quadratic fit of sample A in the X axis vs sample J in the y axis and then use this equation to tell me the value of x for which y=2
> StatBot:
var nColumns = datatable[0].length;
var nRows = datatable.length;
var columnNames = 'sample A, sample B, sample X, sample Y, sample J';
var rowNames = 'Variable 1, Variable 2, Variable 3, Variable 5, Variable 7, Variable 8, Variable 9, Variable 10';
// Find the indices of sample A and sample J
var sampleAIndex = columnNames.split(', ').indexOf('sample A');
var sampleJIndex = columnNames.split(', ').indexOf('sample J');
// Create arrays to store the x and y values
var xValues = [];
var yValues = [];
// Populate the x and y arrays with the corresponding values from the datatable
for (var r = 0; r < nRows; r++) {
xValues.push(datatable[r][sampleAIndex]);
yValues.push(datatable[r][sampleJIndex]);
}
// Run a quadratic fit using numeric.js
var coefficients = numeric.uncmin(function(coef) {
var a = coef[0];
var b = coef[1];
var c = coef[2];
var residuals = yValues.map(function(y, i) {
var x = xValues[i];
return y - (a * x * x + b * x + c);
});
return numeric.norm2(residuals);
}, [0, 0, 0]);
// Extract the coefficients from the result
var a = coefficients.solution[0];
var b = coefficients.solution[1];
var c = coefficients.solution[2];
// Solve the quadratic equation for y = 2
var discriminant = b * b - 4 * a * (c - 2);
var x1 = (-b + Math.sqrt(discriminant)) / (2 * a);
var x2 = (-b - Math.sqrt(discriminant)) / (2 * a);
// Return the answer as a string
var answer = "The value of x for which y = 2 is " + x1.toString() + " or " + x2.toString();
return answer;
> StatBot: Running code…
> StatBot: The value of x for which y = 2 is -2.2237250258055816 or 18.39907145383322
Now let’s see what happens if we ask for the value of x for which y is say 20 (as you see in the plot, no real value makes the function take that value):
The LLM writes exactly the same code, but when it runs it returns NaN (coming from the square root of a negative number):
> StatBot: The value of x for which y = 20 is NaN
Now, if I warn the LLM that there might be no solution, the program checks this and reacts accordingly:
> User:run a quadratic fit of sample A in the X axis vs sample J in the y axis and then use this equation to tell me the value of x for which y=20 (consider the possibility that no real number satisfies this!)
> StatBot:
var nColumns = datatable[0].length;
var nRows = datatable.length;
var columnNames = 'sample A, sample B, sample X, sample Y, sample J';
var rowNames = 'Variable 1, Variable 2, Variable 3, Variable 5, Variable 7, Variable 8, Variable 9, Variable 10';
// Extract the data for sample A and sample J
var sampleAData = [];
var sampleJData = [];
for (var r = 0; r < nRows; r++) {
for (var c = 0; c < nColumns; c++) {
if (columnNames.split(', ')[c] === 'sample A') {
sampleAData.push(datatable[r][c]);
}
if (columnNames.split(', ')[c] === 'sample J') {
sampleJData.push(datatable[r][c]);
}
}
}
// Define the quadratic function
function quadraticFunc(coefficients, x) {
return coefficients[0] * x * x + coefficients[1] * x + coefficients[2];
}
// Define the objective function for uncmin
function objectiveFunc(coefficients) {
var sum = 0;
for (var i = 0; i < sampleAData.length; i++) {
var predictedY = quadraticFunc(coefficients, sampleAData[i]);
sum += Math.pow(predictedY - sampleJData[i], 2);
}
return sum;
}
// Initial guess for the coefficients
var initialCoefficients = [1, 1, 1];
// Run the quadratic fit using uncmin
var result = numeric.uncmin(objectiveFunc, initialCoefficients);
// Check if a real number satisfies y = 20
var xForY20 = null;
if (result.found) {
var coefficients = result.solution;
var discriminant = coefficients[1] * coefficients[1] - 4 * coefficients[0] * (coefficients[2] - 20);
if (discriminant >= 0) {
var x1 = (-coefficients[1] + Math.sqrt(discriminant)) / (2 * coefficients[0]);
var x2 = (-coefficients[1] - Math.sqrt(discriminant)) / (2 * coefficients[0]);
if (x1 >= Math.min.apply(null, sampleAData) && x1 <= Math.max.apply(null, sampleAData)) {
xForY20 = x1;
}
if (x2 >= Math.min.apply(null, sampleAData) && x2 <= Math.max.apply(null, sampleAData)) {
xForY20 = x2;
}
}
}
// Construct the answer string
var answer = "The value of x for which y = 20 is ";
if (xForY20 !== null) {
answer += xForY20.toString();
} else {
answer += "no real number satisfies this condition";
}
return answer;
> StatBot: Running code…
> StatBot: The value of x for which y = 20 is no real number satisfies this condition
Well, not the best answer, but at least it did write code to handle the possibility that the argument of the square root is negative!
PCA and a new avenue
Principal components analysis was the other big thing with which the first approach failed, not even coming close.
In my previous article I reported that the LLMs totally struggled at carrying out PCA, trying things that sounded right but were wrong and/or involved calling functions that didn’t exist in the environment. In the best case, GPT-4 replied that computing PCA was too complex for its scope.
Now, I tried to have the LLMs understand my requests for PC analysis and use numeric.js at will to sort it out. Both GPT-3.5-turbo and GPT-4 understood that they could normalize data, calculate the covariance matrix, and then run singular value decomposition (SVD) via numeric.js. However, all the codes they wrote ended up failing, for different reasons.
So, no, the LLMs don’t seem to be capable of computing PCA by themselves not even having complex procedures such as SVD sorted out via an external library. Of course, it might be that it’s all a matter of prompting, but if so, I couldn’t figure it out.
Adding customized functions
Despite a negative result, this observation prompted me to work on another important addition, one that I’ve been considering already from the early ideas of building LLM-based data analysis web apps. I’m talking about building into the web app customized functions and telling the LLM about them in the prompt.
So hands-on to apply this idea to performing PCA, I took a previous article of mine about PCA in JavaScript:
And I wrote a function based on it, that I incorporated into the web app. Here’s the function, that goes right into the body of my web app-important so that it can be accessed by other pieces of code in the environment, including the code generated by the LLM:
function pca(matrixin) {
//Step 1: setup matrices and variables
var matrix = matrixin
var nvariables = matrix.length
var nobjects = matrix[0].length
//Mean-center the matrix
var averages=[]
for (i=0;i<nvariables;i++) {
var tmp=0
for (j=0;j<nobjects;j++) {
tmp=tmp+matrix[i][j]
}
averages.push(tmp/nobjects)
}
var X = [];
for (i=0;i<nvariables;i++) {
var tmp=[]
for (j=0;j<nobjects;j++) {
tmp.push(matrix[i][j] - averages[i])
}
X.push(tmp)
}
//Step 2: get covariance matrix
var covar = []
for (i=0;i<nvariables;i++) {
var tmp = []
for (j=0;j<nvariables;j++) {
tmp.push(0)
}
covar.push(tmp)
}
for (i=0;i<nvariables;i++) {
for (j=0;j<nvariables;j++) {
for (k=0;k<nobjects;k++) {
covar[i][j] = covar[i][j] + X[i][k]*X[j][k]
}
covar[i][j] = covar[i][j] / nobjects
}
}
// Step 3: Perform SVD
const svdResult = numeric.svd(covar);
console.log(svdResult); //to know if SVD worked
//Step 4: Project data
var projection = []
var pc1 = 0, pc2 = 0
var UU = svdResult.U
var latent = svdResult.S
for (i=0; i<nobjects; i++) {
pc1=0
pc2=0
for (j=0;j<nvariables;j++) {
pc1 = pc1 + UU[j][0] * X[j][i]
pc2 = pc2 + UU[j][1] * X[j][i]
}
var tmp = []
tmp.push(pc1)
tmp.push(pc2)
projection.push(tmp)
}
console.log(projection) //to check it quickly in the console
return {
UU, //loadings for all variables
latent, //latent
projection, //projections of orig data on PC1 and PC2
};
}
Of course since the LLM doesn’t know about the existence of this function, we must pass on this information in the prompt. In doing so, I take the chance to also explain how to use the function.
Here’s the sentence I added to the prompt:
"You have direct access to some functions such as pca() which takes a matrix of nRows variables x nColumns objects like datatable and returns the loadings for all variables, the % contribution to variation for each principal component, and projections of the original data onto PC1 and PC2, as result.UU, result.latent and result.projection, respectively."
Note how the added sentence explains not only that the function (pca()) is available for use, but also that it takes a matrix of a given shape, with variables in the rows and objects in the columns, and it returns the outputs inside an object called result that contains the loadings for all variables, the % contribution to variation for each principal component, and projections of the original data onto PC1 and PC2, inside result.UU, result.latent and result.projection, respectively. These explanations are critical, because it’s otherwise impossible that the LLM knows how you’ve implemented the function.
Here you can see the program running. If I simply ask it to "Run PCA" the procedure works, but the output is a bit messy as it contains all the outputs and lacks any formatting. So, for the example shown below I asked "Run PCA and display an HTML table of PC1 and PC2 for all 5 samples.":
Amazing, isn’t it!
Note how it smoothly calls the custm pca() function and also how smoothly it extracts the principal components from the array result.projection[].
This specific example proves that it is possible to provide your LLM access to any function or procedure you may need. This has two advantages over relying only on the LLM or even over relying on the LLM using a library:
- You can overcome the LLM’s intrinsic limitations, going as complex as you want (as long as you explain the LLM how to use your function in the prompt).
- You have full control on the code that runs, only using the LLM as a high-level organizer of the information flow and processing/data analysis steps.
More broadly, the possibility of providing personalized functions opens a door to chatbot systems that can perform highly domain-specific and complex analyses.
More broadly, the possibility of providing personalized functions opens a door to chatbot systems that can perform highly domain-specific and complex analyses.
Trying a library for data plotting
Data analysis as we’ve seen it so far is cool, but it is quite limited and narrow in scope if we can’t display graphics.
Except that, well… we can!
How? Simply by sourcing a JS library for graphics and telling the LLM that it is available in the environment, just as we did with numeric.js.
Here I tested the use of Google Charts, a library for interactive data plotting that might not be the most powerful one but is certainly flexible, free, simple to use and to write code for, and -of utmost importance- well known by OpenAI’s LLMs.
Again, the additions to the web app’s source code are quite small. First, to source the library I just added this to the heading:
<script type="text/javascript" src="https://www.gstatic.com/charts/loader.js"></script>
Second, the first line of my JavaScript code runs when the webpage loads to load the Google Charts library, so that it is up and running by the time the code generated by the LLM needs it:
google.charts.load('current', {'packages':['corechart']});
Then, in the System part of the prompt I added this sentence to tell the LLM that it can use the library, taking the chance also to tell it what HTML div element to use for plot display:
"For graphics, you have Google charts at hand and you can put plots in a div called chat-div."
Testing automated data plotting
This is getting exciting!
See how easily I can tell the program to do a scatter plot comparing two columns of my table against a third column, indicating which one I want on each axis and what colors I want to see:
Here’s the conversation in detail:
> User: do a scatter plot ok sample A on the x axis vs samples B and J on the y axis with B in red and J in green
> StatBot:
var nColumns = datatable[0].length;
var nRows = datatable.length;
var columnNames = 'sample A, sample B, sample X, sample Y, sample J, ';
var rowNames = 'Variable 1, Variable 2, Variable 3, Variable 5, Variable 7, Variable 8, Variable 9, Variable 10';// Create an array to store the data points for sample B
var dataB = [];
// Create an array to store the data points for sample J
var dataJ = [];
// Loop through the rows and columns to populate the data arrays
for (var r = 0; r < nRows; r++) {
for (var c = 0; c < nColumns; c++) {
// Check if the column name is 'sample A', 'sample B', or 'sample J'
if (columnNames.split(', ')[c] === 'sample A') {
// Get the x value from sample A column
var x = datatable[r][c];
// Get the y value from sample B column
var yB = datatable[r][columnNames.split(', ').indexOf('sample B')];
// Get the y value from sample J column
var yJ = datatable[r][columnNames.split(', ').indexOf('sample J')];
// Create an object with x and y values for sample B
var pointB = {
x: x,
y: yB
};
// Create an object with x and y values for sample J
var pointJ = {
x: x,
y: yJ
};
// Add the point to the data arrays
dataB.push(pointB);
dataJ.push(pointJ);
}
}
}
// Create a scatter plot using Google Charts
google.charts.load('current', {'packages':['corechart']});
google.charts.setOnLoadCallback(drawChart);
function drawChart() {
var chartData = new google.visualization.DataTable();
chartData.addColumn('number', 'Sample A');
chartData.addColumn('number', 'Sample B');
chartData.addColumn('number', 'Sample J');
chartData.addRows(dataB.length);
// Populate the chart data with the data arrays for sample B and sample J
for (var i = 0; i < dataB.length; i++) {
chartData.setCell(i, 0, dataB[i].x);
chartData.setCell(i, 1, dataB[i].y);
chartData.setCell(i, 2, dataJ[i].y);
}
var options = {
title: 'Scatter Plot of Sample A vs Sample B and J',
hAxis: {title: 'Sample A'},
vAxis: {title: 'Sample B and J'},
legend: 'none',
colors: ['red', 'green']
};
var chart = new google.visualization.ScatterChart(document.getElementById('chart-div'));
chart.draw(chartData, options);
}
var answer = "Scatter plot of Sample A on the x-axis vs Sample B and J on the y-axis has been created.";
return answer;
> StatBot: Running code…
> StatBot: Scatter plot of Sample A on the x-axis vs Sample B and J on the y-axis has been created. (and on the side you see the scatter plots shown above)
Moving on to a slightly more difficult example, let’s now make a mixed request asking for some calculations and then plotting of the results:
Here’s the full conversation:
> User: for each sample, get the average and standard deviation through all variables, and do a vertical bar plot showing averages and standard deviations as error bars
> StatBot:
var nColumns = datatable[0].length;
var nRows = datatable.length;
var columnNames = 'sample A, sample B, sample X, sample Y, sample J, ';
var rowNames = 'Variable 1, Variable 2, Variable 3, Variable 5, Variable 7, Variable 8, Variable 9, Variable 10';
// Create an array to store the averages and standard deviations
var averages = [];
var standardDeviations = [];
// Loop through the columns to calculate the averages and standard deviations
for (var c = 0; c < nColumns; c++) {
var sum = 0;
var sumSquared = 0;
for (var r = 0; r < nRows; r++) {
sum += datatable[r][c];
sumSquared += Math.pow(datatable[r][c], 2);
}
var average = sum / nRows;
var variance = (sumSquared / nRows) - Math.pow(average, 2);
var standardDeviation = Math.sqrt(variance);
averages.push(average);
standardDeviations.push(standardDeviation);
}
// Create a data table for the bar plot
var data = new google.visualization.DataTable();
data.addColumn('string', 'Sample');
data.addColumn('number', 'Average');
data.addColumn({type: 'number', role: 'interval'});
data.addColumn({type: 'number', role: 'interval'});
data.addRows(nColumns);
// Populate the data table with the averages and standard deviations
for (var c = 0; c < nColumns; c++) {
data.setCell(c, 0, columnNames.split(', ')[c]);
data.setCell(c, 1, averages[c]);
data.setCell(c, 2, averages[c] - standardDeviations[c]);
data.setCell(c, 3, averages[c] + standardDeviations[c]);
}
// Create options for the bar plot
var options = {
title: 'Averages and Standard Deviations for Each Sample',
hAxis: {title: 'Sample'},
vAxis: {title: 'Value'},
legend: 'none',
bars: 'vertical',
errorBars: 'both'
};
// Create a bar plot using Google Charts
var chart = new google.visualization.ColumnChart(document.getElementById('chart-div'));
chart.draw(data, options);
// Create the answer string
var answer = "A vertical bar plot showing the averages and standard deviations for each sample has been created.";
return answer;
> StatBot: Running code…
> StatBot: A vertical bar plot showing the averages and standard deviations for each sample has been created.
More graphics
Just for fun, let’s try a few more things. First a pie chart with the request "draw a pie chart describing the % contribution of all variables to sample Y", and then the same pie chart but using only shades of red for the pieces of the pie, with the request "draw a pie chart describing the % contribution of all variables to sample Y. I want all pieces of the pie to be colored in a different shade of red":
Then asking for shades of red only:
Cool, very versatile and responsive!
Want to try this new web app prototype?
Just go here:
Example LLM for data analysis via code creation – by Luciano Abriata
Remember you will need an API key for OpenAI’s GPT models.
And I recommend you keep the console visible, to spot any errors and also just to see how the LLM/web app pair works.
Outlook and Ideas for Further Developments
From what we’ve seen here, endless combinations open up.
Combining libraries
We can force the LLM to combine the use of different resources, for example here calling the provided function pca() (which, by the way, itself uses numeric.js) and using Google charts to plot the results:
> User: run pca and do a scatter plot of PC1 vs PC2
> StatBot: PCA has been run and a scatter plot of PC1 vs PC2 has been created.
General improvements
-
Prompting can be cleaner and more complete, for example while testing this web app I saw multiple times that the LLM tries to call functions that don’t exist while others could be used. In particular, numeric.js’ uncmin (for unconstrained minimization) was extremely useful for most questions I asked (I reported here just a few), but sometimes the LLM just skips it although it sometimes does use it so it does know about it, as we saw for quadratic fits.
- The prompt could also include some examples in the form of User-Assistant pairs.
Adding goal- and field-specific functions
I think here’s where the approach has most potential to offer. Just like I provided here a predefined function to carry out PCA, one could define more functions tailored to very specialized kinds of requests pertinent to some special field or application. Say for example you could introduce functions that pull weather data and run some analysis on it if you are writing a tool for LLM-based weather description, or you could provide functions for analyzing RNAseq data if you are creating an LLM-based tool to analyze transcriptomics data, etc.
Other libraries
Here I showed how to empower the LLM called by your web app with mathematical and statistical capabilities from numeric.js, and with plotting capabilities from Google Chart; however, you can in principle pass on any other library provided you teach the LLM how to use it.
For example, if you are creating an LLM-based tool to work with molecular graphics you could insert JSmol, similar to what I did in this previous post:
Control web apps via natural language by casting speech to commands with GPT-3
Another example, you could source libraries that help the tool to better display tables, mathematical formulas, etc., to have nicer and more natural outputs.
Or you could source libraries that power 3D graphics, such as Three.js which OpenAI’s LLM probably knows good enough to create some simple apps.
Further reads
Exploring Data Analysis Via Natural Language, Using LLMs—Approach 1
Control web apps via natural language by casting speech to commands with GPT-3
Core Code to Build ChatGPT-like Bots in < 20 Lines of JavaScript!
Coupling Four AI Models to Deliver the Ultimate Experience in Immersive Visualization and Modeling
Building Customized Chatbots for the Web Using gpt-3.5-turbo,
Crafting Effective Prompts for Summarization Using Large Language Models
New DeepMind Work Unveils Supreme Prompt Seeds for Language Models
www.lucianoabriata.com I write about everything that lies in my broad sphere of interests: nature, science, technology, programming, etc. Subscribe to get my new stories by email. To consult about small jobs check my services page here. You can contact me here. You can tip me here.
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS