VOOZH about

URL: https://thenewstack.io/a-playground-for-llm-apps-how-ai-engineers-use-humanloop/

⇱ A Playground for LLM Apps: How AI Engineers Use Humanloop - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2023-08-22 08:52:01
A Playground for LLM Apps: How AI Engineers Use Humanloop
Frontend Development / Large Language Models

A Playground for LLM Apps: How AI Engineers Use Humanloop

In the LLM app stack, a playground is where developers can test out (and deploy) prompts. We discussed this new concept with Humanloop's CEO.
Aug 22nd, 2023 8:52am by Richard MacManus
👁 Featued image for: A Playground for LLM Apps: How AI Engineers Use Humanloop
Image via Unsplash.

In the evolving LLM app stack, a British company called Humanloop has — perhaps accidentally — defined a new category of product: an LLM “playground.” It’s a platform where developers can test various LLM prompts, and then deploy the best ones into an application with full DevOps-style measurement and monitoring.

To understand exactly what Humanloop offers developers, and how it became one of the leading “playground” toolsets, I spoke to Humanloop co-founder and CEO, Raza Habib.

I first learned of the term “playground” in the LLM app stack diagram created by Andreessen-Horowitz (a16z). But what does it mean and where did the term originate?

👁 a16z_emerging_llm_stack

Via a16z; Click image to view full-size.

Habib, who holds a Ph.D. in machine learning from University College London, explains that it derives from OpenAI.

“When OpenAI first released GPT-3, they just wanted to have an environment where people could go try the model — and they called it the playground. And so […] the name has stuck around. But I think the point is that it’s an environment to interactively try things with different models.”

Habib also noted that a16z didn’t initially know where to place Humanloop in its stack.

“I think we could have belonged in a couple of different places on that diagram,” he said. “But at its core, we help developers evaluate and take steps to improve their prompts and AI applications.”

Let’s take a step back. As Habib explained it, LLM applications start with a base model — such as GPT-4 or Claude — or maybe your own large language model. To begin creating an application you need a “prompt template,” which Habib described as “a set of instructions to the model, with maybe gaps for input.” You then “chain together” all of this with other models or with information retrieval systems to build out a whole application.

Where Habib and his co-founders spotted an opportunity in this process was in collaboration — helping technical users work with non-technical users to try different prompts and evaluate them.

“What we’ve found, speaking to people working on this early on, is [that] it’s very iterative,” he said, regarding the process of building an LLM application. “It requires collaboration between technical and nontechnical people to find good problems and get these systems working well. And evaluation is really hard, because it’s very subjective.”

👁 Image

Diagram via Humanloop

Use Cases and How It Works

One of Humanloop’s customers is Duolingo, a popular language education application. As with many other tech companies over the past year or two, Duolingo has been busy adding AI to its core product. A recent blog post explained that it uses AI in a variety of ways, including helping its staff create lessons and “build courses faster.” Writing prompts are at the core of this:

“Here’s how our AI system works: We write a “prompt,” or a set of detailed commands, that “explains” to the AI model how to write a given Duolingo exercise.”

Duolingo is careful to emphasize that the ultimate responsibility for its lessons and courses falls on its human instructors. Nevertheless, it’s clear that AI is helping a lot — both with template design and to “fill in the blanks.”

Where Humanloop comes in is to help Duolingo get the right type of content out of the LLMs.

“It [the content] obviously needs to be appropriate to the learner, the right tone, the right language, vocabulary that’s appropriate, etc,” Habib explained. “So, it’s not trivial to take the base models and actually get them to do what you want. And so what we provide is a set of tools for iterating on, collaboratively, your prompts and your workflows; measuring performance in production; and then also being able to monitor and evaluate things over time.”

Typically Humanloop is used at the prototyping stage. A team of people will open Humanloop (which is browser-based) and they will see “a playground-type environment.”

“They can try out different models, they can try out different prompts,” Habib continued. “They can include that in a sequence or workflow. They work on that till they get to a point where it seems to be working reasonably well [and] now it’s time to go in and try it out more seriously, beyond just eyeballing things. They’ll then typically run more quantitative evaluation, and so we have the ability to set up evaluation functions. They’ll deploy that to production, and they’ll monitor how well it’s working — including being able to gather end user feedback.”

A similar workflow happens when doing tweaks or testing out new prompts, so it’s an iterative process that doesn’t stop after the application has been deployed.

Playing with Others

I asked whether Humanloop can be used in tandem with other products in the LLM app stack, such as the orchestration framework LangChain and vector databases like Pinecone?

“It integrates natively with LangChain [and] a couple of others,” he confirmed. “So you can switch on an environment variable in LangChain, and then you’ll automatically start getting logging and monitoring of your applications in Humanloop. So it’s really like a one-line code change and then suddenly you can see what data is flowing through, and start gathering feedback and take actions to improve and debug.”

Habib noted that Humanloop has a feature similar to OpenAI’s functions, which it calls “Tools.” This allows users to “connect an LLM to any API to give it extra capabilities and access to private data” — for example, to connect to a vector database. But Habib cautioned that Humanloop isn’t an orchestration framework like LangChain.

“We believe that that’s best done in code,” he said, regarding orchestration. “We’re primarily there to manage the prompt engineering and then evaluate and take steps to improve those models.”

Advice for AI Engineers

The primary users of Humanloop are developers. With the current popularity of LLM applications, I asked Habib what advice he’d give to developers who want to do more work in this area.

“In terms of new skills you want to learn, I think having an awareness for how the models work and an appreciation that this is now stochastic. So if you haven’t had any experience with machine learning before, and you’re coming into it, you’re probably coming from a world in which software is deterministic — [where] you can write unit tests and it always does exactly the same thing.”

With LLMs, though, software isn’t necessarily deterministic. So learning to deal with that randomness and developing an intuition about the limits of LLMs is important, in Habib’s view. Which, of course, is where an LLM playground comes into play.

TRENDING STORIES
Richard MacManus is a Senior Editor at The New Stack and writes about web and application development trends. Previously he founded ReadWriteWeb in 2003 and built it into one of the world’s most influential technology news sites. From the early...
Read more from Richard MacManus
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: OpenAI.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.