VOOZH about

URL: https://thenewstack.io/sql-vector-databases-are-shaping-the-new-llm-and-big-data-paradigm/

⇱ SQL Vector Databases Are Shaping the New LLM and Big Data Paradigm - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2024-04-29 12:29:20
SQL Vector Databases Are Shaping the New LLM and Big Data Paradigm
sponsor-myscale,sponsored-post-contributed,
AI / Data / Databases / Large Language Models

SQL Vector Databases Are Shaping the New LLM and Big Data Paradigm

Combining vector databases with SQL can provide the accuracy and performance required to build modern production-level GenAI applications.
Apr 29th, 2024 12:29pm by Linpeng Tang
👁 Featued image for: SQL Vector Databases Are Shaping the New LLM and Big Data Paradigm
Featured image by Tim Johnson on Unsplash.
MyScale sponsored this post.

The rise of powerful large language models (LLMs) like GPT-4, Gemini 1.5 and Claude 3 has been a game-changer in AI and technology. With some models capable of processing over 1 million tokens, their ability to handle long contexts is truly impressive. However:

  1. Many data structures are too complex and constantly evolving for LLMs to handle effectively on their own.
  2. Managing massive, heterogeneous enterprise data within a context window is simply impractical.

Retrieval-augmented generation (RAG) helps address these issues, but retrieval accuracy is a major bottleneck for end-to-end performance. One solution is integrating LLMs with big data through advanced SQL vector databases. This type of synergy between LLMs and big data not only makes LLMs more effective but also enables people to gain better intelligence from big data. Moreover, it further reduces model hallucination while providing data transparency and reliability.

Current State of Vector Databases

As the cornerstone of RAG systems, vector databases have developed rapidly in the past year. They can generally be categorized into three types: dedicated vector databases, keyword and vector retrieval systems, and SQL vector databases. Each has advantages and limitations.

👁 Types of vector databases

Specialized Vector Databases

Some vector databases (like Pinecone, Weaviate and Milvus) are designed specifically for vector search from the outset. They exhibit good performance in this area but have somewhat limited general data management capabilities.

Keyword and Vector Retrieval Systems

Represented by Elasticsearch and OpenSearch, these systems are widely used in production due to their comprehensive keyword-based retrieval capabilities. However, they consume substantial system resources, and the accuracy and performance of keyword and vector hybrid queries are often unsatisfactory.

SQL Vector Databases

A SQL vector database is a specialized type of database that combines the capabilities of traditional SQL databases with the abilities of a vector database. It provides the ability to efficiently store and query high-dimensional vectors with the help of SQL.

Two major SQL vector databases are illustrated in the figure above: pgvector and MyScaleDB. Pgvector is a vector search plugin for PostgreSQL. It is easy to get started with and useful for managing small data sets. However, due to Postgres’ row storage disadvantages and vector algorithm limitations, pgvector tends to have lower accuracy and performance for large-scale, complex vector queries.

MyScaleDB is an open source SQL vector database built on ClickHouse (a columnar storage SQL database). It is designed to provide a high-performance and cost-effective data foundation for GenAI applications. MyScaleDB is also the first SQL vector database to outperform specialized vector databases in overall performance and cost-effectiveness.

The Power of SQL and Vector Joint Data Modeling

Despite the emergence of NoSQL and big data technologies, SQL databases continue to dominate the data management market half a century after SQL’s inception. Even systems like Elasticsearch and Spark have added SQL interfaces. With SQL support, MyScaleDB enables high performance in vector search and analytics.

In real-world AI applications, integrating SQL and vectors enhances data modeling flexibility and simplifies development. For instance, a large-scale academic product uses MyScaleDB for intelligent Q&A over massive scientific literature data. The main SQL schema includes over 10 tables, several with vector and keyword-based inverted index structures, connected via primary and foreign keys. The system handles complex queries involving structured, vector and keyword data and joined queries across multiple tables. This is a challenging task for specialized vector databases, which often leads to slow iteration, inefficient querying and high maintenance costs.

👁 SQL vector database schema

The main SQL vector database schema of a large-scale academic product supported by MyScale (columns in bold have associated vector indexes or inverted indexes).

Improving RAG Accuracy and Cost-Efficiency

In real-world RAG systems, overcoming retrieval accuracy (and the associated performance bottlenecks) requires an efficient way to combine querying of structured, vector and keyword data.

For instance, in a financial application, when users query a document database asking, “What was the revenue of <company_name> in 2023 globally?” structured metadata like “<company_name>” and “2023” may not be captured by semantic vectors or present in consecutive text. Vector retrieval across the entire database can yield noisy results, reducing final accuracy.

However, information such as company names and years can often be obtained as document metadata. Using WHERE year=2023 AND company LIKE "%<company_name>%" as filtering conditions for vector queries can precisely pinpoint relevant information, significantly increasing system reliability. In finance, manufacturing and research, we have observed SQL vector data modeling and joint querying to improve precision from 60% to 90%.

While traditional database products have recognized the importance of vector queries in the LLM era and started adding vector capabilities, there are still significant issues with the accuracy of their combined queries. For example, in filter-search scenarios, Elasticsearch’s queries per second (QPS) rate drops to about five when the filtering ratio is 0.1, and PostgreSQL with the pgvector plugin has an accuracy of only about 50% when the filtering ratio is 0.01. This demonstrates unstable query accuracy and performance that greatly limit their usage. In contrast, SQL vector database MyScale achieves over 100 QPS and 98% accuracy in various filtering ratio scenarios, at 36% of the cost of pgvector and 12% of the cost of Elasticsearch.

👁 MyScale, pgvector, Elasticsearch precision

LLM + Big Data: Building a Next-Generation Agent Platform

Machine learning and big data have fueled the success of web and mobile apps. But with the rise of LLMs, we’re shifting gears to build a new breed of LLMs with big data solutions. These solutions unlock key capabilities for large-scale data processing, knowledge retrieval, observability, data analysis, few-shot learning and more. They create a closed loop between data and AI, forming the foundation for a next-gen LLM + big data agent platform. This paradigm shift is already underway in sectors like scientific research, finance, industry and healthcare.

👁 MyScale architecture

With the rapid development of technology, some form of artificial general intelligence (AGI), is expected to emerge within the next five to 10 years. Regarding this issue, we must ask: Do we need a static, virtual model, or another more comprehensive solution? Data is undoubtedly the important link connecting LLMs, users and the world. Our vision is to organically integrate LLMs and big data to create a more professional, real-time and collaborative AI system, which is also full of human warmth and value.

You are welcome to explore the MyScaleDB repository on GitHub and leverage SQL and vectors to build innovative, production-level AI applications.

MyScale is an open-source SQL vector database that allows to effectively manage massive volumes of both structured and vector data for developing robust AI applications. It enables every developer to build production-grade GenAI applications with powerful and familiar SQL.
Learn More
TRENDING STORIES
Linpeng Tang, Co-founder and CTO of MyScale. Currently, he leads a talented team focused on developing MyScale’s flagship product, a SQL-based vector database tailored to empower enterprises in managing unstructured data and building AI applications at scale. Before his role...
Read more from Linpeng Tang
MyScale sponsored this post.
SHARE THIS STORY
TRENDING STORIES
Pinecone and Zilliz are sponsors of The New Stack.
TNS owner Insight Partners is an investor in: ClickHouse.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.