VOOZH about

URL: https://thenewstack.io/run-optimized-geospatial-queries-with-trino/

⇱ Run Optimized Geospatial Queries with Trino - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2022-04-08 10:00:44
Run Optimized Geospatial Queries with Trino
contributed,
Data / Observability

Run Optimized Geospatial Queries with Trino

How to use the Trino open source distributed query engine for running geospatial workloads atop different data sources.
Apr 8th, 2022 10:00am by Guy Mast
👁 Featued image for: Run Optimized Geospatial Queries with Trino
Feature image via Pixabay.
Guy Mast
Guy Mast is a Product Manager at Varada, an autonomous data lake acceleration platform, where he works on combining Varada's indexing technology with the Trino OSS project.

The Trino open source distributed query engine is known as a choice for running ad-hoc analysis where there’s no need to model the data and it can be easily sliced and diced. As such, Trino can be also leveraged for running geospatial workloads atop different data sources.

When it comes to running geospatial queries, Trino is OpenGIS compliant and supports a wealth of geospatial-specific functions using WKT/WKB vector data formats. Its diverse array of functions gives you the ability to unify and join geospatial data from multiple sources. For example, you can join points of interest by using Trino’s Postgres connector with events tables stored on S3 by using the Hive connector.

In this post, we’ll walk you through two methods for running geospatial queries on the data lake using Trino’s Hive connector, explore some optimizations to help you accelerate and improve the interactivity for your geospatial queries.

Deep-Dive: Geospatial Analysis Methods with Trino

Leveraging Trino’s native Geospatial support and Bing tiles system are two methods for running interactive geospatial queries for common use cases. In this section, we’ll demonstrate these two methods by querying the trips_data table (4.4 billion rows) and the places table (10 rows) from the public ride-sharing dataset by using Trino’s Hive connector.

Places table locations:

👁 Image

Here is a list with some common geospatial analysis use-cases we came across:

  1. Getting the details of events near points of interest, for example, trips in a ride-sharing app passing by various distances from a billboard or a store — by using Trino’s great_circle_distance function
  2. Intersecting events (such as taxi trips) with a well-defined polygon (such as neighborhoods) — by using Trino’s ST_Contains function
  3. Finding out if a polygon contains a specificpoint, for example, if a trip started or ended in a specific area of the city — by using Trino’s ST_Contains function

Method 1 – Leveraging Trino’s Geospatial Support

Trino offers a wide range of geospatial functions that can be used out of the box.

The first use case might require running queries that get all the riders that were up to some distance from the points of interest (restaurants, billboard, etc.).

Below you will find such a query that joins the trips_data table and the places table, and counts the number of riders in trips that passed up to 500 million from the points of interest stored in the places table.

Finding Adjacent Geographical Data Points

Query #1

Finding Data Points in Defined Geographical Locations

Another example for using the Geospatial functions in Trino is running queries related to use-case 2 in which we look up for events that took place in a well-defined area. Below you kind find such a query that counts the number of drivers that drove in a specific area:

Query #2 

In the above query the Geospatial functions used in the predicate were not pushed down to the Hive connector, resulting in a massive data read. In order to mitigate this, it’s possible to build an additional bounding box by using the lon and lat columns as suggested in query #2.1 or to use the Bing tiles system as suggested in query #4.

Building Additional Bounding Box

Query #2.1 

Here we built a bounding box using the lat, lon columns which include the polygon of interest. The filters on these columns are being pushed down to the Hive connector, which results in reading less data from S3, compared to query #2.

Although we saw a performance improvement by adding the bounding box predicate both on the Hive and Varada connector, it’s important to note that dynamically building a bounding box is not always straightforward.

Method 2 – Using Bing Tiles

Bing tiles are a way to define map areas at different resolutions:

👁 Image

They are stored as “quadkeys”:

  • Quadkey = “03” defines a tile at zoom level 2, tile side length is ~7,000 kilometers
  • Quadkey = “03023323” is zoom level 8, which side length is ~60 kilometers at the equator
  • We used level 19, which side length is ~75 meters

Note that a prefix of a tile quadkey is always a larger tile that includes that tile at a lower resolution.

Building Bing Tile Using Trino’s CTAS Command

In order to utilize the Bing tile segmentation, we can either create the Bing tile during the query runtime (as we did in query #4) or by using an ETL/ELT procedure that will add a new quadkey column to both the trips_data and the places table. In this article, we used Trino’s CTAS command for the ELT procedure described here.

Once the quadkey column is in place we can now JOIN the places table and the table trips_data table by the quakey column.

CTAS for creating the trips_data_bing table

The instructions for creating the trips_data and places table can be found here.

Finding adjacent geographical data points using Bing tiles

Below in Query #3 we implement the same logic as query #1 and utilize the Bing tiles for doing the join between the places and trips table.

Query #3

This query joins the table by the quadkey column, which significantly reduces the number of rows read and boosts the query performance thanks to the dynamic filtering optimization.

Finding Data Points in Defined Geographical Locations Using Bing Tiles

Bing tiles can also be leveraged as can be seen below in query #4, which implements the same login we saw on query #2:

Query #4

When creating the Bing tile during query execution, execution time was slower for this query compared to the equivalent query #2, which might suggest that for this type of queries creating the Bing tile during query execution might not always be beneficial. That being said, when we ran the same experiment using the Varada Trino Connector we did see a significant improvement.

The Results

After running the above queries we can then compile the data into a mini benchmark that details the queries’ runtime and showing the improvement factor after implementing the suggested optimization. Here are the results of a mini benchmark data, using 4Xi3.4xlarage machines on Amazon Web Services.

👁 Image

Optimization improves performance by up to four times.

Conclusion

Running these queries underlines the fact that you can use Trino to run geospatial queries on your data lake thanks to its out-of-the-box support for geospatial functions.

However, by simply applying the suggested methods above you can improve your geospatial queries performance by up to X4.

You can leverage your investment in Trino to run geospatial analysis today on the Hive connector by applying any of the methods and optimizations described in this article. And check out this post for results of running the same queries on the Community Edition of Varada’s Trino connector.

TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.