![]() |
VOOZH | about |
Elasticsearch is a powerful search and analytics engine that allows for efficient data analysis through its rich aggregation framework. Among the various aggregation types, histogram aggregation is particularly useful for grouping data into intervals, which is essential for understanding the distribution and trends within your data.
In this article, we will delve into data histogram aggregation in Elasticsearch, explain its use cases, and provide detailed examples to help you master this powerful feature.
Histogram aggregation in Elasticsearch is used to group numeric data into buckets or intervals. This type of aggregation is especially useful for creating histograms, which are graphical representations of data distribution. By specifying an interval, you can divide your numeric data into meaningful ranges, making it easier to analyze trends and patterns.
Histogram aggregation is particularly useful in scenarios where you need to:
Let's consider an Elasticsearch index called sales with documents representing individual sales transactions. Each document might look like this:
{
"sale_id": 1,
"product": "Laptop",
"category": "electronics",
"price": 1000,
"quantity": 2,
"timestamp": "2023-01-01T12:00:00Z"
},
{
"sale_id": 2,
"product": "T-shirt",
"category": "clothing",
"price": 20,
"quantity": 5,
"timestamp": "2023-01-02T14:00:00Z"
},
{
"sale_id": 3,
"product": "Book",
"category": "books",
"price": 15,
"quantity": 10,
"timestamp": "2023-01-03T16:00:00Z"
}
To start with histogram aggregation, let's use the price field to group sales into price ranges. We'll use an interval of 100.
Query:
GET /sales/_search
{
"size": 0,
"aggs": {
"price_histogram": {
"histogram": {
"field": "price",
"interval": 100
}
}
}
}
Output:
{
"took": 15,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"price_histogram": {
"buckets": [
{
"key": 0,
"doc_count": 2
},
{
"key": 1000,
"doc_count": 1
}
]
}
}
}
In this example, the aggregation named price_histogram shows two buckets: one for prices between 0 and 100, and another for prices between 1000 and 1100. The doc_count field indicates the number of sales in each price range.
You can use the min_doc_count parameter to exclude buckets with fewer than a specified number of documents. For example, to exclude buckets with fewer than 2 sales:
Query:
GET /sales/_search
{
"size": 0,
"aggs": {
"price_histogram": {
"histogram": {
"field": "price",
"interval": 100,
"min_doc_count": 2
}
}
}
}
Output:
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"price_histogram": {
"buckets": [
{
"key": 0,
"doc_count": 2
}
]
}
}
}
In this case, only the bucket for prices between 0 and 100 is returned, as it has 2 documents.
You can use the extended_bounds parameter to ensure that specific buckets are included in the response, even if they have no documents. This is useful for maintaining a consistent range in your histogram.
Query:
GET /sales/_search
{
"size": 0,
"aggs": {
"price_histogram": {
"histogram": {
"field": "price",
"interval": 100,
"extended_bounds": {
"min": 0,
"max": 1200
}
}
}
}
}
Output:
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"price_histogram": {
"buckets": [
{
"key": 0,
"doc_count": 2
},
{
"key": 100,
"doc_count": 0
},
{
"key": 200,
"doc_count": 0
},
{
"key": 300,
"doc_count": 0
},
{
"key": 400,
"doc_count": 0
},
{
"key": 500,
"doc_count": 0
},
{
"key": 600,
"doc_count": 0
},
{
"key": 700,
"doc_count": 0
},
{
"key": 800,
"doc_count": 0
},
{
"key": 900,
"doc_count": 0
},
{
"key": 1000,
"doc_count": 1
},
{
"key": 1100,
"doc_count": 0
}
]
}
}
}
In this example, all price ranges from 0 to 1200 are included in the response, even if they have no documents.
While the basic histogram aggregation works with numeric data, the date histogram aggregation is used for time-based data. This allows you to group documents by date intervals, such as days, weeks, or months.
Let's add some time-based sales data to our sales index:
{
"sale_id": 4,
"product": "Smartphone",
"category": "electronics",
"price": 500,
"quantity": 3,
"timestamp": "2023-01-01T10:00:00Z"
},
{
"sale_id": 5,
"product": "Headphones",
"category": "electronics",
"price": 50,
"quantity": 10,
"timestamp": "2023-01-02T12:00:00Z"
},
{
"sale_id": 6,
"product": "Shoes",
"category": "clothing",
"price": 70,
"quantity": 4,
"timestamp": "2023-01-03T14:00:00Z"
}
Query
Let's group sales by day using the timestamp field:
GET /sales/_search
{
"size": 0,
"aggs": {
"sales_over_time": {
"date_histogram": {
"field": "timestamp",
"calendar_interval": "day"
}
}
}
}
Output:
{
"took": 15,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 6,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"sales_over_time": {
"buckets": [
{
"key_as_string": "2023-01-01T00:00:00.000Z",
"key": 1672531200000,
"doc_count": 2
},
{
"key_as_string": "2023-01-02T00:00:00.000Z",
"key": 1672617600000,
"doc_count": 2
},
{
"key_as_string": "2023-01-03T00:00:00.000Z",
"key": 1672704000000,
"doc_count": 2
}
]
}
}
}
In this example, the aggregation named sales_over_time groups sales into daily intervals. Each bucket represents a day and contains the number of sales for that day.
For e-commerce platforms, histogram aggregations can be used to analyze sales data. By grouping sales by price ranges or time intervals, businesses can identify trends, peak sales periods, and popular price points.
In IT and security, histogram aggregations are useful for log analysis. By grouping log entries by time, administrators can detect unusual patterns, such as spikes in error rates or security breaches.
In performance monitoring, histogram aggregations can be used to analyze response times, CPU usage, and other metrics. Grouping data into intervals helps in understanding the distribution and identifying bottlenecks.
Histogram aggregation in Elasticsearch is a versatile tool for grouping numeric data into intervals, allowing for effective data analysis and visualization. Whether you're analyzing sales data, logs, or performance metrics, histogram aggregation helps you understand the distribution and trends within your data. By mastering this feature, you can leverage Elasticsearch to gain valuable insights and make informed decisions based on your data.