![]() |
VOOZH | about |
Elasticsearch is renowned for its powerful search capabilities, but its functionality extends beyond just text and structured data. Often, we need to index and search binary data such as PDFs, images, and other attachments. Elasticsearch supports this through plugins, making it easy to handle and index various binary formats.
This article will guide you through indexing attachments and binary data using Elasticsearch plugins, with detailed examples and outputs.
Indexing binary data such as documents, images, and multimedia files allows you to:
To handle attachments and binary data, Elasticsearch offers the Ingest Attachment Processor Plugin. This plugin uses Apache Tika to extract content and metadata from various file types.
To install the Ingest Attachment Processor Plugin, run the following command in your Elasticsearch directory:
bin/elasticsearch-plugin install ingest-attachmentRestart Elasticsearch after the plugin installation to activate it.
An ingest pipeline allows you to preprocess documents before indexing them. For attachments, the pipeline will use the attachment processor to extract and index the content and metadata.
Create an ingest pipeline named attachment_pipeline:
curl -X PUT "localhost:9200/_ingest/pipeline/attachment_pipeline" -H 'Content-Type: application/json' -d'
{
"description": "Extract attachment information",
"processors": [
{
"attachment": {
"field": "data"
}
},
{
"remove": {
"field": "data"
}
}
]
}'
This pipeline extracts attachment information from the data field and removes the original base64-encoded data to save space.
Prepare a sample document with a base64-encoded PDF file:
{
"data": "JVBERi0xLjQKJaqrrK0KNCAwIG9iago8PC9MaW5lYXJpemVkIDIgL0wgMTExMTENyCjIwL1UgMzY0MjMvTiAxL1RQIDEwMjcKPj4KZW5kb2JqCjw8L0VuY3J5cHR..."
}
Index this document using the attachment_pipeline:
curl -X PUT "localhost:9200/myindex/_doc/1?pipeline=attachment_pipeline" -H 'Content-Type: application/json' -d'
{
"data": "JVBERi0xLjQKJaqrrK0KNCAwIG9iago8PC9MaW5lYXJpemVkIDIgL0wgMTExMTENyCjIwL1UgMzY0MjMvTiAxL1RQIDEwMjcKPj4KZW5kb2JqCjw8L0VuY3J5cHR..."
}'
Output:
The document is indexed, and the text content and metadata are extracted and indexed separately:
{
"_index": "myindex",
"_id": "1",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
}
}
Once the attachments are indexed, you can query the text content and metadata like any other fields in Elasticsearch.
To search for documents containing a specific keyword in the attachment content, use a simple search query:
curl -X GET "localhost:9200/myindex/_search" -H 'Content-Type: application/json' -d'
{
"query": {
"match": {
"attachment.content": "keyword"
}
}
}'
Output:
The response will include documents where the keyword is found in the extracted content:
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "myindex",
"_id": "1",
"_score": 1.0,
"_source": {
"attachment": {
"content": "This is the content of the attachment...",
"content_type": "application/pdf",
"language": "en",
"title": "Sample PDF"
}
}
}
]
}
}
You can index multiple attachments in a single document by including multiple fields for each attachment and processing them in the pipeline.
Modify the ingest pipeline to handle multiple attachment fields:
curl -X PUT "localhost:9200/_ingest/pipeline/attachment_pipeline" -H 'Content-Type: application/json' -d'
{
"description": "Extract multiple attachment information",
"processors": [
{
"attachment": {
"field": "data1"
}
},
{
"attachment": {
"field": "data2"
}
},
{
"remove": {
"field": ["data1", "data2"]
}
}
]
}'
Prepare a sample document with two base64-encoded attachments:
{
"data1": "JVBERi0xLjQKJaqrrK0KNCAwIG9iago8PC9MaW5lYXJpemVkIDIgL0wgMTExMTENyCjIwL1UgMzY0MjMvTiAxL1RQIDEwMjcKPj4KZW5kb2JqCjw8L0VuY3J5cHR...",
"data2": "JVBERi0xLjQKJaqrrK0KNCAwIG9iago8PC9MaW5lYXJpemVkIDIgL0wgMTExMTENyCjIwL1UgMzY0MjMvTiAxL1RQIDEwMjcKPj4KZW5kb2JqCjw8L0VuY3J5cHR..."
}
Index this document using the attachment_pipeline:
curl -X PUT "localhost:9200/myindex/_doc/2?pipeline=attachment_pipeline" -H 'Content-Type: application/json' -d'
{
"data1": "JVBERi0xLjQKJaqrrK0KNCAwIG9iago8PC9MaW5lYXJpemVkIDIgL0wgMTExMTENyCjIwL1UgMzY0MjMvTiAxL1RQIDEwMjcKPj4KZW5kb2JqCjw8L0VuY3J5cHR...",
"data2": "JVBERi0xLjQKJaqrrK0KNCAwIG9iago8PC9MaW5lYXJpemVkIDIgL0wgMTExMTENyCjIwL1UgMzY0MjMvTiAxL1RQIDEwMjcKPj4KZW5kb2JqCjw8L0VuY3J5cHR..."
}'
You can also query based on extracted metadata fields such as content type, title, or author.
Search for documents where the content type is PDF:
curl -X GET "localhost:9200/myindex/_search" -H 'Content-Type: application/json' -d'
{
"query": {
"match": {
"attachment.content_type": "application/pdf"
}
}
}'
When dealing with large attachments, it is important to consider the resource usage and performance implications. Elasticsearch provides options to manage these efficiently.
You can set a limit on the size of attachments that can be processed by the ingest pipeline to prevent resource exhaustion.
Modify the ingest pipeline to limit attachment size:
curl -X PUT "localhost:9200/_ingest/pipeline/attachment_pipeline" -H 'Content-Type: application/json' -d'
{
"description": "Extract attachment information with size limit",
"processors": [
{
"attachment": {
"field": "data",
"indexed_chars": 100000
}
},
{
"remove": {
"field": "data"
}
}
]
}'
In this example, indexed_chars is set to 100,000 characters, limiting the amount of text extracted from each attachment.
Index a document with a large attachment:
curl -X PUT "localhost:9200/myindex/_doc/3?pipeline=attachment_pipeline" -H 'Content-Type: application/json' -d'
{
"data": "JVBERi0xLjQKJaqrrK0KNCAwIG9iago8PC9MaW5lYXJpemVkIDIgL0wgMTExMTENyCjIwL1UgMzY0MjMvTiAxL1RQIDEwMjcKPj4KZW5kb2JqCjw8L0VuY3J5cHR..."
}'
Indexing attachments and binary data in Elasticsearch extends its powerful search capabilities to include a wide range of document types and file formats. By leveraging the Ingest Attachment Processor Plugin, you can efficiently extract and index content and metadata from attachments, enhancing the search experience for your users.
This article provided a comprehensive guide to installing and configuring the necessary plugin, setting up ingest pipelines, indexing documents with attachments, and querying the indexed data. With these tools, you can effectively manage and search through binary data in your Elasticsearch indices, providing a more robust and comprehensive search solution.