Skip to main content

Amazon Opensearch

OpenSearch

https://github.com/opensearch-project/OpenSearch

Elasticsearch vs Amazon OpenSearch

Getting Started

To get started using OpenSearch Service, you create an OpenSearch Service domain, which is equivalent to an OpenSearch cluster. Each EC2 instance in the cluster acts as one OpenSearch Service node.

What is Amazon OpenSearch Service? - Amazon OpenSearch Service

OpenSearch Ingestion (OSI)

Amazon OpenSearch Ingestion is a fully managed, serverless data collector that streams real-time logs, metrics, and trace data to Amazon OpenSearch Service domains and OpenSearch Serverless collections.

With OpenSearch Ingestion, you no longer need third-party tools like Logstash or Jaeger to ingest data. You configure your data producers to send data to OpenSearch Ingestion, and it automatically delivers it to your specified domain or collection. You can also transform data before delivery.

Because OpenSearch Ingestion is serverless, you don’t have to manage infrastructure, patch software, or scale clusters manually. You can provision ingestion pipelines directly in the AWS Management Console, and OpenSearch Ingestion handles the rest.

As a component of Amazon OpenSearch Service, OpenSearch Ingestion is powered by Data Prepper—an open-source data collector that filters, enriches, transforms, normalizes, and aggregates data for downstream analysis and visualization.

Overview of Amazon OpenSearch Ingestion - Amazon OpenSearch Service

With OpenSearch Ingestion, you can use Amazon S3 as a source or as a destination. When you use Amazon S3 as a source, you send data to an OpenSearch Ingestion pipeline. When you use Amazon S3 as a destination, you write data from an OpenSearch Ingestion pipeline to one or more S3 buckets.

An OSI (OpenSearch Ingestion) processor is a component within an OpenSearch Ingestion pipeline that filters, transforms, enriches, or aggregates data before it is sent to its destination.

Using an OpenSearch Ingestion pipeline with Amazon S3 - Amazon OpenSearch Service

Pipeline

pipeline is the mechanism that Amazon OpenSearch Ingestion uses to move data from its source (where the data comes from) to its sink (where the data goes). In OpenSearch Ingestion, the sink will always be a single Amazon OpenSearch Service domain, while the source of your data could be clients like Amazon S3, Fluent Bit, or the OpenTelemetry Collector.

Creating Amazon OpenSearch Ingestion pipelines - Amazon OpenSearch Service

How OSI processors work

  • Data flow: A pipeline starts with a source (like S3 or Kafka), then passes data through one or more processors, and finally sends it to a sync (like OpenSearch Serverless or another S3 bucket).
  • Processor types: Processors perform specific tasks on the data as it moves through the pipeline. Examples include:
    • Transformation: Changing the structure or format of data, like grok, parse, or XML processors.
    • Enrichment: Adding new information to the data, such as using a machine learning model to generate vector embeddings from text.
    • Aggregation: Grouping data points together, like counting events or calculating statistics.
  • Offline ML: OSI can be used to process large datasets with ML models asynchronously, which is efficient for use cases like creating vector embeddings for search.

AWS Lambda as a processor in OSI Pipelines - YouTube

Features

Amazon OpenSearch Ingestion provisions pipelines, which consist of a source, a buffer, zero or more processors, and one or more sinks. Ingestion pipelines are powered by Data Prepper as the data engine.

Overview of pipeline features in Amazon OpenSearch Ingestion - Amazon OpenSearch Service

Analysis on S3 Files

Analyzing Apache Parquet files stored in Amazon S3 using OpenSearch Service can be achieved through several methods, largely depending on whether you want to directly query the data in S3 or ingest it into OpenSearch for indexing and analysis.

1. OpenSearch Service Zero-ETL Integration with Amazon S3

This is the most direct and modern approach for analyzing Parquet data in S3 without requiring a separate ETL process to ingest data into OpenSearch.

  • Direct Querying: OpenSearch Service can be configured to directly query data stored in S3 buckets, including Parquet files. This allows you to run analytical queries and visualize insights on your S3 data directly within OpenSearch Dashboards without having to index the entire dataset.
  • Configuration: You configure an S3 data source within OpenSearch Service, defining tables and optionally setting up query acceleration. You then use OpenSearch Dashboards to query and visualize the data.

GitHub - aws-samples/aws-s3-to-opensearch-pipeline

Configuring and querying an S3 data source in OpenSearch Dashboards - Amazon OpenSearch Service

2. Ingesting Parquet Data into OpenSearch Service

If you require the full indexing capabilities of OpenSearch for faster search and complex aggregations, you can ingest the Parquet data from S3 into an OpenSearch domain.

  • AWS Glue: AWS Glue can be used to extract data from Parquet files in S3, transform it (e.g., into JSON), and then load it into your OpenSearch Service domain using a Glue job with an OpenSearch Service connection.
  • OpenSearch Ingestion: OpenSearch Ingestion is a managed service that can read Parquet data from S3, perform transformations, and then ingest it into OpenSearch Service. This is particularly useful for streaming data or for handling large volumes of data from S3.
  • AWS Lambda: For more custom or event-driven ingestion, you can use AWS Lambda functions. A Lambda function can be triggered by S3 events (e.g., new Parquet file uploads), read the Parquet data, transform it into a suitable format (like JSON), and then send it to your OpenSearch Service domain for indexing.

Bedrock OpenSearch

  • AWS OpenSearch SearchOCU keeps hitting the max limit | AWS re:Post
    • My OpenSearch domain exhibited unexpected Search Capacity Unit (SearchOCU) scaling behavior correlated with the number of Collections, even with minimal query activity. After deleting a large number of Collections, retaining only critical Collections totaling less than 5GB, the SearchOCU count decreased to 2. Previously, with a significantly higher number of Collections, the SearchOCU count was substantially inflated, despite low query volume.
    • This observation suggests that the sheer presence of a large number of OpenSearch Collections, independent of active search queries, influences SearchOCU consumption. While I understand the impact of query load on scaling, the mechanism by which the number of Collections drives SearchOCU inflation remains unclear.

Pricing

Optimizations

OpenSearch Data Prepper