Analytics

Amazon Athena - Query Data in S3 using SQL
Amazon Kinesis
amazon-msk
Amazon DevOps Guru
aws-sns
amazon-opensearch
- centralized-logging-with-opensearch
amazon-sqs
differences-comparisons

Amazon EMR

Hosted Hadoop Framework

Easily Run and Scale Apache Spark, Hadoop, HBase, Presto, Hive, and other Big Data Frameworks

Amazon EMR is the industry leading cloud-native big data platform for processing vast amounts of data quickly and cost-effectively at scale. Using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi (Incubating), and Presto, coupled with the dynamic scalability of Amazon EC2 and scalable storage of Amazon S3, EMR gives analytical teams the engines and elasticity to run Petabyte-scale analysis for a fraction of the cost of traditional on-premises clusters. EMR gives teams the flexibility to run use cases on single-purpose, short lived clusters that automatically scale to meet demand, or on long running highly available clusters using the new multi-master deployment mode. If you have existing on-premises deployments of open source tools such as Apache Spark and Apache Hive, you can also run EMR clusters on AWS Outposts, giving you both the ability to scale out on-premises via Outposts or in the cloud.

Amazon EMR makes it simple and cost effective to run highly distributed processing frameworks such as Hadoop, Spark, and Presto when compared to on-premises deployments. Amazon EMR is flexible – you can run custom applications and code, and define specific compute, memory, storage, and application parameters to optimize your analytic requirements.

In addition to running SQL queries, Amazon EMR can run a wide variety of scale-out data processing tasks for applications such as machine learning, graph analytics, data transformation, streaming data, and virtually anything you can code. You should use Amazon EMR if you use custom code to process and analyze extremely large datasets with the latest big data processing frameworks such as Spark, Hadoop, Presto, or Hbase. Amazon EMR gives you full control over the configuration of your clusters and the software installed on them.

You can use Amazon Athena to query data that you process using Amazon EMR. Amazon Athena supports many of the same data formats as Amazon EMR. Athena's data catalog is Hive metastore compatible. If you use EMR and already have a Hive metastore, you can run your DDL statements on Amazon Athena and query your data immediately without affecting your Amazon EMR jobs.

When should I use Athena? - Amazon Athena

EMR Serverless - Apache Spark 4.0

EMR Spark 4.0 is 5.4x faster than OSS Spark 4.0

Amazon EMR WAL

Apache HBase Write Ahead Log allows recording all changes to data to file-based storage. With Amazon EMR on EC2, you can write your Apache HBase write-ahead logs to the Amazon EMR WAL, a durable managed storage layer that outlives your cluster. In the event that your cluster, or in the rare cases that the Availability Zone becomes unhealthy or unavailable, you can create a new cluster, point it to the same Amazon S3 root directory and Amazon EMR WAL workspace, and automatically recover the data in WAL within a few minutes.

Write-ahead logs (WAL) for Amazon EMR - Amazon EMR

HBase on Amazon S3 (Amazon S3 storage mode) - Amazon EMR

Node Types

Primary node

The primary node manages the cluster and typically runs primary components of distributed applications. For example, the primary node runs the YARN ResourceManager service to manage resources for applications. It also runs the HDFS NameNode service, tracks the status of jobs submitted to the cluster, and monitors the health of the instance groups.

Core nodes

Core nodes are managed by the primary node. Core nodes run the Data Node daemon to coordinate data storage as part of the Hadoop Distributed File System (HDFS). They also run the Task Tracker daemon and perform other parallel computation tasks on data that installed applications require. For example, a core node runs YARN NodeManager daemons, Hadoop MapReduce tasks, and Spark executors.

Task nodes

You can use task nodes to add power to perform parallel computation tasks on data, such as Hadoop MapReduce tasks and Spark executors. Task nodes don't run the Data Node daemon, nor do they store data in HDFS. As with core nodes, you can add task nodes to a cluster by adding Amazon EC2 instances to an existing uniform instance group or by modifying target capacities for a task instance fleet.

Understand node types: primary, core, and task nodes - Amazon EMR

Amazon CloudSearch

Managed Search Service

Amazon ElasticSearch Service

Run and Scale Elasticsearch Clusters

Amazon Redshift

Fast, Simple, Cost-effective Data Warehousing

Amazon Quicksight

Fast Business Analytics Service

AWS Data Pipeline

Orchestration Service for Periodic, Data-driven Workflows

Not available in Mumbai Region

AWS Glue

Perpare and Load Data

AWS Glue DataBrew

AWS Glue DataBrew is a fully managed, serverless data preparation tool that enables users to visually clean, transform, and normalize raw datasets—all without writing any code. It is specifically designed for data analysts, business users, and data engineers who need to prepare data for analytics or machine learning workflows using an intuitive point-and-click interface.

In the scenario, the company ingests Apache Parquet files into Amazon S3 and needs to apply multiple transformation steps such as anomaly filtering, standardizing date formats, and computing aggregates. With DataBrew, users can perform these tasks through a drag-and-drop Ul and configure complex workflows using reusable "recipes"-collections of ordered transformation steps that can be saved, versioned, and shared with others in the organization.

Furthermore, DataBrew provides built-in data profiling capabilities. It automatically generates column-level statistics such as distributions, cardinality, null value counts, standard deviations, and outlier detection. This visibility enables better data quality assessment before the data is used for reporting or machine learning.

DataBrew's recipe model supports auditability and traceability, satisfying the requirement for data lineage. All transformation steps are recorded and can be exported or replicated across projects.

The output of a DataBrew job is written back to Amazon S3, keeping the transformed data accessible for downstream tools like Amazon Athena, Redshift, or QuickSight.

Because it's serverless, there's no infrastructure to manage, and it integrates seamlessly with other AWS services. This makes it ideal for teams that want a no-code, collaborative, auditable, and profile-rich data transformation tool for Parquet-based data lakes in S3.

What is AWS Glue DataBrew? - AWS Glue DataBrew

Enrich datasets for descriptive analytics with AWS Glue DataBrew | AWS Big Data Blog

AWS Managed Streaming for Apache Kafka

Fully managed Apache Kafka service

AWS Data Exchange

Easily find and subscribe to third-party data in the cloud

Patterns for enterprise data sharing at scale | AWS Big Data Blog

AWS Lake Formation - Build a secure data lake in days

AWS Lake Formation is a service that makes it easy to set up a secure data lake in days. A data lake is a centralized, curated, and secured repository that stores all your data, both in its original form and prepared for analysis. A data lake enables you to break down data silos and combine different types of analytics to gain insights and guide better business decisions.

However, setting up and managing data lakes today involves a lot of manual, complicated, and time-consuming tasks. This work includes loading data from diverse sources, monitoring those data flows, setting up partitions, turning on encryption and managing keys, defining transformation jobs and monitoring their operation, re-organizing data into a columnar format, configuring access control settings, deduplicating redundant data, matching linked records, granting access to data sets, and auditing access over time.

Creating a data lake with Lake Formation is as simple as defining data sources and what data access and security policies you want to apply. Lake Formation then helps you collect and catalog data from databases and object storage, move the data into your new Amazon S3 data lake, clean and classify your data using machine learning algorithms, and secure access to your sensitive data. Your users can access a centralized data catalog which describes available data sets and their appropriate usage. Your users then leverage these data sets with their choice of analytics and machine learning services, like Amazon Redshift, Amazon Athena, and (in beta)Amazon EMR for Apache Spark. Lake Formation builds on the capabilities available in AWS Glue.

https://aws.amazon.com/lake-formation

Amazon EMR​

Amazon EMR WAL​

Node Types​

Primary node​

Core nodes​

Task nodes​

Amazon CloudSearch​

Amazon ElasticSearch Service​

Amazon Redshift​

Amazon Quicksight​

AWS Data Pipeline​

AWS Glue​

AWS Glue DataBrew​

AWS Managed Streaming for Apache Kafka​

AWS Data Exchange​

AWS Lake Formation - Build a secure data lake in days​