Skip to main content

Data Engineering Case Studies

Internet of Things (IoT)

Challenges

The IoT industry faces several challenges, including managing the vast amount of data generated by connected devices, real-time analysis and alerting, and efficient storage and archival of this data. In addition to the challenges related to big data and real-time analysis, the IoT industry often grapples with the complexity of maintaining cron jobs over multiple microservices, managing alerting, handling retries, and managing dependencies efficiently.

Solution

After careful consideration of various solutions such as Vernemq, RabbitMQ, Cassandra, Influx, BigQuery etc the team opted for a combination of Kafka and Druid. Kafka efficiently handles the high volume of data, providing real-time streaming capabilities. Druid, on the other hand, was chosen for its powerful analytics and storage capabilities. The integration of Kafka and Druid enabled the organization to overcome challenges related to big data, real-time analysis, and data storage, providing a robust and scalable solution for their IoT data needs.

To address the cron challenges, we implemented Apache Airflow. Airflow provided a centralized platform for managing, scheduling, and monitoring cron jobs across various microservices. Its DAG (Directed Acyclic Graph) structure allowed for defining workflows, handling dependencies, and automating retries in case of failures. This streamlined approach significantly improved the management of cronjobs, alerting mechanisms, and the overall reliability of the IoT system.

Outcomes

  • Setup Kafka, Druid, Airflow in HA mode, increasing the resiliency and uptime of the system
  • Migrated 10 TB of data from old databases to Druid
  • Reduced query time from hours to seconds and minutes
  • Real time processing of 1 gbps of streaming data
  • Migrated around 100 crons from disparate sources to Airflow

Tools Used

Kafka + Kafka Connectors + Kafka Manager, Druid + S3, Airflow, Kubernetes + Terraform

Client Name - Zenatix Solutions

A leading IOT firm focussed on powered energy and asset management solutions for small and mid-sized buildings.

Financial Technology (Fintech)

Challenges

In the Fintech sector, two critical challenges are fraud detection and credit risk analysis. Real-time monitoring and analysis of transaction data are essential for identifying potential fraud, and effective credit risk analysis requires a comprehensive data warehouse solution.

Solution

To address these challenges, the organization implemented Redis Streams for real-time data processing, S3 + Athena for efficient data modeling, and AWS DMS for seamless data migration from OLTP RDS to OLAP Redshift. The use of PowerBI for data analysis further enhanced the organization's capabilities in making informed decisions. This comprehensive solution enabled the Fintech company to detect and prevent fraud in real-time and conduct thorough credit risk analysis, all while leveraging the power of AWS services.

Outcomes

  • Setup of end to end data streaming and analytics pipeline
  • Setup Kafka, Redis, AWS RDS, AWS DMS, AWS Redshift, S3 + Athena
  • Reduced query time from hours to seconds and minutes
  • Reduced ETL and Reporting time from days to real time
  • Scaled the stack and team to increase loan disbursals from 100K USD to 60M USD per month
  • Increased the resiliency and stability of the system many fold. Achieved 99.99% of infrastructure uptime
  • Reduced NPAs from 9% to 6%, using multiple fraud policies and algorithms throughout the process
  • Migrated around 300 crons from disparate sources to Airflow

Tools Used

Redis Streams, S3 + Athena, AWS RDS, AWS DMS, AWS Redshift + AWS Redshift Spectrum, PowerBI, Airflow, Kubernetes + Terraform

Client Name - Stashfin

Cryptocurrency and Web3

Challenges

The cryptocurrency and web3 industry face unique challenges, including the need to analyze both internal and on-chain data in a cost-effective manner. With the decentralized nature of blockchain technology, efficient data modeling becomes crucial for actionable insights.

Solution

To overcome these challenges, the organization implemented Databricks and Tableau. This combination provided a robust platform for analyzing internal and on-chain data efficiently. The implementation included a cost-effective data modeling strategy using bronze, silver, and gold layers, ensuring that the organization could derive valuable insights from their data without compromising on cost efficiency. This solution empowered the cryptocurrency and web3 company to navigate the complexities of blockchain data analysis effectively.

Outcomes

  • Reduced ETL and Reporting time from days to real time
  • Reduced data engineering and data analytics cost by 50% by optimizing time of running ETLs
  • Simplified all reporting and ETLs by using single Databricks account for all processing and analytics

Tools Used

Databricks + Pyspark, Tableau

Client - Bake.io

Ads Platform

Challenges

The ads platform industry contends with the immense volume of data generated from ad impressions, clicks, and user interactions. Challenges include managing ETL (Extract, Transform, Load) processes, creating data warehouses, defining schemas, and overall administration of data pipelines.

Solution

To overcome these challenges, the organization implemented Snowflake and Snowpipe. Snowflake's cloud-based data warehousing and Snowpipe's automatic data ingestion facilitated multiple ETL and ELT processes. This solution allowed the ads platform to efficiently create and manage data warehouses, define schemas dynamically, and automate the flow of data into the system, reducing the administrative burden on the organization.

Client Name - Apple

Tools Used

Snowflake + Terraform

Cloud + Data Migration for Multiple Clients

Challenges

Companies dealing with multiple clients often encounter challenges when migrating data between on-premise and cloud environments. Managing data migration online, minimizing downtime, and supporting various databases (MySQL, MariaDB, PostgreSQL, Oracle, MongoDB, Elasticsearch, Cassandra) pose significant challenges.

Solution

The organization adopted Debezium to address these migration challenges. Debezium, with its CDC (Change Data Capture) capabilities, enabled online data migration with minimal downtime. It supported various databases, ensuring a smooth transition from on-premise to cloud environments for multiple clients, irrespective of the database technology they used. This streamlined migration approach enhanced the flexibility and efficiency of the migration process.

Outcomes

  • Break multiple monolith databases into specialized databases, reducing the cost of storage and processing and increasing query time
  • Real time migration from OLTP databases to OLAP databases and warehouses
  • Moved from on-prem infrastructure to cloud infrastructure and vice-versa based on client requirements

Client Name - Coto World, Akamai, Times

Tools Used

  • Debezium, Kafka, Spark, Terraform, Jenkins, etc
  • DBs - RDMS, MongoDB, Redis, Druid, Elasticsearch, Cassandra, etc
  • Data Warehouses - Clickhouse, Redshift, Snowflake, Databricks, etc