Skip to main content

Intro

Spark is a general purpose distributed data processing engine. Or in other words: load big data, do computations on it in a distributed way, and then store it.

Spark supports both Scala & Python. Even though Spark itself is written in Scala, I recommend you to use Python for your Spark jobs, as it is vastly more popular among Spark users (many data scientists use Python). It will be far easier for you to find documentation and get your question answered online.

Apache Spark is a lightning-fast cluster computing technology, designed for fast computation.

The basic structure of a Spark-cluster:

image

The cluster manager is not part of the Spark framework itself - even though Spark ships with its own, this one should not be used in production. Supported cluster managers are Mesos, Yarn, and Kubernetes.

The driver program is a Java, Scala, or Python application, which is executed on the Spark Master.

As part of the program, some Spark framework methods will be called, which themselves are executed on the worker nodes.

Each worker node might run multiple executors (as configured: normally one per available CPU core). Each of the executors will receive a task from the scheduler to be executed.

The modules of Apache Spark run directly on top of its core:

image

Spark Abstractions & Concepts

  • RDD: Resilient Distributed Dataset (Immutable)
  • DAG: Directed Acyclic Graph
  • SparkContext
  • Transformations
  • Actions

Spark Memory Management

Two types of memory

  • Execution - Memory used for shuffles, joins, sorts and aggregations
  • Storage memory - Memory used to cache data that will be reused later

Deep Dive: Apache Spark Memory Management - YouTube

Need of Spark

  • Apache Spark is a big data analytics framework that was originally developed at the University of California, Berkely's AMP Lab, in 2012.
  • It is an another system for big data analytics
  • Isn't MapReduce good enough?
    • Simplifies batch processing on large commodity clusters

Spark Applications

  • Twitter spam classification
  • EM algorithm for traffic prediction
  • K-means clustering
  • Alternating Least Squares matrix factorization
  • In-memory OLAP aggregation on Hive data
  • SQL on Spark

Spark Implementation

Spark ideas

  • Expressive computing system, not limited to map-reduce model
  • Facilitate system memory
    • avoid saving intermediate results to disk
    • cache data for repetitive queries. (e.g. for machine learning)
  • Compatible with Hadoop

RDD abstraction

  • Partitioned collection of records
  • Spread across the cluster
  • Read-only
  • Caching dataset in memory
    • different storage levels available
    • fallback to disk possible

RDD operations

  • Transformations to build RDDs through deterministic operations on other RDDs
    • transformations include map, filter, join
    • lazy operation
  • Actions to return value or export data
    • actions include court, collect, save
    • triggers execution

Available APIs

  • You can write in Java, Scala or Python
  • Interactive interpreter: Scala & Python only
  • Standalone applications: any
  • Performance: Java & Scala are faster thanks to static typing

Commands

spark-shell # run scala spark interpreter
pyspark # python interpreter
spark-submit --master local --class GvaWeather target/scala-2.10/gva-weather_2.10-1.0.jar #job submission

Submitting Applications - Spark 3.5.1 Documentation

Summary

  • Concept not limited to single pass map-reduce
  • Avoid sorting intermediate results on disk or HDFS
  • Speedup computations when reusing datasets

Conclusion

  • RDDs provide a simple and efficient programming model
  • Generalized to a broad set of applications
  • Leverages coarse-grained nature of parallel algorithms for failure recovery

References