Others
Data Types
DecimalType()
Represents arbitrary-precision signed decimal numbers. Backed internally by java.math.BigDecimal
. A BigDecimal
consists of an arbitrary precision integer unscaled value and a 32-bit integer scale.
But what it really means? Let’s break it down:
DecimalType() stores two operands (Precision and Scale), this way avoids storing trailing zeros.
- Precision - Number of digits in the Unscaled value
- Unscaled value - Value without the floating-point (i.e 4.33 the unscaled value would be 433)
- Scale - Number of digits to the right of the decimal point ( i.e 4.33 the scale is 2)
Optimization
https://towardsdatascience.com/apache-spark-optimization-toolkit-17cf3e491992
https://github.com/aws-samples/aws-glue-samples/blob/master/examples/join_and_relationalize
https://thedataguy.in/aws-glue-custom-output-file-size-and-fixed-number-of-files
- Option 1: groupFiles
- Option 2: groupFiles while reading from S3
- Option 3: Repartition
Performance Tuning - Spark 3.3.2 Documentation
Bucketing
Bucketing is an optimization technique in Apache Spark SQL. Data is allocated among a specified number of buckets, according to values derived from one or more bucketing columns. Bucketing improves performance by shuffling and sorting data prior to downstream operations such as table joins. The tradeoff is the initial overhead due to shuffling and sorting, but for certain data transformations, this technique can improve performance by avoiding later shuffling and sorting.
This technique is useful for dimension tables, which are frequently used tables containing primary keys. It is also useful when there are frequent join operations involving large and small tables.
Best Practices for Bucketing in Spark SQL | by David Vrba | Towards Data Science
Shuffling
Apache Spark processes queries by distributing data over multiple nodes and calculating the values separately on every node. However, occasionally, the nodes need to exchange the data. After all, that’s the purpose of Spark - processing data that doesn’t fit on a single machine.
Shuffling is the process of exchanging data between partitions. As a result, data rows can move between worker nodes when their source partition and the target partition reside on a different machine.
What is shuffling in Apache Spark, and when does it happen? | Bartosz Mikulski
Spark Basics | Shuffling - YouTube
Spark SQL Shuffle Partitions - Spark By Examples
35. Databricks & Spark: Interview Question - Shuffle Partition - YouTube
SparkML
https://spark.apache.org/docs/latest/ml-pipeline.html
https://towardsdatascience.com/a-neanderthals-guide-to-apache-spark-in-python-9ef1f156d427