Aggregation Queries
AVG
AVG() is an aggregate function that returns the average value for a numeric column.
SELECT AVG(column_name)
FROM table_name;
COUNT
COUNT() is a function that takes the name of a column as an argument and counts the number of rows where the column is not NULL.
SELECT COUNT(column_name)
FROM table_name;
-- To count all rows
SELECT COUNT(*) FROM table_name;
DIFFERENCE
SELECT MAX(POPULATION) - MIN(POPULATION)
FROM CITY;
MAX
MAX() is a function that takes the name of a column as an argument and returns the largest value in that column.
SELECT MAX(column_name)
FROM table_name;
MIN
MIN() is a function that takes the name of a column as an argument and returns the smallest value in that column.
SELECT MIN(column_name)
FROM table_name;
ROUND
ROUND() is a function that takes a column name and an integer as an argument. It rounds the values in the column to the number of decimal places specified by the integer.
SELECT ROUND(column_name, integer)
FROM table_name;
SUM
SUM() is a function that takes the name of a column as an argument and returns the sum of all the values in that column.
SELECT SUM(column_name)
FROM table_name;
GROUP BY
GROUP BY is a clause in SQL that is only used with aggregate functions. It is used in collaboration with the SELECT statement to arrange identical data into groups.
The GROUP BY statement comes after any WHERE statements, but before ORDER BY or LIMIT
SELECT COUNT(*)
FROM table_name
GROUP BY column_name;
SELECT ROUND(imdb_rating), COUNT(name) FROM movies GROUP BY 1 ORDER BY 1;
-- Here, the 1 refers to the first column in our SELECT statement, ROUND(imdb_rating)
-- group by all
SELECT car_model, count(DISTINCT city) AS count FROM dealer GROUP BY ALL;
GROUP BY ALL
A shorthand notation to add all SELECT
-list expressions not containing aggregate functions as group_expression
s. If no such expression exist GROUP BY ALL
is equivalent to omitting the GROUP BY
clause which results in a global aggregation.
GROUP BY clause | Databricks on AWS
Guidelines
- All the dependent columns or columns used in GROUP BY function must form the basis of grouping, hence must be included in GROUP BY clause also
- GROUP BY clause does not support the use of column alias, but the actual names
- GROUP BY clause can only be used with aggregate functions like SUM, AVG, COUNT, MAX, and MIN
- Aggregate functions cannot be used in a GROUP BY clause
- Use Inner query for solving this
- Example -
SELECT
cp.CoinId,
cp.createdAt AS max_createdAt,
COUNT(*) AS count_duplicates
FROM
coin cp
INNER JOIN (
SELECT
CoinId,
MAX(createdAt) AS max_createdAt
FROM
coin
GROUP BY
CoinId
) cp_max
ON cp.CoinId = cp_max.CoinId AND cp.createdAt = cp_max.max_createdAt
GROUP BY
cp.CoinId,
cp.createdAt
HAVING
COUNT(*) > 1;
Is this a valid query?
SELECT city, COUNT(*) FROM census WHERE COUNT(*) > 5
GROUP BY city ORDER BY COUNT(*) DESC
no, GROUP BY happens after WHERE, so you can't use aggregations in a WHERE
Use HAVING COUNT(*) > 5
instead instead of WHERE
Examples
SELECT
CONCAT(HOUR(create_date),
':00-',
HOUR(create_date) + 1,
':00') AS Hours,
COUNT(*) AS `usage`
FROM
communication_log
WHERE
channel = 'sms'
AND create_date BETWEEN '2021-08-25 00:00:00' AND NOW()
GROUP BY HOUR(create_date);
SELECT create_date, count(*)
FROM communication_log
WHERE
channel = 'sms'
AND create_date BETWEEN '2021-08-25 00:00:00' AND NOW()
GROUP BY hour( create_date ) , day( create_date );
SELECT YEAR(create_date),
MONTH(create_date),
count(id)
FROM st_customer_detail
GROUP BY YEAR(create_date),
MONTH(create_date)
ORDER BY YEAR(create_date) DESC, MONTH(create_date) DESC;
SELECT
template_id, COUNT(*) AS template_count
FROM
(SELECT
customer_id, template_id, COUNT(*) AS count
FROM
communication_log
WHERE
channel = 'sms' AND is_success = 1
AND create_date BETWEEN NOW() - INTERVAL 16 HOUR AND NOW()
GROUP BY customer_id , template_id) inner_query
WHERE
count > 2
GROUP BY template_id
ORDER BY template_count DESC;
-- Department wise bifurcation
select a.channel, a.department, a.name, count(*) from (
SELECT ct.department, ct.channel, ct.name
FROM communication_log cl
LEFT JOIN communication_templates ct ON cl.template_id = ct.name
WHERE cl.id BETWEEN 422020686 AND 435852382
AND is_success = '1'
GROUP BY cl.id
) a group by a.channel, a.department, a.name;
-- average temperature by hour and month
-- output hours in rows and months in columns
-- transform row as column using CASE
SELECT
DAY(exam_date),
AVG(marks),
round(AVG(CASE WHEN MONTH(exam_date) = 1 THEN marks ELSE 0 END),2) jan,
round(AVG(CASE WHEN MONTH(exam_date) = 2 THEN marks ELSE 0 END),2) feb,
round(AVG(CASE WHEN MONTH(exam_date) = 3 THEN marks ELSE 0 END),2) mar,
round(AVG(CASE WHEN MONTH(exam_date) = 4 THEN marks ELSE 0 END),2) apr,
round(AVG(CASE WHEN MONTH(exam_date) = 5 THEN marks ELSE 0 END),2) may,
round(AVG(CASE WHEN MONTH(exam_date) = 6 THEN marks ELSE 0 END),2) june,
round(AVG(CASE WHEN MONTH(exam_date) = 7 THEN marks ELSE 0 END),2) july,
round(AVG(CASE WHEN MONTH(exam_date) = 8 THEN marks ELSE 0 END),2) aug,
round(AVG(CASE WHEN MONTH(exam_date) = 9 THEN marks ELSE 0 END),2) sep,
round(AVG(CASE WHEN MONTH(exam_date) = 10 THEN marks ELSE 0 END),2) oct,
round(AVG(CASE WHEN MONTH(exam_date) = 11 THEN marks ELSE 0 END),2) nov,
round(AVG(CASE WHEN MONTH(exam_date) = 12 THEN marks ELSE 0 END),2) december
FROM
demo_table
GROUP BY MONTH(exam_date) , DAY(exam_date);
-- count vowels in names in a table
select d_name, sum(
if (d_name like '%a%', 1, 0)
+ if( d_name like '%e%', 1, 0)
+ if( d_name like '%i%', 1, 0)
+ if( d_name like '%o%', 1, 0)
+ if( d_name like '%u%', 1, 0)
) as vowel_count from (select distinct name as d_name
from demo_table) a group by a.d_name;
max_by, min_by
Returns the value of an expr1
associated with the maximum value of expr2
in a group.
This function can also be invoked as a window function using the OVER
clause.
This function is non-deterministic if expr2
is not unique within the group.
-- max_by
SELECT max_by(x, y) FROM VALUES (('a', 10)), (('b', 50)), (('c', 20)) AS tab(x, y);
>>> b
-- min_by
SELECT min_by(x, y) FROM VALUES (('a', 10)), (('b', 50)), (('c', 20)) AS tab(x, y);
>>> a
max_by aggregate function | Databricks on AWS
min_by aggregate function | Databricks on AWS
Databricks aggregate functions
Function | Description |
---|---|
any(expr) | Returns true if at least one value of expr in the group is true. |
any_value(expr[,ignoreNull]) | Returns any random value of expr for a group of rows. |
approx_count_distinct(expr[,relativeSD]) | Returns the estimated number of distinct values in expr within the group. |
approx_percentile(expr,percentage[,accuracy]) | Returns the approximate percentile of the expr within the group. |
approx_top_k(expr[,k[,maxItemsTracked]]) | Returns the top k most frequently occurring item values in an expr along with their approximate counts. |
array_agg(expr) | Returns an array consisting of all values in expr within the group. |
avg(expr) | Returns the mean calculated from values of a group. |
bit_and(expr) | Returns the bitwise AND of all input values in the group. |
bit_or(expr) | Returns the bitwise OR of all input values in the group. |
bit_xor(expr) | Returns the bitwise XOR of all input values in the group. |
bool_and(expr) | Returns true if all values in expr are true within the group. |
bool_or(expr) | Returns true if at least one value in expr is true within the group. |
collect_list(expr) | Returns an array consisting of all values in expr within the group. |
collect_set(expr) | Returns an array consisting of all unique values in expr within the group. |
corr(expr1,expr2) | Returns Pearson coefficient of correlation between a group of number pairs. |
count(*) | Returns the total number of retrieved rows in a group, including rows containing null. |
count(expr[, …] ) | Returns the number of rows in a group for which the supplied expressions are all non-null. |
count_if(expr) | Returns the number of true values for the group in expr . |
count_min_sketch(expr, epsilon, confidence, seed) | Returns a count-min sketch of all values in the group in expr with the epsilon , confidence and seed . |
covar_pop(expr1,expr2) | Returns the population covariance of number pairs in a group. |
covar_samp(expr1,expr2) | Returns the sample covariance of number pairs in a group. |
every(expr) | Returns true if all values of expr in the group are true. |
first(expr[,ignoreNull]) | Returns the first value of expr for a group of rows. |
first_value(expr[,ignoreNull]) | Returns the first value of expr for a group of rows. |
hll_sketch_agg(expr[,lgConfigK]) | Returns a HyperLogLog sketch used to approximate a distinct values count. |
hll_union_agg(expr[,allowDifferentLgConfigK]) | Aggregates HyperLogLog sketches for a group of rows. |
kurtosis(expr) | Returns the kurtosis value calculated from values of a group. |
last(expr[,ignoreNull]) | Returns the last value of expr for the group of rows. |
last_value(expr[,ignoreNull]) | Returns the last value of expr for the group of rows. |
max(expr) | Returns the maximum value of expr in a group. |
max_by(expr1,expr2) | Returns the value of an expr1 associated with the maximum value of expr2 in a group. |
mean(expr) | Returns the mean calculated from values of a group. |
median(expr) | Returns the median calculated from values of a group. |
min(expr) | Returns the minimum value of expr in a group. |
min_by(expr1, expr2) | Returns the value of an expr1 associated with the minimum value of expr2 in a group. |
mode(expr) | Returns the most frequent, not NULL , value of expr in a group. |
percentile(expr, percentage [,frequency]) | Returns the exact percentile value of expr at the specified percentage . |
percentile_approx(expr,percentage[,accuracy]) | Returns the approximate percentile of the expr within the group. |
percentile_cont(pct) WITHIN GROUP (ORDER BY key) | Returns the interpolated percentile of the key within the group. |
percentile_disc(pct) WITHIN GROUP (ORDER BY key) | Returns the discrete percentile of the key within the group. |
regr_avgx(yExpr, xExpr) | Returns the mean of xExpr calculated from values of a group where xExpr and yExpr are NOT NULL. |
regr_avgy(yExpr, xExpr) | Returns the mean of yExpr calculated from values of a group where xExpr and yExpr are NOT NULL. |
regr_count(yExpr, xExpr) | Returns the number of non-null value pairs yExpr , xExpr in the group. |
regr_intercept(yExpr, xExpr) | Returns the intercept of the uni-variate linear regression line in a group where xExpr and yExpr are NOT NULL. |
regr_r2(yExpr, xExpr) | Returns the coefficient of determination from values of a group where xExpr and yExpr are NOT NULL. |
regr_slope(yExpr, xExpr) | Returns the slope of the linear regression line of non-null value pairs yExpr , xExpr in the group. |
regr_sxx(yExpr, xExpr) | Returns the sum of squares of the xExpr values of a group where xExpr and yExpr are NOT NULL. |
regr_sxy(yExpr, xExpr) | Returns the sum of products of yExpr and xExpr calculated from values of a group where xExpr and yExpr are NOT NULL. |
regr_syy(yExpr, xExpr) | Returns the sum of squares of the yExpr values of a group where xExpr and yExpr are NOT NULL. |
skewness(expr) | Returns the skewness value calculated from values of a group. |
some(expr) | Returns true if at least one value of expr in a group is true . |
std(expr) | Returns the sample standard deviation calculated from the values within the group. |
stddev(expr) | Returns the sample standard deviation calculated from the values within the group. |
stddev_pop(expr) | Returns the population standard deviation calculated from values of a group. |
stddev_samp(expr) | Returns the sample standard deviation calculated from values of a group. |
sum(expr) | Returns the sum calculated from values of a group. |
try_avg(expr) | Returns the mean calculated from values of a group, NULL if there is an overflow. |
try_sum(expr) | Returns the sum calculated from values of a group, NULL if there is an overflow. |
var_pop(expr) | Returns the population variance calculated from values of a group. |
var_samp(expr) | Returns the sample variance calculated from values of a group. |
variance(expr) | Returns the sample variance calculated from values of a group. |
COUNT_IF
SELECT
COUNT_IF(col1 = 'abc' AND col2 IN ('abc', 'xyz')) AS cUser
FROM table_name
WHERE status = 'ACTIVE'