Spark Built-in Libraries

Intro

Apache spark is a fast and general-purpose cluster computing system for large scale data processing
High-level APIs in Java, Scala, Python and R

Big data apps lack libraries of common algorithms
Spark's generality + support for multiple languages make it suitable to offer this
Much of future activity will be in these libraries

Classification: logistic regression, linear SVM, naive Bayes, classification tree
Regression: generalized linear models (GLMs), regression tree
Collaborative filtering: alternating least squares (ALS), non-negative matrix factorization (NMF)
Clustering: K-means
Decomposition: SVD, PCA
Optimization: stochastic gradient descent, L-BFGS)

Collaborative Filtering
- Alternating Least Squares
- Stochastic Gradient Descent
- Tensor Factorization
Structured Prediction
- Loopy Belief Propagation
- Max-produce linear programs
- Gibbs sampling
Semi-supervised ML
- Graph SSL
- CoEM
Community Detection
- Triangle-Counting
- K-core decomposition
- K-Truss
Graph Analytics
- PageRank
- Personalized PageRank
- Shortest Path
- Graph Coloring
Classification
- Neural Networks

Enables loading & querying structed data in Spark

From Hive:

c = HiveContext(sc)
rows = c.sql("select text, year from hivetable")
rows.filter(lambda r: r.year > 2013).collect()

From JSON:

c.jsonFile("tweets.json").registerAsTable("tweets")
c.sql("select text, user.name from tweets")