Data Science in Spark With Sparklyr::: Cheat Sheet
Data Science in Spark With Sparklyr::: Cheat Sheet
Intro Import
Visualize
Wrangle • Collect result, plot in R Communicate
• From R (copy_to()) • Use dbplot Collect results into R
sparklyr is an R interface for Apache Spark™. •dplyr verb
• Read a file share using
sparklyr enables us to write all of our analysis • Feature transformer (ft_)
(spark_read_)
• Direct Spark SQL (DBI)
Model rmarkdown
code in R, but have the actual processing • Read Hive table (tbl())
happen inside Spark clusters. Easily • Spark MLlib (ml_)
R for Data Science,
• H2O Extension
manipulate and model large-scale using R and Grolemund & Wickham
Spark via sparklyr.
Wrangle
Import 0 ft_max_abs_scaler() - Rescale each feature 0 a 0,a ft_vector_assembler() - Combine vectors
DPLYR VERBS -1
1 a
1 b
1,a
1,b into single row-vector
1 individually to range [-1, 1]
Push Translates into Spark SQL statements
Compute ft_min_max_scaler() - Rescale each feature 0 a 0,0 ft_vector_indexer() - Indexing categorical
1-4 2 1 a 1,0
copy_to(sc, mtcars) %>% 1 individually to a common range [min, max] 1 b 1,1 feature columns in a dataset of Vector
4
Import mutate(trm = ifelse(am == 0, "auto", "man")) %>% linearly
group_by(trm) %>% a
ft_vector_slicer() - Takes a feature vector
0,a
Collect Source summarise_all(mean) ft_ngram() - Converts the input array of 1,a a and outputs a new feature vector with a
SPAR K 1,b b
Results subarray of the original features
FEATURE TRANSFORMERS strings into an array of n-grams
Import data into Spark, not R
ft_bucketed_random_projection_lsh() boo ft_word2vec() - Word2Vec transforms a
0
0
ft_binarizer() - Assigned values based on too
ft_minhash_lsh() - Locality Sensitive next word into a code
READ A FILE INTO SPARK 1 threshold
1
Hashing functions for Euclidean distance
Arguments that apply to all functions: 0 ft_bucketizer() - Numeric column to and Jaccard distance (MinHash)
sc, name, path, options=list(), repartition=0,
memory=TRUE, overwrite=TRUE
2
discretized column p=x p=2
ft_normalizer() - Normalize a vector to have
Visualize
spark_read_csv( header = TRUE, a b 0,1 1,1 ft_count_vectorizer() - Extracts a unit norm using the given p-norm
CSV
columns=NULL,
b b 0 2 vocabulary from document
ft_one_hot_encoder()- Continuous to binary Summarize Plot results
infer_schema=TRUE, delimiter = ",", ft_discrete_cosine_transform() - 1D
1 0 in Spark in R
0 1 vectors
quote= "\"", escape = "\\", charset = discrete cosine transform of a real vector
"UTF-8", null_value = NULL) ft_pca() - Project vectors to a lower
x ft_elementwise_product() - dimensional space of top k principal DPLYR + GGPLOT2
JSON spark_read_json() x
x
x
Element-wise product between 2 cols components
PARQUET spark_read_parquet()
TEXT spark_read_text() ft_hashing_tf() - Maps a sequence of ft_quantile_discretizer() - Continuous to
copy_to(sc, mtcars) %>% Summarize
a b 0
a b 0
1 1
terms to their term frequencies using the 1 binned categorical values in Spark
HIVE TABLE spark_read_table() b b 0 2
group_by(cyl) %>%
hashing trick.
ORC spark_read_orc() ft_regex_tokenizer() - Extracts tokens either summarise(mpg_m = mean(mpg)) %>%
LIBSVM spark_read_libsvm() ft_idf() - Compute the Inverse Document AB a b by using the provided regex pattern to split collect() %>%
Collect results in R
Frequency (IDF) given a collection of the text ggplot() +
JDBC spark_read_jdbc() documents geom_col(aes(cyl, mpg_m))
ft_standard_scaler() - Removes the mean Create plot
DELTA spark_read_delta() 𝞼=x 𝞼= 0
ft_imputer() - Imputation estimator for and scaling to unit variance using column
R DATA FRAME INTO SPARK completing missing values, uses the mean summary statistics DBPLOT
dplyr::copy_to(dest, df, name) or the median of the columns
ft_stop_words_remover() - Filters out stop copy_to(sc, mtcars) %>%
0 a ft_index_to_string() - Index labels back to no
words from input dbplot_histogram(mpg) +
FROM A TABLE IN HIVE 1 c
label as strings labs(title = “Histogram of MPG”)
1 c
dplyr::tbl(scr, …) Creates a reference to the a 0 ft_string_indexer() - Column of labels into a
ft_interaction() - Takes in Double and c 1 dbplot_histogram(data, x, bins = 30, binwidth = NULL) -
table without loading it into memory c 1 column of label indices.
Calculates the histogram bins in Spark and plots in ggplot2
2,3 4,2 8,6
Vector type columns and outputs a
ft_tokenizer() - Converts to lowercase and dbplot_raster(data, x, y, fill = n(), resolution = 100,
flattened vector of their feature AB a b
then splits it by white spaces complete = FALSE) - Visualize 2 continuous variables. Use
interactions instead of geom_point()
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at spark.rstudio.com • sparklyr 1.0.4.9002 • Updated: 2019-10
Data Science in Spark with sparklyr : : CHEAT SHEET
Modeling
REGRESSION CLUSTERING UTILITIES
ml_linear_regression() - Regression using linear ml_bisecting_kmeans() - A bisecting k-means algorithm
regression. based on the paper ml_call_constructor() - Identifies the associated ml_standardize_formula() - Generates a formula
sparklyr ML constructor for the JVM string from user inputs, to be used in `ml_model`
ml_aft_survival_regression() - Parametric survival ml_lda() | ml_describe_topics() | ml_log_likelihood() | constructor
regression model named accelerated failure time ml_log_perplexity() | ml_topics_matrix() - LDA topic ml_model_data() - Extracts data associated with a
(AFT) model model designed for text documents. Spark ML model ml_uid() - Extracts the UID of an ML object.