0% found this document useful (0 votes)
107 views2 pages

Data Science in Spark With Sparklyr::: Cheat Sheet

Uploaded by

Daniel Coelho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views2 pages

Data Science in Spark With Sparklyr::: Cheat Sheet

Uploaded by

Daniel Coelho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Data Science in Spark with sparklyr : : CHEAT SHEET

Intro Import
Visualize
Wrangle • Collect result, plot in R Communicate
• From R (copy_to()) • Use dbplot Collect results into R
sparklyr is an R interface for Apache Spark™. •dplyr verb
• Read a file share using
sparklyr enables us to write all of our analysis • Feature transformer (ft_)
(spark_read_)
• Direct Spark SQL (DBI)
Model rmarkdown
code in R, but have the actual processing • Read Hive table (tbl())
happen inside Spark clusters. Easily • Spark MLlib (ml_)
R for Data Science,
• H2O Extension
manipulate and model large-scale using R and Grolemund & Wickham
Spark via sparklyr.
Wrangle
Import 0 ft_max_abs_scaler() - Rescale each feature 0 a 0,a ft_vector_assembler() - Combine vectors
DPLYR VERBS -1
1 a
1 b
1,a
1,b into single row-vector
1 individually to range [-1, 1]
Push Translates into Spark SQL statements
Compute ft_min_max_scaler() - Rescale each feature 0 a 0,0 ft_vector_indexer() - Indexing categorical
1-4 2 1 a 1,0
copy_to(sc, mtcars) %>% 1 individually to a common range [min, max] 1 b 1,1 feature columns in a dataset of Vector
4
Import mutate(trm = ifelse(am == 0, "auto", "man")) %>% linearly
group_by(trm) %>% a
ft_vector_slicer() - Takes a feature vector
0,a
Collect Source summarise_all(mean) ft_ngram() - Converts the input array of 1,a a and outputs a new feature vector with a
SPAR K 1,b b
Results subarray of the original features
FEATURE TRANSFORMERS strings into an array of n-grams
Import data into Spark, not R
ft_bucketed_random_projection_lsh() boo ft_word2vec() - Word2Vec transforms a
0
0
ft_binarizer() - Assigned values based on too
ft_minhash_lsh() - Locality Sensitive next word into a code
READ A FILE INTO SPARK 1 threshold
1
Hashing functions for Euclidean distance
Arguments that apply to all functions: 0 ft_bucketizer() - Numeric column to and Jaccard distance (MinHash)
sc, name, path, options=list(), repartition=0,
memory=TRUE, overwrite=TRUE
2
discretized column p=x p=2
ft_normalizer() - Normalize a vector to have
Visualize
spark_read_csv( header = TRUE, a b 0,1 1,1 ft_count_vectorizer() - Extracts a unit norm using the given p-norm
CSV
columns=NULL,
b b 0 2 vocabulary from document
ft_one_hot_encoder()- Continuous to binary Summarize Plot results
infer_schema=TRUE, delimiter = ",", ft_discrete_cosine_transform() - 1D
1 0 in Spark in R
0 1 vectors
quote= "\"", escape = "\\", charset = discrete cosine transform of a real vector
"UTF-8", null_value = NULL) ft_pca() - Project vectors to a lower
x ft_elementwise_product() - dimensional space of top k principal DPLYR + GGPLOT2
JSON spark_read_json() x
x
x
Element-wise product between 2 cols components
PARQUET spark_read_parquet()
TEXT spark_read_text() ft_hashing_tf() - Maps a sequence of ft_quantile_discretizer() - Continuous to
copy_to(sc, mtcars) %>% Summarize
a b 0
a b 0
1 1
terms to their term frequencies using the 1 binned categorical values in Spark
HIVE TABLE spark_read_table() b b 0 2
group_by(cyl) %>%
hashing trick.
ORC spark_read_orc() ft_regex_tokenizer() - Extracts tokens either summarise(mpg_m = mean(mpg)) %>%
LIBSVM spark_read_libsvm() ft_idf() - Compute the Inverse Document AB a b by using the provided regex pattern to split collect() %>%
Collect results in R
Frequency (IDF) given a collection of the text ggplot() +
JDBC spark_read_jdbc() documents geom_col(aes(cyl, mpg_m))
ft_standard_scaler() - Removes the mean Create plot
DELTA spark_read_delta() 𝞼=x 𝞼= 0
ft_imputer() - Imputation estimator for and scaling to unit variance using column
R DATA FRAME INTO SPARK completing missing values, uses the mean summary statistics DBPLOT
dplyr::copy_to(dest, df, name) or the median of the columns
ft_stop_words_remover() - Filters out stop copy_to(sc, mtcars) %>%
0 a ft_index_to_string() - Index labels back to no
words from input dbplot_histogram(mpg) +
FROM A TABLE IN HIVE 1 c
label as strings labs(title = “Histogram of MPG”)
1 c
dplyr::tbl(scr, …) Creates a reference to the a 0 ft_string_indexer() - Column of labels into a
ft_interaction() - Takes in Double and c 1 dbplot_histogram(data, x, bins = 30, binwidth = NULL) -
table without loading it into memory c 1 column of label indices.
Calculates the histogram bins in Spark and plots in ggplot2
2,3 4,2 8,6
Vector type columns and outputs a
ft_tokenizer() - Converts to lowercase and dbplot_raster(data, x, y, fill = n(), resolution = 100,
flattened vector of their feature AB a b
then splits it by white spaces complete = FALSE) - Visualize 2 continuous variables. Use
interactions instead of geom_point()
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at spark.rstudio.com • sparklyr 1.0.4.9002 • Updated: 2019-10
Data Science in Spark with sparklyr : : CHEAT SHEET
Modeling
REGRESSION CLUSTERING UTILITIES
ml_linear_regression() - Regression using linear ml_bisecting_kmeans() - A bisecting k-means algorithm
regression. based on the paper ml_call_constructor() - Identifies the associated ml_standardize_formula() - Generates a formula
sparklyr ML constructor for the JVM string from user inputs, to be used in `ml_model`
ml_aft_survival_regression() - Parametric survival ml_lda() | ml_describe_topics() | ml_log_likelihood() | constructor
regression model named accelerated failure time ml_log_perplexity() | ml_topics_matrix() - LDA topic ml_model_data() - Extracts data associated with a
(AFT) model model designed for text documents. Spark ML model ml_uid() - Extracts the UID of an ML object.

ml_generalized_linear_regression() - Generalized ml_gaussian_mixture() - Expectation maximization for


linear regression model multivariate Gaussian Mixture Models (GMMs) Start a Spark session
ml_isotonic_regression() - Currently implemented ml_kmeans() | ml_compute_cost() - K-means clustering
using parallelized pool adjacent violators algorithm. with support for k-means YARN CLIENT LOCAL MODE
Only univariate (single feature) algorithm supported 1. Install RStudio Server on one of the existing nodes, No cluster required. Use for learning purposes only
preferably an edge node 1. Install a local version of Spark:
ml_random_forest_regressor() - Regression using FP GROWTH
random forests. 2. Locate path to the cluster’s Spark Home Directory, spark_install("2.3")
ml_fpgrowth() | ml_association_rules() | it normally is “/usr/lib/spark” 2. Open a connection
ml_freq_itemsets() - A parallel FP-growth algorithm to
3. Basic configuration example sc <- spark_connect(master="local")
CLASSIFICATION mine frequent itemsets.
conf <- spark_config()
ml_linear_svc() - Classification using linear support
conf$spark.executor.memory <- "300M" KUBERNETES
vector machines FEATURE conf$spark.executor.cores <- 2 1. Use the following to obtain the Host and Port
ml_logistic_regression() - Logistic regression conf$spark.executor.instances <- 3
ml_chisquare_test(x,features,label) - Pearson's system2("kubectl", "cluster-info")
ml_multilayer_perceptron_classifier() - independence test for every feature against the label conf$spark.dynamicAllocation.enabled<-"false" 2. Open a connection
Classification model based on the Multilayer 4. Open a connection (some base configurations sc <- spark_connect(config =
ml_default_stop_words() - Loads the default stop
Perceptron. included in the example) spark_config_kubernetes(
words for the given language
ml_naive_bayes() - Naive Bayes Classifiers. It sc <- spark_connect(master = "yarn", "k8s://https://[HOST]>:[PORT]",
supports Multinomial NB which can handle finitely spark_home = "/usr/lib/spark/", account = "default",
supported discrete data STATS version = "2.1.0", config = conf) image = "docker.io/owner/repo:version",
ml_one_vs_rest() - Reduction of Multiclass ml_summary() - Extracts a metric from the summary version = "2.3.1"))
Classification to Binary Classification. Performs object of a Spark ML model YARN CLUSTER
reduction using one against all strategy. ml_corr() - Compute correlation matrix 1. Make sure to have copies of the yarn-site.xml and MESOS
correlate package integrates with sparklyr hive-site.xml files in the RStudio Server 1. Install RStudio Server on one of the nodes
2. Point environment variables to the correct paths 2. Open a connection
TREE
copy_to(sc, mtcars) %>% Sys.setenv(JAVA_HOME="[Path]") sc <- spark_connect(master="[Mesos URL]")
ml_decision_tree_classifier() | ml_decision_tree() | Sys.setenv(SPARK_HOME ="[Path]")
ml_decision_tree_regressor() - Classification and correlate() %>%
Sys.setenv(YARN_CONF_DIR ="[Path]") CLOUD
regression using decision trees rplot()
3. Open a connection Databricks - spark_connect(method = "databricks")
ml_gbt_classifier() | ml_gradient_boosted_trees() | sc <- spark_connect(master = "yarn-cluster") Qubole- spark_connect(method = "qubole")
ml_gbt_regressor() - Binary classification and RECOMMENDATION
regression using gradient boosted trees ml_als() | ml_recommend() - Recommendation using
STANDALONE CLUSTER
ml_random_forest_classifier() - Classification and Alternating Least Squares matrix factorization
regression using random forests.
1. Install RStudio Server on one of the existing nodes More Information
or a server in the same LAN
ml_feature_importances(model,...)ml_tree_feature EVALUATION 2. Install a local version of Spark:
_importance(model) - Feature Importance for Tree spark_install (version = "2.0.1")
ml_clustering_evaluator() - Evaluator for clustering
Models 3. Open a connection
ml_evaluate() - Compute performance metrics
spark_connect(master="spark://host:port",
ml_binary_classification_evaluator() |
version = "2.0.1",
ml_binary_classification_eval() |
spark_home = spark_home_dir())
ml_classification_eval() - A set of functions to calculate spark.rstudio.com therinspark.com
performance metrics for prediction models.
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at spark.rstudio.com • sparklyr 1.0.4.9002 • Updated: 2019-10

You might also like