Data Science in Spark With Sparklyr::: Cheat Sheet

Uploaded by

Daniel Coelho

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

107 views2 pages

Data Science in Spark With Sparklyr::: Cheat Sheet

Uploaded by

Daniel Coelho

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

Data Science in Spark with sparklyr : : CHEAT SHEET

Intro Import
Visualize
Wrangle • Collect result, plot in R Communicate
• From R (copy_to()) • Use dbplot Collect results into R
sparklyr is an R interface for Apache Spark™. •dplyr verb
• Read a file share using
sparklyr enables us to write all of our analysis • Feature transformer (ft_)
(spark_read_)
• Direct Spark SQL (DBI)
Model rmarkdown
code in R, but have the actual processing • Read Hive table (tbl())
happen inside Spark clusters. Easily • Spark MLlib (ml_)
R for Data Science,
• H2O Extension
manipulate and model large-scale using R and Grolemund & Wickham
Spark via sparklyr.
Wrangle
Import 0 ft_max_abs_scaler() - Rescale each feature 0 a 0,a ft_vector_assembler() - Combine vectors
DPLYR VERBS -1
1 a
1 b
1,a
1,b into single row-vector
1 individually to range [-1, 1]
Push Translates into Spark SQL statements
Compute ft_min_max_scaler() - Rescale each feature 0 a 0,0 ft_vector_indexer() - Indexing categorical
1-4 2 1 a 1,0
copy_to(sc, mtcars) %>% 1 individually to a common range [min, max] 1 b 1,1 feature columns in a dataset of Vector
4
Import mutate(trm = ifelse(am == 0, "auto", "man")) %>% linearly
group_by(trm) %>% a
ft_vector_slicer() - Takes a feature vector
0,a
Collect Source summarise_all(mean) ft_ngram() - Converts the input array of 1,a a and outputs a new feature vector with a
SPAR K 1,b b
Results subarray of the original features
FEATURE TRANSFORMERS strings into an array of n-grams
Import data into Spark, not R
ft_bucketed_random_projection_lsh() boo ft_word2vec() - Word2Vec transforms a
0
0
ft_binarizer() - Assigned values based on too
ft_minhash_lsh() - Locality Sensitive next word into a code
READ A FILE INTO SPARK 1 threshold
1
Hashing functions for Euclidean distance
Arguments that apply to all functions: 0 ft_bucketizer() - Numeric column to and Jaccard distance (MinHash)
sc, name, path, options=list(), repartition=0,
memory=TRUE, overwrite=TRUE
2
discretized column p=x p=2
ft_normalizer() - Normalize a vector to have
Visualize
spark_read_csv( header = TRUE, a b 0,1 1,1 ft_count_vectorizer() - Extracts a unit norm using the given p-norm
CSV
columns=NULL,
b b 0 2 vocabulary from document
ft_one_hot_encoder()- Continuous to binary Summarize Plot results
infer_schema=TRUE, delimiter = ",", ft_discrete_cosine_transform() - 1D
1 0 in Spark in R
0 1 vectors
quote= "\"", escape = "\\", charset = discrete cosine transform of a real vector
"UTF-8", null_value = NULL) ft_pca() - Project vectors to a lower
x ft_elementwise_product() - dimensional space of top k principal DPLYR + GGPLOT2
JSON spark_read_json() x
x
x
Element-wise product between 2 cols components
PARQUET spark_read_parquet()
TEXT spark_read_text() ft_hashing_tf() - Maps a sequence of ft_quantile_discretizer() - Continuous to
copy_to(sc, mtcars) %>% Summarize
a b 0
a b 0
1 1
terms to their term frequencies using the 1 binned categorical values in Spark
HIVE TABLE spark_read_table() b b 0 2
group_by(cyl) %>%
hashing trick.
ORC spark_read_orc() ft_regex_tokenizer() - Extracts tokens either summarise(mpg_m = mean(mpg)) %>%
LIBSVM spark_read_libsvm() ft_idf() - Compute the Inverse Document AB a b by using the provided regex pattern to split collect() %>%
Collect results in R
Frequency (IDF) given a collection of the text ggplot() +
JDBC spark_read_jdbc() documents geom_col(aes(cyl, mpg_m))
ft_standard_scaler() - Removes the mean Create plot
DELTA spark_read_delta() 𝞼=x 𝞼= 0
ft_imputer() - Imputation estimator for and scaling to unit variance using column
R DATA FRAME INTO SPARK completing missing values, uses the mean summary statistics DBPLOT
dplyr::copy_to(dest, df, name) or the median of the columns
ft_stop_words_remover() - Filters out stop copy_to(sc, mtcars) %>%
0 a ft_index_to_string() - Index labels back to no
words from input dbplot_histogram(mpg) +
FROM A TABLE IN HIVE 1 c
label as strings labs(title = “Histogram of MPG”)
1 c
dplyr::tbl(scr, …) Creates a reference to the a 0 ft_string_indexer() - Column of labels into a
ft_interaction() - Takes in Double and c 1 dbplot_histogram(data, x, bins = 30, binwidth = NULL) -
table without loading it into memory c 1 column of label indices.
Calculates the histogram bins in Spark and plots in ggplot2
2,3 4,2 8,6
Vector type columns and outputs a
ft_tokenizer() - Converts to lowercase and dbplot_raster(data, x, y, fill = n(), resolution = 100,
flattened vector of their feature AB a b
then splits it by white spaces complete = FALSE) - Visualize 2 continuous variables. Use
interactions instead of geom_point()
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at spark.rstudio.com • sparklyr 1.0.4.9002 • Updated: 2019-10
Data Science in Spark with sparklyr : : CHEAT SHEET
Modeling
REGRESSION CLUSTERING UTILITIES
ml_linear_regression() - Regression using linear ml_bisecting_kmeans() - A bisecting k-means algorithm
regression. based on the paper ml_call_constructor() - Identifies the associated ml_standardize_formula() - Generates a formula
sparklyr ML constructor for the JVM string from user inputs, to be used in `ml_model`
ml_aft_survival_regression() - Parametric survival ml_lda() | ml_describe_topics() | ml_log_likelihood() | constructor
regression model named accelerated failure time ml_log_perplexity() | ml_topics_matrix() - LDA topic ml_model_data() - Extracts data associated with a
(AFT) model model designed for text documents. Spark ML model ml_uid() - Extracts the UID of an ML object.

ml_generalized_linear_regression() - Generalized ml_gaussian_mixture() - Expectation maximization for

linear regression model multivariate Gaussian Mixture Models (GMMs) Start a Spark session
ml_isotonic_regression() - Currently implemented ml_kmeans() | ml_compute_cost() - K-means clustering
using parallelized pool adjacent violators algorithm. with support for k-means YARN CLIENT LOCAL MODE
Only univariate (single feature) algorithm supported 1. Install RStudio Server on one of the existing nodes, No cluster required. Use for learning purposes only
preferably an edge node 1. Install a local version of Spark:
ml_random_forest_regressor() - Regression using FP GROWTH
random forests. 2. Locate path to the cluster’s Spark Home Directory, spark_install("2.3")
ml_fpgrowth() | ml_association_rules() | it normally is “/usr/lib/spark” 2. Open a connection
ml_freq_itemsets() - A parallel FP-growth algorithm to
3. Basic configuration example sc <- spark_connect(master="local")
CLASSIFICATION mine frequent itemsets.
conf <- spark_config()
ml_linear_svc() - Classification using linear support
conf$spark.executor.memory <- "300M" KUBERNETES
vector machines FEATURE conf$spark.executor.cores <- 2 1. Use the following to obtain the Host and Port
ml_logistic_regression() - Logistic regression conf$spark.executor.instances <- 3
ml_chisquare_test(x,features,label) - Pearson's system2("kubectl", "cluster-info")
ml_multilayer_perceptron_classifier() - independence test for every feature against the label conf$spark.dynamicAllocation.enabled<-"false" 2. Open a connection
Classification model based on the Multilayer 4. Open a connection (some base configurations sc <- spark_connect(config =
ml_default_stop_words() - Loads the default stop
Perceptron. included in the example) spark_config_kubernetes(
words for the given language
ml_naive_bayes() - Naive Bayes Classifiers. It sc <- spark_connect(master = "yarn", "k8s://https://[HOST]>:[PORT]",
supports Multinomial NB which can handle finitely spark_home = "/usr/lib/spark/", account = "default",
supported discrete data STATS version = "2.1.0", config = conf) image = "docker.io/owner/repo:version",
ml_one_vs_rest() - Reduction of Multiclass ml_summary() - Extracts a metric from the summary version = "2.3.1"))
Classification to Binary Classification. Performs object of a Spark ML model YARN CLUSTER
reduction using one against all strategy. ml_corr() - Compute correlation matrix 1. Make sure to have copies of the yarn-site.xml and MESOS
correlate package integrates with sparklyr hive-site.xml files in the RStudio Server 1. Install RStudio Server on one of the nodes
2. Point environment variables to the correct paths 2. Open a connection
TREE
copy_to(sc, mtcars) %>% Sys.setenv(JAVA_HOME="[Path]") sc <- spark_connect(master="[Mesos URL]")
ml_decision_tree_classifier() | ml_decision_tree() | Sys.setenv(SPARK_HOME ="[Path]")
ml_decision_tree_regressor() - Classification and correlate() %>%
Sys.setenv(YARN_CONF_DIR ="[Path]") CLOUD
regression using decision trees rplot()
3. Open a connection Databricks - spark_connect(method = "databricks")
ml_gbt_classifier() | ml_gradient_boosted_trees() | sc <- spark_connect(master = "yarn-cluster") Qubole- spark_connect(method = "qubole")
ml_gbt_regressor() - Binary classification and RECOMMENDATION
regression using gradient boosted trees ml_als() | ml_recommend() - Recommendation using
STANDALONE CLUSTER
ml_random_forest_classifier() - Classification and Alternating Least Squares matrix factorization
regression using random forests.
1. Install RStudio Server on one of the existing nodes More Information
or a server in the same LAN
ml_feature_importances(model,...)ml_tree_feature EVALUATION 2. Install a local version of Spark:
_importance(model) - Feature Importance for Tree spark_install (version = "2.0.1")
ml_clustering_evaluator() - Evaluator for clustering
Models 3. Open a connection
ml_evaluate() - Compute performance metrics
spark_connect(master="spark://host:port",
ml_binary_classification_evaluator() |
version = "2.0.1",
ml_binary_classification_eval() |
spark_home = spark_home_dir())
ml_classification_eval() - A set of functions to calculate spark.rstudio.com therinspark.com
performance metrics for prediction models.
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at spark.rstudio.com • sparklyr 1.0.4.9002 • Updated: 2019-10

Electrical and Computer Engineering
No ratings yet
Electrical and Computer Engineering
381 pages
Introduction To Functional Analysis - Daniel Daners - 2017
No ratings yet
Introduction To Functional Analysis - Daniel Daners - 2017
123 pages
7-Vector Spaces-01-02-2024
No ratings yet
7-Vector Spaces-01-02-2024
95 pages
Inner Product Spaces
100% (4)
Inner Product Spaces
14 pages
Entropy Solution For Nonlinear Elliptic Boundary Value Problem Having Large Monotonocity in Musielak-Orlicz-Sobolev Spaces
No ratings yet
Entropy Solution For Nonlinear Elliptic Boundary Value Problem Having Large Monotonocity in Musielak-Orlicz-Sobolev Spaces
16 pages
411 For Quiz 1
No ratings yet
411 For Quiz 1
18 pages
MODULE SOLUTION E - Field Potential Slab of Charge in and Out READY FOR WEB PAGE UPLOAD
No ratings yet
MODULE SOLUTION E - Field Potential Slab of Charge in and Out READY FOR WEB PAGE UPLOAD
8 pages
Spaces of Lipschitz Functions On Metric Spaces
No ratings yet
Spaces of Lipschitz Functions On Metric Spaces
20 pages
SVM
No ratings yet
SVM
29 pages
Information Geometry in Functional Spaces of Classical and Quantum Finite Statistical Systems
No ratings yet
Information Geometry in Functional Spaces of Classical and Quantum Finite Statistical Systems
25 pages
NLA_lecture_notes
No ratings yet
NLA_lecture_notes
86 pages
Functional Analysis
No ratings yet
Functional Analysis
20 pages
Chapter 5 Inner Product Spaces
No ratings yet
Chapter 5 Inner Product Spaces
8 pages
KIL1005: Numerical Methods For Engineering: Semester 2, Session 2019/2020 7 May 2020
No ratings yet
KIL1005: Numerical Methods For Engineering: Semester 2, Session 2019/2020 7 May 2020
16 pages
1 Linear Spaces: Summer Propaedeutic Course
No ratings yet
1 Linear Spaces: Summer Propaedeutic Course
15 pages
Two Variable Linear Program Solves The Standar Linear-Quadratic Formulation of The Fractionation Problem in Cancer Radiotherapy PDF
No ratings yet
Two Variable Linear Program Solves The Standar Linear-Quadratic Formulation of The Fractionation Problem in Cancer Radiotherapy PDF
5 pages
Summary Wind Shear ICAO 9817
No ratings yet
Summary Wind Shear ICAO 9817
2 pages
Earths Magnetic Personality
No ratings yet
Earths Magnetic Personality
51 pages
Ncert Solutions Class 12 Maths Chapter 10
No ratings yet
Ncert Solutions Class 12 Maths Chapter 10
47 pages
Coplanar Concurrent Forces
No ratings yet
Coplanar Concurrent Forces
16 pages
Vectors Test
No ratings yet
Vectors Test
3 pages
Experiment 2: Vector Addition
100% (1)
Experiment 2: Vector Addition
45 pages
Turbulence - Numerical Analysis, Modelling and Simulation PDF
No ratings yet
Turbulence - Numerical Analysis, Modelling and Simulation PDF
230 pages
Fiitjee: Practice Sheet
No ratings yet
Fiitjee: Practice Sheet
5 pages
Practice Test - Chapter 9
No ratings yet
Practice Test - Chapter 9
13 pages
1 Complex Numbers
No ratings yet
1 Complex Numbers
9 pages
K Mean Clustering
No ratings yet
K Mean Clustering
36 pages
Linear Control of Nonlinear Processes
No ratings yet
Linear Control of Nonlinear Processes
23 pages
Inner Product Space
No ratings yet
Inner Product Space
8 pages
Chapter 2 Normed Spaces. Banach Spaces
No ratings yet
Chapter 2 Normed Spaces. Banach Spaces
19 pages
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6441)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (999)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (642)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (581)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (628)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1138)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1018)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2884)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (279)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4102)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4361)

Data Science in Spark With Sparklyr::: Cheat Sheet

Uploaded by

Data Science in Spark With Sparklyr::: Cheat Sheet

Uploaded by

Data Science in Spark with sparklyr : : CHEAT SHEET

ml_generalized_linear_regression() - Generalized ml_gaussian_mixture() - Expectation maximization for

You might also like