Clojure for Data Science

11
Clojure for Data Science
Mike Anderson
26 January 2016

2
Contents
 Why Clojure for Data Science
 Array Programming Essentials
 core.matrix
 Library Ecosystem Overview
 Examples and discussion

3
Why Clojure for Data Science
Attribute Clojure Python R Julia Scala Haskell JavaScript
Strong general
purpose language ✓ ✓ ✓ ✓ ✓
Functional language
✓ ✓ ✓
JVM Ecosystem
(Hadoop, Spark etc.) ✓ ✓
Near-native runtime
performance ✓ ✓ ✓ ✓
Dynamic language
✓ ✓ ✓ ✓ ✓
Client side execution
✓ ✓
“Code is Data”
✓

4
Contents
 core.matrix

5
Plug-in paradigms
Paradigm Exemplar language Clojure implementation
Functional
programming
Haskell
clojure.core
Meta-programming Lisp
Logic programming Prolog core.logic
Process algebras /
CSP
Go core.async
Array programming APL core.matrix

6
APL
Venerable
history
Has its own
keyboard
Interesting
perspective on
code
readability
 Notation invented in 1957 by Ken Iverson
 Implemented at IBM around 1960-64
life←{↑1 ⍵∨.∧3 4=+/,¯1 0
1∘.⊖¯1 0 1∘.⌽⊂⍵}

7
Modern array programming
Standalone environment for statistical
programming / graphics
Python library for array programming
A new language (2012) based on
array programming principles
.... and many others

8
"It is better to have 100 functions
operate on one data structure than
10 functions on 10 data structures."
—Alan Perlis
abstraction
Design wisdom

9
What is an array?
0 1 2
0 1 2
3 4 5
6 7 8
1
2
3
Dimensions Example
Vector
Matrix
3D Array
(3rd order Tensor)
Terminology
N ND Array
0 1 2
3 4 5
6 7 8
0 1 2
3 4 5
6 7 8
0 1 2
3 4 5
6 7 8
...
...

10
Multi-dimensional array properties
0 1 2
3 4 5
6 7 8
0
1
2
0 1 2
Dimension 0
Dimension 1
Dimensions
(ordered and
indexed)
Each of the array
elements is a
regular value
Dimension sizes
together define
the shape of the
array
(e.g. 3 x 3)

11
Arrays = data about relationships
(foo :A :T) => 2
0 1 2 3
4 5 6 7
8 9 10 11
:A
:B
:C
:R :S :T
Set X
Set Y
Each element is a
fact about a
relationship
between a value in
Set X and a value in
Set Y
ND array lookup is analogous to arity-N functions!
:U

12
Why arrays instead of functions?
0 1 2
3 4 5
6 7 8
0
1
2
0 1 2
vs. (fn [i j]
(+ j (* 3 i)))
1. Precomputed values with O(1) access
2. Efficient computation with optimised bulk
operations
3. Data driven representation

13
Principle of array programming:
generalise operations on regular (scalar) values to
multi-dimensional data
(+ 1 2) => 3
(+ ) => 2

14
Contents
 core.matrix

15
core.matrix
Array programming
as a language extension
for Clojure
(with a Data Science focus)

16
Expressivity
for (int i=0; i<n; i++) {
for (int j=0; j<m; j++) {
for (int k=0; k<p; k++) {
result[i][j][k] = a[i][j][k] + b[i][j][k];
}
}
}
Java
(mapv
(fn [a b]
(mapv
(fn [a b]
(mapv + a b))
a b))
a b)
(+ a b)
+ core.matrix

17
Elements of core.matrix
Abstraction
Coding with N-dimensional
arrays
Implementation
How is everything
implemented?
API
What can you do with
arrays?

19
Equivalence to Clojure vectors
Nested Clojure vectors of regular shape are arrays!
0 1 2
3 4 5
6 7 8
↔
[[0 1 2]
[3 4 5]
[6 7 8]]
0 1 2 [0 1 2]
↔

20
Array creation
;; Build an array from a sequence
(array (range 5))
=> [0 1 2 3 4]
;; ... or from nested arrays/sequences
(array
(for [i (range 3)]
(for [j (range 3)]
(str i j))))
=> [["00" "01" "02"]
["10" "11" "12"]
["20" "21" "22"]]

21
Shape
;; Shape of a 3 x 2 matrix
(shape [[1 2]
[3 4]
[5 6]])
=> [3 2]
;; Regular values have no shape
(shape 10.0)
=> nil

22
Dimensionality
;; Dimensionality = number of dimensions
;; = length of shape vector
;; = nesting level
(dimensionality [[1 2]
[3 4]
[5 6]])
=> 2
(dimensionality [1 2 3 4 5])
=> 1
;; Regular values have zero dimensionality
(dimensionality “Foo”)
=> 0

23
Scalars vs. arrays
(array? [[1 2] [3 4]])
=> true
(array? 12.3)
=> false
(scalar? [1 2 3])
=> false
(scalar? “foo”)
=> true
Everything is either an array or a scalar
A scalar works as like a 0-dimensional array

24
Indexed element access
0 1 2
3 4 5
6 7 8
0
1
2
0 1 2
Dimension 0
Dimension 1
(def M [[0 1 2]
[3 4 5]
[6 7 8]])
(mget M 1 2)
=> 5

25
Slicing access
0 1 2
3 4 5
6 7 8
0
1
2
0 1 2
Dimension 0
Dimension 1
(def M [[0 1 2]
[3 4 5]
[6 7 8]])
(slice M 1)
=> [3 4 5]
A slice of an array is itself an array!

26
Arrays as a composition of slices
(def M [[0 1 2]
[3 4 5]
[6 7 8]])
(slices M)
=> ([0 1 2] [3 4 5] [6 7 8])
(apply + (slices M))
=> [9 12 15]
0 1 2
3 4 5
6 7 8
0 1 2
3 4 5
6 7 8
slices

27
Operators
(use 'clojure.core.matrix.operators)
(+ [1 2 3] [4 5 6])
=> [5 7 9]
(* [1 2 3] [0 2 -1])
=> [0 4 -3]
(- [1 2] [3 4 5 6])
=> RuntimeException Incompatible shapes
(/ [1 2 3] 10.0)
=> [0.1 0.2 0.3]

28
Broadcasting scalars
(+ 1 1 )= ?
[[0 1 2]
[3 4 5]
[6 7 8]]
(+ 1 )=.
[[1 2 3]
[4 5 6]
[7 8 9]]
[[1 1 1]
[1 1 1]
[1 1 1]]
[[0 1 2]
[3 4 5]
[6 7 8]]
“Broadcasting”

29
Broadcasting arrays
(+ 1 )= ?
[[0 1 2]
[3 4 5]
[6 7 8]]
[2 1 0]
(+ 1 )=.
[[2 2 2]
[5 5 5]
[8 8 8]]
[[2 1 0]
[2 1 0]
[2 1 0]]
[[0 1 2]
[3 4 5]
[6 7 8]]
“Broadcasting”

30
Broadcasting Rules
1. Designed for elementwise operations
- other uses must be explicit
2. Extends shape vector by adding new leading
dimensions
• original shape [4 5]
• can broadcast to any shape [x y ... z 4 5]
• scalars can broadcast to any shape
3. Fills the new array space by duplication of the original
array over the new dimensions
4. Smart implementations can avoid making full copies by
structural sharing or clever indexing tricks

31
Functional operations on sequences
(map inc [1 2 3 4])
=> (2 3 4 5)map
(reduce * [1 2 3 4])
=> 24reduce
(seq [1 2 3 4])
=> (1 2 3 4)seq

32
Functional operations on arrays
(emap inc [[1 2]
[3 4]])
=> [[2 3]
[4 5]]
map ↔ emap
“element map”
(ereduce * [[1 2]
[3 4]])
=> 24
reduce ↔
ereduce
“element reduce”
(eseq [[1 2]
[3 4]])
=> (1 2 3 4)
seq ↔ eseq
“element seq”

33
Specialised matrix constructors
(zero-matrix 4 3) 0 0 0
0 0 0
0 0 0
0 0 0
(identity-matrix 4) 1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
(permutation-matrix [3 1 0 2]) 0 0 0 1
0 1 0 0
1 0 0 0
0 0 1 0

34
Array transformations
(transpose )
0 1 2
3 4 5
0 3
1 4
2 5
Transposes reverses the order of all dimensions and indexes

35
Matrix multiplication
9 2 7
6 4 8
.
2 8
3 4
5 9
=
𝑎 𝑏
𝑐 𝑑
𝑎 = (9 ∗ 2) + (2 ∗ 3) + (7 ∗ 5)
𝑎 = 59
(mmul [[9 2 7] [6 4 8]]
[[2 8] [3 4] [5 9]])
=> [[59 143] [64 136]]

36
Geometry
(def π 3.141592653589793)
(def τ (* 2.0 π))
(defn rot [turns]
(let [a (* τ turns)]
[[ (cos a) (sin a)]
[(-(sin a)) (cos a)]]))
(mmul (rot 1/8) [3 4])
=> [4.9497 0.7071]
NB: See Tau Manifesto (http://tauday.com/) regarding
the use of Tau (τ)
45° =
1/8 turn

38
Mutability – the tradeoffs
Avoid mutability. But it’s an option if you really need it.
Pros Cons
 Faster
 Reduces GC pressure
 Standard in many existing
matrix libraries
✘ Mutability is evil
✘ Harder to maintain / debug
✘ Hard to write concurrent code
✘ Not idiomatic in Clojure
✘ Not supported by all core.matrix
implementations
✘ “Place Oriented Programming”

39
Mutability – performance benefit
28
120
0 50 100 150
Mutable add!
Immutable add
Time for addition of vectors* (ns)
* Length 10 double vectors, using :vectorz implementation
4x
performance benefit

40
Mutability – syntax
A core.matrix function name ending with “!” performs mutation
(usually on the first argument only)
(add [1 2] 1)
 [2 3]
(add! [1 2] 1)
=> RuntimeException ...... not mutable!
(def a (mutable [1 2])) ;; coerce to a mutable format
=> #<Vector2 [1.0,2.0]>
(add! a 1)
=> #<Vector2 [2.0,3.0]>

42
Many Matrix libraries…
UJMP
ojAlgo
MTJ
javax.vecmath

44
Lots of trade-offs
Native Libraries vs. Pure JVM
Mutability vs. Immutability
Specialized elements (e.g.
doubles)
vs. Generalised elements (Object,
Complex)
Multi-dimensional vs. 2D matrices only
Memory efficiency vs. Runtime efficiency
Concrete types vs. Abstraction (interfaces / wrappers)
Specified storage format vs. Multiple / arbitrary storage formats
License A vs. License B
Lightweight (zero-copy) views vs. Heavyweight copying / cloning

45
What’s the best data structure?
0 1 2 3 .. 49Length 50 “range” vector:
2. Java double[] array
new double[]
{0, 1, 2, …. 49};
1. Clojure Vector
[0 1 2 …. 49]
3. Custom deftype
(deftype RangeVector
[^long start
^long end])
4. Native vector format
(org.jblas.DoubleMatrix.
params)

48
Clojure Protocols
(defprotocol PSummable
"Protocol to support the summing of all elements in
an array. The array must hold numeric values only,
or an exception will be thrown."
(element-sum [m]))
clojure.core.matrix.protocols
1. Abstract Interface
2. Open Extension
3. Fast Dispatch

49
Protocols are fast and open
89
13.8
7.9
1.9
1.2
0 20 40 60 80 100
Multimethod*
Protocol call
Boxed function call
Primitive function call
Static / inlined code
Open extensionFunction call costs (ns)
✓
✓
✘
✘
✘
* Using class of first argument as dispatch function

50
Typical core.matrix call path
core.matrix
API
(matrix.clj)
(defn esum
"Calculates the sum of all the elements in a
numerical array."
[m]
(mp/element-sum m))
User Code
(esum [1 2 3 4])
Impl.
code
(extend-protocol mp/PSummable
SomeImplementationClass
(element-sum [a]
………))

51
Most protocols are optional
MANDATORY
Required for a working core.matrix implementation
PImplementation
PDimensionInfo
PIndexedAccess
PIndexedSetting
PMatrixEquality
PSummable
PRowOperations
PVectorCross
PCoercion
PTranspose
PVectorDistance
PMatrixMultiply
PAddProductMutable
PReshaping
PMathsFunctionsMutable
PMatrixRank
PArrayMetrics
PAddProduct
PVectorOps
PMatrixScaling
PMatrixOps
PMatrixPredicates
PSparseArray
…..
OPTIONAL
 Everything in the API will work without these
 core.matrix provides a “default implementation”
 Implement for improved performance

52
Default implementations
Number
(element-sum [a] a)
Object
(element-sum [a]
(mp/element-reduce a +)))
clojure.core.matrix.impl.default
Protocol name - from namespace
clojure.core.matrix.protocols
Implementation for any Number
Implementation for an arbitrary Object
(assumed to be an array)

53
Extending a protocol
(Class/forName "[D")
(element-sum [m]
(let [^doubles m m]
(areduce m i res 0.0 (+ res (aget m i))))))
Class to implement protocol for, in
this case a Java array : double[]
Optimised code to add up all the
elements of a double[] array
Add type hint to avoid reflection

54
15-20x
benefit
Speedup vs. default implementation
201
2859
3690
0 1000 2000 3000 4000
(esum v)
"Specialised"
(reduce + v)
(esum v)
"Default"
Timing for element sum of length 100 double array
(ns)

55
Internal Implementations
Implementation  Key Features
:persistent-vector  Support for Clojure vectors
 Immutable
 Not so fast, but great for quick testing
:double-array  Treats Java double[] objects as 1D arrays
 Mutable – useful for accumulating results etc.
:sequence  Treats Clojure sequences as arrays
 Mostly useful for interop / data loading
:ndarray
:ndarray-double
:ndarray-long
.....
 Google Summer of Code project by Dmitry Groshev
 Pure Clojure
 N-Dimensional arrays similar to NumPy
 Support arbitrary dimensions and data types
:scalar-wrapper
:slice-wrapper
:nd-wrapper
 Internal wrapper formats
 Used to provide efficient default implementations for
various protocols

56
NDArray
(deftype NDArrayDouble
[^doubles data
înt ndims
înts shape
înts strides
înt offset])
0 1 2
3 4 5 ? ? ? 0 1 2 ? ? 3 4 5 ?
offset
0
strides[1]
strides[0]
data
(Java array)
ndims = 2 shape = [2 3]

57
External Implementations
Implementation Key Features
vectorz-clj  Pure JVM (wraps Java Library Vectorz)
 Very fast, especially for vectors and small-medium
matrices
 Most mature core.matrix implementation at present
Clatrix  Use Native BLAS libraries by wrapping the Jblas library
 Very fast, especially for large 2D matrices
 Used by Incanter
parallel-colt-matrix  Wraps Parallel Colt library from Java
 Support for multithreaded matrix computations
arrayspace  Experimental
 Ideas around distributed matrix computation
 Builds on ideas from Blaze, Chapele, ZPL
image-matrix  Treats a Java BufferedImage as a core.matrix array
 Because you can?

58
Switching implementations
(array (range 5))
=> [0 1 2 3 4]
;; switch implementations
(set-current-implementation :vectorz)
;; create array with current implementation
(array (range 5))
=> #<Vector [0.0,1.0,2.0,3.0,4.0]>
;; explicit implementation usage
(array :persistent-vector (range 5))
=> [0 1 2 3 4]

59
Mixing implementations
(def A (array :persistent-vector (range 5)))
=> [0 1 2 3 4]
(def B (array :vectorz (range 5)))
=> #<Vector [0.0,1.0,2.0,3.0,4.0]>
(* A B)
=> [0.0 1.0 4.0 9.0 16.0]
(* B A)
=> #<Vector [0.0,1.0,4.0,9.0,16.0]>
core.matrix implementations can be mixed
(but: behaviour depends on the first argument)

60
Contents
 core.matrix

61
Data Science Libraries for Clojure
• Still not as mature as R or Python, but developing rapidly
• Clojure philosophy of small libraries rather than all-encompassing
frameworks
• Key areas:
• Interactive environments
• Visualisation
• Databases / data access
• Realtime data processing
• Machine Learning

62
Library Description
Incanter Fully featured analytical
environment (“R-like platform”)
gorilla-repl Notebook-style web-based
environment
Interactive environments

63
Library Description
quil Clojure interface to the Processing
library/environment for dynamic
visualisations
gyptis Clojure + ClojureScript library for
producing Vega.js graphs
imagez Library for generating and
manipulation bitmap images
Visualisation

64
Library Description
Datomic Awesome database supporting
immutable “time travel” over
database history. Great scalability
for reads / analytics
java.jdbc Clojure library for access to SQL
databases. Mature workhorse
Yesql Arguably better way to do SQL in
Clojure
Sparkling Clojure library for Apache Spark
flambo Clojure library for Apache Spark
Cascalog Clojure library for querying and data
processing with Apache Hadoop
many, many, more.....
Databases / data access

65
Library Description
Storm Mature, stream processing librray
for highly scalable realtime
computation over large distribute
clusters of compute nodes
Onyx More modern / better designed
alternative to Storm with growing
traction
core.async “Roll your own” concurrent data
processing pipelines
Realtime Data Processing

66
Library Description
clj-ml Wrapper for the popular and venerable “Weka”
machine learning library for Java
enclog Wrapper for the “Encog” machine learning library
Clortex /
Comportex
Libraries implementing Numenta’s Hierarchical
Temporary Memory model
synaptic Basic neural networks in Clojure
State of the art “Deep Learning” library
Machine Learning

67
Contents
 core.matrix

6868
Thank you
For more information about Datacraft, visit: www.datacraft.sg

Clojure for Data Science

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Clojure for Data Science (20)

Recently uploaded (20)

Clojure for Data Science

Editor's Notes