0% found this document useful (0 votes)

11 views

Report Neumann FA

This document presents a student's implementation of a multiple linear regression algorithm in PostgreSQL as part of a minor study project. It provides an introduction to linear regression and challenges of performing matrix operations in databases. The student implemented the ordinary least squares algorithm from the MADlib library to approximate linear regression coefficients. Tests were conducted and results showed the algorithm works correctly and scales efficiently to large datasets as expected.

Uploaded by

Russland

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Report Neumann FA

Uploaded by

Russland

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Department of Informatics, University of Zurich

Facharbeit

Multiple linear regression in

databases

Markus Neumann
Matrikelnummer: s08-706-442

Email: [email protected]

July 17, 2014

supervised by Prof. Dr. M. Böhlen and O. Dolmatova
Abstract
Matrix operations have many applications in various fields of research and industry, where big
datasets and relational database management systems prevail. Performing matrix operations
inside a database can be expensive and therefore efficient algorithms are required. The MAD
project provides a recent set of such algorithms from which I implemented the ‘ordinary least
squares’ algorithm that approximates the linear regression coefficient, as Facharbeit in the
course of my minor at UZH. This work presents an introduction to linear regression, the issues
with its application in a database system as well as the implementation details of the algorithm
in PostgreSQL.

Zusammenfassung
Matrizen sind weit verbreitet in verschiedenen Gebieten der Forschung und Industrie, wo
riesige Datenmengen in relationalen Datenbanksystemen alltäglich sind. Die Berechnung
von Matrizenoperationen innerhalb einer Datenbank kann sehr rechenintesiv sein, weshalb
effiziente Algorithmen sehr gefragt sind. Das MAD Projekt bietet eine Sammlung solcher
Algorithmen, von welchen ich die ‘Methode der kleinsten Quadrate’, welche den linearen
Regressionskoeffizienten approximiert, als Facharbeit im Nebenfach meines Studiums an der
UZH implementiert habe. Diese Arbeit gibt eine Einführung in lineare Regression, ihre Ein-
bindung in Datenbanksysteme, sowie die Details der Implementierung des Algorithmus in
PostgreSQL.

2
Contents
1. Introduction 5

2. Linear regression 5

3. Matrix operations in databases 6

4. MADlib ‘Ordinary Least Squares’ algorithm 6

4.1. How to multiply matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.2. Aggregate functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

5. Implementation 10
5.1. Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.1.1. Aggregate function . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.1.2. Transition function . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.1.3. Final function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

6. Tests and results 13

6.1. Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6.2. Data creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6.3. Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.4. Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.5. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.6. Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

A. SQL code 20

B. Tests 26

C. Results 29

D. Appendix: other code 32

List of Figures
1. Graphical example of simple linear regression. Source: Wikipedia [2014c] . . 5
2. Total time for d = 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3. Total time for d = 32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4. Total time for n = 10000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5. Total time for n = 100000 . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6. Total time for n = 1000000 . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3
List of Listings
1. Table sparse matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2. Table row-by-row matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3. Hardware specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4. Results of correctness test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5. File MyType.sql . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6. File LinReg.sql . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
7. File LinRegStep.sql . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
8. File LinRegFinal.sql . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
9. File LUinvert.sql . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
10. File buildTests.sql . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
11. File runTests.sql . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
12. File buildTests.sql . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
13. File runBetaTests.sql . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
14. File allresults.txt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
15. File producePlots.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4
Figure 1: Graphical example of simple linear regression.
Source: Wikipedia [2014c]

1. Introduction
In the past few years the amount of data collected and stored in databases has increased sig-
nificantly, as well as the available and affordable storage and computational power. At the
same time, the requests to Database Management Systems (DBMS) have changed in their na-
ture. Originally, databases were designed to store data and serve them back to the user on
demand. The typical queries answer questions like “how many employees earn more than $Z
per month?” or “what items of type X need a packing box of type Y ?”. Whenever there were
questions about mathematical or statistical informations of the data, one exported the data
to a file, imported it into an appropriate tool, like Matlab or R, computed the results within
and served them back to the database. With increasing size of the datasets, the transport into
and out of the database became very time-intensive. Thus, there have been more and more
requests of performing statistical and advanced mathematical tasks inside DBMS itself. The
MAD project focuses on exactly that. It provides a set of tools to perform these tasks in an
efficient way directly inside the DBMS.
I used Hellerstein et al. [2012] as the motivation and basis for this work. The focus is placed
on the linear regression method they present in section 4.1.

2. Linear regression
Linear regression is an approach in statistics for modelling a linear relation between n pairs of
values (xi , yi )ni=1 such that yi = β0 + β1 xi . In of simple linear regression, where xi are scalars,
this can interpreted graphically by drawing values (xi , yi ) as points in the euclidean plain and
then fitting a line into that set of points in such a way that the line is as close as possible to all
points (see Figure 1).
In case of multiple linear regression the values xi are vectors of dimension d, denoted by
Xi = (xij )dj=1 . In order to model the linear relation, the equation changes to

yi = β0 + β1 xi1 + ... + βd xid

5
After subtracting β0 from both sides, this can be rewritten as
X · β = Y, (1)
where X denotes the n × d matrix that has Xi as i-th row, β = (βi )di=1 a vector of length d and
Y = (Yi )ni=1 = (yi − β0 )ni=1 a vector of lenght n.
β can be approximated in a way, called the Ordinary Least Squares (OLS) approximation,
that the sum of squared residuals
n
X
S(β) := (Yi − Xi · β)
i=1

is minimized. This is achieved by multiplying X T to both sides of Equation 1, resulting in

X T · X · β̂ = X T · Y. (2)
(Zwillinger [2000], Wikipedia [2014a,c])

3. Matrix operations in databases

If you google on how to do linear algebra in a database, the first answer you usually get is:
“don’t!” (stackoverflow [2014]).
This is because relational databases are designed to process data on an entry-by-entry basis
in a very efficient manner. Even massive datasets can be handled quickly by moving records
from disk to memory and back intelligently. Matrix operations pose a problem there: they
typically need access to both rows and columns at the same time. E.g. in order to multiply two
n × n matrices A P= (aij )ni,j=1 and B = (bij )ni,j=1 , one has to calculate n2 entries. Every entry
is computed as nk=1 aik bkj and hence needs to multiply and sum up 2n points. Typically,
this is done by storing those 2n values (say row vector Ai = (aij )nj=1 and column vector
B̄j = (bij )ni=1 ) in memory and then computing cij . After that, Ai is kept in memory and B̄j is
replaced by B̄j+1 to compute ci,j+1 . Now suppose we can only store n − 1 values in memory.
Thus the DBMS would have to read all entries from disk at every step and hence computation
would become very slow. There are ways to work around this by splitting matrices in smaller
chunks in such a way that one can perform the arithmetic operations on those chunks and
eventually combine the results, but so far none of the popular DBMS provide implementations
for this kind of operations. Recently there is a lot research effort going into this topic and the
MAD project is one working solution.

4. MADlib ‘Ordinary Least Squares’ algorithm

The task at hand is to solve Equation 2, resulting in
−1 T
β̂ = X T X X Y (3)
Hellerstein et al. [2012] perform the steps as shown in algorithm 1. I will explain the details
later, after discussing the following preliminaries:

6
Algorithm 1 Compute β̂
1: M ← XT X
2: N ← XT Y
3: M 0 ← (M )−1
4: β̂ ← M 0 N

4.1. How to multiply matrices

The ordinary way of multiplying two matrices M ∈ Kn×d and N ∈ Kd×n is by computing the
inner product
Xd
(M · N )ij = Mik · Nkj , (4)
k=1
i.e. the entry i, j of the resulting matrix is computed as the vector-vector product between the
i-th row vector of M and j-th column vector of N (Beezer [2006] and Wikipedia [2014b]).
The situation in step 1 of algorithm 1 is a little different. Here, one has to multiply X ∈ Kn×d
with its transpose, which gives
n
X
T
(X X)ij = Xki · Xkj (5)
k=1

So we are actually multiplying d2 column vectors with each other to get the resulting matrix
(X T X) ∈ Kd×d . That’s perfectly fine if you are multiplying matrices by hand or if you are
using a system where you are able to load complete matrices or at least two column vectors at
a time into memory.
There is a different approach to multiply matrices, that is easy to implement and efficient to
execute in a DBMS.
Instead of multiplying column vectors one can multiply row vectors (outer product) and
then sum up all the resulting matrices. Let’s look at an example to illustrate this:

1 2 T 1 3
Example: Matrix multiplication Let K = R and M := . So M = .I
3 4 2 4
will call the resulting matrix P .
1. inner product

1
P11 = 1 3 · = 1 + 9 = 10
3

2
P12 = 1 3 · = 2 + 12 = 14
4

1
P21 = 2 4 · = 2 + 12 = 14
3

2
P22 = 2 4 · = 4 + 16 = 20
4

7
which gives

10 14
P =
14 20

2. outer product

1 1·1 1·2 1 2
P1 = · 1 2 = =
2 2·1 2·2 2 4

3 3·3 3·4 9 12
P2 = · 3 4 = =
4 4·3 4·4 12 16

10 14
P = P1 + P2 =
14 20

4.2. Aggregate functions

To take full advantage of the functionality of the DBMS and the above mentioned method of
matrix multiplication, it is natural to use an aggregate function to compute X T X and X T Y , as
the summands of the outer product can be computed independent of each other and the order
of their summation does not matter.

Definition

“An aggregate function computes a single result from multiple input rows.” (Post-
greSQL [2014a])

Usually a database system operates on tables in a row-by-row or entry-by-entry manner. Since

there are situations where one wants to request information about some kind of collection of
entries instead of single entries, most DBMS provide some built in aggregate functions such as
AVG, SUM or COUNT and even support user defined aggregate functions (PostgreSQL [2014a]
and PostgreSQL [2014c]).
The main benefits of aggregate functions are that they

• can (depending on the DBMS) be parallelized and tuned to high performance and

• are memory efficient since they don’t need to keep track of all entries but only store the
information that needs to be propagated further on.

Hellerstein et al. [2012] provide this definition

“a user-defined aggregate consists of a well-known pattern of two or three user-

defined functions:
1. A transition function that takes the current transition state and a new data
point. It combines both into a new transition state.

8
2. An optional merge function that takes two transition states and computes
a new combined transition state. This function is only needed for parallel
execution.
3. A final function that takes a transition state and transforms it into the output
value.
”
In order for the aggregate function to be parallelizable, its transition function has to be com-
mutative and associative (Hellerstein et al. [2012]). Although matrix multiplication in general
is not commutative, the process itself clearly is since the order of the summands does not
matter for the result. In the case of matrix multiplication as in example 2, the corresponding
transition function would look like algorithm 2.

Algorithm 2 “outer product” matrix multiplication

Input: M : current transition state, x: row vector
d ← dim(x)
for i = 1 to d do
for j = 1 to d do
M [i][j] ← M [i][j] + x[i] · x[j]
end for
end for
Output: M

As the transition function works only on one vector (one record of the database) at a time,
it does not matter in which order that we process the rows. Let us look at example 2 again in
order to enlighten this:

1 2 9 12
P1 + P 2 = +
2 4 12 16

9 12 1 2
= +
12 16 2 4
= P2 + P1

Algorithm details We can combine steps 1 and 2 from algorithm 1 into a single transition
function (algorithm 4) and do the inversion in the final function (algorithm 5). The aggregate
function that wraps those two functions is shown in algorithm 3.
As of now, PostgreSQL does not support parallel execution of aggregates (Berkus [2014]).
Hence I did not write a merge function but I want to show the outline of it in algorithm 6 for
the sake of completeness.

5. Implementation
Database management system I worked with PostgreSQL as DBMS.

9
Algorithm 3 Aggregate function “matrix transposed times matrix”
Input: X matrix, recieved row-by-row, Y column vector, received entry-by-entry
1: initialize transition state M = ARRAY[][]
2: for all row in X do
3: update M by transition function (algorithm 4)
4: end for
5: if parallel environment then
6: combine all Mi by merge function (algorithm 6)
7: end if
8: calculate β̂ by final function (algorithm 5)
9: return β̂

Algorithm 4 Transition function

Input: M transition state, x row vector, y entry
1: d ← dim(x)
2: for i = 1 to d do
3: for j = 1 to d do
4: M [i][j] ← M [i][j] + x[i] · x[j]
5: end for
6: M [i][d + 1] ← M [i][d + 1] + x[i] · y
7: end for
8: return M

Algorithm 5 Final function

Input: M transition state
1: N ← M [:][d + 1]
2: invert M [:][1 : d] as M 0
3: β̂ ← M 0 · N
4: return β̂

Algorithm 6 Merge function

Input: M1 transition state, M2 transition state
1: M ← M1 + M2
2: return M

10
Procedural language I had to choose one out of the four procedural languages that are
supported by PostgreSQL.

• PL/pgSQL SQL procedural language

• PL/Tcl Tcl procedural language

• PL/Perl Perl procedural language

• PL/Python Python procedural language

PostgreSQL [2014b]

I chose PL/pgSQL because of the following reasons:

• Personal interest.
I learned some basics of the language during the database systems course in the spring
semester 2014(Böhlen and Gamper [2014]), on which I wanted to build and get some
practice in that language.

• Natural choice.
Since PostgreSQL was fixed as DBMS, it was the natural choice to stick to its own
procedural language.

• I know neither C nor Perl and learning one of those for this project would have taken to
much time.

Table structure There a several different ways how one can store matrices in a database
table, from which the following two are probably the most popular:

1. The values can be stored in sparse representation, where every entry consist of a
(column number, row number, value) triple as shown in listing 1.

Listing 1 Table sparse matrix

col | row | value
----+-----+-------
1 | 1 | 1
2 | 1 | 2
1 | 2 | 3
2 | 2 | 4

2. The rows of a matrix can be stored as arrays as shown in listing 2.

11
Listing 2 Table row-by-row matrix
values | row_id
--------+--------
[1,2] | 1
[3,4] | 2

The sparse representation is more generic and has a wider field of use cases than the array
representation, but for the computation of X T · X the array representation is very efficient.
One can compute every entry of the result by reading at most two entries from the table into
memory, as long as two arrays fit into memory and hence the amount of reads performed
on the database is smaller than with sparse representation. I chose a variation of the array
representation in order to have the same structure as Hellerstein et al. [2012]. The difference
to listing 2 is, that the y values get stored in the same table as the X values. Therefore, I don’t
need to store the row_id because the computation of β̂ does not depend on the order of X and
y, as long as corresponding points are kept together.

5.1. Functions
Since the aggregate described in algorithm 3 needs to keep track of the processed values
of X T X and X T Y and since PL/pgSQL supports only a single variable as output of as
step function, I had to define a custom datatype in order to store the complete information.
It consists of a two dimensional array for X T X and a one dimensional array for X T Y . I
defined another custom data type, LinRegOut, used as output of the aggregate function.
This comprises the one dimensional array beta and the time interval inversiontime,
that keeps track of the amount of time used to invert X T X that is stored for performance
analysis purposes as is shown in listing 5.

5.1.1. Aggregate function

The implementation of algorithm 3 is straight forward in PL/pgSQL as it needs only of the
type definition for the transition state STYPE, the transition function SFUNC and the final
function FINALFUNC (listing 6).

5.1.2. Transition function

The transition function in listing 7 works basically as described in algorithm 4. The differences
are language specific: since PL/pgSQL can’t access single fields of custom composite data
types inside a transition function directly, I had to define temporary variables for X T X and
X T Y respectively. In the first IF-block, the values of the transition state are read into those
temporary variables, or if the transition state is not yet present (i.e. we are working on the first
input values), the temporary variables are initialized to zero. Then the current vector-vector
product XiT · Xi and the current vector-value product XiT · Yi are added to the temporary
variables. In the end those variables are cast back to MyType and returned.

12
5.1.3. Final function
The final function in listing 8 performs all the steps of algorithm 5. The inversion of X T X is
done by a separate function, in order to easily change in case of performance issues (listing 9).
I decided to compute the LU-decomposition of X T X first and invert the resulting lower-left
and upper-right triangular matrices separately. In mathematical terms, this means that we can
find such matrices L a lower-left matrix and U an upper-right matrix that X T X = L·U . Hence
the inverse of X T X can be computed as (LU )−1 = U −1 · L−1 . This is because the algorithm
for inverting a triangular matrix with the Gauss algorithm is much easier to implement, than
inverting a full matrix. (Sauter [2014])

5.2. Discussion
I think that this algorithm is a really good way to solve the given task. Since it takes full
advantage of the aggregate feature, it is only usable in DBMS that support aggregates. The
best place to use it hence would be those DBMS that support parallelisation and aggregate
at the same time. In the case of PostgreSQL it is still a very efficient way to compute linear
regression. The downside is, that the algorithm is not applicable to other data structures or
matrix multiplications in general as it depends on the array representation of the data.

6. Tests and results

6.1. Prerequisites
All tests were performed in PostgreSQL 9.3.4 running on a virtual Ubuntu 14.04 server. The
hardware specifications are shown in listing 3.

Listing 3 Hardware specifications

CPU 2 Intel QEMU Virtual CPUs @ 2.4GHz
CPU cache 4MB
RAM 8GB

6.2. Data creation

I had no real-life data at hand to test the LinReg function, so I created random matrices of
different sizes (listing 10).
I took n ∈ {10i |2 ≤ i ≤ 6} and d ∈ {2k |3 ≤ k ≤ 6}. The entries were generated using
PL/pgSQL’s built in random() function, that creates uniformly distributed values in the
range [0, 1.0). Since numbers are rarely that small in reality, I multiplied each random value
by 1000. To improve reliability of the results and identify server lags, I created 5 different
random tables for every n, d pair. This makes a total of 100 tables that can be used for tests.

13
Figure 2: Total time for d = 8

6.3. Complexity
Before running the tests, let us look at the complexity of the code to get an idea of what to
expect from the tests. Suppose we have a table with n records, where each x entry has length
d. The transition function (subsubsection 5.1.2) gets executed n times. Inside the transition
function we have two nested loops that iterate both from 1 to d. The final function (subsub-
section 5.1.3) then performs the LU-decomposition that needs d(2d−1)(d−1)
6
≈ d3 iterations to
compute the decomposition and additional d3 operations to invert the two matrices. After that,
the final function needs d2 iterations to compute β̂.
This makes a total of approximately n · d2 + d3 + d3 + d2 iterations on the data. Now to
classify this in O notation, we have to determine whether n · d2 or d3 is the term of higher
computational significance (Hromkovic [2008]). As soon as n ≥ d, the first term is bigger,
which is given for linear regression. Hence I have a code complexity of O(n · d2 ).

6.4. Tests
For every table that I created with the buildTests() function, I started a timer, called the
LinReg() function on that table and stored the runtime and the inversion time in the results
table. (See listing 11)

6.5. Results
A complete listing of the test times is shown in listing 14. To compare the results against the
theory I built up in 6.3, I looked at the results of all tests first where d = 8 and d = 32. As you
can see in both Figure 2 and Figure 3, the time increase is linear to the increase of n, which fits

14
Figure 3: Total time for d = 32

exactly the expectations from subsection 6.3.1 Now I fix n at 10000, 100000 and 1000000 and
look at the change in time for the different values of d. The blue lines in Figure 4,Figure 5 and
x 2
Figure 6 show the runtime, whereas the green lines show the slope of y = in Figure 4,
2
2 2
y = x in Figure 5 and y = (3x) in Figure 6. The time increase for constant n is always in
O(d2 ), only the factor changes with the size of n, meaning my results are consistent.

6.6. Correctness
In order to check if my computation of β̂ is correct, I performed another test. I created 20
tables with n = 1000 and d = 16 where the Equation 1 has β = 1 as exact solution. This is
achieved by setting the entries of X again as random values in the interval [0, 1000) and then
computing Y as Y [i] = nk=1 Xi [k] (listing 12). I computed β̂ with the LinReg function and
P

stored max(|β̂ − β|) in a table (listing 13). The results are shown in listing 4. All differences
are smaller than 10−12 , which means that the error of the computation of β̂ is in a range that
can be ignored in most applications and is probably due to the arithmetic precision of the
computer.

1
All plots were produced by the python script listed in listing 15.

15
Figure 4: Total time for n = 10000

Figure 5: Total time for n = 100000

16
Figure 6: Total time for n = 1000000

References
R. A. Beezer. A First Course in Linear Algebra. 2006. URL http://linear.ups.edu.

J. Berkus. Postgresql new development priorities 4: Parallel query,

2014. URL http://www.databasesoup.com/2013/05/
postgresql-new-development-priorities-4_20.html. [Online; ac-
cessed 11-July-2014].

M. Böhlen and J. Gamper. Slides for the “database systems” course at ifi@uzh, spring 2014,
2014. URL http://www.ifi.uzh.ch/dbtg/teaching/courses/DBS.html.

J. M. Hellerstein, F. Schoppmann, D. Z. Wang, E. Fratkin, and C. Welton. The MADlib

Analytics Library or MAD Skills , the SQL. Proceedings of the VLDB Endowment, pages
1700–1711, 2012.

J. Hromkovic. Hromkovic, Informatik:. Vieweg+Teubner Verlag, 2008. ISBN

9783834806208. URL http://books.google.ch/books?id=AuuM5NC-I5MC.

PostgreSQL. Postgresql manual 9.3 aggregate functions, 2014a. URL http://www.

postgresql.org/docs/9.3/static/tutorial-agg.html. [Online; accessed
9-July-2014].

PostgreSQL. Postgresql manual 9.3 server programming, 2014b. URL http://www.

postgresql.org/docs/9.3/static/server-programming.html. [Online;
accessed 11-July-2014].

17
Listing 4 Results of correctness test
run_id | error
--------+----------------------
1 | 1.3988810110277e-13
2 | 1.11910480882216e-13
3 | 1.27897692436818e-13
4 | 9.05941988094128e-14
5 | 1.18349774425042e-13
6 | 8.34887714518118e-14
7 | 1.20792265079217e-13
8 | 9.72555369571637e-14
9 | 7.74935671188359e-14
10 | 7.90478793533111e-14
11 | 6.52811138479592e-14
12 | 1.17239551400417e-13
13 | 7.19424519957101e-14
14 | 1.09690034832965e-13
15 | 1.7608137170555e-13
16 | 1.32338584535319e-13
17 | 9.45910016980633e-14
18 | 1.04805053524615e-13
19 | 1.4122036873232e-13
20 | 1.13686837721616e-13
(20 rows)

PostgreSQL. Postgresql manual 9.3 user-defined aggregates, 2014c. URL http://www.

postgresql.org/docs/9.3/static/xaggr.html. [Online; accessed 9-July-
2014].

S. Sauter. Numerische lineare algebra. Technical report, Universität Zürich, Institut für Mathe-
matik, 2014. URL http://www.math.uzh.ch/index.php?ve_vo_det&key2=
2158. [Online; accessed 11-July-2014].

stackoverflow. Mysql matrix multiplication, 2014. URL http://stackoverflow.com/

questions/15502622/mysql-matrix-multiplication. [Online; accessed
10-July-2014].

Wikipedia. Linear regression — wikipedia, the free encyclopedia, 2014a.

URL http://en.wikipedia.org/w/index.php?title=Linear_
regression&oldid=615012889. [Online; accessed 8-July-2014].

Wikipedia. Matrix multiplication — wikipedia, the free encyclopedia, 2014b.

URL http://en.wikipedia.org/w/index.php?title=Matrix_
multiplication&oldid=616366532. [Online; accessed 11-July-2014].

18
Wikipedia. Ordinary least squares — wikipedia, the free encyclopedia, 2014c. URL
http://en.wikipedia.org/w/index.php?title=Ordinary_least_
squares&oldid=615091560. [Online; accessed 11-July-2014].

D. Zwillinger. standard probability Statistics tables and formulae standard probability Statis-
tics tables and formulae. 2000. ISBN 1584880597.

19
A. SQL code
CREATE TYPE MyType AS (
XtX DOUBLE PRECISION[][],
Xty DOUBLE PRECISION[]
);

CREATE TYPE LinRegOut AS (

beta DOUBLE PRECISION [],
inversiontime INTERVAL
);

Listing 5: File MyType.sql

CREATE AGGREGATE LinReg(DOUBLE PRECISION[], DOUBLE PRECISION)(

/*--------
* Computes the beta-value of multi-linear regression.
* (i.e. solves X*beta=y)
* Usage: LinReg(X,y), where
* - X DOUBLE PRECISION[]
* - y DOUBLE PRECISION
* Returns:
* - beta DOUBLE PRECISION[]
* - invtime DOUBLE PRECISION
*--------
*/
-- stores intermediate values of Xt*X and Xt*y
STYPE = MyType,
-- computes Xt*X and Xt*y
SFUNC = LinRegStep,
-- computes beta as (Xt*X)-1*Xt*y
FINALFUNC = LinRegFinal
);

Listing 6: File LinReg.sql

20
CREATE OR REPLACE FUNCTION LinRegStep
(aggr_state MyType, x DOUBLE PRECISION[], y DOUBLE PRECISION)
RETURNS MyType AS
$$
/*--------
* stepfunction of aggregate "LinReg"
*
* computes xt*x and xt*y from the
* current row of matrix X and current
* value of vector y.
*--------
*/
DECLARE
-- length of vector x = dimension of Xt*X
dim INTEGER;
-- temporary arrays to compute current values
XtX_tmp DOUBLE PRECISION[][];
Xty_tmp DOUBLE PRECISION[];
BEGIN
-- read ’dim’
dim := array_length(x,1);
-- initialize output according to current state
IF aggr_state IS NULL THEN
-- on empty state, intialize output with 0-matrix and 0-vector
XtX_tmp := array_fill(0,ARRAY[dim,dim]);
Xty_tmp := array_fill(0,ARRAY[dim]);
ELSE
-- initialize output with current state
XtX_tmp := (aggr_state).XtX;
Xty_tmp := (aggr_state).Xty;
END IF;
-- compute the current XtX addition
FOR i IN 1..dim LOOP
FOR j in 1..dim LOOP
XtX_tmp[i][j] := XtX_tmp[i][j]+x[i]*x[j];
END LOOP;
-- compute the current Xty addition
Xty_tmp[i] := Xty_tmp[i]+x[i]*y;
END LOOP;
-- cast result to MyType, so aggregate can handle it
RETURN (XtX_tmp,Xty_tmp)::MyType;
END;
$$ LANGUAGE plpgsql;

21
Listing 7: File LinRegStep.sql

--DROP FUNCTION LinRegFinal(MyType) CASCADE;

CREATE OR REPLACE FUNCTION LinRegFinal(final_state MyType)
RETURNS LinRegOut AS
$$
/*--------
* final function of aggregate "LinReg"
*
* inverts resulting matrix Xt*X and
* computes beta = (Xt*X)-1*Xt*y
*
* returns beta and timestamp of LUinversion
*--------
*/
DECLARE
XtX_fin DOUBLE PRECISION[][];
Xty_fin DOUBLE PRECISION[];

time1 TIMESTAMP;
time2 TIMESTAMP;

dim INTEGER;

beta DOUBLE PRECISION[];

inversiontime INTERVAL;
BEGIN
-- fetch the values from final state
XtX_fin := (final_state).XtX;
Xty_fin := (final_state).Xty;
-- read the dimension
dim := array_length(Xty_fin,1);
-- initialize resulting vector
beta := array_fill(0,ARRAY[dim]);
-- invert XtX (so far with LU and piecewise Gauss)
/* we want to know the effect on total runtime,
* so we take timestamps before and after.
*/
time1 := clock_timestamp();
XtX_fin := LUinvert(XtX_fin);
time2 := clock_timestamp();
-- store time difference
inversiontime := time2-time1;
-- compute (XtX)-1*Xty

22
FOR i IN 1..dim LOOP
FOR j IN 1..dim LOOP
beta[i] := beta[i]+XtX_fin[i][j]*Xty_fin[j];
END LOOP;
END LOOP;
-- all done
RETURN (beta, inversiontime)::LinRegOut;
END;
$$ LANGUAGE plpgsql;

Listing 8: File LinRegFinal.sql

CREATE OR REPLACE FUNCTION LUinvert(M DOUBLE PRECISION[][])

RETURNS DOUBLE PRECISION[][] AS
$$
/*--------
* computes the inverse of matrix M
* by decomposing M into lowerleft and upperright
* triangular matrices L and U, s.t. M=L*U,
* inverting both L and U with Gauss algorithm
* and computing Minv as Uinv * Linv.
*
* INPUT: M DOUBLE PRECISION[][]
*
* OUTPUT: Minv DOUBLE PRECISION[][]
*--------
*/
DECLARE
-- matrix M has size ’dim x dim’
dim INTEGER;
-- inverse of M
Minv DOUBLE PRECISION[][];
-- lower-left matrix and inverse
L DOUBLE PRECISION[][];
Linv DOUBLE PRECISION[][];
-- upper-right matrix and inverse
U DOUBLE PRECISION[][];
Uinv DOUBLE PRECISION[][];

tmp_val DOUBLE PRECISION;

BEGIN
-- read the dimension
-- /!\ M is supposed to be square, but not checked /!\
dim := array_length(M,1);

23
--initialize all matrices
Minv := array_fill(0,ARRAY[dim,dim]);
L := array_fill(0,ARRAY[dim,dim]);
Linv := array_fill(0,ARRAY[dim,dim]);
U := array_fill(0,ARRAY[dim,dim]);
Uinv := array_fill(0,ARRAY[dim,dim]);
--compute the LU factorization of M
FOR k IN 1..dim LOOP
L[k][k] := 1;
Linv[k][k] := 1;
U[k][k] := M[k][k];
Uinv[k][k] := 1;
FOR i IN k+1..dim LOOP
L[i][k] := M[i][k]/U[k][k];
U[k][i] := M[k][i];
END LOOP;
FOR i IN k+1..dim LOOP
FOR j IN k+1..dim LOOP
M[i][j] := M[i][j] - L[i][k]*U[k][j];
END LOOP;
END LOOP;
END LOOP;
--compute Linv and Uinv
FOR k in 1..dim LOOP
FOR i in 1..k LOOP
Linv[k][i] := Linv[k][i]/L[k][k];
Uinv[i][k] := Uinv[i][k]/U[k][k];
END LOOP;
FOR i in k+1..dim LOOP
FOR j in 1..k LOOP
Linv[i][j] := Linv[i][j]-L[i][k]*Linv[k][j];
Uinv[j][i] := Uinv[j][i]-U[k][i]*Uinv[j][k];
END LOOP;
END LOOP;
END LOOP;
--compute Minv as Uinv*Linv
FOR i IN 1..dim LOOP
FOR j IN 1..dim LOOP
tmp_val := 0;
FOR k IN 1..dim LOOP
tmp_val := tmp_val + Uinv[i][k]*Linv[k][j];
END LOOP;
Minv[i][j] := tmp_val;
END LOOP;

24
END LOOP;
-- all done
RETURN Minv;
END;
$$ LANGUAGE plpgsql;

Listing 9: File LUinvert.sql

25
B. Tests
CREATE OR REPLACE FUNCTION buildTests()
RETURNS VOID AS
$$
DECLARE
ndim INTEGER;
ddim INTEGER;
x DOUBLE PRECISION[];
y DOUBLE PRECISION;
BEGIN
FOR n in 2..6 LOOP
FOR d in 3..6 LOOP
ndim := 10^n;
ddim := 2^d;
FOR id in 1..5 LOOP
EXECUTE ’CREATE TABLE test’
|| format(’%s%s%s’,ddim,ndim,id)
|| ’(x DOUBLE PRECISION[],y DOUBLE PRECISION)’;
FOR r IN 1..ndim LOOP
y := random()*1000;
x := array_fill(1,ARRAY[ddim]);
FOR i IN 1..ddim LOOP
x[i] := random()*1000;
END LOOP;
EXECUTE ’INSERT INTO test’
|| format(’%s%s%s’,ddim,ndim,id)
|| ’ VALUES (’
|| format(’%L,%L’,x,y)
|| ’)’;
END LOOP;
END LOOP;
END LOOP;
END LOOP;
END;
$$ LANGUAGE plpgsql;

Listing 10: File buildTests.sql

CREATE OR REPLACE FUNCTION runTests()

RETURNS VOID AS
$$
DECLARE
ndim INTEGER;

26
ddim INTEGER;

starttime TIMESTAMP;
endtime TIMESTAMP;
totaltime INTERVAL;

currentout INTERVAL;
BEGIN
CREATE TABLE IF NOT EXISTS
results(ddim INTEGER, ndim INTEGER, inversiontime INTERVAL,
totaltime INTERVAL);
FOR n in 1..6 LOOP
FOR d in 3..6 LOOP
ndim := 10^n;
ddim := 2^d;
FOR id in 1..5 LOOP
starttime := clock_timestamp();
EXECUTE ’SELECT (LinReg(x,y)).inversiontime FROM test’
|| format(’%sd%sn%si’,ddim,ndim,id) INTO currentout;
endtime := clock_timestamp();
totaltime := endtime -starttime;
INSERT INTO results
VALUES (ddim,ndim,currentout,totaltime);
END LOOP;
END LOOP;
END LOOP;
END;
$$ LANGUAGE plpgsql;

Listing 11: File runTests.sql

CREATE OR REPLACE FUNCTION buildBetaTests()

RETURNS VOID AS
$$
DECLARE
ndim INTEGER;
ddim INTEGER;
x DOUBLE PRECISION[];
y DOUBLE PRECISION;
ytmp DOUBLE PRECISION;
BEGIN
FOR i in 1..20 LOOP
ndim := 10^3;
ddim := 16;

27
EXECUTE ’CREATE TABLE IF NOT EXISTS beta_test’
|| format(’%sd%sn%sid’,ddim,ndim,i)
|| ’(x DOUBLE PRECISION[],y DOUBLE PRECISION)’;
FOR r IN 1..ndim LOOP
ytmp := 0;
x := array_fill(1,ARRAY[ddim]);
FOR j IN 1..ddim LOOP
x[j] := random()*1000;
ytmp := ytmp + x[j];
END LOOP;
y := ytmp;
EXECUTE ’INSERT INTO beta_test’
|| format(’%sd%sn%sid’,ddim,ndim,i)
|| ’ VALUES (’
|| format(’%L,%L’,x,y)
|| ’)’;
END LOOP;
END LOOP;
END;
$$ LANGUAGE plpgsql;

Listing 12: File buildTests.sql

CREATE OR REPLACE FUNCTION runBetaTests()

RETURNS VOID AS
$$
DECLARE
ndim INTEGER;
ddim INTEGER;

starttime TIMESTAMP;
endtime TIMESTAMP;
totaltime INTERVAL;

currentout INTERVAL;
currenterror DOUBLE PRECISION[];
error DOUBLE PRECISION;
maxerror DOUBLE PRECISION;
BEGIN
CREATE TABLE IF NOT EXISTS
beta_results(run_id INT, error DOUBLE PRECISION);
FOR i in 1..20 LOOP
ndim := 10^3;
ddim := 16;

28
EXECUTE ’SELECT (LinReg(x,y)).beta FROM beta_test’
|| format(’%sd%sn%sid’,ddim,ndim,i) INTO currenterror;
maxerror := 0;
FOR i in 1..ddim LOOP
error := ABS(1-currenterror[i]);
IF error > maxerror THEN
maxerror := error;
END IF;
END LOOP;
INSERT INTO beta_results VALUES (i,maxerror);
END LOOP;
END;
$$ LANGUAGE plpgsql;

Listing 13: File runBetaTests.sql

C. Results
ndim | ddim | totaltime | inversiontime
---------+------+-----------------+-----------------
100 | 8 | 00:00:00.023746 | 00:00:00.002564
100 | 8 | 00:00:00.024939 | 00:00:00.003362
100 | 8 | 00:00:00.023171 | 00:00:00.002412
100 | 8 | 00:00:00.0254 | 00:00:00.002379
100 | 8 | 00:00:00.0232 | 00:00:00.002436
1000 | 8 | 00:00:00.26267 | 00:00:00.002564
1000 | 8 | 00:00:00.285489 | 00:00:00.0029
1000 | 8 | 00:00:00.280187 | 00:00:00.003273
1000 | 8 | 00:00:00.279535 | 00:00:00.004159
1000 | 8 | 00:00:00.269056 | 00:00:00.003036
10000 | 8 | 00:00:02.694995 | 00:00:00.002369
10000 | 8 | 00:00:02.543247 | 00:00:00.002332
10000 | 8 | 00:00:02.564963 | 00:00:00.004127
10000 | 8 | 00:00:02.693838 | 00:00:00.002653
10000 | 8 | 00:00:03.082352 | 00:00:00.004899
100000 | 8 | 00:00:23.726925 | 00:00:00.002501
100000 | 8 | 00:00:25.201353 | 00:00:00.002994
100000 | 8 | 00:00:24.343541 | 00:00:00.002566
100000 | 8 | 00:00:24.142769 | 00:00:00.002506
100000 | 8 | 00:00:24.603467 | 00:00:00.002504
1000000 | 8 | 00:04:05.098265 | 00:00:00.004447
1000000 | 8 | 00:04:06.844248 | 00:00:00.005147
1000000 | 8 | 00:04:20.993924 | 00:00:00.002401

29
1000000 | 8 | 00:04:21.864473 | 00:00:00.002414
1000000 | 8 | 00:04:06.783134 | 00:00:00.002382
100 | 16 | 00:00:00.112679 | 00:00:00.017259
100 | 16 | 00:00:00.091801 | 00:00:00.015146
100 | 16 | 00:00:00.112025 | 00:00:00.018014
100 | 16 | 00:00:00.112796 | 00:00:00.022704
100 | 16 | 00:00:00.135746 | 00:00:00.025885
1000 | 16 | 00:00:00.815234 | 00:00:00.015561
1000 | 16 | 00:00:00.901771 | 00:00:00.015476
1000 | 16 | 00:00:00.89798 | 00:00:00.024165
1000 | 16 | 00:00:00.933688 | 00:00:00.015925
1000 | 16 | 00:00:00.865702 | 00:00:00.016614
10000 | 16 | 00:00:09.162817 | 00:00:00.017146
10000 | 16 | 00:00:10.426016 | 00:00:00.015481
10000 | 16 | 00:00:10.376657 | 00:00:00.016865
10000 | 16 | 00:00:09.892446 | 00:00:00.028012
10000 | 16 | 00:00:09.445893 | 00:00:00.016338
100000 | 16 | 00:01:32.174325 | 00:00:00.015269
100000 | 16 | 00:01:23.07713 | 00:00:00.015008
100000 | 16 | 00:01:33.950572 | 00:00:00.014945
100000 | 16 | 00:01:27.25961 | 00:00:00.016372
100000 | 16 | 00:01:23.720913 | 00:00:00.014934
1000000 | 16 | 00:15:03.185172 | 00:00:00.015556
1000000 | 16 | 00:14:50.271601 | 00:00:00.01532
1000000 | 16 | 00:15:15.317476 | 00:00:00.020588
1000000 | 16 | 00:15:29.345151 | 00:00:00.020867
1000000 | 16 | 00:15:59.534883 | 00:00:00.015298
100 | 32 | 00:00:00.706183 | 00:00:00.138679
100 | 32 | 00:00:00.7699 | 00:00:00.17718
100 | 32 | 00:00:00.65598 | 00:00:00.141333
100 | 32 | 00:00:00.679164 | 00:00:00.158792
100 | 32 | 00:00:00.764569 | 00:00:00.212274
1000 | 32 | 00:00:04.89616 | 00:00:00.164341
1000 | 32 | 00:00:05.607052 | 00:00:00.188645
1000 | 32 | 00:00:05.15957 | 00:00:00.150948
1000 | 32 | 00:00:05.119471 | 00:00:00.146039
1000 | 32 | 00:00:05.313862 | 00:00:00.207736
10000 | 32 | 00:00:59.118931 | 00:00:00.206354
10000 | 32 | 00:01:02.757765 | 00:00:00.1889
10000 | 32 | 00:00:59.517943 | 00:00:00.16702
10000 | 32 | 00:00:56.774501 | 00:00:00.146103
10000 | 32 | 00:00:58.940245 | 00:00:00.161413
100000 | 32 | 00:08:06.410799 | 00:00:00.15987
100000 | 32 | 00:08:07.161503 | 00:00:00.164449

30
100000 | 32 | 00:08:43.129977 | 00:00:00.174698
100000 | 32 | 00:08:36.184827 | 00:00:00.164435
100000 | 32 | 00:08:20.626028 | 00:00:00.14783
1000000 | 32 | 01:25:43.143062 | 00:00:00.151011
1000000 | 32 | 01:26:27.980122 | 00:00:00.166694
1000000 | 32 | 01:26:45.59492 | 00:00:00.169309
1000000 | 32 | 01:28:51.414537 | 00:00:00.147906
1000000 | 32 | 01:29:59.636742 | 00:00:00.158089
100 | 64 | 00:00:06.976099 | 00:00:02.632695
100 | 64 | 00:00:06.354614 | 00:00:02.288769
100 | 64 | 00:00:06.496339 | 00:00:02.232434
100 | 64 | 00:00:06.61293 | 00:00:02.376079
100 | 64 | 00:00:06.342603 | 00:00:02.25155
1000 | 64 | 00:00:49.607221 | 00:00:02.533729
1000 | 64 | 00:00:51.139592 | 00:00:02.867665
1000 | 64 | 00:00:50.99478 | 00:00:02.492939
1000 | 64 | 00:00:47.315666 | 00:00:02.448716
1000 | 64 | 00:00:45.623201 | 00:00:02.121799
10000 | 64 | 00:07:07.092826 | 00:00:02.316163
10000 | 64 | 00:07:22.747058 | 00:00:02.263725
10000 | 64 | 00:07:14.827523 | 00:00:02.428302
10000 | 64 | 00:07:51.585268 | 00:00:02.244976
10000 | 64 | 00:07:58.985783 | 00:00:03.273587
100000 | 64 | 01:13:54.132076 | 00:00:02.44789
100000 | 64 | 01:11:44.955574 | 00:00:02.544321
100000 | 64 | 01:14:59.77437 | 00:00:02.500936
100000 | 64 | 01:10:08.247361 | 00:00:02.33984
100000 | 64 | 01:08:59.886234 | 00:00:02.207661
1000000 | 64 | 11:50:16.324562 | 00:00:02.342342
1000000 | 64 | 11:23:20.379733 | 00:00:02.241259
1000000 | 64 | 11:01:12.279202 | 00:00:02.06376
1000000 | 64 | 10:54:03.991939 | 00:00:01.983473
1000000 | 64 | 12:02:32.056664 | 00:00:02.30654
(100 rows)

Listing 14: File allresults.txt

31
D. Appendix: other code
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D as ax3d
import numpy as np

def main():
#read file
with open(’results.txt’,’r’) as infile:
rawdata = infile.readlines()
#read data
ndim, ddim, invtime, tottime = zip(
*[read_data_line(line.split(’|’))
for line in rawdata[2:-2]])

fig = plt.figure()

dplt8 = fig.add_subplot(111)
xs = [ndim[i] for i in range(len(ndim)) if ddim[i]==8]
ys = [tottime[i] for i in range(len(ndim)) if ddim[i]==8]
dplt8.plot(xs,ys)
dplt8.set_xlabel(r’number of records $n$’)
dplt8.set_ylabel(’seconds’)
plt.savefig(’d=8_plot.png’,transparent=True)

fig = plt.figure()

dplt32 = fig.add_subplot(111)
xs = [ndim[i] for i in range(len(ndim)) if ddim[i]==32]
ys = [tottime[i] for i in range(len(ndim)) if ddim[i]==32]
dplt32.plot(xs,ys)
dplt32.set_xlabel(r’number of records $n$’)
dplt32.set_ylabel(’seconds’)
plt.savefig(’d=32_plot.png’,transparent=True)

fig = plt.figure()

nplt4 = fig.add_subplot(111)
xs = [ddim[i] for i in range(len(ndim)) if ndim[i]==10000]
ys = [tottime[i] for i in range(len(ndim)) if ndim[i]==10000]
nplt4.plot(xs,ys)
nplt4.plot([x for x in range(8,64)],
[(0.5*y)**2 for y in range(8,64)])
nplt4.set_xlabel(r’dimension $d$ of $x$ entries’)

32
nplt4.set_ylabel(’seconds’)
plt.savefig(’n=10000_plot.png’,transparent=True)

fig = plt.figure()

nplt5 = fig.add_subplot(111)
xs = [ddim[i] for i in range(len(ndim)) if ndim[i]==100000]
ys = [tottime[i] for i in range(len(ndim)) if ndim[i]==100000]
nplt5.plot(xs,ys)
nplt5.plot([x for x in range(8,64)],
[y**2 for y in range(8,64)])
nplt5.set_xlabel(r’dimension $d$ of $x$ entries’)
nplt5.set_ylabel(’seconds’)
plt.savefig(’n=100000_plot.png’,transparent=True)

fig = plt.figure()

nplt6 = fig.add_subplot(111)
xs = [ddim[i] for i in range(len(ndim)) if ndim[i]==1000000]
ys = [tottime[i] for i in range(len(ndim)) if ndim[i]==1000000]
nplt6.plot(xs,ys)
nplt6.plot([x for x in range(8,64)],
[(3*y)**2 for y in range(8,64)])
nplt6.set_xlabel(r’dimension $d$ of $x$ entries’)
nplt6.set_ylabel(’seconds’)
plt.savefig(’n=1000000_plot.png’,transparent=True)

def read_data_line(line):
return float(line[0]),float(line[1]), \
translate_time(line[2]),translate_time(line[3])

def translate_time(string):
’’’takes a timestring of the form
’hh:mm:ss.msmsmsms’ and returns seconds as float’’’
#trim to remove additional whitespaces
string = string.strip()
doubles = [float(field) for field in string.split(’:’)]
factors = [3600,60,1]
return sum([t*f for t,f in zip(doubles,factors)])

if __name__==’__main__’:
main()

Listing 15: File producePlots.py

Programming, Problem Solving & Abstraction With C (PDFDrive)
100% (1)
Programming, Problem Solving & Abstraction With C (PDFDrive)
253 pages
Introduction To Mathematical Optimization by Matteo Fischetti
No ratings yet
Introduction To Mathematical Optimization by Matteo Fischetti
232 pages
MATLAB
100% (1)
MATLAB
182 pages
Etale Cohomology Milne PDF
No ratings yet
Etale Cohomology Milne PDF
2 pages
Ma50174 Advanced Numerical Methods - Part 1: I.G. Graham (Heavily Based On Original Notes by C.J.Budd)
No ratings yet
Ma50174 Advanced Numerical Methods - Part 1: I.G. Graham (Heavily Based On Original Notes by C.J.Budd)
53 pages
Aerodynamic Loads On External Stores Saab 39 Gripen PDF
No ratings yet
Aerodynamic Loads On External Stores Saab 39 Gripen PDF
63 pages
Instant Access To Introductory Econometrics For Finance (R Guide) Wichmann Ebook Full Chapters
100% (7)
Instant Access To Introductory Econometrics For Finance (R Guide) Wichmann Ebook Full Chapters
52 pages
Google Pagerank and Reduced-Order Modelling
No ratings yet
Google Pagerank and Reduced-Order Modelling
56 pages
Quiver Plot Matlab Tutorial
No ratings yet
Quiver Plot Matlab Tutorial
47 pages
Calculus Multivariable
No ratings yet
Calculus Multivariable
296 pages
Introductory Econometrics for Finance (R Guide) Wichmann instant download
100% (5)
Introductory Econometrics for Finance (R Guide) Wichmann instant download
56 pages
1177 Modular Deep Learning
No ratings yet
1177 Modular Deep Learning
76 pages
Apuntes 2016-1
No ratings yet
Apuntes 2016-1
83 pages
Gaussian Processes For Machine
No ratings yet
Gaussian Processes For Machine
62 pages
Full Text 01
No ratings yet
Full Text 01
52 pages
Nguyen Duy
No ratings yet
Nguyen Duy
66 pages
Full download Introductory Econometrics for Finance (R Guide) Wichmann pdf docx
100% (17)
Full download Introductory Econometrics for Finance (R Guide) Wichmann pdf docx
55 pages
Nonlinear Optimization: Benny Yakir
No ratings yet
Nonlinear Optimization: Benny Yakir
38 pages
Clustering Report
No ratings yet
Clustering Report
55 pages
SSRN Id3466882
No ratings yet
SSRN Id3466882
114 pages
ETS_internship_report
No ratings yet
ETS_internship_report
24 pages
MATLAB Simulink Introduction
No ratings yet
MATLAB Simulink Introduction
58 pages
Sample Project
No ratings yet
Sample Project
36 pages
Eecs 2018 66
No ratings yet
Eecs 2018 66
52 pages
Efficient Elevator Algorithm
No ratings yet
Efficient Elevator Algorithm
55 pages
351 PDF
No ratings yet
351 PDF
68 pages
Introductory Econometrics for Finance (R Guide) Wichmann - Quickly download the ebook to never miss any content
100% (2)
Introductory Econometrics for Finance (R Guide) Wichmann - Quickly download the ebook to never miss any content
68 pages
Prehubz
No ratings yet
Prehubz
7 pages
Laptop Price Predicton Report
No ratings yet
Laptop Price Predicton Report
30 pages
Project Report 2021-05
No ratings yet
Project Report 2021-05
20 pages
XSTK 1
No ratings yet
XSTK 1
37 pages
Vkostakos-Integer Factorisation
No ratings yet
Vkostakos-Integer Factorisation
55 pages
Stata Guide PDF
No ratings yet
Stata Guide PDF
181 pages
Internet Time Synchronization: The Network Time Protocol
No ratings yet
Internet Time Synchronization: The Network Time Protocol
29 pages
Empirical Asset Pricing Via Machine Learning in China Stock Market
100% (1)
Empirical Asset Pricing Via Machine Learning in China Stock Market
25 pages
A GPU Polyhedral For DEM - Master - Thesis - Adam - Bilock
No ratings yet
A GPU Polyhedral For DEM - Master - Thesis - Adam - Bilock
75 pages
Ug4 Proj
No ratings yet
Ug4 Proj
44 pages
Getfem Userdoc
No ratings yet
Getfem Userdoc
263 pages
Machine Learning Summary
No ratings yet
Machine Learning Summary
38 pages
Adsii PDF
No ratings yet
Adsii PDF
125 pages
Intro To Matlab
No ratings yet
Intro To Matlab
102 pages
60414
No ratings yet
60414
39 pages
Nonlinear Optimization: Benny Yakir
No ratings yet
Nonlinear Optimization: Benny Yakir
38 pages
ESG Scores on Swiss-listed Company Performance GroupV
No ratings yet
ESG Scores on Swiss-listed Company Performance GroupV
26 pages
An Overview of Convolutional Neural Network Architectures For Deep Learning
No ratings yet
An Overview of Convolutional Neural Network Architectures For Deep Learning
22 pages
Manual Mat Con Taug 2019
No ratings yet
Manual Mat Con Taug 2019
124 pages
Matconvnet: Convolutional Neural Networks For Matlab
No ratings yet
Matconvnet: Convolutional Neural Networks For Matlab
55 pages
An Evaluation of Spring Webflux: With Focus On Built in SQL Features
No ratings yet
An Evaluation of Spring Webflux: With Focus On Built in SQL Features
60 pages
(Ebook) Theory and Practice of Uncertain Programming by Boading Liu ISBN 3790814903 - The ebook is ready for instant download and access
100% (1)
(Ebook) Theory and Practice of Uncertain Programming by Boading Liu ISBN 3790814903 - The ebook is ready for instant download and access
57 pages
Theory and Practice of Uncertain Programming 1st Edition Professor Baoding Liu (Auth.) - Download the entire ebook instantly and explore every detail
No ratings yet
Theory and Practice of Uncertain Programming 1st Edition Professor Baoding Liu (Auth.) - Download the entire ebook instantly and explore every detail
47 pages
Stata Guide To Accompany Introductory Econometrics For Finance PDF
No ratings yet
Stata Guide To Accompany Introductory Econometrics For Finance PDF
175 pages
Manual MATCONT
No ratings yet
Manual MATCONT
100 pages
Stata Guide To Accompany Introductory Econometrics For Finance
No ratings yet
Stata Guide To Accompany Introductory Econometrics For Finance
175 pages
[EBOOK PDF] Download complete Theory and Practice of Uncertain Programming 1st Edition Professor Baoding Liu (Auth.) ebook
No ratings yet
[EBOOK PDF] Download complete Theory and Practice of Uncertain Programming 1st Edition Professor Baoding Liu (Auth.) ebook
67 pages
FCGuide-pre4 2 0
No ratings yet
FCGuide-pre4 2 0
124 pages
(Ebook) Programming, problem solving & abstraction with C by Moffat, Alistair ISBN 9781486010974, 1486010970 - Download the ebook in PDF with all chapters to read anytime
100% (1)
(Ebook) Programming, problem solving & abstraction with C by Moffat, Alistair ISBN 9781486010974, 1486010970 - Download the ebook in PDF with all chapters to read anytime
56 pages
OpenACC Programming Guide 0 0
No ratings yet
OpenACC Programming Guide 0 0
73 pages
Deep Learning notes
No ratings yet
Deep Learning notes
155 pages
Basic Research and Technologies for Two-Stage-to-Orbit Vehicles: Final Report of the Collaborative Research Centres 253, 255 and 259
From Everand
Basic Research and Technologies for Two-Stage-to-Orbit Vehicles: Final Report of the Collaborative Research Centres 253, 255 and 259
Dieter Jacob
No ratings yet
Control Systems
From Everand
Control Systems
Francisco Luis Pagola y de las Heras
No ratings yet
The Satisfiability Problem: Algorithms and Analyses
From Everand
The Satisfiability Problem: Algorithms and Analyses
Uwe Schöning
No ratings yet
Quadratic Form Example
No ratings yet
Quadratic Form Example
2 pages
maths-class8-question paper
No ratings yet
maths-class8-question paper
3 pages
dg1-3 Surfaces in E 3
No ratings yet
dg1-3 Surfaces in E 3
5 pages
Lecture 14
No ratings yet
Lecture 14
10 pages
Lecture 2 - Inverse Laplace Transforms
No ratings yet
Lecture 2 - Inverse Laplace Transforms
3 pages
(FREE PDF Sample) A First Course in Enumerative Combinatorics 1st Edition Carl G. Wagner Ebooks
100% (6)
(FREE PDF Sample) A First Course in Enumerative Combinatorics 1st Edition Carl G. Wagner Ebooks
62 pages
download_3206133
No ratings yet
download_3206133
4 pages
FP1 Jan 2022 QP
No ratings yet
FP1 Jan 2022 QP
36 pages
Lesson Proper - Module 5: Definite Integral
No ratings yet
Lesson Proper - Module 5: Definite Integral
25 pages
Problem Set #4 Math 453 - Differentiable Manifolds Assignment: Chapter 5 #1 Chapter 6 #1, 2, 4
No ratings yet
Problem Set #4 Math 453 - Differentiable Manifolds Assignment: Chapter 5 #1 Chapter 6 #1, 2, 4
5 pages
MA3151-Matrices and Calculus Question Bank
No ratings yet
MA3151-Matrices and Calculus Question Bank
9 pages
Dudaram PSCP Assignment
No ratings yet
Dudaram PSCP Assignment
26 pages
Log File
No ratings yet
Log File
24 pages
Inequality Modulus and Linear Mix by Lucky
100% (2)
Inequality Modulus and Linear Mix by Lucky
3 pages
Olved Xamples: Chapter # 2 Physics & Mathematics
No ratings yet
Olved Xamples: Chapter # 2 Physics & Mathematics
18 pages
Calculus Ii: Unit 1: Functions of Several Variables
No ratings yet
Calculus Ii: Unit 1: Functions of Several Variables
63 pages
First Year Mathematics Important MCQS: Chapter # 1
50% (2)
First Year Mathematics Important MCQS: Chapter # 1
22 pages
Mathematics For Economics: Euncheol Shin
No ratings yet
Mathematics For Economics: Euncheol Shin
38 pages
Problem Collection PDF
0% (1)
Problem Collection PDF
40 pages
Isc Circular PDF
No ratings yet
Isc Circular PDF
3 pages
Gateway Quiz Mid-Term - Section - 1: Webwork (/webwork2)
No ratings yet
Gateway Quiz Mid-Term - Section - 1: Webwork (/webwork2)
6 pages
Mathematics 2
No ratings yet
Mathematics 2
88 pages
Hermitian, Unitary and Normal Transformations: Fact - 1
75% (4)
Hermitian, Unitary and Normal Transformations: Fact - 1
26 pages
Q1 M8 Polynomial Equation NOTES
No ratings yet
Q1 M8 Polynomial Equation NOTES
32 pages
Lecture 3
No ratings yet
Lecture 3
44 pages
Unit-2 Cgip
No ratings yet
Unit-2 Cgip
20 pages
BTech (CSIT)
No ratings yet
BTech (CSIT)
128 pages
Derivation
No ratings yet
Derivation
6 pages
Matrices 1
No ratings yet
Matrices 1
22 pages

Report Neumann FA

Uploaded by

Report Neumann FA

Uploaded by

Department of Informatics, University of Zurich

Multiple linear regression in

July 17, 2014

3. Matrix operations in databases 6

4. MADlib ‘Ordinary Least Squares’ algorithm 6

6. Tests and results 13

D. Appendix: other code 32

yi = β0 + β1 xi1 + ... + βd xid

is minimized. This is achieved by multiplying X T to both sides of Equation 1, resulting in

3. Matrix operations in databases

4. MADlib ‘Ordinary Least Squares’ algorithm

4.1. How to multiply matrices

4.2. Aggregate functions

Usually a database system operates on tables in a row-by-row or entry-by-entry manner. Since

Hellerstein et al. [2012] provide this definition

“a user-defined aggregate consists of a well-known pattern of two or three user-

Algorithm 2 “outer product” matrix multiplication

Algorithm 4 Transition function

Algorithm 5 Final function

Algorithm 6 Merge function

• PL/pgSQL SQL procedural language

• PL/Tcl Tcl procedural language

• PL/Perl Perl procedural language

• PL/Python Python procedural language

I chose PL/pgSQL because of the following reasons:

Listing 1 Table sparse matrix

2. The rows of a matrix can be stored as arrays as shown in listing 2.

5.1.1. Aggregate function

5.1.2. Transition function

6. Tests and results

Listing 3 Hardware specifications

6.2. Data creation

Figure 5: Total time for n = 100000

J. Berkus. Postgresql new development priorities 4: Parallel query,

J. M. Hellerstein, F. Schoppmann, D. Z. Wang, E. Fratkin, and C. Welton. The MADlib

J. Hromkovic. Hromkovic, Informatik:. Vieweg+Teubner Verlag, 2008. ISBN

PostgreSQL. Postgresql manual 9.3 aggregate functions, 2014a. URL http://www.

PostgreSQL. Postgresql manual 9.3 server programming, 2014b. URL http://www.

PostgreSQL. Postgresql manual 9.3 user-defined aggregates, 2014c. URL http://www.

stackoverflow. Mysql matrix multiplication, 2014. URL http://stackoverflow.com/

Wikipedia. Linear regression — wikipedia, the free encyclopedia, 2014a.

Wikipedia. Matrix multiplication — wikipedia, the free encyclopedia, 2014b.

CREATE TYPE LinRegOut AS (

Listing 5: File MyType.sql

CREATE AGGREGATE LinReg(DOUBLE PRECISION[], DOUBLE PRECISION)(

Listing 6: File LinReg.sql

--DROP FUNCTION LinRegFinal(MyType) CASCADE;

beta DOUBLE PRECISION[];

Listing 8: File LinRegFinal.sql

CREATE OR REPLACE FUNCTION LUinvert(M DOUBLE PRECISION[][])

tmp_val DOUBLE PRECISION;

Listing 9: File LUinvert.sql

Listing 10: File buildTests.sql

CREATE OR REPLACE FUNCTION runTests()

Listing 11: File runTests.sql

CREATE OR REPLACE FUNCTION buildBetaTests()

Listing 12: File buildTests.sql

CREATE OR REPLACE FUNCTION runBetaTests()

Listing 13: File runBetaTests.sql

Listing 14: File allresults.txt

Listing 15: File producePlots.py

You might also like