0% found this document useful (0 votes)
3 views

Data Smarshing

The document discusses a novel approach called 'data smashing' that allows for the comparison of arbitrary data streams without relying on human-defined features or domain knowledge. This method leverages the inherent statistical structure of data to identify similarities and differences, demonstrating effectiveness in various real-world applications such as EEG analysis and cardiac activity detection. The authors argue that data smashing can enhance automated discovery tasks by circumventing traditional preprocessing bottlenecks associated with expert-driven data analysis.

Uploaded by

jinghuizhong
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Data Smarshing

The document discusses a novel approach called 'data smashing' that allows for the comparison of arbitrary data streams without relying on human-defined features or domain knowledge. This method leverages the inherent statistical structure of data to identify similarities and differences, demonstrating effectiveness in various real-world applications such as EEG analysis and cardiac activity detection. The authors argue that data smashing can enhance automated discovery tasks by circumventing traditional preprocessing bottlenecks associated with expert-driven data analysis.

Uploaded by

jinghuizhong
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Downloaded from http://rsif.royalsocietypublishing.

org/ on January 21, 2017

Data smashing: uncovering lurking order


in data
Ishanu Chattopadhyay1,2 and Hod Lipson3,4
rsif.royalsocietypublishing.org 1
Computation Institute, University of Chicago, Chicago, IL, USA
2
Department of Computer Science, School of Mechanical and Aerospace Engineering, Cornell University,
Ithaca, NY, USA
3
School of Mechanical and Aerospace Engineering, and 4Computing and Information Science, Cornell University,
Ithaca, NY, USA

Research IC, 0000-0001-8339-8162

Cite this article: Chattopadhyay I, Lipson H. From automatic speech recognition to discovering unusual stars, underlying
2014 Data smashing: uncovering lurking order almost all automated discovery tasks is the ability to compare and contrast
data streams with each other, to identify connections and spot outliers. Despite
in data. J. R. Soc. Interface 11: 20140826.
the prevalence of data, however, automated methods are not keeping pace. A
http://dx.doi.org/10.1098/rsif.2014.0826 key bottleneck is that most data comparison algorithms today rely on a
human expert to specify what ‘features’ of the data are relevant for comparison.
Here, we propose a new principle for estimating the similarity between the
sources of arbitrary data streams, using neither domain knowledge nor learn-
Received: 24 July 2014 ing. We demonstrate the application of this principle to the analysis of data
Accepted: 10 September 2014 from a number of real-world challenging problems, including the disambigua-
tion of electro-encephalograph patterns pertaining to epileptic seizures,
detection of anomalous cardiac activity from heart sound recordings and classi-
fication of astronomical objects from raw photometry. In all these cases and
without access to any domain knowledge, we demonstrate performance on a
Subject Areas: par with the accuracy achieved by specialized algorithms and heuristics
biocomplexity, mathematical physics, devised by domain experts. We suggest that data smashing principles may
biomathematics open the door to understanding increasingly complex observations, especially
when experts do not know what to look for.
Keywords:
feature-free classification, universal metric,
probabilistic automata
1. Introduction
Any experienced data analyst knows that simply feeding raw data directly into a
data analysis algorithm is unlikely to produce meaningful results. Most data analysis
Author for correspondence: today involves a substantial and often laborious preprocessing stage, before stan-
Ishanu Chattopadhyay dard algorithms can work effectively. In this preprocessing stage, data are filtered
e-mail: [email protected] and reduced into ‘features’ that are defined and selected by experts who know
what aspects of the data are important, based on extensive domain knowledge.
Relying on experts, however, is slow, expensive, error prone and unlikely
to keep pace with the growing amounts and complexity of data. Here, we pro-
pose a general way to circumvent the reliance on human experts, with relatively
little compromise to the quality of results. We discovered that all ordered
datasets—regardless of their origin and meaning—share a fundamental univer-
sal structure that can be exploited to compare and contrast them without a
human-dependent preprocessing step. We suggest that this process, which
we call data smashing, may open the door to understanding increasingly
complex data in the future, especially when experts cannot keep pace.
Our key observation, presented here, is that all quantitative data streams have
corresponding anti-streams, which in spite of being non-unique, are tied to the
stream’s unique statistical structure. We then describe the data smashing process
by which streams and anti-streams can be algorithmically collided to reveal differ-
Electronic supplementary material is available ences that are difficult to detect using conventional techniques. We establish this
principle formally, describe how we implemented it in practice and report its per-
at http://dx.doi.org/10.1098/rsif.2014.0826 or
formance on a number of real-world cases from varied disciplines. The results
via http://rsif.royalsocietypublishing.org.
show that without access to any domain knowledge, the unmodified data smash-
ing process performs on a par with specialized algorithms devised by domain

& 2014 The Author(s) Published by the Royal Society. All rights reserved.
Downloaded from http://rsif.royalsocietypublishing.org/ on January 21, 2017

(a) 2

rsif.royalsocietypublishing.org
(b)
a

J. R. Soc. Interface 11: 20140826


b
en c
co
de

ababcabcbcbbcbacbbccabcab...
co
lli
de
classification
bbbaacbacaababcabc decision
optimization
e
lid
l
co

bbbaacbacaababcabcbcbccc...
r t
ve
in

ababbbaacccacacbabcabaababc...
de
co

a
en

b
c

Figure 1. Data smashing: (a) determining the similarity between two data streams is key to any data mining process, but relies heavily on human-prescribed
criteria. (b) Data smashing first encodes each data stream, then collides one with the inverse of the other. The randomness of the resulting stream reflects
the similarity of the original streams, leading to a cascade of downstream applications involving classification, decision and optimization. (Online version in colour.)

experts for each problem independently. For example, we ana- symbols. The simplest example of such quantization is where
lyse raw electro-encephalographic (EEG) data, and without any all positive values are mapped to the symbol ‘1’ and all negative
domain knowledge find that the measurements from different values to ‘0’, thus generating a string of bits. Next, we select one
patients fall into a single curve, with similar pathologies clus- of the quantized input streams and generate its anti-stream.
tered alongside each other. Making such a discovery using Finally, we smash this anti-stream against the remaining quan-
conventional techniques would require substantial domain tized input stream and measure what information remains.
knowledge and data preprocessing (see figure 3a(ii)). The remaining information is estimated from the deviation of
the resultant stream from flat white noise (FWN).
As information in a data stream is perfectly annihilated by a
2. Anti-streams correct realization of its anti-stream, any deviation of the collision
The notion of data smashing applies only to data in the form of product from noise quantifies statistical dissimilarity. Using this
an ordered series of digits or symbols, such as acoustic waves causal similarity metric, we can cluster streams, classify them or
from a microphone, light intensity over time from a telescope, identify stream segments that are unusual or different. The algor-
traffic density along a road or network activity from a router. ithms are linear in input data, implying they can be applied
An anti-stream contains the ‘opposite’ information from the efficiently to streams in near-real time. Importantly, data smash-
original data stream and is produced by algorithmically invert- ing can be applied without understanding where the streams
ing the statistical distribution of symbol sequences appearing in were generated, how they are encoded and what they represent.
the original stream. For example, sequences of digits that were Ultimately, from a collection of data streams and their
common in the original stream will be rare in the anti-stream, pairwise similarities, it is possible to automatically ‘back
and vice versa. Streams and anti-streams can then be algorith- out’ the underlying metric embedding of the data, revealing
mically ‘collided’ in a way that systematically cancels any their hidden structure for use with traditional machine
common statistical structure in the original streams, leaving learning methods.
only information relating to their statistically significant Dependence across data streams is often quantified using
differences. We call this the principle of information annihilation. mutual information [1]. However, mutual information and
Data smashing involves two data streams and proceeds in data smashing are distinct concepts. The former measures
three steps (see figure 1): raw data streams are first quantized, dependence between streams; the latter computes a distance
by converting continuous value to a string of characters or between the generative processes themselves. For example,
Downloaded from http://rsif.royalsocietypublishing.org/ on January 21, 2017

two sequences of independent coin flips necessarily have quantizations with larger alphabets are also possible. A poss- 3
zero mutual information, but data smashing will identify ibly continuous-valued data stream is thus mapped to a

rsif.royalsocietypublishing.org
the streams as similar because they originate in the same symbol sequence over this pre-specified alphabet.
underlying process—a coin flip. Moreover, smashing only If two data streams are to be smashed, we need the symbols
works correctly if the streams are generated independently to have the same meaning, i.e. represent the same slice of the
(see the electronic supplementary material, Section S-G). data range, in both streams. In other words, the quantization
Similarity computed via data smashing is clearly a function scheme must not vary from one stream to the next. This may
of the statistical information buried in the input streams. be problematic if the data streams have significantly different
However, it might not be easy to find the right statistical average values as a result of a wandering baseline or a definite
tool that reveals this hidden information, particularly without positive or negative trend. One simple de-trending approach is
domain knowledge, or without first constructing a good to consider the signed changes in adjacent values of the data
system model (see the electronic supplementary material, series instead of the series itself, i.e. use the differenced or

J. R. Soc. Interface 11: 20140826


Section S-H, for an example where smashing reveals non-trivial numerically differentiated series. Differentiating once may not
categories missed by simple statistical measures). We describe remove the trend in all cases; more sophisticated de-trending
in detail the process of computing anti-streams and the process may need to be applied. Notably, the exact de-trending
of comparing information. In the electronic supplementary approach is not crucially important; what is important is that
material, we provide theoretical bounds on the confidence we use an invariant scheme and that such a scheme is a
levels, minimal data lengths required for reliable analysis and ‘good quantization scheme’ in the sense of detailed criteria
scalability of the process as a function of the signal encodings. set forth in the electronic supplementary material, Section S-C.
We do not claim strictly superior quantitative perform- The idea of representing continuous-valued time series as
ance to the state-of-art in all applications; carefully chosen symbol streams via application of some form of quantization
approaches tuned to specific problems can certainly do as to the data range is not a new idea, e.g. the widely used sym-
well, or better. Our claim is not that we uniformly outperform bolic aggregate approximation (SAX) [10]. Quantization
existing methods, but that we are on a par, yet do so without involves some information loss which can be reduced with
requiring either expert knowledge or training examples. finer alphabets at the expense of increased computational com-
plexity (see the electronic supplementary material, Section S-C,
for details on the quantization scheme, its comparison with
reported techniques and on mitigating issues such as wander-
3. The hidden models ing baselines, brittleness, etc.). Importantly, our quantization
The notion of a universal metric of similarity makes sense only schemes (see electronic supplementary material, figure S3)
in the context of an approach that does not rely on arbitrarily require no prior domain knowledge.
defined feature vectors, in particular where one considers pair-
wise similarity (or dissimilarity) directly between individual
measurement sets. However, while the advantage of consider- 3.1. Inverting and combining hidden models
ing the notion of similarity between datasets instead of Quantized stochastic processes which capture the statistical
between feature vectors has been recognized [2– 4], attempts structure of symbolic streams can be modelled using prob-
at formulating such measures have been mostly application abilistic automata, provided the processes are ergodic and
dependent, often relying heavily on heuristics. A notable stationary [11 –13]. For the purpose of computing our simi-
exception is a proposed universal normalized compression larity metric, we require that the number of states in the
metric (NCM) based on Kolmogorov’s notion of algorithmic automata be finite (i.e. we only assume the existence of a gen-
complexity [5]. Despite being quite useful in various learning erative probabilistic finite state automata (PFSA)); we do not
tasks [6– 8], NCM is somewhat intuitively problematic as a attempt to construct explicit models or require knowledge of
similarity measure; since even simple stochastic processes either the exact number of states or any explicit bound thereof
may generate highly complex sequences in the Kolmogorov (see figure 2).
sense [1], data streams from identical sources do not necess- A slightly restricted subset of the space of all PFSA over a
arily compute to be similar under NCM (see the electronic fixed alphabet admits an Abelian group structure (see the
supplementary material, Section S-I). We ask whether a more electronic supplementary material, Section S-E), wherein the
intuitive notion of universal similarity is possible; one that operations of commutative addition and inversion are well
guarantees that identical generators, albeit hidden, produce defined. A trivial example of an Abelian group is the set of
similar data streams. We show that universal comparison reals with the usual addition operation; addition of real num-
that adheres to this intuitive requirement is indeed realizable, bers is commutative and each real number a has a unique
and provably so, at least under some general assumptions on inverse 2a, which when summed produce the unique iden-
the nature of the generating processes. tity 0. We have previously discussed the Abelian group
The first step in data smashing is to map the possibly structure on PFSAs in the context of model selection [14].
continuous-valued sensory observations to discrete symbols Here, we show that key group operations, necessary for
via some quantization of the data range (see the electronic sup- classification, can be carried out on the observed sequences
plementary material, Section S-C and figure S3). Each symbol alone, without any state synchronization or reference to the
represents a slice of the data range, and the total number of hidden generators of the sequences.
slices define the symbol alphabet S (where jSj denotes the Existence of a group structure implies that given PFSAs
alphabet size). The coarsest quantization has a binary alphabet G and H, sums G þ H, G 2 H, and unique inverses 2G and
(often referred to as clipping [6,9]) consisting of say 0 and 1 (it 2H are well defined. Individual symbols have no notion of
is not important what symbols we use, we can as well rep- a ‘sign’, and hence the models G and 2G are not generators
resent the letters of the alphabet with a and b), but finer of sign-inverted sequences which would not make sense as
Downloaded from http://rsif.royalsocietypublishing.org/ on January 21, 2017

(a) raw signals to generative models (b) inverse PFSAs and anti-streams 4
sequence s1 sequence s¢

rsif.royalsocietypublishing.org
space process 1 process 2
of h
proces idden
ses
observed stream inversion
raw via selective erasure of s1
signals
s 0|0.7 s 1|0.1 s 0|0.3 s 1|0.9
quantization
s 1|0.3 s 1|0.7
G q1 q2 q1 q2 –G
space
of qu s 0|0.9 s 0|0.1
estimate similarity spac group-theoretic
sequen antized between sequences e of inversion
ces PFS
A using PFSA model

J. R. Soc. Interface 11: 20140826


s 0|0.7 s 1|0.1 2
|0. q2
s 1|0.3 s1 |0.
7
s0

s 1|0.3
(c) annihilation identity with PFSA and sequences

s 0|0.6
q1 q2 q1
s sum realized via
s 0|0.9 proposed 0 |0
.8 selective erasure of s1, s2
q3 sequence s1
s1
sp similarity is causal:
genera ace of hidden
|0.
converges to distance
4
tive mo sequence s¢
dels (P between hidden generators
FSA)
sequence s2
(d) annihilation circuit flat white noise
output 22 ≥ 0
data sufficiency
z self-check for s1
s1 s 0|0.7 s 1|0.1 s 0|0.3 s 1|0.9 s 0|0.5
output 12 ≥ 0
s 1|0.3 s 1|0.7
z distance between + = W
s1 and s2 q1 q2 q1 q2 q1 (flat white
s2 noise model)
s 0|0.9 s 0|0.1
z output 11 ≥ 0 G –G
data sufficiency spac
e of s 1|0.5
self-check for s2 PFS
A

Figure 2. Calculation of causal similarity using data smashing. (a) We quantize raw signals to symbolic sequences over the chosen alphabet and compute a causal
similarity between such sequences. The underlying theory is established assuming the existence of generative probabilistic automata for these sequences, but our
algorithms do not require explicit model construction, or a priori knowledge of their structures. (b) Concept of stream inversion; while we can find the group inverse
of a given PFSA algebraically, we can also transform a generated sequence directly to one that represents the inverse model, without constructing the model itself.
(c) Summing PFSAs G and its inverse 2G yields the zero PFSA W. We can carry out this smashing purely at the sequence level to get FWN. (d) Circuit that allows us
to measure similarity distance between streams s1, s2 via computation of e11 , e22 and e12 (see table 1). Given a threshold ew . 0, if ekk , ew , then we have
w
sufficient data for stream sk (k ¼ 1,2). Additionally, if e12 ,¼ e , then we conclude that s1, s2 have the same stochastic source with high probability (which
converges exponentially fast to 1 with length of input). (Online version in colour.)

our generated sequences are symbol streams. For example, the possibility of estimating causal similarity between observed
anti-stream of a sequence 10111 is not 21 0 21 21 21, but a data streams by estimating this distance from the observed
fragment that has inverted statistical properties in terms of the sequences alone without requiring the models themselves.
occurrence patterns of the symbols 0 and 1 (see table 1). For a We can easily estimate the distance of the hidden model
PFSA G, the unique inverse 2G is the PFSA which when from FWN given only an observed stream s. This is achieved
added to G yields the group identity W ¼ G þ (2G), i.e. the by the function z^ (see table 1, row 4).
zero model. Note, for the zero model W is the unique element Intuitively, given an observed sequence fragment x, we
in the group such that for any arbitrary PFSA H in the group, first compute the deviation of the distribution of the next
we have H þ W ¼ W þ H ¼ H. symbol from the uniform distribution over the alphabet.
For any fixed alphabet size, the zero model is the unique z^(s, ‘) is the sum of these deviations for all historical frag-
single-state PFSA (up to minimal description [15]) that gener- ments x with length up to ‘, weighted by 1=jS j2jxj . The
ates symbols as consecutive realizations of independent weighted sum ensures that deviation of the distributions for
random variables with uniform distribution over the symbol longer x have smaller contribution to z^(s, ‘), which addresses
alphabet. Thus, W generates FWN, and the entropy rate of the issue that the occurrence frequencies of longer sequences
FWN achieves the theoretical upper bound among the are more variable.
sequences generated by arbitrary PFSA in the model space.
Two PFSAs G, H are identical if and only if G þ (2H ) ¼ W.

3.2. Metric structure on model space 4. Key insight: annihilation of information


In addition to the Abelian group, the PFSA space admits a Our key insight is the following: two sets of sequential obser-
metric structure (see the electronic supplementary material, vations have the same generative process if the inverted copy
Section S-D). The distance between two models thus can be of one can annihilate the statistical information contained in
interpreted as the deviation of their group-theoretic differ- the other. We claim that given two symbol streams s1 and
ence from a FWN process. Data smashing exploits the s2, we can check whether the underlying PFSAs (say G1,
Downloaded from http://rsif.royalsocietypublishing.org/ on January 21, 2017

Table 1. Algorithms for stream operations: procedures to assemble the annihilation circuit in figure 2d, which carries out data smashing. Symbolic derivatives 5
underlie the proofs outlined in the electronic supplementary material. However, for the actual implementation, they are only needed in the final step to

rsif.royalsocietypublishing.org
compute deviation from FWN.

stream operation algorithmic procedure (pseudocode)

independent stream copya (1) generate stream v0 from FWN


s (2) read current symbol s1 from s1, and s2 from v0
s¢ (3) if s1 ¼ s2, then write s1 to output s0
generate an independent sample path from the same (4) read next symbol and go to step 1
hidden stochastic source

J. R. Soc. Interface 11: 20140826


this operation is required internally in stream inversion
a
stream inversion (1) generate jSj  1 independent copies of s1: s1 ,    , sjSj1

(2) read current symbols si from si (i ¼ 1,    , jSj  1)
s SjSj1
(3) if si = sj for all distinct i, j, then write Sn i¼1 si to output s0
generate sample path from inverse model of hidden source (4) read next symbol and go to step 1
a
stream summation (1) read current symbols si from si (i ¼ 1, 2)
s1 (2) if s1 ¼ s2, then write to output s0
s¢ (3) read next symbol and go to step 1
s2
generating sample path from sum of hidden sources
deviation from FWNb
jS j  1 X jjfs (x)  U S jj
1
real z^ (s, ‘) ¼ , where
s z
number
output jSj
x:jxj,
¼‘
jSj2jxj
in [0, 1]
estimating the deviation of a symbolic stream from FWN —jSj is alphabet size, jxj is the length of string x
(symbolic derivatives (electronic supplementary material, —‘ is the maximum length of strings up to which the sum is evaluated. For a
Definition S-9) in the electronic supplementary material, given ew , we choose ‘ ¼ ln (1=ew )= ln (jSj) (see the electronic
Section S-B, formalize fs (). If s is generated by a FWN supplementary material, Proposition SI-15)
w
process, then f (x) ! US for any x [ PS , and hence
s —U S : uniform probability vector of length jSj
z^ (s, ‘) ! 0) —for si [ S,
number of occurrences of x si in string s
fs (x)ji ¼
number of occurrences of x in string s

a
See the electronic supplementary material, Section S-F, for proof of correctness.
b
See the electronic supplementary material, Definition S-22, Propositions S-14 and S-15 and Section S-F.

G2) satisfy the annihilation equality G1 þ (2G2) ¼ W without correct for any mis-synchronization of the states of the
explicitly knowing or constructing the models themselves. hidden models. Additionally, we want a linear-time algor-
Data smashing is predicated on being able to invert and ithm, implying that it is desirable to carry out these
sum streams, and to compare streams to noise. Inversion gen- operations in a memoryless symbol-by-symbol fashion.
erates a stream s0 given a stream s, such that if PFSA G is the Thus, we use the notion of a pseudo-copies of probabilistic
source for s, then 2G is the source for s0 . Summation collides automata: given a PFSA G with a transition probability
two streams: given streams s1 and s2 generate a new stream s0 metric P, a pseudo-copy P(G) is any PFSA which has the
which is a realization of FWN if and only if the hidden same structure as G, but with a transition matrix
models G1, G2 satisfy G1 þ G2 ¼ W. Finally, deviation of
P(P) ¼ g[I  (1  g)P]1 P, (4:1)
a stream s from that generated by a FWN process can be
calculated directly. for some scalar g [ (0, 1). We show that the operations
Importantly, for a stream s (with generator G), the described above can be carried out efficiently, once we are
inverted stream s0 is not unique. Any symbol stream gener- willing to settle for stream realizations from pseudo-copies
ated from the inverse model 2G qualifies as an inverse for instead of the exact models. This does not cause a problem
s; thus anti-streams are non-unique. What is indeed unique in disambiguation of hidden dynamics, because the invert-
is the generating inverse PFSA model. As our technique com- ibility of the map in equation (4.1) guarantees that pseudo-
pares the hidden stochastic processes and not their possibly copies of distinct models remain distinct, and nearly identical
non-unique realizations, the non-uniqueness of anti-streams hidden models produce nearly identical pseudo-copies.
is not problematic. Thus, despite the possibility of mis-synchronization between
However, carrying out these operations in the absence of hidden model states, applicability of the algorithms shown in
the model is problematic. In particular, we have no means to table 1 for disambiguation of hidden dynamics is valid. We
Downloaded from http://rsif.royalsocietypublishing.org/ on January 21, 2017

show in the electronic supplementary material, Section S-F, that sufficient data it converges to a well-defined distance 6
the algorithms evaluate distinct models to be distinct, and nearly between the hidden stochastic sources themselves, without

rsif.royalsocietypublishing.org
identical hidden models to be nearly identical. ever knowing them explicitly.
Estimating the deviation of a stream from FWN is
straightforward (as specified by z^ (s, ‘) in table 1, row 4).
All subsequences of a given length must necessarily occur
4.2. Contrast with existing model-free approaches to
with the same frequency for a FWN process; and we time-series analysis
simply estimate the deviation from this behaviour in the Assumption-free time-series analysis to identify localized
observed sequence. The other two tasks are carried out via discords or anomalies has been studied extensively [6,7,25,26].
selective erasure of symbols from the input stream(s) (see A significant majority of these reported approaches use
table 1, rows 1–3). For example, summation of streams is rea- SAX [10] for representing the possibly continuous-valued raw
lized as follows: given two streams s1, s2, we read a symbol data streams. In contrast to our more naive quantization

J. R. Soc. Interface 11: 20140826


from each stream, and if they match then we copy it to our approach, where we map individual raw data streams to indi-
output and ignore the symbols read when they do not match. vidual symbol sequences, SAX typically outputs a set of short
Thus, data smashing allows us to manipulate streams via symbol sequences (referred to as the SAX-words) obtained via
selective erasure, to estimate a distance between the hidden quantization over a sliding window on a smoothed version of
stochastic sources. Specifically, we estimate the degree to the raw data stream (see the electronic supplementary material,
which the sum of a stream and its anti-stream brings the entropy Section S-C, for a more detailed discussion). While the quantiza-
rate of the resultant stream close to its theoretical upper bound. tion details are somewhat different, both approaches essentially
attempt to use information from the occurrence frequency of
symbols or symbol-histories. However, choosing the length
4.1. Contrast with feature-based state of art of the SAX-words beforehand amounts to knowing a priori
Contemporary research in machine learning is dominated by the memory in the underlying process. By contrast, data smash-
the search for good ‘features’ [16], which are typically under- ing does not pre-assume any finite bound on the memory,
stood to be heuristically chosen discriminative attributes and the self-annihilation error (see §5) provides us with a tool
characterizing objects or phenomena of interest. Finding to check if the amount of available data is sufficient for carry-
such attributes is not easy [17,18]. Moreover, the number of ing out the operations described in table 1. The underlying
characterizing features, i.e. the size of the feature set, needs processes need to be at least approximately ergodic and station-
to be relatively small to avoid intractability of the subsequent ary for both approaches. Nevertheless, data smashing is more
learning algorithms. Additionally, their heuristic definition advantageous for slow-mixing conditions, and for a fixed
precludes any notion of optimality; it is impossible to quan- chosen word-length, the processes induce similar frequency
tify the quality of a given feature set in any absolute terms; of observed sequence fragments (see the electronic supplemen-
we can only compare how it performs in the context of a tary material, Section S-J). Importantly, no reported technique,
specific task against a few selected variations. to the best of our knowledge, has a built-in automatic check
In addition to the heuristic nature of feature selection, for data sufficiency that the self-annihilation error provides
machine learning algorithms typically necessitate the choice for data smashing.
of a distance metric in the feature space. For example, the clas- SAX by itself does not lead to any notion of universal simi-
sic ‘nearest neighbour’ k-NN classifier [19] requires definition larity. However, the NCM based on Kolmogorov complexity
of proximity, and the k-means algorithm [20] depends on pair- has been successfully used on symbolized data streams for par-
wise distances in the feature space for clustering. To side-step ameter-free data mining and clustering [8]. While NCM is an
the heuristic metric problem, recent approaches often learn elegant universal metric, the distance computed via smashing
appropriate metrics directly from data, attempting to ‘back reflects similarity in a more intuitive manner; data from identi-
out’ a metric from side information or labelled constraints cal generators always mutually annihilate to FWN, implying
[21]. Unsupervised approaches use dimensionality reduction that identical generators generate similar data streams (see
and embedding strategies to uncover the geometric structure the electronic supplementary material, Section S-I). Additio-
of geodesics in the feature space (e.g. see manifold learning nally, NCM needs to approximately ‘calculate’ incomputable
[22–24]). However, automatically inferred data geometry in quantities, and in theory needs to allow for unspecified addi-
the feature space is, again, strongly dependent on the initial tive constants. By contrast, we can compute the asymptotic
choice of features. As Euclidean distances between feature convergence rate of the self-annihilation error.
vectors are often misleading [22], heuristic features make it
impossible to conceive of a task-independent universal metric.
By contrast, smashing is based on an application- 5. Algorithmic steps
independent notion of similarity between quantized sample
paths observed from hidden stochastic processes. Our univer- 5.1. Self-annihilation test for data-sufficiency check
sal metric quantifies the degree to which the summation of The statistical characteristics of the underlying processes,
the inverted copy of any one stream to the other anni- e.g. the correlation lengths, dictate the amount of data requi-
hilates the existing statistical dependencies, leaving behind red for estimation of the proposed distance. With no access to
FWN. We circumvent the need for features altogether (see the hidden models, we cannot estimate the required data
figure 1b) and do not require training. length a priori; however, it is possible to check for data
Despite the fact that the estimation of similarities between sufficiency for a specified error threshold via self-annihilation.
two data streams is performed in the absence of the knowl- As the proposed metric is causal, the distance between two
edge of the underlying source structure or its parameters, independent samples from the same source always converges
we establish that this universal metric is causal, i.e. with to zero. We estimate the degree of self-annihilation achieved
Downloaded from http://rsif.royalsocietypublishing.org/ on January 21, 2017

in order to determine data sufficiency; that is, a stream is methodologies or finding geometric structures (such as lower 7
sufficiently long if it can sufficiently annihilate an inverted dimensional embedding manifolds [22]) induced by the

rsif.royalsocietypublishing.org
self-copy to FWN. similarity metric on the data sources.
The self-annihilation-based data-sufficiency test consists The matrix H, obtained from E by setting the diagonal
of two steps: given an observed symbolic sequence s, we entries to zero, estimates a distance matrix. A Euclidean
first generate an independent copy (say s0 ). This is the inde- -embedding [28] of H then leads to deeper insight into the
pendent stream copy operation (see table 1, row 1), which geometry of the space of the hidden generators. For example,
can be carried out via selective symbol erasure without any in the case of the EEG data, the time series’ embedding describe
knowledge of the source itself. Once we have s and s0 , we a one-dimensional manifold (a curve), with data from similar
check if the inverted version of one annihilates the other to phenomena clustered together along the curve (see figure 3a(ii)).
a pre-specified degree. If a stream is not able to annihilate
its inverted copy, it is too short for data smashing (see the

J. R. Soc. Interface 11: 20140826


electronic supplementary material, Section S-F).
5.3. Computational complexity
The asymptotic time complexity of carrying out the stream
Selective erasure in annihilation (see table 1) implies that
operations scales linearly with input length and the granular-
the output tested for being FWN is shorter compared with
ity of the alphabet (see the electronic supplementary material,
the input stream, and the expected shortening ratio b can
Section S-F, and figure 4b for illustration of the linear-time
be explicitly computed (see the electronic supplementary
complexity of estimating inter-stream similarity).
material, Section S-F). We refer to b as the annihilation
efficiency, because the convergence rate of ffithe self-annihilation
pffiffiffiffiffiffiffi
error scales asymptotically as O(1= bjsj) (see the electronic
supplementary material, Proposition S-16). In other words, 6. Limitations and assumptions
the required length jsj of the data stream to achieve a self- Data smashing is not directly useful in problems which do
annihilation error of ew scales as 1=b(ew )2 . Importantly, not require a notion of similarity, e.g. predicting the future
electronic supplementary material, Proposition S-13, shows course of a time series, or for problems that do not involve
that the annihilation efficiency is independent of the descrip- the analysis of a stream, such as comparing images or
tional complexity, i.e. the number of causal states, in the unordered datasets.
underlying generating process. This, in combination with For problems to which smashing is applicable, we
the electronic supplementary material, Proposition S-16, implicitly assume the existence of PFSA generators, although
implies that the convergence of the self-annihilation error is we never find these models explicitly. It follows that what
asymptotically independent of the number of states in the pro- we actually assume is not any particular modelling frame-
cess (see the electronic supplementary material, Section S-F, work, but that the systems of interest satisfy the properties
for detailed discussion following Proposition S-16). As an illus- of ergodicity, stationarity and have a finite (but not a priori
tration, note that in the electronic supplementary material, bounded) number of states (see the electronic supplementary
figure S9, the self-annihilation error for a simpler two state material, Section S-D). In practice, our technique performs
process converges fasterpto ffiffiffiffiffiffiffiaffi four state process. Note that the well even if these properties are only approximately satisfied
convergence rate O(1= bjsj) is true only in an asymptotic (e.g. quasi-stationarity instead of stationarity; see example in
sense, and the mixing time of the underlying process does the electronic supplementary material, Section S-H). The alge-
indeed affect how fast the error drops with input length. braic structure of the space of PFSAs (in particular, existence of
The self-annihilation error is also useful to rank the effective- unique group inverses) is key to data smashing; however, we
ness of different quantization schemes. Better quantization argue that any quantized ergodic stationary stochastic process
schemes (e.g. ternary instead of binary) will be able to produce is indeed representable as a probabilistic automata (see the
better self-annihilation while maintaining the ability to dis- electronic supplementary material, Section S-D).
criminate different streams (see the electronic supplementary Data smashing is not applicable to data from strictly
material, Section S-C). deterministic systems. Such systems are representable by prob-
abilistic automata; however, transitions occur with probabilities
which are either 0 or 1. PFSAs with zero-probability transitions
5.2. Feature-free classification and clustering are non-invertible, which invalidates the underlying theoretical
Given n data streams s1, . . . , sn, we construct a matrix E, such guarantees (see the electronic supplementary material, Section
that Eij represents the estimated distance between the streams S-E). Similarly, data streams in which some alphabet symbol
si,sj. Thus, the diagonal elements of E are the self-annihilation is exceedingly rare would be difficult to invert (see the elec-
errors, while the off-diagonal elements represent inter-stream tronic supplementary material, Section S-F, for the notion of
similarity estimates (see figure 2d for the basic annihilation annihilation efficiency).
circuit). This circuit yields three non-negative real numbers Symbolization of a continuous data stream invariably intro-
eii , eij , e jj , which define the corresponding ijth entries of E. duces quantization error. This can be made small by using
Given a positive threshold ew . 0, the self-annihilation tests larger alphabets. However, larger alphabet sizes demand
w
are passed if ekk , ¼e (k ¼ i,j ), and for sufficient data the longer observed sequences (see the electronic supplementary
streams si,sj have identical sources with high probability if material, Section S-F and figure S8), implying that the length
w
and only if eij , ¼ e . Once E is constructed, we can determine of observation limits the quantization granularity, and in the
clusters by rearranging E into prominent diagonal blocks. process limits the degree to which the quantization error can
Any standard technique [27] can be used for such clustering; be mitigated. Importantly, with coarse quantizations distinct
data smashing is only used to find the causal distances processes may evaluate to be similar. However, identical pro-
between observed data streams, and the resultant distance cesses will still evaluate to be identical (or nearly so),
matrix can then be used as input to state-of-the-art clustering provided the streams pass the self-annihilation test. The
Downloaded from http://rsif.royalsocietypublishing.org/ on January 21, 2017

(a) EEG (epileptic pathology detection) 8


(i) IA distance matrix (ii) three-dimensional Euclidean anomaly
(iii) clusters
(seizure data not shown) embedding of distance matrix all data
normal (eyes closed)

rsif.royalsocietypublishing.org
–1
normal (eyes open)
10

plane A
100 pathology
seizure non
–2
data seizure
0
1 ×1 200
eyes z 0.2
closed
(normal) normal ano-
–4
maly
0 300 0.4
eyes
×1
9.9 open 0 0.2 y eyes eyes
(normal) 0.2 open closed
100 200 300 x 0.4 0.6 0

(b) heart murmur detection from digital stethoscope


(i) distance matrix showing clusters (ii) three-dimensional Euclidean embedding (iii) two-dimensional manifold (iv)

J. R. Soc. Interface 11: 20140826


clusters (numbers = cluster size)
–2 all data
0 murmur murmur
1 ×1 healthy
healthy
0.02 8
20
–5 z 1
10 0.02 7
9× 0
40 z 9
0.06 3
0.06 0.04 6
0 0.04 y 0.02
00 0.05 0.10 4
0 0.05 0.02 y x 8
60 0.10 0 5 8 6
x
20 40 60
(c) Cepheid versus RR Lyrae classification from photometry
(i) IA distance matrix showing clusters (ii) three-dimensional Euclidean embedding (iii) clusters
0
–1
RR Lyrae
0.4
10 Cepheid Cepheids all data

z 0.2
0
–2 5000
10

Cepheid RR Lyrae
RR Lyrae 0.4
–4
10 y
× 0.2
9.9
10 000 0.4 0.6
00 0.2
x
0 5000 10 000

Figure 3. Data smashing applications. Pairwise distance matrices, identified clusters and three-dimensional projections of Euclidean embeddings for epileptic path-
ology identification (a), identification of heart murmur (b) and classification of variable stars from photometry (c). In these applications, the relevant clusters are
found unsupervised. (Online version in colour.)

(a) (b)
(i) (ii) (iii)
computation time (ms)

15 ΩSΩ= 2 (2 states)
10–0.5
self-annihilation

10–1 ΩSΩ= 2 (4 states)


1 10 ΩSΩ= 3
error

10–1.0 ΩSΩ= 4
10–2 5
10–1
10–1.5 0
102 103 103 104 105 101.5 102 20 000 40 000 60 000 80 000 100 000
no. symbols read no. symbols read no. symbols read symbol length (no. symbols)

Figure 4. Computational complexity and convergence rates for data smashing. (a) Exponential convergence of the self-annihilation error for a small set of data series
for different applications ((i) for EEG data, (ii) for heart sound recordings and (iii) for photometry). (b) Computation times for carrying out annihilation using the
circuit shown in figure 2d as a function of the length of input streams for different alphabet sizes (and for different number of states in the hidden models).
Note that the asymptotic time complexity of obtaining the similarity distances scales as O(jSjn), where n is the length of the shorter of the two input streams.
(Online version in colour.)

self-annihilation test thus offers an application-independent of the quantization schemes, computed distance matrices
way to compare and rank quantization schemes (see the and identified clusters and Euclidean embeddings are sum-
electronic supplementary material, Section S-C). marized in table 2 and figure 3 (see also the electronic
The algorithmic steps (see table 1) require no synchroniza- supplementary material, Sections S-A and S-C).
tion (we can start reading the streams anywhere), implying Our first application is classification of brain electrical
that non-equal length of time series, and phase mismatches activity from different physiological and pathological brain
are of no consequence. states [29]. We used sets of EEG data series consisting of sur-
face EEG recordings from healthy volunteers with eyes closed
and open, and intracranial recordings from epilepsy patients
during seizure-free intervals from within and from outside
7. Application examples the seizure generating area, as well as intracranial recordings
Data smashing begins with quantizing streams to symbolic of seizures.
sequences, followed by the use of the annihilation circuit Starting with the data series of electric potentials, we gener-
(figure 2d) to compute pairwise causal similarities. Details ated sequences of relative changes between consecutive values
Downloaded from http://rsif.royalsocietypublishing.org/ on January 21, 2017

Table 2. Application problems and results (see the electronic supplementary material, table S1, for a more detailed version). 9

rsif.royalsocietypublishing.org
system input description classification performance

—495 EEG excerpts, each 23.6 s sampled at 173.61 Hz


—signal derivative as input
no comparable result is available in the
—quantizationa (three letter):
literature. However, IA reveals a one-
20
2
7 dimensional manifold structure in the
(1) identify epileptic pathology [29] –7
1
0 dataset, while [29] with additional
–20
0 1000 2000 3000
assumptions on the nature of hidden

J. R. Soc. Interface 11: 20140826


processes fails to yield such insight
—65 .wav files sampled at 44.1 kHz (approx. 10 s each)
—quantizationa (two letter):
0.005
state of the art [30] achieved in supervised
0
–0.005 learning with task-specific features
(2) identify heart murmur [30]
100 200

—10 699 photometric series


—differentiated folded/raw photometry used as input
state of the art [31] achieved with task-
—quantizationa (three letter):
specific features and multiple hand-
(3) classify variable stars (Cepheid 2
8
0 1
optimized classification steps
variable versus RR Lyrae) from –8
0
photometry (OGLE II) [31] 50 100

(this capability is beyond the state of art)


—122 subjects, multi-variate data from 61 standard
electrodes
—256 data points for each trial for each electrode
(4) EEG-based biometric —total number of data series: 5477 (each with 61
authentication [32] with variables) state of the art [33] achieved with task-
a
visually evoked potentials —quantization (two letter): specific features, and after eliminating two
20 1 subjects from consideration
0
–20 0

100 200

—23 speakers (9 female, 14 male), 16 kHz recording


—approximately 100 s recording/speaker
—2 s snippets used as time-series excerpts
(5) text-independent speaker —total number of time series: 1270 state of the art [35] achieved with
identification using ELSDSR —quantizationa (two letter): task-specific features and multiple
database [34] 0.05 1 hand-optimized classification steps
0
–0.05 0
–0.10
100 200 300
a
See the electronic supplementary material, Section S-C, for details on choosing quantization schemes.

before quantization. This step allows a common alphabet for This provides a particularly insightful picture, which eludes
sequences with wide variability in the sequence mean values. complex nonlinear modelling [29].
The distance matrix from pairwise smashing yielded clear Next, we classify cardiac rhythms from noisy heat–
clusters corresponding to seizure, normal eyes open, normal sound data recorded using a digital stethoscope [30]. We
eyes closed and epileptic pathology in non-seizure conditions analysed 65 data series (ignoring the labels) corresponding
(see figure 3a, seizures not shown due to large differences to healthy rhythms and murmur, to verify if we could
from the rest). identify clusters without supervision that correspond to the
Embedding the distance matrix (see figure 3a(i)) yields expert-assigned labels.
a one-dimensional manifold (a curve), with contiguous We found 11 clusters in the distance matrix (see figure 3b),
segments corresponding to different brain states, e.g. right- four of which consisted of mainly data with murmur (as
hand side of plane A corresponds to epileptic pathology. determined by the expert labels), and the rest consisting of
Downloaded from http://rsif.royalsocietypublishing.org/ on January 21, 2017

mainly healthy rhythms (see figure 3b(iv)). Classification pre- data series from the same subject will annihilate each other 10
cision for murmur is noted in table 2 (75.2%). Embedding of correctly, whereas those from different subjects will fail to

rsif.royalsocietypublishing.org
the distance matrix revealed a two-dimensional manifold (see do so to the same extent. We outperformed the state of art
figure 3b(iii)). for both kNN- and SVM-based approaches (see table 2, and
Our next problem is the classification of variable stars the electronic supplementary material, Section S-A).
using light intensity series ( photometry) from the optical Our fifth application is text-independent speaker identifi-
gravitational lensing experiment (OGLE) survey [31]. Super- cation using the ELSDSR database [34], which includes
vised classification of photometry proceeds by first ‘folding’ recording from 23 speakers (9 female and 14 male, with poss-
each light-curve to its known period to correct phase mis- ibly non-native accents). As before, training involved
matches. In our first analysis, we started with folded light- specifying the library series for each speaker. We computed
curves and generated data series of the relative changes the distance matrix by smashing the library data series
between consecutive brightness values in the curves before against each other and trained a simple kNN on the Eucli-

J. R. Soc. Interface 11: 20140826


quantization, which allows for the use of a common alphabet dean embedding of the distance matrix. The test data then
for light-curves with wide variability in the mean brightness yielded a classification accuracy of 80.2%, which beat the
values. Using data for Cepheids and RR Lyrae (3426 Cep- state of art figure of 73.73% for 2 s snippets of recording
heids, 7273 RR Lyrae), we obtained a classification accuracy data [35] (see table 2 and the electronic supplementary
of 99.8% which marginally outperforms the state of art (see material, figure S1b).
table 2). Clear clusters (obtained unsupervised) correspond-
ing to the two classes can be seen in the computed distance
matrix (see figure 3c(i)) and the three-dimensional projection
of its Euclidean embedding (see figure 3c(ii)). The three- 8. Conclusion
dimensional embedding was very nearly constrained within We introduced data smashing to measure causal similarity
a two-dimensional manifold (see figure 3c(ii)). between series of sequential observations. We demonstrated
Additionally, in our second analysis, we asked if data that our insight allows feature-less model-free classification in
smashing can work without knowledge of the period of the diverse applications, without the need for training, or expert
variable star; skipping the folding step. Smashing raw pho- tuned heuristics. Non-equal length of time series, missing
tometry data yielded a classification accuracy of 94.3% for data and possible phase mismatches are of little consequence.
the two classes (see table 2). This direct approach is beyond While better classification algorithms likely exist for
state of the art techniques. specific problem domains, such algorithms are difficult to
Our fourth application is biometric authentication using develop and tune. The strength of data smashing lies in its
visually evoked EEG potentials. The public database used [32] ability to circumvent both the need for expert-defined heuris-
considered 122 subjects, each of whom was exposed to pictures tic features and expensive training, thereby eliminating one of
of objects chosen from the standardized Snodgrass set [36]. the major bottlenecks in contemporary big data challenges.
Note that while this application is supervised (as we are
not attempting to find clusters unsupervised), no actual train- Data accessibility. Software implementations with multi-platform
ing is involved; we merely mark the randomly chosen compatibility may be obtained online at http://creativemachines.
cornell.edu or by contacting the corresponding author.
subject-specific set of data series as the library set represent-
Funding statement. This work has been supported in part by US Defense
ing each individual subject. If ‘unknown’ test data series is Advanced Research Projects Agency (DARPA) project W911NF-
smashed against each element of each of the libraries corre- 12-1-0449, and the US Army Research Office (ARO) project
sponding to the individual subjects, we expected that the W911NF-12-1-0499.

References
1. Cover TM, Thomas JA. 1991 Elements of information 5. Li M, Chen X, Li X, Ma B, Vitanyi PMB. 2004 The 206–215. New York, NY: ACM. (doi:10.1145/1014052.
theory. New York, NY: John Wiley. similarity metric. IEEE Trans. Inform. Theory 50, 1014077)
2. Duin RPW, de Ridder D, Tax DMJ. 1997 Experiments 3250 –3264. (doi:10.1109/TIT.2004.838101) 9. Anthony B, Ratanamahatana C, Keogh E, Lonardi S,
with a feature-less approach to pattern recognition. 6. Ratanamahatana C, Keogh E, Bagnall AJ, Lonardi S. Janacek G. 2006 A bit level representation for time
Pattern Recognit. Lett. 18, 1159– 1166. (doi:10. 2005 A novel bit level time series representation with series data mining with shape based similarity. Data
1016/S0167-8655(97)00138-4) implication of similarity search and clustering. In Min. Knowl. Discov. 13, 11 –40. (doi:10.1007/
3. Mottl V, Dvoenko S, Seredin O, Kulikowski C, Advances in knowledge discovery and data mining (eds s10618-005-0028-0)
Muchnik J. 2001 Featureless pattern recognition in T Ho, D Cheung, H Liu). Lecture Notes in Computer 10. Keogh E, Lin J, Fu A. 2005 HOT SAX: efficiently finding
an imaginary Hilbert space and its application to Science, vol. 3518, pp. 771–777. Berlin, Germany: the most unusual time series subsequence. In Proc. 5th
protein fold classification. In Machine learning and Springer. (doi:10.1007/11430919_90) IEEE Int. Conf. on Data Mining, pp. 226–233.
data mining in pattern recognition (ed. P Perner). 7. Wei L, Kumar N, Lolla V, Keogh EJ, Lonardi S, Washington, DC: IEEE Computer Society. (doi:10.1109/
Lecture Notes in Computer Science, vol. 2123, pp. Ratanamahatana C. 2005 Assumption-free anomaly ICDM.2005.79)
322–336. Berlin, Germany: Springer. (doi:10.1007/ detection in time series. In Proc. 17th Int. Conf. on 11. Chattopadhyay I, Lipson H. 2013 Abductive learning
3-540-44596-X_26) Scientific and Statistical Database Management, pp. of quantized stochastic processes with probabilistic
4. Pekalska E, Duin RPW. 2002 Dissimilarity 237–240. Berkeley, CA: Lawrence Berkeley Laboratory. finite automata. Phil. Trans. R. Soc. A 371,
representations allow for building good classifiers. 8. Keogh E, Lonardi S, Ratanamahatana CA. 2004 Towards 20110543. (doi:10.1098/rsta.2011.0543)
Pattern Recognit. Lett. 23, 943–956. (doi:10.1016/ parameter-free data mining. In Proc. 10th ACM SIGKDD 12. Crutchfield JP, McNamara BS. 1987 Equations of motion
S0167-8655(02)00024-7) Int. Conf. on Knowledge Discovery and Data Mining, pp. from a data series. Complex Syst. 1, 417–452.
Downloaded from http://rsif.royalsocietypublishing.org/ on January 21, 2017

13. Crutchfield JP. 1994 The calculi of emergence: 22. Tenenbaum J, de Silva V, Langford J. 2000 A global 30. Bentley P et al. 2011 The PASCAL Classifying Heart 11
computation, dynamics and induction. Phys. D geometric framework for nonlinear dimensionality Sounds Challenge 2011 (CHSC2011) Results. See

rsif.royalsocietypublishing.org
Nonlinear Phenom. 75, 11 –54. (doi:10.1016/0167- reduction. Science 290, 2319– 2323. (doi:10.1126/ http://www.peterjbentley.com/heartchallenge/
2789(94)90273-9) science.290.5500.2319) index.html.
14. Chattopadhyay I, Wen Y, Ray A. 2010 Pattern 23. Roweis S, Saul L. 2000 Nonlinear dimensionality 31. Szymanski MK. 2005 The optical gravitational
classification in symbolic streams via semantic reduction by locally linear embedding. Science 2900, lensing experiment. Internet access to the OGLE
annihilation of information. (http://arxiv.org/abs/ 2323–2326. (doi:10.1126/science.290.5500.2323) photometry data set: OGLE-II BVI maps and I-band
1008.3667) 24. Seung H, Lee D. 2000 Cognition. The manifold ways data. Acta Astron. 55, 43 –57.
15. Chattopadhyay I, Ray A. 2008 Structural of perception. Science 290, 2268 –2269. (doi:10. 32. Begleiter H. 1995 EEG database data set.
transformations of probabilistic finite state 1126/science.290.5500.2268) New York, NY: Neurodynamics Laboratory,
machines. Int. J. Control 81, 820 –835. (doi:10. 25. Kumar N, Lolla N, Keogh E, Lonardi S, State University of New York Health Center
1080/00207170701704746) Ratanamahatana CA. 2005 Time-series bitmaps: a Brooklyn. See http://archive.ics.uci.edu/ml/

J. R. Soc. Interface 11: 20140826


16. Duda RO, Hart PE, Stork DG. 2000 Pattern practical visualization tool for working with large datasets/EEG+Database.
classification, 2nd edn. New York, NY: Wiley- time series databases. In Proc. SIAM 2005 Data 33. Brigham K, Kumar BVKV. 2010 Subject identification
Interscience. Mining Conf. (ed. H Kargupta et al.), pp. 531– 535. from electroencephalogram (EEG) signals during
17. Brumfiel G. 2011 High-energy physics: down the Philadelphia, PA: SIAM. imagined speech. In 4th IEEE Int. Conf. on
petabyte highway. Nature 469, 282 –283. (doi:10. 26. Wei L, Keogh E, Xi X, Lonardi S. 2005 Integrating lite- Biometrics: Theory Applications and Systems (BTAS),
1038/469282a) weight but ubiquitous data mining into GUI operating pp. 1–8, (doi:10.1109/BTAS.2010.5634515)
18. Baraniuk RG. 2011 More is less: signal processing systems. J. Univers. Comput. Sci. 11, 1820–1834. 34. English Language Speech Database for Speaker
and the data deluge. Science 331, 717–719. 27. Ward JH. 1963 Hierarchical grouping to optimize an Recognition. Department of Informatics and
(doi:10.1126/science.1197448) objective function. J. Am. Stat. Assoc. 58, 236– 244. Mathematical Modelling, Technical University of
19. Cover T, Hart P. 1967 Nearest neighbor pattern (doi:10.1080/01621459.1963.10500845) Denmark. 2005. See http://www2.imm.dtu.dk/
classification. IEEE Trans. Inform. Theory 13, 21– 27. 28. Sippl MJ, Scheraga HA. 1985 Solution of the ~lfen/elsdsr/.
(doi:10.1109/TIT.1967.1053964) embedding problem and decomposition of 35. Feng L, Hansen LK. 2005 A new database for speaker
20. MacQueen JB. 1967 Some methods for classification symmetric matrices. Proc. Natl Acad. Sci. USA 82, recognition. IMM technical report 2005-05. Kgs. Lyngby,
and analysis of multivariate observations. In Proc. 2197 –2201. (doi:10.1073/pnas.82.8.2197) Denmark: Technical University of Denmark, DTU. See
5th Berkeley Symp. on Mathematical Statistics and 29. Andrzejak RG, Lehnertz K, Mormann F, Rieke C, http://www2.imm.dtu.dk/pubdb/p.php?3662.
Probability, pp. 281– 297. Berkeley, CA: University of David P, Elger CE. 2001 Indications of nonlinear 36. Snodgrass JG, Vanderwart M. 1980 A standardized
California Press. deterministic and finite dimensional structures in set of 260 pictures: norms for name agreement,
21. Yang L. 2007 An overview of distance metric time series of brain electrical activity: dependence image agreement, familiarity, and visual
learning. In Proc. Computer Vision and Pattern on recording region and brain state. Phys. Rev. E 64, complexity. J. Exp. Psychol. Hum. Learn. Mem. 6,
recognition, 7 October. 061907. (doi:10.1103/PhysRevE.64.061907) 174–215. (doi:10.1037/0278-7393.6.2.174)

You might also like