Data Smarshing
Data Smarshing
Cite this article: Chattopadhyay I, Lipson H. From automatic speech recognition to discovering unusual stars, underlying
2014 Data smashing: uncovering lurking order almost all automated discovery tasks is the ability to compare and contrast
data streams with each other, to identify connections and spot outliers. Despite
in data. J. R. Soc. Interface 11: 20140826.
the prevalence of data, however, automated methods are not keeping pace. A
http://dx.doi.org/10.1098/rsif.2014.0826 key bottleneck is that most data comparison algorithms today rely on a
human expert to specify what ‘features’ of the data are relevant for comparison.
Here, we propose a new principle for estimating the similarity between the
sources of arbitrary data streams, using neither domain knowledge nor learn-
Received: 24 July 2014 ing. We demonstrate the application of this principle to the analysis of data
Accepted: 10 September 2014 from a number of real-world challenging problems, including the disambigua-
tion of electro-encephalograph patterns pertaining to epileptic seizures,
detection of anomalous cardiac activity from heart sound recordings and classi-
fication of astronomical objects from raw photometry. In all these cases and
without access to any domain knowledge, we demonstrate performance on a
Subject Areas: par with the accuracy achieved by specialized algorithms and heuristics
biocomplexity, mathematical physics, devised by domain experts. We suggest that data smashing principles may
biomathematics open the door to understanding increasingly complex observations, especially
when experts do not know what to look for.
Keywords:
feature-free classification, universal metric,
probabilistic automata
1. Introduction
Any experienced data analyst knows that simply feeding raw data directly into a
data analysis algorithm is unlikely to produce meaningful results. Most data analysis
Author for correspondence: today involves a substantial and often laborious preprocessing stage, before stan-
Ishanu Chattopadhyay dard algorithms can work effectively. In this preprocessing stage, data are filtered
e-mail: [email protected] and reduced into ‘features’ that are defined and selected by experts who know
what aspects of the data are important, based on extensive domain knowledge.
Relying on experts, however, is slow, expensive, error prone and unlikely
to keep pace with the growing amounts and complexity of data. Here, we pro-
pose a general way to circumvent the reliance on human experts, with relatively
little compromise to the quality of results. We discovered that all ordered
datasets—regardless of their origin and meaning—share a fundamental univer-
sal structure that can be exploited to compare and contrast them without a
human-dependent preprocessing step. We suggest that this process, which
we call data smashing, may open the door to understanding increasingly
complex data in the future, especially when experts cannot keep pace.
Our key observation, presented here, is that all quantitative data streams have
corresponding anti-streams, which in spite of being non-unique, are tied to the
stream’s unique statistical structure. We then describe the data smashing process
by which streams and anti-streams can be algorithmically collided to reveal differ-
Electronic supplementary material is available ences that are difficult to detect using conventional techniques. We establish this
principle formally, describe how we implemented it in practice and report its per-
at http://dx.doi.org/10.1098/rsif.2014.0826 or
formance on a number of real-world cases from varied disciplines. The results
via http://rsif.royalsocietypublishing.org.
show that without access to any domain knowledge, the unmodified data smash-
ing process performs on a par with specialized algorithms devised by domain
& 2014 The Author(s) Published by the Royal Society. All rights reserved.
Downloaded from http://rsif.royalsocietypublishing.org/ on January 21, 2017
(a) 2
rsif.royalsocietypublishing.org
(b)
a
ababcabcbcbbcbacbbccabcab...
co
lli
de
classification
bbbaacbacaababcabc decision
optimization
e
lid
l
co
bbbaacbacaababcabcbcbccc...
r t
ve
in
ababbbaacccacacbabcabaababc...
de
co
a
en
b
c
Figure 1. Data smashing: (a) determining the similarity between two data streams is key to any data mining process, but relies heavily on human-prescribed
criteria. (b) Data smashing first encodes each data stream, then collides one with the inverse of the other. The randomness of the resulting stream reflects
the similarity of the original streams, leading to a cascade of downstream applications involving classification, decision and optimization. (Online version in colour.)
experts for each problem independently. For example, we ana- symbols. The simplest example of such quantization is where
lyse raw electro-encephalographic (EEG) data, and without any all positive values are mapped to the symbol ‘1’ and all negative
domain knowledge find that the measurements from different values to ‘0’, thus generating a string of bits. Next, we select one
patients fall into a single curve, with similar pathologies clus- of the quantized input streams and generate its anti-stream.
tered alongside each other. Making such a discovery using Finally, we smash this anti-stream against the remaining quan-
conventional techniques would require substantial domain tized input stream and measure what information remains.
knowledge and data preprocessing (see figure 3a(ii)). The remaining information is estimated from the deviation of
the resultant stream from flat white noise (FWN).
As information in a data stream is perfectly annihilated by a
2. Anti-streams correct realization of its anti-stream, any deviation of the collision
The notion of data smashing applies only to data in the form of product from noise quantifies statistical dissimilarity. Using this
an ordered series of digits or symbols, such as acoustic waves causal similarity metric, we can cluster streams, classify them or
from a microphone, light intensity over time from a telescope, identify stream segments that are unusual or different. The algor-
traffic density along a road or network activity from a router. ithms are linear in input data, implying they can be applied
An anti-stream contains the ‘opposite’ information from the efficiently to streams in near-real time. Importantly, data smash-
original data stream and is produced by algorithmically invert- ing can be applied without understanding where the streams
ing the statistical distribution of symbol sequences appearing in were generated, how they are encoded and what they represent.
the original stream. For example, sequences of digits that were Ultimately, from a collection of data streams and their
common in the original stream will be rare in the anti-stream, pairwise similarities, it is possible to automatically ‘back
and vice versa. Streams and anti-streams can then be algorith- out’ the underlying metric embedding of the data, revealing
mically ‘collided’ in a way that systematically cancels any their hidden structure for use with traditional machine
common statistical structure in the original streams, leaving learning methods.
only information relating to their statistically significant Dependence across data streams is often quantified using
differences. We call this the principle of information annihilation. mutual information [1]. However, mutual information and
Data smashing involves two data streams and proceeds in data smashing are distinct concepts. The former measures
three steps (see figure 1): raw data streams are first quantized, dependence between streams; the latter computes a distance
by converting continuous value to a string of characters or between the generative processes themselves. For example,
Downloaded from http://rsif.royalsocietypublishing.org/ on January 21, 2017
two sequences of independent coin flips necessarily have quantizations with larger alphabets are also possible. A poss- 3
zero mutual information, but data smashing will identify ibly continuous-valued data stream is thus mapped to a
rsif.royalsocietypublishing.org
the streams as similar because they originate in the same symbol sequence over this pre-specified alphabet.
underlying process—a coin flip. Moreover, smashing only If two data streams are to be smashed, we need the symbols
works correctly if the streams are generated independently to have the same meaning, i.e. represent the same slice of the
(see the electronic supplementary material, Section S-G). data range, in both streams. In other words, the quantization
Similarity computed via data smashing is clearly a function scheme must not vary from one stream to the next. This may
of the statistical information buried in the input streams. be problematic if the data streams have significantly different
However, it might not be easy to find the right statistical average values as a result of a wandering baseline or a definite
tool that reveals this hidden information, particularly without positive or negative trend. One simple de-trending approach is
domain knowledge, or without first constructing a good to consider the signed changes in adjacent values of the data
system model (see the electronic supplementary material, series instead of the series itself, i.e. use the differenced or
(a) raw signals to generative models (b) inverse PFSAs and anti-streams 4
sequence s1 sequence s¢
rsif.royalsocietypublishing.org
space process 1 process 2
of h
proces idden
ses
observed stream inversion
raw via selective erasure of s1
signals
s 0|0.7 s 1|0.1 s 0|0.3 s 1|0.9
quantization
s 1|0.3 s 1|0.7
G q1 q2 q1 q2 –G
space
of qu s 0|0.9 s 0|0.1
estimate similarity spac group-theoretic
sequen antized between sequences e of inversion
ces PFS
A using PFSA model
s 1|0.3
(c) annihilation identity with PFSA and sequences
s 0|0.6
q1 q2 q1
s sum realized via
s 0|0.9 proposed 0 |0
.8 selective erasure of s1, s2
q3 sequence s1
s1
sp similarity is causal:
genera ace of hidden
|0.
converges to distance
4
tive mo sequence s¢
dels (P between hidden generators
FSA)
sequence s2
(d) annihilation circuit flat white noise
output 22 ≥ 0
data sufficiency
z self-check for s1
s1 s 0|0.7 s 1|0.1 s 0|0.3 s 1|0.9 s 0|0.5
output 12 ≥ 0
s 1|0.3 s 1|0.7
z distance between + = W
s1 and s2 q1 q2 q1 q2 q1 (flat white
s2 noise model)
s 0|0.9 s 0|0.1
z output 11 ≥ 0 G –G
data sufficiency spac
e of s 1|0.5
self-check for s2 PFS
A
Figure 2. Calculation of causal similarity using data smashing. (a) We quantize raw signals to symbolic sequences over the chosen alphabet and compute a causal
similarity between such sequences. The underlying theory is established assuming the existence of generative probabilistic automata for these sequences, but our
algorithms do not require explicit model construction, or a priori knowledge of their structures. (b) Concept of stream inversion; while we can find the group inverse
of a given PFSA algebraically, we can also transform a generated sequence directly to one that represents the inverse model, without constructing the model itself.
(c) Summing PFSAs G and its inverse 2G yields the zero PFSA W. We can carry out this smashing purely at the sequence level to get FWN. (d) Circuit that allows us
to measure similarity distance between streams s1, s2 via computation of e11 , e22 and e12 (see table 1). Given a threshold ew . 0, if ekk , ew , then we have
w
sufficient data for stream sk (k ¼ 1,2). Additionally, if e12 ,¼ e , then we conclude that s1, s2 have the same stochastic source with high probability (which
converges exponentially fast to 1 with length of input). (Online version in colour.)
our generated sequences are symbol streams. For example, the possibility of estimating causal similarity between observed
anti-stream of a sequence 10111 is not 21 0 21 21 21, but a data streams by estimating this distance from the observed
fragment that has inverted statistical properties in terms of the sequences alone without requiring the models themselves.
occurrence patterns of the symbols 0 and 1 (see table 1). For a We can easily estimate the distance of the hidden model
PFSA G, the unique inverse 2G is the PFSA which when from FWN given only an observed stream s. This is achieved
added to G yields the group identity W ¼ G þ (2G), i.e. the by the function z^ (see table 1, row 4).
zero model. Note, for the zero model W is the unique element Intuitively, given an observed sequence fragment x, we
in the group such that for any arbitrary PFSA H in the group, first compute the deviation of the distribution of the next
we have H þ W ¼ W þ H ¼ H. symbol from the uniform distribution over the alphabet.
For any fixed alphabet size, the zero model is the unique z^(s, ‘) is the sum of these deviations for all historical frag-
single-state PFSA (up to minimal description [15]) that gener- ments x with length up to ‘, weighted by 1=jS j2jxj . The
ates symbols as consecutive realizations of independent weighted sum ensures that deviation of the distributions for
random variables with uniform distribution over the symbol longer x have smaller contribution to z^(s, ‘), which addresses
alphabet. Thus, W generates FWN, and the entropy rate of the issue that the occurrence frequencies of longer sequences
FWN achieves the theoretical upper bound among the are more variable.
sequences generated by arbitrary PFSA in the model space.
Two PFSAs G, H are identical if and only if G þ (2H ) ¼ W.
Table 1. Algorithms for stream operations: procedures to assemble the annihilation circuit in figure 2d, which carries out data smashing. Symbolic derivatives 5
underlie the proofs outlined in the electronic supplementary material. However, for the actual implementation, they are only needed in the final step to
rsif.royalsocietypublishing.org
compute deviation from FWN.
a
See the electronic supplementary material, Section S-F, for proof of correctness.
b
See the electronic supplementary material, Definition S-22, Propositions S-14 and S-15 and Section S-F.
G2) satisfy the annihilation equality G1 þ (2G2) ¼ W without correct for any mis-synchronization of the states of the
explicitly knowing or constructing the models themselves. hidden models. Additionally, we want a linear-time algor-
Data smashing is predicated on being able to invert and ithm, implying that it is desirable to carry out these
sum streams, and to compare streams to noise. Inversion gen- operations in a memoryless symbol-by-symbol fashion.
erates a stream s0 given a stream s, such that if PFSA G is the Thus, we use the notion of a pseudo-copies of probabilistic
source for s, then 2G is the source for s0 . Summation collides automata: given a PFSA G with a transition probability
two streams: given streams s1 and s2 generate a new stream s0 metric P, a pseudo-copy P(G) is any PFSA which has the
which is a realization of FWN if and only if the hidden same structure as G, but with a transition matrix
models G1, G2 satisfy G1 þ G2 ¼ W. Finally, deviation of
P(P) ¼ g[I (1 g)P]1 P, (4:1)
a stream s from that generated by a FWN process can be
calculated directly. for some scalar g [ (0, 1). We show that the operations
Importantly, for a stream s (with generator G), the described above can be carried out efficiently, once we are
inverted stream s0 is not unique. Any symbol stream gener- willing to settle for stream realizations from pseudo-copies
ated from the inverse model 2G qualifies as an inverse for instead of the exact models. This does not cause a problem
s; thus anti-streams are non-unique. What is indeed unique in disambiguation of hidden dynamics, because the invert-
is the generating inverse PFSA model. As our technique com- ibility of the map in equation (4.1) guarantees that pseudo-
pares the hidden stochastic processes and not their possibly copies of distinct models remain distinct, and nearly identical
non-unique realizations, the non-uniqueness of anti-streams hidden models produce nearly identical pseudo-copies.
is not problematic. Thus, despite the possibility of mis-synchronization between
However, carrying out these operations in the absence of hidden model states, applicability of the algorithms shown in
the model is problematic. In particular, we have no means to table 1 for disambiguation of hidden dynamics is valid. We
Downloaded from http://rsif.royalsocietypublishing.org/ on January 21, 2017
show in the electronic supplementary material, Section S-F, that sufficient data it converges to a well-defined distance 6
the algorithms evaluate distinct models to be distinct, and nearly between the hidden stochastic sources themselves, without
rsif.royalsocietypublishing.org
identical hidden models to be nearly identical. ever knowing them explicitly.
Estimating the deviation of a stream from FWN is
straightforward (as specified by z^ (s, ‘) in table 1, row 4).
All subsequences of a given length must necessarily occur
4.2. Contrast with existing model-free approaches to
with the same frequency for a FWN process; and we time-series analysis
simply estimate the deviation from this behaviour in the Assumption-free time-series analysis to identify localized
observed sequence. The other two tasks are carried out via discords or anomalies has been studied extensively [6,7,25,26].
selective erasure of symbols from the input stream(s) (see A significant majority of these reported approaches use
table 1, rows 1–3). For example, summation of streams is rea- SAX [10] for representing the possibly continuous-valued raw
lized as follows: given two streams s1, s2, we read a symbol data streams. In contrast to our more naive quantization
in order to determine data sufficiency; that is, a stream is methodologies or finding geometric structures (such as lower 7
sufficiently long if it can sufficiently annihilate an inverted dimensional embedding manifolds [22]) induced by the
rsif.royalsocietypublishing.org
self-copy to FWN. similarity metric on the data sources.
The self-annihilation-based data-sufficiency test consists The matrix H, obtained from E by setting the diagonal
of two steps: given an observed symbolic sequence s, we entries to zero, estimates a distance matrix. A Euclidean
first generate an independent copy (say s0 ). This is the inde- -embedding [28] of H then leads to deeper insight into the
pendent stream copy operation (see table 1, row 1), which geometry of the space of the hidden generators. For example,
can be carried out via selective symbol erasure without any in the case of the EEG data, the time series’ embedding describe
knowledge of the source itself. Once we have s and s0 , we a one-dimensional manifold (a curve), with data from similar
check if the inverted version of one annihilates the other to phenomena clustered together along the curve (see figure 3a(ii)).
a pre-specified degree. If a stream is not able to annihilate
its inverted copy, it is too short for data smashing (see the
rsif.royalsocietypublishing.org
–1
normal (eyes open)
10
1×
plane A
100 pathology
seizure non
–2
data seizure
0
1 ×1 200
eyes z 0.2
closed
(normal) normal ano-
–4
maly
0 300 0.4
eyes
×1
9.9 open 0 0.2 y eyes eyes
(normal) 0.2 open closed
100 200 300 x 0.4 0.6 0
Figure 3. Data smashing applications. Pairwise distance matrices, identified clusters and three-dimensional projections of Euclidean embeddings for epileptic path-
ology identification (a), identification of heart murmur (b) and classification of variable stars from photometry (c). In these applications, the relevant clusters are
found unsupervised. (Online version in colour.)
(a) (b)
(i) (ii) (iii)
computation time (ms)
15 ΩSΩ= 2 (2 states)
10–0.5
self-annihilation
10–1.0 ΩSΩ= 4
10–2 5
10–1
10–1.5 0
102 103 103 104 105 101.5 102 20 000 40 000 60 000 80 000 100 000
no. symbols read no. symbols read no. symbols read symbol length (no. symbols)
Figure 4. Computational complexity and convergence rates for data smashing. (a) Exponential convergence of the self-annihilation error for a small set of data series
for different applications ((i) for EEG data, (ii) for heart sound recordings and (iii) for photometry). (b) Computation times for carrying out annihilation using the
circuit shown in figure 2d as a function of the length of input streams for different alphabet sizes (and for different number of states in the hidden models).
Note that the asymptotic time complexity of obtaining the similarity distances scales as O(jSjn), where n is the length of the shorter of the two input streams.
(Online version in colour.)
self-annihilation test thus offers an application-independent of the quantization schemes, computed distance matrices
way to compare and rank quantization schemes (see the and identified clusters and Euclidean embeddings are sum-
electronic supplementary material, Section S-C). marized in table 2 and figure 3 (see also the electronic
The algorithmic steps (see table 1) require no synchroniza- supplementary material, Sections S-A and S-C).
tion (we can start reading the streams anywhere), implying Our first application is classification of brain electrical
that non-equal length of time series, and phase mismatches activity from different physiological and pathological brain
are of no consequence. states [29]. We used sets of EEG data series consisting of sur-
face EEG recordings from healthy volunteers with eyes closed
and open, and intracranial recordings from epilepsy patients
during seizure-free intervals from within and from outside
7. Application examples the seizure generating area, as well as intracranial recordings
Data smashing begins with quantizing streams to symbolic of seizures.
sequences, followed by the use of the annihilation circuit Starting with the data series of electric potentials, we gener-
(figure 2d) to compute pairwise causal similarities. Details ated sequences of relative changes between consecutive values
Downloaded from http://rsif.royalsocietypublishing.org/ on January 21, 2017
Table 2. Application problems and results (see the electronic supplementary material, table S1, for a more detailed version). 9
rsif.royalsocietypublishing.org
system input description classification performance
100 200
before quantization. This step allows a common alphabet for This provides a particularly insightful picture, which eludes
sequences with wide variability in the sequence mean values. complex nonlinear modelling [29].
The distance matrix from pairwise smashing yielded clear Next, we classify cardiac rhythms from noisy heat–
clusters corresponding to seizure, normal eyes open, normal sound data recorded using a digital stethoscope [30]. We
eyes closed and epileptic pathology in non-seizure conditions analysed 65 data series (ignoring the labels) corresponding
(see figure 3a, seizures not shown due to large differences to healthy rhythms and murmur, to verify if we could
from the rest). identify clusters without supervision that correspond to the
Embedding the distance matrix (see figure 3a(i)) yields expert-assigned labels.
a one-dimensional manifold (a curve), with contiguous We found 11 clusters in the distance matrix (see figure 3b),
segments corresponding to different brain states, e.g. right- four of which consisted of mainly data with murmur (as
hand side of plane A corresponds to epileptic pathology. determined by the expert labels), and the rest consisting of
Downloaded from http://rsif.royalsocietypublishing.org/ on January 21, 2017
mainly healthy rhythms (see figure 3b(iv)). Classification pre- data series from the same subject will annihilate each other 10
cision for murmur is noted in table 2 (75.2%). Embedding of correctly, whereas those from different subjects will fail to
rsif.royalsocietypublishing.org
the distance matrix revealed a two-dimensional manifold (see do so to the same extent. We outperformed the state of art
figure 3b(iii)). for both kNN- and SVM-based approaches (see table 2, and
Our next problem is the classification of variable stars the electronic supplementary material, Section S-A).
using light intensity series ( photometry) from the optical Our fifth application is text-independent speaker identifi-
gravitational lensing experiment (OGLE) survey [31]. Super- cation using the ELSDSR database [34], which includes
vised classification of photometry proceeds by first ‘folding’ recording from 23 speakers (9 female and 14 male, with poss-
each light-curve to its known period to correct phase mis- ibly non-native accents). As before, training involved
matches. In our first analysis, we started with folded light- specifying the library series for each speaker. We computed
curves and generated data series of the relative changes the distance matrix by smashing the library data series
between consecutive brightness values in the curves before against each other and trained a simple kNN on the Eucli-
References
1. Cover TM, Thomas JA. 1991 Elements of information 5. Li M, Chen X, Li X, Ma B, Vitanyi PMB. 2004 The 206–215. New York, NY: ACM. (doi:10.1145/1014052.
theory. New York, NY: John Wiley. similarity metric. IEEE Trans. Inform. Theory 50, 1014077)
2. Duin RPW, de Ridder D, Tax DMJ. 1997 Experiments 3250 –3264. (doi:10.1109/TIT.2004.838101) 9. Anthony B, Ratanamahatana C, Keogh E, Lonardi S,
with a feature-less approach to pattern recognition. 6. Ratanamahatana C, Keogh E, Bagnall AJ, Lonardi S. Janacek G. 2006 A bit level representation for time
Pattern Recognit. Lett. 18, 1159– 1166. (doi:10. 2005 A novel bit level time series representation with series data mining with shape based similarity. Data
1016/S0167-8655(97)00138-4) implication of similarity search and clustering. In Min. Knowl. Discov. 13, 11 –40. (doi:10.1007/
3. Mottl V, Dvoenko S, Seredin O, Kulikowski C, Advances in knowledge discovery and data mining (eds s10618-005-0028-0)
Muchnik J. 2001 Featureless pattern recognition in T Ho, D Cheung, H Liu). Lecture Notes in Computer 10. Keogh E, Lin J, Fu A. 2005 HOT SAX: efficiently finding
an imaginary Hilbert space and its application to Science, vol. 3518, pp. 771–777. Berlin, Germany: the most unusual time series subsequence. In Proc. 5th
protein fold classification. In Machine learning and Springer. (doi:10.1007/11430919_90) IEEE Int. Conf. on Data Mining, pp. 226–233.
data mining in pattern recognition (ed. P Perner). 7. Wei L, Kumar N, Lolla V, Keogh EJ, Lonardi S, Washington, DC: IEEE Computer Society. (doi:10.1109/
Lecture Notes in Computer Science, vol. 2123, pp. Ratanamahatana C. 2005 Assumption-free anomaly ICDM.2005.79)
322–336. Berlin, Germany: Springer. (doi:10.1007/ detection in time series. In Proc. 17th Int. Conf. on 11. Chattopadhyay I, Lipson H. 2013 Abductive learning
3-540-44596-X_26) Scientific and Statistical Database Management, pp. of quantized stochastic processes with probabilistic
4. Pekalska E, Duin RPW. 2002 Dissimilarity 237–240. Berkeley, CA: Lawrence Berkeley Laboratory. finite automata. Phil. Trans. R. Soc. A 371,
representations allow for building good classifiers. 8. Keogh E, Lonardi S, Ratanamahatana CA. 2004 Towards 20110543. (doi:10.1098/rsta.2011.0543)
Pattern Recognit. Lett. 23, 943–956. (doi:10.1016/ parameter-free data mining. In Proc. 10th ACM SIGKDD 12. Crutchfield JP, McNamara BS. 1987 Equations of motion
S0167-8655(02)00024-7) Int. Conf. on Knowledge Discovery and Data Mining, pp. from a data series. Complex Syst. 1, 417–452.
Downloaded from http://rsif.royalsocietypublishing.org/ on January 21, 2017
13. Crutchfield JP. 1994 The calculi of emergence: 22. Tenenbaum J, de Silva V, Langford J. 2000 A global 30. Bentley P et al. 2011 The PASCAL Classifying Heart 11
computation, dynamics and induction. Phys. D geometric framework for nonlinear dimensionality Sounds Challenge 2011 (CHSC2011) Results. See
rsif.royalsocietypublishing.org
Nonlinear Phenom. 75, 11 –54. (doi:10.1016/0167- reduction. Science 290, 2319– 2323. (doi:10.1126/ http://www.peterjbentley.com/heartchallenge/
2789(94)90273-9) science.290.5500.2319) index.html.
14. Chattopadhyay I, Wen Y, Ray A. 2010 Pattern 23. Roweis S, Saul L. 2000 Nonlinear dimensionality 31. Szymanski MK. 2005 The optical gravitational
classification in symbolic streams via semantic reduction by locally linear embedding. Science 2900, lensing experiment. Internet access to the OGLE
annihilation of information. (http://arxiv.org/abs/ 2323–2326. (doi:10.1126/science.290.5500.2323) photometry data set: OGLE-II BVI maps and I-band
1008.3667) 24. Seung H, Lee D. 2000 Cognition. The manifold ways data. Acta Astron. 55, 43 –57.
15. Chattopadhyay I, Ray A. 2008 Structural of perception. Science 290, 2268 –2269. (doi:10. 32. Begleiter H. 1995 EEG database data set.
transformations of probabilistic finite state 1126/science.290.5500.2268) New York, NY: Neurodynamics Laboratory,
machines. Int. J. Control 81, 820 –835. (doi:10. 25. Kumar N, Lolla N, Keogh E, Lonardi S, State University of New York Health Center
1080/00207170701704746) Ratanamahatana CA. 2005 Time-series bitmaps: a Brooklyn. See http://archive.ics.uci.edu/ml/