Big Data Analytics Unit 2 MINING DATA STREAMS
Big Data Analytics Unit 2 MINING DATA STREAMS
Streams may be archived in a large archival store, but we assume it is not possible to answer
queries from the archival store. It could be examined only under special circumstances using time-
consuming retrieval processes. There is also a working store, into which summaries or parts of
streams may be placed, and which can be used for answering queries. The working store might be
disk, main memory, depending on how fast we need to process queries. But either way, it is of
sufficiently limited capacity that it cannot store all the data from all the streams.
Image Data
Satellites often send down to earth streams consisting of many terabytes of images per day.
Surveillance cameras produce images with lower resolution than satellites, but there can be many of
them, each producing a stream of images at intervals like one second.
Stream Queries
Queries may be answered for stream data by many ways. Some of them mat require average
of specific number of elements or the one it may be maximum value. We might have a standing
query that, each time a new reading arrives, produces the average of the 24 most recent readings.
That query also can be answered easily, if we store the 24 most recent stream elements. When a new
stream element arrives, we can drop from the working store the 25th most recent element, since it
will never again be needed.
The other form of query is ad-hoc, a question asked once about the current state of a stream
or streams. If we do not store all streams in their entirety, as normally we can not, then we cannot
expect to answer arbitrary queries about streams. If we have some idea what kind of queries will be
asked through the ad-hoc query interface, then we can prepare for them by storing appropriate parts
or summaries of streams.
A common approach is to store a sliding window of each stream in the working store. A
sliding window can be the most recent n elements of a stream, for some n, or it can be all the
elements that arrived within the last t time units, If we regard each stream element as a tuple, we
can treat the window as a relation and query it with any SQL query.
Query model
Aurora supports continuous queries (real-time processing), views, and ad hoc queries all
using substantially the same mechanisms.All three modes of operation use the same conceptual
building blocks.Each mode processes flows based on QoS specifications – each output in Aurora is
associated with two-dimensional QoS graphs that specify the utility of the output in terms of several
performance-related and quality-related attributes (see Sect.4. 1). The diagram in Fig.2 illustrates
the processing modes supported by Aurora. The topmost path represents a continuous query.In
isolation, data elements flow into boxes, are processed, and flow further downstream.In this
scenario, there is no need to store any data elements once they are processed. Once an input has
worked its way through all reachable paths, that data item is drained from the network.The QoS
specification at the end of the path controls how resources are allocated to the processing elements
along the path.One can also view an Aurora network (along with some of its applications) as a large
collection of triggers.Each path from a sensor input to an output can be viewed as computing the
condition part of a complex trigger. An output tuple is delivered to an application, which can take
the appropriate action. The dark circles on the input arcs to boxes b1 and b2 represent connection
points.A connection point is an arc that supports dynamic modification to the network.Ne w boxes
can be added to or deleted from a connection point.When a new application connects to the
network, it will often require access to the recent past.As such, a connection point has the potential
for persistent storage .Persistent storage retains data items beyond their processing by a particular
box.In otherwords, as items flowpast a connection point, they are cached in a persistent store for
some period of time.They are not drained from the network by applications.Instead, a persistence
specification indicates exactly how long the items are kept, so that a future ad hoc query can get
historical results. In the figure, the leftmost connection point is specified to be available for 2 h.This
indicates that the beginning of time for newly connected applications will be 2 h in the past.
Connection points can be generalized to allow an elegant way of including static data sets in
Aurora.Hence we allow a connection point to have no upstream node, i.e., a dangling connection
point.W ithout an upstream node the connection point cannot correspond to an Aurora
stream.Instead, the connection point is decorated with the identity of a stored data set in a traditional
DBMS or other storage system.In this case, the connection point can be materialized and the stored
tuples passed as a stream to the downstream node.In this case, such tuples will be pushed through
an Aurora network.Alternately, query execution on the downstream node can pull tuples by running
a query to the store.If the downstream node is a filter or a join, pull processing has obvious
advantages.Moreo ver, if the node is a join between a stream and a stored data set, then an obvious
query execution strategy is to perform iterative substitution whenever a tuple from the stream
arrives and perform a lookup to the stored data.In this case, a windowdoes not need to be specified
as the entire join can be calculated.
The middle path in Fig.2 represents a view.In this case, a path is defined with no connected
application.It is allowed to have a QoS specification as an indication of the importance of the
view.Applications can connect to the end of this path whenever there is a need.Before this happens,
the system can propagate some, all, or none of the values stored at the connection point in order to
reduce latency for applications that connect later.Moreo ver, it can store these partial results at any
point along a viewpath.This is analogous to a materialized or partially materialized view.V iew
materialization is under the control of the scheduler.
The bottom path represents an ad hoc query.An ad hoc query can be attached to a connection
point at any time.The semantics of an ad hoc query is that the system will process data items and
deliver answers from the earliest time T (persistence specification) stored in the connection point
until the query branch is explicitly disconnected.Thus, the semantics for an Aurora ad hoc query is
the same as a continuous query that starts executing at tnow − T and continues until explicit
termination.
15
$1,000,000. The total income for this population is $1,060,000. If we take an SRS of size k = 2—
and hence estimate the income for the population as 1.5 times the income for the sampled
individuals—then the outcome of our sampling and estimation exercise would follow one of the
scenarios given in Table 1. Each of the scenarios is equally likely, and the expected value (also
called the “mean value”) of our estimate is computed as
expected value = (1/3) · (90,000)+(1/3) · (1,515,000)+(1/3) · (1,575,000)
= 1,060,000,
which is equal to the true answer. In general, it is important to evaluate the accuracy (degree
of systematic error) and precision (degree of variability) of a sampling and estimation scheme. The
bias, i.e., expected error, is a common measure of accuracy, and, for estimators with low bias, the
standard error is a common measure of precision. The bias of our income estimator is 0 and the
standard error is computed as the square root of the variance (expected squared deviation from the
mean) of our estimator:
SE = #(1/3) · (90,000−1,060,000)2 +(1/3) · (1,515,000− 1,060,000)2
+(1/3) · (1,575,000− 1,060,000)2#1/2 ≈ 687,000.
For more complicated population parameters and their estimators, there are often no simple
formulas for gauging accuracy and precision. In these cases, one can sometimes resort to techniques
based on subsampling, that is, taking one or more random samples from the initial population
sample.Well known subsampling techniques for estimating bias and standard error include the
“jackknife” and “bootstrap” methods. In general, the accuracy and precision of a well designed
sampling-based estimator should increase as the sample size increases.
Database Sampling
Although database sampling overlaps heavily with classical finite-population sampling, the
former setting differs from the latter in a number of important respects.
• Scarce versus ubiquitous data. In the classical setting, samples are usually expensive to obtain and
data is hard to come by, and so sample sizes tend to be small. In database sampling, the population
size can be enormous (terabytes of data), and samples are relatively easy to collect, so that sample
sizes can be relatively large. The emphasis in the database setting is on the sample as a flexible,
lossy, compressed synopsis of the data that can be used to obtain quick approximate answers to user
queries.
• Different sampling schemes. As a consequence of the complex storage formats and retrieval
mechanisms that are characteristic of modern database systems, many sampling schemes that were
unknown or of marginal interest in the classical setting are central to database sampling. For
example, the classical literature pays relatively little attention to Bernoulli sampling schemes, but
such schemes are very important for database sampling because they can be easily parallelized
across data partitions. As another example, tuples in a relational database are typically retrieved
from disk in units of pages or extents. This fact strongly influences the choice of sampling and
estimation schemes, and indeed has led to the introduction of several novel methods. As a final
example, estimates of the answer to an aggregation query involving select–project–join operations
are often based on samples drawn individually from the input base relations, a situation that does
not arise in the classical setting.
• No domain expertise. In the classical setting, sampling and estimation are often carried out by an
expert statistician who has prior knowledge about the population being sampled. As a result, the
classical literature is rife with sampling schemes that explicitly incorporate auxiliary information
about the population, as well as “model-based” schemes in which the population is assumed to be a
sample from a hypothesized “super-population” distribution. In contrast, database systems typically
must view the population (i.e., the database) as a black box, and so cannot exploit these specialized
techniques.
• Auxiliary synopses. In contrast to a classical statistician, a database designer often has the
opportunity to scan each population element as it enters the system, and therefore has the
opportunity to maintain auxiliary data synopses, such as an index of “outlier” values or other data
summaries, which can be used to increase the precision of sampling and estimation algorithms. If
available, knowledge of the query workload can be used to guide synopsis creation.
Online-aggregation algorithms take, as input, streams of data generated by random scans of one or
more (finite) relations, and produce continually-refined estimates of answers to aggregation queries
over the relations, along with precision measures. The user aborts the query as soon as the running
estimates are sufficiently precise; although the data stream is finite, query processing usually
terminates long before the end of the stream is reached. Recent work on database sampling includes
extensions of online aggregation methodology, application of bootstrapping ideas to facilitate
approximate answering of very complex aggregation queries, and development of techniques for
sampling-based discovery of correlations, functional dependencies, and other data relationships for
purposes of query optimization and data integration.
Collective experience has shown that sampling can be a very powerful tool, provided that it
is applied judiciously. In general, sampling is well suited to very quickly identifying pervasive
patterns and properties of the data when a rough approximation suffices; for example, industrial-
strength sampling-enhanced query engines can speed up some common decision-support queries by
orders of magnitude. On the other hand, sampling is poorly suited for finding “needles in
haystacks” or for producing highly precise estimates. The needle-in-haystack phenomenon appears
in numerous guises. For example, precisely estimating the selectivity of a join that returns very few
tuples is an extremely difficult task, since a random sample from the base relations will likely
contain almost no elements of the join result. As another example, sampling can perform poorly
when data values are highly skewed. For example, suppose we wish to estimate the average of the
values in a data set that consists of 106 values equal to 1 and five values equal to 108. The five
outlier values are the needles in the haystack: if, as is likely, these values are not included in the
sample, then the sampling-based estimate of the average value will be low by orders of magnitude.
Even when the data is relatively well behaved, some population parameters are inherently hard to
estimate from a sample. One notoriously difficult parameter is the number of distinct values in a
population. Problems arise both when there is skew in the data-value frequencies and when there
are many data values, each appearing a small number of times. In the former scenario, those values
that appear few times in the database are the needles in the haystack; in the latter scenario, the
sample is likely to contain no duplicate values, in which case accurate assessment of a scale-up
factor is impossible. Other challenging population parameters include the minimum or maximum
data value. Researchers continue to develop new methods to deal with these problems, typically by
exploiting auxiliary data synopses and workload information.
One of the main issues in the stream data mining is to find out a model which will suit the
extraction process of the frequent item set from the streaming in data. There are three stream data
processing model that are Landmark window, Damped window and Sliding window model. A
transaction data stream is a sequence of incoming transactions and an excerpt of the stream is called
a window. A window, W, can be either time-based or count-based, and either a landmark window or
a sliding window. W is time-based if W consists of a sequence of fixed-length time units, where a
variable number of transactions may arrive within each time unit. W is count-based if W is
composed of a sequence of batches, where each batch consists of an equal number of transactions.
W is a landmark window if W = (T1, T2, . . . , T); W is a sliding window if W = (TT−w+1, . . . ,
TT), where each Ti is a time unit or a batch, T1 and TT are the oldest and the current time unit or
batch, and w is the number of time units or batches in the sliding window, depending on whether W
is time-based or count-based. Note that a count-based window can also be captured by a time-based
window by assuming that a uniform number of transactions arrive within each time unit.
The frequency of an item set, X, in W, denoted as freq(X), is the number of transactions in W
support X. The support of X in W, denoted as sup(X), is defined as freq(X)/N, where N is the total
number of transactions received in W. X is a Frequent Item set (FI) in W, if sup(X) ≥ σ, where σ (0
≤ σ ≤ 1) is a user-specified minimum support threshold. X is a Frequent Maximal Item set (FMI) in
W, if X is an FI in W and there exists no item set Y in W such that X ⊂ Y . X is a Frequent Closed
Item set (FCI) in W, if X is an FI in W and there exists no item set Y in W such that X ⊂ Y and
freq(X) = freq(Y).
Where X is the root of S and Y is an item set in S. Assume S is compressed into a node v in
the CP-tree. The node v consists of the following four fields: item-list, parent-list, freqTmax and
freqTmin where v.item-list is a list of items which are the labels of the nodes in S, v. parent-list is a
list of locations (in the CP-tree) of the parents of each node in S, v. freqTmax is the frequency of the
root of S and freqTmin is the frequency of the right-most leaf of S.
The use of the CP-tree results in the reduction of memory consumption, which is important
in mining data streams. The CP-tree can also be used to mine the FIs, however, the error rate of the
computed frequency of the FIs, which is estimated from freqTmin and freqTmax, will be further
increased. Thus, the CP-tree is more suitable for mining FMIs.
b) Sliding Window Concept
The sliding window model processes only the items in the window and maintains only the
frequent item sets. The size of the sliding window can be decided according to the applications and
the system resources. The recently generated transactions in the window will influence the mining
result of the sliding windowing, otherwise all the items in the window to be maintained. The size of
the sliding window may vary depends up on the applications it may use. In this section we will
discuss some of the important windowing approaches for stream mining.
An in memory prefix tree based algorithm following the windowing approach to
incrementally update the set of frequent closed item sets over the sliding window . The data
structure used for the algorithm is called as Closed Enumeration Tree (CET) to maintain the
dynamically selected set of item set over the sliding window. This algorithm will compute the exact
set of frequent closed item sets over the sliding window. The updation will be for each incoming
transaction but not enough to handle the handle the high speed streams.
One another notable algorithm in the windowing concept is estWin[3]. This algorithm
maintains the frequent item sets over a sliding window. The data structure used to maintain the item
sets is prefix tree. The prefix tree holds three parameters for each items set in the tree, that are
frequency of x in current window before x is inserting in the tree, that is freq(x). The second is an
upper bound for the frequency of x in the current window before x is inserted in the tree, err(x). The
third is the ID of the transaction being processed, tid(x).b. The item set in the tree will be pruned
along with all supersets of the item set, we prune the item set X and the supersets if tid(X) ≤ tid1
and freq(X) < ⌈N⌉, or (2) tid(X) > tid1 and freq(X) < ⌈(N − (tid(X) − tid1))⌉. The expression tid(X) >
tid1 means that X is inserted into D at some transaction that arrived within the current sliding
window and hence the expression (N−(tid(X)−tid1))returns the number of transactions that arrived
within the current window since the arrival of the transaction having the ID tid(X). We note that X
itself is not pruned if it is a 1-itemset, since estWin estimates the maximum frequency error of an
itemset based on the computed frequency of its subsets [84] and thus the frequency of a 1-itemset
cannot be estimated again if it is deleted. ∈∈
c) Damped Window Concept
The estDec algorithm proposed to reduce the effect of the old transactions on the stream
mining result. They have used a decay rate to reduce the effect of the old transactions and the
resulted frequent item sets are called recent frequent Item sets. The algorithm, for maintaining
recent FIs is an approximate algorithm that adopts the mechanism to estimate the frequency of the
item sets.
The use of a decay rate diminishes the effect of the old and obsolete information of a data
stream on the mining result. However, estimating the frequency of an item set from the frequency of
its subsets can produce a large error and the error may propagate all the way from the 2-subsets to
the n-supersets, while the upper bound is too loose. Thus, it is difficult to formulate an error bound
on the computed frequency of the resulting item sets and a large number of false-positive results
will be returned, since the computed frequency of an item set may be much larger than its actual
frequency. Moreover, the update for each incoming transaction (instead of a batch) may not be able
to handle high-speed streams.
Another approximation algorithm uses a tilted time window model . In this frequency FIs
are kept in different time granularities such as last one hour, last two hours, last four hours and so
on. The data structure used in this algorithm is called FP-Stream. There are two components in the
FP-Stream which are pattern tree based prefix tree and tilted time window which is at the end node
of the path. The pattern tree can be constructed using the FP-tree algorithm. The tilted time window
guarantees that the granularity error is at most T/2, where T is the time units.
The updation of the frequency record will be done by shifting the recent records to merge
with the older records. To reduce the number of frequency records in the tilted-time windows, the
old frequency records of an item set, X, are pruned as follows. Let freqj(X) be the computed
frequency of X over a time unit Tj and Nj be the number of transactions received within Tj , where
1 ≤ j ≤ τ . For some m, where 1 ≤ m ≤ τ, the frequency records freq1(X), . . . ,freqm(X) are pruned if
the following condition holds:
∃n ≤ τ, ∀i, 1 ≤ i ≤ n, freqi(X) < σNi and
∀l, 1 ≤ l ≤ m ≤ n, Σ−(x)< Σ
The FP-stream mining algorithm computes a set of sub-FIs at the relaxed minimum support
threshold, , over each batch of incoming transactions by using the FI mining algorithm, FP-growth
[25]. For each sub-FI X obtained, FP-streaming inserts X into the FP-stream if X is not in the FP-
stream. If X is already in the FP-stream, then the computed frequency of X over the current batch is
added to its tilted-time window. Next, pruning is performed on the tilted-time window of X and if
the window becomes empty, FP-growth stops mining supersets of X by the Apriori property. After
all sub-FIs mined by FP-growth are updated in the FP-stream, the FP-streaming scans the FP-stream
and, for each item set X visited, if X is not updated by the current batch of transactions, the most
recent frequency in X’s tilted-time window is recorded as 0. Pruning is then performed on X. If the
tilted-time window of some item set visited is empty (as a result of pruning), the item set is also
pruned from the FP-stream. ∈
The tilted-time window model allows us to answer more expressive time-sensitive queries,
at the expense of some frequency record kept for each item set. The tilted-time window also places
greater importance on recent data than on old data as does the sliding window model; however, it
does not lose the information in the historical data completely. A drawback of the approach is that
the FP-stream can become very large over time and updating and scanning such a large structure
may degrade the mining throughput.
Panel data
A time series is one type of panel data. Panel data is the general class, a multidimensional
data set, whereas a time series data set is a one-dimensional panel (as is a cross-sectional dataset). A
data set may exhibit characteristics of both panel data and time series data. One way to tell is to ask
what makes one data record unique from the other records. If the answer is the time data field, then
this is a time series data set candidate. If determining a unique record requires a time data field and
an additional identifier which is unrelated to time (student ID, stock symbol, country code), then it
is panel data candidate. If the differentiation lies on the non-time identifier, then the data set is a
cross-sectional data set candidate.
Analysis
There are several types of motivation and data analysis available for time series which are
appropriate for different purposes and etc.
Motivation
In the context of statistics, econometrics, quantitative finance, seismology, meteorology, and
geophysics the primary goal of time series analysis is forecasting. In the context of signal
processing, control engineering and communication engineering it is used for signal detection and
estimation, while in the context of data mining, pattern recognition and machine learning time series
analysis can be used for clustering, classification, query by content, anomaly detection as well as
forecasting.
Exploratory analysis
The clearest way to examine a regular time series manually is with a line chart such as the
one shown for tuberculosis in the United States, made with a spreadsheet program. The number of
cases was standardized to a rate per 100,000 and the percent change per year in this rate was
calculated. The nearly steadily dropping line shows that the TB incidence was decreasing in most
years, but the percent change in this rate varied by as much as +/- 10%, with 'surges' in 1975 and
around the early 1990s. The use of both vertical axes allows the comparison of two time series in
one graphic.
Curve fitting
Curve fitting is the process of constructing a curve, or mathematical function, that has the
best fit to a series of data points, possibly subject to constraints. Curve fitting can involve either
interpolation, where an exact fit to the data is required, or smoothing, in which a "smooth" function
is constructed that approximately fits the data. A related topic is regression analysis, which focuses
more on questions of statistical inference such as how much uncertainty is present in a curve that is
fit to data observed with random errors. Fitted curves can be used as an aid for data visualization, to
infer values of a function where no data are available, and to summarize the relationships among
two or more variables. Extrapolation refers to the use of a fitted curve beyond the range of the
observed data, and is subject to a degree of uncertainty since it may reflect the method used to
construct the curve as much as it reflects the observed data.
The construction of economic time series involves the estimation of some components for
some dates by interpolation between values ("benchmarks") for earlier and later dates. Interpolation
is estimation of an unknown quantity between two known quantities (historical data), or drawing
conclusions about missing information from the available information ("reading between the
lines"). Interpolation is useful where the data surrounding the missing data is available and its trend,
seasonality, and longer-term cycles are known. This is often done by using a related series known
for all relevant dates. Alternatively polynomial interpolation or spline interpolation is used where
piecewise polynomial functions are fit into time intervals such that they fit smoothly together. A
different problem which is closely related to interpolation is the approximation of a complicated
function by a simple function (also called regression).The main difference between regression and
interpolation is that polynomial regression gives a single polynomial that models the entire data set.
Spline interpolation, however, yield a piecewise continuous function composed of many
polynomials to model the data set.
Extrapolation is the process of estimating, beyond the original observation range, the value
of a variable on the basis of its relationship with another variable. It is similar to interpolation,
which produces estimates between known observations, but extrapolation is subject to greater
uncertainty and a higher risk of producing meaningless results.
Function approximation
In general, a function approximation problem asks us to select a function among a well-
defined class that closely matches ("approximates") a target function in a task-specific way. One can
distinguish two major classes of function approximation problems: First, for known target functions
approximation theory is the branch of numerical analysis that investigates how certain known
functions (for example, special functions) can be approximated by a specific class of functions (for
example, polynomials or rational functions) that often have desirable properties (inexpensive
computation, continuity, integral and limit values, etc.).
Second, the target function, call it g, may be unknown; instead of an explicit formula, only a
set of points (a time series) of the form (x, g(x)) is provided. Depending on the structure of the
domain and codomain of g, several techniques for approximating g may be applicable. For example,
if g is an operation on the real numbers, techniques of interpolation, extrapolation, regression
analysis, and curve fitting can be used. If the codomain (range or target set) of g is a finite set, one
is dealing with a classification problem instead. A related problem of online time series
approximation is to summarize the data in one-pass and construct an approximate representation
that can support a variety of time series queries with bounds on worst-case error.
To some extent the different problems (regression, classification, fitness approximation)
have received a unified treatment in statistical learning theory, where they are viewed as supervised
learning problems.
Classification
Assigning time series pattern to a specific category, for example identify a word based on
series of hand movements in sign language
Signal estimation
This approach is based on harmonic analysis and filtering of signals in the frequency domain
using the Fourier transform, and spectral density estimation, the development of which was
significantly accelerated during World War II by mathematician Norbert Wiener, electrical
engineers Rudolf E. Kálmán, Dennis Gabor and others for filtering signals from noise and
predicting signal values at a certain point in time. See Kalman filter, Estimation theory, and Digital
signal processing
Segmentation
Splitting a time-series into a sequence of segments. It is often the case that a time-series can
be represented as a sequence of individual segments, each with its own characteristic properties. For
example, the audio signal from a conference call can be partitioned into pieces corresponding to the
times during which each person was speaking. In time-series segmentation, the goal is to identify
the segment boundary points in the time-series, and to characterize the dynamical properties
associated with each segment. One can approach this problem using change-point detection, or by
modeling the time-series as a more sophisticated system, such as a Markov jump linear system.
Models
Models for time series data can have many forms and represent different stochastic
processes. When modeling variations in the level of a process, three broad classes of practical
importance are the autoregressive (AR) models, the integrated (I) models, and the moving average
(MA) models. These three classes depend linearly on previous data points.Combinations of these
ideas produce autoregressive moving average (ARMA) and autoregressive integrated moving
average (ARIMA) models. The autoregressive fractionally integrated moving average (ARFIMA)
model generalizes the former three. Extensions of these classes to deal with vector-valued data are
available under the heading of multivariate time-series models and sometimes the preceding
acronyms are extended by including an initial "V" for "vector", as in VAR for vector autoregression.
An additional set of extensions of these models is available for use where the observed time-series
is driven by some "forcing" time-series (which may not have a causal effect on the observed series):
the distinction from the multivariate case is that the forcing series may be deterministic or under the
experimenter's control. For these models, the acronyms are extended with a final "X" for
"exogenous".
Non-linear dependence of the level of a series on previous data points is of interest, partly
because of the possibility of producing a chaotic time series. However, more importantly, empirical
investigations can indicate the advantage of using predictions derived from non-linear models, over
those from linear models, as for example in nonlinear autoregressive exogenous models. Further
references on nonlinear time series analysis.
Among other types of non-linear time series models, there are models to represent the
changes of variance over time (heteroskedasticity). These models represent autoregressive
conditional heteroskedasticity (ARCH) and the collection comprises a wide variety of
representation (GARCH, TARCH, EGARCH, FIGARCH, CGARCH, etc.). Here changes in
variability are related to, or predicted by, recent past values of the observed series. This is in
contrast to other possible representations of locally varying variability, where the variability might
be modelled as being driven by a separate time-varying process, as in a doubly stochastic model.
In recent work on model-free analyses, wavelet transform based methods (for example
locally stationary wavelets and wavelet decomposed neural networks) have gained favor. Multiscale
(often referred to as multiresolution) techniques decompose a given time series, attempting to
illustrate time dependence at multiple scales. See also Markov switching multifractal (MSMF)
techniques for modeling volatility evolution.
A Hidden Markov model (HMM) is a statistical Markov model in which the system being
modeled is assumed to be a Markov process with unobserved (hidden) states. An HMM can be
considered as the simplest dynamic Bayesian network. HMM models are widely used in speech
recognition, for translating a time series of spoken words into text.
Notation
A number of different notations are in use for time-series analysis. A common notation
specifying a time series X that is indexed by the natural numbers is written
X = {X1, X2, ...}.
Another common notation is
Y = {Yt: t ∈ T},
where T is the index set.
Conditions
There are two sets of conditions under which much of the theory is built:
Stationary process
Ergodic process
However, ideas of stationarity must be expanded to consider two important ideas: strict
stationarity and second-order stationarity. Both models and applications can be developed under
each of these conditions, although the models in the latter case might be considered as only partly
specified.
In addition, time-series analysis can be applied where the series are seasonally stationary or
non-stationary. Situations where the amplitudes of frequency components change with time can be
dealt with in time-frequency analysis which makes use of a time–frequency representation of a
time-series or signal.
Analytic platform combines tools for creating analyses with an engine to execute them, a
DBMS to keep and manage them for ongoing use and mechanism for acquiring and preparing data
that are not already stored. The components of the platform are depicted in the figure.
The data can be collected from multiple data sources and feed through the data integration
process. The data can be captured and transformed and loaded into analytic database management
system (ADBMS). This ADBMS has separate data store to mange data. It also has the provision for
creating functions and procedures to operate on the data. Models can be created for analysis in the
ADBMS itself. Analytic applications can make use of the data in the ADBMS and apply the
algorithms on it.
The application has the following facilities
Ad-hoc reporting
Model building
Statistical Analysis
Predictive Analysis
Data visualization
Hardware sharing model for processing and data through MPP (Massive Parallel Processing)
Storage format (row and column manner) and smart data management
Programming extensibility and more cores, threads which yield more processing power
Deployment model
Applications
Social Media Analytics
Social Media is the modern way of communication and networking. It is a growing and
widely accepted way of interaction these days and connects billions of people on a real time basis.
Fan page analysis – face book and twitter
Business Analytics
Business analytics focuses on developing new insights and understanding of business
performance based on data and statistical methods. It gives critical information about supply and
demand of business/product's viability in the marketplace.
Goal tracking and returning customers
trendlines
Customer analytics
Customer profiling
Web Analytics
It is the process of collecting, analyzing and reporting of web data for the purpose of
understanding and optimizing web usage.
On-site analytics (No. of users visited, no. of current users and actions, user locations etc…)
Logfile analysis
Click analytics
Customer life cycle analytics
Tracking web traffic
Sentiment Analysis
We begin our sentiment analysis by applying Alex Davies' word list in order to see if a
simple approach is sucient enough to correlate to market movement. For this, we use a pre-
generated word list of roughly ve thousand common words along with log probabilities of `happy'
or `sad' associated with the respective words. The process works as follows. First, each tweet is
tokenized into a word list. The parsing algorithm separates the tweets using whitespace and
punctuation, while accounting for common syntax found in tweets, such as URLs and emoticons.
Next, we look up each token's log-probability in the word list; as the word list is not comprehensive,
we choose to ignore words that do not appear in the list. The log probabilities of each token was
simply added to determine the probability of `happy' and `sad' for the entire tweet. These were then
averaged per day to obtain a daily sentiment value.
As expected, this method resulted in highly uncorrelated data (with correlation coecients of
almost zero). We tried to improve this by using a more comprehensive and accurate dictionary for
positive and negative sentiments. Specically, we swapped our initial word list with a sentiment
score list we generated using SentiWordNet, which consisted of over 400 thousand words. Since
this list considers relationships between each word and includes multi-word expressions, it provided
better results. We also tried representing the daily sentiment value in a dierent way - instead of
averaging the probabilities of each tweet, we counted the frequency of `happy' tweets (such as using
a threshold probability of above 0.5 for happy) and represented this as a percentage of all tweets for
that day. While this did not improve the output's correlation with stock market data, it did provide
us with more insight into our Twitter data. For example, we see a spike in the percentage `happy'
tweets toward the end of each month (Figure 1).
We did not news events which could have caused these spikes; however, upon investigating
the source of Twitter data, we found that it had been pre-ltered for a previous research project (i.e.
there may be some bias in what we assumed to be raw Twitter data). Due to a lack of access to
better Twitter data, we conclude that using the frequency of happy tweets is not a reliable indicator
of sentiment for our application and revert to our averaging method.
The Algorithm
We chose to model the data using a linear regression. This decision was motivated by
several factors:
Speed - A fast, ecient algorithm was one of our original specications. This is a must when working
with massive amounts of data in real time, as is the case in the stock market.
Regression - We sought to be able to make investment decisions not only on direction of market
movement, but also to quantify this movement. A simple classier was insucient for this; we required
a regressor.
Accurate - Naturally, we needed an algorithm that would model the data as accurately as possible.
Since our data is, by its nature, very noisy, we chose a simple model to avoid high variance.
Features
The backbone of our algorithm was, of course, Twitter sentiment data. As such, we designed
several features that correspond to these sentiment values at various time-delays to the present.
Training in one-dimensional feature space using only this data, we found that the best results were
obtained when the Twitter data predated the market by 3 days. Using k-fold cross-validation to
quantify our accuracy, we observed that this model was able to make predictions with
approximately 60% accuracy, a modest improvement over no information (50% accuracy), but we
wanted to see if we could do better.
We designed 2 more classes of features to try: one modeling the change in price of the
market each day at various time-delays, the other modeling the total change in price of the market
over the past n days. To help us choose a good set of features, we applied a feature selection
algorithm using forward search to the problem. From this, we learned that the `change in price 3
days ago' feature improved our previous model to one with approximately 64% accuracy.
Further tests indicated that several of the other features are also relevant, however, due to
relatively small amount of training data (72 days or fewer), training in higher-dimensional feature
spaces yielded worse results in practice. Nonetheless, with the availability of more training data, a
more complex and diverse set of features could further improve accuracy. We were able to achieve,
using nearly all of our available data to train (infeasible for portfolio simulation, see next section),
classication accuracy as high as 70%.
Here, invest is the percent of our funds we use to buy stock and predicted % change q is computed
by dividing the predicted change in the market tomorrow by the price today.
Maximal - This strategy assumes perfect knowledge about future stock prices. We will invest all
available funds when we know the market will go up the following day, and invest no funds when
we know the market will go down. This strategy is, of course, impossible to execute in reality, and
is only being used to quantify the prots from an ideal strategy.
Simulation
We start with exactly enough money to buy 50 shares of stock on the rst day. Note that since
we output results as percentages of starting money, they do not depend on this value, and as such it
is chosen arbitrarily. At the start of each day, we make a prediction and invest according to some
strategy. At the end of the day, we sell all shares at closing price and put the money in this bank.
This is done so that any gains or losses can future gains or losses by virtue of being able to purchase
more or less stock at every time step.
Results
We ran the simulation for each investment strategy, as described above, on 2 dierent time
intervals. The results are shown below:
In the gure on the left, we trained on about 3 4 of the data (72 days) and simulated on about 1
4 of the data (25 days). In the gure on the right, we trained on about 2 3 of the data (64 days) and
simulated on about 1 3 of the data (33 days). We immediately see that both of our strategies fare
better than the default strategy in both simulations.
Note, however, that the regression strategy is more protable in the rst simulation while the
classification strategy is more protable in the second simulation. We observe that on the simulation
in which the model was given less training data (figure on the right), on day 27, our regression
strategy opted to invest only 25% of funds that day because it perceived gains as being uncertain.
This did not happen on the corresponding day in the first simulation (with more training data).
Indeed, with less data to train on, imprecision in our model resulted in a poor investment decision
when using the more complex regression strategy. In general, the classification strategy tends to be
more consistent, while the regression strategy, though theoretically more protable, is also more
sensitive to noise in the model.