Processing and Visualizing The Data in Tweets
Processing and Visualizing The Data in Tweets
The MIT Faculty has made this article openly available. Please share
how this access benefits you. Your story matters.
Citation
As Published
http://dx.doi.org/10.1145/2094114.2094120
Publisher
Version
Accessed
Citable Link
http://hdl.handle.net/1721.1/79351
Terms of Use
Detailed Terms
http://creativecommons.org/licenses/by-nc-sa/3.0/
ABSTRACT
Microblogs such as Twitter provide a valuable corpus
of diverse user-generated content. While the data extracted from Twitter is generally timely and accurate,
the process by which developers currently extract structured data from the tweet stream is ad-hoc and requires
reimplementation of common data manipulation primitives. In this paper, we present two systems for extracting structure from and querying Twitter-embedded data.
The first, TweeQL, provides a streaming SQL-like interface to the Twitter API, making common tweet processing tasks simpler. The second, TwitInfo, shows how
end-users can interact with and understand aggregated
data from the tweet stream (as well as showcasing the
power of the TweeQL language). Together these systems show the richness of content that can be extracted
from Twitter.
1.
INTRODUCTION
2.
TWEEQL
2.1
http://blog.twitter.com/2011/06/200million-tweets-per-day.html
2.1.1
Streams
The primary stream that TweeQL provides is twitter stream. TweeQL users define streams based on
this base stream using the CREATE STREAM statement,
which creates a named substream of the main twitter
stream that satistifies a particular set of filters. For
example, the following statement creates a queriable
stream of tweets containing the term obama called obamatweets generated from the twitter stream streaming
source:
CREATE STREAM obamatweets
FROM twitter_stream
WHERE text contains obama;
While twitter stream offers several fields (e.g., text,
username, userid, location, latitude, longitude), the
Twitter API only allows certain filters to be used as access methods for defining a stream. Specifically, when
defining twitter stream, the developer must supply a
combinations of fields which can be filtered by key or
range lookups. For example, the Twitter streaming API
allows parameters for text, userid, and latitude/longitude
ranges. If a user tries to create a stream from a streaming
source using an illegal set of predicates, TweeQL will
raise an error.
Users are not allowed to directly query the raw
twitter stream because Twitter only provides access to
tweets that contain a filter. If users wish to access an
unrestricted stream, Twitter provides a sampled, unfiltered stream that TweeQL wraps as twitter sample. An
unsampled, unfiltered stream is not provided by Twitter
for performance and financial reasons.
Streaming sources asynchronously generate tuples as
they appear, and are buffered by an access method that
implements the iterator model. They appear as tuples
with a set schema to the rest of the query tree. Any
streaming source must include a created at timestamp
field. If one is not provided by the datasource, tuples are
timestamped with their creation time. The field is necessary for the windowed aggregates described in Section 2.1.5 to follow proper ordering semantics.
While our examples show users creating streams from
the twitter stream base stream, in principle one could
also wrap other streaming sources, such as RSS feeds, a
Facebook news feed, or a Google+ feed. Once wrapped,
derived streams can be generated using techniques similar to the examples we provide.
2.1.2
UDFs
Complex Data Types. TweeQL UDFs can accept arrayor table-valued attributes as arguments. This is required
because APIs often allow a variable number of parameters. For example, a geocoding API might allow multiple text locations to be be mapped to latitude/longitude
pairs in a single web service request.
UDFs can also return several values at once. This
behavior needed both for batched APIs that submit multiple requests at once, but also for many text-processing
tasks which are important in unstructured text processing. For example, to build an index of words that appear
in tweets, one can issue the following query:
SELECT tweetid, tokenize(text)
FROM obamatweets;
The tokenize UDF returns an array of words that appear in the tweet text. For example, tokenize(Tweet
number one) = [Tweet, number, one]. While
arrays can be stored or passed to array-valued functions,
users often wish to relationalize them. To maintain
the relational model, we provide a FLATTEN operator
(based on the operator of the same name from Olston
et al.s Pig Latin [8]). Users can wrap an array-valued
function found in a SELECT clause with a FLATTEN to
produce a result without arrays. For example, instead of
the above query, the programmer could write:
SELECT tweetid, FLATTEN(tokenize(text))
FROM obamatweets;
The resulting tuples for a tweet with tweetid = 5 and text
= Tweet number one would then be:
(5, Tweet)
(5, number)
(5, one)
Web Services as UDFs. Much of TweeQLs structureextraction functionality is provided by third parties as
web APIs. TweeQL allows UDF implementers to make
calls to such web services to access their functionality.
One such UDF is the geocode UDF that returns the latitude and longitude for user-reported textual locations
described in Section 2.1.4. The benefit of wrapping
such functionality in third party services is that often
the functionality requires large datasetsgood geocoding datasets can be upward of several gigabytesthat an
implementer can not or does not wish to package with
their UDF. Wrapping services comes at a cost, however,
as service calls generally incur high latency, and service
providers often limit how frequently a client can make
requests to their service.
Because calls to other web services may be slow or
rate-limited, a TweeQL UDF developer can specify several parameters in addition to the UDF implementation.
For example, the developer can add a cache invalidation policy for cacheable UDF invocations, as well as
any rate-limiting policies that the API they are wrapping allows. To ensure quality of service, the developer
can also specify a timeout on wrapped APIs. When the
timeout expires, the return token TIMEOUT is returned,
which acts like a NULL value but can be retrieved at
a later time. Similarly, a RATELIMIT token can be returned for rate-limited UDFs.
2.1.3
2.1.4
SELECT text,
FLATTEN(namedEntities(text)) AS entity
FROM obamatweets
INTO STREAM obamaentities;
SELECT text
FROM obamaentities
WHERE entity = "Barack Obama"
INTO STREAM barackobamatweets;
The current implementation of namedEntities is as an
API wrapper around OpenCalais 2 , a web service for
performing named entity extraction and topic identification. OpenCalais was designed to handle longer text
blobs (e.g., a newspaper article) for better contextual
named entity extraction. One area of future work is to
develop named entity extractors for tweets, which are
significantly shorter.
2.1.5
Windowed Operators
Like other stream processing engines, TweeQL supports aggregates and joins on streams. Because streams
are infinite, we attach sliding window semantics to
them, as in other streaming systems [?, 1]. Windows
are defined by a WINDOW parameter specifying the
timeframe during which to calculate an aggregate or
join. On aggregates, an EVERY parameter specifies
how frequently to emit WINDOW-sized aggregates. The
created at field of a tuple emitted from an aggregate is
the time that the window begins.
For example, the query below converts the obamasentloc stream of sentiment, latitude, and longitude
into an average sentiment expressed in a 1x 1area.
This average is computed over the course of three hours,
and is calculated every hour.
SELECT AVG(sent) AS sent,
floor(lat) AS lat,
floor(long) AS long
FROM obamasentloc
GROUP BY lat, long
WINDOW 3 hours
EVERY 1 hour
INTO STREAM obamasentbyarea;
2.2
System Design
http://www.opencalais.com/
Query
Parser
Query Plan
Optimizer
Sampler
Executor
Stream Mgr.
...
sample obama ...
Relational Mgr.
Cacher
Rate Limiter
Latency Enforcer
Streaming APIs
Pull APIs
Figure 1:
TweeQL architectural components.
ADAM: does this diagram look OK? Ill change
the example streams/services to match our
new api wrapping description
2.3
Current Status
2.4
Challenges
https://github.com/marcua/tweeql
ative classifications. In this case, AVG(sent) will be biased toward the class with higher recall. The solution
1
for positive
described in [7] is to return recallpositive
1
text, and recallnegative , thus adjusting for this bias.
3.
TWITINFO
3.1
Creating an Event
TwitInfo users define an event by specifying a Twitter keyword query. For example, for a soccer game,
users might enter search keywords soccer, football, premierleague, and team names like manchester and liverpool. Users give the event a human-readable name
like Soccer: Manchester City vs. Liverpool as well
as an optional time window. When users are done entering the information, TwitInfo saves the event and begins logging any tweets containing the keywords using
a TweeQL stream like the following:
CREATE STREAM twitinfo
FROM twitter_stream
WHERE text contains soccer
OR text contains football
OR text contains premierleague
OR text contains manchester
OR text contains liverpool;
3.2
3.3
3.4
4.
CONCLUSION
5.
ACKNOWLEDGEMENTS
6.
REFERENCES