0% found this document useful (0 votes)
46 views

Subject: Big-Data Analytics (CSE-420) Class: B.Tech (CSE) Semester: 6 Semester: 6 Lecture No. 13

The document discusses concepts related to big data analytics including Aerospike, a new generation key-value store, and AsterixDB, a database management system for semi-structured data. It provides details on the architecture of Aerospike, including its use of primary and secondary indexes, and transactions. It also shows how AsterixDB can model semi-structured data like JSON documents using types and handles nested elements.

Uploaded by

Mdim
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Subject: Big-Data Analytics (CSE-420) Class: B.Tech (CSE) Semester: 6 Semester: 6 Lecture No. 13

The document discusses concepts related to big data analytics including Aerospike, a new generation key-value store, and AsterixDB, a database management system for semi-structured data. It provides details on the architecture of Aerospike, including its use of primary and secondary indexes, and transactions. It also shows how AsterixDB can model semi-structured data like JSON documents using types and handles nested elements.

Uploaded by

Mdim
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Subject: Big-Data Analytics (CSE-420)

Class: B.Tech (CSE)


Semester: 6th
Lecture No. 13
At the end of this lecture, students will be able to
understand the concept of

• Aerospike: (A New Generation KV Store)

• Architecture of Aerospike

• Transaction in Aerospike

• AsterixDB
• Aerospike: (A New Generation KV Store):--

• Large amount of data should be accessible at any point of


time

• Aerospike can interoperate with hadoop based system or


spark or with a real time data source

• It can exchange any large volume of data with any such


source and serves “fast look” and query to the application
server.
• Aerospike: (High Level Architecture Diagram):--
• “FAST PATH” essentially refers to left side of the architecture.

1. Client system process transaction that is the data primarily managed


in a primary index as a key value store.

2. This index stays in memory for operational purposes. However, the


system also interact with the storage layer for persistence.

3. The Storage layer uses three kinds of storage system, in-memory


with DRAM, a regular spinning disk & Flash/SSD for fast loading of
data when needed.
• Aerospike: (High Level Architecture Diagram):--

4. Aerospike builds secondary index as a non-primary keys. (Non-


primary key is a key attribute that makes a tuple unique, but not has
been chosen as a primary key)

5. In Aerospike, secondary index are stored in main memory, they are


built on every node in a cluster and co-located with the primary index.
• Querying Aerospike:--
• Aerospike uses standard data type like: --

– Standard Scalar, lists, maps, geospatial, large objects.

o Maps type is similar to the Hashes type in Radis and contain


attribute-value pair.

o Since it is focus on real time web application, it support geospatial


data (like latitude & longitude)

• Aerospike also provides more declarative language like AQL (looks


very similar to SQL)
• Transactions in Aerospike:--
• Aerospike ensures ACID

o Consistency: -- means two different things, one is to ensures that


all constraints like domain constraints are satisfied.

• Second means, is to apply distributed system and ensures all


copies of a data items within a cluster are in sync. (Uses
synchronous write to replicas)

o Durability: -- is achieved by storing data in flash SSD on every


node and performing direct reach from the flash memory.

• Durability is also maintained through the process of replication


because of multiple copies of data.
• AsterixDB: (A DBMS for semi-structured data):--
• A new approach currently been incubated by Apatche i.e
AsterixDB.

Originally AsterixDB was conceived by university of


california, since it is a full flagged BDMS, it provides ACID
guarantees.

• To understand the basic design of AsterixDB, lets consider


the incomplete JSON taken from an actual tweet.
{
"created_at": "Thu Oct 21 16:02:46 +0000 2010",
"entities": {
"user_mentions": [
{
"name": "Gnip, Inc.", An abbreviated Tweet
"screen_name": "gnip"
}

]
},
"text": "what we've been up to at @gnip -- delivering data to happy customers http://gnip.com/success_stories",
"id": 28039652140,
“geo”: null,
"retweet_count": null,
"in_reply_to_user_id": null,
"user": {
"name": "Gnip, Inc.",
"lang": "en",
"followers_count": 260,
"friends_count": 71,
"statuses_count": 302,
"screen_name": "gnip"
},
}
• AsterixDB: (A DBMS for semi-structured data):--
• From the previous slide, entities and user, two parts (in
blue) is nested, that means embedded within the structure of
the tweet.

• If we represent the part of the schema in AsterixDB, it


would look like…..

(Refers next slide….)


create dataset TweetMessages(TweetMessageType)
create dataverse LittleTwitterDemo; primary key tweetid;

create type TwitterUserType as open {


screen-name: string,
lang: string,
friends_count: int32,
statuses_count: int32,
id: int32,
followers_count: int32
}
create type TweetMessageType as closed {
tweetid: string,
user: TwitterUserType,
geo: point?,
created_at: datetime,
referred-topics: {{ string }},
text: string
}
• AsterixDB: (A DBMS for semi-structured data):--
• Top Part:

Which looks like standard database table, represent the user portion of
the JSON object that we highlight.

Open: -- means more no. of data types.

• Bottom Part:

represent the message instead of nesting it like JSON, user attribute


highlighted in blue is declared its type, twitter is its type.

Closed: data instance must have the same attribute within the schema.

geo: point?: ?(Optional) means -- All instance need not have it.

for $userin datasetTwitterUsersorder by
$user.followers_count desc,$user.langasc return$user


SELECT a.val,b.val FROM a LEFTOUTERJOIN bON
(a.key=b.key)




Hyracks Job Management





• from filesin a directory path
create dataset Tweets (Tweet)
primary key id;

create feed TestFileFeed using localfs


(("path"="127.0.01:///Users/adc/text/"),
("format"="adm"), ("type-name"="Tweet"),
("expression"=".*\\.adm"));

connect feed TestFileFeed to dataset Tweets;


• from anexternalAPI
use dataverse feeds;
create dataset Tweets (Tweet)
primary key id;
create feed TwitterFeed if not exists using "push_twitter"
(("type-name"="Tweet"),
("consumer.key"=“some-key"),
("consumer.secret"=“some-secret"),
("access.token"=“some-token"),
("access.token.secret"=“some-token-secret"));
connect feed TwitterFeed to dataset Tweets;

You might also like