0% found this document useful (0 votes)
286 views

ITU Big Data Projects Summer15

This includes Big Data Projects for Students of Summer 2015 class. Topics include R, SAS, Hadoop, Spark, Storm , SolR

Uploaded by

Bhairav Mehta
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
286 views

ITU Big Data Projects Summer15

This includes Big Data Projects for Students of Summer 2015 class. Topics include R, SAS, Hadoop, Spark, Storm , SolR

Uploaded by

Bhairav Mehta
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Team A

Harinath Bokka Sonali Wani Zoheb Syed Jyotsna


Paintal Radhika Aitha

Problem Statement
A.

Finding the list of people with particular grade who have


taken loan.

B.

Finding the list of people with having interest more than


certain value like 1000.

C.

Finding the list of people with having loan amount more


than certain value.

D.

Get maximum number of loan given to which grade


users (A-G).

E.

Highest loan amount given in that year with that


Employee id and Employees annual income.

F.

Get the total number of loans with loan id and load


amount which are all having loan status as Late?

G.

Average loan interest rate with 60 month term and 36


month term.

Dataset
https://edureka.wistia.com/medias/cpj3ljetym/download?
media_file_id=64495520

Dataset Description
https://edureka.wistia.com/medias/410pi4dlfe/download?
media_file_id=64495579

Team B
VarshaTomar, Rupesh,
Nallapaneni

Swetha

,Sheny,

Sandeep

Problem Statement
Find the display name and no. of posts created
by the user who has got maximum reputation.
Find the average age of users on the Stack
Overflow site.
Find the display name of user who posted the
oldest post on Stack Overflow (in terms of date).
Find the display name and no. of comments
done by the user who has got maximum
reputation.
Find the display name of user who has created
maximum no. of posts on Stack Overflow.
Overflow.
Find the owner name and id of user whose post
has got maximum no. of view counts so far.
Find the title and owner name of the post which
has maxim Find the title and owner name of post
who has got maximum no. of Comment count.
Find the location which has maximum no of
Stack Overflow users.

Find the total no. of answers, posts, comments


created by Indian users.

Dataset
https://edureka.wistia.com/medias/d06fdpiiec/download?
media_file_id=64431552

Dataset Description
https://edureka.wistia.com/medias/btr6i3e0p5/download
?media_file_id=81524799

Team C
Vidya Rani Gidiginjala Manne, Remya Nekkuth
Melath, Simmy Payyappilly, Varghese, Monali Modi,
Ajuba Benazir Riyaz

Problem Statement
1. Find the number of movies released between
1950 and 1960.
2. Find the number of movies having rating more
than 4.
3. Find the movies whose rating are between 3 and
4.
4. Find the number of movies with duration more
than 2 hours (7200 second).
5. Find the list of years and number of movies
released each year.
6. Find the total number of movies in the dataset.

Dataset
https://edureka.wistia.com/medias/7qd5lgmko4

Dataset Description

Column1: Movie IDColumn2: Movie nameColumn3: Year


of releaseColumn4: Rating of the movie Column5: Movie
duration in seconds

Team D
XinchengTang,XiaoranAn,YelinLu,JingranXu,Xincheng
Tang

Problem Statement
1. Find out the top 5 categories with maximum
number of videos uploaded.
2. Find out the top 10 rated videos.
3. Find out the most viewed videos.

Dataset
https://edureka.wistia.com/medias/6cchxi6to4

Dataset Description
Column1: Video id of 11 characters.Column2: uploader
of the video of string data type.
Column3: Interval between day of establishment of
Youtube and the date of uploading of the video of integer
data type.
Column4: Category of the video of String data type.
Column5: Length of the video of integer data type.
Column6: Number of views for the video of integer data
type. Column7: Rating on the video of float data type.
Column8: Number of ratings given on the video.
Column9: Number of comments on the videos in integer

data type. Column10: Related video ids with the


uploaded video.

Team E
Minghao (Murphy) Zhai
Minghao (Murphy) Zhai
Jaime Shien Yuanqi (Linda) Zhou

Problem Statement
1. Count number of countries based on landmass.
2. Find out top 5 country with Sum of bars and strips in
a flag.
3. Count of countries with icon.
4. Count of countries which have same top left and top
right color in flag.
5. Count number of countries based on zone.
6. Find out largest county in terms of area in NE zone.
7. Find out least populated country in S.America
landmass.
8. Find out largest speaking language among all
countries.
9. Find most common colour among flags from all
countries.
10.
Sum of all circles present in all country flags.
11.
Count of countries which have both icon and
text in flag.

Dataset
http://archive.ics.uci.edu/ml/machine-learningdatabases/flags/flag.data

Dataset Description
http://archive.ics.uci.edu/ml/machine-learningdatabases/flags/flag.names

Team F
Abinas Roy, Anushree Randad, Lorena Arague, Vivek
Narang

Problem Statement
1. Find list of Airports operating in the Country
India
2. Find the list of Airlines having zero stops
3. List of Airlines operating with code share
4. Which country (or) territory having highest
Airports
5. Find the list of Active Airlines in United state

Dataset
https://edureka.wistia.com/medias/67vuzsza8j/download
?media_file_id=66596539

Dataset Description
In this use case there are 3 data sets. Final_airlines,
routes.dat,airports_mod.dat
*********************************************************
***AirPortsdataseti.eairports_mod.datIt contains the

following fields
Airport ID Unique OpenFlights identifier for this airport.

NameCityCountryIATA/FAA 3-letter FAA code, for airports


located in Country "United States of America". 3-letter
IATA code, for all other airports.
Blank if not assigned.ICAO 4-letter ICAO code. Blank if
not assigned.
Latitude
Longitude
Altitude
Timezone
Decimal degrees, usually to six significant digits.
Negative is South, positive is North. Decimal degrees,
usually to six significant digits. Negative is West, positive
is East. In feet.Hours offset from UTC. Fractional hours
are expressed as decimals, eg. India is 5.5.
Name of airport. May or may not contain the City name.
Main city served by airport. May be spelled differently
from Name.
Country or territory where airport is located.
DST Daylight savings time. One of E (Europe), A
(US/Canada), S (South America), O (Australia), Z (New
Zealand), N (None) or U (Unknown). See also: Help: Time
Tz database time Timezone in "tz" (Olson) format, eg.
"America/Los_Angeles". zone

AirLinesDataset:

It contains the following fields:Airline Unique OpenFlights


identifier for this airline. IDName Name of the airline.
Alias Alias of the airline. For example, All Nippon Airways
is commonly known as "ANA". IATA 2-letter IATA code, if
available.ICAO 3-letter ICAO code, if available.Callsign
Airline callsign.Country Country or territory where airline
is incorporated.
Active "Y" if the airline is or has until recently been
operational, "N" if it is defunct. This field is not reliable:
in particular, major airlines that stopped flying long ago,
but have not had their IATA code reassigned (eg.
Ansett/AN), will incorrectly show as "Y".

RoutesDataseti.eroutes.dat
It contains the following fields
Airline 2-letter (IATA) or 3-letter (ICAO) code of the
airline.
Airline ID Unique OpenFlights identifier for airline (see
Airline).
Source airport 3-letter (IATA) or 4-letter (ICAO) code of
the source airport.
Source airport ID Unique OpenFlights identifier for source
airport (see Airport)
Destination airport 3-letter (IATA) or 4-letter (ICAO) code
of the destination airport.
Destination airport ID Unique OpenFlights identifier for
destination airport (see Airport)
Codeshare "Y" if this flight is a codeshare (that is, not
operated by Airline, but another carrier), empty
otherwise.

Stops Number of stops on this flight ("0" for direct)


Equipment 3-letter codes for plane type(s) generally
used on this flight, separated by spaces

Team G
Vasanth Nair Swarali Chaudhari
Shweta Tiwari

Shikha Saxena

Introduction
This document will tell you how to analyse the NFL
dataset and generate the optimised output for the
same.

Analysing the Dataset


The number of steps applied for analysing the
dataset are mentioned below: 1. You have to point
towards the folder containing the input data.
The input file containing the dataset

NFL_SocialMedia_sample_data1.csvis in the following


format:

The data is divided into the following columns:


ContentIdtstampprofilelink screenmane
timezone
NFL_SocialMedia_sample_data1.csvfile is present in the
LMS.
1. After the input data has been fed, read machine
log and separate out log and time columns.
2. Convert the machine log into text corpus.
3. Convert to Lower Case.
4. Remove the Stopwords.
5. Remove Punctuations.
6. Remove Numbers.
7.

Eliminate the white spaces.

8. Create a dtm (Document Term Matrix)

9. Determine the Term Frequency and tfxidf


10.
Use the K-means package to do document
clustering.
11.
Normalize the Vectors so that Euclidean
makes sense.
12.

Cluster the data into 10 clusters.

13.
Point towards the folder containing the
Interim Data i.e. Cluster_Out.csv
The Cluster_Out.csvwill look like this:
DataSet for this project:
https://www.dropbox.com/s/ykgr2yh67b47rs0/edureka-nfl-dataset-.zip?
dl=0

15. Find the top 5 words discussed in each of the 10


clusters. Write these cluster wise top words into the
TopWords.csvfile and generate the word cloud for
each of the cluster.
The output file containing the cluster-wise topwords

will look like as follows:


The output file Topwords.csvis present in the LMS for
reference.

Team H
Prathyusha Kota Saritha Buchireddy Raghureddy
Laxmi Madhu Kumar Brahmandam Venkesh
Ethiraj

Call Details Record


You will have a CDR (Call Details Record) file,
you need to find out top 10 customers facing
frequent call drops in Roaming. This is a very
important report which telecom companies
use to prevent customer churn out, by
calling them back and at the same time

contacting their roaming partners to improve


the connectivity issues in specific areas. Use
below link to download the CDR.csv file.
https://edureka.wistia.com/medias/799xfn376r/download
?media_file_id=73236251

You might also like