0% found this document useful (0 votes)

286 views

ITU Big Data Projects Summer15

This includes Big Data Projects for Students of Summer 2015 class. Topics include R, SAS, Hadoop, Spark, Storm , SolR

Uploaded by

Bhairav Mehta

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

286 views

ITU Big Data Projects Summer15

This includes Big Data Projects for Students of Summer 2015 class. Topics include R, SAS, Hadoop, Spark, Storm , SolR

Uploaded by

Bhairav Mehta

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 14

Team A

Harinath Bokka Sonali Wani Zoheb Syed Jyotsna

Paintal Radhika Aitha

Problem Statement
A.

Finding the list of people with particular grade who have

taken loan.

Finding the list of people with having interest more than

certain value like 1000.

Finding the list of people with having loan amount more

than certain value.

Get maximum number of loan given to which grade

users (A-G).

Highest loan amount given in that year with that

Employee id and Employees annual income.

Get the total number of loans with loan id and load

amount which are all having loan status as Late?

Average loan interest rate with 60 month term and 36

month term.

Dataset
https://edureka.wistia.com/medias/cpj3ljetym/download?
media_file_id=64495520

Dataset Description
https://edureka.wistia.com/medias/410pi4dlfe/download?
media_file_id=64495579

Team B
VarshaTomar, Rupesh,
Nallapaneni

Swetha

,Sheny,

Sandeep

Problem Statement
Find the display name and no. of posts created
by the user who has got maximum reputation.
Find the average age of users on the Stack
Overflow site.
Find the display name of user who posted the
oldest post on Stack Overflow (in terms of date).
Find the display name and no. of comments
done by the user who has got maximum
reputation.
Find the display name of user who has created
maximum no. of posts on Stack Overflow.
Overflow.
Find the owner name and id of user whose post
has got maximum no. of view counts so far.
Find the title and owner name of the post which
has maxim Find the title and owner name of post
who has got maximum no. of Comment count.
Find the location which has maximum no of
Stack Overflow users.

Find the total no. of answers, posts, comments

created by Indian users.

Dataset
https://edureka.wistia.com/medias/d06fdpiiec/download?
media_file_id=64431552

Dataset Description
https://edureka.wistia.com/medias/btr6i3e0p5/download
?media_file_id=81524799

Team C
Vidya Rani Gidiginjala Manne, Remya Nekkuth
Melath, Simmy Payyappilly, Varghese, Monali Modi,
Ajuba Benazir Riyaz

Problem Statement
1. Find the number of movies released between
1950 and 1960.
2. Find the number of movies having rating more
than 4.
3. Find the movies whose rating are between 3 and
4.
4. Find the number of movies with duration more
than 2 hours (7200 second).
5. Find the list of years and number of movies
released each year.
6. Find the total number of movies in the dataset.

Dataset
https://edureka.wistia.com/medias/7qd5lgmko4

Dataset Description

Column1: Movie IDColumn2: Movie nameColumn3: Year

of releaseColumn4: Rating of the movie Column5: Movie
duration in seconds

Team D
XinchengTang,XiaoranAn,YelinLu,JingranXu,Xincheng
Tang

Problem Statement
1. Find out the top 5 categories with maximum
number of videos uploaded.
2. Find out the top 10 rated videos.
3. Find out the most viewed videos.

Dataset
https://edureka.wistia.com/medias/6cchxi6to4

Dataset Description
Column1: Video id of 11 characters.Column2: uploader
of the video of string data type.
Column3: Interval between day of establishment of
Youtube and the date of uploading of the video of integer
data type.
Column4: Category of the video of String data type.
Column5: Length of the video of integer data type.
Column6: Number of views for the video of integer data
type. Column7: Rating on the video of float data type.
Column8: Number of ratings given on the video.
Column9: Number of comments on the videos in integer

data type. Column10: Related video ids with the

uploaded video.

Team E
Minghao (Murphy) Zhai
Minghao (Murphy) Zhai
Jaime Shien Yuanqi (Linda) Zhou

Problem Statement
1. Count number of countries based on landmass.
2. Find out top 5 country with Sum of bars and strips in
a flag.
3. Count of countries with icon.
4. Count of countries which have same top left and top
right color in flag.
5. Count number of countries based on zone.
6. Find out largest county in terms of area in NE zone.
7. Find out least populated country in S.America
landmass.
8. Find out largest speaking language among all
countries.
9. Find most common colour among flags from all
countries.
10.
Sum of all circles present in all country flags.
11.
Count of countries which have both icon and
text in flag.

Dataset
http://archive.ics.uci.edu/ml/machine-learningdatabases/flags/flag.data

Dataset Description
http://archive.ics.uci.edu/ml/machine-learningdatabases/flags/flag.names

Team F
Abinas Roy, Anushree Randad, Lorena Arague, Vivek
Narang

Problem Statement
1. Find list of Airports operating in the Country
India
2. Find the list of Airlines having zero stops
3. List of Airlines operating with code share
4. Which country (or) territory having highest
Airports
5. Find the list of Active Airlines in United state

Dataset
https://edureka.wistia.com/medias/67vuzsza8j/download
?media_file_id=66596539

Dataset Description
In this use case there are 3 data sets. Final_airlines,
routes.dat,airports_mod.dat
*********************************************************
***AirPortsdataseti.eairports_mod.datIt contains the

following fields
Airport ID Unique OpenFlights identifier for this airport.

NameCityCountryIATA/FAA 3-letter FAA code, for airports

located in Country "United States of America". 3-letter
IATA code, for all other airports.
Blank if not assigned.ICAO 4-letter ICAO code. Blank if
not assigned.
Latitude
Longitude
Altitude
Timezone
Decimal degrees, usually to six significant digits.
Negative is South, positive is North. Decimal degrees,
usually to six significant digits. Negative is West, positive
is East. In feet.Hours offset from UTC. Fractional hours
are expressed as decimals, eg. India is 5.5.
Name of airport. May or may not contain the City name.
Main city served by airport. May be spelled differently
from Name.
Country or territory where airport is located.
DST Daylight savings time. One of E (Europe), A
(US/Canada), S (South America), O (Australia), Z (New
Zealand), N (None) or U (Unknown). See also: Help: Time
Tz database time Timezone in "tz" (Olson) format, eg.
"America/Los_Angeles". zone

AirLinesDataset:

It contains the following fields:Airline Unique OpenFlights

identifier for this airline. IDName Name of the airline.
Alias Alias of the airline. For example, All Nippon Airways
is commonly known as "ANA". IATA 2-letter IATA code, if
available.ICAO 3-letter ICAO code, if available.Callsign
Airline callsign.Country Country or territory where airline
is incorporated.
Active "Y" if the airline is or has until recently been
operational, "N" if it is defunct. This field is not reliable:
in particular, major airlines that stopped flying long ago,
but have not had their IATA code reassigned (eg.
Ansett/AN), will incorrectly show as "Y".

RoutesDataseti.eroutes.dat
It contains the following fields
Airline 2-letter (IATA) or 3-letter (ICAO) code of the
airline.
Airline ID Unique OpenFlights identifier for airline (see
Airline).
Source airport 3-letter (IATA) or 4-letter (ICAO) code of
the source airport.
Source airport ID Unique OpenFlights identifier for source
airport (see Airport)
Destination airport 3-letter (IATA) or 4-letter (ICAO) code
of the destination airport.
Destination airport ID Unique OpenFlights identifier for
destination airport (see Airport)
Codeshare "Y" if this flight is a codeshare (that is, not
operated by Airline, but another carrier), empty
otherwise.

Stops Number of stops on this flight ("0" for direct)

Equipment 3-letter codes for plane type(s) generally
used on this flight, separated by spaces

Team G
Vasanth Nair Swarali Chaudhari
Shweta Tiwari

Shikha Saxena

Introduction
This document will tell you how to analyse the NFL
dataset and generate the optimised output for the
same.

Analysing the Dataset

The number of steps applied for analysing the
dataset are mentioned below: 1. You have to point
towards the folder containing the input data.
The input file containing the dataset

NFL_SocialMedia_sample_data1.csvis in the following

format:

The data is divided into the following columns:

ContentIdtstampprofilelink screenmane
timezone
NFL_SocialMedia_sample_data1.csvfile is present in the
LMS.
1. After the input data has been fed, read machine
log and separate out log and time columns.
2. Convert the machine log into text corpus.
3. Convert to Lower Case.
4. Remove the Stopwords.
5. Remove Punctuations.
6. Remove Numbers.
7.

Eliminate the white spaces.

8. Create a dtm (Document Term Matrix)

9. Determine the Term Frequency and tfxidf

10.
Use the K-means package to do document
clustering.
11.
Normalize the Vectors so that Euclidean
makes sense.
12.

Cluster the data into 10 clusters.

13.
Point towards the folder containing the
Interim Data i.e. Cluster_Out.csv
The Cluster_Out.csvwill look like this:
DataSet for this project:
https://www.dropbox.com/s/ykgr2yh67b47rs0/edureka-nfl-dataset-.zip?
dl=0

15. Find the top 5 words discussed in each of the 10

clusters. Write these cluster wise top words into the
TopWords.csvfile and generate the word cloud for
each of the cluster.
The output file containing the cluster-wise topwords

will look like as follows:

The output file Topwords.csvis present in the LMS for
reference.

Team H
Prathyusha Kota Saritha Buchireddy Raghureddy
Laxmi Madhu Kumar Brahmandam Venkesh
Ethiraj

Call Details Record

You will have a CDR (Call Details Record) file,
you need to find out top 10 customers facing
frequent call drops in Roaming. This is a very
important report which telecom companies
use to prevent customer churn out, by
calling them back and at the same time

contacting their roaming partners to improve

the connectivity issues in specific areas. Use
below link to download the CDR.csv file.
https://edureka.wistia.com/medias/799xfn376r/download
?media_file_id=73236251

Ritesh Tandon Machine Learning Project
100% (5)
Ritesh Tandon Machine Learning Project
23 pages
Service Manual: SS-WG880
100% (1)
Service Manual: SS-WG880
6 pages
Python Assignment
No ratings yet
Python Assignment
3 pages
Solution - Data Analysis With Python-Project-2 - v1.0
No ratings yet
Solution - Data Analysis With Python-Project-2 - v1.0
14 pages
Jump into JMP Scripting, Second Edition
From Everand
Jump into JMP Scripting, Second Edition
Wendy Murphrey
No ratings yet
CSQE Exam Preparation: Douglas Hoffman
No ratings yet
CSQE Exam Preparation: Douglas Hoffman
390 pages
MARRIOTT: Marketing Research Leads To Expanded Offerings
0% (2)
MARRIOTT: Marketing Research Leads To Expanded Offerings
2 pages
CSE6006 NoSQL-Databases ETH 1 AC41
No ratings yet
CSE6006 NoSQL-Databases ETH 1 AC41
10 pages
Problem Statement
No ratings yet
Problem Statement
6 pages
HW 1 - Version 2.ipynb - Colab
No ratings yet
HW 1 - Version 2.ipynb - Colab
5 pages
Project Walkthrough - Bike Share-2020
No ratings yet
Project Walkthrough - Bike Share-2020
58 pages
Paper 2
No ratings yet
Paper 2
12 pages
Capstone Project Overview: Bureau of Transportation Statistics
No ratings yet
Capstone Project Overview: Bureau of Transportation Statistics
7 pages
Technical Documenetflix Technicalnt
No ratings yet
Technical Documenetflix Technicalnt
15 pages
24 Ultimate Data Science Projects To Boost Your Knowledge and Skills
No ratings yet
24 Ultimate Data Science Projects To Boost Your Knowledge and Skills
10 pages
DBDAL LAB - MANUAL - Final
No ratings yet
DBDAL LAB - MANUAL - Final
93 pages
Python Coding Interview Interview Questions Questions
No ratings yet
Python Coding Interview Interview Questions Questions
9 pages
Problem Statements For PBL Internships
No ratings yet
Problem Statements For PBL Internships
3 pages
SL-III Lab Manual
No ratings yet
SL-III Lab Manual
74 pages
Final Project
No ratings yet
Final Project
2 pages
Uob Python Lecture2p
No ratings yet
Uob Python Lecture2p
22 pages
Certificate in Data Science Foundation PDF
No ratings yet
Certificate in Data Science Foundation PDF
10 pages
Vertopal.com IMDb+Movie+Assignment Stub
No ratings yet
Vertopal.com IMDb+Movie+Assignment Stub
9 pages
IP - Class XII - Question Paper - Pre Board I (Offline) Examination
No ratings yet
IP - Class XII - Question Paper - Pre Board I (Offline) Examination
8 pages
Topic
No ratings yet
Topic
13 pages
GCD Detailed Syllabus
No ratings yet
GCD Detailed Syllabus
24 pages
MIT Data Science and Big Data Analytics Case Study
No ratings yet
MIT Data Science and Big Data Analytics Case Study
8 pages
IP_1
No ratings yet
IP_1
5 pages
Big Data
No ratings yet
Big Data
5 pages
Technical Docs of NETFLIX MOVIES AND TV SHOWS CLUSTERING
No ratings yet
Technical Docs of NETFLIX MOVIES AND TV SHOWS CLUSTERING
12 pages
H1
No ratings yet
H1
1 page
Hints and Answers
No ratings yet
Hints and Answers
13 pages
Business Intelligence Project Report
No ratings yet
Business Intelligence Project Report
14 pages
H2
No ratings yet
H2
2 pages
pandas__prac
No ratings yet
pandas__prac
4 pages
DAT7302 BDA ASSESSMENT BRIEF
No ratings yet
DAT7302 BDA ASSESSMENT BRIEF
9 pages
Sameh Sobhy Ahmed Kishta
No ratings yet
Sameh Sobhy Ahmed Kishta
3 pages
netflix-case
0% (1)
netflix-case
19 pages
Dsbda Lab Manual
No ratings yet
Dsbda Lab Manual
167 pages
18BCS053
No ratings yet
18BCS053
17 pages
Class 12 Practice Paper 1
No ratings yet
Class 12 Practice Paper 1
4 pages
Netflix Analysis Report (2105878 - Bibhudutta Swain)
No ratings yet
Netflix Analysis Report (2105878 - Bibhudutta Swain)
19 pages
Aniket Gurav: Total Experience: + 3.5 Years Data Scientist
No ratings yet
Aniket Gurav: Total Experience: + 3.5 Years Data Scientist
4 pages
How To Convert Casuals To Members?": Google Data Analytics Course Capstone Project: Case Study 1 "Cyclistic"
No ratings yet
How To Convert Casuals To Members?": Google Data Analytics Course Capstone Project: Case Study 1 "Cyclistic"
18 pages
Big Data
No ratings yet
Big Data
4 pages
AS_Problem Statement (2)
No ratings yet
AS_Problem Statement (2)
4 pages
Xii Ip QP Second Set-1
No ratings yet
Xii Ip QP Second Set-1
6 pages
AIML Mod4 Loki
No ratings yet
AIML Mod4 Loki
11 pages
Ip Sample Paper 1
No ratings yet
Ip Sample Paper 1
6 pages
Kendriya Vidyalaya Sangathan, Chennai Region PRACTICE TEST 2020-2021 Class XII
100% (1)
Kendriya Vidyalaya Sangathan, Chennai Region PRACTICE TEST 2020-2021 Class XII
8 pages
Python For Exploratory Data Analysis
No ratings yet
Python For Exploratory Data Analysis
12 pages
DM Theory Mid Term
No ratings yet
DM Theory Mid Term
9 pages
Chapter03 PRJ Requirements
No ratings yet
Chapter03 PRJ Requirements
2 pages
Ip Practical 2024 2025
No ratings yet
Ip Practical 2024 2025
14 pages
CMSC-691-Assignment-3
No ratings yet
CMSC-691-Assignment-3
2 pages
Recommender System
No ratings yet
Recommender System
45 pages
Importing Librarie
No ratings yet
Importing Librarie
13 pages
Soal CISDM
No ratings yet
Soal CISDM
3 pages
Business Case - Netflix - Data Exploration and Visualisation - Ipynb - Colab
No ratings yet
Business Case - Netflix - Data Exploration and Visualisation - Ipynb - Colab
9 pages
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
Apache Cassandra Developer Associate - Exam Practice Tests
From Everand
Apache Cassandra Developer Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Designing and Conducting Survey Research: A Comprehensive Guide
From Everand
Designing and Conducting Survey Research: A Comprehensive Guide
Louis M. Rea
2/5 (2)
Coding In C Decoded: Decoded, #1
From Everand
Coding In C Decoded: Decoded, #1
D Brown
No ratings yet
Blockchain Technology
No ratings yet
Blockchain Technology
17 pages
EMG 523 Engineering Quality Assurance Quiz 1 7
No ratings yet
EMG 523 Engineering Quality Assurance Quiz 1 7
3 pages
CQE
100% (1)
CQE
1 page
Accenture Returns Repairs
No ratings yet
Accenture Returns Repairs
7 pages
A Model For Social Networks: Riitta Toivonen, Jukka-Pekka Onnela, Jari Sarama Ki, Jo Rkki Hyvo Nen, Kimmo Kaski
No ratings yet
A Model For Social Networks: Riitta Toivonen, Jukka-Pekka Onnela, Jari Sarama Ki, Jo Rkki Hyvo Nen, Kimmo Kaski
10 pages
Weep Not Child Reflection
No ratings yet
Weep Not Child Reflection
2 pages
171117A_laptop_24inch monitor
No ratings yet
171117A_laptop_24inch monitor
1 page
Angelomingarelli PDF
No ratings yet
Angelomingarelli PDF
748 pages
Bryan Steam: Separable Tank Tray Type Boiler Feedwater Deaerators
No ratings yet
Bryan Steam: Separable Tank Tray Type Boiler Feedwater Deaerators
2 pages
Research Job
No ratings yet
Research Job
4 pages
Altronic Ignition Accesories
No ratings yet
Altronic Ignition Accesories
19 pages
MC4 Midterm
No ratings yet
MC4 Midterm
15 pages
Drivers of Cash-on-Delivery Method of Payment in E Commerce Shopping Evidence From Pakistan
No ratings yet
Drivers of Cash-on-Delivery Method of Payment in E Commerce Shopping Evidence From Pakistan
14 pages
A 48 CP, Ap, FP
No ratings yet
A 48 CP, Ap, FP
1 page
JAS Industrial Visit Report Sardar Sarovar Dam and Dhanki Pumping Station 13-10-2015 To14!10!2015
No ratings yet
JAS Industrial Visit Report Sardar Sarovar Dam and Dhanki Pumping Station 13-10-2015 To14!10!2015
8 pages
wcms_862810
No ratings yet
wcms_862810
8 pages
WEEK12 - Unit 2 LET'S PLAY
No ratings yet
WEEK12 - Unit 2 LET'S PLAY
9 pages
绝杀宗长 Faux Hollows Foxes
No ratings yet
绝杀宗长 Faux Hollows Foxes
68 pages
1.1 Economics: 1.1.1 Flow in An Economy
No ratings yet
1.1 Economics: 1.1.1 Flow in An Economy
51 pages
Jurnal Ansietas & Mekanisme Koping
No ratings yet
Jurnal Ansietas & Mekanisme Koping
8 pages
Unit 1 Text Analysis Translation Notes
No ratings yet
Unit 1 Text Analysis Translation Notes
4 pages
Prospekt RSNC en
No ratings yet
Prospekt RSNC en
6 pages
Ommy Ox IV: Media Arts & Animation
No ratings yet
Ommy Ox IV: Media Arts & Animation
1 page
Thread Fasteners Loosening Tribology
No ratings yet
Thread Fasteners Loosening Tribology
179 pages
Achs Epp Test
No ratings yet
Achs Epp Test
3 pages
SyringePumpPro User GuideLetter
No ratings yet
SyringePumpPro User GuideLetter
87 pages
Nota PDF Bab 2
No ratings yet
Nota PDF Bab 2
3 pages
9AKK107680A3302 Hitachi Energy IC OEM Brochure
100% (1)
9AKK107680A3302 Hitachi Energy IC OEM Brochure
20 pages
1893 PART 3 2014 Bridges, Retaining Walls
50% (2)
1893 PART 3 2014 Bridges, Retaining Walls
33 pages
France
No ratings yet
France
13 pages
Growbot Queue
No ratings yet
Growbot Queue
7 pages
58. Đề kiểm tra giữa kì 1 Anh 11 Global Success có lời giải, file nghe - Đáp án
No ratings yet
58. Đề kiểm tra giữa kì 1 Anh 11 Global Success có lời giải, file nghe - Đáp án
5 pages
MS 02 230
No ratings yet
MS 02 230
58 pages
Foster PDS 82 77 R0220
No ratings yet
Foster PDS 82 77 R0220
2 pages