0% found this document useful (0 votes)
17 views

SAP Big Data Analytics Proposal

Uploaded by

ankitwww0306
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

SAP Big Data Analytics Proposal

Uploaded by

ankitwww0306
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

George Mason University

SYST 699: Masters Capstone Project


Spring 2014

Project Proposal:
SAP Big Data Analytics on Mobile Usage
Inferring age and gender of a person through his/her phone habits

February 11, 2014

Arturo Buzzalino
Justin Nguyen
Mitul Patel
Tanner Suttles
SAP BIG DATA ANALYTICS

TABLE OF CONTENTS
1 INTRODUCTION............................................................................................................................. 3
1.1 Background ................................................................................................................................... 3
1.2 Problem Statement........................................................................................................................ 3
1.3 Scope ............................................................................................................................................. 4
1.4 Document Overview ...................................................................................................................... 4
2 PRELIMINARY REQUIREMENTS .................................................................................................. 5
3 TECHNICAL APPROACH .............................................................................................................. 6
3.1 Analysis ......................................................................................................................................... 6
3.2 Requirements Development ......................................................................................................... 6
3.3 Model Development ...................................................................................................................... 6
3.4 Testing and Evaluation ................................................................................................................. 6
3.5 Delivery .......................................................................................................................................... 6
4 EXPECTED RESULTS ................................................................................................................... 7
5 MANAGEMENT APPROACH ......................................................................................................... 7
5.1 Project Plan ................................................................................................................................... 7
5.2 Project Risks ................................................................................................................................. 8

2
SAP BIG DATA ANALYTICS

1 INTRODUCTION
1.1 Background

SAP Mobile Services is developing a new product, Consumer Insight 365 (CI365). The purpose of this
product is to enhance a business’ ability to expand its market, and provide a tool to perform meaningful
analysis of consumer patterns. CI365 will analyze large amounts of global mobile carrier data. This mass
analysis will extend to a large number of countries across the world and cover millions of people.

The goal of this project is to provide businesses with an additional, powerful means to expand their markets
by focusing their growth efforts on specific regions and demographics. Data visualization and statistical
techniques will be used to determine patterns among: socio-demographics, gender, age, URL click stream
categories, geo-location and texting / calling habits. Below is a sample of what a CI365 custom report might
look like (Figure 1):

Figure 1: Sample of the type of report a business might be using

The data from the carriers is received in anonymized form. This is important because SAP does not want to
breach the privacy of the consumers. There is no way for SAP to trace the number of a person to determine
who he or she is or what is his or her home address.

The data provided will include the users’ daily data generation through the handset. The age and gender is
provided when the user has a plan with the carrier. For those users who are roaming on the network, or own
pay as you go and prepaid plans, this data is not available. This is where SAP wants to use Big Data
analytics to be able to infer a user’s gender and age based on his or her phone habits.

1.2 Problem Statement

Focusing on a small carrier’s mobile user data, determine correlations between texting / calling habits, URL
categories and geo-location with user gender / age. SAP is interested in having the ability to determine the
gender and general age of the mobile user based on his/her phone habits.

3
SAP BIG DATA ANALYTICS

1.3 Scope

This project is very large and can encompass many different aspects. The project team will focus on the
small carrier’s data detailed in the project description.

Within Scope:
 Constructing model capable of inferring age and gender
 Model should only consider text / call data, URL traffic categories, and geo-location
o Text and call data will further be broken down once project team has access to the data.
 Number of texts in 1 hour, length of call, number of calls in 1 day
o Geo-location will be considered last, as it is the most difficult.
 Because a user’s location changes hourly throughout the day, this will require
further investigation
 Only consider the small carrier’s data provided by SAP
 Test model and conduct sensitivity on the data for which the age and gender is NOT provided

Outside of Scope:
 Data provided outside of the categories mentioned above (i.e. daily tweets, Facebook likes)
 Data not pertaining to the carrier above

1.4 Document Overview

The remainder of this document details the requirements, technical approach, expected results and project
management plan. These sections will describe how the team plans on providing a solution to the problem
statement, and achieve the scope designated for the semester.

4
SAP BIG DATA ANALYTICS

2 PRELIMINARY REQUIREMENTS
Requirement 1: The team shall utilize data provided by mobile carrier
The type of data sent by the mobile carrier includes metadata about user’s texting and calling habits, points
of interest frequented (geo-location), and URLs visited

Requirement 2: The team shall develop methods to identify patterns in cell phone usage by gender
The methods used will be developed based on statistical and data-mining principles and techniques. These
methods shall consistently identify pattern in cell phone usage by gender.

Requirement 3: The team shall develop methods to and identify patterns in cell phone usage by age
group
The methods used will be developed based on statistical and data-mining principles and techniques. These
methods shall consistently identify patterns in cell phone usage by age group.

Requirement 4: The team shall develop a model for classifying a subscriber’s gender
The team will develop a model within SAP HANA for classifying the gender of a subscriber.

Requirement 5: The model shall predict the gender of an anonymized user as male or female
The model will be developed based on patterns identified to predict the gender of a user.

Requirement 6: The model shall predict the age group of an anonymized user
The model will be developed based patterns identified techniques to predict the age group of a user.

Requirement 7: The model shall provide accuracy for each classification


The model will produce accuracy of its classification result for each subscriber.

5
SAP BIG DATA ANALYTICS

3 TECHNICAL APPROACH
3.1 Analysis

In order to better understand and implement big data analytics for the project, multiple factors need to be
considered. Research on phone usage behaviors, gender and cultural patterns, and socio-demographics of
phone traffic data will need to be conducted as well. The pros and cons of what software would be the most
optimal and appropriate to analyze and interpret the data needs to be weighed out. The team will also need
to research data clustering and data mining techniques in order to figure how the model should be
developed. All of these factors have to be analyzed before developing a method or guideline of inferring what
the gender and age is based on the characteristics of the data given.

3.2 Requirements Development

The model requirements will be broken down into multiple categories such as function, system, input/output,
operations, and interface. The preliminary requirements will be further developed once the team has access
to the phone data so a better understanding of the factors being dealt with can be formulated. The
determination of the metrics/rules and statistical methods of how the results will be implicated will also dictate
the outcome of the model’s requirements. The requirements will guide the model development phase and
will be verified in the testing and evaluation phase

3.3 Model Development

The first step in model development will be the selection of features as inputs to the model. The data will be
analyzed to look for features that distinguish users by their gender. Distributions in visitation of websites and
rates of utilization of numeric metrics will be plotted to show patterns. Further model development will follow
an agile approach where the model will be successively refined through short evolutions of design and
testing. Development iterations are expected to be based on modification of the feature set, algorithm
selection, and algorithm parameters.

The main software that the team is planning on using to analyze the data is HANA’s PAL analytics. This tool
that HANA provides should be more than enough to build a model to infer the gender and age of the phone
user. A major obstacle that the rule set engine of the model must overcome is recognizing the fundamental
difference between pre-paid and full plan holders. If the model does not adjust for differences in the
populations, the result may be skewed. Another issue will be that the model is constrained to one market
rather than a universal setting due to the limited input data.

3.4 Testing and Evaluation

An evaluation of whether the data model’s gender and age implication is accurate will be performed. Two
forms of testing will be testing of accuracy and performing sensitivity analysis. The testing of accuracy will be
conducted on a holdout sample set of data where the gender and age is already known. The team will input
that set of data into the model calculation without looking at the age and gender and verify that the model
consistently outputs accurate results. The sensitivity analysis will vary features input to the model and see
how the output results change. This will reduce the amount of uncertainty in the model as well as increase
the understanding of relationships between the input variables and output result. The testing stage will also
ensure that all the requirements are being fulfilled and me by the data model.

3.5 Delivery

The final delivery of results will be presented in a summary form rather than individual data results. It will
contain information on age and genders ranges with regards to multiple factors such as location, popular
interests, or URL categories. The results will be organized in a manner that best satisfies SAP requirements
and expected deliverables. The results will be given in both presentation and paper format to the customers,
SAP and George Mason University.

6
SAP BIG DATA ANALYTICS

4 EXPECTED RESULTS
The project will yield three major deliverables:

 Methods developed to identify patterns in mobile usage


o The team will submit a detailed narrative on what methods were used, how they were
chosen and how they were applied to identify the mobile usage patterns.
 Model to predict age group and gender of user
o The team will submit a model that will infer the age group and gender of a mobile user with
proven confidence. A detailed narrative about how the analytic methods were used to
develop the model, and the model’s inputs and outputs will be provided.
 Sensitivity Analysis of model
o The team will perform a sensitivity analysis of the model to aid in its validation. A narrative of
how the analysis was performed and its results will be submitted.

5 MANAGEMENT APPROACH

5.1 Project Plan

The project has been divided into four task areas: project management, research, model development, and
final deliverables. The timeline shown in Figure 2 shows the research phase in blue, model development in
green and final deliverables in yellow. The project management task covers initial project activities like
kickoff and problem definition, along with progress presentations throughout the project. The introduction of
new tools and concepts for the team in the area of big data also necessitates a period of research in which
the team will learn about big data analytics and current cell phone usage research. The majority of the
project will be spent in data analysis and model development. Near the end of the semester the team will
start to look toward the final deliverables, and time spent on model development will be shared with drafting
the presentation. The majority of the time for the final deliverables will be spent on the final presentation.
The work on the final presentation starts earlier in the project with professor reviews, and will help organize
the information for the final report. At the end of the project the final report and presentation will be delivered
and the project closed out.

Figure 2: Project timeline

7
SAP BIG DATA ANALYTICS

An abbreviated work breakdown structure (WBS) has been included below in Table 1. The abbreviated
version shows major tasks and milestones in the project schedule. The full WBS is included in Appendix A.

Table 1: Project work breakdown structure

Outline Task Name Duration Start Finish


Number

1 Project Management 44 days 1/23 3/25

2 Research 14 days 1/23 2/11

2.1 Mobile Phone Use Demographics 14 days 1/23 2/11

2.2 Big Data Tools 14 days 1/23 2/11

3 Model Development 64 days 1/23 4/22

3.1 Get Access To SAP 14 days 1/23 2/11

3.2 Get Data 14 days 1/23 2/11

3.3 Determine Approach 10 days 2/12 2/25

3.4 Analyze Data 40 days 2/12 4/8

3.5 Develop Model 40 days 2/26 4/22

3.6 Sensitivity Analysis 10 days 3/21 4/3

4 Website 6 days 4/28 5/5

5 Final Report 14 days 4/16 5/5

6 Final Presentation 29 days 4/1 5/9

5.2 Project Risks

Risk 1: Data Delivery

Due to the data being shipped, there is risk that the data will arrive too late in the semester for the team to
develop a fully functional model. If the team does not have access to the data by February 14th, the scope of
the model may have to be adjusted.

Risk 2: Data Access

Depending on the installation of the data there is a change that the data will only be accessible at the SAP
Reston office. The team would need to be escorted while in the office and due to schedule limitations of the
project team there would be limited availability to work on data analysis and model.

Risk 3: Big Data Expertise

The team does not have experience with SAP HANA or PAL analytics capabilities, or much exposure to data
mining techniques. In order to mitigate this risk the team is consulting with Professors at the George Mason
University and Subject Matter Experts at SAP.

Appendix A: WBS

8
SAP BIG DATA ANALYTICS

Outline Task Name Duration Start Finish


Number

1 Project Management 44 days 1/23 3/25

1.1 Project Kickoff 1 day 1/23 1/23

1.2 Create Presentation 3 days 1/23 1/27

1.3 Preliminary Project Description 0 days 1/28 1/28


Presentation

1.4 Create Presentation 3 days 1/29 1/31

1.5 Problem Definition and Scope 0 days 2/4 2/4


Presentation

1.6 Draft Project Proposal 3 days 2/4 2/6

1.7 Project Proposal 1 day 2/10 2/10

1.8 Create Progress Presentation 1 5 days 2/25 3/3

1.9 Progress Presentation 1 1 day 3/4 3/4

1.10 Create Progress Presentation 2 5 days 3/19 3/25

1.11 Progress Presentation 2 1 day 3/25 3/25

1.12 Spring Break 5 days 3/10 3/14

2 Research 14 days 1/23 2/11

2.1 Mobile Phone Use Demographics 14 days 1/23 2/11

2.2 Big Data Tools 14 days 1/23 2/11

3 Model Development 64 days 1/23 4/22

3.1 Get Access To SAP 14 days 1/23 2/11

3.2 Get Data 14 days 1/23 2/11

3.3 Determine Approach 10 days 2/12 2/25

3.4 Analyze Data 40 days 2/12 4/8

3.5 Develop Model 40 days 2/26 4/22

3.6 Sensitivity Analysis 10 days 3/21 4/3

4 Website 6 days 4/28 5/5

4.1 Create Website 5 days 4/28 5/2

4.2 Website Due 1 day 5/5 5/5

9
SAP BIG DATA ANALYTICS

5 Final Report 14 days 4/16 5/5

5.1 Draft 10 days 4/16 4/29

5.2 Review 2 days 4/30 5/1

5.3 Tech Edit 1 day 5/2 5/2

5.4 Final Report Due 1 day 5/5 5/5

6 Final Presentation 29 days 4/1 5/9

6.1 Draft 1 5 days 4/1 4/7

6.2 Meet with Professor 1 day 4/8 4/8

6.3 Meet with Professor 1 day 4/15 4/15

6.4 Draft 2 2 days 4/16 4/17

6.5 In Class Dry Run 1 day 4/22 4/22

6.6 In Class Dry Run 1 day 4/29 4/29

6.7 Final Draft 2 days 4/30 5/1

6.8 Final Presentation 1 day 5/9 5/9

10
www.sap.com

© 2014 SAP AG or an SAP affiliate company. All rights reserved.


No part of this publication may be reproduced or transmitted in any form or for
any purpose without the express permission of SAP AG or an SAP affiliate
company.
SAP and other SAP products and services mentioned herein as well as their
respective logos are trademarks or registered trademarks of SAP AG (or an
SAP affiliate company) in Germany and other countries. Please see
http://www.sap.com/corporate-en/legal/copyright/index.epx#trademark for
additional trademark information and notices. Some software products
marketed by SAP AG and its distributors contain proprietary software
components of other software vendors.
National product specifications may vary.
These materials are provided by SAP AG or an SAP affiliate company for
informational purposes only, without representation or warranty of any kind,
and SAP AG or its affiliated companies shall not be liable for errors or
omissions with respect to the materials. The only warranties for SAP AG or
SAP affiliate company products and services are those that are set forth in the
express warranty statements accompanying such products and services, if
any. Nothing herein should be construed as constituting an additional warranty.
In particular, SAP AG or its affiliated companies have no obligation to pursue
any course of business outlined in this document or any related presentation,
or to develop or release any functionality mentioned therein. This document, or
any related presentation, and SAP AG’s or its affiliated companies’ strategy
and possible future developments, products, and/or platform directions and
functionality are all subject to change and may be changed by SAP AG or its
affiliated companies at any time for any reason without notice. The information
in this document is not a commitment, promise, or legal obligation to deliver
any material, code, or functionality. All forward-looking statements are subject
to various risks and uncertainties that could cause actual results to differ
materially from expectations. Readers are cautioned not to place undue
reliance on these forward-looking statements, which speak only as of their
dates, and they should not be relied upon in making purchasing decisions.

You might also like