SlideShare a Scribd company logo
Big Data & DS Analytics
for PAARL
Albert Anthony D. Gavino, MBA
Data Scientist / DS Evangelist
About the speaker: Albert Anthony D. Gavino
Project profile
Program Objectives / Program Goals
Participants to be able to relate Big Data and Data Science
applications to Library services.
1. What is Big Data?
Extremely large data sets that may be analyzed to reveal patterns,
trends and associations
The BIG 3 V’s
• Variety: different types of data
(Facebook, Twitter, CCTV feed)
• Velocity: the speed that data comes in
(batch, streaming every second)
• Volume: the largeness of that data.
(1GB, 1TB, 1PB, 1ZB)
Library Data Resources
What resources does the library have (budget, staff, premises, media, opening
hours etc.) and how is the library performing against traditional parameters, like
lending figures, visitors and social media activity? This library data can also be
combined with environmental information like community education levels,
geographical distances, age and so on.
http://www.axiell.co.uk/gettingthemostfromyourlibrarydata/
DATA Analytics Challenges and Pitfalls
The challenges to creating a robust institutional data analytics program include
culture, talent, cost, and data. We have deliberately mentioned culture first
because it is very easy to jump to data challenges. In fact, most of the literature
surrounding data analytics starts with challenges surrounding the data itself.
However, we are convinced that institutional culture is the most important factor
in determining the success of any given data analytics program, including the
politics and process around questions of talent, cost, and data itself.
Reference: The Journal of Academic Librarianship, Libraries and Institutional Data Libraries:
Challenges and Opportunities
63% of researchers and administrators expressed unhappiness with the use
of metrics in higher education (Abbott et al., 2010)
What about New Tasks like streamlining for the Librarian?
If librarians take on new tasks, it is very important to track the amount of time and level of staff
required when undertaking analytics projects. For example, collecting citation data for a
researcher with a common name often requires manual and painstaking record-by-record
searching in order to disambiguate that individual's research from others that share his/her
name. This type of work requires a librarian with a deep and intimate knowledge of the
bibliometric databases that are being used to harvest the bibliometric data.
Reference: The Journal of Academic Librarianship, Libraries and Institutional Data Libraries:
Challenges and Opportunities
What is the Cost?
• Data analytics should be thought of as a strategic investment,
not a cost-saving technique
• the real cost is the time spent on cultural change and on
developing and educating a staff with the analytical skills that we
need in our discipline
• visionary analytics plan invests in people, in hiring and training,
over data tools and platforms.
.
Pitfalls of Data Sharing:
Challenges on Institutional Data Analytics
Pitfalls Possible Solution/s
Ownership: who owns the data? It
could be registrar, library, IT
services.
An assigned office e.g. or Office of
the President/ Compliance Office
can release the official reports.
Quality: deciding when it is accurate
or good data, data reliability.
Data Governance Unit assures the
quality of data
Standards: what kind of data
variables are in use: string, numeric
This can be addressed by Data
Management on data warehousing
Access: who has access to the data User roles can be defined as to who
has access
Getting Started on Institutional Data
• Creating an inventory of institutional data
• Developing a data dictionary
• Designing an unambiguous process for cleaning up those data
• Creating an open data set that answers to the most commonly asked data
questions across campus.
Opportunities for Libraries on Big Data
• Libraries know metadata
• Libraries know strategy
• Libraries know assessment
• Libraries are neutral
• Libraries know the vendors
• Libraries are part of larger bodies like PAARL
• Libraries have influence over campuses
• Libraries know metrics
• Libraries have user-centered culture
• Libraries know the vendors
• Libraries know the politics and policy issues with commercial parties
• Libraries collaborate with both academic and academic support
2. Building a BIG DATA culture
• Openness and acceptance to technology: Upper Management
• Willingness to invest in the Big Data Platform: which entails cost
• Training Staff and making sure of job security: Skills upgrade
• Make data sharing acceptable: Trust in the data quality and people
• Create Data Quality Assurance Team/s
• Foster collaboration among departments
• Continuous improvement of models
DATA Governance and DATA Management are different roles
Data governance is the designation of decision-rights and policy-making surrounding institutional data,
while data management is the implementation of those decisions and policies. Institutions need both,
and both require investment, but the senior leadership of our institutions need to design the former.
Data Governance CouncilData Governance Council
Data ManagementData Management
policiespolicies
metricsmetrics
Data Quality DeptData Quality Dept
Data Warehouse / Data
Lake
Data Warehouse / Data
Lake
Machine Learning
Is a type of artificial intelligence that provides
computers with the ability to learn without being
explicitly programmed.
Market Basket Analysis on Book Recommendations (Association Rule Algorithm)
Weather related information and reading a book (use of hash tags and location and weather data)
Pic from Marco Rasos
Social Listening – is the process of monitoring digital conversations to
understand what customers are saying about a brand or service.
Online Research Journals and Click through Rates
Click through Rates (CTR)
Ratio of users who click on a specific
link to get to a page from a page ad or
button.
OpenCV (Open Source and Computer Vision)
Modern Day Data Scientists
Dr. Reina Reyes, Astrophysicist
Andrew Ng of Baidu, Coursera
Amy Smith, Uber Singapore
Data Science Conference 2016
YOU as the next
Doctor Strange
(Entering the world of
Data Science)
Isaac Reyes, Data Scientist Talas Data Scientists
CRISP – DM Methodology
The project was led by
five companies: SPSS,
Teradata, Daimler AG,
NCR Corporation and
OHRA, an insurance
company
The project was led by
five companies: SPSS,
Teradata, Daimler AG,
NCR Corporation and
OHRA, an insurance
company
CRISP-DM Tasks
From regular data to BIG data, from stat to AI
RegulardataBIGdata
Statistical modeling
Machine Learning
Deep Learning / A.I.
Traditional Modern
Trends in Data Science Domains
Data Science Domain Current Status
Natural Language
Processing (NLP)
Entered the market
Predictive Analytics /
Machine Learning
Entered the market
Visualization /
Dashboards
Entered the market
Image Processing
(openCV)
Exploration
Internet of Things (IoT) Exploration
Artificial Intelligence Exploration
DS/Big Data Applications to the field of Study
Agriculture Climate forecast modeling to help farmers
manage plantations (e.g. corn yields)
Medical field Image processing for chest x rays,
retina images for diabetic patients
Linguistics Natural Language Processing (NLP) for
dialects and Sentiment Analysis applications
Economics/Finance Predicting a stock price based on certain
indicators (e.g. noise, competitor price)
Sample Field of Study Specific Applications
Engineering Internet of Things (IoT) application to Big Data
Building a Data Science Team
Data ScientistData Engineer/
Dev Ops
Statistician Viz Expert
R,
Python,
Spark ML
Hadoop,
Spark Core,
Spark stream
SAS,
SPSS,
R, Matlab
Tableau, Cognos
D3, Javascript
Neural Nets
Random Forest
RDD, dataframes,
SQLContext
Linear Regression
K-means clustering
visualization
GIS maps
DS
role
Prog
Language
Sample
output
Data Science Team Composition
11 22 33
Trends on Programming Languages
scalaR
python
spark Rapid miner EMC
java
TOOLS: OPEN SOURCE vs PROPRIETARY SOFTWARE
OPEN SOURCE PROPRIETARY
SOFTWARE
pros No cost on software, packages are
available faster
Easy to deploy
cons Takes some time to create and
integrate with other software
Expensive software,
you have do buy in
modules
tools Python, R, Apache Spark SAS, IBM-SPSS,
AWS, Google
Small Data vs Big Data (in comparison)
Small data Big data
Sample size can be done
(sampling e.g. survey)
Use all of the data in the
storage
No need for memory computing,
can be run on a regular PC/Mac
Eats up memory and needs
distributed computing
Statistical assumptions hold
true,
normality, heteroskedasticity
independence
Statistical assumptions do not
hold true like p-values since the
data is so large (what seems
not significant to small sets will
become significant, be careful
when using these assumptions)
Simple DS Cheat sheet
Classifiers
Neural Nets
Random forest
Clustering
K-means
Association
Assoc Rules
Predicting
Linear
Regression
Logistic
Regression
(binary)
Cox Regression
(Survival)
Hierarchical
Clustering
SVM (Cancer
Cells)
Medical
Vizualization TOOLS
Color Hues and Functionality
Local Implications: Data Privacy Act 10173
Sensitive personal information refers to personal information:
1. About an individual’s race, ethnic origin, marital status, age, color, and religious, philosophical or
political affiliations;
2. About an individual’s health, education, genetic or sexual life of a person, or to any proceeding for
any offense committed or alleged to have been committed by such individual, the disposal of such
proceedings, or the sentence of any court in such proceedings;
3. Issued by government agencies peculiar to an individual which includes, but is not limited to, social
security numbers, previous or current health records, licenses or its denials, suspension or
revocation, and tax returns; and
4. Specifically established by an executive order or an act of Congress to be kept classified.
Solutions to the Data Privacy Act: Policies
Make sure you have the following in place
•Opt In for customers
•Opt out for customers
•Updated your customer policy accordingly
•Make your policy available publicly e.g. websites
References
• www.coursera.org/learn/machine-learning
• www.kaggle.com
• www.crowdanalytix.com
• www.talas.ph
• www.facebook.com/analytics4pinoys
• www.linkedin.com/albertgavino

More Related Content

What's hot (20)

PPT
Elementary Concepts of data minig
Dr Anjan Krishnamurthy
 
PDF
Data Mining and Big Data Challenges and Research Opportunities
Kathirvel Ayyaswamy
 
PPTX
Big Data and Classification
303Computing
 
PPT
Data mining
Ahmed Moussa
 
PPTX
Big data analytics
Dr.Bhuvaneswari Velumani
 
PPTX
Bigdata and Hadoop with applications
Padma Metta
 
PPTX
Data science
SouravSadhukhan6
 
PPT
Seminar presentation
Klawal13
 
PPTX
Session 10 handling bigger data
bodaceacat
 
PPTX
Big data ppt
Yash Raj
 
PDF
Challenges of Big Data Research
Regional Science Academy
 
PPTX
Big Data Analytics
Ghulam Imaduddin
 
PDF
Addressing Big Data Challenges - The Hadoop Way
Xoriant Corporation
 
PDF
INF2190_W1_2016_public
Attila Barta
 
PDF
Big Data: Issues and Challenges
Harsh Kishore Mishra
 
PPTX
Digital data
ShivanandaVSeeri
 
PPTX
On Big Data
arttan2001
 
PPT
Data mining with big data
Sandip Tipayle Patil
 
PPTX
Data Analytics
Vala Ali Rohani
 
PDF
Hadoop and Big Data Readiness in Africa: A Case of Tanzania
ijsrd.com
 
Elementary Concepts of data minig
Dr Anjan Krishnamurthy
 
Data Mining and Big Data Challenges and Research Opportunities
Kathirvel Ayyaswamy
 
Big Data and Classification
303Computing
 
Data mining
Ahmed Moussa
 
Big data analytics
Dr.Bhuvaneswari Velumani
 
Bigdata and Hadoop with applications
Padma Metta
 
Data science
SouravSadhukhan6
 
Seminar presentation
Klawal13
 
Session 10 handling bigger data
bodaceacat
 
Big data ppt
Yash Raj
 
Challenges of Big Data Research
Regional Science Academy
 
Big Data Analytics
Ghulam Imaduddin
 
Addressing Big Data Challenges - The Hadoop Way
Xoriant Corporation
 
INF2190_W1_2016_public
Attila Barta
 
Big Data: Issues and Challenges
Harsh Kishore Mishra
 
Digital data
ShivanandaVSeeri
 
On Big Data
arttan2001
 
Data mining with big data
Sandip Tipayle Patil
 
Data Analytics
Vala Ali Rohani
 
Hadoop and Big Data Readiness in Africa: A Case of Tanzania
ijsrd.com
 

Viewers also liked (20)

PDF
Libraries and the Internet of Things
Philippine Association of Academic/Research Librarians
 
PDF
Philippine Libraries in Transformation (Summer Conference)
Philippine Association of Academic/Research Librarians
 
PDF
PAARL Summer Conference 2017 Call for papers
Philippine Association of Academic/Research Librarians
 
PDF
"One MIL a Day Keeps the (IL) Literate Away"
Philippine Association of Academic/Research Librarians
 
DOCX
Paarl Calendar of Activities for 2016
Philippine Association of Academic/Research Librarians
 
PDF
Byaheng Wow libraries, philippines 2017
Roderick Baturi Ramos
 
PDF
Enhancing writing skills for librarians and information professionals
Philippine Association of Academic/Research Librarians
 
PDF
Paarl newsletter 2014 (october - december)
Philippine Association of Academic/Research Librarians
 
PDF
Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...
Peter Löwe
 
PDF
CUST-1 Share Document Library Extension Points
Alfresco Software
 
PDF
Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...
National Information Standards Organization (NISO)
 
ODP
A theoretical approach to accreditation of Open Education
Stian Håklev
 
PDF
Collaborative Benchmarking Plus 1: The Amazing Bangkok, Thailand Experience
Philippine Association of Academic/Research Librarians
 
PDF
Paarl calendar of activities 2015
Philippine Association of Academic/Research Librarians
 
PDF
Library congress guam 2 (1)
Roderick Baturi Ramos
 
PDF
Ramos, Roderick and cv as of February 23, 2017
Roderick Baturi Ramos
 
PDF
WOW LIBRARIES REPEAT! May 19 Summer Library Tours & Travels
Roderick Baturi Ramos
 
PDF
e-book available now: Being chief & confidently able with a heart! By Roderic...
Roderick Baturi Ramos
 
PDF
Paarl newsletter 2015 (oct dec)
Philippine Association of Academic/Research Librarians
 
Libraries and the Internet of Things
Philippine Association of Academic/Research Librarians
 
Philippine Libraries in Transformation (Summer Conference)
Philippine Association of Academic/Research Librarians
 
PAARL Summer Conference 2017 Call for papers
Philippine Association of Academic/Research Librarians
 
"One MIL a Day Keeps the (IL) Literate Away"
Philippine Association of Academic/Research Librarians
 
Paarl Calendar of Activities for 2016
Philippine Association of Academic/Research Librarians
 
Byaheng Wow libraries, philippines 2017
Roderick Baturi Ramos
 
Enhancing writing skills for librarians and information professionals
Philippine Association of Academic/Research Librarians
 
Paarl newsletter 2014 (october - december)
Philippine Association of Academic/Research Librarians
 
Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...
Peter Löwe
 
CUST-1 Share Document Library Extension Points
Alfresco Software
 
Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...
National Information Standards Organization (NISO)
 
A theoretical approach to accreditation of Open Education
Stian Håklev
 
Collaborative Benchmarking Plus 1: The Amazing Bangkok, Thailand Experience
Philippine Association of Academic/Research Librarians
 
Paarl calendar of activities 2015
Philippine Association of Academic/Research Librarians
 
Library congress guam 2 (1)
Roderick Baturi Ramos
 
Ramos, Roderick and cv as of February 23, 2017
Roderick Baturi Ramos
 
WOW LIBRARIES REPEAT! May 19 Summer Library Tours & Travels
Roderick Baturi Ramos
 
e-book available now: Being chief & confidently able with a heart! By Roderic...
Roderick Baturi Ramos
 
Ad

Similar to Big Data & DS Analytics for PAARL (20)

PDF
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Krishna Sankar
 
PPTX
Trends in data analytics
Ramakrishnan Venkataramanan
 
PPTX
The role of libraries and information professionals during the Big Data Era/ ...
African Open Science Platform
 
PPTX
Big data analyti data analytical life cycle
NAKKAPUNEETH1
 
PPTX
Big data
Prince Barai
 
PPTX
Bigdata Hadoop introduction
Sunitha Mutchintala
 
PDF
Big Data & Social Analytics presentation
gustavosouto
 
PPTX
In-Depth Data Analytics
YASH GAIKWAD
 
PPT
01-introduction.ppt the paper that you can unless you want to join me because...
teodroscampaus
 
PPTX
Big data Analytics Fundamentals Chapter 1
karpagavalli38
 
PDF
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Ahmed Elmalla
 
PDF
DSSG Speaker Series: Paco Nathan
Paco Nathan
 
PDF
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
Dr. Radhey Shyam
 
PPTX
Big data ppt
Deepika ParthaSarathy
 
PPTX
Mtech First_Year Data Analytics in Industry with power bI
SachinDhavane
 
PDF
00-01 DSnDA.pdf
SugumarSarDurai
 
PDF
Introduction to Data Analytics and data analytics life cycle
Dr. Radhey Shyam
 
PPTX
AI Project Cycle Summary Class ninth please
lefreak320
 
PPTX
Data sciences and marketing analytics
MJ Xavier
 
PDF
KIT-601 Lecture Notes-UNIT-1.pdf
Dr. Radhey Shyam
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Krishna Sankar
 
Trends in data analytics
Ramakrishnan Venkataramanan
 
The role of libraries and information professionals during the Big Data Era/ ...
African Open Science Platform
 
Big data analyti data analytical life cycle
NAKKAPUNEETH1
 
Big data
Prince Barai
 
Bigdata Hadoop introduction
Sunitha Mutchintala
 
Big Data & Social Analytics presentation
gustavosouto
 
In-Depth Data Analytics
YASH GAIKWAD
 
01-introduction.ppt the paper that you can unless you want to join me because...
teodroscampaus
 
Big data Analytics Fundamentals Chapter 1
karpagavalli38
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Ahmed Elmalla
 
DSSG Speaker Series: Paco Nathan
Paco Nathan
 
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
Dr. Radhey Shyam
 
Big data ppt
Deepika ParthaSarathy
 
Mtech First_Year Data Analytics in Industry with power bI
SachinDhavane
 
00-01 DSnDA.pdf
SugumarSarDurai
 
Introduction to Data Analytics and data analytics life cycle
Dr. Radhey Shyam
 
AI Project Cycle Summary Class ninth please
lefreak320
 
Data sciences and marketing analytics
MJ Xavier
 
KIT-601 Lecture Notes-UNIT-1.pdf
Dr. Radhey Shyam
 
Ad

More from Philippine Association of Academic/Research Librarians (20)

PDF
Paarl newsletter 2016 (Jan-Mar)
Philippine Association of Academic/Research Librarians
 
PDF
PAARL Awards and Scholarship program 2016
Philippine Association of Academic/Research Librarians
 
PDF
Recognizing Best Researches: a Colloquium
Philippine Association of Academic/Research Librarians
 
PPTX
Demonstrating the library's impact through assessment and evaluation
Philippine Association of Academic/Research Librarians
 
PPTX
Building a library disaster preparedness plan
Philippine Association of Academic/Research Librarians
 
PDF
Information literacy and the role of academic libraries
Philippine Association of Academic/Research Librarians
 
PPTX
Financial Management in Libraries
Philippine Association of Academic/Research Librarians
 
PPTX
Dynamic Leadership and Management of Libraries/Learning Commons
Philippine Association of Academic/Research Librarians
 
PPTX
The DLSU Libraries Engineering Collection
Philippine Association of Academic/Research Librarians
 
PPTX
Use equals value: Use Analysis of the DLSU Business and Economics Collection
Philippine Association of Academic/Research Librarians
 
PPTX
The 80/20 Rule: Analysis of Factors That Contribute to Print Book Utilization
Philippine Association of Academic/Research Librarians
 
PPTX
Collection assessment using modified brief test method
Philippine Association of Academic/Research Librarians
 
PDF
Step-by-step guide to travel visa application for Taiwan
Philippine Association of Academic/Research Librarians
 
PPTX
E-Metrics: Assessing Electronic Resources
Philippine Association of Academic/Research Librarians
 
PPTX
Keeping them posted: Analyzing library web content and user engagement
Philippine Association of Academic/Research Librarians
 
PAARL Awards and Scholarship program 2016
Philippine Association of Academic/Research Librarians
 
Recognizing Best Researches: a Colloquium
Philippine Association of Academic/Research Librarians
 
Demonstrating the library's impact through assessment and evaluation
Philippine Association of Academic/Research Librarians
 
Building a library disaster preparedness plan
Philippine Association of Academic/Research Librarians
 
Information literacy and the role of academic libraries
Philippine Association of Academic/Research Librarians
 
Dynamic Leadership and Management of Libraries/Learning Commons
Philippine Association of Academic/Research Librarians
 
The DLSU Libraries Engineering Collection
Philippine Association of Academic/Research Librarians
 
Use equals value: Use Analysis of the DLSU Business and Economics Collection
Philippine Association of Academic/Research Librarians
 
The 80/20 Rule: Analysis of Factors That Contribute to Print Book Utilization
Philippine Association of Academic/Research Librarians
 
Collection assessment using modified brief test method
Philippine Association of Academic/Research Librarians
 
Step-by-step guide to travel visa application for Taiwan
Philippine Association of Academic/Research Librarians
 
E-Metrics: Assessing Electronic Resources
Philippine Association of Academic/Research Librarians
 
Keeping them posted: Analyzing library web content and user engagement
Philippine Association of Academic/Research Librarians
 

Recently uploaded (20)

PPTX
Cybersecurity: How to Protect your Digital World from Hackers
vaidikpanda4
 
PPTX
Unlock the Power of Cursor AI: MuleSoft Integrations
Veera Pallapu
 
PPTX
Digital Professionalism and Interpersonal Competence
rutvikgediya1
 
PDF
EXCRETION-STRUCTURE OF NEPHRON,URINE FORMATION
raviralanaresh2
 
PPTX
Top 10 AI Tools, Like ChatGPT. You Must Learn In 2025
Digilearnings
 
PPTX
20250924 Navigating the Future: How to tell the difference between an emergen...
McGuinness Institute
 
PPTX
Python-Application-in-Drug-Design by R D Jawarkar.pptx
Rahul Jawarkar
 
PPTX
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 
PPTX
Gupta Art & Architecture Temple and Sculptures.pptx
Virag Sontakke
 
PDF
The-Invisible-Living-World-Beyond-Our-Naked-Eye chapter 2.pdf/8th science cur...
Sandeep Swamy
 
PPTX
Rules and Regulations of Madhya Pradesh Library Part-I
SantoshKumarKori2
 
PPTX
Applications of matrices In Real Life_20250724_091307_0000.pptx
gehlotkrish03
 
PDF
The Minister of Tourism, Culture and Creative Arts, Abla Dzifa Gomashie has e...
nservice241
 
PDF
TOP 10 AI TOOLS YOU MUST LEARN TO SURVIVE IN 2025 AND ABOVE
digilearnings.com
 
PPTX
Applied-Statistics-1.pptx hardiba zalaaa
hardizala899
 
PDF
Module 2: Public Health History [Tutorial Slides]
JonathanHallett4
 
PPTX
Electrophysiology_of_Heart. Electrophysiology studies in Cardiovascular syste...
Rajshri Ghogare
 
DOCX
pgdei-UNIT -V Neurological Disorders & developmental disabilities
JELLA VISHNU DURGA PRASAD
 
PPT
DRUGS USED IN THERAPY OF SHOCK, Shock Therapy, Treatment or management of shock
Rajshri Ghogare
 
DOCX
Unit 5: Speech-language and swallowing disorders
JELLA VISHNU DURGA PRASAD
 
Cybersecurity: How to Protect your Digital World from Hackers
vaidikpanda4
 
Unlock the Power of Cursor AI: MuleSoft Integrations
Veera Pallapu
 
Digital Professionalism and Interpersonal Competence
rutvikgediya1
 
EXCRETION-STRUCTURE OF NEPHRON,URINE FORMATION
raviralanaresh2
 
Top 10 AI Tools, Like ChatGPT. You Must Learn In 2025
Digilearnings
 
20250924 Navigating the Future: How to tell the difference between an emergen...
McGuinness Institute
 
Python-Application-in-Drug-Design by R D Jawarkar.pptx
Rahul Jawarkar
 
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 
Gupta Art & Architecture Temple and Sculptures.pptx
Virag Sontakke
 
The-Invisible-Living-World-Beyond-Our-Naked-Eye chapter 2.pdf/8th science cur...
Sandeep Swamy
 
Rules and Regulations of Madhya Pradesh Library Part-I
SantoshKumarKori2
 
Applications of matrices In Real Life_20250724_091307_0000.pptx
gehlotkrish03
 
The Minister of Tourism, Culture and Creative Arts, Abla Dzifa Gomashie has e...
nservice241
 
TOP 10 AI TOOLS YOU MUST LEARN TO SURVIVE IN 2025 AND ABOVE
digilearnings.com
 
Applied-Statistics-1.pptx hardiba zalaaa
hardizala899
 
Module 2: Public Health History [Tutorial Slides]
JonathanHallett4
 
Electrophysiology_of_Heart. Electrophysiology studies in Cardiovascular syste...
Rajshri Ghogare
 
pgdei-UNIT -V Neurological Disorders & developmental disabilities
JELLA VISHNU DURGA PRASAD
 
DRUGS USED IN THERAPY OF SHOCK, Shock Therapy, Treatment or management of shock
Rajshri Ghogare
 
Unit 5: Speech-language and swallowing disorders
JELLA VISHNU DURGA PRASAD
 

Big Data & DS Analytics for PAARL

  • 1. Big Data & DS Analytics for PAARL Albert Anthony D. Gavino, MBA Data Scientist / DS Evangelist
  • 2. About the speaker: Albert Anthony D. Gavino
  • 4. Program Objectives / Program Goals Participants to be able to relate Big Data and Data Science applications to Library services.
  • 5. 1. What is Big Data? Extremely large data sets that may be analyzed to reveal patterns, trends and associations
  • 6. The BIG 3 V’s • Variety: different types of data (Facebook, Twitter, CCTV feed) • Velocity: the speed that data comes in (batch, streaming every second) • Volume: the largeness of that data. (1GB, 1TB, 1PB, 1ZB)
  • 7. Library Data Resources What resources does the library have (budget, staff, premises, media, opening hours etc.) and how is the library performing against traditional parameters, like lending figures, visitors and social media activity? This library data can also be combined with environmental information like community education levels, geographical distances, age and so on. http://www.axiell.co.uk/gettingthemostfromyourlibrarydata/
  • 8. DATA Analytics Challenges and Pitfalls The challenges to creating a robust institutional data analytics program include culture, talent, cost, and data. We have deliberately mentioned culture first because it is very easy to jump to data challenges. In fact, most of the literature surrounding data analytics starts with challenges surrounding the data itself. However, we are convinced that institutional culture is the most important factor in determining the success of any given data analytics program, including the politics and process around questions of talent, cost, and data itself. Reference: The Journal of Academic Librarianship, Libraries and Institutional Data Libraries: Challenges and Opportunities 63% of researchers and administrators expressed unhappiness with the use of metrics in higher education (Abbott et al., 2010)
  • 9. What about New Tasks like streamlining for the Librarian? If librarians take on new tasks, it is very important to track the amount of time and level of staff required when undertaking analytics projects. For example, collecting citation data for a researcher with a common name often requires manual and painstaking record-by-record searching in order to disambiguate that individual's research from others that share his/her name. This type of work requires a librarian with a deep and intimate knowledge of the bibliometric databases that are being used to harvest the bibliometric data. Reference: The Journal of Academic Librarianship, Libraries and Institutional Data Libraries: Challenges and Opportunities
  • 10. What is the Cost? • Data analytics should be thought of as a strategic investment, not a cost-saving technique • the real cost is the time spent on cultural change and on developing and educating a staff with the analytical skills that we need in our discipline • visionary analytics plan invests in people, in hiring and training, over data tools and platforms. .
  • 11. Pitfalls of Data Sharing: Challenges on Institutional Data Analytics Pitfalls Possible Solution/s Ownership: who owns the data? It could be registrar, library, IT services. An assigned office e.g. or Office of the President/ Compliance Office can release the official reports. Quality: deciding when it is accurate or good data, data reliability. Data Governance Unit assures the quality of data Standards: what kind of data variables are in use: string, numeric This can be addressed by Data Management on data warehousing Access: who has access to the data User roles can be defined as to who has access
  • 12. Getting Started on Institutional Data • Creating an inventory of institutional data • Developing a data dictionary • Designing an unambiguous process for cleaning up those data • Creating an open data set that answers to the most commonly asked data questions across campus.
  • 13. Opportunities for Libraries on Big Data • Libraries know metadata • Libraries know strategy • Libraries know assessment • Libraries are neutral • Libraries know the vendors • Libraries are part of larger bodies like PAARL • Libraries have influence over campuses • Libraries know metrics • Libraries have user-centered culture • Libraries know the vendors • Libraries know the politics and policy issues with commercial parties • Libraries collaborate with both academic and academic support
  • 14. 2. Building a BIG DATA culture • Openness and acceptance to technology: Upper Management • Willingness to invest in the Big Data Platform: which entails cost • Training Staff and making sure of job security: Skills upgrade • Make data sharing acceptable: Trust in the data quality and people • Create Data Quality Assurance Team/s • Foster collaboration among departments • Continuous improvement of models
  • 15. DATA Governance and DATA Management are different roles Data governance is the designation of decision-rights and policy-making surrounding institutional data, while data management is the implementation of those decisions and policies. Institutions need both, and both require investment, but the senior leadership of our institutions need to design the former. Data Governance CouncilData Governance Council Data ManagementData Management policiespolicies metricsmetrics Data Quality DeptData Quality Dept Data Warehouse / Data Lake Data Warehouse / Data Lake
  • 16. Machine Learning Is a type of artificial intelligence that provides computers with the ability to learn without being explicitly programmed.
  • 17. Market Basket Analysis on Book Recommendations (Association Rule Algorithm)
  • 18. Weather related information and reading a book (use of hash tags and location and weather data) Pic from Marco Rasos
  • 19. Social Listening – is the process of monitoring digital conversations to understand what customers are saying about a brand or service.
  • 20. Online Research Journals and Click through Rates Click through Rates (CTR) Ratio of users who click on a specific link to get to a page from a page ad or button.
  • 21. OpenCV (Open Source and Computer Vision)
  • 22. Modern Day Data Scientists Dr. Reina Reyes, Astrophysicist Andrew Ng of Baidu, Coursera Amy Smith, Uber Singapore Data Science Conference 2016 YOU as the next Doctor Strange (Entering the world of Data Science) Isaac Reyes, Data Scientist Talas Data Scientists
  • 23. CRISP – DM Methodology The project was led by five companies: SPSS, Teradata, Daimler AG, NCR Corporation and OHRA, an insurance company The project was led by five companies: SPSS, Teradata, Daimler AG, NCR Corporation and OHRA, an insurance company
  • 25. From regular data to BIG data, from stat to AI RegulardataBIGdata Statistical modeling Machine Learning Deep Learning / A.I. Traditional Modern
  • 26. Trends in Data Science Domains Data Science Domain Current Status Natural Language Processing (NLP) Entered the market Predictive Analytics / Machine Learning Entered the market Visualization / Dashboards Entered the market Image Processing (openCV) Exploration Internet of Things (IoT) Exploration Artificial Intelligence Exploration
  • 27. DS/Big Data Applications to the field of Study Agriculture Climate forecast modeling to help farmers manage plantations (e.g. corn yields) Medical field Image processing for chest x rays, retina images for diabetic patients Linguistics Natural Language Processing (NLP) for dialects and Sentiment Analysis applications Economics/Finance Predicting a stock price based on certain indicators (e.g. noise, competitor price) Sample Field of Study Specific Applications Engineering Internet of Things (IoT) application to Big Data
  • 28. Building a Data Science Team Data ScientistData Engineer/ Dev Ops Statistician Viz Expert R, Python, Spark ML Hadoop, Spark Core, Spark stream SAS, SPSS, R, Matlab Tableau, Cognos D3, Javascript Neural Nets Random Forest RDD, dataframes, SQLContext Linear Regression K-means clustering visualization GIS maps DS role Prog Language Sample output Data Science Team Composition 11 22 33
  • 29. Trends on Programming Languages scalaR python spark Rapid miner EMC java
  • 30. TOOLS: OPEN SOURCE vs PROPRIETARY SOFTWARE OPEN SOURCE PROPRIETARY SOFTWARE pros No cost on software, packages are available faster Easy to deploy cons Takes some time to create and integrate with other software Expensive software, you have do buy in modules tools Python, R, Apache Spark SAS, IBM-SPSS, AWS, Google
  • 31. Small Data vs Big Data (in comparison) Small data Big data Sample size can be done (sampling e.g. survey) Use all of the data in the storage No need for memory computing, can be run on a regular PC/Mac Eats up memory and needs distributed computing Statistical assumptions hold true, normality, heteroskedasticity independence Statistical assumptions do not hold true like p-values since the data is so large (what seems not significant to small sets will become significant, be careful when using these assumptions)
  • 32. Simple DS Cheat sheet Classifiers Neural Nets Random forest Clustering K-means Association Assoc Rules Predicting Linear Regression Logistic Regression (binary) Cox Regression (Survival) Hierarchical Clustering SVM (Cancer Cells) Medical
  • 34. Color Hues and Functionality
  • 35. Local Implications: Data Privacy Act 10173 Sensitive personal information refers to personal information: 1. About an individual’s race, ethnic origin, marital status, age, color, and religious, philosophical or political affiliations; 2. About an individual’s health, education, genetic or sexual life of a person, or to any proceeding for any offense committed or alleged to have been committed by such individual, the disposal of such proceedings, or the sentence of any court in such proceedings; 3. Issued by government agencies peculiar to an individual which includes, but is not limited to, social security numbers, previous or current health records, licenses or its denials, suspension or revocation, and tax returns; and 4. Specifically established by an executive order or an act of Congress to be kept classified.
  • 36. Solutions to the Data Privacy Act: Policies Make sure you have the following in place •Opt In for customers •Opt out for customers •Updated your customer policy accordingly •Make your policy available publicly e.g. websites
  • 37. References • www.coursera.org/learn/machine-learning • www.kaggle.com • www.crowdanalytix.com • www.talas.ph • www.facebook.com/analytics4pinoys • www.linkedin.com/albertgavino