Wk1_Overview of Data Analytics and Big Data
Wk1_Overview of Data Analytics and Big Data
Richard Lui
1
The Big Data Era
• Data: Any piece of information stored and/or processed by a computer or mobile device.
• Companies/Organizations are generating and keeping more and more data
• The term "Big Data" was coined by John Mashey in 1990s to describe data that is too
vast and complex for traditional tools to handle.
3
Explosion of data
• Exponential growth of the Internet and World Wide Web
• Transactions and interaction of users with e-commerce and
mobile applications
• Social network activities
• E.g. YouTube, Facebook, Instagram, Twitter
• Companies collect and store a large volume of data from
different types of users
• E.g. Google, Baidu, Netflix, Uber
• Internet of Things (IoT) and wireless sensors
• Smart watch, thermostat, water heaters, smoke detectors, …
{
"timestamp":"2022-08-12 03:01:58.732726",
"user_id":"35",
"click_id":“15cf179b9c9d483a…",
"event_name":"Search",
"user_ip":"11.22.33.44",
"additional_data":{
“engagement_time":40,
"product_id":12345
Clickstreams in an }
e-commerce website
6
Structured vs. Unstructured data
• Structured data
• Data conforms to a data model or schema and is often stored in tabular form.
• Unstructured data
• Data that does not conform to a data model or data schema is known as unstructured data.
• Estimated to makes up 80% of the data within any given enterprise.
• Semi-structured data
• Non-tabular structure, but conform to some level of structure
Structured data
7
Are the data structured/unstructured?
10
Diagnostic Analytics
• Cause of a phenomenon that occurred in the past
• Example
• Why were Q2 sales less than Q1 sales?
11
Predictive Analytics
• Generate future predictions based upon past events.
• Example
• What is the predicted sales in the next month?
12
Visualization
• Creation and study of the visual representation of data
• One of the most important tools for data analytics
• Dashboard: A read-only snapshot of an analysis that you can share with other users for reporting
purposes.
14
How Facebook track your data?
• Facebook has 2.89 billion active users, as of the second quarter of 2021 (Source: Statistica)
• Collect, store and analyze users data and behavior
• Suggest posts and advertisement which match the users’ preference
• Collected data
• Age, gender, Hobbies and recent experiences
• Posts and pages liked by user
• "People You May Know" feature
• phone contacts and shared locations
• Users' political activities, such as protests and marches attended
• Facebook partners with data brokers to gather information about users' purchases.
• Even offline transactions, like credit card payments, can be linked to user profiles, leading to targeted ads.
https://www.facebook.com/help/794535777607370?ref=learn_more_ipl
16
Artwork Personalization at Netflix
• Artwork selection is crucial to encourage members to engage with unfamiliar titles.
• Netflix personalized the image we use to depict the movie “Good Will Hunting”
• Someone who has watched many romantic movies => show the artwork containing Matt Damon and
Minnie Driver
• A member who has watched many comedies => use the artwork containing Robin Williams, a well-known
comedian.
https://netflixtechblog.com/artwork-personalization-c589f074ad76 17
Data analytic in Healthcare
• Metrics: patient falls with injury, average length of stay, and patient recommendations, etc.
• Create interactive dashboards
• Allow clinicians to analyze their performance and outcomes.
• Highlight areas of improvement in patient care.
• Deliver better and safer patient care.
19
Case Study: How Cops Are Using Algorithms to
Predict Crimes
• Los Angeles Police Departments (LAPD) are using data-driven algorithms to forecast future crimes.
• Predicts violent crime occurrences and potential perpetrators using historical crime, arrest, and field data.
• PredPol: A predictive policing tool utilized by over 60 departments
• Identifies areas or "hotspots" with a higher likelihood of criminal activity
• Officers are directed to specific hotspots identified by PredPol's algorithm, which analyzes historical crime data
and creates hotspots.
• Drone surveillance and facial recognition-equipped body cameras
• Stop LAPD Spying Coalition argue that such strategies disproportionately target low-income and
communities of color.
21