SlideShare a Scribd company logo
idBigData Meetup #17
SQL Big Data Analytics
Open Source Solution for Big Data Analyst Workflow
Institut Teknologi Bandung, 28 September 2017
Sigit Prasetyo
sigit.prasetyo@idbigdata.com
@sigitpras303
linkedin.com/in/sigitprasetyo303
flikr.com/photografer-kw3
Sigit Prasetyo
Open Source Solution for Data Analyst Workflow
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
Data-Driven Company
A data-driven company is an organization where every person who can
use data to make better decisions, has access to the data they need
when they need it.
Being data-driven is not about seeing a few canned reports at the
beginning of every day or week; it's about giving the business
decision makers the power to explore data independently, even
if they're working with big or disparate data sources.
https://www.infoworld.com/article/3074322/big-data/what-is-a-data-driven-company.html
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
Moneyball
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
Data Journey
Data Collection
01
Data Preparation
02
Data Exploration
03
Data Formatting
04
Data Presentation
05
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
What is Data Analysts ?
Data Analysts are experienced data professionals in their organization who
can query and process data, provide reports, summarize and
visualize data.
They have a strong understanding of how to leverage existing tools and
methods to solve a problem, and help people from across the company
understand specific queries with ad-hoc reports and charts.
Skills: Data Analysts need to have a baseline
understanding of some core skills: statistics,
data munging, data visualization, exploratory
data analysis,
https://cognitiveclass.ai/blog/data-scientist-vs-data-engineer/
Tools: Microsoft Excel, SPSS, SPSS Modeler,
SAS, SAS Miner, SQL, Microsoft Access,
Tableau, SSAS
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
Big Data Data Analyst Certification
Required Skills
Prepare the Data
Use Extract, Transfer, Load (ETL) processes to
prepare data for queries.
Provide Structure to the Data
Use Data Definition Language (DDL) statements
to create or alter structures in the metastore for
use by Hive and Impala.
Data Analysis
Use Query Language (QL) statements in Hive and
Impala to analyze data on the cluster.
Certification Exam Subject Areas
1. Extract, Transform, and Load Data with Apache
Pig
2. Manipulate Data with Apache Pig
3. Create tables and load data in Apache Hive
4. Query data with Apache Hive
5. SQL Queries with Drill
6. Working with Self-Describing Data
7. Advanced Topics including Troubleshooting
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
Why SQL ?
SQL : Structured Query Language
A very high level language
(Almost) Every application use database
Easier to find a SQL developer
The easiest step to enter Hadoop
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
SQL On Hadoop
Schema-free SQL Query Engine
for Hadoop, NoSQL and Cloud
Storage
OLTP and operational analytics
for Apache Hadoop
Data warehouse software
facilitates reading, writing, and
managing large datasets residing
in distributed storage using SQL.
The open source, native analytic
database for Apache Hadoop*
A big data warehouse system on
Hadoop
Apache Hadoop Native SQL.
Advanced, MPP, elastic query
engine and analytic database for
enterprises*
Distributed SQL Query Engine for
Big Data
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
Why not Excel ?
Easy to use
Flat database
(Almost) Complete tool for data analyst (formula, statistic, chart)
What if ..
Bigger data
Complex relational
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
Let’s Play Lego
Read simple to complex data
Data exploration + Ad Hoc Query
Data visualization
Machine Learning
HDFS + MAPREDUCE + HIVE + ZEPPELIN
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
SQL Data Analytics Sandbox
VirtualBox
Linux Mint OS 18.2
Apache Hadoop Vanila
Single NodeYARN - Resource Management
HDFS HDFS HDFS
Hadoop Distributed File System
HDFS
MapReduce
Execution Engine
MapReduce
Execution Engine
Data Preparation
Data Exploration
Apache Zeppelin
https://github.com/project303/dasb
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
Apache Hive
Initially developed by Facebook
Included in most Hadoop distro (Cloudera, Hortonworks, MapR, Yava)
Built In Function and User Defined Function
Transactional (ACID)
Has Index
Support Procedural Language
Machine Learning - HiveMall*
Supported Execution Engine
- MapReduce
- Apache Tez
- Spark
JDBC connection support
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
Apache Zeppelin
Interactive Notebook
Web Front End
Multiple Interpreter
Built-in Visualization
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
Proof Of Concept
Perform Squid Access Log Data Analysis.
Squid is a caching proxy for the Web supporting HTTP, HTTPS, FTP, and
more. It reduces bandwidth and improves response times by caching and
reusing frequently-requested web pages.
Scenario :
Load data access.log into HDFS
Analyze whether there is something uncommon in it by using Hive
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
Know Your Data
Data Format : text file that contain 10 fields and separated by space for each field
remotehost rfc931 authuser [date] "request" status size referer agent tcp_code
Field Description :
1. Remotehost
Remote hostname (or IP number if DNS hostname is not
available, or if DNSLookup is Off.
2. Rfc931
The remote logname of the user.
3. User ID
The username as which the user has authenticated himself.
Always NULL ("-") for Squid logs.
4. [date]
Date and time of the request.
5. "Request"
The request line exactly as it came from the client. GET,
HEAD, POST, etc. for HTTP requests. ICP_QUERY for ICP
requests.
6. Status
The HTTP status code returned to the client. See the HTTP
status codes for a complete list.
7. Size
The content-length of data transferred in byte.
8. Referer
9. Agent
Application that access the internet
10. TCP Code
The ``cache result'' of the request. This describes if the
request was a cache hit or miss, and if the object was
refreshed
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
Know Your Data
Sample Data :
192.168.6.129 - - [17/Sep/2017:00:00:21 +0700] "GET
http://api.account.xiaomi.com/pass/v2/safe/user/coreInfo? HTTP/1.1" 200 862 "-"
"Dalvik/2.1.0 (Linux; U; Android 5.1.1; 2014817 MIUI/V8.5.1.0.LHJMIED)" TCP_MISS:DIRECT
192.168.6.103 - - [17/Sep/2017:00:01:14 +0700] "POST http://netmarbleslog.netmarble.com/
HTTP/1.0" 200 299 "-" "okhttp/2.5.0" TCP_MISS:DIRECT
Remotehost : 192.168.129
[date] : [17/Sep/2017:00:00:21 +0700]
"Request" :
"GET http://api.account.xiaomi.com/pass/v2/safe/user/coreInfo? HTTP/1.1"
Status : 200
Size : 862
Agent :
"Dalvik/2.1.0 (Linux; U; Android 5.1.1; 2014817 MIUI/V8.5.1.0.LHJMIED)"
TCP Code : TCP_MISS:DIRECT
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
Starting Apache Zeppelin
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
Accessing Zeppelin
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
Preparation
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
Load Data To HDFS
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
Create External Table
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
RegexSerDe
Sample Data :
192.168.6.129 - - [17/Sep/2017:00:00:21 +0700] "GET
http://api.account.xiaomi.com/pass/v2/safe/user/coreInfo? HTTP/1.1" 200 862 "-"
"Dalvik/2.1.0 (Linux; U; Android 5.1.1; 2014817 MIUI/V8.5.1.0.LHJMIED)" TCP_MISS:DIRECT
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
View Table Content
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
Create View
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
Let’s Tell The Story
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
Monday Traffic Behaviour
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
IP Traffic Behaviour
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
Agent Name
Status → 403 Forbidden
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
The Most Used Agent
idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
Thank You & Stay Connected
s.id/idbigdata
Credit for icon
Gregor Cresnar
www.flaticon.com/authors/gregor-cresnar
Prosymbols
www.flaticon.com/authors/prosymbols
Freepik
www.freepik.com
Pavel Kozlov
www.flaticon.com/authors/pavel-kozlov
Yannick
www.flaticon.com/authors/yannick
Dave Gandy
www.flaticon.com/authors/dave-gandy
SimpleIcon
www.flaticon.com/authors/simpleicon
Ad

More Related Content

What's hot (20)

The rise of big data governance: insight on this emerging trend from active o...
The rise of big data governance: insight on this emerging trend from active o...The rise of big data governance: insight on this emerging trend from active o...
The rise of big data governance: insight on this emerging trend from active o...
DataWorks Summit
 
Continuous Data Ingestion pipeline for the Enterprise
Continuous Data Ingestion pipeline for the EnterpriseContinuous Data Ingestion pipeline for the Enterprise
Continuous Data Ingestion pipeline for the Enterprise
DataWorks Summit
 
Securing your Big Data Environments in the Cloud
Securing your Big Data Environments in the CloudSecuring your Big Data Environments in the Cloud
Securing your Big Data Environments in the Cloud
DataWorks Summit
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
DataWorks Summit/Hadoop Summit
 
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
DataWorks Summit
 
Azure Big data
Azure Big data Azure Big data
Azure Big data
Michel HUBERT
 
Neo4j – The Fastest Path to Scalable Real-Time Analytics
Neo4j – The Fastest Path to Scalable Real-Time AnalyticsNeo4j – The Fastest Path to Scalable Real-Time Analytics
Neo4j – The Fastest Path to Scalable Real-Time Analytics
Neo4j
 
LinkedIn2
LinkedIn2LinkedIn2
LinkedIn2
DataWorks Summit/Hadoop Summit
 
Review on Big Data Security in Hadoop
Review on Big Data Security in HadoopReview on Big Data Security in Hadoop
Review on Big Data Security in Hadoop
IRJET Journal
 
Enterprise large scale graph analytics and computing base on distribute graph...
Enterprise large scale graph analytics and computing base on distribute graph...Enterprise large scale graph analytics and computing base on distribute graph...
Enterprise large scale graph analytics and computing base on distribute graph...
DataWorks Summit
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
confluent
 
Hadoop Journey at Walgreens
Hadoop Journey at WalgreensHadoop Journey at Walgreens
Hadoop Journey at Walgreens
DataWorks Summit
 
BI on Big Data with instant response times at Verizon
BI on Big Data with instant response times at VerizonBI on Big Data with instant response times at Verizon
BI on Big Data with instant response times at Verizon
DataWorks Summit
 
Democratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druidDemocratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druid
DataWorks Summit
 
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...
Databricks
 
Practical advice to build a data driven company
Practical advice to build a data driven companyPractical advice to build a data driven company
Practical advice to build a data driven company
DataWorks Summit/Hadoop Summit
 
Navigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data DiscoveryNavigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data Discovery
DataWorks Summit/Hadoop Summit
 
Beyond Kerberos and Ranger - Tips to discover, track and manage risks in hybr...
Beyond Kerberos and Ranger - Tips to discover, track and manage risks in hybr...Beyond Kerberos and Ranger - Tips to discover, track and manage risks in hybr...
Beyond Kerberos and Ranger - Tips to discover, track and manage risks in hybr...
DataWorks Summit
 
Security, ETL, BI & Analytics, and Software Integration
Security, ETL, BI & Analytics, and Software IntegrationSecurity, ETL, BI & Analytics, and Software Integration
Security, ETL, BI & Analytics, and Software Integration
DataWorks Summit
 
Big Data Application Architectures - IoT
Big Data Application Architectures - IoTBig Data Application Architectures - IoT
Big Data Application Architectures - IoT
DataWorks Summit/Hadoop Summit
 
The rise of big data governance: insight on this emerging trend from active o...
The rise of big data governance: insight on this emerging trend from active o...The rise of big data governance: insight on this emerging trend from active o...
The rise of big data governance: insight on this emerging trend from active o...
DataWorks Summit
 
Continuous Data Ingestion pipeline for the Enterprise
Continuous Data Ingestion pipeline for the EnterpriseContinuous Data Ingestion pipeline for the Enterprise
Continuous Data Ingestion pipeline for the Enterprise
DataWorks Summit
 
Securing your Big Data Environments in the Cloud
Securing your Big Data Environments in the CloudSecuring your Big Data Environments in the Cloud
Securing your Big Data Environments in the Cloud
DataWorks Summit
 
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
DataWorks Summit
 
Neo4j – The Fastest Path to Scalable Real-Time Analytics
Neo4j – The Fastest Path to Scalable Real-Time AnalyticsNeo4j – The Fastest Path to Scalable Real-Time Analytics
Neo4j – The Fastest Path to Scalable Real-Time Analytics
Neo4j
 
Review on Big Data Security in Hadoop
Review on Big Data Security in HadoopReview on Big Data Security in Hadoop
Review on Big Data Security in Hadoop
IRJET Journal
 
Enterprise large scale graph analytics and computing base on distribute graph...
Enterprise large scale graph analytics and computing base on distribute graph...Enterprise large scale graph analytics and computing base on distribute graph...
Enterprise large scale graph analytics and computing base on distribute graph...
DataWorks Summit
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
confluent
 
Hadoop Journey at Walgreens
Hadoop Journey at WalgreensHadoop Journey at Walgreens
Hadoop Journey at Walgreens
DataWorks Summit
 
BI on Big Data with instant response times at Verizon
BI on Big Data with instant response times at VerizonBI on Big Data with instant response times at Verizon
BI on Big Data with instant response times at Verizon
DataWorks Summit
 
Democratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druidDemocratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druid
DataWorks Summit
 
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...
Databricks
 
Navigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data DiscoveryNavigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data Discovery
DataWorks Summit/Hadoop Summit
 
Beyond Kerberos and Ranger - Tips to discover, track and manage risks in hybr...
Beyond Kerberos and Ranger - Tips to discover, track and manage risks in hybr...Beyond Kerberos and Ranger - Tips to discover, track and manage risks in hybr...
Beyond Kerberos and Ranger - Tips to discover, track and manage risks in hybr...
DataWorks Summit
 
Security, ETL, BI & Analytics, and Software Integration
Security, ETL, BI & Analytics, and Software IntegrationSecurity, ETL, BI & Analytics, and Software Integration
Security, ETL, BI & Analytics, and Software Integration
DataWorks Summit
 

Similar to Open Source Solution for Data Analyst Workflow (20)

Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
Roy Kim
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Lace Lofranco
 
Ibm db2 big sql
Ibm db2 big sqlIbm db2 big sql
Ibm db2 big sql
ModusOptimum
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Imam Raza
 
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit OrlandoGimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Romit Mehta
 
Ibm db2update2019 icp4 data
Ibm db2update2019   icp4 dataIbm db2update2019   icp4 data
Ibm db2update2019 icp4 data
Gustav Lundström
 
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA
 
Democratization of Data @Indix
Democratization of Data @IndixDemocratization of Data @Indix
Democratization of Data @Indix
Manoj Mahalingam
 
Build Big Data Enterprise solutions faster on Azure HDInsight
Build Big Data Enterprise solutions faster on Azure HDInsightBuild Big Data Enterprise solutions faster on Azure HDInsight
Build Big Data Enterprise solutions faster on Azure HDInsight
DataWorks Summit
 
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
Márton Kodok
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
RTTS
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
Sascha Dittmann
 
Big Data on Azure Tutorial
Big Data on Azure TutorialBig Data on Azure Tutorial
Big Data on Azure Tutorial
rustd
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with Databricks
Databricks
 
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKESBig Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
Matt Stubbs
 
InfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceInfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experience
Wilfried Hoge
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
Jim Kaskade
 
Nagarjuna_Damarla_Resume
Nagarjuna_Damarla_ResumeNagarjuna_Damarla_Resume
Nagarjuna_Damarla_Resume
Nag Arjun
 
Data science big data and analytics
Data science big data and analyticsData science big data and analytics
Data science big data and analytics
Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
vhrocca
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
Roy Kim
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Lace Lofranco
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Imam Raza
 
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit OrlandoGimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Romit Mehta
 
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA
 
Democratization of Data @Indix
Democratization of Data @IndixDemocratization of Data @Indix
Democratization of Data @Indix
Manoj Mahalingam
 
Build Big Data Enterprise solutions faster on Azure HDInsight
Build Big Data Enterprise solutions faster on Azure HDInsightBuild Big Data Enterprise solutions faster on Azure HDInsight
Build Big Data Enterprise solutions faster on Azure HDInsight
DataWorks Summit
 
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
Márton Kodok
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
RTTS
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
Sascha Dittmann
 
Big Data on Azure Tutorial
Big Data on Azure TutorialBig Data on Azure Tutorial
Big Data on Azure Tutorial
rustd
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with Databricks
Databricks
 
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKESBig Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
Matt Stubbs
 
InfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceInfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experience
Wilfried Hoge
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
Jim Kaskade
 
Nagarjuna_Damarla_Resume
Nagarjuna_Damarla_ResumeNagarjuna_Damarla_Resume
Nagarjuna_Damarla_Resume
Nag Arjun
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
vhrocca
 
Ad

Recently uploaded (20)

Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
Customer Segmentation using K-Means clustering
Customer Segmentation using K-Means clusteringCustomer Segmentation using K-Means clustering
Customer Segmentation using K-Means clustering
Ingrid Nyakerario
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Deloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining ProjectsDeloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining Projects
Process mining Evangelist
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Modern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx AaModern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx Aa
MuhammadAwaisKamboh
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Customer Segmentation using K-Means clustering
Customer Segmentation using K-Means clusteringCustomer Segmentation using K-Means clustering
Customer Segmentation using K-Means clustering
Ingrid Nyakerario
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Deloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining ProjectsDeloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining Projects
Process mining Evangelist
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Modern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx AaModern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx Aa
MuhammadAwaisKamboh
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
Ad

Open Source Solution for Data Analyst Workflow

  • 1. idBigData Meetup #17 SQL Big Data Analytics Open Source Solution for Big Data Analyst Workflow Institut Teknologi Bandung, 28 September 2017 Sigit Prasetyo
  • 4. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com Data-Driven Company A data-driven company is an organization where every person who can use data to make better decisions, has access to the data they need when they need it. Being data-driven is not about seeing a few canned reports at the beginning of every day or week; it's about giving the business decision makers the power to explore data independently, even if they're working with big or disparate data sources. https://www.infoworld.com/article/3074322/big-data/what-is-a-data-driven-company.html
  • 5. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com Moneyball
  • 6. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com Data Journey Data Collection 01 Data Preparation 02 Data Exploration 03 Data Formatting 04 Data Presentation 05
  • 7. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com What is Data Analysts ? Data Analysts are experienced data professionals in their organization who can query and process data, provide reports, summarize and visualize data. They have a strong understanding of how to leverage existing tools and methods to solve a problem, and help people from across the company understand specific queries with ad-hoc reports and charts. Skills: Data Analysts need to have a baseline understanding of some core skills: statistics, data munging, data visualization, exploratory data analysis, https://cognitiveclass.ai/blog/data-scientist-vs-data-engineer/ Tools: Microsoft Excel, SPSS, SPSS Modeler, SAS, SAS Miner, SQL, Microsoft Access, Tableau, SSAS
  • 8. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com Big Data Data Analyst Certification Required Skills Prepare the Data Use Extract, Transfer, Load (ETL) processes to prepare data for queries. Provide Structure to the Data Use Data Definition Language (DDL) statements to create or alter structures in the metastore for use by Hive and Impala. Data Analysis Use Query Language (QL) statements in Hive and Impala to analyze data on the cluster. Certification Exam Subject Areas 1. Extract, Transform, and Load Data with Apache Pig 2. Manipulate Data with Apache Pig 3. Create tables and load data in Apache Hive 4. Query data with Apache Hive 5. SQL Queries with Drill 6. Working with Self-Describing Data 7. Advanced Topics including Troubleshooting
  • 9. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com Why SQL ? SQL : Structured Query Language A very high level language (Almost) Every application use database Easier to find a SQL developer The easiest step to enter Hadoop
  • 10. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com SQL On Hadoop Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage OLTP and operational analytics for Apache Hadoop Data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. The open source, native analytic database for Apache Hadoop* A big data warehouse system on Hadoop Apache Hadoop Native SQL. Advanced, MPP, elastic query engine and analytic database for enterprises* Distributed SQL Query Engine for Big Data
  • 11. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com Why not Excel ? Easy to use Flat database (Almost) Complete tool for data analyst (formula, statistic, chart) What if .. Bigger data Complex relational
  • 12. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com Let’s Play Lego Read simple to complex data Data exploration + Ad Hoc Query Data visualization Machine Learning HDFS + MAPREDUCE + HIVE + ZEPPELIN
  • 13. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com SQL Data Analytics Sandbox VirtualBox Linux Mint OS 18.2 Apache Hadoop Vanila Single NodeYARN - Resource Management HDFS HDFS HDFS Hadoop Distributed File System HDFS MapReduce Execution Engine MapReduce Execution Engine Data Preparation Data Exploration Apache Zeppelin https://github.com/project303/dasb
  • 14. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com Apache Hive Initially developed by Facebook Included in most Hadoop distro (Cloudera, Hortonworks, MapR, Yava) Built In Function and User Defined Function Transactional (ACID) Has Index Support Procedural Language Machine Learning - HiveMall* Supported Execution Engine - MapReduce - Apache Tez - Spark JDBC connection support
  • 15. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com Apache Zeppelin Interactive Notebook Web Front End Multiple Interpreter Built-in Visualization
  • 16. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com Proof Of Concept Perform Squid Access Log Data Analysis. Squid is a caching proxy for the Web supporting HTTP, HTTPS, FTP, and more. It reduces bandwidth and improves response times by caching and reusing frequently-requested web pages. Scenario : Load data access.log into HDFS Analyze whether there is something uncommon in it by using Hive
  • 17. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com Know Your Data Data Format : text file that contain 10 fields and separated by space for each field remotehost rfc931 authuser [date] "request" status size referer agent tcp_code Field Description : 1. Remotehost Remote hostname (or IP number if DNS hostname is not available, or if DNSLookup is Off. 2. Rfc931 The remote logname of the user. 3. User ID The username as which the user has authenticated himself. Always NULL ("-") for Squid logs. 4. [date] Date and time of the request. 5. "Request" The request line exactly as it came from the client. GET, HEAD, POST, etc. for HTTP requests. ICP_QUERY for ICP requests. 6. Status The HTTP status code returned to the client. See the HTTP status codes for a complete list. 7. Size The content-length of data transferred in byte. 8. Referer 9. Agent Application that access the internet 10. TCP Code The ``cache result'' of the request. This describes if the request was a cache hit or miss, and if the object was refreshed
  • 18. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com Know Your Data Sample Data : 192.168.6.129 - - [17/Sep/2017:00:00:21 +0700] "GET http://api.account.xiaomi.com/pass/v2/safe/user/coreInfo? HTTP/1.1" 200 862 "-" "Dalvik/2.1.0 (Linux; U; Android 5.1.1; 2014817 MIUI/V8.5.1.0.LHJMIED)" TCP_MISS:DIRECT 192.168.6.103 - - [17/Sep/2017:00:01:14 +0700] "POST http://netmarbleslog.netmarble.com/ HTTP/1.0" 200 299 "-" "okhttp/2.5.0" TCP_MISS:DIRECT Remotehost : 192.168.129 [date] : [17/Sep/2017:00:00:21 +0700] "Request" : "GET http://api.account.xiaomi.com/pass/v2/safe/user/coreInfo? HTTP/1.1" Status : 200 Size : 862 Agent : "Dalvik/2.1.0 (Linux; U; Android 5.1.1; 2014817 MIUI/V8.5.1.0.LHJMIED)" TCP Code : TCP_MISS:DIRECT
  • 19. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com Starting Apache Zeppelin
  • 20. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com Accessing Zeppelin
  • 21. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com Preparation
  • 22. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com Load Data To HDFS
  • 23. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com Create External Table
  • 24. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com RegexSerDe Sample Data : 192.168.6.129 - - [17/Sep/2017:00:00:21 +0700] "GET http://api.account.xiaomi.com/pass/v2/safe/user/coreInfo? HTTP/1.1" 200 862 "-" "Dalvik/2.1.0 (Linux; U; Android 5.1.1; 2014817 MIUI/V8.5.1.0.LHJMIED)" TCP_MISS:DIRECT
  • 25. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com View Table Content
  • 26. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com Create View
  • 27. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com Let’s Tell The Story
  • 28. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com Monday Traffic Behaviour
  • 29. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com IP Traffic Behaviour
  • 30. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com Agent Name Status → 403 Forbidden
  • 31. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com The Most Used Agent
  • 32. idBigData.com IDBigData idBigData @idBigData hub.idBigData.com Thank You & Stay Connected s.id/idbigdata Credit for icon Gregor Cresnar www.flaticon.com/authors/gregor-cresnar Prosymbols www.flaticon.com/authors/prosymbols Freepik www.freepik.com Pavel Kozlov www.flaticon.com/authors/pavel-kozlov Yannick www.flaticon.com/authors/yannick Dave Gandy www.flaticon.com/authors/dave-gandy SimpleIcon www.flaticon.com/authors/simpleicon