Big Data Taiwan 2014 Track1-3: Big Data, Big Challenge — Splunk 幫你解決 Big Data...Etu Solution
講者:SYSTEX 數據加值應用發展部產品經理 | 陶靖霖
議題簡介:認清現實吧! Big Data 是個熱門詞彙、熱門議題,但是問題的核心仍然圍繞在資料處理的流程、架構與技術,要踏入 Big Data 的領域,使用者會遭遇哪些挑戰? Splunk 被譽為「全球最佳的 Big Data Company」,究竟在資料處理的流程中擁有什麼獨特的技術優勢,能夠幫助使用者克服這些挑戰?又有哪些成功幫助使用者從資料中萃取出價值的應用案例?歡迎來認識 Splunk 以及全球 Big Data 成功案例。
Mesos-based Data Infrastructure @ DoubanZhong Bo Tian
How to build an elastic and efficient platform to support various Big Data and Machine Learning tasks is a challenge for a lot of corporations. In this presentation, Zhongbo Tian will give an overview of the Mesos-based core infrastructure of Douban, and demonstrate how to integrate the platform with state-of-art Big Data/ML technologies.
How to plan a hadoop cluster for testing and production environmentAnna Yen
Athemaster wants to share our experience to plan Hardware Spec, server initial and role deployment with new Hadoop Users. There are 2 testing environments and 3 production environments for case study.
If it's going to work, you need to involve people outside the marketing function. Actually, you have a change management project on your hands. See why.
Please credit the author if you use the material. Some images are subject to copyright.
Mesos-based Data Infrastructure @ DoubanZhong Bo Tian
How to build an elastic and efficient platform to support various Big Data and Machine Learning tasks is a challenge for a lot of corporations. In this presentation, Zhongbo Tian will give an overview of the Mesos-based core infrastructure of Douban, and demonstrate how to integrate the platform with state-of-art Big Data/ML technologies.
How to plan a hadoop cluster for testing and production environmentAnna Yen
Athemaster wants to share our experience to plan Hardware Spec, server initial and role deployment with new Hadoop Users. There are 2 testing environments and 3 production environments for case study.
If it's going to work, you need to involve people outside the marketing function. Actually, you have a change management project on your hands. See why.
Please credit the author if you use the material. Some images are subject to copyright.
A global qualitative study was held with people in 11 countries to find out what they thought about the January 2017 women's march.
The study was conducted by Think Global Qualitative, a global network of senior qualitative specialists.
This document discusses the importance of data science and building a data science team. It notes that data science provides new analytic insights and data products. Effective data science requires a team that includes data scientists, data engineers, and others. The document suggests data science can enable smart factories, supply chains, precision medicine, personalized shopping and learning. It promotes learning data science through the Data Science Thailand community.
This document discusses churn management in mobile communications. It defines churn as customer attrition or loss and churn rate as the number of customers who discontinue service divided by the total number of customers. It identifies reasons for churn such as easy switching between providers and inadequate services. It discusses types of churn, data transformation for modeling, identifying customers' propensity to churn, and calculating customer profitability. Finally, it outlines strategies for reducing churn such as identifying valuable customers and developing win-back policies.
Trinity 大幅提昇企業面對大量快速變化資訊潮流時的競爭力。
現今企業 BI 多建於 RDBMS 上,伴隨大量的 ETL 與資料交換作業。在導入 Hadoop Big Data 應用之後, 如何有效地與既有 BI 系統介接,且進一步整合,以發揮整體綜效,將是一項挑戰。
Trinity 藉由優越的架構,在傳統 Structured Data 與 Hadoop Big Data 的應用間,建立無縫的交換作業,讓資訊分析人員直接運用熟悉的方式,以大幅降低導入 Big Data 應用時的學習曲線與後續對系統維運所投入的人力。
To segment effectively, you need to understand what drives the segments, not just how to measure them. That's where qualitative insight comes in.
Please credit the author if you use the material. Some images are subject to copyright.
致詞歡迎:Big Data 無所不在,Data Technology 無 C 不歡Etu Solution
This document contains the opening remarks from Lin Longfen, the general manager of Jingcheng Group. It discusses how Gartner dropped "Big Data" from its hype cycle of emerging technologies in 2015 because it is now considered a mainstream part of many industries. Big data is still essential for major trends like the Internet of Things, Industry 4.0, and smart everything. The document emphasizes that understanding customers ("C") is key to a company's ("B") competitive advantage in the digital economy, and that leveraging industry data is a common development strategy across Jingcheng Group's business units.
This document discusses building a new generation of intelligent data platforms. It emphasizes that most big data projects spend 80% of time on data integration and quality. It also notes that Informatica developers are 5 times more productive than those coding by hand for Hadoop. The document promotes Informatica's tools for enabling existing developers to work with big data platforms like Hadoop through visual interfaces and pre-built connectors and transformations.
The document discusses consumer behavior and the buyer decision process. It outlines the gaps model of consumer expectations and marketer perceptions. It then describes the 5 stages of the buyer decision process - need recognition, information search, evaluation of alternatives, purchase decision, and post-purchase behavior. Finally, it discusses factors that influence consumer behavior such as cultural, social, personal and psychological characteristics.
This document summarizes the roles of servers in a Hadoop cluster, including manager, name nodes, edge nodes, and data nodes. It discusses hardware considerations for Hadoop cluster design like CPU to memory to disk ratios for different use cases. It also provides an overview of Dell's Hadoop solutions that integrate PowerEdge servers, Dell Networking switches, and support from Etu for analytic software and Dell Professional Services for implementation. It briefly discusses futures around in-memory processing and virtualized Hadoop deployments.
The document discusses customer churn risk and how to develop predictive churn models. It defines risk as having two components: uncertainty and exposure to that uncertainty. When building a churn model, the key steps are: defining active vs churned customers, selecting relevant customer data, analyzing characteristics to identify predictors, developing a predictive score using methods like logistic regression, and evaluating the model's ability to identify customers likely to churn. The goal of a churn model is to provide insights for preventing churn, not just statistical precision.
Build 1 trillion warehouse based on carbon databoxu42
Apache CarbonData & Spark Meetup
Build 1 trillion warehouse based on CarbonData
Huawei
Apache Spark™ is a unified analytics engine for large-scale data processing.
CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookup on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenario it supports queries on single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!
Spark is a general purpose computational framework that provides more flexibility than MapReduce. It leverages distributed memory and uses directed acyclic graphs for data parallel computations while retaining MapReduce properties like scalability, fault tolerance, and data locality. Cloudera has embraced Spark and is working to integrate it into their Hadoop ecosystem through projects like Hive on Spark and optimizations in Spark Core, MLlib, and Spark Streaming. Cloudera positions Spark as the future general purpose framework for Hadoop, while other specialized frameworks may still be needed for tasks like SQL, search, and graphs.
This document discusses big data and Cloudera's Enterprise Data Hub solution. It begins by noting that big data is growing exponentially and now includes structured, complex, and diverse data types from various sources. Traditional data architectures using relational databases cannot effectively handle this scale and variety of big data. The document then introduces Cloudera's Hadoop-based Enterprise Data Hub as an open, scalable, and cost-effective platform that can ingest and process all data types and bring compute capabilities to the data. It provides an overview of Cloudera's history and product offerings that make up its full big data platform.
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionEtu Solution
講者:Informatica 資深產品顧問 | 尹寒柏
議題簡介:Big Data 時代,比的不是數據數量,而是了解數據的深度。現在,因為 Big Data 技術的成熟,讓非資訊背景的 CXO 們,可以讓過去像是專有名詞的 CI (Customer Intelligence) 變成動詞,從 BI 進入 CI,更連結消費者經濟的脈動,洞悉顧客的意圖。不過,有個 Big Data 時代要 注意的思維,那就是競爭到最後,不單只是看數據量的增長,還要比誰能更了解數據的深度。而 Informatica 正是這個最佳解決的答案。我們透過 Informatica 解決在企業及時提供可信賴數據的巨大壓力;同時隨著日益增高的數據量和複雜程度,Informatica 也有能力提供更快速彙集數據技術,從而讓數據變的有意義並可供企業用來促進效率提升、完善品質、保證確定性和發揮優勢的功能。Inforamtica 提供了更為快速有效地實現此目標的方案,是精誠集團在 Big Data 時代的最佳工具。
Big Data Taiwan 2014 Keynote 4: Monetize Enterprise Data – Big Data 在台灣的經典應用與行動Etu Solution
講者:Etu 資深協理 | 陳育杰
簡介:過去這兩年內,Big Data 在企業的應用架構已逐漸形塑出來,我們看到,不同的產業,陸續開始運用 Hadoop 來解決不同的問題,而背後的 IT 架構,其實都具有一些共通性。我們將透過這些共通性的架構來探索 Big Data / Hadoop 具體展現的企業應用。
Big Data Taiwan 2014 Keynote 2: Hadoop and the Future of Data ManagementEtu Solution
Speaker:
1. Christopher Poulos | Vice President, Asia Pacific and Japan at Cloudera
2. Gab Gennai | Technical Services Director, Asia Pacific and Japan at Cloudera
Introduction: With no doubt Apache Hadoop is leading the way in enterprise architecture, find out how easily it integrates with your existing hardware and software infrastructure.
26. 26
TPC-DS 定義了評測用的資料表 Schema
[master:21000] > show databases;
Query: show databases
+------------------+
| name |
+------------------+
| _impala_builtins |
| default |
| tpcds |
| tpcds_parquet |
| tpcds_rcfile |
+------------------+
Fetched 5 row(s) in 0.03s
[master:21000] > use tpcds;
Query: use tpcds
[master:21000] > show tables;
Query: show tables
+------------------------+
| name |
+------------------------+
| customer |
| customer_address |
| customer_demographics |
| date_dim |
| household_demographics |
| inventory |
| item |
| promotion |
| store |
| store_sales |
| time_dim |
+------------------------+
Fetched 11 row(s) in 0.01s
27. 27
TPC-DS 也定義了評測用的 SQL 查詢
-- start query 1 in stream 0 using template query27.tpl
select
i_item_id,
s_state,
-- grouping(s_state) g_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
from
store_sales,
customer_demographics,
date_dim,
store,
item
Where
ss_sold_date_sk = d_date_sk
and ss_item_sk = i_item_sk
and ss_store_sk = s_store_sk
and ss_cdemo_sk = cd_demo_sk
and cd_gender = 'F'
and cd_marital_status = 'W'
and cd_education_status = 'Primary'
and d_year = 1998
and s_state in
('WI', 'CA', 'TX', 'FL', 'WA', 'TN')
and ss_sold_date_sk between 2450815 and 2451179
-- partition key filter
group by
-- rollup (i_item_id, s_state)
i_item_id,
s_state
order by
i_item_id,
s_state
limit 100;
-- end query 1 in stream 0 using template query27.
28. 28
TPC-DS 也有提供產生指定筆數資料的工具
取樣資料表:資料量 3.8G, 文字檔格式, 沒有壓縮, 3 千萬筆資料
[master:21000] > show table stats customer;
Query: show table stats customer
+-------+--------+--------+--------------+--------+-------------------+
| #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
+-------+--------+--------+--------------+--------+-------------------+
| -1 | 1 | 3.81GB | NOT CACHED | TEXT | false |
+-------+--------+--------+--------------+--------+-------------------+
Fetched 1 row(s) in 0.00s
[master:21000] > select count(*) from customer;
Query: select count(*) from customer
+----------+
| count(*) |
+----------+
| 30000000 |
+----------+
Fetched 1 row(s) in 0.77s
使用相同的資料集與
查詢語句,較容易進行
不同技術的評選