Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016Nagato Kasaki
現在、DMM.comでは、1日あたり1億レコード以上の行動ログを中心に、各サービスのコンテンツ情報や、地域情報のようなオープンデータを収集し、データドリブンマーケティングやマーケティングオートメーションに活用しています。しかし、データの規模が増大し、その用途が多様化するにともなって、データ処理のレイテンシが課題となってきました。本発表では、既存のデータ処理に用いられていたHiveの処理をHive on Sparkに置き換えることで、1日あたりのバッチ処理の時間を3分の1まで削減することができた事例を紹介し、Hive on Sparkの導入方法やメリットを具体的に解説します。
Hadoop / Spark Conference Japan 2016
http://www.eventbrite.com/e/hadoop-spark-conference-japan-2016-tickets-20809016328
Organizations looking to use a NoSQL data store based on Big Table face a challenge when deciding between alternatives. Often superficial differences are overblown or worse, subtle differences aren't discovered until it's too late. In this talk we compare and contrast Apache Accumulo against Apache Cassandra and Apache HBase, diving deep into design differences and subtleties that may hinder a project only after reaching a certain amount of usage or data storage.
– Speaker –
Aaron Cordova
Co-founder and CTO, Koverse
Aaron has built multiple, large-scale, big data systems that are used by the intelligence, defense, finance and healthcare industries. Aaron co-founded Koverse Inc. Prior to that, Aaron spent five years as a researcher for the National Security Agency (NSA) where he developed and deployed into operations dozens of advanced analytical techniques. He is the founder of Apache Accumulo, a scalable and secure data store on top of Apache Hadoop and the author of the recently released O’Reilly book, Accumulo: Application Development, Table Design, and Best Practices.
— More Information —
For more information see http://www.accumulosummit.com/
- Apache Spark is an open-source cluster computing framework for large-scale data processing. It was originally developed at the University of California, Berkeley in 2009 and is used for distributed tasks like data mining, streaming and machine learning.
- Spark utilizes in-memory computing to optimize performance. It keeps data in memory across tasks to allow for faster analytics compared to disk-based computing. Spark also supports caching data in memory to optimize repeated computations.
- Proper configuration of Spark's memory options is important to avoid out of memory errors. Options like storage fraction, execution fraction, on-heap memory size and off-heap memory size control how Spark allocates and uses memory across executors.
Tech Deep Dive #2 in Osaka
https://techdeepdive.connpass.com/event/79096/
2018/03/17
アプリケーションを動かしてて、データベースが遅くなったり壊れてしまった際に、どのように対処したらいいのかと、お困りの方は少なくないのではないでしょうか。そんな時に備えて、データベースの設計方式や実装方法をご紹介します。
Organizations looking to use a NoSQL data store based on Big Table face a challenge when deciding between alternatives. Often superficial differences are overblown or worse, subtle differences aren't discovered until it's too late. In this talk we compare and contrast Apache Accumulo against Apache Cassandra and Apache HBase, diving deep into design differences and subtleties that may hinder a project only after reaching a certain amount of usage or data storage.
– Speaker –
Aaron Cordova
Co-founder and CTO, Koverse
Aaron has built multiple, large-scale, big data systems that are used by the intelligence, defense, finance and healthcare industries. Aaron co-founded Koverse Inc. Prior to that, Aaron spent five years as a researcher for the National Security Agency (NSA) where he developed and deployed into operations dozens of advanced analytical techniques. He is the founder of Apache Accumulo, a scalable and secure data store on top of Apache Hadoop and the author of the recently released O’Reilly book, Accumulo: Application Development, Table Design, and Best Practices.
— More Information —
For more information see http://www.accumulosummit.com/
- Apache Spark is an open-source cluster computing framework for large-scale data processing. It was originally developed at the University of California, Berkeley in 2009 and is used for distributed tasks like data mining, streaming and machine learning.
- Spark utilizes in-memory computing to optimize performance. It keeps data in memory across tasks to allow for faster analytics compared to disk-based computing. Spark also supports caching data in memory to optimize repeated computations.
- Proper configuration of Spark's memory options is important to avoid out of memory errors. Options like storage fraction, execution fraction, on-heap memory size and off-heap memory size control how Spark allocates and uses memory across executors.
Tech Deep Dive #2 in Osaka
https://techdeepdive.connpass.com/event/79096/
2018/03/17
アプリケーションを動かしてて、データベースが遅くなったり壊れてしまった際に、どのように対処したらいいのかと、お困りの方は少なくないのではないでしょうか。そんな時に備えて、データベースの設計方式や実装方法をご紹介します。
Oracle Cloud Days Tokyo 2016 (2016年10月開催)でのデータ・セキュリティに関する講演資料です。 ※ 一部資料を修正しました (2016/12/27)
この3年間の情報漏洩事件の傾向を振り返り、今考えるべき個人情報や機密情報を保護に求められる多層防御やデータ・セキュリティ対策のポイントを事例を交えて紹介します。また、各種ガイドラインに求められるデータ・セキュリティに求められる要件も併せて紹介します。
Oracle Databaseの既存バージョンの10gや11gOracle Zero Data Loss Recovery Applianceの登場で、ますます重要な機能となってきたOracle Recovery Managerについて、OTN人気連載シリーズ「しばちょう先生の試して納得!DBAへの道」の執筆者が語ります。RMANバックアップの運用例から、高速増分バックアップの内部動作とチューニング方法まで、出し惜しみなく解説します。
This document appears to be a presentation on Cloudera and related technologies. It introduces Cloudera and provides an agenda. It then discusses Cloudera's growth from 2008-2018, products and services offered, organizational structure, and the technologies that make up their Hadoop platform including components like HDFS, HBase, Zookeeper, YARN and more. It also covers some Linux system administration and monitoring topics like log analysis and storage.
How to go into production your machine learning models? #CWT2017Cloudera Japan
This document discusses various patterns for deploying machine learning systems. It describes different approaches for model building, prediction, and serving including:
- Developing models in Cloudera Data Science Workbench and exporting them for prediction through APIs or databases.
- Using microservices architectures with web applications, APIs, and databases connecting to machine learning systems.
- Serving models through REST APIs or databases and updating models continuously through streaming data.
Apache Kudu - Updatable Analytical Storage #rakutentechCloudera Japan
This document provides an overview of Apache Kudu, an open source columnar storage system that enables fast analytics on fast changing data. It discusses Kudu's architecture including its use of tablets, replication using Raft consensus, and columnar storage with compression. The document also covers Kudu's write path involving memstores, delta memstores, and flushing to disk; its read path involving lookups without merging files; and compaction processes. Overall, the summary provides a high-level technical introduction to Kudu's capabilities and design.
Cloudera Data Science WorkbenchとPySparkで 好きなPythonライブラリを 分散で使う #cadedaCloudera Japan
Data Engineering and Data Analysis Workshop #1 での有賀 (@chezou)の発表です。
https://cyberagent.connpass.com/event/58808/
Cloudera Data Science WorkbenchとPySparkを使い、Pythonで好きなライブラリを分散実行する方法についてです。日本語の形態素解析ライブラリMeCabをPySparkから実行します。