The document summarizes upgrading an HDP cluster from version 2.1 to 2.5. It involved setting up new hardware with more resources, performing a blue-green deployment by migrating 500TB of data using DistCp over 3 days, and migrating Hive schemas and other configuration changes. The author recommends upgrading in batches of less than 100 jobs and having help to address errors, as performing a large upgrade alone can be overwhelming.
This document summarizes Azkaban, an open source workflow scheduler that was created at LinkedIn to manage Hadoop jobs and their dependencies. Key features of Azkaban include defining job dependencies in a simple interface, retry functionality, scheduling, and viewing logs and execution details in the web UI. The document also discusses how the author uses Azkaban to manage Python batch jobs at their company, including writing job files in YAML format and using the Azkaban API. In conclusion, the author finds Azkaban simple to use and sees no reason to replace it, though hopes for more active development.
The document summarizes upgrading an HDP cluster from version 2.1 to 2.5. It involved setting up new hardware with more resources, performing a blue-green deployment by migrating 500TB of data using DistCp over 3 days, and migrating Hive schemas and other configuration changes. The author recommends upgrading in batches of less than 100 jobs and having help to address errors, as performing a large upgrade alone can be overwhelming.
This document summarizes Azkaban, an open source workflow scheduler that was created at LinkedIn to manage Hadoop jobs and their dependencies. Key features of Azkaban include defining job dependencies in a simple interface, retry functionality, scheduling, and viewing logs and execution details in the web UI. The document also discusses how the author uses Azkaban to manage Python batch jobs at their company, including writing job files in YAML format and using the Azkaban API. In conclusion, the author finds Azkaban simple to use and sees no reason to replace it, though hopes for more active development.
This document discusses the application of PostgreSQL in a large social infrastructure project involving smart meter management. It describes three main missions: (1) loading 10 million datasets within 10 minutes, (2) saving data for 24 months, and (3) stabilizing performance for large scale SELECT statements. Various optimizations are discussed to achieve these missions, including data modeling, performance tuning, reducing data size, and controlling execution plans. The results showed that all three missions were successfully completed by applying PostgreSQL expertise and customizing it for the large-scale requirements of the project.
Azkaban is a workflow scheduler that was created at LinkedIn to manage Hadoop jobs and their dependencies. It provides features like defining job dependencies, retries, scheduling, and viewing logs through a web UI. While useful, it has some limitations like being a single point of failure, lack of triggered executions from file events, and inactive development. The document discusses using Azkaban to manage Hadoop jobs, including writing jobs in Python and generating job files from YAML definitions. It also outlines the author's usage of Azkaban in their environment to manage over 120 flows on a daily, hourly, weekly and monthly basis.
This document summarizes the process of upgrading an HDP cluster from version 2.1 to 2.4. It describes installing HDP 2.1.5 without Ambari, then upgrading to HDP 2.3.4 using Ambari. Several issues were encountered with Hive on Tez and MR and worked around by changing configurations, increasing memory limits, and switching between Tez and MR. HDP 2.4.0 was then installed which resolved some issues but a block corruption problem remained. The conclusion recommends using HDP 2.4.0 and monitoring upgrades due to ongoing query incompatibility issues.
[db tech showcase Tokyo 2017] D33: Deep Learningや、Analyticsのワークロードを加速するには-Ten...Insight Technology, Inc.
Deep Learningでは、GPUを用いた、コンピューティング環境を用意される事が多いですが、こちらを加速させる足回りについてはあまり意識されてきていませんでした。また、SparkでのAnalyticsについても、Pipeline処理の高速化が可能となりました。ピュアストレージが最新のユースケースのご紹介も兼ねて、AI時代のワークロードを実現する方法をお伝えします。
Silicon Valley x 日本 / Tech x Business Meetup #12 (2015/04/17)
『並列分散処理基盤Hadoopの紹介と、開発者が語るHadoopの使いどころ』
NTTデータ 基盤システム事業本部
システム方式技術事業部 OSSプロフェッショナルサービス
鯵坂 明
OracleならではのHadoopソリューションである、Oracle Big Data SQLの最新バージョン3.1がリリースされました
ビッグデータ、IoTにより、真にビジネス価値を生み出すためには、HadoopやNoSQL上のデータのみならず、RDBMS上のビジネスデータとの連携とが必須となります
今回のアップデートにより、よりそれらすべてのデータの統合を実現し、より多くのお客さまの、ビッグデータ/IoTからの真なるビジネス価値創出と、競争優位確立をご支援させていただきます