Fast Analytics aims to deliver analytics at decision-making speeds using technologies like Apache Kudu and Apache Druid for processing high volumes of data in real time. However, Kudu does not integrate well with Hadoop, so Druid is presented as a better solution for combining low-latency queries with Hadoop compatibility. The document then provides overviews of the capabilities and use cases of Druid, examples of companies using Druid, and instructions for getting started with a Druid quickstart tutorial.
This document provides an overview of Apache Kafka. It discusses Kafka's key capabilities including publishing and subscribing to streams of records, storing streams of records durably, and processing streams of records as they occur. It describes Kafka's core components like producers, consumers, brokers, and clustering. It also outlines why Kafka is useful for messaging, storing data, processing streams in real-time, and its high performance capabilities like supporting multiple producers/consumers and disk-based retention.
This document discusses different methods for analyzing both quantitative and qualitative data to gain business intuition. It describes common descriptive and inferential statistical techniques for quantitative data like mean, median, mode, correlation, and regression. For qualitative data, it outlines techniques like word repetition analysis, comparing and contrasting themes, content analysis, narrative analysis, and grounded theory. The overall goal is to use exploratory data analysis and various tools to identify patterns and insights that can help build business intuition.
This document defines 16 basic terms related to data engineering:
1. Apache Airflow is an open-source workflow management platform that uses directed acyclic graphs to manage workflow orchestration.
2. Batch processing involves processing large amounts of data at once, such as in ETL steps or bulk operations on digital images.
3. Cold data storage stores old, hardly used data on low-power servers, making retrieval slower.
4. A cluster groups several computers together to perform a single task.
Kubernetes provides the software necessary to build and deploy reliable, scalable distributed systems by handling problems related to velocity, scaling, abstracting infrastructure, and efficiency. Specifically, it allows for fast updates and self-healing systems, scales well through decoupled architectures, separates developers from specific machines or cloud providers, and improves efficiency by automating application distribution and enabling cheap test environments using containers.
Apache Airflow is an open-source workflow management platform that was created at Airbnb in 2014 to author, schedule, and monitor complex workflows. It allows users to define workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler then executes the tasks on workers based on dependencies. Airflow is commonly used for ETL pipelines, data processing, machine learning workflows, and automating devops tasks like monitoring cron jobs. Companies like Robinhood and Google use Airflow for complex data workflows and as a managed service on Google Cloud.
Fast Analytics (FA) uses an Enterprise Service Bus (ESB) to process high volumes of big data in real time, enabling decision makers to understand new trends and shifts as they occur. FA delivers analytics at decision-making speeds through technologies like Apache Kudu, which provides low latency random access and efficient analytical queries on columnar data. Kudu uses a log-structured storage approach and Raft consensus algorithm to replicate data across nodes for reliability and high availability.
The document discusses the concept of "dark data", which refers to data that is collected by organizations but not analyzed or used. Some key points:
- Up to 90% of data loses value immediately or is never analyzed by organizations. Common examples of dark data include customer location data and sensor data.
- Organizations retain dark data for compliance purposes but storing it can be more expensive than the potential value. Only about 1% of organizational data is typically analyzed.
- Dark data poses risks like legal issues if it contains private information, but also opportunity costs if competitors analyze the data first. Methods to mitigate risks include ongoing data inventories, encryption, and retention policies.
- Many types of businesses could benefit from analyzing
Apache Hadoop is a framework for distributed storage and processing of large datasets across clusters of commodity hardware. It provides HDFS for distributed file storage and MapReduce as a programming model for distributed computations. Hadoop includes other technologies like YARN for resource management, Spark for fast computation, HBase for NoSQL database, and tools for data analysis, transfer, and security. Hadoop can run on-premise or in cloud environments and supports analytics workloads.
10. บทที5 สรุปผลและข้อเสนอแนะ
1 2 3
Run out of
money to buy
rocket parts
Launch a
plane
instead
Do it
another way
Save up
more money
Blockchain ?
Launch
our first
rocket
Business process
&
Data
AI
Milestone #3
Milestone #1
Milestone #2
Ultimate Goal
เลือกให้เหมาะสม Blockchain
+ AI
ศึกษาเทคโนโลยี
บล็อกเชน
Customer
Engegemant
10
11. March 27, 2021
595162020003
The mind is just like a muscle
— the more you exercise it,
the stronger it gets and the
more it can expand.
Idowu Koyenikan
11