Chapter 2-Data Science
Chapter 2-Data Science
Data Science
Contents:
• Big data is the term for a collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.
Clustered Computing
• Cluster computing refers that many of the computers connected on a network and they
perform like a single entity.
• Because of the qualities of big data, individual computers are often inadequate for handling
the data at most stages.
• To better address the high storage and computational needs of big data, computer clusters are
a better fit.
• Big data clustering software combines the resources of many smaller machines, seeking to
provide a number of benefits:
suppose you have a big file having more than 500 mb data and you need to count the number of words. But
your computer has only 100 mb, how you can handle it ?
• Resource Pooling: Combining the available storage space to hold data
is a clear benefit, but CPU and memory pooling are also extremely
important.
• Processing large datasets requires large amounts of all three of these
resources.
• High Availability: Clusters can provide varying levels of fault
tolerance and availability guarantees to prevent hardware or software
failures from affecting access to data and processing.
• Easy Scalability: Clusters make it easy to scale horizontally by
adding additional machines to the group.
• Cluster membership and resource allocation can be handled by software
like Hadoop’s YARN (which stands for Yet Another Resource Negotiator).
• Economical: Its systems are highly economical as ordinary computers can be used for data
processing.
• Reliable: It is reliable as it stores copies of the data on different machines and is resistant
to hardware failure.
• Scalable: It is easily scalable both, horizontally and vertically. A few extra nodes help in
scaling up the framework.
• Flexible: It is flexible and you can store as much structured and unstructured data as you
need to and decide to use them later.
• Hadoop has an ecosystem that has evolved from its four core
components: data management, data access, data processing, and data
storage.
Hadoop ecosystem
Big Data Life Cycle with Hadoop
1. Ingesting data into the system
• The second stage is Processing. In this stage, the data is stored and
processed.
• The data is stored in the distributed file system, HDFS, and the
NoSQL distributed data, Hbase, Spark and MapReduce perform data
processing.
3. Computing and analyzing data
• Pig converts the data using a map and reduce and then analyzes it.
• Hive is also based on the map and reduce programming and is most
suitable for structured data.
4. Visualizing the results
• The fourth stage is Access, which is performed by tools such as Hue
and Cloudera Search.
• In this stage, the analyzed data can be accessed by users.
Thank you!!!