0% found this document useful (0 votes)
487 views

BDA

Uploaded by

ajaycse341
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
487 views

BDA

Uploaded by

ajaycse341
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 148
UNIT | 1 Understanding Big Data Syllabus Introduction to big data - convergence of key trends - unstructured data - industry examples of big data - web analytics - big data applications- big data technologies - introduction to Hadoop - open source technologies - cloud and big data - mobile business intelligence - Crowd sourcing analytics = inter and trans firewall analytics. Contents 1.1 Introduction to Big Data 1.2 Convergence of Key Trends 1.3. Unstructured Data 1.4 Industry Examples of Big Data 1.5 Web Analytics 1.6 Big Data Applications 1.7 Big Data Technologies 1.8 Introduction to Hadoop 1.9 Open Source Technologies 1.10 Cloud and Big Data 1.11. Mobile Business Intelligence 4.12 Crowd Sourcing Analytics 1.13. Inter and Trans Firewall Analytics 1.14 Two Marks Questions with Answers a. Big Data Analytics 4-2 Understanding Big Daty EEG introduction to Big Data * Big data can be defined as very large volumes of data available at various sources, in varying degrees of complexity, generated at different speed ie., velocities ang varying degrees of ambiguity, which cannot be processed using. traditional technologies, processing methods, algorithms or any commercial off-the-shelf solutions. ‘Big data’ is a term used to describe a collection of data that is huge in size and yet growing exponentially with time. In short, such data is so. large and complex that none of the traditional data management tools are able to store it or process it efficiently. The processing of big data begins with the raw data that isn't aggregated or organized and is most often impossible to store in the memory of a single computer. Big data processing is a set of techniques or programming models to access large-scale data to extract useful information for supporting and providing decisions. Hadoop is the open-source implementation of MapReduce and is widely used for big data processing. EREI Difference between Data Science and Big Data Data science Sr. No. L It is a field of scientific analysis of data in order to solve analytically complex problems and the significant and necessary activity of cleansing, preparing of data, insurance. Goals: Data classification, anomaly detection, prediction, scoring and ranking. Benefits of Big Data Processing Benefits of big data processing : 1. Improved customer service, It is used in Biotech, energy, gaming and Big data Big data is storing and processing Volume of structured and unstructured data that can not be possible with traditional applications. Used in retail, education, healthcare and | social media, Goals : To provide better customer service, identifying new revenue opportunities, effective marketing etc. 2 Business can utilize outside intelligence while taking decisions. 3. Reducing maintenance costs, TECHNICAL PuBLICATIONS® + an up-thrust for knowledge Big Data Analytics 1-3 Understanding Big Data 4, Re-develop your products : Big data can also help you understand how others perceive your products so that you can adapt them or your marketing, if need be, 5. Early identification of risk to the product / services, if any. 6. Better operational efficiency. E] Big Data Challenges + Collecting, storing and processing big data comes with its own set of challenges : 1. Big data is growing exponentially and existing data management solutions have to be constantly updated to cope with the three Vs. 2. Organizations do not have enough skilled data professionals who can understand and work with big data and big data tools. EA Convergence of Key Trends « The essence of computer applications is to store things in the real world into computer systems in the form of data, ie,, it is a process of producing data, Some data are the records related to culture and society and others are the descriptions of phenomena of the universe and life. The large scale of data is rapidly generated and stored in computer systems, which is called data explosion. Data is generated automatically by mobile, devices and computers, think facebook, search queries, directions and GPS locations and image capture. Sensors also generate volumes of data, including medical data and commerce location-based sensors. Experts expect 55 billion IP- enabled sensors by 2021. Even storage of all this data is expensive, Analysis gets more important and more expensive every year. Fig. 1.2.1 shows the big data explosion by the current data boom. and how ctitical it is for us to be able to extract meaning from all of this data. oa Fig, 1.2.1 Data explosion The phenomena of exponential multiplication of data that gets stored is termed as "Data Explosion". Continuous inflow of real-time data from various processes, machinery and manual inputs keeps flooding the storage servers every second. Sending emails, making phone calls, collecting information for campaigns; each day we create a massive amount of data just by going about our normal business TECHNIGAL PUBLICATIONS® - an yp-thrst for knowledge Big Data Analytics 1-4 Understanding ig Data and this data explosion does not seem to be slowing down. In fact, 90 % of the data that currently exists was created in just the last two years. * Reason for this data explosion is Innovation. Lo 1. Business model transformation : Innovation changed the way in which we do business, provide services. The data world is governed by three fundamental trends are business model transformation, globalization and personalization of services. © Organizations have traditionally treated data as a legal or compliance requirement, supporting limited management reporting requirements, Consequently, organizations have treated data as a cost to be minimized, © The businesses are required to produce more data related to product and Provide services to cater each sector and channel of customer. Rv . Globalization : Globalization is an emerging trend in business where organizations start operating on an intemational scale. From manufacturing to customer service, globalization has. changed the commerce of the world. Variety and different formats of data are generated due to globalization. 3. Personalization of services : To enhance customer service, the form of one-to-one markefing in the form of personalization of service is opted by the customer. Customers expect communication through. various channels increases* i the speed of data generation. 4. New sources of data : The shift to online advertising supported by the likes of Google, Yahoo and others is a key driver in the data boom. Social media, mobile devices, sensor networks and new media are on the fingertips of customers or users. The data generated through this is used by corporations for decision support systems like business intelligence and analytics. The growth of i technology helped to emerge new business models over the last decade or more. Integration of all the data across the r enterprise is used to create business it decision support platform. V's of Big Data * We differentiate big data characteristic the five V's : Volume, velocity, variety, 1, Volume : Volumes of data are Jai infrastructure can cope with. It ics from traditional data by one or more of veracity and value. ger than that conventional relational database ‘ consisting of terabytes or petabytes of data. Fig. 1.22 shows big data volume, TECHNICAL*PUBLICATIONS® - an Up-thrust for knowledge Undorstanding big Date Big Date Analytics 1-6 2 Geographical Information Machino data | ==> oo syotoms and {940-sjpatial dato Fig. 1.2.2 Big data volumo ‘he term ‘velocity’ refers to the speed of generation of data, How fast generated and processed to meet the demands, determines real the data potential in the data. It is being created in or near real-time. 3. Variety : It refers to heterogeneous sources and the nature of data, both structured and unstructured, © Fig. 1.2.3 (a) and Fig, 1.2.3 (b). shows big data velocity and data variety. Mobile Sonsor data networks "Amazon, facebook, 9 Date Soclal simon Geode volocity modia (Wab based companies) Fig. 1.2.3 (a) Data velocity (Refer Fig, 1.2.3 (b) on next page) Value : It represents the business value to be derived from big data. © The ultimate objective of any big data project should be to generate some sort of value for the company doing all the analysis. Otherwise, you're just performing some technological task for technology's sake, TECHNICAL: PUBLICATIONS® - an up-thnist for knowledge Big Data Analytics 4-6 Understanding Big Dat, Structured Data Gs) =a | Fig. 1.2.3 (b) Data variety © For real-time spatial big data, decisions can be enhanced through visualization of dynamic change in such spatial phenomena as climate, traffic, social-media-based attitudes and massive inventory locations. ' © Exploration of data trends can include spatial proximities and relationships. Once spatial big data are structured, formal spatial analytics can be applied, such as spatial autocorrelation, overlays, buffering, spatial cluster techniques and location quotients. 5, Veracity : Big data must be fed with relevant and true data. We will not be able to perform useful analytics if much of the incoming data comes from false i sources or has errors. Veracity refers to the level of trustiness or messiness of data and if higher the trustiness of the data, then lower the messiness and vice versa, It relates to the assurance of the data's quality, integrity, credibility and accuracy. We must evaluate the data for accuracy before using it for business insights because it is obtained from multiple sources. EEE2] compare Cloud Computing and Big Data Cloud computing Big data Hae | It provides resources on demand. It provides a way to handle huge volumes: | : : : of data and generate insights. 2, Ierefers to internet services from SaaS, It refers to data, which can be structured, | | | PaaStolaaS. Semi-structured or unstructured, fo I 3. Cloud is used to store data and . It is used to describe a huge volume: of | k information’ on remote servers, data and information. | 4 Cloud computing is economical as it Big data is a highly scalable, robust | has low maintenance costs centralized ecosystem and cost-effective. platform no upfront cost: and ‘disaster i en Big Data Analytics 1-7 ‘ Understanding Big Data “oy & The main focus of cloud computing is Main focus of big data is about solving Vendors and solution providers. of Vendors and. solution providers ‘of big | cloud computing are Google, Amazon data are Cloudera, Hortonworks, Apache web service, Dell, Microsoft, Apple and MapR: 3 and IBM. 2 to provide computer resources and problems when a huge amount of data services with the help .of network generating and processing. connection. EES unstructured Data Unstructured data is data that does not follow .a specified format. Row and columns are not used for unstructured data. Therefore it is difficult to retrieve required information. Unstructured data has no identifiable structure. For example of unstructured data is e-mails, click. streams, textual data, images, log data;and videos: In the case of unstructured data, the size is-not the only problem, deriving value or getting results out of unstructured data is much complex and challenging as compared of structured data. The unstructured data can be in the form of text : (Documents, email messages, customer feedbacks), audio, video, images. Email is an example of unstructured data. Even today in most of the organizations more than 80 % of the data are in ‘unstructured form. This carries lots of information. But extracting information from these various sources is a very big challenge. Characteristics of unstructured data : There is no a structural restriction or binding for the data. Data can be of any type. Unstructured. data.does.not-follow any structural rules. There are no predefined formats, restriction or sequence for unstructured data, gf Nop Since there is no structural binding for unstructured data it is unpredictable in nature. Examples of machine generated unstructured data : . 1. Satellite images : This includes weather data or. the data that the government captures in its satellite surveillance imagery. np Scientific data : This includes atmospheric data and-high energy physics, 3. Photographs and video : This include security, surveillance and traffic video. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge 1 t i 1-8 Understanding Big Day, Big Data Analytics Structured Data : © Structured data is arranged in rows and colum retrieve and process data easily. Database mana structured data. ; ; * Any data that can be stroed in the form of a particular fixed is known ay structured data. For example, data stored in the colums and rows of tables jn relational database management systems is a form of structured data. n format. It helps for application j, gement system is used for storing Difference between Structured and Unstructured Data Unstructured data stored Unstructured: data is data that "eis in discrete form. i in row and column format. does. not follow a specified - format. Syntax Semantics Database management system. Unmanaged file structure ‘ SQL, ADOnet, ODBC ___—_—_|_Open XML, SMTO, SMS ETL Batch processing or manual data entry. characteristics | With a structured document, In unstructured. document | | certain information. always information can appear in iB | appears in the same location on unexpected places on the | | the page, __ document, i | Used by organizations) | Low volume operations High volume operations Ha Industry Examples of Big Data + Big data plays an important role in digital marketing, Each day information shared digitally increases significantly. With the help of big data, marketers can analyze every action of the consumer. It provides better marketing insights and it helps marketers to make more accurate and advanced marketing strategies, Reasons why big data is important for digital marketers : a) Real-time customer insights b) Personalized targeting ) Increasing sales 4) Improves the efficiency of a marketing campaign e) Budget optimization f)_Measuring campaign's results more accurately. TECHNICAL PUBLICATIONS® - an vps! for knowiedge Big Data Analytics 1-9 Understanding Big Data © Data constantly informs marketing teams of customer behaviors and industry trends and is used to optimize future efforts, create innovative campaigns and build lasting relationships with customers. + Big data regarding customers provides marketers details about user demographics, locations and interests, which can be used to personalize the product experience and increase customer loyalty over time. * Big data solutions can help organize data and pinpoint which marketing campaigns, strategies or social channels are getting the most traction. This lets marketers allocate marketing resources and reduce costs for projects that are not ~ yielding as much revenue or meeting desired audience goals. * Personalized targeting : Nowadays, personalization is the key strategy for every marketer. Engaging the customers at the right moment with the right message is the biggest issue for marketers. Big data helps marketers to create targeted and personalized campaigns. * Personalized marketing is creating and delivering messages to the individuals or the group of the audience through data analysis with the help of consumer's data such as geolocation, browsing history, clickstream behavior and purchasing history. It is also known as one - to - one marketing. * Consumer insights : In this day an age, marketing has become the ability of a company to interpret the data and change its strategies accordingly. Big data allows for real-time consumer insights which is crucial to understanding the habits of your customers. By interacting with your consumers through social media you will know exactly what they want and expect from your product or service, which will be key to distinguishing your campaign from your competitors. * Help increase sales : Big data will help with demand predictions for a product or service. Information gathered on user behaviour will allow marketers to answer ‘what types of product their users are buying, how often they conduct purchases or search for a product or service and lastly, what payment methods they prefer using. * Analyse campaign results; Big data allows marketers to measure their campaign performance. This is the most important part of digital marketing. Marketers will use reports to measure any negative changes to marketing KPIs. If they have not achieved the desired results it will be a signal that the strategy would need to be changed in order to maximize revenue and make i e your marketin scalable in future. m este more TECHNICAL PUBLICATIONS® - an upthrust for knowledge ee Big Data Analytics 1-10 Undorstanding Big Dag Web Analytics # Web analytics is the collection, reporting and analysis of website data. The focy, is on identifying measures based on your organizational and user goals and uy the website data to determine the success or failure of those goals and to drive strategy and improve the user's experience. The WWW is an evolving system for publishing and accessing resources ang services across the Internet, The web is an open system. Its operations are based on freely published communication standards and documents standards, Web analytics is important to help us to : 1. Refine your marketing campaigns 2. Understand your website visitors 3. Analyze website conversions 4, Improve the website user experience 5. Boost your search engine ranking 6. Understand and optimize referral sources ‘ 7. Boost online sales. Businesses use web analytics platforms to measure and benchmark site performance and to look at key performance indicators that drive their business, such as purchase conversion rate, Website analytics provide insights and data that can be used to create a beter user experience for website visitors, Understanding customer behavior is also key. to optimizing a website for key conversion metrics, For example, web analytics will show us the most Popular pages on your website, and the most popular paths to purchase. With website analytics, we can also accurately track the effectiveness of your online. marketing campaigns to help inform future efforts. Web anelytics can help a digit marketer understand their customers betet by providing : 1. Insight into who the customers are and their interests 2. Conversion challenges 5. Enhanced appreciation of what consumer like or do not like 4. Understanding of how to improve user experience for the consumer. TEOHNICAL PUBLICATIONS? an upd for knowledge Big Data Analytics 1-11 Understanding Big Date EEG Big Data Applications * Big data applications can help companies to make better business decisions by analyzing large volumes of data and discovering hidden patterns, These data sets might be from social media, data captured by sensors, website logs, customer feedbacks, etc. Organizations are spending huge amounts on big data applications to discover hidden patterns, unknown associations, market style, consumer preferences and other valuable business information. * Domains where big data can be applied to health care, media and entertainment, JoT, manufacturing and government. * Relation between MoT and Big Data : Big data production in the industrial Internet of Things (IoT) is evident due to the massive deployment of sensors and Internet of Things (IoT) devices. However, big data processing is challenging due to limited computational, networking and storage resources at IoT device-end. Big Data Analytics (BDA) is expected to provide operational and customer-level intelligence in HoT systems. © The extensive installation of sensors on machines causes a massive increase in the volume of data collected within industrial processes. The data consist of operating data, error lists, history of maintenance activities and alike. In combination with the related business data, the overall plethora of data cess optimizations and other applications. To set provides the raw material for pro the raw data needs to be processed this potential for optimizations free, systematically, passing through various algorithms. «The results are prepared information with specific application objectives. Especially pattern detection is to mention in this context, since this method identifies and quantifies cause and effect correlations and allows predictions of state changes. The significance of the information given out by the analysis depends on the amount of data processed. 1, Healthcare : * Big data analytics for healthcare uses health-related information of an indi or community to understand a patient, organization or community. In the past, managing and analyzing healthcare data was tedious and expensive. More recently, technology has helped the healthcare sector make leaps and bounds to keep up with the flow of big data in healthcare. ual * Diagnostic devices, medical machinery, instrumentation, online services sources such as these are transferring data throughout a healthcare network. This is done with the help of big data tools such as Hadoop and Spark. TECHNICAL PUBLICATIONS® - an up-inrust for knowledge Undorst Big Date Analytics 1-12 tending Biy Day One of the most current and relevant big data examples in healthcare is how has impacted the global coronavirus crisis. Big data analytics for healthea., Supported the rapid development of COVID-19 vaccines. Researchers can shar, data with each other to develop advanced medications very quickly. Big data j, healthcare also predicted the spread of disease by allowing healthcare information to be processed much more rapidly than in the past during other pandemics, Smoother hospital administration : Healthcare administration becomes much smoother with the help of big data. It helps to reduce the cost of care measurement, provide the best clinical support and manage the population of at-risk patients. It also helps medical experts analyze data from diverse sources, j helps healthcare providers conclude the deviations among patients and the effets treatments have on their health. Fraud prevention and detection : Big data helps to prevent a wide range of errors on the side of health administrators in the form of wrong dosage, wrong medicines and other human errors. It will also be particularly useful to insurance companies. They can prevent a wide range of fraudulent claims of insurance, Challenges of big data in healthcare : As a relatively new field, big data in healthcare is still evolving to keep up with the fast pace and changi technology. With such vast amounts of data available to work with, and leaders can struggle with knowing where and how to start with in healthcare to find the information that is meaningful. ing nature of organizations data analytics Many healthcare organizations lack adequate systems and databases and the skilled professionals to handle them. As such, the demand for healthcare analyst with advanced education and training is very high in the World. 2. Manufacturing : Improving efficiency across the business helps a manufacturing company control costs, increase productivity, and boost margins. Automated production lines ar already standard practice for many, but manufacturing big data can exponentially improve line speed and quality. Manufacturing big data also increases trans; example, by using sensor and RFID data inventory in real time, reducing interrupti Parency into the entire supply chain-for to track the location of tools, parts and ions and delays, Companies can also increase supply chain transparency by analyzing individual Processes and their interdependencies for opportunities to optimize everything a optimization. Speeding up assembly : Part of the key to manufacturing more products is t0 simply make the whole process quicker. With big data, manufacturers have been able to segment their production to identify which parts of the process go the TECHNICAL PUBLICATIONS® ~ an up-tnrast for knowledge Big Data Analytics 1-13 Understanding Blg Dota fastest. Knowing which products are faster and easier to produce can help companies know where to focus their efforts, perhaps even concentrating solely on those products for maximum production. It helps for companies to know where they are most efficient, with the added possibility of working on those areas that need the most improvement. Al-driven analysis of manufacturing big data enables companies to aggregate and analyze both their own and competitor's pricing and cost data to produce continually optimized price variants. For manufacturers that focus on build-to-order products, ML can also ensure the accuracy of their customized configurations and streamline the Configure-Price-Quote (CPQ) workflow. Big Data Technologies Big data technology is defined as the technology and a software utility that is designed for analysis, processing and extraction of the information from a large set of extremely complex structures and large data sets which is very difficult for traditional systems to deal with. Big data technology is used to handle both real-time and batch related data. Big data technology is defined as software-utility. This technology is primarily designed to analyze, process and extract information from a large data set and a huge set of extremely complex structures. This is very difficult for traditional data processing software to deal with. Big data technologies including Apache Hadoop, Apache Spark, MongoDB, Cassandra, Plotly, Pig, Tableau and Apache Cassandra etc. Cassandra : Cassandra is one of the leading big data technologies among the list of top NoSQL databases. It is open-source, distributed and has extensive column storage options. It is freely available and provides high availability without fail. Apache Pig is a high - level scripting language used to execute queries for larger datasets that are used within Hadoop. Apache Spark is a fast, in - Memory data processing engine suitable for use in a wide range of circumstances. Spark can be deployed in several ways, it features java, Python, Scala and R programming languages and supports SQL, streaming data, machine learning and graph processing, which can be used together in an application. MongoDB : MongoDB is another important component of big data technologies in terms of storage. No relational properties and RDBMS properties apply to MongoDb because it is a NoSQL database. This is not the same as traditional RDBMS databases that use structured query languages. Instead, MongoDB uses schema documents. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Big Data Analytics 1-14 Understanding Big Dy EEG introduction to Hadoop . Apache Hadoop is an open source framework that is used to efficiently store ang Process large datasets ranging in size from gigabytes to petabytes of data. Hadoop is designed to scale up from a single computer to thousands of clusterey computers, with each machine offering local computation and storage. While Hadoop is sometimes referred to as an acronym for High Availability Distributed Object Oriented Platform. The Hadoop framework consists of a storage layer known as the Hadoop Distributed File System (HDFS) and a processing framework called. the MapReduce programming model. Hadoop splits large amounts of data into chunks, distributes them within the network cluster and processes them in its MapReduce Framework. Hadoop can also be installed on cloud servers to better manage the compute and storage resources required for big data. Leading cloud vendors such as Amazon Web Services (AWS) and Microsoft Azure offer solutions. Cloudera supports Hadoop workloads both on-premises and in the cloud, including options for one or more public cloud environments from multiple vendors. Hadoop provides a distributed file system and a’ framework for the-analysis’ and transformation of very large -data sets using the MapReduce paradigm. An important characteristic of Hadoop is the partitioning of data and computation actos. many (thousands) of hosts and executing application computations in Parallel close to their data. A Hadoop cluster scales computation capacity, storage capacity and I/O bandwidth by simply adding commodity servers. Key features of Hadoop : 1. Cost Effective System, 2. Large Cluster of Nodes 3. Parallel Processing. 4. Distributed Data 5. Automatic Failover Management 6, Data Locality Optimization 7. Heterogeneous Cluster 8, Scalability. Hadoop allows for the distribution of datasets across a cluster ‘of commodity hardware. Processing is performed in parallel on multiple servers simultaneously. Software clients input data into Hadoop. HDFS handles metadata and the TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Big Date Anaiyiics 1-15 Understanding Big Date distributed file system, MapReduce then processes and converts the data. Finally, YARN divides the jobs across the computing cluster. « All Hadoop modules are designed with a fundamental assumption that hardware failures of individual machines or racks of machines are common and should be automatically handled in software by the framework. * Challenges of Hadoop : MapReduce complexity : As a file-intensive system, MapReduce can be 2 difficult tool to utilize for complex jobs, such as interactive analytical tasks. * There are four main libraries in Hadoop. 1. Hadoop Common : This provides utilities used by all other modules in Hadoop. 2, Hadoop MapReduce : This works as a parallel framework for scheduling and processing the data. . Hadoop YARN : This is an acronym for Yet Another Resource Navigator. It is an improved version of MapReduce and is used for processes running over Hadoop. Hadoop Distributed File System - HDFS : This stores data and maintains records over various machines or clusters. It also allows the data to be stored in an accessible format. w 4. Hadoop Ecosystem + Hadoop ecosystem is neither a programming language nor a service, it is a platform or framework which solves big data problems. + The Hadoop ecosystem refers to the various components of the Apache Hadoop software library, aswell as to the accessories and tools provided by the Apache Software Foundation for these types of software projects and to the ways that they work together. © Hadoop is a Java - based framework that is extremely popular for handling and analysing large sets of data. The idea of a Hadoop ecosystem involves the use of different parts of the core Hadoop set such as MapReduce, a framework for handling vast amounts of data and the Hadoop Distributed File System (HDFS), a sophisticated file - handling system. There is also YARN, a Hadoop resource manager. * In addition to these core elements of Hadoop, Apache has also delivered other kinds of accessories or complementary tools for developers, * Some of the most well - known tools of the Hadoop ecosystem include HDFS, Hive, Pig, YARN, MapReduce, Spark, HBase, Oozie, Sqoop, Zookeeper, ete. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Understandir i Big Data Analytics 1-16 ling Big Dats © Fig. 1.8.1 shows Apache Hadoop ecosystem. ‘Management and monitoring (Ambari) Data integration NoSQL (HBase) Machine Leaming Coordination] | Workflow (Zookesper) and scheduling (Mahout) (Oozie) Distributed processing (MapReduce) Distributed storage (HDFS) Fig. 1.8.1 Apache Hadoop ecosystem + Hadoop Distributed File System (HDFS), is one of the largest Apache projects and primary storage system of Hadoop. It employs a NameNode and DataNode architecture. It is a distributed file system able to store large files running over the cluster of commodity hardware. * YARN stands for Yet Another Resource Negotiator. It is one of the core components in open source Apache Hadoop suitable for resource management. It is responsible for managing workloads, monitoring and security controls implementation. * Hive is an ETL and Data warehousing tool used to query or analyze large datasets stored within the Hadoop ecosystem. Hive has three main functions : Data summarization, query and analysis of unstructured and semi - structured data in Hadoop. + Map - Reduce : It is the core component of processing in a Hadoop Ecosystem as it provides the logic of processing. In other words, MapReduce is a software framework which helps in writing applications that processes large data sets using distributed and parallel algorithms inside Hadoop environment. * Apache Pig is a high - level scripting language used to execute queries for larger datasets that are used within Hadoop. * Apache Spark is a fast, in - memory data processing engine suitable for use in @ wide pase eu cam a Spark can be deployed in several ways, it features Java, Python, Scala and R programming languages and supports SQL, streaming TECHNICAL PUBLICATIONS® ‘an up-thnust for knowledge Big Data Analytics 1-17 Understanding Big Data data, machine learning and graph processing, which can be used together in an application. Apache HBase is a Hadoop ecosystem component which is a distributed database that was designed to store structured data in tables that could have billions of rows and millions of columns. HBase is scalable, distributed and NoSQL database that is built on top of HDFS. HBase provide real - time access to read. or write data in HDFS. Hadoop Advantages . Scalable : Hadoop cluster can be extended by just adding nodes in the cluster. . Cost effective : Hadoop is open source and uses commodity hardware to store data so it is really cost effective as compared to traditional relational database management systems. . Resilient to failure : HDFS has the property with which it can replicate data over the network. 4. Hadoop can handle unstructured as well as semi-structured data. . The unique storage method of Hadoop is based on a distributed file system that effectively maps data wherever the cluster is located. Open Source Technologies Open source software is like any other software (closed/proprietary software). This software is differentiated by its use and licenses. Open source software guarantees the right to access and modify the source code and to use, reuses and redistribute the software, all with no royalty or other costs. Standard Software is sold and supported commercially. However, Open Source software can be sold and/or supported commercially, too. Open source is a disruptive technology. Open source is an approach to the design, development and distribution of software, offering practical accessibility to software's source code. ‘Open source licenses must permit non-exclusive commercial exploitation of the licensed work, must make available the work's source code and must permit the creation of derivative works from the work itself. The Netscape Public License and subsequently under the Mozilla Public License. Proprietary software is computer software which is the legal property of one party. The terms of use for other parties are defined by contracts or licensing agreements. These terms may include various privileges to share, alter, dissemble, and use the software and its code. TECHNICAL PUBLICATIONS®- an up-thnist for knowledge Big Data Anaiytics Need 1-18 Understanding Bg be Closed source is a term for software whose license does not allow for the rele oF distribution of the software's source code. Generally, it means only the bina of a computer program are distributed and the license provides no access to Program's source code. The source code of such programs is usually regarded a3 trade secret of the company. Access to source code by third parties commonly Tequires the party to sign a non-disclosure agreement. of open source The demands of consumers as well as enterprises are ever increasing with the increase in the information technology usage. Information technology solutions are required to satisfy their different needs. It is a fact that a single solution provide cannot produce all the needed solutions. Open source, freeware and free softwar are now available for anyone and for any use. In the 1970s and early 1980s, the software organization started using technics measures to prevent computer users from being able to study and modify software. The copyright law was extended to computer programs in 1980. The free software movement was conceived in 1983 by Richard Stallman to satisfy the need for and to give the benefit of "software freedom" to computer users. Richard Stallman declared the idea of the GNU operating system in September 1983. The GNU Manifesto was written by Richard Stallman and published in March 1985. . The Free Software Foundation (FSF) is a non-profit corporation started by Richard Stallman on 4 October 1985 to support the free software movement, a copyleft based movement which aims to promote the universal freedom to distribute and modify computer software without restriction. In February 1986, the first formal definition of free software was published. The term “free software" is associated with FSFs definition, and the term “open source software” is associated with OSI's defit ion. FSFs and OSI's definitions are worded quite differently but the set of software that they cover is almost identical One of the primary goals ‘of this foundation was the development of a free and open computer operating system and application software that can be used and shared among different users with complete freedom, While open source differs from the operation of traditional ¢ permitting both open distribution and open modification, Before the term open source became widely adopted, developers and producers uused a variety of phrases to describe the concept, The term open source gained popularity with the rise of the Internet, which provided access to divers? production models, communication paths and last but not least, interactiv? communities. ‘opyright licensing bY TECHNICAL PUBLICATIONS® - an up.thrst for knowledge Sen ee A Big Data Analytics 4219 Understanding Big Data «Netscape licensed and released its code as open source under Definition of Open Source Software. Successes of open source « Successful open source projects make up many of today's most widely used technologies Operating systems : Linux, Symbian, GNU Project, NetBSD. Servers : Apache, Tomcat, MediaWiki, Drupal, WordPress, Eclipse, Moodle, Joomla Programming languages : Java, JavaScript, PHP, Python, Ruby. Client software : Mozilla Firefox, Mozilla Thunderbird, OpenOffice, Songbird, Audacity, 7-Zip. Digital content : Wikipedia, Wiktionary, Project Gutenberg. Examples in open source. and propritary software : | Classification of software Open. Source software Propritary software |. Operating systems Linux MS Windows, XP, Vista ; SUN’ | : Solaris | | word processing and. office openOtfice "| MS Office, Adobe Framemaker | | applications é = ‘ 4 | Software development Eclipse, JDK MS Visual Studio, .net i f 1 | Multimedia content creation Gimp “Adobe Photoshop - | i | Web page design - Typo3 MS! Brontpage,, Adobe, Flash, Difference between Open Source and Open Standards Open. source software is a type of software where the user has access to the software's source code and can freely use, modify and distribute the software. Thus open source concerns the code the software is made of. Open standards denotes that the code responsible for communication with other systems is open and has technical specifications which are accessible free of charge. Thus open standards concern the communication between software. Advantages of Open Sources 1. The right to use the software in any way. 2. There is usually no license cost and free of cost. 3._ The source code is open and.can be modified freely. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge * Big Data Analytics 4, 1-20 It is possible to, reuse the software in another co authority. Open standards. It provides higher flexibility. Disadvantages of Open Sources Understoncing Big oy, text or with another pubj, There is no guarantee that development will happen. It is sometimes difficult to know that a project exists, and its current status. No secured follow-up development strategy. Application of Open Source Software Following is the list of applications where open source software is used. L ypu aw Social networking 2 Animation 4. Instant messaging 6. Desktop publishing 8. Resource management 10. EEE comparison of Open Source with k NS Mop i | Proprietary Software Open source software Source code freely available. Modification are allowed. Licenses may. do their own development. Example : Wikipedia Sublicensing is allowed. No guarantee of further development, Fees if any for integration, packing, support and consulting. Android OS is open source software rovided by Google. Multimedia Accounting ERP Website development Video editing. Close Source Close source / proprietary software | Source code is kept secret, Modifications are not allowed. i All upgrades, support, maintenance and | development are done by licensor. Example : Microsoft windows Sublicensing is not allowed. Guarantee of further development. Fees are for license, mainteance and upgradation, An iOS is proprietary software provided | TECHNICAL PUBLICATIONS® - ‘an up-thrust for knowledge Big Data Analytics 1-21 Understanding Big Date Cloud and Big Data The NIST defines cloud computing as : "Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model is composed of five essential characteristics, three service models and four deployment models.” Cloud provider is responsible for the physical infrastructure and the cloud consumer is responsible for application configuration, personalization and data. Broad network access refers to resources hosted in a cloud network that are available for access from a wide range of devices. Rapid elasticity is used to describe the capability to provide scalable cloud computing services. In measured services, NIST talks about measured service as a setup where cloud systems may control a user or tenant's use of resources by a metering capability somewhere in the system. : On-demand self-service refers to the service provided by cloud computing vendors that enables the provision of cloud resources on demand whenever they are required. The Cloud Cube Model has four dimensions to differentiate cloud formations : a) External/Internal b) Proprietary/Open ©) De-perimeterized / peremeterized d) Outsourced/Insourced. External / Internal : Physical location of data is defined by extemal/intemal dimension. It defiries the organization's boundary. Example: Information inside a datacenter using a private cloud deployment would .be considered internal and data that resided on Amazon EC2 would be considered external. Proprietary / Open : Ownership is proprietary or open; is a measurement for not only ownership of technology but also its interoperability, use of data and ease of data-transfer and degree of vendor's application's lock-in. Proprietary means that the organization providing the service is keeping the means of provision under their ownership. Clouds that are open are using technology that is not proprietary, meaning that there are likely to be more suppliers. De-perimeterized /. peremeterized. : Security Ranges : is parameterized or de-parameterized; which measures whether the operations are inside or outside the security boundary, firewall, etc. : TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Undorstonding By oy, for providing * Encryption and key management will be the technology means TOF P) ‘ding day sterized model. confidentiality and integeity in a deepeslmeterized mode! which defines whether ty Big Data Anatytice Outnourced / Insourced ; Out-soureing/In-sourcing: customer or the service provider provides the service, | * ulsoureed means the service is provided by a third party. It refers to letng contractors or service providers handle all requests and most of cloud busing, models {all into this. * Insourced is the services provided by your own staff under organization contry, Insourced means in-house development of clouds. * Cloud computing, is often described as a stack, as a response to the broad range gf | services built on top of one another under the “cloud”, A’cloud computing stack i a cloud architecture built in layers of one or more cloud-managed services (Saas, Paas, Jaa, ele.). * Cloud computing stacks are used for all sorts of applications: and systems. They | are especially good in microservices and -scalable applications, as each tier is dynamically scaling and replaceable, + The cloud computing pile makes up a threefold system that comprises its lower-level elements, These components function as formalized cloud computing delivery models : 4) Software as a Service (SaaS) b) Platform as a Service (PaaS) ¢) Infrastructure as a Service (Iaa8) * Saa$ applications are designed for end-users and delivered over the web, + PanS is the set of tools and services designe d to make coding and deploying those applications quick and efficient, JnaS-is the hardware and software that powers it-all, including. servers, storage, networks and operating systems, EHD Difference botwoen Cloud Computing and Big Data | Sr No, Cloud computing Jt provides a way to. handle-hhage volumes , _ oF cata andl generate insights, ! | | } | 1k Tl provides resources on demand, [on . Jt refers to internet: services from, ‘Saas, PaaS.to Jaas, WM ref {crs 10 date, which can be structured, | red or tinstructured, TEcHNCAL PunueaTioNs®-woeang goo PUBLICATIONS? « an upstirust {for knowledge Big Data Analytics 4. Cloud is used to store data information on remote servers. and Cloud computing is economical as it has low maintenance costs centralized platform no upfront cost and disaster safe implementation, Vendors and solution providers of cloud computing are Google, Amazon Web Service, Dell, Microsoft, Apple and IBM, : The main focus of cloud computing is to provide computer resources and services with the help of network connection. Cloud computing Cloud computing is a new technology that delivers many types of resources over the Internet. Cloud computing allows individuals and businesses to access on-demand computing resources and applications, Cloud computing cannot globally without the Internet. operate Cloud computing is owned by a person, company or institution or government. Cloud computing isan application-based software infrastructure that stores data on remote servers, which can- be accessed through the internet, The Internet is enabling infrastructure. Ey Mobile Business Intelligence 1-23 Understanding Big Data It is used to describe huge volume of data and information. Big data is highly scalable, robust ecosystem ond cost - effective. Vendors and solution providers of big data are Cloudera, Hortonworks, Apache and MapR. Main focus of big data is about solving problems when a huge amount of data generating and processing. Difference between Cloud Computing and Internet Internet Internet is a network of networks, which provides software/hardware infrastructure fo establish and maintain connectivity: of the computers around the world. ‘The Internet is interconnected with unique identifiers and can exchange data over a network with little or no human interaction. Internet operates without cloud computing. No single person, company, institution, or government agency controls or owns the Internet. The Internet provides coftware/hardware infrastructure to establish and maintain | connectivity of the computers, Cloud computing is the promise of the utilization of that infrastructure to provide continuous services, * Mobile Business Intelligence (BI) or Mobile analytics is the rising software technology that allows users to access information and analytics on their phones TECHNICAL PUBLICATIONS® - en up-thrust for knowledge retanding Big Dé Big Data Analytics 1-24 Understanding Big Data 1 systems. Mobile analytics involves as i sktop-based BI and tablets instead of desktop-base: Jatforms and properties, such measuring and analyzing data generated by mobile p! as mobile sites and mobile applications. / Analytics is the practice of measuring and analyzing data of users in order to create an understanding of user behavior as well as website or application's performance. If this practice is done on mobile apps and app users, it is called “mobile analytics". * Mobile analytics is the practice of collecting user behavior data, determining intent from those metrics and taking action to drive retention, engagement and conversion. * Mobile analytics is similar to web analytics where identification of the unique customer and recording their usages. * With mobile analytics data, you can improve your cross-channel marketing initiatives, optimize the mobile experience for you customers and grow mobile user engagement and retention. * Analytics usually comes in the form of a software that integrates into companie's existing websites and apps to capture, store and analyze the data. It is always very important for businesses to measure their critical KPIs (Key Performance Indicators), as the old rule is always valid : “If you can't measure it, you can't improve it”. * To be more specific, if a business find out 75 % of their users exit in the shipment screen of their sales funnel, probably there is something wrong with that screen in terms of its design, user interface (UI) or user experience (UX) or there is a technical problem preventing users from completing the process. Working of Mobile Analytics : * Most of the analytics tools need a library (an SDK) to be embedded into the mobile app's project code and at minimum an initialization code in order to track the users and screens. SDKs differ by platform so a different SDK is required for each platform such as iOS, Android, Windows Phone etc. On top of that, additional code is required for custom event traking. With the help of this code, analytics tools track and count each user, tap, event, app crash or any additional information that the user device, operating system, version IP address (and probable location) app launch, has, such as TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Big Data Analytics 1-25 Understanding Big Data * Unlike web analytics, mobile analytics tools don't depend on cookies to identify unique users since mobile analytics SDKs can generate a persistent and unique identifier for each device. ° The tracking technology varies between websites, which use either JavaScript or cookies and apps, which use a Software Development Kit (SDF). * Each time a website or app visitor takes an action, the application fires off data which is recorded in the mobile analytics platform. KEMEUG Difference between Mobile Analytics and Web Analytics Mobile analytics When web site is using,.then mobile _user called. as USER. Interaction with ‘site is called as SESSION. Web analytics When web site is using, then user called | as VISITER. Interaction with site is called es VISTIS. “| ES On mobile, users have less screen real On a desktop, users have larger screes | estate (4 to 7 iches) and interact by -(10 to 17 inches) and interact by toucing, swiping and holding. dlicking, double-clicking and using key commands. t fod Session timeout may be as short as Session will end after 30 minutes of | i 30 seconds. inactivity for websites. 5. Unique users are identified via user Cookies are used to identifies user. IDs. EEE crowd Sourcing Analytics * Crowdsourcing is the process of exploring customer's ideas, opinions and thoughts available on the internet from large groups of people aimed at incorporating innovation, implementing new ideas and eliminating product issues. * Crowdsourcing means the outsourcing of human-intelligence tasks to a large group of unspecified people.via the Internet. * Crowdsourcing is all about collecting data from users through some services, ideas, or content and then it needs to be stored in a server such that the necessary data can be or provided to users whenever necessary. * Most users nowadays use Truecaller to find unknown numbers and Google Maps to find out places and the traffic in a region. All the services are based on crowdsourcing. TECHNICAL PUBLIGATIONS® - an up-thnist for knowledge Understanding fy ata * Crowdnourced data ina form of necondary data, Secondary data refers to data thay is collected by any party other than the rexearcher, Secondary data provide, important context for any investigation into a policy intervention. * When crowdsourcing data, researchers collect plentiful, valuable and disperse data at a cost typically lower than that of traditional data collection methods, * Consider the trade-offy between sample size and sampling Insues before deciding to crowdaource data. Enuuring data quality means making sure the platform on which you are collecting crowduourced data is well-tested. * Crowdsourcing experiments are normally set up by asking @ set of users to perform a task for a very small remuneration on each unit of the task. Amazon Mechanical Turk (AMT) is @ popular platform that has a large set of registered remote workers who are hired to perform tasks such as data labeling, * In data labeling tasks, the crowd workers are randomly assigned a single item in the dataset. A data object may receive multiple labels from different workers and these have to be aggregated to get the overall true label. * Crowdsourcing allows for many contributors to be recruited in a short period of time, thereby eliminating traditional barriers to data collection, Furthermore, crowdsourcing platforms ysually employ their own tools to optimize the annotation process, making it easier to conduct time-intensive labeling tasks, Crowdsourcing data is especially effective in generating complex and free-form labels such as in the case of audio transcription, sentiment analysis, image annotation or translation, * With crowdsourcing, companies can collect information from custorners and use it to their advantage. Brands gather opinions, ask for help, receive feedback to improve their product or service, and drive sales. For instance, Lego conducted a campaign where customers had the chance to develop their designs of toys and submit them. * To become the winner, the creator had to receive the biggest amount of people's votes, The best design was moved to the production process. Moreover, the winner got a privilege that amounted to a 1 % royalty on the net revenue. * Types of Crowdsourcing : There are four main types of crowdsourcing. 1. Wisdom of the crowd : It is a’ collective opinion of different individuals gathered in a group. This type is used for decision-making since it allows one to find the best solution for problems, 2. Crowd creation : This type involves a company asking its customers to help with new products, This way, companies get brand new ideas and thoughts that help a business stand out. TECHNICAL PUBLICATIONS” - an up-thrust for knowledge Big Data Analytics 1-27 Undoratanding Big Data 3. Crowd voting : It is a type of crowdsourcing where customers are allowed to choose a winner. They can vote to decide which of the options is the best for them. This type can be applied to different situations. Consumers can choose one of the options provided by experts or products created by consumers, 4, Crowdfunding : It is when people collect money and ask for investments for charities, projects and startups without planning to return the money to the owners, People do it voluntarily. Often, companies gather money to help individuals and families suffering from natural disasters, poverty, social problems, etc. Inter and Trans Firewall Analytics * A firewall is a device designed to control the flow of traffic into and out-of a network, In general, firewalls are installed to prevent attacks. Firewall can be a software program or a hardware device. * Fig. 1.13.1 shows firewall. [Computer| Router| Firewall [Computer Intemet Server a Fig. 1.13.1 Firewall * Firewalls are software programs or hardware devices that filter the traffic that flows into’a user PC or user network through an internet connection. They sift through the data flow and block that which they deem harmful to the user network or computer system. . Firewalls filter based on IP, UDP and TCP information. Firewall is placed on the link between a network router and Internet or between a user and router. For large organizations with many small networks, the firewall is placed on every connection attached to the Internet. + Large organizations may use multiple levels of firewall or distributed firewalls, locating a firewall at a single access point to the network. ¢ Firewalls test all traffic against consistent rules and pass traffic that meets those rules. Many routers support basic firewall functionality. Firewall can also be used to control data traffic. TECHNICAL PUBLICATIONS® - an up-thnust for knowledge Undaratondiny Big Day, ng, the only’ connectivity to thy the firewall via othe, sncly on the firewall be no way to bypass rewall based security depet size from outside; there should be ns, gateways; wireless connect wd to a particular IP address or 4 + Firewall filters out all incoming messages adeleess ress -truates |i particular TCP port number, It divides a network ine rad 70n9 internal to the firewall and a Jess trusted zone external 10 {he [ © Firewalls may also impose restrictions on outgoing traffic, 0 prevent certain attacks and to limit losses if an attacker succeeds in getting access inside the firewall. * Functions of firewall : 1. Access control : Firewall filters incoming as we 2, Address/Port. Translation : Using network address (ranslation, Internal machines, though not visible on the Internet, can establish @ connection with external machines on the Intemet, NATing is often done by firewall. IL as outgoing packets. 3. Logging : Security architecture ensures that each incoming or outgoing packet encounters at least one firewall. The firewall can log all anomalous packets, «Firewalls can protect the computer and user personal information from : 1. Hackers who breaks your system security. 2. Firewall prevents malware and other Internet hacker attacks from reaching, your computer in the first place. 3. Outgoing traffic from your computer created by a virus infection. * Firewalls cannot provide protection : 1. Against phishing scams and other fraudttlent activity 2. Viruses spread through e-mail 3. From physical access of your computer or network 4, For an unprotected wireless network, Firewall Characteristics 1, All traffic from inside to outside and vice versa, must pass through the firewall, 2. The firewall itself is resistant to penetration. 3. Only authorized traffic, as defined by te local security policy, ¢ a Firewall Rules * The rules and regul will be allowed to lations set by the organizati 1 ‘ganization, Policy determi , + . z * * internal and external. information resources employees can ace: ite hn of 8 access, the kinds TECHNICAL PUBLICATIONS® - an upstmrust for knowlodgo Big Dats Analytics 1-29 Understanding Big Data Programs they may install on their own. computers as well as their authority for reserving network resources, Policy is typically general and set at a high level within the organization. Policies that contain details generally become too much of a “living document”. ‘User can create or disable firewall filter rules based on following conditions : 1, IP addresses : System admin can block a certain tange of IP addresses. 2, Domain names : Admin can only allow certain specific domain names to access your systems or allow access to only some specific types of domain names or domain name extension. 3. Protocol : A firewall can decide which of the systems can allow or have access to common protocols like IP, SMTP, FIP, UDP, ICMP, Telnet or SNMP. 4. Ports : Blocking or disabling ports of servers that are connected to the internet will help maintain the kind of data flow you want to see it used for and also close down possible entry points for hackers or malignant software. 5. Keywords : Firewalls also can sift through the data flow for a match of the keywords or phrases to block out offensive or unwanted data from flowing in. ¢ When your computer makes a connection with another computer on the network, several things are exchanged including the source and destination ports. In a standard firewall configuration, most inbound ports are blocked. This would normally cause a problem with return traffic since the source port is randomly assigned. A state is a dynamic rule created by the firewall containing the source-destination port combination, allowing the desired return traffic to pass the firewall. FREES Types of Firewall 1. Packet filter 2, Application level firewall 3. Circuit level gateway. Fig. 1.13.2 shows relation between OSI layer and Firewall. * Packet filter firewall controls access to packets on the basis of packet source and destination address or specific transport protocol type. It is done at the OSI data link, network and transport layers. Packet filter firewall works on the network layer of the OSI model. * Packet filters do not see inside a packet; they block or accept packets solely on the basis of the IP addresses and ports. All incoming SMTP and FIP packets are parsed to check whether they should drop or forwarded. But outgoing SMTP and FTP packets have already been screened by the gateway and do not have to be checked by the packet filtering router. Packet filter firewall only checks the header information. TECHNICAL PUBLICATIONS® « an up-thrist for knowledgo

You might also like