0% found this document useful (0 votes)
15 views

Unit 1 Topic 6 Big Data Features - Security

Uploaded by

teotia.harsh22
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Unit 1 Topic 6 Big Data Features - Security

Uploaded by

teotia.harsh22
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 55

Big Data features – security

Dr. Anil Kumar Dubey


Associate Professor,
Computer Science & Engineering Department,
ABES EC, Ghaziabad
Affiliated to Dr. A.P.J. Abdul Kalam Technical University, Uttar
Pradesh, Lucknow
Data Security
Big data security can be termed as the tool and measures
which are used to guard both data and analytics
processes.

Main purpose of Big data security is to provide


protection against attacks, thefts, and other malicious
activities that could harm valuable data.
Conti…
Bigdata security challenges are multi-faced for the
companies that operate on the cloud.

This challenging threat includes the theft of information


stored online, ransomware, or DDoS attacks that could
crash a server.

These threats can cause serious financial repercussions


such as losses, litigation costs, and fines or sanctions of
an organization.
Big Data Security Technologies
Encryption
Encryption of data is generally done to secure a massive
volume of data, different types of data.

It can be user-generated or machine-generated code.

Also get applied to data from different sources like


relational database management system (RDBMS),
specialized file systems like Hadoop Distributed File
System (HDFS), etc.
User Access Control
It is the most basic network security tool. But few
companies practice this because it involves high
management overhead, this can be dangerous at the
network level and not good for the Big data platforms.
Automated strong user access control is a must for
organizations.
Automation control manages complex user control
levels that protects the Big data platform against the
inside attack.
Physical Security
It is generally built in when you deploy the Big data
platform in your own center.
It can also be built around your cloud provider’s data
center security.
They are important as they can deny data center access
to strangers or suspicious visitors.
Video surveillance and security logs are also used for the
same purpose.
Centralized Key Management
It is applied in Big data environments, especially on
those having wide geographical distribution.

Best practices under centralized key management


include policy-driven automation, on-demand key
delivery, logging, and abstracting key management from
key usage.
Conti…
Some of the companies practicing Big Data Securities are:

Cloudwick

CDAP (Cloudwick Data Analytics Platform) is a managed


security hub that integrates security features from multiple
analytics toolsets and machine learning projects.

IBM

IBM security Guardium is used to monitor Big data and


NoSQL environments. It includes the discovery and
classification of sensitive data.
Conti…
Logtrust
Logtrust is partnered with Panda Security in order to
provide the ART (Advanced Reporting Tool) and Panda
Adaptive Defense.

Gemalto
Gemalto SafeNet protects Big data platforms. Usually, it
protects the Big data platforms in the cloud, data center,
and virtual environments.
Big Data Security Use Cases
A. Cloud Security Monitoring
Cloud computing generally offers more efficient
communication and increased profitability for all
businesses. This communication needs to be secure.

Big data security offers cloud application monitoring.

This provides host sensitive data and also monitors


cloud-hosted infrastructure. Solutions also offer support
across several relevant cloud platforms.
B. Network Traffic Analysis
Due to the high volume of data over the network, it is
difficult to maintain transactional visibility over the
network traffic.
Security analytics allow your enterprise to watch over
this network traffic. It is used to establish baselines and
detect anomalies.
It is used to analyze traffic in and out of cloud
infrastructure. It also illuminates dark spaces that are
hidden in infrastructures and analyze encrypted
sensitive data. Thus, ensuring the proper working of
C. Insider Threat Detection
Insider threats are as much as a danger to your
enterprise as external threats.

An active malicious user can do as much damage as any


malware attack.

But it is only in some rare cases that an insider threat


can destroy a network.
D. Threat Hunting
Security analytics helps to automate this threat of
hunting.
It acts as an extra set of eyes for your threat hunting
efforts.
Threats hunting automation can help in detecting
malware beaconing activity and thus alerts for its
stoppage as soon as possible.
E. Incident Investigation
Generally, the sheer number of security alerts from
SIEM solutions would overwhelm your IT security team.
These continuous alerts can cause more fostering
burnout and frustration.
Thus to minimize this issue, security analytics automates
the incident investigation by providing contextualization
to alerts.
Thus your team has more time to prioritize incidents
and can deal with potential breach incidents first.
F. User Behaviour Analysis
Organization’s users generally interact with your IT
infrastructure all the time. Mainly it is the user’s behavior
that decides the success or failure of your cybersecurity.
The security analytics monitor the unusual behavior of
employees.
An example of one such renowned security analytics use
case is UEBA (User and entity behavior analytics). It helps
to provide visibility into the IT environment.
Thus compiling user activities from multiple datasets into
complete profiles.
G. Data Exfiltration Detection
Data exfiltration is termed as any unauthorized
movement of data moving in and out of any network.
Unauthorized data movements can cause theft and
leakage of data.
Thus there is a need to protect data from such
unauthorized access.
Security analytics helps to detect the data exfiltration
over a network. It is generally used to detect data
leakage in encrypted communications.
Big Data Security Issues
Conti…
1. Access Controls
 It is critically important for an organization to have a system
which is fully secure.
 Permission to exchange the data should be permitted to
authenticated users only.
 Access control needs to be such that it would not get hacked
by attackers, hackers, or by any malicious activities.

2. Non-relational data stores


 Non-relational databases like NoSQL usually lack security by
themselves.
Conti…
3. Storage
In Big data architecture, we store data on multiple tiers.
Its storage depends on business needs in terms of
performance and cost.
For example, high-priority data is generally stored on
flash media. So locking down storage means creating a
tier-conscious strategy.
Conti…
4. Endpoints
Security solutions that usually draw logs from endpoints
will need to validate the authenticity of those endpoints or
the analysis is not going to do much.

5. Real-time security/ compliance tools


Real-time tools generally generate a large amount of
information.
The key is to find a way to ignore false or rough
information. So that human talent can be focused on true
Conti…
6. Data mining solutions
Data mining solutions generally find a pattern that
suggests business strategies.
For this reason, there is a need for ensuring that it is
secured from both internal and external threats.
Big Data Privacy and Ethics
Private customer data and identity should remain
private
Shared private information should be treated
confidentially
Customers should have a transparent view
Big Data should not interfere with human will
Big data should not institutionalize unfair biases
Conti…
Private customer data and identity should remain
private
Privacy does not mean secrecy, as private data might need
to be audited based on legal requirements, but that private
data obtained from a person with their consent should not
be exposed for use by other businesses or individuals with
any traces to their identity.
Conti…
Shared private information should be treated
confidentially
Third party companies share sensitive data — medical,
financial or locational — and need to have restrictions on
whether and how that information can be shared further.

Customers should have a transparent view


How our data is being used or sold, and the ability to
manage the flow of their private information across
massive, third-party analytical systems.
Conti…
Big Data should not interfere with human will
Big data analytics can moderate and even determine who
we are before we make up our own minds. Companies need
to begin to think about the kind of predictions and
inferences that should be allowed and the ones that should
not.

Big data should not institutionalize unfair biases


Machine learning algorithms can absorb unconscious
biases in a population and amplify them via training
samples.
Challenges of Conventional Systems
Three Challenges that big data face
Data or Volume
Process
Management
Conti…
The volume of data, especially machine-generated data,
is exploding and How fast that data is growing every
year, with new sources of data that are emerging.

For example, in the year 2000, 800,000 petabytes (PB)


of data were stored in the world, and it is expected to
reach 35 zettabytes (ZB) by2020 (according to IBM).
Reporting v/s Analysis
 Reporting: The process of organizing data into informational
summaries in order to monitor how different areas of a
business are performing.

 Reporting translates raw data into information.

 Analysis: The process of exploring data and reports in order


to extract meaningful insights, which can be used to better
understand and improve business performance.

 Analysis transforms data and information into insights.


Conti…
Data reporting: Gathering data into one place and
presenting it in visual representations.

Data analysis: Interpreting your data and giving it


context.
Conti…
Analysis transforms data and information into insights.
Reporting helps companies to monitor their online
business and be alerted to when data falls outside of
expected ranges.
Conti…
Reporting Analysis
Purpose Monitor and alert Interpret and recommend actions

Tasks Build, Configure, Consolidate, Question, Examine, Interpret,


Organize, Format, Summarize Compare, Confirm
Outputs Canned reports, Dashboards, Alert Ad hoc responses, Analysis,
presentations (findings +
recommendations)
Delivery Accessed via tool, Scheduled for prepared and shared by analyst
delivery
Value Distills data into, information for provides deeper insights into business,
further analysis, Alerts company to Offers recommendations to drive
exceptions in data action
Modern Data Analytic Tools
APACHE Hadoop
It’s a Java-based open-source platform that is being used
to store and process big data.
It is built on a cluster system that allows the system to
process data efficiently and let the data run parallel.
It can process both structured and unstructured data from
one server to multiple computers.
Hadoop also offers cross-platform support for its users.
Today, it is the best big data analytic tool and is popularly
used by many tech giants such as Amazon, Microsoft, IBM,
Conti…
Cassandra
 APACHE Cassandra is an open-source NoSQL distributed
database that is used to fetch large amounts of data.
 It’s one of the most popular tools for data analytics and has
been praised by many tech companies due to its high scalability
and availability without compromising speed and performance.
 It is capable of delivering thousands of operations every
second and can handle petabytes of resources with almost zero
downtime.
 It was created by Facebook back in 2008 and was published
publicly.
Conti…
Qubole
It’s an open-source big data tool that helps in fetching
data in a value of chain using ad-hoc analysis in machine
learning.
Qubole is a data lake platform that offers end-to-end
service with reduced time and effort which are required in
moving data pipelines.
It is capable of configuring multi-cloud services such as
AWS, Azure, and Google Cloud.
Besides, it also helps in lowering the cost of cloud
Conti…
Xplenty
It is a data analytic tool for building a data pipeline by
using minimal codes in it.
With the help of its interactive graphical interface, it
provides solutions for ETL etc.
The best part of using Xplenty is its low investment in
hardware & software and its offers support via email,
chat, telephonic and virtual meetings.
Xplenty is a platform to process data for analytics over
the cloud and segregates all the data together.
Conti…
Spark
APACHE Spark is another framework that is used to process
data and perform numerous tasks on a large scale.
It is widely used among data analysts as it offers easy-to-use
APIs that provide easy data pulling methods and it is capable
of handling multi-petabytes of data as well.
Spark made a record of processing 100 terabytes of data in
just 23 minutes which broke the previous world record
of Hadoop (71 minutes).
This is the reason why big tech giants are moving towards
spark now and is highly suitable for ML and AI today.
Conti…
Mongo DB
Came in limelight in 2010, is a free, open-source
platform and a document-oriented (NoSQL)
database that is used to store a high volume of data.
It uses collections and documents for storage and its
document consists of key-value pairs which are
considered a basic unit of Mongo DB.
It is so popular among developers due to its availability
for multi-programming languages such as Python,
Jscript, and Ruby.
Conti…
Apache Storm
 A storm is a robust, user-friendly tool used for data analytics,
especially in small companies.
 The best part about the storm is that it has no language barrier
(programming) in it and can support any of them.
 It was designed to handle a pool of large data in fault-tolerance
and horizontally scalable methods.
 When we talk about real-time data processing, Storm leads the
chart because of its distributed real-time big data processing
system, due to which today many tech giants are using APACHE
Storm in their system. Some of the most notable names are
Twitter, Zendesk, NaviSite, etc.
Conti…
SAS
One of the best tools for creating statistical modeling used
by data analysts.
By using SAS, a data scientist can mine, manage, extract
or update data in different variants from different
sources.
Statistical Analytical System or SAS allows a user to access
the data in any format (SAS tables or Excel worksheets).
Besides that it also offers a cloud platform for business
analytics called SAS Viya and also to get a strong grip on
Conti…
Data Pine
Datapine is an analytical used for BI and was founded
back in 2012 (Berlin, Germany).
It’s mainly used for data extraction (for small-medium
companies fetching data for close monitoring).
With the help of its enhanced UI design, anyone can visit
and check the data as per their requirement and offer in 4
different price brackets, starting from $249 per month.
They do offer dashboards by functions, industry, and
platform.
Conti…
Rapid Miner
It’s a fully automated visual workflow design tool used for
data analytics.
It’s a no-code platform and users aren’t required to code
for segregating data.
Though it’s an open-source platform but has a limitation
of adding 10000 data rows and a single logical processor.
With the help of Rapid Miner, one can easily deploy their
ML models to the web or mobile (only when the user
interface is ready to collect real-time figures).
Analytic Processes and Tools
 Big data analytics tools should be able to handle the volume, variety, and
velocity of data.

 They should also be able to process data in real-time or near-real-time so


that decisions can be made based on the most up-to-date information.

 Big Data analytics is a process used to extract meaningful insights, such as


hidden patterns, unknown correlations, market trends, and customer
preferences.

 Big Data analytics provides various advantages—it can be used for better
decision making, preventing fraudulent activities, among other things.
Conti…
Zoho Analytics is a powerful big data analytics tool that
enables you to analyze massive data sets, whether on the
cloud or on-premise.
Zoho Analytics can connect to multiple data sources,
including business applications, files and feeds, offline
databases, cloud databases, and cloud drives.
Create business dashboards and insightful
reports utilizing AI and ML technologies.
Provide key business metrics on-demand with
our robust big data analytics software.
Lifecycle Phases of Big Data Analytics
Stage 1 - Business case evaluation: The Big Data analytics
lifecycle begins with a business case, which defines the
reason and goal behind the analysis.

Stage 2 - Identification of data: Here, a broad variety of


data sources are identified.

Stage 3 - Data filtering: All of the identified data from the


previous stage is filtered here to remove corrupt data.
Conti…
Stage 4 - Data extraction: Data that is not compatible with
the tool is extracted and then transformed into a
compatible form.

Stage 5 - Data aggregation: In this stage, data with the


same fields across different datasets are integrated.

Stage 6 - Data analysis: Data is evaluated using analytical


and statistical tools to discover useful information.
Conti…
Stage 7 - Visualization of data: With
tools like Tableau, Power BI, and QlikView, Big Data
analysts can produce graphic visualizations of the
analysis.

Stage 8 - Final analysis result: This is the last step of the


Big Data analytics lifecycle, where the final results of the
analysis are made available to business stakeholders who
will take action.
Different Types of Big Data Analytics
Descriptive Analytics
Summarizes past data into a form that people can easily
read. This helps in creating reports, like a company’s
revenue, profit, sales, and so on. Also, it helps in the
tabulation of social media metrics.

Use Case: The Dow Chemical Company analyzed its past


data to increase facility utilization across its office and lab
space. Using descriptive analytics, Dow was able to
identify underutilized space. This space consolidation
Conti…
Diagnostic Analytics
 This is done to understand what caused a problem in the first
place. Techniques like drill-down, data mining, and data recovery
are all examples. Organizations use diagnostic analytics because
they provide an in-depth insight into a particular problem.

 Use Case: An e-commerce company’s report shows that their


sales have gone down, although customers are adding products to
their carts. This can be due to various reasons like the form didn’t
load correctly, the shipping fee is too high, or there are not
enough payment options available. This is where you can use
diagnostic analytics to find the reason.
Conti…
Predictive Analytics
 This type of analytics looks into the historical and present data to
make predictions of the future. Predictive analytics uses data
mining, AI, and machine learning to analyze current data and
make predictions about the future. It works on predicting
customer trends, market trends, and so on.

 Use Case: PayPal determines what kind of precautions they have


to take to protect their clients against fraudulent transactions.
Using predictive analytics, the company uses all the historical
payment data and user behavior data and builds an algorithm
that predicts fraudulent activities.
Conti…
Prescriptive Analytics
 This type of analytics prescribes the solution to a particular
problem. Perspective analytics works with both descriptive and
predictive analytics. Most of the time, it relies on AI and
machine learning.

 Use Case: Prescriptive analytics can be used to maximize an


airline’s profit. This type of analytics is used to build an
algorithm that will automatically adjust the flight fares based
on numerous factors, including customer demand, weather,
destination, holiday seasons, and oil prices.
Big Data Analytics Tools
Here are some of the key big data analytics tools :
 Hadoop - helps in storing and analyzing data
 MongoDB - used on datasets that change frequently
 Talend - used for data integration and management
 Cassandra - a distributed database used to handle chunks of
data
 Spark - used for real-time processing and analyzing large
amounts of data
 STORM - an open-source real-time computational system
 Kafka - a distributed streaming platform that is used for
fault-tolerant storage
Conti…
R-Programming - R is a free open source software
programming language and a software environment for
statistical computing and graphics.
Datawrapper- It is an online data visualization tool for
making interactive charts.
Tableau Public - It communicates the insights of the
data through data visualisation.
Content Grabber - Content Grabber is a data extraction
tool. It is suitable for people with advanced
programming skills.
THANK
YOU

You might also like