0% found this document useful (0 votes)
5 views

Data analytics using r unit-3

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Data analytics using r unit-3

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

UNIT-3

1.What is big data and explain need of data analytics?


Big data refers to large and complex datasets that are difficult to process
using traditional data management tools. These datasets typically have one
or more of the following characteristics:

1. Volume: Big data involves a large volume of data that exceeds the
processing capacity of conventional database systems. This could be
terabytes, petabytes, or even larger datasets.
2. Variety: Big data comes in various formats, including structured data
(like databases), unstructured data (such as text documents and social
media posts), and semi-structured data (like XML and JSON files).
3. Velocity: Big data is often generated at high speed and needs to be
processed quickly to extract valuable insights in a timely manner. For
example, data streaming from sensors or social media feeds.
4. Veracity: Big data can have quality and accuracy issues due to its
diverse sources and complex nature. Data analytics techniques are
needed to clean, validate, and preprocess the data for analysis.
5. Value: Despite the challenges, big data contains valuable information
that can lead to insights, improvements in decision-making, and
competitive advantages for businesses and organizations.

Now, let's talk about the need for data analytics in R programming,
especially concerning big data:

1. Handling Large Datasets: R provides various packages and tools


(e.g., dplyr, data.table, sqldf) that allow users to efficiently handle and
manipulate large datasets, making it suitable for big data analytics.
2. Statistical Analysis: R is widely used for statistical analysis, making
it a valuable tool for exploring and analyzing large datasets to uncover
patterns, trends, correlations, and anomalies.
3. Machine Learning: R offers numerous libraries (e.g., caret,
randomForest, xgboost) for machine learning tasks, enabling users to
build predictive models, clustering algorithms, and other advanced
analytics solutions on big data.
4. Visualization: R has powerful visualization libraries like ggplot2,
plotly, and ggplotly that help in creating insightful visualizations and
dashboards to communicate findings from big data analysis effectively.
5. Integration: R can be integrated with big data technologies such as
Apache Hadoop, Spark, and databases like MySQL, PostgreSQL, and
NoSQL databases, allowing seamless data access and analysis across
different platforms.

2.Explain mean, median,standard deviation ,variance,correlation functions in r


programming?

Mean
Madhusudanacharyulu Padakandla
UNIT-3

It is calculated by taking the sum of the values and dividing with the number of
values in a data series.

The function mean() is used to calculate this in R.

Syntax

mean(x, trim = 0, na.rm = FALSE, ...)

# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)

# Find Mean.
result.mean <- mean(x)
print(result.mean)

Median

The middle most value in a data series is called the median.


The median() function is used in R to calculate this value.
Syntax
median(x, na.rm = FALSE)

# Create the vector.


x <- c(12,7,3,4.2,18,2,54,-21,8,-5)

# Find the median.


median.result <- median(x)
print(median.result)

Standard Deviation:

Standard deviation measures the amount of variation or dispersion in a set of values.

# Create a sample vector


data <- c(10, 20, 30, 40, 50)
sd_result <- sd(data)

Madhusudanacharyulu Padakandla
UNIT-3

sd_result

Variance:
Variance is a measure of how spread out the values in a dataset are.
# Calculate variance
variance_result <- var(data)
variance_result

Correlation:
 Correlation measures the strength and direction of the linear relationship
between two variables.

# Create two sample vectors


x <- c(1, 2, 3, 4, 5)
y <- c(3, 5, 7, 9, 11)
correlation_result <- cor(x, y)
correlation_result

3.Explain bascic analysis techniques chi-square Test and T-test?


Chi-Square test is a statistical method to determine if two categorical variables
have a significant correlation between them. Both those variables should be from
same population and they should be categorical like − Yes/No, Male/Female,
Red/Green etc.

For example, we can build a data set with observations on people's ice-cream
buying pattern and try to correlate the gender of a person with the flavor of the
ice-cream they prefer. If a correlation is found we can plan for appropriate stock
of flavors by knowing the number of gender of people visiting.

Syntax:

The function used for performing chi-Square test is chisq.test().


chisq.test(data)

Example:

observed <- matrix(c(10, 20, 30, 40), nrow = 2, byrow = TRUE)


colnames(observed) <- c("Group A", "Group B")

Madhusudanacharyulu Padakandla
UNIT-3

rownames(observed) <- c("Category 1", "Category 2")

chi_square_result <- chisq.test(observed)

print(chi_square_result)

T-test:
In R, the t.test() function is used to perform a t-test, which is a statistical test used
to determine if there is a significant difference between the means of two groups.
# Generate example data
group1 <- c(25, 30, 35, 40, 45)
group2 <- c(20, 22, 25, 28, 30)
t_test_result <- t.test(group1, group2)
print(t_test_result)

Madhusudanacharyulu Padakandla

You might also like