0% found this document useful (0 votes)
165 views

Intern

This document summarizes a 10 day internship at UNIQ Technologies focusing on Big Data. The intern learned about Eclipse installation, Java programming using split strings, lists, sets and keys programs, Hadoop installation including steps to install Java, SSH, download Hadoop files, configure files, format the namenode, start daemons, and monitor the ResourceManager and NameNode. The internship covered technologies like Big Data, Hadoop, and programming concepts in Java.

Uploaded by

Hgg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
165 views

Intern

This document summarizes a 10 day internship at UNIQ Technologies focusing on Big Data. The intern learned about Eclipse installation, Java programming using split strings, lists, sets and keys programs, Hadoop installation including steps to install Java, SSH, download Hadoop files, configure files, format the namenode, start daemons, and monitor the ResourceManager and NameNode. The internship covered technologies like Big Data, Hadoop, and programming concepts in Java.

Uploaded by

Hgg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 120

u n i qtechnologies

Services | Development | Consultancy

Internship Report on Big Data

Submitted By:B.PAVITHRA

INTERNSHIP REPORT
Project Coordinator
UNIQ TECHNOLOGIES
CANDIDATE NAME :B.PAVITHRA

COLLEGE NAME :SRI VENKATESWARA COLLEGE OF ENGG

DEPARTMENT :INFORMATION TECHNOLOGY

DOMAIN OF INTERNSHIP :BIG DATA

DURATION :10 DAYS

PROJECT NAME :LIBRARY MANAGEMENT SYSTEM

STUDENT PROJECT GUIDE

INTERNSHIP REPORT UNIQ TECHNOLOGIES


TECHNOLOGY NAME: BIG DATA
Big data is phrase used to mean a massive volume of both structured and
unstructured data that is too large it is difficult to process using tradionational
database and software techiques. In most enterprise scenarios the volume of data
is too big or it moves too fast or it exceeds current process capacity.

3Vs(volume,variety,velocity) and three defining properties or dimensionsof big


data. Volume refers to the amount of data, variety refers to the number of types
of data available and velocity refers to speed of data processing. According to the
3vs model,the challenges of bigdata management results from expansion of all
three properties, rather than just volume alone—the shear amount of data to be
managed.

DAY 1: ECLIPSE INSTALLATION AND JAVA BASIC PROGRAM USING


SPLIT STRING
Trainer Name: VIGNESH
ECLIPSE INSTALLATION:
Eclipse is a integrated development environment(IDE) used in computer
programming, and it is most widely used in in the java IDE. It contains a base
workspace and extensible plugin systemfor customisig the environment.

Use of eclipse:

Developed using java,the eclipse platform can be used to develop rich client
applications, integrated development environments and other tools. Eclipse can
be used as a IDE for any programming languuage for which plugin available.

Java basic program using split string:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


The method split( ) is used for splitting string into its substrings based on he given
delimter/regular expression.

String[ ] split(String regex): Its returns an array of strings after splitting an input
string based on the delimiting regular expression.

String[ ] split(string regex,int limit): The only difference between above variation
and this one is that it limits no of strings returned after split up.

DAY 2 & 3: LISTS, SETS AND KEYS PROGRAM


Trainer Name: VIGNESH
TREESET PROGRAM USING COMPARATOR():
TreeSet is similar to HashSet except that it sorts the element in the ascending
order while HashSet doesnt maintain any order. Treeset allows null elements but
HashSet doesnt allow. Like most of the other collection classes this class is also
not synchronized, however it can be collections. SynchronizedSortedSet(new
TreeSet(..))

ARRAYLIST PROGRAM USING ITERATOR:


Arraylist class implements List Interface and it is based on an Array data
structure. It is widely used because of the functionality and flexibility it offers.
Most of the developers choose Arraylist over Array as it's very good alternative of
Traditional java arrays. Array List is a resizable-array implementation of the List
Interface. It implements all optional list operations and permits all elements,
including null.

Iterator enables you to cycle through a collection, obtaining or removing


elements. List Iterator extends Iterator to allow bidirectional traversal of the list,
and the modification of elements.

INTERNSHIP REPORT UNIQ TECHNOLOGIES


Before you can access a collection through an Iterator, you must obtain one. Each
of the collection classes provides an Iterator( ) method that returns an Iterator to
the start of the collection. By using this Iterator object you can access each
element in the collection, one element at a time.
THE METHODS DECLARED BY ITERATOR

Boolean hasNext( )

Returns true if there are more elements. Otherwise, returns false

Object next( )

Returns the next element. Throws NoSuchElementException if there is not a next

element.

Void remove()

Removes the current elements. Throws IllegalStateException if an attempt is made to call remove() that
is not precede by call to next().

DAY 4&5 : HADOOP INSTALLATION AND UBUNTU INSTALLATION


Trainer Name:VIGNESH
Step 1:

Installing Oracle Java 8

Apache Hadoop is java framework, we need java installed on our machine to get it
run over operating system. Hadoop supports all java version greater than 5 (i.e.
Java 1.5). So, Here you can also try Java 6, 7 instead of Java 8.

hadoop@hadoop:~$ sudo add-apt-repository ppa:webupd8team/java


hadoop@hadoop:~$ sudo apt-get update
hadoop@hadoop:~$ sudo apt-get install oracle-java8-installer
hadoop@hadoop:~$ sudo apt-get install oracle-java8-set-default

It will install java source in your machine at /usr/lib/jvm/java-8-oracle

To verify your java installation, you have to fire the following command like,

INTERNSHIP REPORT UNIQ TECHNOLOGIES


hadoop@hadoop:~$ java -version

javac 1.8.0_66

hadoop@hadoop:~$ which javac

/usr/bin/javac

hadoop@hadoop:~$ readlink -f /usr/bin/javac

/usr/lib/jvm/java-8-oracle/jre/bin/java

Step 2:

Installing SSH

SSH (“Secure SHell”) is a protocol for securely accessing one machine from
another. Hadoop uses SSH for accessing another slaves nodes to start and manage
all HDFS and MapReduce daemons.

hadoop@hadoop~$ sudo apt-get install openssh-server openssh-client

Now, we have installed SSH over Ubuntu machine so we will be able to connect
with this machine as well as from this machine remotely.

Configuring SSH
Once you installed SSH on your machine, you can connect to other machine or
allow other machines to connect with this machine. However we have this single
machine, we can try connecting with this same machine by SSH. To do this, we
need to copy generated RSA key (i.e. id_rsa.pub) pairs to authorized_keys folder
of SSH installation of this machine by the following command,

# Generate ssh key for hduser account


hadoop@hadoop:~$ ssh-keygen -t rsa -P ""

## Copy id_rsa.pub to authorized keys from hduser


hadoop@hadoop:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

INTERNSHIP REPORT UNIQ TECHNOLOGIES


In case you are configuring SSH for another machine (i.e. from master node to
slave node), you have to update the above command by adding the hostname of
slave machine.

Step 3:

Installation Steps

Download latest Apache Hadoop source from Apache mirrors


First you need to download Apache Hadoop 2.6.0 (i.e. hadoop-2.6.0.tar.gz)or
latest version source from Apache download Mirrors. You can also try stable
hadoop to get all latest features as well as recent bugs solved with Hadoop source.
Choose location where you want to place all your hadoop installation, I have
chosen /usr/local/hadoop

## Move hadoop-2.6.0 to hadoop folder

hadoop@hadoop:~$ sudo mkdir /usr/local/hadoop

hadoop@hadoop:~$ sudo mv hadoop-2.6.0/* /usr/local/hadoop

## Assign ownership of this folder to Hadoop user

hadoop@hadoop:~$ sudo chown $USER:$USER -R /usr/local/hadoop

## Create Hadoop temp directories for Namenode and Datanode

hadoop@hadoop:~$ sudo mkdir -p /usr/local/hadoop_tmp/hdfs/namenode


hadoop@hadoop:~$ sudo mkdir -p /usr/local/hadoop_tmp/hdfs/datanode

## Again assign ownership of this Hadoop temp folder to Hadoop user

hadoop@hadoop:~$ sudo chown $USER:$USER -R /usr/local/hadoop_tmp/

Step 4:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


Update Hadoop configuration files

User profile : Update $HOME/.bashrc

## User profile : Update $HOME/.bashrc

hadoop@hadoop:~$ sudo gedit ~/.bashrc

## Update hduser configuration file by appending the


## following environment variables at the end of this file.

# -- HADOOP ENVIRONMENT VARIABLES START -- #


export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
# -- HADOOP ENVIRONMENT VARIABLES END -- #

## User profile : Update $HOME/.bashrc

hadoop@hadoop:~$ source ~/.bashrc

Step 5:

Format Namenode

hadoop@hadoop:~$ hdfs namenode -format

15/04/18 14:43:12 INFO util.ExitUtil: Exiting with status 0


15/04/18 14:43:12 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at laptop/192.168.1.1
************************************************************/

INTERNSHIP REPORT UNIQ TECHNOLOGIES


Start all Hadoop daemons

hadoop@hadoop:~$ start-all.sh

Instead both of these above command you can also use start-all.sh, but its now
deprecated so its not recommended to be used for better Hadoop operations.
Track/Monitor/Verify

Verify Hadoop daemons:

hadoop@hadoop:~$ jps

9026 NodeManager
7348 NameNode
9766 Jps
8887 ResourceManager
7507 DataNode
5647 SecondryNode

Step 6:

Monitor Hadoop ResourseManage and Hadoop NameNode

If you wish to track Hadoop MapReduce as well as HDFS, you can try exploring
Hadoop web view of ResourceManager and NameNode which are usually used by
hadoop administrators. Open your default browser and visit to the following links.

hadoop@hadoop:~$ netstat -plten | grep java

For ResourceManager – http://localhost:8088

ResourceManager

For NameNode – http://localhost:50070

NameNode

INTERNSHIP REPORT UNIQ TECHNOLOGIES


Step 7:

Run Map-Reduce Jobs

Run word count example:

hadoop@hadoop:~$ cd /usr/local/hadoop

hadoop@hadoop:/usr/local/hadoop$ bin/hdfs dfs -mkdir /inputwords

hadoop@hadoop:/usr/local/hadoop$ bin/hdfs dfs -put /home/$USER/sample.txt


/inputwords

hadoop@hadoop:/usr/local/hadoop$ bin/yarn jar


share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount
/inputwords /outputwords
hadoop@hadoop:/usr/local/hadoop$ bin/hdfs dfs -cat /outputwords/*
hadoop@hadoop:/usr/local/hadoop$ cd
hadoop@hadoop:~$ stop-all.sh

INTERNSHIP REPORT UNIQ TECHNOLOGIES


DAY 6 : WORD COUNT
Trainer Name: VIGNESH
To count the number of words present in the given file.

INTERNSHIP REPORT UNIQ TECHNOLOGIES


INTERNSHIP REPORT UNIQ TECHNOLOGIES
OUTPUT:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


DAY 7 :WORD COUNT OF A GIVEN CATAGORY OF YOUTUBE DATA
THE GIVE YOUTUBE DATA :

INTERNSHIP REPORT UNIQ TECHNOLOGIES


INTERNSHIP REPORT UNIQ TECHNOLOGIES
OUTPUT:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


DAY 8: PARTIONING

TRAINER NAME:VIGNESH

CODE:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


INTERNSHIP REPORT UNIQ TECHNOLOGIES
OUTPUT:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


DAY 9&10:LIBRARY MANAGEMENT SYSTEM PROJECT
TRAINER NAME: VIGNESH

A Library management system is a project that manages and stores books


information electronically according to the needs. The system helps both the
students and library manager to keep constant track of all books available in
the library. It allows both admin and the user to search for the desired book.
It also manages the information about the books the students have
borrowed and the books they have returned.

OUTPUT:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


INTERNSHIP REPORT UNIQ TECHNOLOGIES
INTERNSHIP REPORT UNIQ TECHNOLOGIES
INTERNSHIP REPORT UNIQ TECHNOLOGIES
Verified.

Project Coordinator

INTERNSHIP REPORT UNIQ TECHNOLOGIES


u n i qtechnologies
Services | Development | Consultancy

Internship Report on Big Data

Submitted By:S. SURABI SRI DHANYA

INTERNSHIP REPORT
Project Coordinator
UNIQ TECHNOLOGIES
CANDIDATE NAME :S. SURABI SRI DHANYA

COLLEGE NAME :SRI VENKATESWARA COLLEGE OF ENGG

DEPARTMENT :INFORMATION TECHNOLOGY

DOMAIN OF INTERNSHIP :BIG DATA

DURATION :10 DAYS

PROJECT NAME :LIBRARY MANAGEMENT SYSTEM

STUDENT PROJECT GUIDE

INTERNSHIP REPORT UNIQ TECHNOLOGIES


TECHNOLOGY NAME: BIG DATA
Big data is phrase used to mean a massive volume of both structured and
unstructured data that is too large it is difficult to process using tradionational
database and software techiques. In most enterprise scenarios the volume of data
is too big or it moves too fast or it exceeds current process capacity.

3Vs(volume,variety,velocity) and three defining properties or dimensionsof big


data. Volume refers to the amount of data, variety refers to the number of types
of data available and velocity refers to speed of data processing. According to the
3vs model,the challenges of bigdata management results from expansion of all
three properties, rather than just volume alone—the shear amount of data to be
managed.

DAY 1: ECLIPSE INSTALLATION AND JAVA BASIC PROGRAM USING


SPLIT STRING
Trainer Name: VIGNESH
ECLIPSE INSTALLATION:
Eclipse is a integrated development environment(IDE) used in computer
programming, and it is most widely used in in the java IDE. It contains a base
workspace and extensible plugin systemfor customisig the environment.

Use of eclipse:

Developed using java,the eclipse platform can be used to develop rich client
applications, integrated development environments and other tools. Eclipse can
be used as a IDE for any programming languuage for which plugin available.

Java basic program using split string:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


The method split( ) is used for splitting string into its substrings based on he given
delimter/regular expression.

String[ ] split(String regex): Its returns an array of strings after splitting an input
string based on the delimiting regular expression.

String[ ] split(string regex,int limit): The only difference between above variation
and this one is that it limits no of strings returned after split up.

DAY 2 & 3: LISTS, SETS AND KEYS PROGRAM


Trainer Name: VIGNESH
TREESET PROGRAM USING COMPARATOR():
TreeSet is similar to HashSet except that it sorts the element in the ascending
order while HashSet doesnt maintain any order. Treeset allows null elements but
HashSet doesnt allow. Like most of the other collection classes this class is also
not synchronized, however it can be collections. SynchronizedSortedSet(new
TreeSet(..))

ARRAYLIST PROGRAM USING ITERATOR:


Arraylist class implements List Interface and it is based on an Array data
structure. It is widely used because of the functionality and flexibility it offers.
Most of the developers choose Arraylist over Array as it's very good alternative of
Traditional java arrays. Array List is a resizable-array implementation of the List
Interface. It implements all optional list operations and permits all elements,
including null.

Iterator enables you to cycle through a collection, obtaining or removing


elements. List Iterator extends Iterator to allow bidirectional traversal of the list,
and the modification of elements.

INTERNSHIP REPORT UNIQ TECHNOLOGIES


Before you can access a collection through an Iterator, you must obtain one. Each
of the collection classes provides an Iterator( ) method that returns an Iterator to
the start of the collection. By using this Iterator object you can access each
element in the collection, one element at a time.
THE METHODS DECLARED BY ITERATOR

Boolean hasNext( )

Returns true if there are more elements. Otherwise, returns false

Object next( )

Returns the next element. Throws NoSuchElementException if there is not a next

element.

Void remove()

Removes the current elements. Throws IllegalStateException if an attempt is made to call remove() that
is not precede by call to next().

DAY 4&5 : HADOOP INSTALLATION AND UBUNTU INSTALLATION


Trainer Name:VIGNESH
Step 1:

Installing Oracle Java 8

Apache Hadoop is java framework, we need java installed on our machine to get it
run over operating system. Hadoop supports all java version greater than 5 (i.e.
Java 1.5). So, Here you can also try Java 6, 7 instead of Java 8.

hadoop@hadoop:~$ sudo add-apt-repository ppa:webupd8team/java


hadoop@hadoop:~$ sudo apt-get update
hadoop@hadoop:~$ sudo apt-get install oracle-java8-installer
hadoop@hadoop:~$ sudo apt-get install oracle-java8-set-default

It will install java source in your machine at /usr/lib/jvm/java-8-oracle

To verify your java installation, you have to fire the following command like,

INTERNSHIP REPORT UNIQ TECHNOLOGIES


hadoop@hadoop:~$ java -version

javac 1.8.0_66

hadoop@hadoop:~$ which javac

/usr/bin/javac

hadoop@hadoop:~$ readlink -f /usr/bin/javac

/usr/lib/jvm/java-8-oracle/jre/bin/java

Step 2:

Installing SSH

SSH (“Secure SHell”) is a protocol for securely accessing one machine from
another. Hadoop uses SSH for accessing another slaves nodes to start and manage
all HDFS and MapReduce daemons.

hadoop@hadoop~$ sudo apt-get install openssh-server openssh-client

Now, we have installed SSH over Ubuntu machine so we will be able to connect
with this machine as well as from this machine remotely.

Configuring SSH
Once you installed SSH on your machine, you can connect to other machine or
allow other machines to connect with this machine. However we have this single
machine, we can try connecting with this same machine by SSH. To do this, we
need to copy generated RSA key (i.e. id_rsa.pub) pairs to authorized_keys folder
of SSH installation of this machine by the following command,

# Generate ssh key for hduser account


hadoop@hadoop:~$ ssh-keygen -t rsa -P ""

## Copy id_rsa.pub to authorized keys from hduser


hadoop@hadoop:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

INTERNSHIP REPORT UNIQ TECHNOLOGIES


In case you are configuring SSH for another machine (i.e. from master node to
slave node), you have to update the above command by adding the hostname of
slave machine.

Step 3:

Installation Steps

Download latest Apache Hadoop source from Apache mirrors


First you need to download Apache Hadoop 2.6.0 (i.e. hadoop-2.6.0.tar.gz)or
latest version source from Apache download Mirrors. You can also try stable
hadoop to get all latest features as well as recent bugs solved with Hadoop source.
Choose location where you want to place all your hadoop installation, I have
chosen /usr/local/hadoop

## Move hadoop-2.6.0 to hadoop folder

hadoop@hadoop:~$ sudo mkdir /usr/local/hadoop

hadoop@hadoop:~$ sudo mv hadoop-2.6.0/* /usr/local/hadoop

## Assign ownership of this folder to Hadoop user

hadoop@hadoop:~$ sudo chown $USER:$USER -R /usr/local/hadoop

## Create Hadoop temp directories for Namenode and Datanode

hadoop@hadoop:~$ sudo mkdir -p /usr/local/hadoop_tmp/hdfs/namenode


hadoop@hadoop:~$ sudo mkdir -p /usr/local/hadoop_tmp/hdfs/datanode

## Again assign ownership of this Hadoop temp folder to Hadoop user

hadoop@hadoop:~$ sudo chown $USER:$USER -R /usr/local/hadoop_tmp/

Step 4:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


Update Hadoop configuration files

User profile : Update $HOME/.bashrc

## User profile : Update $HOME/.bashrc

hadoop@hadoop:~$ sudo gedit ~/.bashrc

## Update hduser configuration file by appending the


## following environment variables at the end of this file.

# -- HADOOP ENVIRONMENT VARIABLES START -- #


export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
# -- HADOOP ENVIRONMENT VARIABLES END -- #

## User profile : Update $HOME/.bashrc

hadoop@hadoop:~$ source ~/.bashrc

Step 5:

Format Namenode

hadoop@hadoop:~$ hdfs namenode -format

15/04/18 14:43:12 INFO util.ExitUtil: Exiting with status 0


15/04/18 14:43:12 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at laptop/192.168.1.1
************************************************************/

INTERNSHIP REPORT UNIQ TECHNOLOGIES


Start all Hadoop daemons

hadoop@hadoop:~$ start-all.sh

Instead both of these above command you can also use start-all.sh, but its now
deprecated so its not recommended to be used for better Hadoop operations.
Track/Monitor/Verify

Verify Hadoop daemons:

hadoop@hadoop:~$ jps

9026 NodeManager
7348 NameNode
9766 Jps
8887 ResourceManager
7507 DataNode
5647 SecondryNode

Step 6:

Monitor Hadoop ResourseManage and Hadoop NameNode

If you wish to track Hadoop MapReduce as well as HDFS, you can try exploring
Hadoop web view of ResourceManager and NameNode which are usually used by
hadoop administrators. Open your default browser and visit to the following links.

hadoop@hadoop:~$ netstat -plten | grep java

For ResourceManager – http://localhost:8088

ResourceManager

For NameNode – http://localhost:50070

NameNode

INTERNSHIP REPORT UNIQ TECHNOLOGIES


Step 7:

Run Map-Reduce Jobs

Run word count example:

hadoop@hadoop:~$ cd /usr/local/hadoop

hadoop@hadoop:/usr/local/hadoop$ bin/hdfs dfs -mkdir /inputwords

hadoop@hadoop:/usr/local/hadoop$ bin/hdfs dfs -put /home/$USER/sample.txt


/inputwords

hadoop@hadoop:/usr/local/hadoop$ bin/yarn jar


share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount
/inputwords /outputwords
hadoop@hadoop:/usr/local/hadoop$ bin/hdfs dfs -cat /outputwords/*
hadoop@hadoop:/usr/local/hadoop$ cd
hadoop@hadoop:~$ stop-all.sh

DAY 6 : WORD COUNT


Trainer Name: VIGNESH

INTERNSHIP REPORT UNIQ TECHNOLOGIES


To count the number of words present in the given file.

INTERNSHIP REPORT UNIQ TECHNOLOGIES


OUTPUT:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


DAY 7 :WORD COUNT OF A GIVEN CATAGORY OF YOUTUBE DATA
THE GIVE YOUTUBE DATA :

INTERNSHIP REPORT UNIQ TECHNOLOGIES


INTERNSHIP REPORT UNIQ TECHNOLOGIES
OUTPUT:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


DAY 8: PARTIONING

TRAINER NAME:VIGNESH

CODE:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


INTERNSHIP REPORT UNIQ TECHNOLOGIES
OUTPUT:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


DAY 9&10:LIBRARY MANAGEMENT SYSTEM PROJECT
TRAINER NAME: VIGNESH

A Library management system is a project that manages and stores books


information electronically according to the needs. The system helps both the
students and library manager to keep constant track of all books available in
the library. It allows both admin and the user to search for the desired book.
It also manages the information about the books the students have
borrowed and the books they have returned.

OUTPUT:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


INTERNSHIP REPORT UNIQ TECHNOLOGIES
INTERNSHIP REPORT UNIQ TECHNOLOGIES
INTERNSHIP REPORT UNIQ TECHNOLOGIES
Verified.

Project Coordinator

INTERNSHIP REPORT UNIQ TECHNOLOGIES


u n i qtechnologies
Services | Development | Consultancy

Internship Report on Big Data

Submitted By:S. NIVETHA

INTERNSHIP REPORT
Project Coordinator
UNIQ TECHNOLOGIES
CANDIDATE NAME :S. NIVETHA

COLLEGE NAME :SRI VENKATESWARA COLLEGE OF ENGG

DEPARTMENT :INFORMATION TECHNOLOGY

DOMAIN OF INTERNSHIP :BIG DATA

DURATION :10 DAYS

PROJECT NAME :LIBRARY MANAGEMENT SYSTEM

STUDENT PROJECT GUIDE

INTERNSHIP REPORT UNIQ TECHNOLOGIES


TECHNOLOGY NAME: BIG DATA
Big data is phrase used to mean a massive volume of both structured and
unstructured data that is too large it is difficult to process using tradionational
database and software techiques. In most enterprise scenarios the volume of data
is too big or it moves too fast or it exceeds current process capacity.

3Vs(volume,variety,velocity) and three defining properties or dimensionsof big


data. Volume refers to the amount of data, variety refers to the number of types
of data available and velocity refers to speed of data processing. According to the
3vs model,the challenges of bigdata management results from expansion of all
three properties, rather than just volume alone—the shear amount of data to be
managed.

DAY 1: ECLIPSE INSTALLATION AND JAVA BASIC PROGRAM USING


SPLIT STRING
Trainer Name: VIGNESH
ECLIPSE INSTALLATION:
Eclipse is a integrated development environment(IDE) used in computer
programming, and it is most widely used in in the java IDE. It contains a base
workspace and extensible plugin systemfor customisig the environment.

Use of eclipse:

Developed using java,the eclipse platform can be used to develop rich client
applications, integrated development environments and other tools. Eclipse can
be used as a IDE for any programming languuage for which plugin available.

Java basic program using split string:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


The method split( ) is used for splitting string into its substrings based on he given
delimter/regular expression.

String[ ] split(String regex): Its returns an array of strings after splitting an input
string based on the delimiting regular expression.

String[ ] split(string regex,int limit): The only difference between above variation
and this one is that it limits no of strings returned after split up.

DAY 2 & 3: LISTS, SETS AND KEYS PROGRAM


Trainer Name: VIGNESH
TREESET PROGRAM USING COMPARATOR():
TreeSet is similar to HashSet except that it sorts the element in the ascending
order while HashSet doesnt maintain any order. Treeset allows null elements but
HashSet doesnt allow. Like most of the other collection classes this class is also
not synchronized, however it can be collections. SynchronizedSortedSet(new
TreeSet(..))

ARRAYLIST PROGRAM USING ITERATOR:


Arraylist class implements List Interface and it is based on an Array data
structure. It is widely used because of the functionality and flexibility it offers.
Most of the developers choose Arraylist over Array as it's very good alternative of
Traditional java arrays. Array List is a resizable-array implementation of the List
Interface. It implements all optional list operations and permits all elements,
including null.

Iterator enables you to cycle through a collection, obtaining or removing


elements. List Iterator extends Iterator to allow bidirectional traversal of the list,
and the modification of elements.

INTERNSHIP REPORT UNIQ TECHNOLOGIES


Before you can access a collection through an Iterator, you must obtain one. Each
of the collection classes provides an Iterator( ) method that returns an Iterator to
the start of the collection. By using this Iterator object you can access each
element in the collection, one element at a time.
THE METHODS DECLARED BY ITERATOR

Boolean hasNext( )

Returns true if there are more elements. Otherwise, returns false

Object next( )

Returns the next element. Throws NoSuchElementException if there is not a next

element.

Void remove()

Removes the current elements. Throws IllegalStateException if an attempt is made to call remove() that
is not precede by call to next().

DAY 4&5 : HADOOP INSTALLATION AND UBUNTU INSTALLATION


Trainer Name:VIGNESH
Step 1:

Installing Oracle Java 8

Apache Hadoop is java framework, we need java installed on our machine to get it
run over operating system. Hadoop supports all java version greater than 5 (i.e.
Java 1.5). So, Here you can also try Java 6, 7 instead of Java 8.

hadoop@hadoop:~$ sudo add-apt-repository ppa:webupd8team/java


hadoop@hadoop:~$ sudo apt-get update
hadoop@hadoop:~$ sudo apt-get install oracle-java8-installer
hadoop@hadoop:~$ sudo apt-get install oracle-java8-set-default

It will install java source in your machine at /usr/lib/jvm/java-8-oracle

To verify your java installation, you have to fire the following command like,

INTERNSHIP REPORT UNIQ TECHNOLOGIES


hadoop@hadoop:~$ java -version

javac 1.8.0_66

hadoop@hadoop:~$ which javac

/usr/bin/javac

hadoop@hadoop:~$ readlink -f /usr/bin/javac

/usr/lib/jvm/java-8-oracle/jre/bin/java

Step 2:

Installing SSH

SSH (“Secure SHell”) is a protocol for securely accessing one machine from
another. Hadoop uses SSH for accessing another slaves nodes to start and manage
all HDFS and MapReduce daemons.

hadoop@hadoop~$ sudo apt-get install openssh-server openssh-client

Now, we have installed SSH over Ubuntu machine so we will be able to connect
with this machine as well as from this machine remotely.

Configuring SSH
Once you installed SSH on your machine, you can connect to other machine or
allow other machines to connect with this machine. However we have this single
machine, we can try connecting with this same machine by SSH. To do this, we
need to copy generated RSA key (i.e. id_rsa.pub) pairs to authorized_keys folder
of SSH installation of this machine by the following command,

# Generate ssh key for hduser account


hadoop@hadoop:~$ ssh-keygen -t rsa -P ""

## Copy id_rsa.pub to authorized keys from hduser


hadoop@hadoop:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

INTERNSHIP REPORT UNIQ TECHNOLOGIES


In case you are configuring SSH for another machine (i.e. from master node to
slave node), you have to update the above command by adding the hostname of
slave machine.

Step 3:

Installation Steps

Download latest Apache Hadoop source from Apache mirrors


First you need to download Apache Hadoop 2.6.0 (i.e. hadoop-2.6.0.tar.gz)or
latest version source from Apache download Mirrors. You can also try stable
hadoop to get all latest features as well as recent bugs solved with Hadoop source.
Choose location where you want to place all your hadoop installation, I have
chosen /usr/local/hadoop

## Move hadoop-2.6.0 to hadoop folder

hadoop@hadoop:~$ sudo mkdir /usr/local/hadoop

hadoop@hadoop:~$ sudo mv hadoop-2.6.0/* /usr/local/hadoop

## Assign ownership of this folder to Hadoop user

hadoop@hadoop:~$ sudo chown $USER:$USER -R /usr/local/hadoop

## Create Hadoop temp directories for Namenode and Datanode

hadoop@hadoop:~$ sudo mkdir -p /usr/local/hadoop_tmp/hdfs/namenode


hadoop@hadoop:~$ sudo mkdir -p /usr/local/hadoop_tmp/hdfs/datanode

## Again assign ownership of this Hadoop temp folder to Hadoop user

hadoop@hadoop:~$ sudo chown $USER:$USER -R /usr/local/hadoop_tmp/

Step 4:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


Update Hadoop configuration files

User profile : Update $HOME/.bashrc

## User profile : Update $HOME/.bashrc

hadoop@hadoop:~$ sudo gedit ~/.bashrc

## Update hduser configuration file by appending the


## following environment variables at the end of this file.

# -- HADOOP ENVIRONMENT VARIABLES START -- #


export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
# -- HADOOP ENVIRONMENT VARIABLES END -- #

## User profile : Update $HOME/.bashrc

hadoop@hadoop:~$ source ~/.bashrc

Step 5:

Format Namenode

hadoop@hadoop:~$ hdfs namenode -format

15/04/18 14:43:12 INFO util.ExitUtil: Exiting with status 0


15/04/18 14:43:12 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at laptop/192.168.1.1
************************************************************/

INTERNSHIP REPORT UNIQ TECHNOLOGIES


Start all Hadoop daemons

hadoop@hadoop:~$ start-all.sh

Instead both of these above command you can also use start-all.sh, but its now
deprecated so its not recommended to be used for better Hadoop operations.
Track/Monitor/Verify

Verify Hadoop daemons:

hadoop@hadoop:~$ jps

9026 NodeManager
7348 NameNode
9766 Jps
8887 ResourceManager
7507 DataNode
5647 SecondryNode

Step 6:

Monitor Hadoop ResourseManage and Hadoop NameNode

If you wish to track Hadoop MapReduce as well as HDFS, you can try exploring
Hadoop web view of ResourceManager and NameNode which are usually used by
hadoop administrators. Open your default browser and visit to the following links.

hadoop@hadoop:~$ netstat -plten | grep java

For ResourceManager – http://localhost:8088

ResourceManager

For NameNode – http://localhost:50070

NameNode

INTERNSHIP REPORT UNIQ TECHNOLOGIES


Step 7:

Run Map-Reduce Jobs

Run word count example:

hadoop@hadoop:~$ cd /usr/local/hadoop

hadoop@hadoop:/usr/local/hadoop$ bin/hdfs dfs -mkdir /inputwords

hadoop@hadoop:/usr/local/hadoop$ bin/hdfs dfs -put /home/$USER/sample.txt


/inputwords

hadoop@hadoop:/usr/local/hadoop$ bin/yarn jar


share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount
/inputwords /outputwords
hadoop@hadoop:/usr/local/hadoop$ bin/hdfs dfs -cat /outputwords/*
hadoop@hadoop:/usr/local/hadoop$ cd
hadoop@hadoop:~$ stop-all.sh

DAY 6 : WORD COUNT


Trainer Name: VIGNESH

INTERNSHIP REPORT UNIQ TECHNOLOGIES


To count the number of words present in the given file.

INTERNSHIP REPORT UNIQ TECHNOLOGIES


OUTPUT:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


DAY 7 :WORD COUNT OF A GIVEN CATAGORY OF YOUTUBE DATA
THE GIVE YOUTUBE DATA :

INTERNSHIP REPORT UNIQ TECHNOLOGIES


INTERNSHIP REPORT UNIQ TECHNOLOGIES
OUTPUT:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


DAY 8: PARTIONING

TRAINER NAME:VIGNESH

CODE:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


INTERNSHIP REPORT UNIQ TECHNOLOGIES
OUTPUT:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


DAY 9&10:LIBRARY MANAGEMENT SYSTEM PROJECT
TRAINER NAME: VIGNESH

A Library management system is a project that manages and stores books


information electronically according to the needs. The system helps both the
students and library manager to keep constant track of all books available in
the library. It allows both admin and the user to search for the desired book.
It also manages the information about the books the students have
borrowed and the books they have returned.

OUTPUT:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


INTERNSHIP REPORT UNIQ TECHNOLOGIES
INTERNSHIP REPORT UNIQ TECHNOLOGIES
INTERNSHIP REPORT UNIQ TECHNOLOGIES
Verified.

Project Coordinator

INTERNSHIP REPORT UNIQ TECHNOLOGIES


u n i qtechnologies
Services | Development | Consultancy

Internship Report on Big Data

Submitted By:S.REKHA SRI

INTERNSHIP REPORT
Project Coordinator
UNIQ TECHNOLOGIES
CANDIDATE NAME :S. REKHA SRI

COLLEGE NAME :SRI VENKATESWARA COLLEGE OF ENGINEERING

DEPARTMENT :INFORMATION TECHNOLOGY

DOMAIN OF INTERNSHIP :BIG DATA

DURATION :10 DAYS

PROJECT NAME :LIBRARY MANAGEMENT SYSTEM

STUDENT PROJECT GUIDE

INTERNSHIP REPORT UNIQ TECHNOLOGIES


TECHNOLOGY NAME: BIG DATA
Big data is phrase used to mean a massive volume of both structured and
unstructured data that is too large it is difficult to process using tradionational
database and software techiques. In most enterprise scenarios the volume of data
is too big or it moves too fast or it exceeds current process capacity.

3Vs(volume,variety,velocity) and three defining properties or dimensionsof big


data. Volume refers to the amount of data, variety refers to the number of types
of data available and velocity refers to speed of data processing. According to the
3vs model,the challenges of bigdata management results from expansion of all
three properties, rather than just volume alone—the shear amount of data to be
managed.

DAY 1: ECLIPSE INSTALLATION AND JAVA BASIC PROGRAM USING


SPLIT STRING
Trainer Name: VIGNESH
ECLIPSE INSTALLATION:
Eclipse is a integrated development environment(IDE) used in computer
programming, and it is most widely used in in the java IDE. It contains a base
workspace and extensible plugin systemfor customisig the environment.

Use of eclipse:

Developed using java,the eclipse platform can be used to develop rich client
applications, integrated development environments and other tools. Eclipse can
be used as a IDE for any programming languuage for which plugin available.

Java basic program using split string:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


The method split( ) is used for splitting string into its substrings based on he given
delimter/regular expression.

String[ ] split(String regex): Its returns an array of strings after splitting an input
string based on the delimiting regular expression.

String[ ] split(string regex,int limit): The only difference between above variation
and this one is that it limits no of strings returned after split up.

DAY 2 & 3: LISTS, SETS AND KEYS PROGRAM


Trainer Name: VIGNESH
TREESET PROGRAM USING COMPARATOR():
TreeSet is similar to HashSet except that it sorts the element in the ascending
order while HashSet doesnt maintain any order. Treeset allows null elements but
HashSet doesnt allow. Like most of the other collection classes this class is also
not synchronized, however it can be collections. SynchronizedSortedSet(new
TreeSet(..))

ARRAYLIST PROGRAM USING ITERATOR:


Arraylist class implements List Interface and it is based on an Array data
structure. It is widely used because of the functionality and flexibility it offers.
Most of the developers choose Arraylist over Array as it's very good alternative of
Traditional java arrays. Array List is a resizable-array implementation of the List
Interface. It implements all optional list operations and permits all elements,
including null.

Iterator enables you to cycle through a collection, obtaining or removing


elements. List Iterator extends Iterator to allow bidirectional traversal of the list,
and the modification of elements.

INTERNSHIP REPORT UNIQ TECHNOLOGIES


Before you can access a collection through an Iterator, you must obtain one. Each
of the collection classes provides an Iterator( ) method that returns an Iterator to
the start of the collection. By using this Iterator object you can access each
element in the collection, one element at a time.
THE METHODS DECLARED BY ITERATOR

Boolean hasNext( )

Returns true if there are more elements. Otherwise, returns false

Object next( )

Returns the next element. Throws NoSuchElementException if there is not a next

element.

Void remove()

Removes the current elements. Throws IllegalStateException if an attempt is made to call remove() that
is not precede by call to next().

DAY 4&5 : HADOOP INSTALLATION AND UBUNTU INSTALLATION


Trainer Name:VIGNESH
Step 1:

Installing Oracle Java 8

Apache Hadoop is java framework, we need java installed on our machine to get it
run over operating system. Hadoop supports all java version greater than 5 (i.e.
Java 1.5). So, Here you can also try Java 6, 7 instead of Java 8.

hadoop@hadoop:~$ sudo add-apt-repository ppa:webupd8team/java


hadoop@hadoop:~$ sudo apt-get update
hadoop@hadoop:~$ sudo apt-get install oracle-java8-installer
hadoop@hadoop:~$ sudo apt-get install oracle-java8-set-default

It will install java source in your machine at /usr/lib/jvm/java-8-oracle

To verify your java installation, you have to fire the following command like,

INTERNSHIP REPORT UNIQ TECHNOLOGIES


hadoop@hadoop:~$ java -version

javac 1.8.0_66

hadoop@hadoop:~$ which javac

/usr/bin/javac

hadoop@hadoop:~$ readlink -f /usr/bin/javac

/usr/lib/jvm/java-8-oracle/jre/bin/java

Step 2:

Installing SSH

SSH (“Secure SHell”) is a protocol for securely accessing one machine from
another. Hadoop uses SSH for accessing another slaves nodes to start and manage
all HDFS and MapReduce daemons.

hadoop@hadoop~$ sudo apt-get install openssh-server openssh-client

Now, we have installed SSH over Ubuntu machine so we will be able to connect
with this machine as well as from this machine remotely.

Configuring SSH
Once you installed SSH on your machine, you can connect to other machine or
allow other machines to connect with this machine. However we have this single
machine, we can try connecting with this same machine by SSH. To do this, we
need to copy generated RSA key (i.e. id_rsa.pub) pairs to authorized_keys folder
of SSH installation of this machine by the following command,

# Generate ssh key for hduser account


hadoop@hadoop:~$ ssh-keygen -t rsa -P ""

## Copy id_rsa.pub to authorized keys from hduser


hadoop@hadoop:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

INTERNSHIP REPORT UNIQ TECHNOLOGIES


In case you are configuring SSH for another machine (i.e. from master node to
slave node), you have to update the above command by adding the hostname of
slave machine.

Step 3:

Installation Steps

Download latest Apache Hadoop source from Apache mirrors


First you need to download Apache Hadoop 2.6.0 (i.e. hadoop-2.6.0.tar.gz)or
latest version source from Apache download Mirrors. You can also try stable
hadoop to get all latest features as well as recent bugs solved with Hadoop source.
Choose location where you want to place all your hadoop installation, I have
chosen /usr/local/hadoop

## Move hadoop-2.6.0 to hadoop folder

hadoop@hadoop:~$ sudo mkdir /usr/local/hadoop

hadoop@hadoop:~$ sudo mv hadoop-2.6.0/* /usr/local/hadoop

## Assign ownership of this folder to Hadoop user

hadoop@hadoop:~$ sudo chown $USER:$USER -R /usr/local/hadoop

## Create Hadoop temp directories for Namenode and Datanode

hadoop@hadoop:~$ sudo mkdir -p /usr/local/hadoop_tmp/hdfs/namenode


hadoop@hadoop:~$ sudo mkdir -p /usr/local/hadoop_tmp/hdfs/datanode

## Again assign ownership of this Hadoop temp folder to Hadoop user

hadoop@hadoop:~$ sudo chown $USER:$USER -R /usr/local/hadoop_tmp/

Step 4:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


Update Hadoop configuration files

User profile : Update $HOME/.bashrc

## User profile : Update $HOME/.bashrc

hadoop@hadoop:~$ sudo gedit ~/.bashrc

## Update hduser configuration file by appending the


## following environment variables at the end of this file.

# -- HADOOP ENVIRONMENT VARIABLES START -- #


export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
# -- HADOOP ENVIRONMENT VARIABLES END -- #

## User profile : Update $HOME/.bashrc

hadoop@hadoop:~$ source ~/.bashrc

Step 5:

Format Namenode

hadoop@hadoop:~$ hdfs namenode -format

15/04/18 14:43:12 INFO util.ExitUtil: Exiting with status 0


15/04/18 14:43:12 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at laptop/192.168.1.1
************************************************************/

INTERNSHIP REPORT UNIQ TECHNOLOGIES


Start all Hadoop daemons

hadoop@hadoop:~$ start-all.sh

Instead both of these above command you can also use start-all.sh, but its now
deprecated so its not recommended to be used for better Hadoop operations.
Track/Monitor/Verify

Verify Hadoop daemons:

hadoop@hadoop:~$ jps

9026 NodeManager
7348 NameNode
9766 Jps
8887 ResourceManager
7507 DataNode
5647 SecondryNode

Step 6:

Monitor Hadoop ResourseManage and Hadoop NameNode

If you wish to track Hadoop MapReduce as well as HDFS, you can try exploring
Hadoop web view of ResourceManager and NameNode which are usually used by
hadoop administrators. Open your default browser and visit to the following links.

hadoop@hadoop:~$ netstat -plten | grep java

For ResourceManager – http://localhost:8088

ResourceManager

For NameNode – http://localhost:50070

NameNode

INTERNSHIP REPORT UNIQ TECHNOLOGIES


Hadoop Administration

INTERNSHIP REPORT UNIQ TECHNOLOGIES


Name Node Details:

Data Node Details:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


StartUp Process:

Hadoop OutPut Window:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


Resource Manager:

Step 7:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


Run Map-Reduce Jobs

Run word count example:

hadoop@hadoop:~$ cd /usr/local/hadoop

hadoop@hadoop:/usr/local/hadoop$ bin/hdfs dfs -mkdir /inputwords

hadoop@hadoop:/usr/local/hadoop$ bin/hdfs dfs -put /home/$USER/sample.txt


/inputwords

hadoop@hadoop:/usr/local/hadoop$ bin/yarn jar


share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount
/inputwords /outputwords
hadoop@hadoop:/usr/local/hadoop$ bin/hdfs dfs -cat /outputwords/*
hadoop@hadoop:/usr/local/hadoop$ cd
hadoop@hadoop:~$ stop-all.sh

DAY 6 : WORD COUNT


Trainer Name: VIGNESH

INTERNSHIP REPORT UNIQ TECHNOLOGIES


To count the number of words present in the given file.

INTERNSHIP REPORT UNIQ TECHNOLOGIES


OUTPUT:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


DAY 7 :WORD COUNT OF A GIVEN CATAGORY OF YOUTUBE DATA
THE GIVE YOUTUBE DATA :

INTERNSHIP REPORT UNIQ TECHNOLOGIES


INTERNSHIP REPORT UNIQ TECHNOLOGIES
OUTPUT:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


DAY 8: PARTIONING

TRAINER NAME:VIGNESH

CODE:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


INTERNSHIP REPORT UNIQ TECHNOLOGIES
OUTPUT:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


DAY 9&10:LIBRARY MANAGEMENT SYSTEM PROJECT
TRAINER NAME: VIGNESH

A Library management system is a project that manages and stores books


information electronically according to the needs. The system helps both the
students and library manager to keep constant track of all books available in
the library. It allows both admin and the user to search for the desired book.
It also manages the information about the books the students have
borrowed and the books they have returned.

OUTPUT:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


INTERNSHIP REPORT UNIQ TECHNOLOGIES
INTERNSHIP REPORT UNIQ TECHNOLOGIES
INTERNSHIP REPORT UNIQ TECHNOLOGIES
Verified.

Project Coordinator

INTERNSHIP REPORT UNIQ TECHNOLOGIES


u n i qtechnologies
Services | Development | Consultancy

Internship Report on Big Data

Submitted By:R.SAHITHI

INTERNSHIP REPORT
Project Coordinator
UNIQ TECHNOLOGIES
CANDIDATE NAME :R.SAHITHI

COLLEGE NAME :SRI VENKATESWARA COLLEGE OF ENGG

DEPARTMENT :INFORMATION TECHNOLOGY

DOMAIN OF INTERNSHIP :BIG DATA

DURATION :10 DAYS

PROJECT NAME :LIBRARY MANAGEMENT SYSTEM

STUDENT PROJECT GUIDE

INTERNSHIP REPORT UNIQ TECHNOLOGIES


TECHNOLOGY NAME: BIG DATA
Big data is phrase used to mean a massive volume of both structured and
unstructured data that is too large it is difficult to process using tradionational
database and software techiques. In most enterprise scenarios the volume of data
is too big or it moves too fast or it exceeds current process capacity.

3Vs(volume,variety,velocity) and three defining properties or dimensionsof big


data. Volume refers to the amount of data, variety refers to the number of types
of data available and velocity refers to speed of data processing. According to the
3vs model,the challenges of bigdata management results from expansion of all
three properties, rather than just volume alone—the shear amount of data to be
managed.

DAY 1: ECLIPSE INSTALLATION AND JAVA BASIC PROGRAM USING


SPLIT STRING
Trainer Name: VIGNESH
ECLIPSE INSTALLATION:
Eclipse is a integrated development environment(IDE) used in computer
programming, and it is most widely used in in the java IDE. It contains a base
workspace and extensible plugin systemfor customisig the environment.

Use of eclipse:

Developed using java,the eclipse platform can be used to develop rich client
applications, integrated development environments and other tools. Eclipse can
be used as a IDE for any programming languuage for which plugin available.

Java basic program using split string:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


The method split( ) is used for splitting string into its substrings based on he given
delimter/regular expression.

String[ ] split(String regex): Its returns an array of strings after splitting an input
string based on the delimiting regular expression.

String[ ] split(string regex,int limit): The only difference between above variation
and this one is that it limits no of strings returned after split up.

DAY 2 & 3: LISTS, SETS AND KEYS PROGRAM


Trainer Name: VIGNESH
TREESET PROGRAM USING COMPARATOR():
TreeSet is similar to HashSet except that it sorts the element in the ascending
order while HashSet doesnt maintain any order. Treeset allows null elements but
HashSet doesnt allow. Like most of the other collection classes this class is also
not synchronized, however it can be collections. SynchronizedSortedSet(new
TreeSet(..))

ARRAYLIST PROGRAM USING ITERATOR:


Arraylist class implements List Interface and it is based on an Array data
structure. It is widely used because of the functionality and flexibility it offers.
Most of the developers choose Arraylist over Array as it's very good alternative of
Traditional java arrays. Array List is a resizable-array implementation of the List
Interface. It implements all optional list operations and permits all elements,
including null.

Iterator enables you to cycle through a collection, obtaining or removing


elements. List Iterator extends Iterator to allow bidirectional traversal of the list,
and the modification of elements.

INTERNSHIP REPORT UNIQ TECHNOLOGIES


Before you can access a collection through an Iterator, you must obtain one. Each
of the collection classes provides an Iterator( ) method that returns an Iterator to
the start of the collection. By using this Iterator object you can access each
element in the collection, one element at a time.
THE METHODS DECLARED BY ITERATOR

Boolean hasNext( )

Returns true if there are more elements. Otherwise, returns false

Object next( )

Returns the next element. Throws NoSuchElementException if there is not a next

element.

Void remove()

Removes the current elements. Throws IllegalStateException if an attempt is made to call remove() that
is not precede by call to next().

DAY 4&5 : HADOOP INSTALLATION AND UBUNTU INSTALLATION


Trainer Name:VIGNESH
Step 1:

Installing Oracle Java 8

Apache Hadoop is java framework, we need java installed on our machine to get it
run over operating system. Hadoop supports all java version greater than 5 (i.e.
Java 1.5). So, Here you can also try Java 6, 7 instead of Java 8.

hadoop@hadoop:~$ sudo add-apt-repository ppa:webupd8team/java


hadoop@hadoop:~$ sudo apt-get update
hadoop@hadoop:~$ sudo apt-get install oracle-java8-installer
hadoop@hadoop:~$ sudo apt-get install oracle-java8-set-default

It will install java source in your machine at /usr/lib/jvm/java-8-oracle

To verify your java installation, you have to fire the following command like,

INTERNSHIP REPORT UNIQ TECHNOLOGIES


hadoop@hadoop:~$ java -version

javac 1.8.0_66

hadoop@hadoop:~$ which javac

/usr/bin/javac

hadoop@hadoop:~$ readlink -f /usr/bin/javac

/usr/lib/jvm/java-8-oracle/jre/bin/java

Step 2:

Installing SSH

SSH (“Secure SHell”) is a protocol for securely accessing one machine from
another. Hadoop uses SSH for accessing another slaves nodes to start and manage
all HDFS and MapReduce daemons.

hadoop@hadoop~$ sudo apt-get install openssh-server openssh-client

Now, we have installed SSH over Ubuntu machine so we will be able to connect
with this machine as well as from this machine remotely.

Configuring SSH
Once you installed SSH on your machine, you can connect to other machine or
allow other machines to connect with this machine. However we have this single
machine, we can try connecting with this same machine by SSH. To do this, we
need to copy generated RSA key (i.e. id_rsa.pub) pairs to authorized_keys folder
of SSH installation of this machine by the following command,

# Generate ssh key for hduser account


hadoop@hadoop:~$ ssh-keygen -t rsa -P ""

## Copy id_rsa.pub to authorized keys from hduser


hadoop@hadoop:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

INTERNSHIP REPORT UNIQ TECHNOLOGIES


In case you are configuring SSH for another machine (i.e. from master node to
slave node), you have to update the above command by adding the hostname of
slave machine.

Step 3:

Installation Steps

Download latest Apache Hadoop source from Apache mirrors


First you need to download Apache Hadoop 2.6.0 (i.e. hadoop-2.6.0.tar.gz)or
latest version source from Apache download Mirrors. You can also try stable
hadoop to get all latest features as well as recent bugs solved with Hadoop source.
Choose location where you want to place all your hadoop installation, I have
chosen /usr/local/hadoop

## Move hadoop-2.6.0 to hadoop folder

hadoop@hadoop:~$ sudo mkdir /usr/local/hadoop

hadoop@hadoop:~$ sudo mv hadoop-2.6.0/* /usr/local/hadoop

## Assign ownership of this folder to Hadoop user

hadoop@hadoop:~$ sudo chown $USER:$USER -R /usr/local/hadoop

## Create Hadoop temp directories for Namenode and Datanode

hadoop@hadoop:~$ sudo mkdir -p /usr/local/hadoop_tmp/hdfs/namenode


hadoop@hadoop:~$ sudo mkdir -p /usr/local/hadoop_tmp/hdfs/datanode

## Again assign ownership of this Hadoop temp folder to Hadoop user

hadoop@hadoop:~$ sudo chown $USER:$USER -R /usr/local/hadoop_tmp/

Step 4:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


Update Hadoop configuration files

User profile : Update $HOME/.bashrc

## User profile : Update $HOME/.bashrc

hadoop@hadoop:~$ sudo gedit ~/.bashrc

## Update hduser configuration file by appending the


## following environment variables at the end of this file.

# -- HADOOP ENVIRONMENT VARIABLES START -- #


export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
# -- HADOOP ENVIRONMENT VARIABLES END -- #

## User profile : Update $HOME/.bashrc

hadoop@hadoop:~$ source ~/.bashrc

Step 5:

Format Namenode

hadoop@hadoop:~$ hdfs namenode -format

15/04/18 14:43:12 INFO util.ExitUtil: Exiting with status 0


15/04/18 14:43:12 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at laptop/192.168.1.1
************************************************************/

INTERNSHIP REPORT UNIQ TECHNOLOGIES


Start all Hadoop daemons

hadoop@hadoop:~$ start-all.sh

Instead both of these above command you can also use start-all.sh, but its now
deprecated so its not recommended to be used for better Hadoop operations.
Track/Monitor/Verify

Verify Hadoop daemons:

hadoop@hadoop:~$ jps

9026 NodeManager
7348 NameNode
9766 Jps
8887 ResourceManager
7507 DataNode
5647 SecondryNode

Step 6:

Monitor Hadoop ResourseManage and Hadoop NameNode

If you wish to track Hadoop MapReduce as well as HDFS, you can try exploring
Hadoop web view of ResourceManager and NameNode which are usually used by
hadoop administrators. Open your default browser and visit to the following links.

hadoop@hadoop:~$ netstat -plten | grep java

For ResourceManager – http://localhost:8088

ResourceManager

For NameNode – http://localhost:50070

NameNode

INTERNSHIP REPORT UNIQ TECHNOLOGIES


Step 7:

Run Map-Reduce Jobs

Run word count example:

hadoop@hadoop:~$ cd /usr/local/hadoop

hadoop@hadoop:/usr/local/hadoop$ bin/hdfs dfs -mkdir /inputwords

hadoop@hadoop:/usr/local/hadoop$ bin/hdfs dfs -put /home/$USER/sample.txt


/inputwords

hadoop@hadoop:/usr/local/hadoop$ bin/yarn jar


share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount
/inputwords /outputwords
hadoop@hadoop:/usr/local/hadoop$ bin/hdfs dfs -cat /outputwords/*
hadoop@hadoop:/usr/local/hadoop$ cd
hadoop@hadoop:~$ stop-all.sh

DAY 6 : WORD COUNT


Trainer Name: VIGNESH

INTERNSHIP REPORT UNIQ TECHNOLOGIES


To count the number of words present in the given file.

INTERNSHIP REPORT UNIQ TECHNOLOGIES


OUTPUT:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


DAY 7 :WORD COUNT OF A GIVEN CATAGORY OF YOUTUBE DATA
THE GIVE YOUTUBE DATA :

INTERNSHIP REPORT UNIQ TECHNOLOGIES


INTERNSHIP REPORT UNIQ TECHNOLOGIES
OUTPUT:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


DAY 8: PARTIONING

TRAINER NAME:VIGNESH

CODE:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


INTERNSHIP REPORT UNIQ TECHNOLOGIES
OUTPUT:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


DAY 9&10:LIBRARY MANAGEMENT SYSTEM PROJECT
TRAINER NAME: VIGNESH

A Library management system is a project that manages and stores books


information electronically according to the needs. The system helps both the
students and library manager to keep constant track of all books available in
the library. It allows both admin and the user to search for the desired book.
It also manages the information about the books the students have
borrowed and the books they have returned.

OUTPUT:

INTERNSHIP REPORT UNIQ TECHNOLOGIES


INTERNSHIP REPORT UNIQ TECHNOLOGIES
INTERNSHIP REPORT UNIQ TECHNOLOGIES
INTERNSHIP REPORT UNIQ TECHNOLOGIES
Verified.

Project Coordinator

INTERNSHIP REPORT UNIQ TECHNOLOGIES

You might also like