SlideShare a Scribd company logo
RedHat Enterprise Linux Essential
     Unit 7: Text Processing Tools
Objectives
Upon completion of this unit, you should be able to:

 Use tools for extracting, analyzing and manipulating
  text data
Tools for Extracting Text

 File Contents: less and cat

 File Excerpts: head and tail

 Extract by Column: cut

 Extract by Keyword: grep
Viewing File Contents
                             less and cat

 cat: dump one or more files to STDOUT

    Multiple files are concatenated together


 less: view file or STDIN one page at a time

    Useful commands while viewing:

       • /text searches for text

       • n/N jumps to the next/previous match

       • v opens the file in a text editor


 less is the pager used by man
Viewing File Excerpts
                               head and tail

 head: Display the first 10 lines of a file

    Use -n to change number of lines displayed


 tail: Display the last 10 lines of a file

    Use -n to change number of lines displayed


    Use -f to "follow" subsequent additions to the file
       • Very useful for monitoring log files!
Extracting Text by Keyword
                             grep
 Prints lines of files or STDIN where a pattern is matched
       $ grep 'john' /etc/passwd

       $ date --help | grep year

 Use -i to search case-insensitively

 Use -n to print line numbers of matches

 Use -v to print lines not containing pattern

 Use -AX to include the X lines after each match

 Use -BX to include the X lines before each match
Extracting Text by Column
                                 cut

 Display specific columns of file or STDIN data

  $ cut -d: -f1 /etc/passwd

  $ grep root /etc/passwd | cut -d: -f7


 Use -d to specify the column delimiter (default is TAB)

 Use -f to specify the column to print

 Use -c to cut by characters

  $ cut -c2-5 /usr/share/dict/words
Tools for Analyzing Text

 Text Stats: wc

 Sorting Text: sort

 Comparing Files: diff and patch

 Spell Check: aspell
Gathering Text Statistics
                       wc (word count)
 Counts words, lines, bytes and characters

 Can act upon a file or STDIN

       $ wc story.txt

       39   237   1901 story.txt

 Use -l for only line count

 Use -w for only word count

 Use -c for only byte count

 Use -m for character count (not displayed)
Sorting Text sort

 Sorts text to STDOUT - original file unchanged

       $ sort [options] file(s)
 Common options
    -r performs a reverse (descending) sort

    -n performs a numeric sort

    -f ignores (folds) case of characters in strings

    -u (unique) removes duplicate lines in output

    -t c uses c as a field separator

    -k X sorts by c-delimited field X
       • Can be used multiple times
Eliminating Duplicate Lines
                        sort and uniq
 sort -u: removes duplicate lines from input

 uniq: removes duplicate adjacent lines from input
    Use -c to count number of occurrences

    Use with sort for best effect:

      $ sort userlist.txt | uniq -c
Comparing Files
                              diff
 Compares two files for differences
      $ diff foo.conf-broken foo.conf-works
      5c5
      < use_widgets = no
      ---
      > use_widgets = yes
    Denotes a difference (change) on line 5

 Use gvimdiff for graphical diff
    Provided by vim-X11 package
Duplicating File Changes
                               patch
 diff output stored in a file is called a "patchfile"
    Use -u for "unified" diff, best in patchfiles

 patch duplicates changes in other files (use with care!)

 • Use -b to automatically back up changed files

  $ diff -u foo.conf-broken foo.conf-works > foo.patch

  $ patch -b foo.conf-broken foo.patch
Spell Checking with aspell

 Interactively spell-check files:
       $ aspell check letter.txt

 Non-interactively list mis-spelled words in STDIN

       $ aspell list < letter.txt

       $ aspell list < letter.txt | wc -l
Tools for Manipulating Text
                           tr and sed
 Alter (translate) Characters: tr
    Converts characters in one set to corresponding characters in another
     set
    Only reads data from STDIN

       $ tr 'a-z' 'A-Z' < lowercase.txt

 Alter Strings: sed
    stream editor

    Performs search/replace operations on a stream of text

    Normally does not alter source file

    Use -i.bak to back-up and alter source file
sed
                              Examples
 Quote search and replace instructions!

 sed addresses
    sed 's/dog/cat/g' pets

    sed '1,50s/dog/cat/g' pets

    sed '/digby/,/duncan/s/dog/cat/g' pets

 Multiple sed instructions
    sed -e 's/dog/cat/' -e 's/hi/lo/' pets

    sed -f myedits pets
Introduction awk

   Field/Column processor
   Supports egrep-compatible (POSIX) RegExes
   Can return full lines like grep
   Awk runs 3 steps:
     BEGIN - optional
     Body, where the main action(s) take place
     END - optional
 Multiple body actions can be executed by separating them using
  semicolons. e.g. '{ print $1; print $2 }'
 awk, auto-loops through input stream, regardless of the source of the
  stream. e.g. STDIN, Pipe, File
 Usage:
       awk '/optional_match/ { action }' file_name | Pipe
Example awk

 Print a text file
    awk '{print }' /etc/passwd

    awk '{print $0}' /etc/passwd

 Print specific field
    awk -F':' '{print $1}' /etc/passwd

 Pattern matching
    awk '$9 == 500 { print $0}' /var/log/httpd/access.log

 Print lines containing vmintam,student and khanh
    awk '/vmintam|student|khanh/' /etc/passwd
Example awk (con’t)

 print 1st lines from file
   awk "NR==1{print;exit}" /etc/resolv.conf

 Simply Arithmetic
   awk '{total += $1} END {print total}' earnings.txt

 Shell cannot calculate with floating point numberes, but awk can:
   awk 'BEGIN {printf "%.3fn", 2005.50 / 3}‘

 history | awk '{print $2}' | sort | uniq -c | sort -rn | head
Special Characters for Complex Searches
                 Regular Expressions
 ^ represents beginning of line

 $ represents end of line

 Character classes as in bash:
    [abc], [^abc]

    [[:upper:]], [^[:upper:]]

 Used by:
    grep, sed, less, others
Unit 8 text processing tools

More Related Content

What's hot (18)

PPTX
Grep - A powerful search utility
Nirajan Pant
 
PPT
101 3.7 search text files using regular expressions
Acácio Oliveira
 
PPT
101 3.7 search text files using regular expressions
Acácio Oliveira
 
DOCX
15 practical grep command examples in linux
Teja Bheemanapally
 
PDF
Hex file and regex cheat sheet
Martin Cabrera
 
KEY
PHP 5.3
Idaf_1er
 
PPTX
Unix - Filters
Dr. Girish GS
 
PPT
intro unix/linux 05
duquoi
 
PPT
Mysql
HAINIRMALRAJ
 
PPT
Unix Basics
Dr.Ravi
 
PPTX
Introduction to Python , Overview
NB Veeresh
 
PPT
Using Unix
Dr.Ravi
 
PPTX
Programming in C
nagathangaraj
 
PDF
Unix Commands
Dr.Ravi
 
DOCX
Learning Grep
Vikas Kumar CSM®
 
PDF
Chunked, dplyr for large text files
Edwin de Jonge
 
Grep - A powerful search utility
Nirajan Pant
 
101 3.7 search text files using regular expressions
Acácio Oliveira
 
101 3.7 search text files using regular expressions
Acácio Oliveira
 
15 practical grep command examples in linux
Teja Bheemanapally
 
Hex file and regex cheat sheet
Martin Cabrera
 
PHP 5.3
Idaf_1er
 
Unix - Filters
Dr. Girish GS
 
intro unix/linux 05
duquoi
 
Unix Basics
Dr.Ravi
 
Introduction to Python , Overview
NB Veeresh
 
Using Unix
Dr.Ravi
 
Programming in C
nagathangaraj
 
Unix Commands
Dr.Ravi
 
Learning Grep
Vikas Kumar CSM®
 
Chunked, dplyr for large text files
Edwin de Jonge
 

Viewers also liked (7)

PPT
Speed protocol processor
Akhil Kumar
 
PPTX
Word processor in the classroom
Luphiie Lyaa
 
PPTX
Ictlessonepp4aralin10angcomputerfilesystem 150622081942-lva1-app6892 -
Cathy Princess Bunye
 
PPTX
Ictlessonepp4 aralin11pananaliksikgamitanginternet-150622045536-lva1-app6891 -
Cathy Princess Bunye
 
PPTX
Ict lesson epp 4 aralin 9 pangangalap ng impormasyon gamit ang ict
Mary Ann Encinas
 
PDF
K TO 12 GRADE 4 UNANG MARKAHANG PAGSUSULIT
LiGhT ArOhL
 
PPT
Ppt for tranmission media
Manish8976
 
Speed protocol processor
Akhil Kumar
 
Word processor in the classroom
Luphiie Lyaa
 
Ictlessonepp4aralin10angcomputerfilesystem 150622081942-lva1-app6892 -
Cathy Princess Bunye
 
Ictlessonepp4 aralin11pananaliksikgamitanginternet-150622045536-lva1-app6891 -
Cathy Princess Bunye
 
Ict lesson epp 4 aralin 9 pangangalap ng impormasyon gamit ang ict
Mary Ann Encinas
 
K TO 12 GRADE 4 UNANG MARKAHANG PAGSUSULIT
LiGhT ArOhL
 
Ppt for tranmission media
Manish8976
 
Ad

Similar to Unit 8 text processing tools (20)

PPTX
Handling Files Under Unix.pptx
Harsha Patil
 
PPTX
Handling Files Under Unix.pptx
Harsha Patil
 
PDF
Cheatsheet: Hex file headers and regex
Kasper de Waard
 
PPTX
Unix Trainning Doc.pptx
KalpeshRaut7
 
ODP
Linux
merlin deepika
 
PPT
Linux
HAINIRMALRAJ
 
ODP
Linux
merlin deepika
 
PPT
Spsl II unit
Sasidhar Kothuru
 
PPTX
Linux System commands Essentialsand Basics.pptx
mba1130feb2024
 
RTF
Unix lab manual
Chaitanya Kn
 
ODP
Vim and Python
majmcdonald
 
PPT
intro unix/linux 06
duquoi
 
PDF
Scripting and the shell in LINUX
Bhushan Pawar -Java Trainer
 
PDF
Linux Command Line - By Ranjan Raja
Ranjan Raja
 
PPT
101 3.7 search text files using regular expressions
Acácio Oliveira
 
PPTX
terminal command2.pptx with good explanation
farsankadavandy
 
PPT
Shell Scripts
Dr.Ravi
 
PDF
1) List currently running jobsANS) see currently runningcommand.pdf
amaresh6333
 
Handling Files Under Unix.pptx
Harsha Patil
 
Handling Files Under Unix.pptx
Harsha Patil
 
Cheatsheet: Hex file headers and regex
Kasper de Waard
 
Unix Trainning Doc.pptx
KalpeshRaut7
 
Spsl II unit
Sasidhar Kothuru
 
Linux System commands Essentialsand Basics.pptx
mba1130feb2024
 
Unix lab manual
Chaitanya Kn
 
Vim and Python
majmcdonald
 
intro unix/linux 06
duquoi
 
Scripting and the shell in LINUX
Bhushan Pawar -Java Trainer
 
Linux Command Line - By Ranjan Raja
Ranjan Raja
 
101 3.7 search text files using regular expressions
Acácio Oliveira
 
terminal command2.pptx with good explanation
farsankadavandy
 
Shell Scripts
Dr.Ravi
 
1) List currently running jobsANS) see currently runningcommand.pdf
amaresh6333
 
Ad

More from root_fibo (11)

PDF
Unit 13 network client
root_fibo
 
PDF
Unit 12 finding and processing files
root_fibo
 
PDF
Unit 11 configuring the bash shell – shell script
root_fibo
 
PDF
Unit3 browsing the filesystem
root_fibo
 
PDF
Unit 10 investigating and managing
root_fibo
 
PDF
Unit 9 basic system configuration tools
root_fibo
 
PDF
Unit 7 standard i o
root_fibo
 
PDF
Unit 6 bash shell
root_fibo
 
PDF
Unit 5 vim an advanced text editor
root_fibo
 
PDF
Unit 4 user and group
root_fibo
 
PDF
Unit2 help
root_fibo
 
Unit 13 network client
root_fibo
 
Unit 12 finding and processing files
root_fibo
 
Unit 11 configuring the bash shell – shell script
root_fibo
 
Unit3 browsing the filesystem
root_fibo
 
Unit 10 investigating and managing
root_fibo
 
Unit 9 basic system configuration tools
root_fibo
 
Unit 7 standard i o
root_fibo
 
Unit 6 bash shell
root_fibo
 
Unit 5 vim an advanced text editor
root_fibo
 
Unit 4 user and group
root_fibo
 
Unit2 help
root_fibo
 

Recently uploaded (20)

PDF
Upskill to Agentic Automation 2025 - Kickoff Meeting
DianaGray10
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PPTX
Darren Mills The Migration Modernization Balancing Act: Navigating Risks and...
AWS Chicago
 
PPTX
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Women in Automation Presents: Reinventing Yourself — Bold Career Pivots That ...
DianaGray10
 
PDF
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PDF
Sustainable and comertially viable mining process.pdf
Avijit Kumar Roy
 
PDF
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
PDF
Rethinking Security Operations - SOC Evolution Journey.pdf
Haris Chughtai
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
Arcee AI - building and working with small language models (06/25)
Julien SIMON
 
PDF
Empowering Cloud Providers with Apache CloudStack and Stackbill
ShapeBlue
 
PDF
Upgrading to z_OS V2R4 Part 01 of 02.pdf
Flavio787771
 
PPTX
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
PDF
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
PDF
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Upskill to Agentic Automation 2025 - Kickoff Meeting
DianaGray10
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Darren Mills The Migration Modernization Balancing Act: Navigating Risks and...
AWS Chicago
 
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Women in Automation Presents: Reinventing Yourself — Bold Career Pivots That ...
DianaGray10
 
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
Sustainable and comertially viable mining process.pdf
Avijit Kumar Roy
 
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
Rethinking Security Operations - SOC Evolution Journey.pdf
Haris Chughtai
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Arcee AI - building and working with small language models (06/25)
Julien SIMON
 
Empowering Cloud Providers with Apache CloudStack and Stackbill
ShapeBlue
 
Upgrading to z_OS V2R4 Part 01 of 02.pdf
Flavio787771
 
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 

Unit 8 text processing tools

  • 1. RedHat Enterprise Linux Essential Unit 7: Text Processing Tools
  • 2. Objectives Upon completion of this unit, you should be able to:  Use tools for extracting, analyzing and manipulating text data
  • 3. Tools for Extracting Text  File Contents: less and cat  File Excerpts: head and tail  Extract by Column: cut  Extract by Keyword: grep
  • 4. Viewing File Contents less and cat  cat: dump one or more files to STDOUT  Multiple files are concatenated together  less: view file or STDIN one page at a time  Useful commands while viewing: • /text searches for text • n/N jumps to the next/previous match • v opens the file in a text editor  less is the pager used by man
  • 5. Viewing File Excerpts head and tail  head: Display the first 10 lines of a file  Use -n to change number of lines displayed  tail: Display the last 10 lines of a file  Use -n to change number of lines displayed  Use -f to "follow" subsequent additions to the file • Very useful for monitoring log files!
  • 6. Extracting Text by Keyword grep  Prints lines of files or STDIN where a pattern is matched $ grep 'john' /etc/passwd $ date --help | grep year  Use -i to search case-insensitively  Use -n to print line numbers of matches  Use -v to print lines not containing pattern  Use -AX to include the X lines after each match  Use -BX to include the X lines before each match
  • 7. Extracting Text by Column cut  Display specific columns of file or STDIN data $ cut -d: -f1 /etc/passwd $ grep root /etc/passwd | cut -d: -f7  Use -d to specify the column delimiter (default is TAB)  Use -f to specify the column to print  Use -c to cut by characters $ cut -c2-5 /usr/share/dict/words
  • 8. Tools for Analyzing Text  Text Stats: wc  Sorting Text: sort  Comparing Files: diff and patch  Spell Check: aspell
  • 9. Gathering Text Statistics wc (word count)  Counts words, lines, bytes and characters  Can act upon a file or STDIN $ wc story.txt 39 237 1901 story.txt  Use -l for only line count  Use -w for only word count  Use -c for only byte count  Use -m for character count (not displayed)
  • 10. Sorting Text sort  Sorts text to STDOUT - original file unchanged $ sort [options] file(s)  Common options  -r performs a reverse (descending) sort  -n performs a numeric sort  -f ignores (folds) case of characters in strings  -u (unique) removes duplicate lines in output  -t c uses c as a field separator  -k X sorts by c-delimited field X • Can be used multiple times
  • 11. Eliminating Duplicate Lines sort and uniq  sort -u: removes duplicate lines from input  uniq: removes duplicate adjacent lines from input  Use -c to count number of occurrences  Use with sort for best effect: $ sort userlist.txt | uniq -c
  • 12. Comparing Files diff  Compares two files for differences $ diff foo.conf-broken foo.conf-works 5c5 < use_widgets = no --- > use_widgets = yes  Denotes a difference (change) on line 5  Use gvimdiff for graphical diff  Provided by vim-X11 package
  • 13. Duplicating File Changes patch  diff output stored in a file is called a "patchfile"  Use -u for "unified" diff, best in patchfiles  patch duplicates changes in other files (use with care!)  • Use -b to automatically back up changed files $ diff -u foo.conf-broken foo.conf-works > foo.patch $ patch -b foo.conf-broken foo.patch
  • 14. Spell Checking with aspell  Interactively spell-check files: $ aspell check letter.txt  Non-interactively list mis-spelled words in STDIN $ aspell list < letter.txt $ aspell list < letter.txt | wc -l
  • 15. Tools for Manipulating Text tr and sed  Alter (translate) Characters: tr  Converts characters in one set to corresponding characters in another set  Only reads data from STDIN $ tr 'a-z' 'A-Z' < lowercase.txt  Alter Strings: sed  stream editor  Performs search/replace operations on a stream of text  Normally does not alter source file  Use -i.bak to back-up and alter source file
  • 16. sed Examples  Quote search and replace instructions!  sed addresses  sed 's/dog/cat/g' pets  sed '1,50s/dog/cat/g' pets  sed '/digby/,/duncan/s/dog/cat/g' pets  Multiple sed instructions  sed -e 's/dog/cat/' -e 's/hi/lo/' pets  sed -f myedits pets
  • 17. Introduction awk  Field/Column processor  Supports egrep-compatible (POSIX) RegExes  Can return full lines like grep  Awk runs 3 steps:  BEGIN - optional  Body, where the main action(s) take place  END - optional  Multiple body actions can be executed by separating them using semicolons. e.g. '{ print $1; print $2 }'  awk, auto-loops through input stream, regardless of the source of the stream. e.g. STDIN, Pipe, File  Usage: awk '/optional_match/ { action }' file_name | Pipe
  • 18. Example awk  Print a text file awk '{print }' /etc/passwd awk '{print $0}' /etc/passwd  Print specific field awk -F':' '{print $1}' /etc/passwd  Pattern matching awk '$9 == 500 { print $0}' /var/log/httpd/access.log  Print lines containing vmintam,student and khanh awk '/vmintam|student|khanh/' /etc/passwd
  • 19. Example awk (con’t)  print 1st lines from file awk "NR==1{print;exit}" /etc/resolv.conf  Simply Arithmetic awk '{total += $1} END {print total}' earnings.txt  Shell cannot calculate with floating point numberes, but awk can: awk 'BEGIN {printf "%.3fn", 2005.50 / 3}‘  history | awk '{print $2}' | sort | uniq -c | sort -rn | head
  • 20. Special Characters for Complex Searches Regular Expressions  ^ represents beginning of line  $ represents end of line  Character classes as in bash:  [abc], [^abc]  [[:upper:]], [^[:upper:]]  Used by:  grep, sed, less, others