VOOZH about

URL: https://www.tecmint.com/play-with-word-and-character-counts-in-linux/

⇱ Fun in Linux Terminal - Play with Word and Character Counts


Skip to content

Linux command line has a lot of fun around itself and many tedious task can be performed very easily yet with perfection. Playing with words and characters, their frequency in a text file, etc is what we are going to see in this article.

The only command that comes to our mind, for tweaking Linux command line to manipulate words and characters from a text file is wc command.

πŸ‘ Word Count in Linux
Fun with Word and Letter Counts in Shell

A β€˜wcβ€˜ command which stands for word count is capable of Printing Newline, word & byte counts from a text file.

To work with the small scripts to analyze text file, we must have a text file. To maintain uniformity, we are creating a text file with the output of man command, as described below.

$ man man > man.txt

The above command creates a text file β€˜man.txtβ€˜ with the content of β€˜manual pageβ€˜ for β€˜manβ€˜ command.

We want to check the most common words, in the above created β€˜Text Fileβ€˜ by running the below script.

$ cat man.txt | tr ' ' '\012' | tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | grep -v '[^a-z]' | sort | uniq -c | sort -rn | head
Sample Output
7557 
262 the 
163 to 
112 is 
112 a 
78 of 
78 manual 
76 and 
64 if 
63 be

The above one liner simple script shows, ten most frequently appearing words and their frequency of appearance, in the text file.

How about breaking down a word into individual using following command.

$ echo 'tecmint team' | fold -w1
Sample Output
t 
e 
c 
m 
i 
n 
t 
t 
e 
a 
m

Note: Here, β€˜-w1’ is for width.

Now we will be breaking down every single word in a text file, sort the result and get the desired output with the frequency of ten most frequent characters.

$ fold -w1 < man.txt | sort | uniq -c | sort -rn | head
Sample Output
8579 
2413 e
1987 a
1875 t
1644 i
1553 n
1522 o
1514 s
1224 r
1021 l

How about getting most frequent characters in the text file with uppercase and lowercase differently along with their occurrence frequency.

$ fold -w1 < man.txt | sort | tr '[:lower:]' '[:upper:]' | uniq -c | sort -rn | head -20
Sample Output
11636 
2504 E 
2079 A 
2005 T 
1729 I 
1645 N 
1632 S 
1580 o
1269 R 
1055 L 
836 H 
791 P 
766 D 
753 C 
725 M 
690 U 
605 F 
504 G 
352 Y 
344 .

Check the above output, where punctuation mark is included. Lets strip out punctuation, with β€˜trβ€˜ command. Here we go:

$ fold -w1 < man.txt | tr '[:lower:]' '[:upper:]' | sort | tr -d '[:punct:]' | uniq -c | sort -rn | head -20
Sample Output
 11636 
 2504 E 
 2079 A 
 2005 T 
 1729 I 
 1645 N 
 1632 S 
 1580 O 
 1550 
 1269 R 
 1055 L 
 836 H 
 791 P 
 766 D 
 753 C 
 725 M 
 690 U 
 605 F 
 504 G 
 352 Y

Now I have three text files, lets run the above one liner script to see the output.

$ cat *.txt | fold -w1 | tr '[:lower:]' '[:upper:]' | sort | tr -d '[:punct:]' | uniq -c | sort -rn | head -8
Sample Output
 11636 
 2504 E 
 2079 A 
 2005 T 
 1729 I 
 1645 N 
 1632 S 
 1580 O

Next we will be generating those infrequent letters that are at least ten letters long. Here is the simple script.

$ cat man.txt | tr '' '\012' | tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | tr -d '[0-9]' | sort | uniq -c | sort -n | grep -E '..................' | head
Sample Output
1 ────────────────────────────────────────── 
1 a all 
1 abc any or all arguments within are optional 
1 able see setlocale for precise details 
1 ab options delimited by cannot be used together 
1 achieved by using the less environment variable 
1 a child process returned a nonzero exit status 
1 act as if this option was supplied using the name as a filename 
1 activate local mode format and display local manual files 
1 acute accent

Note: The more and more dots in the above script till all the results are generated. We can use .{10} to get ten character matches.

These simple scripts, also make us know most frequent appearing words and characters in English.

That’s all for now. I’ll be here again with another interesting and off the beat topic worth knowing, which you people will love to read. Don’t forget to provide us with your valuable feedback in comment section, below.

Read Also: 20 Funny Commands of Linux

If this article helped, share it with someone on your team.
TecMint Weekly Newsletter
Get the Learn Linux 7 Days Crash Course free when you join 34,000+ Linux professionals reading every Thursday.
Check your email for a magic link to get started.
Something went wrong. Please try again.
β˜•
TecMint has been free for 14 years. Help keep it that way.
Google AI Overviews and tools like ChatGPT have cut into search traffic for independent tech sites like TecMint. Running this site costs over $2,000 every month for hosting, infrastructure, and paying authors to keep the content accurate and tested.

If this article helped you solve a problem, consider buying a coffee. It helps keep TecMint free, supports the authors, and keeps the project going.
β˜• Buy Me a Coffee
Avishek
A Passionate GNU/Linux Enthusiast and Software Developer with over a decade in the field of Linux and Open Source technologies.

Each tutorial at TecMint is created by a team of experienced Linux system administrators so that it meets our high-quality standards.

1 Comment

Leave a Reply
  1. how to find a most frequenlty used words in linux

    Reply

Got Something to Say? Join the Discussion... Cancel reply

Free Course
Get a free Linux course before you go.
Subscribe to TecMint Weekly and get the Learn Linux 7 Days Crash Course free. Read by 34,000+ Linux professionals every Thursday.
Check your email for a magic link to get started.