Friday, November 2, 2012

Regular Expressions in Grep Command with 10 Examples

http://www.thegeekstuff.com/2011/01/regular-expressions-in-grep-command/


Regular expressions are used to search and manipulate the text, based on the patterns. Most of the Linux commands and programming languages use regular expression.
Grep command is used to search for a specific string in a file. Please refer our earlier article for 15 practical grep command examples.
You can also use regular expressions with grep command when you want to search for a text containing a particular pattern. Regular expressions search for the patterns on each line of the file. It simplifies our search operation.

This articles is part of a 2 article series.
This part 1 article covers grep examples for simple regular expressions. The future part 2 article will cover advanced regular expression examples in grep.
Let us take the file /var/log/messages file which will be used in our examples.

Example 1. Beginning of line ( ^ )

In grep command, caret Symbol ^ matches the expression at the start of a line. In the following example, it displays all the line which starts with the Nov 10. i.e All the messages logged on November 10.
$ grep "^Nov 10" messages.1
Nov 10 01:12:55 gs123 ntpd[2241]: time reset +0.177479 s
Nov 10 01:17:17 gs123 ntpd[2241]: synchronized to LOCAL(0), stratum 10
Nov 10 01:18:49 gs123 ntpd[2241]: synchronized to 15.1.13.13, stratum 3
Nov 10 13:21:26 gs123 ntpd[2241]: time reset +0.146664 s
Nov 10 13:25:46 gs123 ntpd[2241]: synchronized to LOCAL(0), stratum 10
Nov 10 13:26:27 gs123 ntpd[2241]: synchronized to 15.1.13.13, stratum 3
The ^ matches the expression in the beginning of a line, only if it is the first character in a regular expression. ^N matches line beginning with N.


Example 2. End of the line ( $)

Character $ matches the expression at the end of a line. The following command will help you to get all the lines which ends with the word “terminating”.
$ grep "terminating.$" messages
Jul 12 17:01:09 cloneme kernel: Kernel log daemon terminating.
Oct 28 06:29:54 cloneme kernel: Kernel log daemon terminating.
From the above output you can come to know when all the kernel log has got terminated. Just like ^ matches the beginning of the line only if it is the first character, $ matches the end of the line only if it is the last character in a regular expression.

Example 3. Count of empty lines ( ^$ )

Using ^ and $ character you can find out the empty lines available in a file. “^$” specifies empty line.
$ grep -c  "^$" messages anaconda.log
messages:0
anaconda.log:3
The above commands displays the count of the empty lines available in the messages and anaconda.log files.

Example 4. Single Character (.)

The special meta-character “.” (dot) matches any character except the end of the line character. Let us take the input file which has the content as follows.
$ cat input
1. first line
2. hi hello
3. hi zello how are you
4. cello
5. aello
6. eello
7. last line
Now let us search for a word which has any single character followed by ello. i.e hello, cello etc.,
$ grep ".ello" input
2. hi hello
3. hi zello how are you
4. cello
5. aello
6. eello
In case if you want to search for a word which has only 4 character you can give grep -w “….” where single dot represents any single character.

Example 5. Zero or more occurrence (*)

The special character “*” matches zero or more occurrence of the previous character. For example, the pattern ’1*’ matches zero or more ’1′.
The following example searches for a pattern “kernel: *” i.e kernel: and zero or more occurrence of space character.
$ grep "kernel: *." *
messages.4:Jul 12 17:01:02 cloneme kernel: ACPI: PCI interrupt for device 0000:00:11.0 disabled
messages.4:Oct 28 06:29:49 cloneme kernel: ACPI: PM-Timer IO Port: 0x1008
messages.4:Oct 28 06:31:06 btovm871 kernel:  sda: sda1 sda2 sda3
messages.4:Oct 28 06:31:06 btovm871 kernel: sd 0:0:0:0: Attached scsi disk sda
.
.
In the above example it matches for kernel and colon symbol followed by any number of spaces/no space and “.” matches any single character.

Example 6. One or more occurrence (\+)

The special character “\+” matches one or more occurrence of the previous character. ” \+” matches at least one or more space character.
If there is no space then it will not match. The character “+” comes under extended regular expression. So you have to escape when you want to use it with the grep command.
$ cat input
hi hello
hi    hello how are you
hihello

$ grep "hi \+hello" input
hi hello
hi    hello how are you
In the above example, the grep pattern matches for the pattern ‘hi’, followed by one or more space character, followed by “hello”.
If there is no space between hi and hello it wont match that. However, * character matches zero or more occurrence.
“hihello” will be matched by * as shown below.
$ grep "hi *hello" input
hi hello
hi    hello how are you
hihello
$

Example 7. Zero or one occurrence (\?)

The special character “?” matches zero or one occurrence of the previous character. “0?” matches single zero or nothing.
$ grep "hi \?hello" input
hi hello
hihello
“hi \?hello” matches hi and hello with single space (hi hello) and no space (hihello).
The line which has more than one space between hi and hello did not get matched in the above command.

Example 8.Escaping the special character (\)

If you want to search for special characters (for example: * , dot) in the content you have to escape the special character in the regular expression.
$ grep "127\.0\.0\.1"  /var/log/messages.4
Oct 28 06:31:10 btovm871 ntpd[2241]: Listening on interface lo, 127.0.0.1#123 Enabled

Example 9. Character Class ([0-9])

The character class is nothing but list of characters mentioned with in the square bracket which is used to match only one out of several characters.
$ grep -B 1 "[0123456789]\+ times" /var/log/messages.4
Oct 28 06:38:35 btovm871 init: open(/dev/pts/0): No such file or directory
Oct 28 06:38:35 btovm871 last message repeated 2 times
Oct 28 06:38:38 btovm871 pcscd: winscard.c:304:SCardConnect() Reader E-Gate 0 0 Not Found
Oct 28 06:38:38 btovm871 last message repeated 3 times
Repeated messages will be logged in messages logfile as “last message repeated n times”. The above example searches for the line which has any number (0to9) followed by the word “times”. If it matches it displays the line before the matched line and matched line also.
With in the square bracket, using hyphen you can specify the range of characters. Like [0123456789] can be represented by [0-9]. Alphabets range also can be specified such as [a-z],[A-Z] etc. So the above command can also be written as
$ grep -B 1 "[0-9]\+ times" /var/log/messages.4

Example 10. Exception in the character class

If you want to search for all the characters except those in the square bracket, then use ^ (Caret) symbol as the first character after open square bracket. The following example searches for a line which does not start with the vowel letter from dictionary word file in linux.
$ grep -i  "^[^aeiou]" /usr/share/dict/linux.words
1080
10-point
10th
11-point
12-point
16-point
18-point
1st
2
First caret symbol in regular expression represents beginning of the line. However, caret symbol inside the square bracket represents “except” — i.e match except everything in the square bracket.

---------
In our previous regular expression part 1 article, we reviewed basic reg-ex with practical examples.
But we can do much more with the regular expressions. You can often accomplish complex tasks with a single regular expression instead of writing several lines of codes.

When applying a regex to a string, the regex engine will start at the first character of the string. It will try all possible permutations of the regular expression at the first character. Only if all possibilities have been tried and found to fail, will the regex engine continue with the second character in the text.
The regex will try all possible permutations of the regex, in exactly the same order. The result is that the regex-directed engine will return the leftmost match.
In this article, let us review some advanced regular expression with examples.

Example 1. OR Operation (|)

Pipe character (|) in grep is used to specify that either of two whole subexpressions occur in a position. “subexpression1|subexpression2″ matches either subexpression1 or subexpression2.
The following example will remove three various kind of comment lines in a file using OR in a grep command.
First, create a sample file called “comments”.
$ cat comments
This file shows the comment character in various programming/scripting languages
### Perl / shell scripting
If the Line starts with single hash symbol,
then its a comment in Perl and shell scripting.
' VB Scripting comment
The line should start with a single quote to comment in VB scripting.
// C programming single line comment.
Double slashes in the beginning of the line for single line comment in C.
The file called “comments” has perl,VB script and C programming comment lines. Now the following grep command searches for the line which does not start with # or single quote (‘) or double front slashes (//).
$ grep  -v "^#\|^'\|^\/\/" comments
This file shows the comment character in various programming/scripting languages
If the Line starts with single hash symbol,
then its a comment in Perl and shell scripting.
The line should start with a single quote to comment in VB scripting.
Double slashes in the beginning of the line for single line comment in C.

Example 2. Character class expression

As we have seen in our previous regex article example 9, list of characters can be mentioned with in the square brackets to match only one out of several characters. Grep command supports some special character classes that denote certain common ranges. Few of them are listed here. Refer man page of grep to know various character class expressions.
[:digit:]  Only the digits 0 to 9
[:alnum:]  Any alphanumeric character 0 to 9 OR A to Z or a to z.
[:alpha:]  Any alpha character A to Z or a to z.
[:blank:]  Space and TAB characters only.
These are always used inside square brackets in the form [[:digit:]]. Now let us grep all the process Ids of ntpd daemon process using appropriate character class expression.
$ grep -e "ntpd\[[[:digit:]]\+\]" /var/log/messages.4
Oct 28 11:42:20 gstuff1 ntpd[2241]: synchronized to LOCAL(0), stratum 10
Oct 28 11:42:20 gstuff1 ntpd[2241]: synchronized to 15.11.13.123, stratum 3
Oct 28 12:33:31 gstuff1 ntpd[2241]: synchronized to LOCAL(0), stratum 10
Oct 28 12:50:46 gstuff1 ntpd[2241]: synchronized to 15.11.13.123, stratum 3
Oct 29 07:55:29 gstuff1 ntpd[2241]: time reset -0.180737 s

Example 3. M to N occurences ({m,n})

A regular expression followed by {m,n} indicates that the preceding item is matched at least m times, but not more than n times. The values of m and n must be non-negative and smaller than 255.
The following example prints the line if its in the range of 0 to 99999.
$ cat  number
12
12345
123456
19816282

$ grep  "^[0-9]\{1,5\}$" number
12
12345
The file called “number” has the list of numbers, the above grep command matches only the number which 1 (minimum is 0) to 5 digits (maximum 99999).
Note: For basic grep command examples, read 15 Practical Grep Command Examples.

Example 4. Exact M occurence ({m})

A Regular expression followed by {m} matches exactly m occurences of the preceding expression. The following grep command will display only the number which has 5 digits.
$ grep  "^[0-9]\{5\}$" number
12345

Example 5. M or more occurences ({m,})

A Regular expression followed by {m,} matches m or more occurences of the preceding expression. The following grep command will display the number which has 5 or more digits.
$ grep "[0-9]\{5,\}" number
12345
123456
19816282
Note: Did you know that you can use bzgrep command to search for a string or a pattern (regular expression) on bzip2 compressed files.

Example 6. Word boundary (\b)

\b is to match for a word boundary. \b matches any character(s) at the beginning (\bxx) and/or end (xx\b) of a word, thus \bthe\b will find the but not thet, but \bthe will find they.
# grep -i "\bthe\b" comments
This file shows the comment character in various programming/scripting languages
If the Line starts with single hash symbol,
The line should start with a single quote to comment in VB scripting.
Double slashes in the beginning of the line for single line comment in C.

Example 7. Back references (\n)

Grouping the expressions for further use is available in grep through back-references. For ex, \([0-9]\)\1 matches two digit number in which both the digits are same number like 11,22,33 etc.,
# grep -e '^\(abc\)\1$'
abc
abcabc
abcabc
In the above grep command, it accepts the input the STDIN. when it reads the input “abc” it didnt match, The line “abcabc” matches with the given expression so it prints. If you want to use Extended regular expression its always preferred to use egrep command. grep with -e option also works like egrep, but you have to escape the special characters like paranthesis.
Note: You can also use zgrep command to to search inside a compressed gz file.

Example 8. Match the pattern “Object Oriented”

So far we have seen different tips in grep command, Now using those tips, let us match “object oriented” in various formats.
$ grep "OO\|\([oO]bject\( \|\-\)[oO]riented\)"
The above grep command matches the “OO”, “object oriented”, “Object-oriented” and etc.,

Example 9. Print the line “vowel singlecharacter samevowel”

The following grep command print all lines containing a vowel (a, e, i, o, or u) followed by a single character followed by the same vowel again. Thus, it will find eve or adam but not vera.
$ cat input
evening
adam
vera

$ grep "\([aeiou]\).\1" input
evening
adam

Example 10. Valid IP address

The following grep command matches only valid IP address.
$ cat input
15.12.141.121
255.255.255
255.255.255.255
256.125.124.124

$ egrep  '\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)' input
15.12.141.121
255.255.255.255
In the regular expression given above, there are different conditions. These conditioned matches should occur three times and one more class is mentioned separately.
  1. If it starts with 25, next number should be 0 to 5 (250 to 255)
  2. If it starts with 2, next number could be 0-4 followed by 0-9 (200 to 249)
  3. zero occurence of 0 or 1, 0-9, then zero occurence of any number between 0-9 (0 to 199)
  4. Then dot character


No comments:

Post a Comment