U.H.A.C.C. Join today! Linux Free BSD OSX (Darwin) Open BSD GNU - Free Software Foundation IBM AIX Sun - Solaris, Open Office SGI IRIX
home club info listserv sysblogd forums geeky links tech help webmail contact
hacker emblem


LDP Mirror
Forums RSS

Past club events:

UHACC @ Penguicon
LinuxFest 2004
SCO-B-Q
UHACC @ Flatcon

Club Pages:

ISU chapter: ISUnix
Member's Sites

Projects:

UHACC CVS

Club Documents:

Our Guiding Principles
UHACC Constitution
Operating Code
Membership App
AUP


Valid CSS!

                  REGULAR EXPRESSIONS - PART I

A Primer                                                          page 1
(c) C. Geigner, President, UHACC - October 2003
========================================================================
Regular Expressions may well be one of the most useful tools you can
master when writing code. Perhaps beginners sense the magnitude of the
power when glancing at some code - and then turn tail and bail. Yeah, I
know. It looks like hieroglyphics when you first look at it, but I 
promise that if you start with a few elements and work your way in 
gradually, soon enough you'll be whipping out some hieroglyphics of your
own. Now don't get me wrong, the point here is not to confuse your boss
(although an interesting side-effect) or obfuscate. We aim to write
concise, tight expressions. This is not as hard as you would think, since
the fun thing about regex, is that no matter how I show it to you, there 
are always other ways to do "it". You will notice that I also word my 
explanations in a particular way, almost too literally: "Expression X 
matches occurance of input characters in such-and-such a fashion." This 
is on purpose. I don't want you just to think about "matching," I want 
you to realize that using regexes isn't so much about knowing the symbols,
metasequences and all that jazz, it is about understanding the data being
input.  Regex doesn't know grammar or even words, just the given input 
stream. So instead of matching the concept of "dog", I want you to focus 
on the data that might represent "dog": one word containing, in order, a 
capital or small "d" preceded by a newline or space, followed by a 
capital or lowercase "o" followed by a capital or lowercase "g", 
followed by a newline, space or period. 
Easy as \<[Pp][Ii][Ee]\>, right? :)

Welcome to the world of regex. Let's get started.
Chuck Geigner, UHACC
========================================================================
^      Carat by itself matches occurance of line beginning.
$      Dollar sign matches occurance of line end.
  EXAMPLE:
  ^$          --> matches an empty line
  g/^&/d      --> Global delete-all-empty-lines (useful in ex/vi)

.      Dot matches occurance of any (and exactly) 1 character.
  EXAMPLE:
  /g..d/      --> matches 4 character sequence starting with "g" and
      ending with "d". Matches: gawd, gild, g45d, gekd, gRTd, g#%d, etc.
?      Question mark matches occurance of 0 or 1 of the preceding 
      character.
  EXAMPLE: 
  s/^#?[[:space:]]//     --> search for lines beginning with a 
      comment ("#"), plus a space or tab, or a line beginning with no
      comment, plus a space or tab. Replace the found matches with 
      naught (better to think this way than saying "delete").
      Notice also right here that The metacharacter "?" modifies the 
      character match defined immediatly to its left. In this 
      instance, it modifies the match of "#" to denote that matching
      that particular charater is optional. 
[]     Character Class brackets enclose an expression meant to match 
       1 character. Character classes can be literal or ranged. It may 
       also contain a POSIX shortcut, or "bracket expression"(A1).
  EXAMPLE:
  [0123456789] --> matches occurance of any one digit
  [0-9]        --> same, but ranged
  [[:digit:]]  --> same, in POSIX bracket expression(A1).
  [\d]        --> yet again, in escaped Character Class shorthand(A2).
  [-,&0-9a-z]  --> matches any one digit OR lowercase letter OR hyphen
                   OR comma OR ampersand
                   (example shows you can mix expressions within the 
                   brackets, including litersal special characters and 
                   also that you can match a hyphen (you must list it 
                   first, it otherwise might be interpreted as a range)
  ^ ACHTUNG!!!, the carat, used within the context of Character 
  Class brackets carries special meaning: Negation. Place a carat
  into the brackets and it negates the expression therein.
  EXAMPLE:
  [^a-zA-Z]  --> matches occurance of any character that is NOT alphabetic
  [^aeiou]   --> matches occurance of any character that is not a vowel.

()     Group Expression parens enclose an expression you want to be 
       lumped together as one expression. Very good for encapsulating 
       backreferences and alternations. Alternations are accomplished by
       placing a "|" (logical "or") between expressions.
  Alternation:  (|)
  (1[012]:[0-5][0-9]|[0-9]:[0-5][0-9]) --> matches any 12-hour clock
                   time between 0:00 and 12:59. Broken down, the group
                   expression matches literally 10:00-12:59 OR 0:00-9:59.
  Backreferencing: \1, \2, \3,...\n
  Every time you enclose an expression within the () Group Expression 
  construct, some regex engines will save that expression in memory.
  So instead of reconstructing longer expressions, you may reference it
  by the order in which it occured (from left to right). Again, check to
  see if what your usin' supports this, Off the top of my head, I know
  that sed and awk support Backreferencing, but there are very probably
  others.
  EXAMPLE:
  (1?[0-9]?[0-9]|2([0-4][0-9]|5[0-5]))\.\1\.\1\.\1
Yeah, this matches any valid IP address. The initial expression is pretty hefty, so referencing it as \1 after the initial definition really saves us some work. Breakdown: 1?[0-9]?[0-9] matches 0-199 2[0-4][0-9] matches 200-249 25[0-5] matches 250-255 Try it out in bash: $ egrep "(1?[0-9]?[0-9]|2([0-4][0-9]|5[0-5]))\.\1\.\1\.\1" somefile ======================================================================== APPENDIX 1: A1: POSIX Bracket Expressions [:alnum:] --> matches occurance of any alphabet or numeric character, case insensitive. [:alpha:] --> matches occurance of any alphabet character, case insen. [:blank:] --> matches occurance of any space or tab [:cntrl:] --> matches occurance of control characters [:digit:] --> matches occurance of any digit [:graph:] --> matches occurance of non-blank characters (does not match spaces, control chars, etc.) [:lower:] --> matches occurance of lowercase alphabetical chars only [:print:] --> matches occurance of non-blank, except includes space character as a positive match [:punct:] --> matches occurance of punctuation characters [:space:] --> matches occurance of all space characters, including newline, formfeed, and CR) [:upper:] --> matches occurance of uppercase alphabetical chars only [:xdigit:] --> matches occurance of hexidecimal digits 0-F or 0-f For more fun with POSIX bracket expressions, hit Google and read up on collating sequences and character equivalents. ================================================= >>>>>>>>>> On to page 2

Upcoming Events


UHACC Pre-Meeting


Wednesday Evenings, ~5:15-6:30pm

- Lunker's


Officially unofficial pre-meeting meeting.
Come. Eat. Geek.


UHACC Meeting


Every Wednesday - 7:00-9:00pm

IWU Center for Natural Science Learning and Research, Fishbowl, floor 2. [Directions]



Join us every Wednesday for our usual gratuitous display of geekiness. Meetings are free and attendance is open.

Hope to see you there!


[Home] [Acceptable Usage] [Privacy Policy] [Downloads] [LDP Mirror] [Member's Sites] [Archives]

Copyright © 2006: Unix Hobbyists' Administrators' & Coders' Club. All Rights Reserved.
UHACC, P.O. Box 6376 - Bloomington, Illinois 61702-6376
"First they ignore you, then they laugh at you, then they fight you, then you win." - Mahatma Gandhi