|
|
|
|
|||||||||||
Past club events:
UHACC @ Penguicon
Club Pages:
ISU chapter: ISUnix
Projects:Club Documents:
Our Guiding Principles
|
REGULAR EXPRESSIONS - PART I
A Primer page 1
(c) C. Geigner, President, UHACC - October 2003
========================================================================
Regular Expressions may well be one of the most useful tools you can
master when writing code. Perhaps beginners sense the magnitude of the
power when glancing at some code - and then turn tail and bail. Yeah, I
know. It looks like hieroglyphics when you first look at it, but I
promise that if you start with a few elements and work your way in
gradually, soon enough you'll be whipping out some hieroglyphics of your
own. Now don't get me wrong, the point here is not to confuse your boss
(although an interesting side-effect) or obfuscate. We aim to write
concise, tight expressions. This is not as hard as you would think, since
the fun thing about regex, is that no matter how I show it to you, there
are always other ways to do "it". You will notice that I also word my
explanations in a particular way, almost too literally: "Expression X
matches occurance of input characters in such-and-such a fashion." This
is on purpose. I don't want you just to think about "matching," I want
you to realize that using regexes isn't so much about knowing the symbols,
metasequences and all that jazz, it is about understanding the data being
input. Regex doesn't know grammar or even words, just the given input
stream. So instead of matching the concept of "dog", I want you to focus
on the data that might represent "dog": one word containing, in order, a
capital or small "d" preceded by a newline or space, followed by a
capital or lowercase "o" followed by a capital or lowercase "g",
followed by a newline, space or period.
Easy as \<[Pp][Ii][Ee]\>, right? :)
Welcome to the world of regex. Let's get started.
Chuck Geigner, UHACC
========================================================================
^ Carat by itself matches occurance of line beginning.
$ Dollar sign matches occurance of line end.
EXAMPLE:
^$ --> matches an empty line
g/^&/d --> Global delete-all-empty-lines (useful in ex/vi)
. Dot matches occurance of any (and exactly) 1 character.
EXAMPLE:
/g..d/ --> matches 4 character sequence starting with "g" and
ending with "d". Matches: gawd, gild, g45d, gekd, gRTd, g#%d, etc.
? Question mark matches occurance of 0 or 1 of the preceding
character.
EXAMPLE:
s/^#?[[:space:]]// --> search for lines beginning with a
comment ("#"), plus a space or tab, or a line beginning with no
comment, plus a space or tab. Replace the found matches with
naught (better to think this way than saying "delete").
Notice also right here that The metacharacter "?" modifies the
character match defined immediatly to its left. In this
instance, it modifies the match of "#" to denote that matching
that particular charater is optional.
[] Character Class brackets enclose an expression meant to match
1 character. Character classes can be literal or ranged. It may
also contain a POSIX shortcut, or "bracket expression"(A1).
EXAMPLE:
[0123456789] --> matches occurance of any one digit
[0-9] --> same, but ranged
[[:digit:]] --> same, in POSIX bracket expression(A1).
[\d] --> yet again, in escaped Character Class shorthand(A2).
[-,&0-9a-z] --> matches any one digit OR lowercase letter OR hyphen
OR comma OR ampersand
(example shows you can mix expressions within the
brackets, including litersal special characters and
also that you can match a hyphen (you must list it
first, it otherwise might be interpreted as a range)
^ ACHTUNG!!!, the carat, used within the context of Character
Class brackets carries special meaning: Negation. Place a carat
into the brackets and it negates the expression therein.
EXAMPLE:
[^a-zA-Z] --> matches occurance of any character that is NOT alphabetic
[^aeiou] --> matches occurance of any character that is not a vowel.
() Group Expression parens enclose an expression you want to be
lumped together as one expression. Very good for encapsulating
backreferences and alternations. Alternations are accomplished by
placing a "|" (logical "or") between expressions.
Alternation: (|)
(1[012]:[0-5][0-9]|[0-9]:[0-5][0-9]) --> matches any 12-hour clock
time between 0:00 and 12:59. Broken down, the group
expression matches literally 10:00-12:59 OR 0:00-9:59.
Backreferencing: \1, \2, \3,...\n
Every time you enclose an expression within the () Group Expression
construct, some regex engines will save that expression in memory.
So instead of reconstructing longer expressions, you may reference it
by the order in which it occured (from left to right). Again, check to
see if what your usin' supports this, Off the top of my head, I know
that sed and awk support Backreferencing, but there are very probably
others.
EXAMPLE:
|
|
|||||||||
|
[Home] [Acceptable Usage] [Privacy Policy] [Downloads] [LDP Mirror] [Member's Sites] [Archives]
Copyright © 2006: Unix Hobbyists' Administrators' & Coders' Club. All Rights Reserved.
|
|||||||||||