[ Team LiB ] • Table of Cont ent s • Reviews • Reader Reviews • Errat a Or a cle Re gu la r Ex pr e ssion s Pock e t Re fe r e n ce By Jonat han Gennick , Pet er Linsley Publisher : O'Reilly Pub Dat e: Sept em ber 2003 I SBN: 0- 596- 00601- 2 Pages: 64 Oracle Regular Expressions Pocket Reference is part t ut orial and part quick- reference. I t 's suit able for t hose who have never used regular expressions before, as well as t hose who have experience wit h Perl and ot her languages support ing regular expressions. The book describes Oracle Dat abase 10G's support for regular expressions, including globalizat ion support and differences bet ween Perl's synt ax and t he POSI X synt ax support ed by Oracle 10G. I t also provides a com prehensive reference, including exam ples, t o all support ed regular expression operat ors, funct ions, and error m essages. [ Team LiB ] [ Team LiB ] • Table of Cont ent s • Reviews • Reader Reviews • Errat a Or a cle Re gu la r Ex pr e ssion s Pock e t Re fe r e n ce By Jonat han Gennick , Pet er Linsley Publisher : O'Reilly Pub Dat e: Sept em ber 2003 I SBN: 0- 596- 00601- 2 Pages: 64 Copyright Chapt er 1. Oracle Regular Expressions Pocket Reference Sect ion 1.1. I nt roduct ion Sect ion 1.2. Organizat ion of This Book Sect ion 1.3. Convent ions Sect ion 1.4. Acknowledgm ent s Sect ion 1.5. Exam ple Dat a Sect ion 1.6. Tut orial Sect ion 1.7. Oracle's Regular Expression Support Sect ion 1.8. Regular Expression Quick Reference Sect ion 1.9. Oracle Regular Expression Funct ions Sect ion 1.10. Oracle Regular Expression Error Messages [ Team LiB ] [ Team LiB ] Copyright Copyright © 2003 O'Reilly & Associat es, I nc. Print ed in t he Unit ed St at es of Am erica. Published by O'Reilly & Associat es, I nc., 1005 Gravenst ein Highway Nort h, Sebast opol, CA 95472. O'Reilly & Associat es books m ay be purchased for educat ional, business, or sales prom ot ional use. Online edit ions are also available for m ost t it les ( ht t p: / / safari.oreilly.com ) . For m ore inform at ion, cont act our corporat e/ inst it ut ional sales depart m ent : ( 800) 998- 9938 or corporat e@oreilly.com . Nut shell Handbook, t he Nut shell Handbook logo, and t he O'Reilly logo are regist ered t radem arks of O'Reilly & Associat es, I nc. Many of t he designat ions used by m anufact urers and sellers t o dist inguish t heir product s are claim ed as t radem arks. Where t hose designat ions appear in t his book, and O'Reilly & Associat es, I nc. was aware of a t radem ark claim , t he designat ions have been print ed in caps or init ial caps. The associat ion bet ween t he im age of garden spiders and t he t opic of Oracle regular expressions is a t radem ark of O'Reilly & Associat es, I nc. Oracle® and all Oracle- based t radem arks and logos are t radem arks or regist ered t radem arks of Oracle Corporat ion, I nc. in t he Unit ed St at es and ot her count ries. O'Reilly & Associat es, I nc. is independent of Oracle Corporat ion. While every precaut ion has been t aken in t he preparat ion of t his book, t he publisher and aut hors assum e no responsibilit y for errors or om issions, or for dam ages result ing from t he use of t he inform at ion cont ained herein. [ Team LiB ] [ Team LiB ] Chapter 1. Oracle Regular Expressions Pocket Reference Sect ion 1.1. I nt roduct ion Sect ion 1.2. Organizat ion of This Book Sect ion 1.3. Convent ions Sect ion 1.4. Acknowledgm ent s Sect ion 1.5. Exam ple Dat a Sect ion 1.6. Tut orial Sect ion 1.7. Oracle's Regular Expression Support Sect ion 1.8. Regular Expression Quick Reference Sect ion 1.9. Oracle Regular Expression Funct ions Sect ion 1.10. Oracle Regular Expression Error Messages [ Team LiB ] [ Team LiB ] 1.1 Introduction Wit h t he release of Oracle Dat abase 10g, Oracle has int roduced regular expression support t o t he com pany's flagship product . Regular expressions are used t o describe pat t erns in t ext , and t hey are an invaluable aid when working wit h loosely form at t ed t ext ual dat a. This lit t le booklet describes Oracle's regular expression support in det ail. I t s goal is t o enable you t o t ake full advant age of t he newly int roduced regular expression feat ures when querying and m anipulat ing t ext ual dat a. [ Team LiB ] [ Team LiB ] 1.2 Organization of This Book This book is divided int o t he following six sect ions: I nt roduct ion You're reading it now. Tut orial Provides a short regular expression t ut orial aim ed at t hose who aren't already fam iliar wit h regular expressions. Oracle's Regular Expression Support For readers fam iliar wit h regular expressions, describes how t hey are im plem ent ed and used wit hin Oracle. Also includes a descript ion of t he key differences bet ween t he regular expression im plem ent at ions of Perl and Oracle. Regular Expression Quick Reference Describes t he regular expression m et acharact ers support ed by Oracle and provides exam ples of t heir usage. Oracle Regular Expression Funct ions Det ails t he new SQL and PL/ SQL funct ions t hat m ake up Oracle's regular expression support . Oracle Regular Expression Error Messages List s all of Oracle's regular expression error m essages and provides advice as t o what do when you encount er a given m essage. [ Team LiB ] [ Team LiB ] 1.3 Conventions The following t ypographical convent ions are used in t his book: UPPERCASE I ndicat es a SQL or PL/ SQL keyword lowercase I ndicat es a user- defined it em , such as a t able nam e or a colum n nam e, in a SQL or PL/ SQL st at em ent I t alic I ndicat es URLs, em phasis, or t he int roduct ion of new t echnical t erm s Constant width Used for code exam ples and for in- t ext references t o t able nam es, colum n nam es, regular expressions, and so fort h Constant width bold I ndicat es user input in code exam ples showing bot h input and out put [ Team LiB ] [ Team LiB ] 1.4 Acknowledgments We t hank Debby Russell and Todd Mezzulo of O'Reilly & Associat es for believing in and support ing t his book. We also t hank Barry Trut e, Michael Yau, Weiran Zhang, Keni Mat suda, Ken Jacobs, and t he ot hers at Oracle Corporat ion who spent valuable t im e reviewing t his m anuscript t o ensure it s accuracy. Pet er would like t o acknowledge Weiran Zhang for his finesse and int ellect as codeveloper of Oracle's regular expression feat ures. Pet er would also like t o t hank Rit su for being an ever- support ive and encouraging wife. Jonat han would like t o t hank Dale Bowen for providing t he Spanish sent ence used for t he collat ion exam ple; Andrew Sears for spending so m uch t im e wit h Jeff; Jeff for dragging his dad on so m any bike rides t o t he Falling Rock Cafe for ice cream and coffee; and t he Falling Rock Cafe for, well, j ust for being t here. [ Team LiB ] [ Team LiB ] 1.5 Example Data Many of t he exam ple SQL st at em ent s in t his book execut e against t he following t able: CREATE TABLE park ( park_name NVARCHAR2 (40), park_phone NVARCHAR2 (15), country VARCHAR2 (2), description NCLOB ); This t able cont ains inform at ion on a variet y of st at e, provincial, and nat ional parks from around t he world. Much of t he inform at ion is in free- t ext form wit hin t he description colum n, m aking t his t able an ideal plat form on which t o dem onst rat e Oracle's regular expression capabilit ies. You can download a script t o creat e t he park t able and populat e it wit h dat a from ht t p: / / oreilly.com / cat alog/ oracleregexpr . [ Team LiB ] [ Team LiB ] 1.6 Tutorial A regular expression ( oft en known as a regex ) is a sequence of charact ers t hat describe a pat t ern in t ext . Regular expressions use a synt ax t hat has evolved over a num ber of years, and t hat is now codified as part of t he POSI X st andard. Regular expressions are ext rem ely useful, because t hey allow you t o work wit h t ext in t erm s of pat t erns. For exam ple, you can use regular expressions t o search t he park t able and ident ify any park wit h a descript ion cont aining t ext t hat looks like a phone num ber. You can t hen use t he sam e regular expression t o ext ract t hat phone num ber from t he descript ion. This t ut orial will get you st art ed using regular expressions, but we can only begin t o cover t he t opic in t his sm all book. I f you want t o learn about regular expressions in dept h, see Jeffrey Friedl's excellent book Mast ering Regular Expressions ( O'Reilly) . 1.6.1 Patterns The sim plest t ype of pat t ern is sim ply an exact st ring of charact ers t hat you are searching for, such as t he st ring in t he following WHERE clause: SELECT * FROM park WHERE park_name='Mackinac Island State Park'; However, t he st ring 'Mackinac Island State Park' isn't what m ost people t hink of when you m ent ion t he word " pat t ern." The expect at ion is t hat a pat t ern will use so- called m et acharact ers t hat allow for m at ches when you know only t he general pat t ern of t ext you are looking for. St andard SQL has long had rat her lim it ed support for pat t ern m at ching in t he form of t he LI KE predicat e. For exam ple, t he following query at t em pt s t o ret urn t he nam es of all st at e parks: SELECT park_name FROM park WHERE park_name LIKE '%State Park%'; The percent (%) charact ers in t his pat t ern specify t hat any num ber of charact ers are allowed on eit her side of t he st ring 'State Park'. Any num ber of charact ers m ay be zero charact ers, so st rings in t he form 'xxx State Park' fit t he pat t ern. There! I 've j ust used a pat t ern t o describe t he operat ion of a pat t ern. Hum ans have long used pat t erns as a way t o organize and describe t ext . Look no furt her t han your address and phone num ber for exam ples of com m only used pat t erns. Handy as it is at t im es, LI KE is an am azingly weak predicat e, support ing only t wo expression m et acharact ers t hat don't even begin t o address t he range of pat t erns you m ight need t o describe in your day- t o- day work. You need m ore. You need a richer and m ore expressive language for describing pat t erns. You need regular expressions. 1.6.2 Regular Expressions Regular expressions is t he answer t o t he quest ion: " How do I describe a pat t ern of t ext ?" Regular expressions first becam e widely used on t he Unix plat form , support ed by such ut ilit ies as ed, grep, and ( not ably) Perl. Regular expressions have gone on t o becom e form alized in t he I EEE POSI X st andard, and regular expressions are widely support ed across an ever- growing range of edit ors, em ail client s, program m ing languages, script ing languages, and now Oracle SQL and PL/ SQL. Let 's revisit t he earlier problem of finding st at e parks in t he park t able. We perform ed t hat t ask using LI KE t o search for t he words 'State Park' in t he park_name colum n. Following is t he regular expression solut ion t o t he problem : SELECT park_name FROM park WHERE REGEXP_LIKE(park_name, 'State Park'); REGEXP_LI KE is a new Oracle predicat e t hat searches, in t his case, t he park_name colum n t o see whet her it cont ains a st ring m at ching t he pat t ern 'State Park'. REGEXP_LI KE is sim ilar t o LI KE, but differs in one m aj or respect : LI KE requires it s pat t ern t o m at ch t he ent ire colum n value, whereas REGEXP_LI KE looks for it s pat t ern anywhere wit hin t he colum n value. There are no m et acharact ers in t he regular expression ' State Park', so it 's not a t erribly excit ing pat t ern. Following is a m ore int erest ing exam ple t hat at t em pt s t o ident ify parks wit h descript ions cont aining phone num bers: SELECT park_name FROM park WHERE REGEXP_LIKE(description, '...-....'); This query uses t he regular expression m et acharact er period ( .) , which m at ches any charact er. The expression does not look for t hree periods followed by a dash followed by four periods. I t looks for any t hree charact ers, followed by a dash, followed by any four charact ers. Because it m at ches any charact er, t he period is a very com m only used regular expression m et acharact er. You can creat e funct ion- based indexes t o support queries using REGEXP_LI KE and ot her REGEXP funct ions in t he WHERE clause. I t can be inconvenient t o specify repet it ion by repeat ing t he m et acharact er, so regular expression synt ax allows you t o follow a m et acharact er wit h an indicat ion of how m any t im es you want t hat m et acharact er t o be repeat ed. For exam ple: SELECT park_name FROM park WHERE REGEXP_LIKE(description, '.{3}-.{4}'); The int erval quant ifier {3} is used t o specify t hat t he first period should repeat t hree t im es. The {4} following t he second period indicat es a repeat count of four. This pat t ern of a regular expression synt ax elem ent followed by a quant ifier is one you'll frequent ly encount er and use when working wit h regular expressions. Table 1- 1 illust rat es t he t ypes of quant ifiers you can use t o specify repet it ion in a regular expression. The lat er Sect ion 1.8 describes each of t hese quant ifiers in det ail. Ta ble 1 - 1 . Qu a n t ifie r s u se d t o spe cify r e pe t it ion in a pa t t e r n Pa t t e r n M a t ch e s .* Zero or m ore charact ers .+ One or m ore charact ers .? Zero or one charact er .{3,4} Three t o four charact ers .{3,} Three or m ore charact ers .{3} Exact ly t hree charact ers 1.6.3 Bracket Expressions The exam ples in t he previous sect ion searched for phone num bers using t he period ( .) . That 's not a very precise search, because t he period m at ches any charact er, not j ust digit s. A phrase such as 'A 217-acre park...' will result in a false posit ive, because '217-acre' fit s t he '...-....' pat t ern. The following query uses REGEXP_SUBSTR t o ext ract t he t ext m at ching t he pat t ern so t hat you can see t he false posit ives t hat result : SELECT park_name, REGEXP_SUBSTR(description, '...-....') FROM park; I f you've downloaded t he exam ple dat a for t his book and are following along, execut e t he above query and look at t he result s for Färnebofj ärden and Muskallonge Lake St at e Park, am ong ot hers. I f you're ever in doubt as t o why REGEXP_LI KE report s t he exist ence of a given pat t ern in a t ext value, use REGEXP_SUBSTR t o ext ract t hat sam e pat t ern from t he value in quest ion, and you'll see t he t ext REGEXP_LI KE considered a m at ch for your pat t ern. Regular expression synt ax provides several ways for you t o be m ore specific about t he charact ers you are searching for. One approach is t o specify a list of charact ers in square bracket s ( [ ]) : SELECT park_name FROM park WHERE REGEXP_LIKE(description, '[0123456789]{3}-[0123456789]{4}'); Square bracket s and t heir cont ent s are referred t o as bracket expressions, and define a specific subset of charact ers, any one of which can provide a single- charact er m at ch. I n t his exam ple, t he bracket expressions each define t he set of charact ers com prising t he digit s 0- 9. Following each list is a repeat count , eit her {3} or {4}. I t 's painful t o have t o t ype out each of t he digit s 0- 9, and it 's error- prone t oo, as you m ight skip a digit . A bet t er solut ion is t o specify a range of charact ers: SELECT park_name FROM park WHERE REGEXP_LIKE(description, '[0-9]{3}-[0-9]{4}'); Even bet t er, perhaps, in t his case, is t o use one of t he nam ed charact er classes: SELECT park_name FROM park WHERE REGEXP_LIKE(description, '[[:digit:]]{3}-[[:digit:]]{4}'); Nam ed charact er classes such as [:digit:] are especially im port ant in m ult ilingual environm ent s, because t hey enable you t o writ e expressions t hat work across languages and charact er set s. Charact er classes are support ed only wit hin bracket expressions. Thus you cannot writ e [:digit:]{3}, but inst ead m ust use [[:digit:]]{3}. You can even define a set of charact ers in t erm s of what it is not . The following exam ple uses [^[:digit:]] t o allow for any charact er ot her t han a digit t o separat e t he groups of a phone num ber: SELECT park_name FROM park WHERE REGEXP_LIKE(description, '[[:digit:]]{3}[^[:digit:]][[:digit:]]{4}'); Any t im e you specify a caret ( ^) as t he first charact er wit hin a bracket expression, you are t elling t he regular expression engine t o begin by including all possible charact ers in t he set , and t o t hen exclude only t hose t hat you list following t he caret . Most m et acharact ers lose t heir special m eaning when used wit hin a bracket expression. See [ ] ( Square Bracket s) in Sect ion 1.8 for det ails on t his issue. Bracket expressions can be m uch m ore com plex and varied t han t hose we've shown so far. For exam ple, t he following bracket expression: [[:digit:]A-Zxyza-c] includes all digit s, t he uppercase let t ers A t hrough Z, lowercase x, y, and z, and lowercase a t hrough c. 1.6.4 The Escape Character So far you've seen t hat periods ( .) , square bracket s ( []) , and braces ( {}) all have special m eaning in a regular expression. But t hese are com m on charact ers! What if you are searching specifically for one of t hese charact ers rat her t han what it represent s? I f you look carefully at t he various park descript ions, you'll see t hat som e phone num bers are delim it ed using periods rat her t han hyphens. For exam ple, t he Tahquam enon Falls St at e Park phone num ber is given as 906.492.3415. Wit h one except ion, t he pat t ern we've been using so far dem ands a hyphen bet ween digit groups: [[:digit:]]{3}-[[:digit:]]{4} To find phone num bers delim it ed by periods, you m ight t hink you could sim ply place a period bet ween digit groups: [[:digit:]]{3}.[[:digit:]]{4} However, t he period in t his case would be int erpret ed as a wildcard, and would m at ch any charact er—a dash, a period, or som et hing else. To specify t hat you really do m ean a period, and not t he wildcard m at ch t hat a period represent s, you need t o escape t he period by preceding it wit h a backslash ( \) : SELECT park_name FROM park WHERE REGEXP_LIKE(description, '[[:digit:]]{3}\.[[:digit:]]{4}'); This query will now find only parks wit h phone num bers delim it ed by periods in t he description colum n. The \. t ells t he regular expression engine t o look for a lit eral period. Any t im e you want t o use one of t he regular expression m et acharact ers as a regular charact er, you m ust precede it wit h a backslash. I f you want t o use a backslash as a regular charact er, precede it wit h anot her backslash, as in \\. Be aware t hat t he list of m et acharact ers changes inside square bracket s ( []) . Most lose t heir special m eaning, and do not need t o be escaped. See t he quick reference subsect ion on square bracket s for det ails on t his issue. 1.6.5 Subexpressions Earlier you saw quant ifiers used t o specify t he num ber of repet it ions for a m et acharact er. For exam ple, we first used .{3} t o specify t hat phone num bers begin wit h t hree charact ers. Lat er we used [[:digit:]]{3} t o specify t hat phone num bers begin wit h t hree digit s. However, what if you want t o specify a repeat count for not j ust a single m et acharact er or bracket expression, but for any arbit rary part of a subexpression? You can apply a quant ifier t o any arbit rary part of an expression sim ply by enclosing t hat part of t he expression in parent heses. The result is called a subexpression, and you can follow a subexpression wit h a quant ifier t o specify repet it ion. For exam ple, t he regular expression in t he following query searches for non- U.S. phone num bers. I t does t his by looking for a plus sign ( + ) followed by t wo t o five groups of 1- 3 digit s, wit h each group of 1- 3 digit s being followed by a single space: SELECT park_name, REGEXP_SUBSTR(description, '\+([0-9]{1,3} ){1,4}([0-9]+)') intl_phone FROM park; Färnebofjärden Mackinac Island State Park Fort Wilkins State Park . . . +46 8 698 10 00 ***Null*** ***Null*** The first int erval expression, {1,3}, refers t o [0-9], and m at ches 1- 3 digit s. The second int erval expression, {2,4}, refers t o t he subexpression m aching 1- 3 digit s plus one space. Finally, we use t he subexpression ([0-9]*) t o pick up a final digit group wit hout also including a t railing space in our result s. I n t his exam ple, we require at least t wo digit groups, one from t he first subexpression and one from t he second, in an at t em pt t o reduce false posit ives. I n our experience, non- U.S. phone num bers are m ost frequent ly writ t en wit h at least t wo digit groups. I n t he end, you can't always be sure t hat t ext m at ching a pat t ern has t he sem ant ics, or m eaning, t hat you want . See t he lat er sect ion, Sect ion 1.6.9. 1.6.6 Alternation Hum ans don't like t o follow rules, and j ust when you t hink you've got a pat t ern nailed down, you'll discover t hat som eone is using a variat ion on t he pat t ern, or perhaps a com plet ely different pat t ern. For exam ple, in t he previous sect ion we began t o deal wit h t he fact t hat som e of t he phone num bers in t he park descript ions use periods rat her t han dashes t o separat e t he digit groups. Regular expression synt ax support s a concept called alt ernat ion , which sim ply m eans t hat you can specify alt ernat ive versions of a pat t ern. Alt ernat ion in a regular expression is like t he OR operat or in t he SELECT st at em ent 's WHERE clause. You specify alt ernat ion using a vert ical bar ( |) t o separat e t wo alt ernat ives, eit her of which is accept able. For exam ple: SELECT park_name FROM park WHERE REGEXP_LIKE(description, '[[:digit:]]{3}-[[:digit:]]{4}' || '|[[:digit:]]{3}\.[[:digit:]]{4}'); This expression is a bit difficult t o read, because t he narrow page widt h used in t his book forces us t o show it as t he concat enat ion of t wo st rings. The || operat or is t he SQL st ring- concat enat ion operat or. The single | t hat you see beginning t he second st ring is t he regular expression alt ernat ion operat or. The first half of t he regular expression looks for phone num bers delim it ed by dashes; t he second half looks for phone num bers delim it ed by periods. Whenever you use alt ernat ion, it 's a good idea t o enclose your t wo alt ernat ives wit hin parent heses. We didn't use parent heses in our previous exam ple, so alt ernat ive # 1 consist s of everyt hing t o t he left of t he vert ical bar, and alt ernat ive # 2 consist s of everyt hing t o t he right . Parent heses const rain alt ernat ion t o j ust t he subexpression wit hin t he parent heses, enabling you t o writ e m ore concise expressions: SELECT park_name FROM park WHERE REGEXP_LIKE(description, '[[:digit:]]{3}(-|\.)[[:digit:]]{4}'); This t im e t he ent ire expression isn't repeat ed. The alt ernat ion is lim it ed t o specifying t he alt ernat e delim it ers: (-|\.). When an alt ernat ion involves only single charact ers, you really don't need t o specify an alt ernat ion at all. A choice of one out of several single charact ers is bet t er expressed as a bracket expression. The following exam ple uses [-.] t o allow for eit her a dash or a period in phone num bers: SELECT park_name FROM park WHERE REGEXP_LIKE(description, '[[:digit:]]{3}[-.][[:digit:]]{4}'); Earlier you saw a period used t o m at ch any charact er, and a dash used t o define a range of charact ers in a bracket expression. However, when used as t he first charact er in a bracket expression, t he dash represent s it self. Sim ilarly, t he period always represent s it self in a bracket expression. See [ ] ( Square Bracket s) in Sect ion 1.8 for m ore det ails on special rules in bracket expressions. The following exam ple m akes good use of alt ernat ion t o ext ract U.S. and Canadian phone num bers, including area codes, from park descript ions. We use alt ernat ion t o deal wit h t he t wo area code form at s: SELECT park_name, REGEXP_SUBSTR(description, '([[:digit:]]{3}[-.]|\([[:digit:]]{3}\) )' ||'[[:digit:]]{3}[-.][[:digit:]]{4}') park_phone FROM park; I n t his exam ple, we used alt ernat ion t o deal wit h t he fact t hat som e area codes are enclosed wit hin parent heses. We used a bracket expression t o accom m odat e t he varying use of periods and dashes as delim it ers. Parent heses enclose t he area code port ion of t he expression, lim it ing t he scope of t he alt ernat ion t o j ust t hat part of t he phone num ber. 1.6.7 Greediness Regular expressions are greedy. Greedy in t he cont ext of regular expressions m eans t hat a quant ifier will m at ch as m uch of t he source st ring as possible. For exam ple, while writ ing t his book, one of us wrot e t he following regular expression in an at t em pt t o ext ract t he first word from t he source t ext by looking for a sequence of charact ers followed by a space. Sounds like a reasonable approach, right ? Wrong! SELECT REGEXP_SUBSTR( 'In the beginning','.+[[:space:]]') FROM dual; In the What 's happened here? The first sequence of charact ers followed by a space is 'In ', so why was 'In the ' ret urned? The answer boils down t o what regular expression users refer t o as greediness. The regular expression engine will indeed find t hat 'In ' m at ches t he specified pat t ern, but t hat 's not enough! The engine want s m ore. I t 's greedy. I t looks t o see whet her it can m at ch an even longer run of charact ers, and in t his case it can. I t can m at ch up charact ers t hrough ' In the ', and so it does. The last word, 'beginning', is not followed by a space, and so t he regular expression cannot m at ch it . What if you do j ust want t hat first word? One solut ion is t o use a negat ed bracket expression t o specify t hat you want t o find a pat t ern consist ing of non- spaces: SELECT REGEXP_SUBSTR( 'In the beginning','[^[:space:]]*') FROM dual; In This t im e, t he engine will find 'In' and st op at t he space t hat follows, because t hat space is not part of t he set denot ed by t he bracket expression. A possibly beneficial side effect of t his approach is t hat t he result does not include a t railing space, t hough it would include t railing punct uat ion if any punct uat ion im m ediat ely followed t he first word in t he sent ence. 1.6.8 Backreferences Oracle support s t he not ion of backreferences in regular expressions. A backreference is a num bered reference in t he form \1, \2, and so fort h, t o t he t ext m at ching a previous subexpression. The following query cont ains an exam ple of a regular expression wit h a backreference, in t his case \2: SELECT park_name, REGEXP_SUBSTR(description, '(^|[[:space:][:punct:]]+)([[:alpha:]]+)' || '[[:space:][:punct:]]+\2' || '($|[[:space:][:punct:]]+)') doubled_word FROM park WHERE REGEXP_LIKE(description, '(^|[[:space:][:punct:]]+)([[:alpha:]]+)' || '[[:space:][:punct:]]+\2' || '($|[[:space:][:punct:]]+)'); The sam e regular expression appears t wice in t his query, and is one solut ion t o t he classic problem of finding doubled words such as 'the the' in a block of t ext . To help you underst and t his expression, we'll walk you t hrough it one piece at a t im e: (^|[[:space:][:punct:]]+) The first word in a sequence of doubled words m ust be preceded by whit espace or punct uat ion, be t he first word following a newline line ( [:space:] includes newline) , or be t he first word in t he st ring. This subexpression is \1, but we do not reference it from lat er in t he expression. ([[:alpha:]]+) A word is defined as a sequence of one or m ore alphabet ic charact ers. This subexpression is \2. [ [ : space: ] [ : punct : ] ] + The t wo words in a sequence of doubled words m ust be separat ed by one or m ore space and punct uat ion charact ers. \2 The second word m ust be t he sam e as t he first . The subexpression defining t he first word is t he first subexpression, so we refer back t o t he value of t hat subexpression using \2. ( $| [ [ : space: ] [ : punct : ] ] + ) The second word m ust also be followed by space and/ or punct uat ion charact ers, be t he last word in a line, or be t he last word in t he st ring. Our use of \2 in t his expression is key t o our goal of finding doubled words. Consider t he following sent ence fragm ent : Fort Wilkins is is a ... When t he regular expression engine evaluat es our expression in t he cont ext of t his sent ence, it follows t hese st eps: 1 . The engine begins by finding t he beginning of t he line as a m at ch for t he first subexpression. 2 . Next , t he engine t ries 'Fort' as t he m at ch t o t he second subexpression. Thus \2 refers t o ' Fort', and t he engine looks for 'Fort Fort'. I nst ead, t he engine finds ' Fort Wilkins'. 3 . The engine m oves on t o t ry 'Wilkins' as t he m at ch for t he second subexpression. \2 now refers t o 'Wilkins' and t he engine looks for 'Wilkins Wilkins'. I nst ead, t he engine finds ' Wilkins is'. 4 . The regular expression engine t ries 'is' as t he m at ch for t he second expression. \2 t hen refers t o 'is', so t he engine looks for 'is is' and is successful at finding a m at ch. Our exam ple query uses t wo funct ions t hat you'll read m ore about in t he upcom ing Sect ion 1.7. First , REGEXP_LI KE is used t o ident ify rows t hat have doubled words in t heir descript ion. Next , REGEXP_SUBSTR ext ract s t hose doubled words for you t o review. Bot h funct ions leverage t he sam e regular expression. REGEXP_LI KE is opt im ized t o do as lit t le as possible t o prove t hat an expression exist s wit hin a st ring. You can reference t he first nine subexpressions in a regular expression using \1, \2, et c., t hrough \9. Subexpressions from t he t ent h onwards cannot be referenced at all. 1.6.9 Fuzziness Pat t ern m at ching is not an exact science. Regular expressions let you search and m anipulat e t ext based on pat t erns. People can get very creat ive when it com es t o variat ions on a pat t ern or m aking new pat t erns, and som et im es people don't seem t o follow any pat t ern at all. Think of writ ing a regular expression as a learning process, a chance t o get t o know your dat a bet t er. Begin by t aking your best shot at writ ing an expression t o m at ch t he t ext you are aft er. Test t hat expression. Review t he result s. Do you find false posit ives? Then refine your expression t o elim inat e t hose false posit ives, and it erat e t hrough t he t est ing process again. You m ay never be able t o t ruly elim inat e all false posit ives; you m ay have t o set t le for t olerat ing som e sm all percent age of t hem . Don't forget t o review your dat a for false negat ives, which are it em s cont aining t ext t hat you want , but which your regular expression as current ly developed will exclude. Rem em ber t he periods in t he phone num bers discussed in previous sect ions. Our first at t em pt at a regular expression t o ident ify phone num bers excluded all t hose wit h periods. Finally, don't be int im idat ed by t he inherent fuzziness, as we call it , in regular expressions, nor be put off by it . Just underst and it . Regular expressions are incredibly useful. They have t heir place and, like any ot her feat ure, t hey also have t heir own st rengt hs and weaknesses. [ Team LiB ] [ Team LiB ] 1.7 Oracle's Regular Expression Support Oracle's regular expression support m anifest s it self in t he form of t hree SQL funct ions and one predicat e t hat you can use t o search and m anipulat e t ext in any of Oracle's support ed t ext dat at ypes: VARCHAR2, CHAR, NVARCHAR2, NCHAR, CLOB, and NCLOB. Regular expression support does not ext end t o LONG, because LONG is support ed only for backward com pat ibilit y wit h exist ing code. 1.7.1 Regular Expression Functions Following are t he four funct ions you'll use t o work wit h regular expressions in Oracle: REGEXP_LI KE Det erm ines whet her a specific colum n, variable, or t ext lit eral cont ains t ext m at ching a regular expression. REGEXP_I NSTR Locat es, by charact er posit ion, an occurrence of t ext m at ching a regular expression. REGEXP_REPLACE Replaces t ext m at ching a regular expression wit h new t ext t hat you specify. Your replacem ent t ext can include backreferences t o values in t he regular expression. REGEXP_SUBSTR Ext ract s t ext m at ching a regular expression from a charact er colum n, variable, or t ext lit eral. Of t hese, you've already seen REGEXP_LI KE in quit e a few exam ples. REGEXP_LI KE is docum ent ed in t he " Condit ions" chapt er of t he Oracle Dat abase 10g SQL Reference because in SQL it can only be used as a predicat e in t he WHERE and HAVI NG clauses of a query or DML st at em ent . I n PL/ SQL, however, you can use REGEXP_LI KE as you would any ot her Boolean funct ion: DECLARE x Boolean; BEGIN x := REGEXP_LIKE( 'Does this string mention Oracle?', 'Oracle'); END; / The rem aining t hree funct ions work ident ically in SQL and PL/ SQL. All four funct ions are fully described in Sect ion 1.9 near t he end of t his book. 1.7.2 Regular Expression Locale Support Oracle is not able for it s Globalizat ion Support in t hat it support s an exceedingly wide variet y of charact er set s, languages, t errit ories, and linguist ic sort s. Regular expressions are no except ion. The com binat ion of charact er set , language, and t errit ory is known as a locale. Oracle's regular expression engine respect s locale, and is configurable via NLS ( Nat ional Language Support ) param et er set t ings. Following are som e not able exam ples of t he way in which regular expression locale support affect s you: The regular expression engine is charact er- based. The period ( .) will always m at ch a single charact er or, m ore st rict ly speaking, a single codepoint , regardless of how m any byt es are used t o represent t hat charact er in t he underlying charact er set . Charact er classes are sensit ive t o t he underlying charact er set . For exam ple, if you're using one of t he Unicode charact er set s, t he class [:digit:] will include not only 0, 1, 2, t hrough 9, but also t he Arabic- I ndic , , t hrough , t he Bengali , , t hrough , and so fort h. NLS_SORT can affect how com parisons are perform ed. I f NLS_SORT considers t wo charact ers t o be equivalent , t hen so does t he regular expression engine. For exam ple, using t he default sort of BI NARY, t he expression 'resume' will not m at ch t he t ext 'Résumé'. Change NLS_SORT t o GENERI C_BASELETTER, and t he expression does m at ch, because t hat sort t reat s 'e' and 'é' as t he sam e let t er and also ignores case. Bracket expressions such as [A-z] are affect ed by t he underlying charact er set and t he sort order. For exam ple: [a-z] includes A when using t he case- insensit ive sort GERMAN_CI , but not when using GERMAN. Given an ASCI I - based charact er set and t he BI NARY sort order, [A-z] encom passes all let t ers, upper- and lowercase. Given an EBCDI C charact er set and t he BI NARY sort order, [A-z] fails t o be a valid expression, even failing t o com pile, because in EBCDI C t he binary represent at ion of t he let t er A com es aft er t hat of t he let t er z. I f a regular expression is in one charact er set , and t he t ext t o be searched is in anot her, t he regular expression will be convert ed t o t he charact er set of t he t ext t o be searched. Your NLS_SORT set t ing affect s whet her case- sensit ive m at ching is done by default . A sort such as SPANI SH yields case- sensit ive sort ing. You can add t he suffix _CI , as in SPANI SH_CI , t o linguist ic sort s t o get a case- insensit ive sort . Use t he suffix _AI for an accent - insensit ive sort . NLS_SORT also affect s which accent ed and unaccent ed charact ers are considered t o be of t he sam e class. For exam ple, t he expression 'na[[=i=]]ve' will m at ch bot h ' naive' and 'naïve' when NLS_SORT is set t o BI NARY ( t he default sort for t he AMERI CAN language) , but not when NLS_SORT is set t o GREEK. NLS_SORT affect s which collat ion elem ent s are considered valid. For exam ple, [.ch.] is recognized by Spanish sort ing rules ( when NLS_SORT equals XSPANI SH) , but not by Am erican sort ing rules. 1.7.3 Regular Expression Matching Options Each of Oracle's regular expression funct ions t akes an opt ional match_parameter, which is a charact er st ring t hat you can fill wit h one- charact er flags. This st ring gives you cont rol over t he following aspect s of regular expression behavior: Whet her m at ching is case- sensit ive NLS_SORT cont rols whet her m at ching is case- sensit ive by default , which it usually will be. You can override t he default on a per- call basis. Whet her t he period ( .) m at ches newline charact ers By default , periods do not m at ch newline charact ers ( occurrences of CHR(10) on Unix syst em s) in t he source t ext . You can specify t hat periods m at ch newlines. The definit ion of " line" By default , t he source st ring t hat you are searching is considered one long line, and t he caret ( ^) and dollar sign ($) m at ch only t he beginning and ending of t he ent ire st ring. You can specify t hat t he source value is t o be t reat ed as m any lines delim it ed by newline charact ers. I f you do so, t hen t he ^ and $ m at ch t he beginning and end of each line respect ively. The following exam ple dem onst rat es t he use of t he match_parameter by perform ing a caseinsensit ive search for doubled words. The match_parameter value in t his case is 'i'. The t wo 1 param et ers preceding 'i' in REGEXP_SUBSTR supply t he default values for st art ing posit ion and occurrence. Those param et ers need t o be specified in order t o reach t he match_parameter. SELECT park_name, REGEXP_SUBSTR( description, '(^|[[:space:][:punct:]]+)([[:alpha:]]+)' || '([[:space:][:punct:]])+\2' || '([[:space:][:punct:]]+|$)', 1,1,'i') duplicates FROM park WHERE REGEXP_LIKE(description, '(^|[[:space:][:punct:]]+)([[:alpha:]]+)' || '([[:space:][:punct:]])+\2' || '([[:space:][:punct:]]+|$)', 'i'); To specify m ult iple param et ers, sim ply list t hem in one st ring. For exam ple, t o request caseinsensit ive m at ching wit h periods m at ching newline charact ers, specify 'in' or ' ni' as your match_parameter. I f you specify cont radict ory param et ers, Oracle uses t he last value in t he st ring. For exam ple, 'ic' is cont radict ory because 'i' asks for case- insensit ivit y, while 'c' asks for t he opposit e. Oracle resolves t his by t aking t he last value in t he st ring, in t his case t he 'c'. I f you specify param et ers t hat are undefined, Oracle will ret urn an ORA-01760: illegal argument for function error. 1.7.4 Standards Compliance Oracle's regular expression engine is of t he t radit ional nondet erm inist ic finit e aut om at a ( t radit ional NFA) variet y, t he sam e t ype used in Perl, t he .NET environm ent , and Java. Wit h one except ion, Oracle's engine im plem ent s t he synt ax and behavior for ext ended regular expressions ( EREs) as described in t he POSI X st andard. I n addit ion, Oracle adds support for backreferences. The regular expression synt ax and behavior docum ent ed in t he Open Group Base Specificat ions I ssue 6, I EEE St andard 1003.1, 2003 Edit ion is t he sam e as t hat for POSI X. You can view t he Open Group specificat ions at ht t p: / / www.opengroup.org/ onlinepubs/ 007904975/ basedefs/ xbd_chap09.ht m l The one except ion t hat st ands bet ween Oracle and full POSI X com pliance is t hat Oracle does not at t em pt t o det erm ine t he longest possible m at ch for a pat t ern cont aining variat ions, as t he st andard requires. The following exam ple dem onst rat es t his very well: SELECT REGEXP_SUBSTR('bbb','b|bb') FROM dual; b SELECT REGEXP_SUBSTR('bbb','bb|b') FROM dual; bb These t wo st at em ent s differ only by t he order in which t he alt ernat ives are specified in t he regular expression: b|bb versus bb|b. The longest possible m at ch in eit her case is 'bb', and t hat 's t he m at ch POSI X requires for bot h cases. However, Oracle's regular expression engine t akes t he first m at ch it finds, which can be eit her 'b' or ' bb', depending on t he order in which t he alt ernat ives are specified. Do not confuse finding t he longest possible m at ch out of several alt ernat ions wit h greediness. Like m any regular expression engines, Oracle ignores t he " longest possible m at ch" rule, because t he overhead of com put ing all possible perm ut at ions and t hen det erm ining which is t he longest can be excessive. 1.7.5 Differences Between Perl and Oracle Perl has done a lot t o popularize t he use of regular expressions, and m any regular expression engines ( e.g., Java and PHP) follow Perl's im plem ent at ion closely. Many readers m ay have learned regular expressions using Perl or a Perl- like engine, so t his brief sect ion highlight s t he key differences bet ween Perl's and Oracle's support for regular expressions. This sect ion is based a com parison of Perl Version 5.8 wit h Oracle Dat abase 10 g. 1.7.5.1 String literal issues Regular expressions are oft en writ t en as st ring lit erals. When you m ove st ring lit erals from one language t o anot her, you m ay encount er issues wit h t he way t hat each language handles such lit erals. For exam ple, Perl enables you t o t o use \x followed by t wo hexadecim al digit s t o em bed arbit rary byt e codes wit hin a st ring. Perl also support s charact er sequences such as \n for t he newline ( linefeed on Unix) charact er. Thus, in Perl, you can writ e t he following regular expression t o search for eit her a linefeed or a space: /[\n|\x20]/ The issue is t hat t his isn't a regular expression per se—it 's a Perl st ring. The backslash sequences \n and \x20 have no m eaning t o Perl's regular expression engine, which, in fact , never sees t hem . Those sequences are int erpret ed by Perl it self. By t he t im e t he st ring get s t o Perl's regular expression engine, \n and \x20 have been replaced by t he appropriat e byt e codes. Anot her issue you m ay encount er is Perl's use of t he dollar sign ( $) t o dereference a variable wit hin a st ring. I n Perl, t he expression /a$b/ searches for t he let t er 'a' followed by t he cont ent s of t he Perl variable nam ed b. Perl's regular expression never sees t he '$b', because Perl subst it ut es t he value of t he variable before it passes t he st ring t o t he engine. Neit her SQL nor PL/ SQL support t he use of \ and $ in t he way t hat Perl does. Because Perl and Oracle differ in t heir handling of st ring lit erals, you m ay not be able t o t ake a regular expression developed for Perl and sim ply drop it int o Oracle. Before at t em pt ing t o m ove an expression in t he form of a st ring lit eral from Perl t o Oracle, m ake sure t hat t he " expression" doesn't cont ain any charact ers t hat Perl it self int erpret s. 1.7.5.2 NULL versus empty strings Unlike Perl and m any dat abase product s, Oracle t reat s an em pt y st ring as a NULL value. Thus, t he following query, which at t em pt s t o m at ch an em pt y st ring, brings back no dat a: SELECT * FROM park WHERE REGEXP_LIKE(description,''); I n Oracle, t he regular expression engine does not see an em pt y st ring; rat her, it sees a NULL, or t he com plet e absence of an expression wit h which t o do any m at ching. 1.7.5.3 Perl-specific syntax Oracle's regular expression synt ax is POSI X- com pliant . Perl's engine support s a num ber of operat ors, charact er classes, and so fort h t hat are not defined as part of t he POSI X st andard. These are described in Table 1- 2. Where possible, we also specify a POSI X equivalent t hat 's usable in Oracle. The POSI X equivalent s shown in Table 1- 2 should work for t he default locale ( Am erican_Am erica.US7ASCI I , wit h a BI NARY sort ) . However, we have not yet been able t o run exhaust ive t est s. Ta ble 1 - 2 . Pe r l's n on st a n da r d r e gu la r e x pr e ssion ope r a t or s Pe r l ope r a t or D e scr ipt ion / Or a cle e qu iva le n t [[:ascii:]] Mat ches any ASCI I charact er. I n Oracle, possibly use: '[' || CHR(00) || '-' || CHR(127) || ']'. [[:word:]] A word charact er, defined as any alphanum eric charact er, including underscore: [[:alnum:]_] \C Em beds arbit rary byt es in a regular expression. I n Oracle, use t he CHR funct ion, but be aware t hat Oracle requires an expression t o be com posed of valid charact ers as defined by t he underlying charact er set . \d Digit s: [[:digit:]] \D Non- digit s: [^[:digit:]] \pP Nam ed propert ies, no POSI X equivalent \PP Negat ed nam ed propert ies, no POSI X equivalent \s Whit espace: [[:space:]], except t hat[[:space:]] includes vert ical t ab ( \x0B ) , and \s does not . \S Non- whit espace: [^[:space:]] \w Alphanum eric charact ers: [[:alnum:]_] \W Non- alphanum eric charact ers: [^[:alnum:]_] \X Followed by a code point value, \X em beds a Unicode com bining charact er sequence int o a regular expression. I n Oracle, use t he COMPOSE funct ion t o generat e Unicode com bining charact ers from code point s. \b \B \A \Z \z \G Perl support s a num ber of zero- widt h assert ions. None are recognized by POSI X. 1.7.5.4 Syntax Perl does not support Perl does not support t he POSI X- st andard [= =] not at ion for defining an equivalence class. I n addit ion, Perl does not support t he use of [. .] t o specify a collat ion elem ent . 1.7.5.5 Negating character classes Bot h Perl and Oracle support t he POSI X- com pliant caret ( ^) as t he first charact er wit hin a bracket expression t o m ean all charact ers except t hose list ed wit hin t he expression. For exam ple, you can writ e: [^A-Z] t o m at ch on any charact er but t he uppercase let t ers. Perl also support s t he use of a caret in conj unct ion wit h a charact er class nam e. For exam ple, Perl allows you t o writ e [[:^digit:]] t o m at ch on any charact er except for one in t he [:digit:] class. You can get t he sam e effect in Oracle using t he form : [^[:digit:]]. 1.7.5.6 Lazy quantifiers (non-greediness) As we described in Sect ion 1.6.7 in Sect ion 1.6, quant ifiers in a regular expression will m at ch as m any charact ers as possible. For exam ple, given a source st ring of '123456', t he expression [0-9]+ will m at ch t he ent ire st ring of six digit s. Perl support s t he addit ion of a quest ion m ark ( ?) t o t he end of a quant ifier t o m ake it non- greedy, or lazy, in which case t he quant ifier m at ches t he m inim um num ber of charact ers possible. For exam ple, t he expression [0-9]+? m at ches only t he first digit of t he st ring '123456'. The com plet e list of lazy quant ifiers support ed by Perl is: *?, +?, ??, and {}? POSI X, and by ext ension Oracle, does not support t hese quant ifiers. 1.7.5.7 Experimental features Perl support s a m echanism for adding experim ent al regular expression feat ures. Such feat ures always t ake t he form (?...), in which t he ellipses represent t he feat ure- specific synt ax. Com m ent s wit hin expressions are one of t he so- called experim ent al feat ures, and you can em bed a com m ent in a Perl regular expression as follows: (?#area code)([[:digit:]]{3}[-\.]|\([[:digit:]]{3}\)) (?#local number)[[:digit:]]{3}[-\.][[:digit:]]{4} Oracle does not support Perl's experim ent al feat ure synt ax. 1.7.5.8 Backreferences I n a replacem ent st ring such as one you m ight use wit h REGEXP_REPLACE, Perl support s t he use of a dollar sign ( $) t o indicat e a backreference. For exam ple, you can use $1 t o refer t o t he first subexpression. Oracle support s only t he backslash synt ax \1, \2, and so fort h. 1.7.5.9 Backslash differences POSI X and Perl differ som ewhat in how t hey handle backslash (\) charact ers: \ in a bracket - list I n Perl, a \ in a bracket - list is t reat ed as a m et acharact er. I n Oracle, a \ in a bracket - list represent s it self. \ as t he last charact er of an expression Use \ as t he last charact er of a regular expression in Perl, and you get an error. Do t he sam e t hing in Oracle, and t he t railing \ is silent ly ignored. [ Team LiB ] [ Team LiB ] 1.8 Regular Expression Quick Reference This sect ion provides a quick- reference sum m ary of t he behavior of all t he regular expression m et acharact ers support ed by Oracle. Most m et acharact ers are t reat ed as regular charact ers when used wit hin square bracket s ( []) . See [ ] ( Square Bracket s) for m ore det ails on t his issue. \ ( Ba ck sla sh ) Escapes a m et acharact er Use t he backslash (\) t o t reat as norm al a charact er t hat would ot herwise have a special m eaning. For exam ple, t o ext ract a dollar am ount from a sent ence, you m ight escape t he period ( .) and t he dollar sign ( $) as follows: SELECT REGEXP_SUBSTR( 'This book costs $9.95 in the U.S.', '\$[[:digit:]]+\.[[:digit:]]+') FROM dual; $9.95 The \$ in t his expression requires t hat t he m at ching t ext begin wit h a dollar sign. The \. requires a period bet ween t he t wo digit groups. To specify a backslash in an expression, escape it wit h anot her backslash. The following query ret rieves all t ext up t o and including t he last backslash: SELECT REGEXP_SUBSTR( 'I want this \ but not this', '.*\\') FROM dual; I want this \ I f t he charact er following a backslash in an expression is not a m et acharact er, t hen t he backslash is ignored: SELECT REGEXP_SUBSTR('\test','\test') FROM dual; test The \ has no special m eaning wit hin square bracket s ( []) . When used wit hin square bracket s, \ represent s it self. \ 1 t h r ou gh \ 9 ( Ba ck sla sh ) References t he value m at chedby a preceding subexpression Use \1, \2, \3, etc. t hrough \9 t o creat e backreferences t o values m at ched by preceding subexpressions. You can backreference up t o nine subexpressions, t he first nine, in any expression. Subexpressions are num bered in t he order in which t heir opening parent heses are encount ered when scanning from left t o right . For exam ple, t o flip a nam e from last , first form at t o first last: SELECT REGEXP_REPLACE( 'Sears, Andrew', '(.+), (.+)','\2 \1') FROM dual; Andrew Sears For m ore exam ples, see Sect ion 1.6.8 in Sect ion 1.6 and REGEXP_REPLACE under Sect ion 1.9. . ( Pe r iod) Mat ches any charact er The period m at ches any charact er in t he underlying charact er set of t he st ring t o be searched, except t hat by default it does not m at ch t he newline charact er. The following exam ple uses a regular expression consist ing of eight periods t o ext ract t he first sequence of eight cont iguous charact ers from a st ring: SELECT REGEXP_SUBSTR('Do not' || CHR(10) || 'Brighten the corner!' ,'........') FROM dual; Brighten These result s do not include t he first charact ers in t he st ring, because t he sevent h charact er is a newline ( CHR(10)) , and t hat newline breaks t he pat t ern. The first eight cont iguous charact ers, exclusive of newlines, form t he word 'Brighten'. You can specify 'n' in t he opt ional match_parameter t o cause t he period t o m at ch t he newline: SELECT REGEXP_SUBSTR('Do not' || CHR(10) || 'Brighten the corner!' ,'........',1,1,'n') FROM dual; Do not B Periods do not m at ch NULLs, and t hey lose t heir special m eaning when used wit hin square bracket s ( []) . ^ ( Ca r e t ) Mat ches t he beginning- of- line Use t he caret ( ^) t o anchor an expression t o t he beginning of t he source t ext , or t o t he beginning of a line wit hin t he source t ext . By default , Oracle t reat s t he ent ire source value as one line, so ^ m at ches only t he very beginning of t he source value: SELECT REGEXP_SUBSTR( 'one two three','^one ') FROM dual; one I f t he t ext you're looking for isn't at t he beginning of t he source st ring, it won't be m at ched. The following query ret urns NULL: SELECT REGEXP_SUBSTR( 'two one three','^one ') FROM dual; The caret is valid anywhere wit hin an expression. For exam ple, t he following expression m at ches eit her 'One' or ' one', but in eit her case t he word m at ched m ust com e at t he beginning of t he st ring: SELECT REGEXP_SUBSTR( 'one two three','^One|^one') FROM dual; one You can change Oracle's default behavior t o t reat t he source t ext as a set of " lines" delim it ed by newline charact ers. You do t his using t he 'm' match_parameter, as follows: SELECT REGEXP_SUBSTR( 'three two one' || CHR(10) || 'one two three', '^one',1,1,'m') FROM dual; one Because 'm' is used, t he ^ anchors t o t he beginning of any line in t he t ext , and t he pat t ern '^one' m at ches t he word ' one'at t he very beginning of t he second line. Be careful, t hough, t hat you don't writ e an im possible expression such as ' one^one', which at t em pt s t o anchor t he beginning of t he st ring, or t he beginning of a line wit hin t he st ring, t o t he m iddle of a value m at ched by t he expression. You can only anchor t he beginning of a line/ st ring t o t he beginning of a value. I f you want t o m at ch a value across a newline, you can t ake one of at least t wo approaches: Use an expression such as 'two[[:space:]]three', which works because t he definit ion of [:space:] includes newline. I f you specifically m ust have t he newline charact er in t he value, t hen build an expression cont aining t he newline charact er, as in: ' two' || CHR(10) || 'three'. The ^ is not a m et acharact er wit hin square bracket s ( []) , except when it is t he very first charact er wit hin t hose bracket s. I n such cases, it negat es t he rem aining charact ers wit hin t he bracket s. See [ ] ( Square Bracket s) for det ails. $ ( D olla r Sign ) Mat ches t he end- of- line Use t he dollar sign ($) t o anchor a regular expression t o t he end of t he source t ext , or t o t he end of a line wit hin t he source t ext . For exam ple, t he $ in t he following query's regular expression is t he reason why 'three' is ret urned rat her t han 'one': SELECT REGEXP_SUBSTR( 'one two three','(one|two|three)$') FROM dual; three As wit h t he caret ( ^) , you can use 'm' t o t reat t he source t ext as a series of " lines" delim it ed by newline ( CHR(10) on Unix syst em s) charact ers. The $ is not a m et acharact er wit hin square bracket s( []) . [ ] ( Squ a r e Br a ck e t s) Mat ches any of a set of charact ers Use square bracket s ([]) t o creat e a m at ching list t hat will m at ch on any one of t he charact ers in t he list. The following exam ple searches for a st ring of digit s by applying t he plus ( +) quant ifier t o a m at ching list consist ing of t he set of digit s 0- 9: SELECT REGEXP_SUBSTR( 'Andrew is 14 years old.', '[0123456789]+ years old') FROM dual; 14 years old A bet t er solut ion t o t his problem is t o define a range of digit s using t he dash ( -) : [0-9]+ years old Even bet t er is t o specify a charact er class: [[:digits:]]+ years old' Begin a list wit h a caret ( ^) t o creat e a non- m at ching list t hat specifies charact ers t o which you do not want t o m at ch. The following ext ract s all of a sent ence except t he ending punct uat ion: SELECT REGEXP_SUBSTR( 'This is a sentence.', '.*[^.!:]') FROM dual; This is a sentence Virt ually all regular expression m et acharact ers lose t heir special m eaning and are t reat ed as regular charact ers when used wit hin square bracket s. The period in t he previous SELECT st at em ent provides an exam ple of t his, and Table 1- 3 describes som e except ions t o t his general rule. Ta ble 1 - 3 . Ch a r a ct e r s t h a t r e t a in spe cia l m e a n in g w it h in squ a r e br a ck e t s Ch a r a ct e r M e a n in g ^ An init ial ^ defines a non- m at ching list . Ot herwise, t he ^ has no special m eaning. - Specifies a range, for exam ple 0-9. When used as t he very first or very last charact er bet ween bracket s, or as t he first charact er following a leading ^ wit hin bracket s, t he holds no special m eaning, and m at ches it self. [ Represent s it self, unless used as part of a charact er class, equivalence class, or collat ion. For exam ple, use [[] t o m at ch j ust t he left , square bracket ( [) . ] Represent s it self when it is t he first charact er following t he opening ( left ) bracket ( [) , or t he first charact er following a leading caret ( ^) . For exam ple, use [][] t o m at ch opening and closing square bracket s; use [^][] t o m at ch all but t hose bracket s. [: :] Encloses a charact er class nam e, for exam ple [:digit:]. [. .] Encloses a collat ion elem ent , for exam ple [.ch.]. [= =] Encloses an equivalence class, for exam ple [=e=]. [ . .] ( Colla t ion Ele m e n t ) Specifies a collat ion elem ent Use [. and .] t o enclose a collat ion elem ent , usually a m ult icharact er collat ion elem ent . Collat ion elem ent s m ust be specified wit hin bracket expressions. The following exam ple uses t he collat ion elem ent [.ch.] t o find a word cont aining t he Spanish let t er ' ch' wit h a case- insensit ive search. First , look at t he result s when we sim ply place t he let t ers ' c' and ' h' in a bracket expression: ALTER SESSION SET NLS_LANGUAGE=SPANISH; SELECT REGEXP_SUBSTR( 'El caballo, Chico come la tortilla.', '[[:alpha:]]*[ch][[:alpha:]]*',1,1,'i') FROM dual; caballo These aren't t he result s we want ! Even t hough 'ch' is t wo let t ers, Spanish, at least old Spanish, t reat s it as one. Collat ion elem ent s let us deal wit h t his sit uat ion: ALTER SESSION SET NLS_SORT=XSPANISH; SELECT REGEXP_SUBSTR( 'El caballo, Chico come la tortilla.', '[[:alpha:]]*[[.ch.]][[:alpha:]]*',1,1,'i') FROM dual; Chico By specifying t he collat ion [.ch.] in t he bracket expression, we t ell t he regular expression engine t o look for t he com binat ion 'ch', not for a 'c' or an 'h'. We also had t o change our NLS_SORT set t ing from SPANI SH ( t he default for t he Spanish language) t o XSPANI SH in order for t he collat ion t o be recognized. This is because SPANI SH uses m odern rules t hat t reat 'ch' as t wo let t ers, but XSPANI SH uses older rules t hat t reat 'ch' as one let t er. You cannot arbit rarily put any t wo let t ers in a collat ion. See Table 1- 4. Technically, any single charact er is a collat ion elem ent . Thus, [a] and [[.a.]] are equivalent . I n pract ice, you only need t o use collat ion elem ent synt ax when a collat ion elem ent consist s of m ult iple charact ers t hat linguist ically represent one charact er. Table 1- 4 provides a list of such cases recognized by Oracle. The collat ion elem ent s in t he t able are only valid for t he specified NLS_SORT set t ings. Ta ble 1 - 4 . Colla t ion e le m e n t s N LS_ SORT M u lt ich a r a ct e r colla t ion e le m e n t s aa AA Aa XDANI SH oe OE Oe ch CH Ch XSPANI SH ll LL Ll cs CS Cs gy GY Gy ly LY Ly XHUNGARI AN ny NY Ny sz SZ Sz ty TY Ty zs ZS Zs XCZECH ch CH Ch XCZECH_PUNCTUATI ON ch CH Ch dz DZ Dz XSLOVAK d D D ch CH Ch d XCROATI AN D D lj LJ Lj nj Nj NJ [ : :] ( Ch a r a ct e r Cla ss) Specifies a charact er class Use [: and :] t o enclose a charact er class nam e, for exam ple: [:alpha:]. Charact er classes m ust be specified wit hin bracket expressions, as in [[:alpha:]]. The following exam ple uses t he charact er class [:digit:] t o m at ch t he digit s in a ZI P code: SELECT REGEXP_SUBSTR( 'Munising MI 49862', '[[:digit:]]{5}') zip_code FROM dual; 49862 I n t his exam ple, we could j ust as well have used t he pat t ern [0-9]{5}. However, in m ult ilingual environm ent s digit s are not always t he charact ers 0- 9. The charact er class [:digit:] m at ches t he English 0- 9, t he Arabic- I ndic – , t he Tibet an – , and so fort h. Table 1- 5 describes t he charact er class nam es recognized by Oracle. All nam es are case- sensit ive. Ta ble 1 - 5 . Su ppor t e d ch a r a ct e r cla sse s Cla ss D e scr ipt ion [:alnum:] Alphanum eric charact ers ( sam e as [:alpha:] + [:digit:]) [:alpha:] Alphabet ic charact ers only [:blank:] Blank space charact ers, such as space and t ab [:cntrl:] Nonprint ing or cont rol charact ers [:digit:] Num eric digit s [:graph:] Graphical charact ers ( sam e as [:punct:] + [:upper:] + [:lower:] + [:digit:]) [:lower:] Lowercase let t ers [:print:] Print able charact ers [:punct:] Punct uat ion charact ers [:space:] Whit espace charact ers such as space, form - feed, newline,carriage ret urn, horizont al t ab, and vert ical t ab [:upper:] Uppercase let t ers [:xdigit:] Hexadecim al charact ers [ = = ] ( Equ iva le n ce Cla ss) Specifies an equivalence class Use [= and =] t o surround a let t er when you want t o m at ch all accent ed and unaccent ed versions of t hat let t er. The result ing equivalence class reference m ust always be wit hin a bracket expression. For exam ple: SELECT REGEXP_SUBSTR('eéëèÉËÈE' '[[=É=]]+') FROM dual; eéëèÉËÈE SELECT REGEXP_SUBSTR('eéëèÉËÈE', '[[=e=]]+') FROM dual; eéëèÉËÈE I t doesn't m at t er which version of a let t er you specify bet ween t he [= and =]. All equivalent accent ed and unaccent ed let t ers, whet her upper- or lowercase, will m at ch. NLS_SORT det erm ines which charact ers are considered t o be equivalent . Thus, equivalence can be det erm ined appropriat ely for what ever language you are using. * ( Ast e r isk ) Mat ches zero or m ore The ast erisk ( *) is a quant ifier t hat applies t o t he preceding regular expression elem ent . I t specifies t hat t he preceding elem ent m ay occur zero or m ore t im es. The following exam ple uses ^.*$ t o ret urn t he second line of a t ext value. SELECT REGEXP_SUBSTR('Do not' || CHR(10) || 'Brighten the corner!' ,'^.*$',1,2,'m') FROM dual; Brighten the corner! The 'm' match_parameter is used t o cause t he ^ and $ charact ers t o m at ch t he beginning and end of each line, respect ively. The .* m at ches any and all charact ers bet ween t he beginning and end of t he line. The first m at ch of t his expression is t he st ring " Do not " . We passed a 2 as t he fourt h param et er t o request t he second occurrence of t he regular expression. I f t he previous elem ent is a bracket expression, t he ast erisk m at ches a st ring of zero or m ore charact ers from t he set defined by t hat expression: SELECT REGEXP_SUBSTR('123789', '[[:digit:]]*') FROM dual; 123789 Likewise, t he preceding elem ent m ight be a subexpression. I n t he following exam ple, each fruit nam e m ay be followed by zero or m ore spaces, and we are looking for any num ber of such fruit nam es: SELECT REGEXP_SUBSTR('apple apple orange wheat', '((apple|orange|pear)[[:space:]]*)*') FROM dual; apple apple orange Wat ch out ! The ast erisk can surprise you. Consider t he following: SELECT REGEXP_SUBSTR('abc123789def', '[[:digit:]]*') FROM dual; The result of execut ing t his query will be a NULL. Why? Because [[:digit:]] is opt ional. When t he regular expression engine looks at t he first charact er in t he st ring ( t he let t er 'a') it will decide t hat , sure enough, it has found zero or m ore digit s, in t his case zero digit s. The regular expression will be sat isfied, and REGEXP_SUBSTR will ret urn a st ring of zero charact ers, which in Oracle is t he sam e as a NULL. + ( Plu s Sign ) Mat ches one or m ore The plus ( +) is a quant ifier t hat m at ches one or m ore occurrences of t he preceding elem ent . The plus is sim ilar t o t he ast erisk ( *) in t hat m any occurrences are accept able, but unlike t he ast erisk in t hat at least one occurrence is required. The following is a m odificat ion of t he first exam ple from t he previous sect ion on t he ast erisk. This exam ple also ret urns t he second line of a t ext value, but t he difference is t hat t his t im e .+ is used t o ret urn t he second line cont aining charact ers. SELECT REGEXP_SUBSTR('Do not' || CHR(10) || CHR(10) || 'Brighten the corner!' ,'^.+$',1,2,'m') FROM dual; Brighten the corner! The first line is 'Do not', and is skipped because t he fourt h param et er request s line t wo. The second line is a NULL line, which is skipped because it cont ains no charact ers. The t hird line is ret urned from t he funct ion because it 's t he second occurrence of t he pat t ern: a line cont aining charact ers. Just as t he ast erisk can be applied t o bracket expressions and subexpressions, so can t he plus. Unlike t he ast erisk, t he plus will not m at ch on a NULL. Following is a m odificat ion of t he query in t he preceding sect ion t hat ret urned a NULL, but t his t im e t he + quant ifier is used: SELECT REGEXP_SUBSTR('abc123789def', '[[:digit:]]+') FROM dual; 123789 Because + is used, t he expression will not m at ch on t he NULL st ring preceding t he let t er a. I nst ead, t he regular expression engine will cont inue on t hrough t he source st ring looking for one or m ore digit s. ? ( Qu e st ion M a r k ) Mat ches zero or one The quest ion m ark ( ?) is very sim ilar t o t he ast erisk (*) , except t hat it m at ches at m ost one occurrence of t he preceding elem ent . For exam ple, t he following ret urns only t he first fruit : SELECT REGEXP_SUBSTR('apple apple orange wheat', '((apple|orange|pear)[[:space:]]*)?') FROM dual; apple Like t he *, t he ? can surprise you by m at ching where you don't expect . I n t his case, if t he st ring doesn't begin wit h a fruit nam e, t he ? will m at ch on t he em pt y st ring. See * ( Ast erisk) for an exam ple of t his kind of behavior. { } ( Cu r ly Br a ce s) Mat ches a specific num ber of t im es Use curly braces ( {}) when you want t o be very specific about t he num ber of occurrences an operat or or subexpression m ust m at ch in t he source st ring. Curly braces and t heir cont ent s are known as int erval expressions. You can specify an exact num ber or a range, using any of t he form s shown in Table 1- 6. Ta ble 1 - 6 . For m s of t h e { } in t e r va l e x pr e ssion For m {m} M e a n in g The preceding elem ent or subexpression m ust occur exact ly m t im es. {m,n} The preceding elem ent or subexpression m ust occur bet ween m and n t im es, inclusive. {m,} The preceding elem ent or subexpression m ust occur at least m t im es. The following exam ple, t aken from Sect ion 1.6, uses curly braces t o specify t he num ber of digit s in t he different phone num ber groupings: SELECT park_name FROM park WHERE REGEXP_LIKE(description, '[[:digit:]]{3}-[[:digit:]]{4}'); Using t he { m, n} form , you can specify a range of occurrences you are willing t o accept . The following query uses {3,5} t o m at ch from t hree t o five digit s: SELECT REGEXP_SUBSTR( '1234567890','[[:digit:]]{3,5}') FROM dual; 12345 Using { m,} , you can leave t he upper end of a range unbounded: SELECT REGEXP_SUBSTR( '1234567890','[[:digit:]]{3,}') FROM dual; 1234567890 Ve r t ica l Ba r ( | ) Delim it s alt ernat ive possibilit ies The vert ical bar ( |) is known as t he alt ernat ion operat or . I t delim it s, or separat es, alt ernat ive subexpressions t hat are equally accept able. For exam ple, t he expression in t he following query ext ract s t he nam e of a fruit from a sent ence. I n t his exam ple t he fruit is 'apple', but any of t he t hree list ed fruit s: 'apple', 'apricot', or 'orange' is equally accept able as a m at ch: SELECT REGEXP_SUBSTR( 'An apple a day keeps the doctor away.', 'apple|apricot|orange') FROM dual; apple I t 's usually wise t o const rain your alt ernat ions using parent heses. For exam ple, t o m odify t he previous exam ple t o ret urn t he ent ire st ring, you could use: SELECT REGEXP_SUBSTR( 'An apple a day keeps the doctor away.', 'An apple a day keeps the doctor away.' || '|An apricot a day keeps the doctor away.' || '|An orange a day keeps the doctor away.') FROM dual; This solut ion works, but it 's painfully repet it ive and does not scale well. I f t here were t wo words t hat could change in each sent ence, and if each word had t hree possibilit ies, you'd need t o writ e 3 x 3= 9 alt ernat e versions of t he sent ence. The following approach is m uch bet t er, and easier: SELECT REGEXP_SUBSTR( 'An apple a day keeps the doctor away.', 'An (apple|apricot|orange) a day ' || 'keeps the doctor away.') FROM dual; By const raining t he alt ernat ion t o j ust t hat part of t he t ext t hat can vary, we elim inat ed t he need t o repeat t he t ext t hat st ays t he sam e. An expression such as (abc|) is valid, and will m at ch eit her 'abc' or not hing at all. However, using (abc)? will look less like a m ist ake, and will m ake your int ent clearer. ( ) ( Pa r e n t h e se s) Defines a subexpression Place parent heses ( () ) around a port ion of a regular expression t o define a subexpression. Subexpressions are useful for t he following purposes: To const rain an alt ernat ion t o t he subexpression. To provide for a backreference t o t he value m at ched by t he subexpression. To allow a quant ifier t o be applied t o t he subexpression as a whole. The regular expression in t he following exam ple uses parent heses t wice. The innerm ost set const rains t he alt ernat ion t o t he t hree fruit nam es. The out erm ost set defines a subexpression in t he form of fruit nam e + space, which we require t o appear from 1 t o 3 t im es in t he t ext . SELECT REGEXP_SUBSTR( 'orange apple pear lemon lime', 'orange ((apple|pear|lemon)[[:space:]]){1,3}') FROM dual; orange apple pear lemon See Sect ion 1.6, especially under Sect ion 1.6.6 and Sect ion 1.6.8, for m ore exam ples showing t he use of parent heses in regular expressions. [ Team LiB ] [ Team LiB ] 1.9 Oracle Regular Expression Functions Oracle's regular expression support , which we int roduced earlier in t he book, m anifest s it self in t he form of four funct ions, which are described in t his sect ion. Each funct ion is usable from bot h SQL and PL/ SQL. All t he exam ples in t his sect ion search t ext lit erals. We do t his t o m ake it obvious how each funct ion works, by showing you bot h input and out put for each exam ple. Typically, you do not use regular expressions t o search st ring lit erals, but rat her t o search charact er colum ns in t he dat abase, or charact er variables in PL/ SQL. For t he sam e reason, t he regular expressions in t his sect ion are sim ple t o t he ext rem e. We don't want you puzzling over our expressions when what you really want is t o underst and t he funct ions. REGEXP_ I N STR Locat es t ext m at ching a pat t ern REGEXP_I NSTR ret urns t he beginning or ending charact er posit ion of a regular expression wit hin a st ring. You specify which posit ion you want . The funct ion ret urns zero if no m at ch is found. Syn t a x REGEXP_INSTR(source_string, pattern [, position [, occurrence [, return_option [, match_parameter]]]]) All param et ers aft er t he first t wo are opt ional. However, t o specify any one opt ional param et er, you m ust specify all preceding param et ers. Thus, if you want t o specify match_parameter, you m ust specify all param et ers. Pa r a m e t e r s source_string The st ring you want t o search. pattern A regular expression describing t he t ext pat t ern you are searching for. This expression m ay not exceed 512 byt es in lengt h. position The charact er posit ion at which t o begin t he search. This default s t o 1, and m ust be posit ive. occurrence The occurrence of pattern you are int erest ed in finding. This default s t o 1. Specify 2 if you want t o find t he second occurrence of t he pat t ern, 3 for t he t hird occurrence, and so fort h. return_option Specify 0 ( t he default ) t o ret urn t he pat t ern's beginning charact er posit ion. Specify 1 t o ret urn t he ending charact er posit ion. match_parameter A set of opt ions in t he form of a charact er st ring t hat change t he default m anner in which regular expression pat t ern m at ching is perform ed. You m ay specify any, all, or none of t he following opt ions, in any order: ' i' Specifies case- insensit ive m at ching. ' c' Specifies case- sensit ive m at ching. The NLS_SORT param et er set t ing det erm ines whet her casesensit ive or - insensit ive m at ching is done by default . ' n' Allows t he period ( .) t o m at ch t he newline charact er. Norm ally, t hat is not t he case. ' m' Causes t he caret ( ^) and dollar sign ($) t o m at ch t he beginning and ending, respect ively, of lines wit hin t he source st ring. Norm ally, t he caret ( ^) and dollar sign ($) m at ch only t he very beginning and very ending of t he source st ring, regardless of any newline charact ers wit hin t he st ring. Ex a m ple s Following is an exam ple of a sim ple case, in which t he st ring 'Mackinac', com m only m isspelled ' Mackinaw', is locat ed wit hin a larger st ring: SELECT REGEXP_INSTR( 'Fort Mackinac was built in 1870', 'Mackina.') FROM dual; 6 I f you're int erest ed in t he ending charact er posit ion, act ually one past t he ending posit ion, you can specify a value of 1 for return_option, which forces you t o also specify values for position and occurrence: SELECT REGEXP_INSTR( 'Fort Mackinac was built in 1870', 'Mackina.',1,1,1) FROM dual; 14 The occurrence param et er enables you t o locat e an occurrence of a pat t ern ot her t han t he first : SELECT REGEXP_INSTR( 'Fort Mackinac is near Mackinaw City', 'Mackina.',1,2) FROM dual; 23 The following exam ple uses position t o skip t he first 14 charact ers of t he search st ring, beginning t he search at charact er posit ion 15: SELECT REGEXP_INSTR( 'Fort Mackinac is near Mackinaw City', 'Mackina.',15) FROM dual; 23 For an exam ple involving match_parameter, see Sect ion 1.7.3 in Sect ion 1.7. REGEXP_ LI KE Det erm ines whet her a given pat t ern exist s REGEXP_LI KE is a Boolean funct ion, or predicat e, which ret urns true if a st ring cont ains t ext m at ching a specified regular expression. Ot herwise REGEXP_LI KE ret urns false. Syn t a x REGEXP_LIKE (source_string, pattern [, match_parameter]) Pa r a m e t e r s source_string The st ring you want t o search. pattern A regular expression describing t he t ext pat t ern you are searching for. This expression m ay not exceed 512 byt es in lengt h. match_parameter A set of opt ions in t he form of a charact er st ring t hat change t he default m anner in which regular expression pat t ern m at ching is perform ed. You m ay specify any, all, or none of t he following opt ions, in any order: ' i' Specifies case- insensit ive m at ching. ' c' Specifies case- sensit ive m at ching. The NLS_SORT param et er set t ing det erm ines whet her casesensit ive or - insensit ive m at ching is done by default . ' n' Allows t he period ( .) t o m at ch t he newline charact er. Norm ally, t hat is not t he case. ' m' Causes t he caret ( ^) and dollar sign ($) t o m at ch t he beginning and ending, respect ively, of lines wit hin t he source st ring. Norm ally, t he caret ( ^) and dollar sign ($) m at ch only t he very beginning and very ending of t he source st ring, regardless of any newline charact ers wit hin t he st ring. Ex a m ple s I n a SQL st at em ent , REGEXP_LI KE m ay be used only as a predicat e in t he WHERE and HAVI NG clauses. This is because SQL does not recognize t he Boolean dat a t ype. For exam ple: SELECT 'Phone number present' FROM DUAL WHERE REGEXP_LIKE( 'Tahquamenon Falls: (906) 492-3415', '[0-9]{3}[-.][0-9]{4}'); I n PL/ SQL, REGEXP_LI KE m ay be used in t he sam e m anner as any ot her Boolean funct ion: DECLARE has_phone BOOLEAN; BEGIN has_phone := REGEXP_LIKE( 'Tahquamenon Falls: (906) 492-3415', '[0-9]{3}[-.][0-9]{4}'); END; / REGEXP_LI KE, and even t he ot her regular expression funct ions, can also be used in CHECK const raint s. The following const raint ensures t hat phone num bers are always st ored in ( xxx) xxxxxxx form at : ALTER TABLE park ADD (CONSTRAINT phone_number_format CHECK (REGEXP_LIKE(park_phone, '^\([0-9]{3}\) [0-9]{3}-[0-9]{4}$'))); For an exam ple involving match_parameter, see Sect ion 1.7.3 in Sect ion 1.7. REGEXP_ REPLACE Replaces t ext m at ching a pat t ern REGEXP_REPLACE searches a st ring for subst rings m at ching a regular expression, and replaces each subst ring wit h t ext t hat you specify. Your replacem ent t ext m ay cont ain backreferences t o subexpressions in t he regular expression. The new st ring, wit h all replacem ent s m ade, is ret urned as t he funct ion's result . REGEXP_REPLACE ret urns eit her a VARCHAR2 or a CLOB, depending on t he input t ype. The ret urn value's charact er set will m at ch t hat of t he source st ring. Syn t a x REGEXP_REPLACE(source_string, pattern [, replace_string [, position [, occurrence [, match_parameter]]]]) All param et ers aft er t he first t wo are opt ional. However, t o specify any one opt ional param et er, you m ust specify all preceding param et ers. Thus, if you want t o specify match_parameter, you m ust specify all param et ers. Pa r a m e t e r s source_string The st ring cont aining t he subst rings t hat you want t o replace. pattern A regular expression describing t he t ext pat t ern of t he subst rings you want t o replace. Maxim um lengt h is 512 byt es. replace_string The replacem ent t ext . Each occurrence of pattern in source_string is replaced by replace_string. See Sect ion 1.6.8 lat er in t his sect ion for im port ant inform at ion on using regular expression backreferences in t he replacem ent t ext . Maxim um lengt h is 32,767 byt es. Any replacem ent t ext value larger t han 32,767 byt es will be t runcat ed t o t hat lengt h. I f you're using m ult ibyt e charact ers, t runcat ion m ight result in less t han 32,767 byt es, because Oracle will t runcat e t o a charact er boundary, never leaving a part ial charact er in a st ring. Up t o 500 backreferences are support ed in t he replacem ent t ext . To place a backslash ( \) int o t he replacem ent t ext , you m ust escape it , as in \\. position The charact er posit ion at which t o begin t he search- and- replace operat ion. This default s t o 1, and m ust be posit ive. occurrence The occurrence of pattern you are int erest ed in replacing. This default s t o 0, causing all occurrences t o be replaced. Specify 1 if you want t o replace only t he first occurrence of t he pat t ern, 2 for only t he second occurrence, and so fort h. match_parameter A set of opt ions in t he form of a charact er st ring t hat change t he default m anner in which regular expression pat t ern m at ching is perform ed. You m ay specify any, all, or none of t he following opt ions, in any order: ' i' Specifies case- insensit ive m at ching. ' c' Specifies case- sensit ive m at ching. The NLS_SORT param et er set t ing det erm ines whet her casesensit ive or - insensit ive m at ching is done by default . ' n' Allows t he period ( .) t o m at ch t he newline charact er. Norm ally, t hat is not t he case. ' m' Causes t he caret ( ^) and dollar sign ($) t o m at ch t he beginning and ending, respect ively, of lines wit hin t he source st ring. Norm ally, t he caret ( ^) and dollar sign ($) m at ch only t he very beginning and very ending of t he source st ring, regardless of any newline charact ers wit hin t he st ring. Ex a m ple s Following is an exam ple of t he sim plest t ype of search- and- replace operat ion, in t his case correct ing any m isspellings of t he nam e Mackinaw Cit y: SELECT REGEXP_REPLACE( 'It''s Mackinac Bridge, but Mackinac City.', 'Mackina. City', 'Mackinaw City') FROM dual; It's Mackinac Bridge, but Mackinaw City. By default , all occurrences of t ext m at ching t he regular expression are replaced. The following exam ple specifies 2 for t he occurrence argum ent , so t hat only t he second occurrence of t he pat t ern ' Mackina.' is replaced: SELECT REGEXP_REPLACE( 'It''s Mackinac Bridge, but Mackinac City.', 'Mackina.', 'Mackinaw',1,2) FROM dual; It's Mackinac Bridge, but Mackinaw City. For an exam ple of t he position argum ent 's use, see REGEXP_I NSTR. For an exam ple involving match_parameter, see Sect ion 1.7.3 in Sect ion 1.7. Ba ck r e fe r e n ce s REGEXP_REPLACE allows t he use of regular expression backreferences in t he replacem ent t ext st ring. Such backreferences refer t o values m at ching t he corresponding subexpressions in t he pattern argum ent . The following exam ple m akes use of backreferences t o fix doubled word problem s: SELECT park_name, REGEXP_REPLACE(description, '([[:space:][:punct:]]+)([[:alpha:]]+)' || '([[:space:][:punct:]]+)\2' || '[[:space:][:punct:]]+', '\1\2\3') description FROM park WHERE REGEXP_LIKE(description, '([[:space:][:punct:]]+)([[:alpha:]]+)' || '([[:space:][:punct:]]+)\2' || '[[:space:][:punct:]]+'); Look carefully at t he subexpressions in t he pat t ern expression, and you'll see t hat t he subexpressions have t he following m eanings: \1 The space and punct uat ion preceding t he first occurrence of t he word. This we keep. \2 The first occurrence of t he doubled word, which we also keep. \3 The space and punct uat ion following t he first occurrence, which we also keep. The second occurrence of t he doubled word, and what ever space and punct uat ion t hat follows it , are arbit rarily discarded. While t he pat t ern shown in t his sect ion is an int erest ing way t o rid yourself of doubled words, it m ay or m ay not yield correct sent ences. See Sect ion 1.6.8 in Sect ion 1.6 for a m ore com prehensive explanat ion of backreferences. REGEXP_ SUBSTR Ext ract s t ext m at ching a pat t ern REGEXP_SUBSTR scans a st ring for t ext m at ching a regular expression, and t hen ret urns t hat t ext as it s result . I f no t ext is found, NULL is ret urned. Syn t a x REGEXP_SUBSTR(source_string, pattern [, position [, occurrence [, match_parameter]]] All param et ers but t he first t wo are opt ional. However, t o specify any opt ional param et er, you m ust specify all preceding param et ers. Thus, when specifying match_parameter, all ot her param et ers are also required. Pa r a m e t e r s source_string The st ring you want t o search. pattern A regular expression describing t he pat t ern of t ext you want t o ext ract from t he source st ring. position The charact er posit ion at which t o begin searching. This default s t o 1. occurrence The occurrence of pattern you want t o ext ract . This default s t o 1. match_parameter A set of opt ions in t he form of a charact er st ring t hat change t he default m anner in which regular expression pat t ern m at ching is perform ed. You m ay specify any, all, or none of t he following opt ions, in any order: ' i' Specifies case- insensit ive m at ching. ' c' Specifies case- sensit ive m at ching. The NLS_SORT param et er set t ing det erm ines whet her casesensit ive or - insensit ive m at ching is done by default . ' n' Allows t he period ( .) t o m at ch t he newline charact er. Norm ally, t hat is not t he case. ' m' Causes t he caret ( ^) and dollar sign ($) t o m at ch t he beginning and ending, respect ively, of lines wit hin t he source st ring. Norm ally, t he caret ( ^) and dollar sign ($) m at ch only t he very beginning and very ending of t he source st ring, regardless of any newline charact ers wit hin t he st ring. Ex a m ple s The following exam ple ext ract s U.S. and Canadian phone num bers from park descript ions: SELECT park_name, REGEXP_SUBSTR(description, '([[:digit:]]{3}[-.]|\([[:digit:]]{3}\) )' ||'[[:digit:]]{3}[-.][[:digit:]]{4}') park_phone FROM park; PARK_NAME ------------------------Färnebofjärden Mackinac Island State Park Fort Wilkens State Park ... PARK_PHONE -------------***NULL*** 517-373-1214 (800) 447-2757 This PL/ SQL- based exam ple loops t hrough t he various phone num bers in a descript ion: <<local>> DECLARE description park.description%TYPE; phone VARCHAR2(14); phone_index NUMBER; BEGIN SELECT description INTO local.description FROM park WHERE park_name = 'Fort Wilkins State Park'; phone_index := 1; LOOP phone := REGEXP_SUBSTR(local.description, '([[:digit:]]{3}[-.]|\([[:digit:]]{3}\) )' ||'[[:digit:]]{3}[-.][[:digit:]]{4}', 1,phone_index); EXIT WHEN phone IS NULL; DBMS_OUTPUT.PUT_LINE(phone); phone_index := phone_index + 1; END LOOP; END; / (800) 447-2757 906.289.4215 (906) 289-4210 The key t o t his exam ple is t hat phone_index is increm ent ed following each m at ch, causing REGEXP_SUBSTR t o it erat e t hrough t he first , second, and t hird phone num bers. I t erat ion st ops when a NULL ret urn value indicat es t hat t here are no m ore phone num bers t o display. [ Team LiB ] [ Team LiB ] 1.10 Oracle Regular Expression Error Messages The following list det ails Oracle errors specific t o regular expressions, and suggest s how you m ight resolve t hem . ORA- 0 1 7 6 0 : ille ga l a r gu m e n t for fu n ct ion This is not st rict ly a regular expression error. However, you can get t his error if you pass an invalid match_parameter t o one of t he REGEXP funct ions. See Sect ion 1.7.3 in Sect ion 1.7 for m ore det ails. You can also get t his error by passing an invalid t ype for any param et er. For exam ple, you'll get t his error if you pass a num ber where a st ring is expect ed, or vice- versa. I f you do get t his error as t he result of a call t o one of t he REGEXP funct ions, check t o be sure t hat all your argum ent t ypes are valid, and t hat you are passing only valid m at ching opt ions ( 'i', 'c', 'm', or ' n') in your match_parameter argum ent , which is always t he last argum ent of a REGEXP funct ion call. ORA- 1 2 7 2 2 : r e gu la r e x pr e ssion in t e r n a l e r r or Cont act Oracle Support and open a Technical Assist ance Request ( TAR) , because you've encount ered a bug. ORA- 1 2 7 2 5 : u n m a t ch e d pa r e n t h e se s in r e gu la r e x pr e ssion You have m ism at ched parent heses in your expression. For exam ple, an expression like '(a' will cause t his error. Carefully check each subexpression t o be sure you include bot h opening and closing parent heses. Check t o see whet her you've correct ly escaped parent heses t hat do not enclose subexpressions, and m ake sure you haven't inadvert ent ly escaped a parent heses t hat should open or close a subexpression. ORA- 1 2 7 2 6 : u n m a t ch e d br a ck e t in r e gu la r e x pr e ssion You have m ism at ched square bracket s in your expression. Apply t he advice we give for ORA- 12725, but t his t im e look at your use of square bracket s. Also, while an expression such as '[a' will cause t his error, an expression such as 'a]' will not , because a closing ( right ) bracket is t reat ed as a regular charact er unless it is preceded by an opening ( left ) bracket . ORA- 1 2 7 2 7 : in va lid ba ck r e fe r e n ce in r e gu la r e x pr e ssion You wrot e a backreference t o a subexpression t hat does not exist , or t hat does not yet exist . For exam ple, '\1' is invalid because t here is no subexpression t o reference. On t he ot her hand, '\1(abc)' is invalid because t he backreference precedes t he subexpression t o which it refers. Verify t hat all your backreferences are valid, and t hat t hey always refer t o preceding subexpressions. ORA- 1 2 7 2 8 : in va lid r a n ge in r e gu la r e x pr e ssion You specified a range, such as ' [z-a]', in which t he st art ing charact er does not precede t he ending charact er. Check each range in your expression t o ensure t hat t he beginning charact er precedes t he ending charact er. Also check your NLS_SORT set t ing, as it is NLS_SORT t hat det erm ines t he ordering of charact ers used t o define a range. ORA- 1 2 7 2 9 : in va lid ch a r a ct e r cla ss in r e gu la r e x pr e ssion : You specified an invalid charact er class nam e wit hin [: and :]. Check your regular expression t o be sure you are using only t hose nam es valid for your release of Oracle. Table 1- 4 in Sect ion 1.8 list s nam es valid for t he init ial release of Oracle Dat abase 10g. ORA- 1 2 7 3 0 : in va lid e qu iva le n ce cla ss in r e gu la r e x pr e ssion You specified a sequence of charact ers wit hin [= and =] t hat cannot be resolved t o a single base let t er. For exam ple, [=ab=] is not a valid t wo- charact er equivalence. ORA- 1 2 7 3 1 : in va lid colla t ion cla ss in r e gu la r e x pr e ssion You specified a collat ion elem ent t hat does not exist in your current sort order. For exam ple, specifying [.ch.] when NLS_SORT is ot her t han XSPANI SH or XCZECH will cause t his error, because ot her languages never t reat t he com binat ion 'ch' as a single charact er. Check your expression t o be sure t hat each use of [= and =] is valid, and check your NLS_SORT set t ing. ORA- 1 2 7 3 2 : in va lid in t e r va l va lu e in r e gu la r e x pr e ssion Using curly braces, you specified a range of repeat count s in which t he beginning of t he range is great er t han t he end. For exam ple, '{3,1}' is invalid because 3 is great er t han 1. Wit hin curly braces, t he sm allest value m ust com e first ( e.g., '{1,3}') . [ Team LiB ]