SQLMDX

Eventually it's all about data

Evolution of regular expressions

Posted by sqlmdx on July 2, 2016

Frankly speaking “evoulution” is not quite proper word to title this post.
I’m going to show some extensions in regular expressions in different languages in compare to their implementation in Oracle.

Oracle supports regular expressions according to POSIX standard draft 1003.2/D11.2 and two extensions

  • matching capabilities for multilingual data
  • some commonly used PERL regular expression operators (for example, character class shortcuts and the nongreedy modifier “?”)

Let’s consider quizzes which cannot be easily resolved using built-in capabilities in Oracle.

Challenge 1
Find files which do not contain “read” before extension. File extension is a combination of any three letters.

SQL> select value(t) value
  2    from table(sys.odcivarchar2list('abread.txt',
  3                                    'ydear.txt',
  4                                    'zzdare.txt',
  5                                    'xread.com',
  6                                    'hello.exe')) t
  7   where regexp_like(value(t), '\w+[^read]\.\w{3}$');

VALUE
---------------
hello.exe

Non-matching character list does not help to achieve desired result since all combinations of letters r,e,a,d get rejected.
Possible workaround is to replace non-matching word with some character we sure is not a part of a string and apply regular expression after that.

SQL> select value(t) value
  2    from table(sys.odcivarchar2list('abread.txt',
  3                                    'ydear.txt',
  4                                    'zzdare.txt',
  5                                    'xread.com',
  6                                    'hello.exe')) t
  7   where regexp_like(replace(value(t), 'read', '#'), '\w+[^#]\.\w{3}$');

VALUE
---------------
ydear.txt
zzdare.txt
hello.exe

It returns expected result but this approach has a drawback.
For instance, consider following statement: find strings starting with “ABC”, ending with “XYZ” and not containing “COX” between them.
Approach with additional replace filters out “ABCOXYZ” which is incorrect.

SQL> select value(t) value
  2    from table(sys.odcivarchar2list('ABCXYZ',
  3                                    'ABC+XYZ',
  4                                    'ABCOXYZ',
  5                                    'ABCCOXXYZ')) t
  6   where regexp_like(replace(value(t), 'COX', '#'), 'ABC([^#])*XYZ');

VALUE
---------------
ABCXYZ
ABC+XYZ

Such type of challenges could be easily resolved with and engine for regular expressions which supports negative lookaheand and negative lookbehind.
For example, powershell solution is below

PS C:\> $arr = ("abread.txt", "ydear.txt", "zzdare.txt", "xread.com", "hello.exe")
PS C:\> $arr | where {$_.tostring() -match '\w+(?<!read)\.\w{3}$'}
ydear.txt
zzdare.txt
hello.exe
PS C:\> $arr = ("ABCXYZ", "ABC+XYZ", "ABCOXYZ", "ABCCOXXYZ")
PS C:\> $arr | where {$_.tostring() -match 'ABC(?!COX).*XYZ'}
ABCXYZ
ABC+XYZ
ABCOXYZ

Contruction “(?<!read)" is called negative lookbehind and (?!COX) is negative lookahead.
For futher information please refer to regexp lookarounds.

Challenge 2
Replace all natural numbers with their values raised to the squares.
Neither Oracle nor powershell regural expressions allow evaluate expressions based on matched values.
Perl-based solution is below

c:\>perl -e "$str = 'ya hello 5 30 2 world'; $str =~ s/([\d+]+)/$1**2/eg; print $str;"
ya hello 25 900 4 world

One of workaround in Oracle may be using recursive subquery factoring and replacing one number at a time.

SQL> with t(str) as (select 'ya hello 5 30 2 world' from dual)
  2  , rec(str, token, lvl) as
  3   (select str, regexp_substr(str, '\d+', 1, 1), 1
  4      from t
  5    union all
  6    select regexp_replace(str, '\d+', token * token, 1, lvl),
  7           regexp_substr(str, '\d+', 1, lvl + 1),
  8           lvl + 1
  9      from rec
 10     where regexp_substr(str, '\d+', 1, lvl) is not null)
 11  select str from rec where token is null;

STR
------------------------------
ya hello 25 900 4 world

Challenge 3
Replace odd length palindromes with “@”s.
This can be solved using regular expresions only if engine supports recursion.
Perl-based solution looks as following

c:\>perl -e "$str = 'lol qwerty radar lool kukukukuk'; $str =~ s/\b(?'word'(?'letter'[a-z])(?&word)\k'letter'|[a-z])\b/'@'/eg; print $str;"
@ qwerty @ lool @

It’s not very simple at first glance, explanation can be found here.
In order to get this result in Oracle we need to split input string into characters, apply the logic and concatenate rows into result.
It’s possible to get the result using iterative model though.

SQL> select regexp_replace(listagg(case when cnt1 = cnt2 and x <> ' ' then '@' else x end) within group(order by ord), '@+', '@') str
  2    from (select t2.*,
  3                 count(distinct to_char(least(rn1, rn2)) || x) over(partition by grp) * 2 - 1 cnt1,
  4                 count(*) over(partition by grp) cnt2
  5            from (select t1.*,
  6                         row_number() over(partition by grp order by ord) rn1,
  7                         row_number() over(partition by grp order by ord desc) rn2
  8                    from (select t.*,
  9                                 decode(x,
 10                                        ' ',
 11                                        null,
 12                                        count(decode(t.x, ' ', 1))
 13                                        over(order by ord)) grp
 14                            from (select substr(str, rownum, 1) x, rownum ord
 15                                    from (select 'lol qwerty radar lool kukukukuk' str
 16                                            from dual)
 17                                  connect by rownum <= length(str)) t) t1) t2) t3
 18   order by ord;

STR
------------------------------
@ qwerty @ lool @

It’s worth to mention that expression to match odd length palindromes of arbitrary length is not regular from Computer Science perspective.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: