How to define reg_exp and options in RE_REPLACE,RE_MATCH,RE_SEARCH in Statistica

How to define reg_exp and options in RE_REPLACE,RE_MATCH,RE_SEARCH in Statistica

book

Article ID: KB0081852

calendar_today

Updated On:

Products Versions
Spotfire Statistica 10 and later versions

Description

Statistica has 3 text functions in the Function browser named : RE_REPLACE,RE_MATCH and RE_SEARCH. This article details the input parameters format for reg_exp,replace string and options. In general, the format acceptable is PERL format for these parameters.




 

Issue/Introduction

RE_REPLACE,RE_MATCH,RE_SEARCH accept reg_exp and options as parameters. What format should be used for defining these parameters ?

Resolution

A few examples of the functions before discussing the parameters specifically :

A valid RE_REPLACE can be :
=RE_REPLACE(V1, "([0-9]*)", "", "g")
Replaces all numbers in Variable 1 with nothing. The "g" option means global replace so it replaces not just the first occurrence of a number but all of them. This one replaces numbers with nothing

=RE_REPLACE(V1, "([^0-9]*)", " ", "g")
reg_exp [^0-9] ^ inverts it and makes it match characters NOT in the range 0-9. So this one replaces all non numbers with a space, thus stripping out the non-numeric characters and leaving only the numbers

=RE_REPLACE("hello 123 there", ".*?([0-9]+).*", "$1", "")
extracts and returns 123
in this regex, the [.*?] at the beginning means match 0 or more characters, with the ? meaning minimal match; i.e., match the smallest possible thing before a number. ([0-9]+) matches one or more digits and puts it in match group $1. Then .* at the end matches everything after the number.  The replace string is $1, the number.

 =RE_REPLACE(V1, "([0-9]*)", "$1", "c")
Can be used for extracting the number out of a alphanumeric variable
() in the regex create groups you can refer to by number, $1. Here ([0-9]*) matches any number of consecutive digits and assigns it to match group 1, and we replace the whole string with $1, the desired output. The last parameter is for options - "c" means don't copy to the output portions of the string which did not match, so we get just the part that did match (namely the number)

=RE_REPLACE(V1, "([0-9]*)", "$`$'", "c")
To extract all special characters from a variable that has special characters and numbers.$` (backtick) is everything before the match. $'  is everything after the match.The number is the match so it effectively simply removes that part

reg_ex/replace string :
 Regular expressions are a wonderful kind of black magic, but no single example is going to explain them fully.You may define either a variable here or regular expression patterns formed with meta characters and repetition definitions used in PERL  :

 

Metacharacters

charmeaning
^beginning of string
$end of string
.any character except newline
*match 0 or more times
+match 1 or more times
?match 0 or 1 times; or: shortest match
|alternative
( )grouping; “storing”
[ ]set of characters
{ }repetition modifier
\quote or special

Repetition

a*zero or more a’s
a+one or more a’s
a?zero or one a’s (i.e., optional a)
a{m}exactly m a’s
a{m,}at least m a’s
a{m,n}at least m but at most n a’s
 Some examples of patterns that can be formed are :
 
expressionmatches...
abcabc (that exact character sequence, but anywhere in the string)
^abcabc at the beginning of the string
abc$abc at the end of the string
a|beither of a and b
^abc|abc$the string abc at the beginning or at the end of the string
ab{2,4}can a followed by two, three or four b’s followed by a c
ab{2,}can a followed by at least two b’s followed by a c
ab*can a followed by any number (zero or more) of b’s followed by a c
ab+can a followed by one or more b’s followed by a c
ab?can a followed by an optional b followed by a c; that is, either abc or ac
a.can a followed by any single character (not newline) followed by a c
a\.ca.c exactly
[abc]any one of a, b and c
[Aa]bceither of Abc and abc
[abc]+any (nonempty) string of a’s, b’s and c’s (such as a, abba, acbabcacaa)
[^abc]+any (nonempty) string which does not contain any of a, b and c (such as defg)
\d\dany two decimal digits, such as 42; same as \d{2}
 More examples can be online for Regular expressions in PERL or in Statistica help : Under Working with Spreadsheets >>Using Spreadsheets>>Using Formulas :
Spreadsheet Formulas - Regular Expression Functions
Regular Expression Overview
Spreadsheet Formulas - Regular Expression Function Examples

options:

The following options are available for RE_SEARCH, RE_MATCH, and RE_REPLACE:

b: Use POSIX basic grammar rather than PERL-like grammar. Similar to sed and grep.

x: Use POSIX extended grammar (similar to egrep).
Note : b and x are mutually exclusive; x overrides b.

i: Ignore case.

e: Allow the pattern to match an empty string.

s: Allow '.' to match newlines (similar to PERL's /s).

m: Allow the use of ^, $ as anchors to match start and end of a line, respectively (similar to PERL's /m).

g: Replace globally (similar to PERL's /g). In a RE_REPLACE, replaces all matches not just the first.

c: In a RE_REPLACE, do not copy to the output string portions of the input that did not match the regular expression.

Additional Information

https://www.boost.org/doc/libs/1_50_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html