I can't figure out how to construct a regex for the example values:
123,456,789
-12,34
1234
-8
Could you help me?
I have a simple question for your “simple” question: What precisely do you mean by “a number”?
−0
a number?√−1
? ⅝
or ⅔
a number? 186,282.42±0.02
miles/second one number — or is it two or three of them?6.02e23
a number? 3.141_592_653_589
a number? How about π
, or ℯ
? And −2π⁻³ ͥ
?0.083̄
?128.0.0.1
?⚄
hold? How about ⚂⚃
?10,5 mm
have one number in it — or does it have two? ∛8³
a number — or is it three of them? ↀↀⅮⅭⅭⅬⅫ AUC
represent, 2762 or 2009?४५६७
and ৭৮৯৮
numbers? 0377
, 0xDEADBEEF
, and 0b111101101
?Inf
a number? Is NaN
?④②
a number? What about ⓰
?㊅
?ℵ₀
and ℵ₁
have to do with numbers? Or ℝ
, ℚ
, and ℂ
?Also, are you familiar with these patterns? Can you explain the pros and cons of each?
/\D/
/^\d+$/
/^\p{Nd}+$/
/^\pN+$/
/^\p{Numeric_Value:10}$/
/^\P{Numeric_Value:NaN}+$/
/^-?\d+$/
/^[+-]?\d+$/
/^-?\d+\.?\d*$/
/^-?(?:\d+(?:\.\d*)?|\.\d+)$/
/^([+-]?)(?=\d|\.\d)\d*(\.\d*)?([Ee]([+-]?\d+))?$/
/^((\d)(?(?=(\d))|$)(?(?{ord$3==1+ord$2})(?1)|$))$/
/^(?:(?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2}))$/
/^(?:(?:[0-9a-fA-F]{1,2}):(?:[0-9a-fA-F]{1,2}):(?:[0-9a-fA-F]{1,2}):(?:[0-9a-fA-F]{1,2}):(?:[0-9a-fA-F]{1,2}):(?:[0-9a-fA-F]{1,2}))$/
/^(?:(?:[+-]?)(?:[0123456789]+))$/
/(([+-]?)([0123456789]{1,3}(?:,?[0123456789]{3})*))/
/^(?:(?:[+-]?)(?:[0123456789]{1,3}(?:,?[0123456789]{3})*))$/
/^(?:(?i)(?:[+-]?)(?:(?=[0123456789]|[.])(?:[0123456789]*)(?:(?:[.])(?:[0123456789]{0,}))?)(?:(?:[E])(?:(?:[+-]?)(?:[0123456789]+))|))$/
/^(?:(?i)(?:[+-]?)(?:(?=[01]|[.])(?:[01]{1,3}(?:(?:[,])[01]{3})*)(?:(?:[.])(?:[01]{0,}))?)(?:(?:[E])(?:(?:[+-]?)(?:[01]+))|))$/
/^(?:(?i)(?:[+-]?)(?:(?=[0123456789ABCDEF]|[.])(?:[0123456789ABCDEF]{1,3}(?:(?:[,])[0123456789ABCDEF]{3})*)(?:(?:[.])(?:[0123456789ABCDEF]{0,}))?)(?:(?:[G])(?:(?:[+-]?)(?:[0123456789ABCDEF]+))|))$/
/((?i)([+-]?)((?=[0123456789]|[.])([0123456789]{1,3}(?:(?:[_,]?)[0123456789]{3})*)(?:([.])([0123456789]{0,}))?)(?:([E])(([+-]?)([0123456789]+))|))/
I suspect that some of those patterns above may serve your needs. But I cannot tell you which one or ones — or, if none, supply you another — because you haven’t said what you mean by “number”.
As you see, there are a huge number of number possibilities: quite probably ℵ₁ worth of them, in fact. ☺
Each numbered explanation listed below describes the pattern of the corresponding numbered pattern listed above.
\p{Nd}
, \p{Decimal_Number}
, or \p{General_Category=Decimal_Number}
. This is turn is actually just a reflection of those code points whose Numeric Type category is Decimal, which is available as \p{Numeric_Type=Decimal}
.\w
and \W
, \d
and \D
, \s
and \S
, and \b
or \B
into the appropriate Unicode property. That means you must not use any of those eight one-character escapes for any Unicode data in Java, because they work only on ASCII even though Java always uses Unicode characters internally.\pN
, \p{Number}
, or \p{General_Category=Number}
property. These include \p{Nl}
or \p{Letter_Number}
for things like Roman numerals and \p{No}
or \p{Other_Number}
for subscripted and subscripted numbers, fractions, and circled numbers — amongst others, like counting rods.Ⅹ
the Roman numeral ten, and ⑩
, ⑽
, ⒑
, ⓾
, ❿
, ➉
, and ➓
.\1
capture group, making available as $1
after the match succeeds.Patterns number 1,2,7–11 come from a previous incarnation of the Perl Frequently Asked Questions list in the question, “How do I validate input?”. That section has been replaced by a suggestion to use the Regexp::Common module, written by Abigail and Damian Conway. The original patterns can still be found in Recipe 2.1 of the Perl Cookbook, “Checking Whether a String Is a Valid Number”, solutions to which can be found for a dizzying number of diverse languages, including ada, common lisp, groovy, guile, haskell, java, merd, ocaml, php, pike, python, rexx, ruby, and tcl at the the PLEAC project.
Pattern 12 could be more legibly rewritten
m{
^
(
( \d )
(?(?= ( \d ) ) | $ )
(?(?{ ord $3 == 1 + ord $2 }) (?1) | $ )
)
$
}x
It uses regex recursion, which is found in many pattern engines, including Perl and all the PCRE-derived languages. But it also uses an embedded code callout as the test of its second conditional pattern; to my knowledge, code callouts are available only in Perl and PCRE.
Patterns 13–21 were derived from the aforementioned Regexp::Common module. Note that for brevity, these are all written without the whitespace and comments that you would definitely want in production code. Here is how that might look in /x
mode:
$real_rx = qr{ ( # start $1 to hold entire pattern
( [+-]? ) # optional leading sign, captured into $2
( # start $3
(?= # look ahead for what next char *will* be
[0123456789] # EITHER: an ASCII digit
| [.] # OR ELSE: a dot
) # end look ahead
( # start $4
[0123456789]{1,3} # 1-3 ASCII digits to start the number
(?: # then optionally followed by
(?: [_,]? ) # an optional grouping separator of comma or underscore
[0123456789]{3} # followed by exactly three ASCII digits
) * # repeated any number of times
) # end $4
(?: # begin optional cluster
( [.] ) # required literal dot in $5
( [0123456789]{0,} ) # then optional ASCII digits in $6
) ? # end optional cluster
) # end $3
(?: # begin cluster group
( [E] ) # base-10 exponent into $7
( # exponent number into $8
( [+-] ? ) # optional sign for exponent into $9
( [0123456789] + ) # one or more ASCII digits into $10
) # end $8
| # or else nothing at all
) # end cluster group
) }xi; # end $1 and whole pattern, enabling /x and /i modes
From a software engineering perspective, there are still several issues with the style used in the /x
mode version immediately above. First, there is a great deal of code repetition, where you see the same [0123456789]
; what happens if one of those sequences accidentally leaves a digit out? Second, you are relying on positional parameters, which you must count. That means you might write something like:
(
$real_number, # $1
$real_number_sign, # $2
$pre_exponent_part, # $3
$pre_decimal_point, # $4
$decimal_point, # $5
$post_decimal_point, # $6
$exponent_indicator, # $7
$exponent_number, # $8
$exponent_sign, # $9
$exponent_digits, # $10
) = ($string =~ /$real_rx/);
which is frankly abominable! It is easy to get the numbering wrong, hard to remember what symbolic names go where, and tedious to write, especially if you don’t need all those pieces. Rewriting that to used named groups instead of just numbered ones. Again, I’ll use Perl syntax for the variables, but the contents of the Pattern should work anywhere that named groups are supported.
use 5.010; # Perl got named patterns in 5.10
$real_rx = qr{
(?<real_number>
# optional leading sign
(?<real_number_sign> [+-]? )
(?<pre_exponent_part>
(?= # look ahead for what next char *will* be
[0123456789] # EITHER: an ASCII digit
| [.] # OR ELSE: a dot
) # end look ahead
(?<pre_decimal_point>
[0123456789]{1,3} # 1-3 ASCII digits to start the number
(?: # then optionally followed by
(?: [_,]? ) # an optional grouping separator of comma or underscore
[0123456789]{3} # followed by exactly three ASCII digits
) * # repeated any number of times
) # end <pre_decimal_part>
(?: # begin optional anon cluster
(?<decimal_point> [.] ) # required literal dot
(?<post_decimal_point>
[0123456789]{0,} )
) ? # end optional anon cluster
) # end <pre_exponent_part>
# begin anon cluster group:
(?:
(?<exponent_indicator> [E] ) # base-10 exponent
(?<exponent_number> # exponent number
(?<exponent_sign> [+-] ? )
(?<exponent_digits> [0123456789] + )
) # end <exponent_number>
| # or else nothing at all
) # end anon cluster group
) # end <real_number>
}xi;
Now the abstractions are named, which helps. You can pull the groups out by name, and you only need the ones you care about. For example:
if ($string =~ /$real_rx/) {
($pre_exponent, $exponent_number) =
@+{ qw< pre_exponent exponent_number > };
}
There’s one more thing to do this pattern to make it still more maintainable. The problem is that there’s still too much repetition, which means it’s too easily changed in one place but not in another. If you were doing a McCabe analysis, you would say its complexity metric is too high. Most of us would just say it’s too indented. This makes it hard to follow. To fix all these things, what we need is a “grammatical pattern”, one with a definition block to create named abstractions, which we then treat somewhat like a subroutine call later on in the match.
use 5.010; # Perl first got regex subs in v5.10
$real__rx = qr{
^ # anchor to front
(?&real_number) # call &real_number regex sub
$ # either at end or before final newline
##################################################
# the rest is definition only; think of ##
# each named buffer as declaring a subroutine ##
# by that name ##
##################################################
(?(DEFINE)
(?<real_number>
(?&mantissa)
(?&abscissa) ?
)
(?<abscissa>
(?&exponent_indicator)
(?&exponent)
)
(?<exponent>
(&?sign) ?
(?&a_digit) +
)
(?<mantissa>
# expecting either of these....
(?= (?&a_digit)
| (?&point)
)
(?&a_digit) {1,3}
(?: (?&digit_separator) ?
(?&a_digit) {3}
) *
(?: (?&point)
(?&a_digit) *
) ?
)
(?<point> [.] )
(?<sign> [+-] )
(?<digit_separator> [_,] )
(?<exponent_indicator> [Ee] )
(?<a_digit> [0-9] )
) # end DEFINE block
}x;
See how insanely better the grammatical pattern is than the original line-noisy pattern? It’s also far easier to get the syntax right: I typed that in without even one regex syntax error that needed correcting. (OK fine, I typed all the others in without any syntax errors either, but I've been doing this for a while. :)
Grammatical patterns look much more like a BNF than the ugly old regular expressions that people have come to hate. They are far easier to read, write, and maintain. So let’s have no more ugly patterns, OK?