Latin Characters check

CompanyDroneFromSector7G picture CompanyDroneFromSector7G · Apr 3, 2013 · Viewed 10.8k times · Source

there are some similar questions out there, but none that are quite the same or that have an answer that works for me.

I need a javascript function which validates whether a text field contains all valid latin characters, so no cryllic or Chinese, just latin; specifically:

Basic Latin (excluding the C0 control characters), Latin-1 (excluding the C1 control characters), Latin Extended A, Latin Extended B and Latin Extended Additional. This set corresponds to Unicode code points U+0020 to U+007E, U+00A0 to U+024F and U+IE00 to U+IEFF

Some of the answers out there seem to check the first character in the text field but miss out others, so these are no good.

This is what I have tried so far (this doesn't work!):

var value = 'abcdef' // from text field
var re = '\u0000-\u007F|\u0100-\u017F|\u0180-\u024F|\u1E00-\u1EFF|\u0080-\u00FF'; // latin regexp string
// var re = '\\w+/'; // alternative
if (new RegExp(re).test(value)) {
    result = false;
}

The following sort of works but only for the first character:

//var re = '\u0000-\u007F|\u0100-\u017F|\u0180-\u024F|\u1E00-\u1EFF|\u0080-\u00FF'; // latin regexp string
// couldn't get the above to work so using the following:
var re = '\\w+';
if (!value.match(re)) {
    message = 'Please enter valid latin characters only';
    $focusField = $this;
}

What is the right way to do this?

I really need code, rather than an explaination, but both would be better.

Thanks

Answer

tchrist picture tchrist · Apr 3, 2013

EDIT: Note that the solution given in the accepted answer is incorrect. It is full of false positives and false negatives. The exact numeric code point numbers needed are given at the bottom of this post.

The example given by the question mistakenly attempt to use Block rather than Script properties!

You do not want to use Unicode block character properties here; you want to use Unicode script character properties. In other words, you really want Script=Latin and not to try to use Block=Basic_Latin plus Block=Latin_1 plus Block=Latin_1_Supplement plus Block=Latin_Extended_A plus Block=Latin_Extended_Additional.

Note also that the question neglected to other Latin blocks: Block=Latin_Extended_C and Block=Latin_Extended_D.

Even if you used the correct blocks, you would get 145 false positives that were in those blocks but which were not Latin script characters:

$ unichars '\P{Script=Latin}' '[\p{Block=Basic_Latin}\p{Block=Latin_1}\p{Block=Latin_1_Supplement}\p{Block=Latin_Extended_A}\p{Block=Latin_Extended_B}
\p{Block=Latin_Extended_Additional}\p{Block=Latin_Extended_C}\p{Block=Latin_Extended_D}]' | wc -l
145

Furthermore, you would miss 403 false negatives that are indeed Latin script characters but which are not in those blocks:

$ unichars '\p{Script=Latin}' '[^\p{Block=Basic_Latin}\p{Block=Latin_1}\p{Block=Latin_1_Supplement}\p{Block=Latin_Extended_A}\p{Block=Latin_Extended_B
}\p{Block=Latin_Extended_Additional}\p{Block=Latin_Extended_C}\p{Block=Latin_Extended_D}]' | wc -l
403

You virtually never want to use Blocks; you want to use Scripts. That’s why Level 1 conformance of UTS#18 requires in Requirement 1.2that the Script character property be supported, but says nothing of the Block property until Requirement 2.7: Full Properties.

See UTS#18 Annex A, Character Blocks, for more pitfalls that come of using Blocks instead of Scripts.

Removing the code points that lie outside the Basic Multilingual Plane due to the Javascript bug that makes it impossible to specify these by ranges, we are left with this set of insanely unmaintainable garbledy-gook needed to fish out all Unicode v6.2 code points having the Latin, Common, or Inherited script character property:

[\u0000-\u0040][\u0041-\u005A][\u005B-\u0060][\u0061-\u007A][\u007B-\u00A9]\u00AA[\u00AB-\u00B9]\u00BA[\u00BB-\u00BF][\u00C0-\u00D6]\u00D7[\u00D8-\u00
F6]\u00F7[\u00F8-\u02B8][\u02B9-\u02DF][\u02E0-\u02E4][\u02E5-\u02E9][\u02EC-\u02FF][\u0300-\u036F]\u0374\u037E\u0385\u0387[\u0485-\u0486]\u0589\u060C
\u061B\u061F\u0640[\u064B-\u0655][\u0660-\u0669]\u0670\u06DD[\u0951-\u0952][\u0964-\u0965]\u0E3F[\u0FD5-\u0FD8]\u10FB[\u16EB-\u16ED][\u1735-\u1736][\u
1802-\u1803]\u1805[\u1CD0-\u1CD2]\u1CD3[\u1CD4-\u1CE0]\u1CE1[\u1CE2-\u1CE8][\u1CE9-\u1CEC]\u1CED[\u1CEE-\u1CF3]\u1CF4[\u1CF5-\u1CF6][\u1D00-\u1D25][\u
1D2C-\u1D5C][\u1D62-\u1D65][\u1D6B-\u1D77][\u1D79-\u1DBE][\u1DC0-\u1DE6][\u1DFC-\u1DFF][\u1E00-\u1EFF][\u2000-\u200B][\u200C-\u200D][\u200E-\u2064][\u
206A-\u2070]\u2071[\u2074-\u207E]\u207F[\u2080-\u208E][\u2090-\u209C][\u20A0-\u20BA][\u20D0-\u20F0][\u2100-\u2125][\u2127-\u2129][\u212A-\u212B][\u212
C-\u2131]\u2132[\u2133-\u214D]\u214E[\u214F-\u215F][\u2160-\u2188]\u2189[\u2190-\u23F3][\u2400-\u2426][\u2440-\u244A][\u2460-\u26FF][\u2701-\u27FF][\u
2900-\u2B4C][\u2B50-\u2B59][\u2C60-\u2C7F][\u2E00-\u2E3B][\u2FF0-\u2FFB][\u3000-\u3004]\u3006[\u3008-\u3020][\u302A-\u302D][\u3030-\u3037][\u303C-\u30
3F][\u3099-\u309A][\u309B-\u309C]\u30A0[\u30FB-\u30FC][\u3190-\u319F][\u31C0-\u31E3][\u3220-\u325F][\u327F-\u32CF][\u3358-\u33FF][\u4DC0-\u4DFF][\uA70
0-\uA721][\uA722-\uA787][\uA788-\uA78A][\uA78B-\uA78E][\uA790-\uA793][\uA7A0-\uA7AA][\uA7F8-\uA7FF][\uA830-\uA839][\uFB00-\uFB06][\uFD3E-\uFD3F]\uFDFD
[\uFE00-\uFE0F][\uFE10-\uFE19][\uFE20-\uFE26][\uFE30-\uFE52][\uFE54-\uFE66][\uFE68-\uFE6B]\uFEFF[\uFF01-\uFF20][\uFF21-\uFF3A][\uFF3B-\uFF40][\uFF41-\
uFF5A][\uFF5B-\uFF65]\uFF70[\uFF9E-\uFF9F][\uFFE0-\uFFE6][\uFFE8-\uFFEE][\uFFF9-\uFFFD]

Personally, I would fire anyone who attempted to use that sort of nonsense.

Furthermore, 3,225 code points that you miss because of the Javascript bug in handling full Unicode are the following:

10100-10102 10107-10133 10137-1013F 10190-1019B 101D0-101FC 101FD
1D000-1D0F5 1D100-1D126 1D129-1D166 1D167-1D169 1D16A-1D17A 1D17B-1D182
1D183-1D184 1D185-1D18B 1D18C-1D1A9 1D1AA-1D1AD 1D1AE-1D1DD 1D300-1D356
1D360-1D371 1D400-1D454 1D456-1D49C 1D49E-1D49F 1D4A2 1D4A5-1D4A6
1D4A9-1D4AC 1D4AE-1D4B9 1D4BB 1D4BD-1D4C3 1D4C5-1D505 1D507-1D50A
1D50D-1D514 1D516-1D51C 1D51E-1D539 1D53B-1D53E 1D540-1D544 1D546
1D54A-1D550 1D552-1D6A5 1D6A8-1D7CB 1D7CE-1D7FF 1F000-1F02B 1F030-1F093
1F0A0-1F0AE 1F0B1-1F0BE 1F0C1-1F0CF 1F0D1-1F0DF 1F100-1F10A 1F110-1F12E
1F130-1F16B 1F170-1F19A 1F1E6-1F1FF 1F201-1F202 1F210-1F23A 1F240-1F248
1F250-1F251 1F300-1F320 1F330-1F335 1F337-1F37C 1F380-1F393 1F3A0-1F3C4
1F3C6-1F3CA 1F3E0-1F3F0 1F400-1F43E 1F440 1F442-1F4F7 1F4F9-1F4FC
1F500-1F53D 1F540-1F543 1F550-1F567 1F5FB-1F640 1F645-1F64F 1F680-1F6C5
1F700-1F773 E0001 E0020-E007F E0100-E01EF

The correct way to do all this is included below.

If you are going to be playing around with Unicode character properties, it is tantamount to hopeless to hardcode code-point numbers like this. What you really want is to be able to say something like:

[^\p{Script=Latin}\p{Script=Common}\p{Script=Inherited}]

However, Javascript regexes are still completely antemillennial in this regard, and are so far from complying with Unicode Technical Standard #18: Unicode Regular Expressions, even at its very most basic compliance level, level one:

Level 1: Basic Unicode Support. At this level, the regular expression engine provides support for Unicode characters as basic logical units. (This is independent of the actual serialization of Unicode as UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.) This is a minimal level for useful Unicode support. It does not account for end-user expectations for character support, but does satisfy most low-level programmer requirements. The results of regular expression matching at this level are independent of country or language. At this level, the user of the regular expression engine would need to write more complicated regular expressions to do full Unicode processing.

Because even the most rudimentary compliance level for Unicode regular expressions is still far beneath Javascript’s capabilities, I strongly recommending running whatever Unicode-aware regexes you need on the server in some language that actually supports them.

However, in the event that this is not practical, a sanity-saving workaround is the Javascript XRegExp plugin, which provides a saner regex library that also allows for access to certain essential character properties such as you are attempting to use.

As of v2.0, the “XRegExp All” add-on supports all these:

  • XRegExp 2.0.0
  • Unicode Base 1.0.0
  • Unicode Categories 1.2.0
  • Unicode Scripts 1.2.0
  • Unicode Blocks 1.2.0
  • Unicode Properties 1.0.0
  • XRegExp.matchRecursive 0.2.0
  • XRegExp.build 0.1.0
  • Prototypes 1.0.0

Which means that once you have it loaded, you will be able to get at the properties you need this way:

XRegExp("[^\\p{Latin}\\p{Common}\\p{Inherited}]");

Please note very carefully that as of Unicode v6.2, any and all of the following code points and code-point ranges are deemed to have the Script=Latin character property:

0041-005A 
0061-007A 
00AA 
00BA 
00C0-00D6 
00D8-00F6 
00F8-02B8 
02E0-02E4 
1D00-1D25 
1D2C-1D5C 
1D62-1D65 
1D6B-1D77 
1D79-1DBE 
1E00-1EFF 
2071 
207F 
2090-209C 
212A-212B 
2132 
214E 
2160-2188 
2C60-2C7F 
A722-A787 
A78B-A78E 
A790-A793 
A7A0-A7AA 
A7F8-A7FF 
FB00-FB06 
FF21-FF3A 
FF41-FF5A 

Whereas these are the code points that have the Script=Common character property:

0000-0040  
005B-0060  
007B-00A9  
00AB-00B9  
00BB-00BF  
00D7
00F7
02B9-02DF  
02E5-02E9  
02EC-02FF  
0374
037E
0385 
0387
0589
060C
061B
061F
0640
0660-0669  
06DD
0964-0965  
0E3F 
0FD5-0FD8  
10FB
16EB-16ED
1735-1736
1802-1803
1805
1CD3
1CE1
1CE9-1CEC
1CEE-1CF3
1CF5-1CF6
2000-200B
200E-2064
206A-2070  
2074-207E  
2080-208E  
20A0-20BA  
2100-2125
2127-2129
212C-2131  
2133-214D  
214F-215F  
2189
2190-23F3
2400-2426
2440-244A
2460-26FF
2701-27FF
2900-2B4C
2B50-2B59
2E00-2E3B
2FF0-2FFB  
3000-3004
3006
3008-3020
3030-3037  
303C-303F
309B-309C
30A0
30FB-30FC
3190-319F
31C0-31E3
3220-325F
327F-32CF
3358-33FF
4DC0-4DFF
A700-A721
A788-A78A
A830-A839
FD3E-FD3F  
FDFD
FE10-FE19  
FE30-FE52
FE54-FE66
FE68-FE6B  
FEFF
FF01-FF20  
FF3B-FF40
FF5B-FF65
FF70
FF9E-FF9F
FFE0-FFE6
FFE8-FFEE
FFF9-FFFD
10100-10102
10107-10133
10137-1013F
10190-1019B
101D0-101FC
1D000-1D0F5
1D100-1D126
1D129-1D166
1D16A-1D17A
1D183-1D184
1D18C-1D1A9
1D1AE-1D1DD
1D300-1D356
1D360-1D371
1D400-1D454
1D456-1D49C
1D49E-1D49F
1D4A2
1D4A5-1D4A6
1D4A9-1D4AC
1D4AE-1D4B9
1D4BB
1D4BD-1D4C3
1D4C5-1D505
1D507-1D50A
1D50D-1D514
1D516-1D51C
1D51E-1D539
1D53B-1D53E
1D540-1D544
1D546
1D54A-1D550
1D552-1D6A5
1D6A8-1D7CB
1D7CE-1D7FF
1F000-1F02B
1F030-1F093
1F0A0-1F0AE
1F0B1-1F0BE
1F0C1-1F0CF
1F0D1-1F0DF
1F100-1F10A
1F110-1F12E
1F130-1F16B
1F170-1F19A
1F1E6-1F1FF
1F201-1F202
1F210-1F23A
1F240-1F248
1F250-1F251
1F300-1F320
1F330-1F335
1F337-1F37C
1F380-1F393
1F3A0-1F3C4
1F3C6-1F3CA
1F3E0-1F3F0
1F400-1F43E
1F440
1F442-1F4F7
1F4F9-1F4FC
1F500-1F53D
1F540-1F543
1F550-1F567
1F5FB-1F640
1F645-1F64F
1F680-1F6C5
1F700-1F773
E0001
E0020-E007F

And these are the code points that have the Script=Inherited character property:

0300-036F
0485-0486
064B-0655
0670
0951-0952
1CD0-1CD2
1CD4-1CE0
1CE2-1CE8
1CED
1CF4
1DC0-1DE6
1DFC-1DFF
200C-200D
20D0-20F0
302A-302D
3099-309A
FE00-FE0F
FE20-FE26
101FD
1D167-1D169
1D17B-1D182
1D185-1D18B
1D1AA-1D1AD
E0100-E01EF

I hope the terrible maintenance, upkeep, legibility, and indeed writability problems that come of using literal code-point numbers like these make it clear that you want to at a bare minimum use the XRegExp add-ons.