regex in Vietnamese characters

lamtheo.com picture lamtheo.com · Sep 29, 2010 · Viewed 15.2k times · Source

I have one string and want remove any character not in any case below:

  • not in this list : ÀÁÂÃÈÉÊÌÍÒÓÔÕÙÚĂĐĨŨƠàáâãèéêìíòóôõùúăđĩũơƯĂẠẢẤẦẨẪẬẮẰẲẴẶẸẺẼỀỀỂ ưăạảấầẩẫậắằẳẵặẹẻẽềềểỄỆỈỊỌỎỐỒỔỖỘỚỜỞỠỢỤỦỨỪễệỉịọỏốồổỗộớờởỡợụủứừỬỮỰỲỴÝỶỸửữựỳỵỷỹ

  • not in [a-z 0-9 A-Z]

  • not is : _ and white space.

can anyone help me with this regex in php?

Answer

Gumbo picture Gumbo · Sep 29, 2010

Try this regular expression:

/[^a-z0-9A-Z_ÀÁÂÃÈÉÊÌÍÒÓÔÕÙÚĂĐĨŨƠàáâãèéêìíòóôõùúăđĩũơƯĂẠẢẤẦẨẪẬẮẰẲẴẶẸẺẼỀỀỂưăạảấầẩẫậắằẳẵặẹẻẽềềểỄỆỈỊỌỎỐỒỔỖỘỚỜỞỠỢỤỦỨỪễệỉịọỏốồổỗộớờởỡợụủứừỬỮỰỲỴÝỶỸửữựỳỵỷỹ]/u

The u modifier makes PHP to interpret the pattern string as UTF-8.

If that doesn’t work, try using Unicode character properties like \p{L} for letters or the escape sequence \x{1234} for describing single Unicode characters or custom character ranges:

/[^a-z0-9A-Z_\x{00C0}-\x{00FF}\x{1EA0}-\x{1EFF}]/u