How To mask personal identification information using any language like java?

Brijesh Patel picture Brijesh Patel · Mar 13, 2014 · Viewed 9.2k times · Source

I want to mask PII(personal Identification Information) like Name. Birth Date, SSN, Credit card Number, Phone Number, etc. It should remain same formate , means it looks like real data. And shouldn't be reversible.And it should take less time to mask. Any one please help me.

Answer

Jason C picture Jason C · Mar 13, 2014

Replacing consonants with consonants, vowels with vowels, and digits with digits:

import java.util.Random;

public class Example {

    static char randomChar (Random r, String cs, boolean uppercase) {
        char c = cs.charAt(r.nextInt(cs.length()));
        return uppercase ? Character.toUpperCase(c) : c;
    }

    static String mask (String str, int seed) {

        final String cons = "bcdfghjklmnpqrstvwxz";
        final String vowel = "aeiouy";
        final String digit = "0123456789";

        Random r = new Random(seed);
        char data[] = str.toCharArray();

        for (int n = 0; n < data.length; ++ n) {
            char ln = Character.toLowerCase(data[n]);
            if (cons.indexOf(ln) >= 0)
                data[n] = randomChar(r, cons, ln != data[n]);
            else if (vowel.indexOf(ln) >= 0)
                data[n] = randomChar(r, vowel, ln != data[n]);
            else if (digit.indexOf(ln) >= 0)
                data[n] = randomChar(r, digit, ln != data[n]);
        }

        return new String(data);

    }

    public static void main (String[] args) {

        System.out.println(mask("John Doe, 534 West Street, Wherever, XY. (888) 535-3593. 399-35-3535", 0));

    }
}

That produces the output:

    Bumk Tyy, 194 Wyrd Tggoyb, Flikibod, QY. (557) 722-5385. 055-08-1462

From the input:

    John Doe, 534 West Street, Wherever, XY. (888) 535-3593. 399-35-3535

It's up to you to generate the seed. Use a seed based on the input data (e.g. a checksum) as well as a consistent RNG if you want to guarantee that the same input always produces the same output.

A performance optimization could be made by using a character class table instead of e.g. vowel.indexOf(). Further micro-optimizations could be made (e.g. re-using Random, operating only on char[] and reducing new String allocations, etc.)

Heavy difficulties will be encountered with full Unicode support. Masking also does not change length of components.

Over all I would rate this a poor, but at least moderately interesting, algorithm.

I don't think you understand that what you are asking for (output that looks real) is outside the scope of normal encryption topics and doesn't lend itself well to "efficiency", as some amount of morphological analysis would be required to produce meaningful results (and again, internationalization complicates this significantly).