How to create a bigram from a text file with frequency count in Spark/Scala?

oscarm picture oscarm · Apr 18, 2015 · Viewed 7.1k times · Source

I want to take a text file and create a bigram of all words not separated by a dot ".", removing any special characters. I'm trying to do this using Spark and Scala.

This text:

Hello my Friend. How are
you today? bye my friend.

Should produce the following:

hello my, 1
my friend, 2
how are, 1
you today, 1
today bye, 1
bye my, 1

Answer

ohruunuruus picture ohruunuruus · Apr 18, 2015

For each of the lines in the RDD, start by splitting based on '.'. Then tokenize each of the resulting substrings by splitting on ' '. Once tokenized, remove special characters with replaceAll and convert to lowercase. Each of these sublists can be converted with sliding to an iterator of string arrays containing bigrams.

Then, after flattening and converting the bigram arrays to strings with mkString as requested, get a count for each one with groupBy and mapValues.

Finally flatten, reduce, and collect the (bigram, count) tuples from the RDD.

val rdd = sc.parallelize(Array("Hello my Friend. How are",
                               "you today? bye my friend."))

rdd.map{ 

    // Split each line into substrings by periods
    _.split('.').map{ substrings =>

        // Trim substrings and then tokenize on spaces
        substrings.trim.split(' ').

        // Remove non-alphanumeric characters, using Shyamendra's
        // clean replacement technique, and convert to lowercase
        map{_.replaceAll("""\W""", "").toLowerCase()}.

        // Find bigrams
        sliding(2)
    }.

    // Flatten, and map the bigrams to concatenated strings
    flatMap{identity}.map{_.mkString(" ")}.

    // Group the bigrams and count their frequency
    groupBy{identity}.mapValues{_.size}

}.

// Reduce to get a global count, then collect
flatMap{identity}.reduceByKey(_+_).collect.

// Format and print
foreach{x=> println(x._1 + ", " + x._2)}

you today, 1
hello my, 1
my friend, 2
how are, 1
bye my, 1
today bye, 1