I want to take a text file and create a bigram of all words not separated by a dot ".", removing any special characters. I'm trying to do this using Spark and Scala.
This text:
Hello my Friend. How are
you today? bye my friend.
Should produce the following:
hello my, 1
my friend, 2
how are, 1
you today, 1
today bye, 1
bye my, 1
For each of the lines in the RDD, start by splitting based on '.'
. Then tokenize each of the resulting substrings by splitting on ' '
. Once tokenized, remove special characters with replaceAll
and convert to lowercase. Each of these sublists can be converted with sliding
to an iterator of string arrays containing bigrams.
Then, after flattening and converting the bigram arrays to strings with mkString
as requested, get a count for each one with groupBy
and mapValues
.
Finally flatten, reduce, and collect the (bigram, count) tuples from the RDD.
val rdd = sc.parallelize(Array("Hello my Friend. How are",
"you today? bye my friend."))
rdd.map{
// Split each line into substrings by periods
_.split('.').map{ substrings =>
// Trim substrings and then tokenize on spaces
substrings.trim.split(' ').
// Remove non-alphanumeric characters, using Shyamendra's
// clean replacement technique, and convert to lowercase
map{_.replaceAll("""\W""", "").toLowerCase()}.
// Find bigrams
sliding(2)
}.
// Flatten, and map the bigrams to concatenated strings
flatMap{identity}.map{_.mkString(" ")}.
// Group the bigrams and count their frequency
groupBy{identity}.mapValues{_.size}
}.
// Reduce to get a global count, then collect
flatMap{identity}.reduceByKey(_+_).collect.
// Format and print
foreach{x=> println(x._1 + ", " + x._2)}
you today, 1
hello my, 1
my friend, 2
how are, 1
bye my, 1
today bye, 1