This is one sample of my data:
case time (especially it's purse), read manual care, follow care instructions make stays waterproof -- example, inspect rubber seals doors (especially battery/memory card door open time)
xm "life support" picture . flip part bit flimsy guessing won't long . sound great altec speaker dock it! chance back base (xm3020) . traveling bag connect laptop extra speaker . amount paid ($25).
I want remove all punctuation marks except dot(.) and also remove words with length < = 2
, for example my expected output is :
case time especially its purse read manual care follow care instructions . make stays waterproof example inspect rubber seals doors especially batterymemory card door open time
life support picture . flip part bit flimsy guessing wont long . sound great altec speaker dock chance back base xm3020 . traveling bag connect laptop extra speaker . amount paid $25 .
and this should be implemented in Scala , i've tried :
replaceAll( """\\W\s""", "")
replaceAll(""""[^a-zA-Z\.]""", "")
but doesn't work well , Can anybody help me?
Looking at the regex javadoc (http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html), we see that the character class for punctuation is \p{Punct}
and that we can remove a character from a character class using something as [a-z&&[^def]]
. From then it is easy to define a regex that will remove all punctuation except the dot:
s.replaceAll("""[\p{Punct}&&[^.]]""", "")
Removing words with size <= 2 could be done like so:
s.replaceAll("""\b\p{IsLetter}{1,2}\b""")
Combining the two, this gives:
s.replaceAll("""([\p{Punct}&&[^.]]|\b\p{IsLetter}{1,2}\b)\s*""", "")
Note how I added \s*
to remove redundant spaces.
Also, you can see that the above regex entirely removes '$', because it is a punctuation character (as defined by unicode).
If that is undesirable (as seems to indicate your expected output), please be more precise in what you consider punctuation.
By example you might want to consider only the following characters as punctuation: ?.!:()
:
s.replaceAll("""([?.!:]|\b\p{IsLetter}{1,2}\b)\s*""", "")
Alternatively, you could just add '$' to your "not-punctuation" character-list, along with the dot:
s.replaceAll("""([\p{Punct}&&[^.$]]|\b\p{IsLetter}{1,2}\b)\s*""", "")