Removing stopwords from a String in Java

JavaLearner picture JavaLearner · Dec 29, 2014 · Viewed 28.9k times · Source

I have a string with lots of words and I have a text file which contains some Stopwords which I need to remove from my String. Let's say I have a String

s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."

After removing stopwords, string should be like :

"love phone, super fast much cool jelly bean....but recently bugs."

I have been able to achieve this but the problem I am facing is that whenver there are adjacent stopwords in the String its removing only the first and I am getting result as :

"love phone, super fast there's much and cool with jelly bean....but recently seen bugs"  

Here's my stopwordslist.txt file : Stopwords

How can I solve this problem. Here's what I have done so far :

int k=0,i,j;
ArrayList<String> wordsList = new ArrayList<String>();
String sCurrentLine;
String[] stopwords = new String[2000];
try{
        FileReader fr=new FileReader("F:\\stopwordslist.txt");
        BufferedReader br= new BufferedReader(fr);
        while ((sCurrentLine = br.readLine()) != null){
            stopwords[k]=sCurrentLine;
            k++;
        }
        String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
        StringBuilder builder = new StringBuilder(s);
        String[] words = builder.toString().split("\\s");
        for (String word : words){
            wordsList.add(word);
        }
        for(int ii = 0; ii < wordsList.size(); ii++){
            for(int jj = 0; jj < k; jj++){
                if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
                    wordsList.remove(ii);
                    break;
                }
             }
        }
        for (String str : wordsList){
            System.out.print(str+" ");
        }   
    }catch(Exception ex){
        System.out.println(ex);
    }

Answer

geert3 picture geert3 · Dec 29, 2014

This is a much more elegant solution (IMHO), using only regular expressions:

    // instead of the ".....", add all your stopwords, separated by "|"
    // "\\b" is to account for word boundaries, i.e. not replace "his" in "this"
    // the "\\s?" is to suppress optional trailing white space
    Pattern p = Pattern.compile("\\b(I|this|its.....)\\b\\s?");
    Matcher m = p.matcher("I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.");
    String s = m.replaceAll("");
    System.out.println(s);