I have a table called 'ticket_diary_comment'
with a column called 'comment_text'
. This column is populated with text data. I would like to get the frequency of all the words occurring in this entire column. Ex:
Comment_Text
I am a good guy
I am a bad guy
I am not a guy
What I want:
Word Frequency
I 3
good 1
bad 1
not 1
guy 3
Notice that I have also removed the stop words in the output. I know calculating the frequency of a particular word is not difficult but I am looking for something that counts all the words appearing in a column removing the stop words.
I would appreciate any kind of help on this issue. I would also like to mention that I have to apply this query on a big-ish dataset (about 1 TB), so performance is a concern.
I would use a table valued function to split the strings, and then group them in a query. Something like this:
SELECT item, count(1)
FROM ticket_diary_comment
CROSS APPLY dbo.fn_SplitString(comment_text, ' ')
GROUP BY item
and the definition for fn_SplitString
:
CREATE FUNCTION [dbo].[fn_SplitString]
(
@String VARCHAR(8000),
@Delimiter VARCHAR(255)
)
RETURNS
@Results TABLE
(
ID INT IDENTITY(1, 1),
Item VARCHAR(8000)
)
AS
BEGIN
INSERT INTO @Results (Item)
SELECT SUBSTRING(@String+@Delimiter, num,
CHARINDEX(@Delimiter, @String+@Delimiter, num) - num)
FROM Numbers
WHERE num <= LEN(REPLACE(@String,' ','|'))
AND SUBSTRING(@Delimiter + @String,
num,
LEN(REPLACE(@delimiter,' ','|'))) = @Delimiter
ORDER BY num RETURN
END
This function requires a numbers table, which is basically just CREATE TABLE Numbers(Num int)
and contains all the numbers from 1 to 10,000 (or more/less depending on needs). If you already have a numbers table in your DB you can substitute that table/column for what you already have.