I have this function in PostgreSQL, but I don't know how to return the result of the query:
CREATE OR REPLACE FUNCTION wordFrequency(maxTokens INTEGER)
RETURNS SETOF RECORD AS
$$
BEGIN
SELECT text, count(*), 100 / maxTokens * count(*)
FROM (
SELECT text
FROM token
WHERE chartype = 'ALPHABETIC'
LIMIT maxTokens
) as tokens
GROUP BY text
ORDER BY count DESC
END
$$
LANGUAGE plpgsql;
But I don't know how to return the result of the query inside the PostgreSQL function.
I found that the return type should be SETOF RECORD
, right? But the return command is not right.
What is the right way to do this?
Use RETURN QUERY
:
CREATE OR REPLACE FUNCTION word_frequency(_max_tokens int)
RETURNS TABLE (txt text -- also visible as OUT parameter inside function
, cnt bigint
, ratio bigint) AS
$func$
BEGIN
RETURN QUERY
SELECT t.txt
, count(*) AS cnt -- column alias only visible inside
, (count(*) * 100) / _max_tokens -- I added brackets
FROM (
SELECT t.txt
FROM token t
WHERE t.chartype = 'ALPHABETIC'
LIMIT _max_tokens
) t
GROUP BY t.txt
ORDER BY cnt DESC; -- potential ambiguity
END
$func$ LANGUAGE plpgsql;
Call:
SELECT * FROM word_frequency(123);
Explanation:
It is much more practical to explicitly define the return type than simply declaring it as record. This way you don't have to provide a column definition list with every function call. RETURNS TABLE
is one way to do that. There are others. Data types of OUT
parameters have to match exactly what is returned by the query.
Choose names for OUT
parameters carefully. They are visible in the function body almost anywhere. Table-qualify columns of the same name to avoid conflicts or unexpected results. I did that for all columns in my example.
But note the potential naming conflict between the OUT
parameter cnt
and the column alias of the same name. In this particular case (RETURN QUERY SELECT ...
) Postgres uses the column alias over the OUT
parameter either way. This can be ambiguous in other contexts, though. There are various ways to avoid any confusion:
ORDER BY 2 DESC
. Example:
ORDER BY count(*)
.plpgsql.variable_conflict
or use the special command #variable_conflict error | use_variable | use_column
in the function. See:
Don't use "text" or "count" as column names. Both are legal to use in Postgres, but "count" is a reserved word in standard SQL and a basic function name and "text" is a basic data type. Can lead to confusing errors. I use txt
and cnt
in my examples.
Added a missing ;
and corrected a syntax error in the header. (_max_tokens int)
, not (int maxTokens)
- type after name.
While working with integer division, it's better to multiply first and divide later, to minimize the rounding error. Even better: work with numeric
(or a floating point type). See below.
This is what I think your query should actually look like (calculating a relative share per token):
CREATE OR REPLACE FUNCTION word_frequency(_max_tokens int)
RETURNS TABLE (txt text
, abs_cnt bigint
, relative_share numeric) AS
$func$
BEGIN
RETURN QUERY
SELECT t.txt, t.cnt
, round((t.cnt * 100) / (sum(t.cnt) OVER ()), 2) -- AS relative_share
FROM (
SELECT t.txt, count(*) AS cnt
FROM token t
WHERE t.chartype = 'ALPHABETIC'
GROUP BY t.txt
ORDER BY cnt DESC
LIMIT _max_tokens
) t
ORDER BY t.cnt DESC;
END
$func$ LANGUAGE plpgsql;
The expression sum(t.cnt) OVER ()
is a window function. You could use a CTE instead of the subquery - pretty, but a subquery is typically cheaper in simple cases like this one.
A final explicit RETURN
statement is not required (but allowed) when working with OUT
parameters or RETURNS TABLE
(which makes implicit use of OUT
parameters).
round()
with two parameters only works for numeric
types. count()
in the subquery produces a bigint
result and a sum()
over this bigint
produces a numeric
result, thus we deal with a numeric
number automatically and everything just falls into place.