I am running a query like:
SELECT f.*, p.countryName, p.airportName, a.name AS agentName
FROM (
SELECT
f.outboundlegid,
f.inboundlegid,
f.querydatetime,
cast(f.agent as bigint) as agent,
cast(f.querydestinationplace as bigint) as querydestinationplace,
f.queryoutbounddate,
f.queryinbounddate,
f.quoteageinminutes,
f.price
FROM flights f
WHERE querydatetime >= '2018-01-02'
AND querydatetime <= '2019-01-10'
) f
INNER JOIN (
SELECT airportId, airportName, countryName
FROM airports
WHERE countryName IN ('Philippines', 'Indonesia', 'Malaysia', 'Hong Kong', 'Thailand', 'Vietnam')
) p
ON f.querydestinationplace = p.airportId
INNER JOIN agents a
ON f.agent = a.id
ORDER BY f.outboundlegid, f.inboundlegid, f.agent, querydatetime DESC
What's wrong with it? Or how can I optimize it? It gives me
Query exhausted resources at this scale factor
I have a flights table and I want to query for flights inside a specific country
I have been facing this problem since the begining of Athena, the problem is the ORDER BY
clause. Athena is just an EMR cluster with hive and prestodb installed. The problem you are facing is: Even if your query is distributed across X numbers of nodes, the ordering phase must be done by just a single node, the master node in this case. So at the end, you can order as much data as memory have the master node.
You can test it by reducing the amount of data the query returns maybe reducing the time range. I hope this helps :)