Athena: Query exhausted resources at scale factor

Jiew Meng picture Jiew Meng · Jan 26, 2019 · Viewed 11.8k times · Source

I am running a query like:

SELECT f.*, p.countryName, p.airportName, a.name AS agentName
FROM (
    SELECT 
        f.outboundlegid, 
        f.inboundlegid,
        f.querydatetime,
        cast(f.agent as bigint) as agent,
        cast(f.querydestinationplace as bigint) as querydestinationplace,
        f.queryoutbounddate,
        f.queryinbounddate,
        f.quoteageinminutes,
        f.price
    FROM flights f
    WHERE querydatetime >= '2018-01-02'
    AND querydatetime <= '2019-01-10'
) f
INNER JOIN (
  SELECT airportId, airportName, countryName
  FROM airports
  WHERE countryName IN ('Philippines', 'Indonesia', 'Malaysia', 'Hong Kong', 'Thailand', 'Vietnam')
) p
ON f.querydestinationplace = p.airportId
INNER JOIN agents a
ON f.agent = a.id
ORDER BY f.outboundlegid, f.inboundlegid, f.agent, querydatetime DESC

What's wrong with it? Or how can I optimize it? It gives me

Query exhausted resources at this scale factor

I have a flights table and I want to query for flights inside a specific country

Answer

Roberto picture Roberto · Jan 26, 2019

I have been facing this problem since the begining of Athena, the problem is the ORDER BY clause. Athena is just an EMR cluster with hive and prestodb installed. The problem you are facing is: Even if your query is distributed across X numbers of nodes, the ordering phase must be done by just a single node, the master node in this case. So at the end, you can order as much data as memory have the master node.

You can test it by reducing the amount of data the query returns maybe reducing the time range. I hope this helps :)