My hadoop cluster HA active namenode (host1) suddenly switch to standby namenode(host2). I could not found any error in hadoop logs (in any server) to identify the root cause.
After switching the Namenodes following error appeared in hdfs logs frequently and non of the application could read the HDFS files.
2014-07-17 01:58:53,381 WARN namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(6769)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby
Once I restart the new active node(host2), namenode is switching back to new standby node(host1). Then cluster is working as normal, users also can can retrieve the HDFS files.
I'm using Hortonworks 2.1.2.0 and HDFS version 2.4.0.2.1
Edit:21st Jult 2014 Following logs were found in active namenode logs when active-standby namenode switch happen
NT_SETTINGS-1675610.csv dst=null perm=null 2014-07-20 09:06:44,746 INFO FSNamesystem.audit (FSNamesystem.java:logAuditMessage(7755)) - allowed=true
ugi=storm (auth:SIMPLE) ip=/10.0.1.50 cmd=getfileinfo src=/user/tungsten/staging/LEAPSET/PRODUCTS/PRODUCTS-138018 6.csv dst=null perm=null 2014-07-20 09:06:44,747 INFO FSNamesystem.audit (FSNamesystem.java:logAuditMessage(7755)) - allowed=true ugi=storm (auth:SIMPLE) ip=/10.0.1.50
cmd=getfileinfo src=/user/tungsten/staging/LEAPSET/MERCHANT_SETTINGS/MERCHA NT_SETTINGS-1695794.csv dst=null perm=null 2014-07-20 09:06:44,747 INFO FSNamesystem.audit (FSNamesystem.java:logAuditMessage(7755)) - allowed=true
ugi=storm (auth:SIMPLE) ip=/10.0.1.50 cmd=getfileinfo src=/user/tungsten/staging/LEAPSET/PRODUCTS/PRODUCTS-139954 1.csv dst=null perm=null 2014-07-20 09:06:44,748 INFO namenode.FSNamesystem (FSNamesystem.java:stopActiveServices(1095)) - Stopping services started for active state 2014-07-20 09:06:44,750 INFO namenode.FSEditLog (FSEditLog.java:endCurrentLogSegment(1153)) - Ending log segment 842249 2014-07-20 09:06:44,752 INFO namenode.FSEditLog (FSEditLog.java:printStatistics(673)) - Number of transactions: 2 Total time for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of syncs: 1 SyncTimes(ms): 4 35 2014-07-20 09:06:44,774 INFO namenode.FSEditLog (FSEditLog.java:printStatistics(673)) - Number of transactions: 2 Total time for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of syncs: 2 SyncTimes(ms): 24 37 2014-07-20 09:06:44,805 INFO namenode.FSNamesystem (FSNamesystem.java:run(4362)) - NameNodeEditLogRoller was interrupted, exiting 2014-07-20 09:06:44,824 INFO namenode.FileJournalManager (FileJournalManager.java:finalizeLogSegment(130)) - Finalizing edits file /ebs/hadoop/hdfs/namenode/current/edits_inprogress_0000000000000842249 -> /ebs/hadoop/hdfs/name node/current/edits_0000000000000842249-0000000000000842250 2014-07-20 09:06:44,874 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(168)) - Shutting down CacheReplicationMonitor 2014-07-20 09:06:44,876 INFO namenode.FSNamesystem (FSNamesystem.java:startStandbyServices(1136)) - Starting services required for standby state 2014-07-20 09:06:44,927 INFO ha.EditLogTailer (EditLogTailer.java:(117)) - Will roll logs on active node at hadoop-client-us-west-1b/10.0.254.10:8020 every 120 seconds. 2014-07-20 09:06:44,929 INFO ha.StandbyCheckpointer (StandbyCheckpointer.java:start(129)) - Starting standby checkpoint thread... Checkpointing active NN at http:// hadoop-client-us-west-1b:50070 Serving checkpoints at http:// hadoop-client-us-west-1a:50070 2014-07-20 09:06:44,930 INFO ipc.Server (Server.java:run(2027)) - IPC Server handler 3 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from 10.0.1.50:57297 Call#8431877 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby 2014-07-20 09:06:44,930 INFO ipc.Server (Server.java:run(2027)) - IPC Server handler 16 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from 10.0.1.50:57294 Call#130105071 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby 2014-07-20 09:06:44,940 INFO ipc.Server (Server.java:run(2027)) - IPC Server handler 14 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from 10.0.1.50:57294 Call#130105072 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
Edit:13th August 2014 We were able to found out root cause of namenode switching, namenode getting lots of file info requests and then namenode switching was happened.
But still could not get resolve Operation category READ is not supported in state standby error.
Edit:7th December 2014 We were found that, as the solution application need to manually connect with current active namenode once previously active namenode failed. Traffic for namenodes in HA mode are not automatically directed to active node.
I had the same issue. You need to update the client libraries. Use amabari to set up spark and have it install the client on the server. Then set your SPARK_HOME environment variable.