Search/Find a file and file content in Hadoop

leon picture leon · Jun 9, 2011 · Viewed 61.3k times · Source

I am currently working on a project using Hadoop DFS.

  1. I notice there is no search or find command in Hadoop Shell. Is there a way to search and find a file (e.g. testfile.doc) in Hadoop DFS?

  2. Does Hadoop support file content search? If so, how to do it? For example, I have many Word Doc files stored in HDFS, I want to list which files have the words "computer science" in them.

What about in other Distributed File Systems? Is file content search a soft spot of distributed file systems?

Answer

ajduff574 picture ajduff574 · Jun 9, 2011
  1. You can do this: hdfs dfs -ls -R / | grep [search_term].
  2. It sounds like a MapReduce job might be suitable here. Here's something similar, but for text files. However, if these documents are small, you may run into inefficiencies. Basically, each file will be assigned to one map task. If the files are small, the overhead to set up the map task may be significant compared to the time necessary to process the file.