I'm thinking about using hadoop to process large text files on my existing windows 2003 servers (about 10 quad core machines with 16gb of RAM)
The questions are:
Is there any good tutorial on how to configure an hadoop cluster on windows?
What are the requirements? java + cygwin + sshd ? Anything else?
HDFS, does it play nice on windows?
I'd like to use hadoop in streaming mode. Any advice, tool or trick to develop my own mapper / reducers in c#?
What do you use for submitting and monitoring the jobs?
Thanks
From the Hadoop documentation:
Win32 is supported as a development platform. Distributed operation has not been well tested on Win32, so it is not supported as a production platform.
Which I think translates to: "You're on your own."
That said, there might be hope if you're not queasy about installing Cygwin and a Java shim, according to the Getting Started page of the Hadoop wiki:
It is also possible to run the Hadoop daemons as Windows Services using the Java Service Wrapper (download this separately). This still requires Cygwin to be installed as Hadoop requires its df command.
I guess the bottom line is that it doesn't sound impossible, but you'd be swimming upstream all the way. I've done a few Hadoop installs (on Linux for production, Mac for dev) now, and I wouldn't bother with Windows when it's so straightforward on other platforms.