I have been trying to call a mapreduce job from a simple java program in the same package.. I tried to refer the mapreduce jar file in my java program and call it using the runJar(String args[])
method by also passing the input and output paths for the mapreduce job.. But the program dint work..
How do I run such a program where I just use pass input, output and jar path to its main method?? Is it possible to run a mapreduce job (jar) through it?? I want to do this because I want to run several mapreduce jobs one after another where my java program vl call each such job by referring its jar file.. If this gets possible, I might as well just use a simple servlet to do such calling and refer its output files for the graph purpose..
/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/
/**
*
* @author root
*/
import org.apache.hadoop.util.RunJar;
import java.util.*;
public class callOther {
public static void main(String args[])throws Throwable
{
ArrayList arg=new ArrayList();
String output="/root/Desktp/output";
arg.add("/root/NetBeansProjects/wordTool/dist/wordTool.jar");
arg.add("/root/Desktop/input");
arg.add(output);
RunJar.main((String[])arg.toArray(new String[0]));
}
}
Oh please don't do it with runJar
, the Java API is very good.
See how you can start a job from normal code:
// create a configuration
Configuration conf = new Configuration();
// create a new job based on the configuration
Job job = new Job(conf);
// here you have to put your mapper class
job.setMapperClass(Mapper.class);
// here you have to put your reducer class
job.setReducerClass(Reducer.class);
// here you have to set the jar which is containing your
// map/reduce class, so you can use the mapper class
job.setJarByClass(Mapper.class);
// key/value of your reducer output
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
// this is setting the format of your input, can be TextInputFormat
job.setInputFormatClass(SequenceFileInputFormat.class);
// same with output
job.setOutputFormatClass(TextOutputFormat.class);
// here you can set the path of your input
SequenceFileInputFormat.addInputPath(job, new Path("files/toMap/"));
// this deletes possible output paths to prevent job failures
FileSystem fs = FileSystem.get(conf);
Path out = new Path("files/out/processed/");
fs.delete(out, true);
// finally set the empty out path
TextOutputFormat.setOutputPath(job, out);
// this waits until the job completes and prints debug out to STDOUT or whatever
// has been configured in your log4j properties.
job.waitForCompletion(true);
If you are using an external cluster, you have to put the following infos to your configuration via:
// this should be like defined in your mapred-site.xml
conf.set("mapred.job.tracker", "jobtracker.com:50001");
// like defined in hdfs-site.xml
conf.set("fs.default.name", "hdfs://namenode.com:9000");
This should be no problem when the hadoop-core.jar
is in your application containers classpath.
But I think you should put some kind of progress indicator to your web page, because it may take minutes to hours to complete a hadoop job ;)
For YARN (> Hadoop 2)
For YARN, the following configurations need to be set.
// this should be like defined in your yarn-site.xml
conf.set("yarn.resourcemanager.address", "yarn-manager.com:50001");
// framework is now "yarn", should be defined like this in mapred-site.xm
conf.set("mapreduce.framework.name", "yarn");
// like defined in hdfs-site.xml
conf.set("fs.default.name", "hdfs://namenode.com:9000");