I have a Map-Reduce job with a mapper which takes a record and converts it into an object, an instance of MyObject, which is marshalled to JSON using Jackson. The value is just another Text field in the record.
The relevant piece of the mapper is something like the following:
ObjectMapper mapper = new ObjectMapper();
MyObject val = new MyObject();
val.setA(stringA);
val.setB(stringB);
Writer strWriter = new StringWriter();
mapper.writeValue(strWriter, val);
key.set(strWriter.toString());
The outputs of the mapper are sent to a Combiner which unmarshalls the JSON object and aggregates key-value pairs. It is conceptually very simple and is something like:
public void reduce(Text key, Iterable<IntWritable> values, Context cxt)
throws IOException, InterruptedException {
int count = 0;
TermIndex x = _mapper.readValue(key.toString(), MyObject.class);
for (IntWritable int : values) ++count;
...
emit (key, value)
}
The MyObject class consists of two fields (both strings), get/set methods and a default constructor. One of the fields stores snippets of text based on a web crawl, but is always a string.
public class MyObject {
private String A;
private String B;
public MyObject() {}
public String getA() {
return A;
}
public void setA(String A) {
this.A = A;
}
public String getB() {
return B;
}
public void setIdx(String B) {
this.B = B;
}
}
My MapReduce job appears to be running fine until it reaches certain records, which I cannot easily access (because the mapper is generating the records from a crawl), and the following exception is being thrown:
Error: com.fasterxml.jackson.core.JsonParseException:
Illegal character ((CTRL-CHAR, code 0)): only regular white space (\r, \n, \t) is allowed between tokens
at [Source: java.io.StringReader@5ae2bee7; line: 1, column: 3]
Would anyone have any suggestions about the cause of this?
You can use StringUtils from apache commons to escape the string - https://commons.apache.org/proper/commons-lang/javadocs/api-2.6/src-html/org/apache/commons/lang/StringEscapeUtils.html#line.89
or you can replace selectively the control characters from the string before json marshaling.
you can also refer to this post - Illegal character - CTRL-CHAR