real time log processing using apache spark streaming

Y0gesh Gupta picture Y0gesh Gupta · Feb 22, 2015 · Viewed 9.6k times · Source

I want to create a system where I can read logs in real time, and use apache spark to process it. I am confused if I should use something like kafka or flume to pass the logs to spark stream or should I pass the logs using sockets. I have gone through a sample program in the spark streaming documentation- Spark stream example. But I will be grateful if someone can guide me a better way to pass logs to spark stream. Its kind of a new turf to me.

Answer

Victor Bashurov picture Victor Bashurov · Feb 24, 2015

Apache Flume may help to read the logs in real time. Flume provides logs collection and transport to the application where Spark Streaming is used to analyze required information.

1. Download Apache Flume from official site or follow the instructions from here

2. Setup and run Flume modify flume-conf.properties.template from the directory where Flume is installed (FLUME_INSTALLATION_PATH\conf), here you need to provide logs source, channel and sinks (output). More details about setup here

There is an example of launching flume which collects log information from ping comand running on windows host and writes it to a file:

flume-conf.properties

agent.sources = seqGenSrc
agent.channels = memoryChannel
agent.sinks = loggerSink

agent.sources.seqGenSrc.type = exec
agent.sources.seqGenSrc.shell = powershell -Command

agent.sources.seqGenSrc.command = for() { ping google.com }

agent.sources.seqGenSrc.channels = memoryChannel

agent.sinks.loggerSink.type = file_roll

agent.sinks.loggerSink.channel = memoryChannel
agent.sinks.loggerSink.sink.directory = D:\\TMP\\flu\\
agent.sinks.loggerSink.serializer = text
agent.sinks.loggerSink.appendNewline = false
agent.sinks.loggerSink.rollInterval = 0

agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 100

To run the example go to FLUME_INSTALLATION_PATH and execute

java -Xmx20m -Dlog4j.configuration=file:///%CD%\conf\log4j.properties -cp .\lib\* org.apache.flume.node.Application -f conf\flume-conf.properties -n agent

OR you may create your java application that has flume libraries in a classpath and call org.apache.flume.node.Application instance from the application passing corresponding arguments.

How to setup Flume to collect and transport logs?

You can use some script for gathering logs from the specified location

agent.sources.seqGenSrc.shell = powershell -Command
agent.sources.seqGenSrc.command = your script here

instead of windows script you also can launch java application (put 'java path_to_main_class arguments' in field) which provides smart logs collection. For example, if the file is modified in real-time you can use Tailer from Apache Commons IO. To configure the Flume to transport the log infromation read this article

3. Get the Flume stream from your source code and analyze it with Spark. Take a look on a code sample from github https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaFlumeEventCount.java