Use flume to stream data to S3

user3277217 picture user3277217 · Sep 25, 2014 · Viewed 7.9k times · Source

I am trying flume for something very simple, where I would like to push content from my log files to S3. I was able to create a flume agent that would read the content from an apache access log file and use a logger sink. Now I am trying to find a solution where I can replace the logger sink with an "S3 sink". (I know this does not exist by default)

I was looking for some pointers to direct me in the correct path. Below is my test properties file that I am using currently.

a1.sources=src1
a1.sinks=sink1
a1.channels=ch1

#source configuration
a1.sources.src1.type=exec
a1.sources.src1.command=tail -f /var/log/apache2/access.log

#sink configuration
a1.sinks.sink1.type=logger

#channel configuration
a1.channels.ch1.type=memory
a1.channels.ch1.capacity=1000
a1.channels.ch1.transactionCapacity=100

#links
a1.sources.src1.channels=ch1
a1.sinks.sink1.channel=ch1

Answer

gasparms picture gasparms · Oct 2, 2014

S3 is built over HDFS so you can use HDFS sink, you must replace hdfs path to your bucket in this way. Don't forget replace AWS_ACCESS_KEY and AWS_SECRET_KEY.

agent.sinks.s3hdfs.type = hdfs
agent.sinks.s3hdfs.hdfs.path = s3n://<AWS.ACCESS.KEY>:<AWS.SECRET.KEY>@<bucket.name>/prefix/
agent.sinks.s3hdfs.hdfs.fileType = DataStream
agent.sinks.s3hdfs.hdfs.filePrefix = FilePrefix
agent.sinks.s3hdfs.hdfs.writeFormat = Text
agent.sinks.s3hdfs.hdfs.rollCount = 0
agent.sinks.s3hdfs.hdfs.rollSize = 67108864  #64Mb filesize
agent.sinks.s3hdfs.hdfs.batchSize = 10000
agent.sinks.s3hdfs.hdfs.rollInterval = 0