Friday, October 26, 2012

Nginx to hadoop hdfs with fluentd



Nginx => hadoop HDFS using Fluentd


        Fluentd is a json everywhere log collector. It transmits logs as json streams so that log processing can be easily managed and processed.

      Hadoop HDFS is a distributed filesystem which can be used to store any amount of logs and run mapreduce jobs for faster log processing.

 We will be using fluent-webhdfs-plugin to send logs over to httpfs interface


1. Install hadoop-httpfs package

        yum install hadoop-httpfs

2. Enable access to HDFS for httpfs user

vi /etc/hadoop/conf/core-site.xml
  <property>  
   <name>hadoop.proxyuser.httpfs.hosts</name>  
   <value>localhost,httpfshost></value>  
  </property>  
  <property>  
   <name>hadoop.proxyuser.httpfs.groups</name>  
   <value>*</value>  
  </property>  

 

3. Restart the hadoop cluster.

4. Start the hadoop-httpfs service

/etc/init.d/hadoop-httpfs start

5. Check whether it is working
 curl -i "http://<namenode>:14000?user.name=httpfs&op=homedir"  
 HTTP/1.1 200 OK  
 Server: Apache-Coyote/1.1  



6. Install treasure date td-agent in nginx servers and log-aggregator server

cat > /etc/yum.repos.d/treasuredate.repo
[treasuredata]
name=TreasureData
baseurl=http://packages.treasure-data.com/redhat/$basearch
gpgcheck=0

 yum install td-agent

7. Install fluentd and fluentd-plugin-webhdfs in log-aggregator host

gem install fluent-logger --no-ri --no-rdoc
/usr/lib64/fluent/ruby/bin/fluent-gem install fluent-plugin-webhdfs

8. Edit td-agent configuration in nginx server

vi /etc/td-agent/td-agent.conf
 # Tail the nginx logs associated with stats.slideshare.net  
 <source>  
  type tail  
  path /var/log/nginx/stats_access.log  
  format apache  
  tag stats.access  
  pos_file /var/log/td-agent/stats_access.pos  
 </source>  
 <match stats.access>  
  type forward  
  <server>  
   host <LOG AGGREGATOR NODE>  
   port 24224  
  </server>  
  retry_limit 5  
  <secondary>  
   type file  
   path /var/log/td-agent/stats_access.log  
  </secondary>  
 </match>  

Edit Nginx configuration to use apache log format.
   log_format main '$remote_addr - $remote_user [$time_local] "$request" '  
            '$status $body_bytes_sent "$http_referer" '  
            '"$http_user_agent"';  



9. Edit td-agent configuration in log aggregator server




 <source>  
  type forward  
  port 24224  
 </source>  
 <match stats.access>  
  type webhdfs  
  host <NAMENODE OR HTTPFS HOST>  
  port 14000  
  path /user/hdfs/stats_logs/stats_access.%Y%m%d_%H.log  
  httpfs true  
  username httpfsuser  
 </match>  


10. Start td-agent in log aggregator host

/etc/init.d/td-agent start

 * ensure that there are no errors in /var/log/td-agent/td-agent.log

11. Start td-agent in nginx servers

/etc/init.d/td-agent start
/etc/init.d/nginx restart

12. Check whether you can see the logs in HDFS
 sudo -u hdfsuser hdfs dfs -ls /user/hdfs/stats_logs/  
 Found 1 items  
 -rw-r--r--  3 httpfsuser group   17441 2012-10-12 01:10 /user/hdfsuser/stats_logs/stats_access.20121012_07.log  


That is all.. Now you have a log aggregation happening

3 comments: