I am working in this relatively large code base where I am seeing a file descriptor leak and processes start complaining that they are not able to open files after I run certain programs.
Though this happens after 6 days , I am able to reproduce the problem in 3-4 hours by reducing the value in /proc/sys/fs/file-max to 9000.
There are many processes running at any moment. I have been able to pin point couple of processes that could be causing the leak. However, I don't see any file descriptor leak either through lsof or through /proc//fd.
If I kill the processes(they communicate with each other) that I am suspecting of leaking, the leak goes away. FDs are released.
cat /proc/sys/fs/file-nr in a while(1) loop shows the leak. However, I don't see any leak in any process.
Here is a script I wrote to detect that leak is happening :
#!/bin/bash
if [ "$#" != "2" ];then
name=`basename $0`
echo "Usage : $name <threshold for number of pids> <check_interval>"
exit 1
fi
fd_threshold=$1
check_interval=$2
total_num_desc=0
touch pid_monitor.txt
nowdate=`date`
echo "=================================================================================================================================" >> pid_monitor.txt
echo "****************************************MONITORING STARTS AT $nowdate***************************************************" >> pid_monitor.txt
while [ 1 ]
do
for x in `ps -ef | awk '{ print $2 }'`
do
if [ "$x" != "PID" ];then
num_fd=`ls -l /proc/$x/fd 2>/dev/null | wc -l`
pname=`cat /proc/$x/cmdline 2> /dev/null`
total_num_desc=`expr $total_num_desc + $num_fd`
if [ $num_fd -gt $fd_threshold ]; then
echo "Proces name $pname($x) and number of open descriptor = $num_fd" >> pid_monitor.txt
fi
fi
done
total_nr_desc=`cat /proc/sys/fs/file-nr`
lsof_desc=`lsof | wc -l`
nowdate=`date`
echo "$nowdate : Total number of open file descriptor = $total_num_desc lsof desc: = $lsof_desc file-nr descriptor = $total_nr_desc" >> pid_monitor.txt
total_num_desc=0
sleep $2
done
./monitor.fd.sh 500 2 & tail -f pid_monitor.txt
As I mentioned earlier, I don't see any leak in /proc//fd for any , but leak is happening for sure and system is running out of file descriptors.
I suspect something in the kernel is leaking. Linux kernel version 2.6.23.
My questions are follows :
Will 'ls /proc//fd' show list descriptors for any library linked to the process with pid . If not how do i determine when there is a leak in the library i am linking to.
How do I confirm that leak is in the userspace vs. in kernel.
If the leak is in the kernel what tools can I use to debug ?
Any other tips you can give me.
Thanks for going through the question patiently.
Would really appreciate any help.
Found the solution to the problem.
There was a shared memory attach happening in some function and that function was getting called every 30 seconds. The shared memory attach was never getting detached , hence the descriptor leak. I guess /proc//fd doesn't show shared memory attach as a descriptor. Hence my script was not able to catch file descriptor leak.