I'm trying to connect a Mesos slave to its master. Whenver the slave tries to connect to the master, I get the following message:
I0806 16:39:59.090845 935 hierarchical.hpp:528] Added slave 20150806-163941-1027506442-5050-921-S3 (debian) with cpus(*):1; mem(*):1938; disk(*):3777; ports(*):[31000-32000] (allocated: )
E0806 16:39:59.091384 940 socket.hpp:107] Shutdown failed on fd=25: Transport endpoint is not connected [107]
I0806 16:39:59.091508 940 master.cpp:3395] Registered slave 20150806-163941-1027506442-5050-921-S3 at slave(1)@127.0.1.1:5051 (debian) with cpus(*):1; mem(*):1938; disk(*):3777; ports(*):[31000-32000]
I0806 16:39:59.091747 940 master.cpp:1006] Slave 20150806-163941-1027506442-5050-921-S3 at slave(1)@127.0.1.1:5051 (debian) disconnected
I0806 16:39:59.091868 940 master.cpp:2203] Disconnecting slave 20150806-163941-1027506442-5050-921-S3 at slave(1)@127.0.1.1:5051 (debian)
I0806 16:39:59.092031 940 master.cpp:2222] Deactivating slave 20150806-163941-1027506442-5050-921-S3 at slave(1)@127.0.1.1:5051 (debian)
I0806 16:39:59.092248 939 hierarchical.hpp:621] Slave 20150806-163941-1027506442-5050-921-S3 deactivated
The error seems to be:
E0806 16:39:59.091384 940 socket.hpp:107] Shutdown failed on fd=25: Transport endpoint is not connected [107]
The host was started using:
./mesos-master.sh --ip=10.129.62.61 --work_dir=~/Mesos/mesos-0.23.0/workdir/ --zk=zk://10.129.62.61:2181/mesos --quorum=1
And the slave
./mesos-slave.sh --master=zk://10.129.62.61:2181/mesos
If I run the slave on the same VM as the host it's working fine.
I couldn't find much information on the internet. I'm running two virtual boxes (Debian 8.1) on VirtualBox 5. The host is a windows 7.
Edit 1:
The master and the slave both run on a dedicated VM.
Both VMs nextorks are configured using bridged network.
ifconfig from master:
eth0 Link encap:Ethernet HWaddr 08:00:27:cc:6c:6e
inet addr:10.129.62.61 Bcast:10.129.255.255 Mask:255.255.0.0
inet6 addr: fe80::a00:27ff:fecc:6c6e/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:5335953 errors:0 dropped:0 overruns:0 frame:0
TX packets:1422428 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:595886271 (568.2 MiB) TX bytes:362423868 (345.6 MiB)
ifconfig from slave:
eth0 Link encap:Ethernet HWaddr 08:00:27:56:83:20
inet addr:10.129.62.49 Bcast:10.129.255.255 Mask:255.255.0.0
inet6 addr: fe80::a00:27ff:fe56:8320/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:4358561 errors:0 dropped:0 overruns:0 frame:0
TX packets:3825 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:397126834 (378.7 MiB) TX bytes:354116 (345.8 KiB)
Edit 2:
The slave logs can be found at http://pastebin.com/CXZUBHKr
The master logs can be found at http://pastebin.com/thYR1par
I had a similar problem. My slave logs would be filled with
E0812 15:58:04.017990 2193 socket.hpp:107] Shutdown failed on fd=13: Transport endpoint is not connected [107]
My master would have
F0120 20:45:48.025610 12116 master.cpp:1083] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
And the master would die, and a new election would occur, the killed master would be restarted by upstart (I am on a Centos 6 box) and be added into the pool of potential masters. Thus my elected master would daisy chain around my master nodes. Many restarts of masters and slaves did nothing the problem would consistently return within 1 minute of master election.
The solution for me came from a this stackoverflow question (thanks) and a hint in a github gist note.
The gist of it is /etc/default/mesos-master
must specify a quorum number (it needs to be correct for the number of mesos masters, in my case 3)
MESOS_QUORUM=2
This seems odd to me as I have the same information in the file /etc/mesos-master/quorum
But I added it to /etc/default/mesos-master
restarted the mesos-masters and slaves and the problem has not returned.
I hope this helps you.