-
Type: Task
-
Resolution: Fixed
-
Priority: Major
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
CMS fe and be (storm-cms.cr.cnaf.infn.it) show instabilities since last Saturday, with a similar behavior on Saturday, Monday morning, Monday afternoon and finally last night. I'll detail about last night.
Filesystem ok, load ok, memory ok, no network problems.
At 4:33 am everythign stops working: both FE and BE, which are on the same host, stop logging, we see 200 task pending (=max) inĀ monitoring.log, we see 0s in heartbeat.log.
Sensu sends several alarms saying the FE cannot be contacted (both ipv4 and ipv6). Finally, at 6:50 am, the frontend is restarted and back to "normal".
CMS doesn't use webdav. The gridftp servers (xs-402 and xs-403) stop working in the same time range.