[STOR-1195] Investigate CMS StoRM instabilities Created: 07/Apr/20 Updated: 27/May/21 Resolved: 04/May/20 |
|
Status: | Closed |
Project: | StoRM |
Component/s: | None |
Affects Version/s: | None |
Fix Version/s: | None |
Type: | Task | Priority: | Major |
Reporter: | Andrea Ceccanti | Assignee: | Unassigned |
Resolution: | Fixed | Votes: | 0 |
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Attachments: |
![]() ![]() ![]() ![]() |
||||||||||||
Issue Links: |
|
Description |
CMS fe and be (storm-cms.cr.cnaf.infn.it) show instabilities since last Saturday, with a similar behavior on Saturday, Monday morning, Monday afternoon and finally last night. I'll detail about last night. Filesystem ok, load ok, memory ok, no network problems. At 4:33 am everythign stops working: both FE and BE, which are on the same host, stop logging, we see 200 task pending (=max) in monitoring.log, we see 0s in heartbeat.log. Sensu sends several alarms saying the FE cannot be contacted (both ipv4 and ipv6). Finally, at 6:50 am, the frontend is restarted and back to "normal". CMS doesn't use webdav. The gridftp servers (xs-402 and xs-403) stop working in the same time range. |
Comments |
Comment by Andrea Ceccanti [ 04/May/20 ] |
The frontend instabilities were caused by the be being blocked. We haven't understood looking at the logs what caused the blocking of the be. A possibility is that threadpools serving requests from the FE got saturated. We'll add monitoring of the threadpools metrics in the storm metrics. |
Comment by Lucia Morganti [ 10/Apr/20 ] |
be_metrics.pdf |
Comment by Lucia Morganti [ 08/Apr/20 ] |
Comment by Lucia Morganti [ 08/Apr/20 ] |
Comment by Lucia Morganti [ 07/Apr/20 ] |
|
Comment by Andrea Ceccanti [ 07/Apr/20 ] |
java.lang.OutOfMemoryError doesn't look good at all. |
Comment by Lucia Morganti [ 07/Apr/20 ] |
Adding the last rows from storm-backend-metrics.log: 23:25:55.133 - synch.ls [(count=71952082, m1_rate=78.74230957953807, m5_rate=203.3783220703492, m15_rate=284.32058517739483) (max=18110.076702, min=9.054869, mean=1398.911431029295, p95=6628.170163, p99=18110.076702)] duration_units=milliseconds, rate_units=events/minute |
Comment by Lucia Morganti [ 07/Apr/20 ] |
Around 5k ERRORS, which however seems a usual number, considering the previous days. 23:26:19.802 - ERROR [Timer-6] - REQUEST SUMMARY DAO - purgeExpiredRequests - Rolling back because of error: Can't (the ones above are really close in time to when storm-backend-metrics.log stops logging) 23:56:45.134 - ERROR [xmlrpc-1557980] - srmRm: File does not exist 23:57:47.302 - ERROR [xmlrpc-1557533] - srmMkdir: Path specified exists as a file 23:58:28.396 - ERROR [Timer-6] - PICKER2: roll back failed! Can't call rollback when autocommit=true I also see several Actually, there are no errors related to garbage collector in the previous days, 842 ERRORS related to garbage collector on Saturday, and very few on Sunday, so this could be related to the Saturday night fever. |
Comment by Enrico Vianello [ 07/Apr/20 ] |
Restarting storm-backend-server seems to be the real solution to the problem. Are there any ERROR into storm backend's log of Saturday? |
Comment by Andrea Ceccanti [ 07/Apr/20 ] |
The BE was stuck and didn't accept connections. I restarted it. |
Comment by Lucia Morganti [ 07/Apr/20 ] |
Enrico Fattibene cannot comment on this issue. He would like to say: gemss on hsm cms logs from 4th April 23:27:58 "yamssReorderRecall[8317]: Error: StoRM recall table service not responding" after executing a command like "curl -s -S -H "Token:$STORM_BACKEND_TOKEN" -X GET http://$STORM_BACKEND_NODE:9998/recalltable/cardinality/tasks/readyTakeOver"
|
Comment by Lucia Morganti [ 07/Apr/20 ] |
hsm logs several "Error: StoRM recall table service not responding" starting from Saturday |