[STOR-1195] Investigate CMS StoRM instabilities Created: 07/Apr/20  Updated: 27/May/21  Resolved: 04/May/20

Status: Closed
Project: StoRM
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major
Reporter: Andrea Ceccanti Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File be.PNG     PDF File be_metrics.pdf     PNG File requests.png     PNG File screenshot-1.png    
Issue Links:
Relates
relates to STOR-1174 Include thread pool metrics reporting... Closed
relates to STOR-1198 Add Date to Backend's metrics log Closed

 Description   

CMS fe and be (storm-cms.cr.cnaf.infn.it) show instabilities since last Saturday, with a similar behavior on Saturday, Monday morning, Monday afternoon and finally last night. I'll detail about last night.

Filesystem ok, load ok, memory ok, no network problems.

At 4:33 am everythign stops working: both FE and BE, which are on the same host, stop logging, we see 200 task pending (=max) in  monitoring.log, we see 0s in heartbeat.log.

Sensu sends several alarms saying the FE cannot be contacted (both ipv4 and ipv6). Finally, at 6:50 am, the frontend is restarted and back to "normal".

CMS doesn't use webdav. The gridftp servers (xs-402 and xs-403) stop working in the same time range.



 Comments   
Comment by Andrea Ceccanti [ 04/May/20 ]

The frontend instabilities were caused by the be being blocked. We haven't understood looking at the logs what caused the blocking of the be.

A possibility is that threadpools serving requests from the FE got saturated. We'll add monitoring of the threadpools metrics in the storm metrics.

Comment by Lucia Morganti [ 10/Apr/20 ]

be_metrics.pdf here is backend-metrics.log parsed with ELK for last Saturday. The log stopped at 23:26. In case you prefer to remove "max" values from the duration plots, let me know.

Comment by Lucia Morganti [ 08/Apr/20 ]

shows requests from BE log

Comment by Lucia Morganti [ 08/Apr/20 ]

shows requests pattern on April 4th

Comment by Lucia Morganti [ 07/Apr/20 ]

show requests in the previous hours.
Unfortunately ELK is not active for CMS
Should I parse the log files or is this enough?

Comment by Andrea Ceccanti [ 07/Apr/20 ]

java.lang.OutOfMemoryError doesn't look good at all.
"GC overhead limit exceeded" means that the BE was spending most of its time in garbage collection instead of doing real work, so it was unresponsive.
What was the request pattern before the OOM?

Comment by Lucia Morganti [ 07/Apr/20 ]

Adding the last rows from storm-backend-metrics.log:

23:25:55.133 - synch.ls [(count=71952082, m1_rate=78.74230957953807, m5_rate=203.3783220703492, m15_rate=284.32058517739483) (max=18110.076702, min=9.054869, mean=1398.911431029295, p95=6628.170163, p99=18110.076702)] duration_units=milliseconds, rate_units=events/minute
23:25:55.133 - synch.mkdir [(count=651915, m1_rate=0.4489344991993962, m5_rate=0.9996813382802947, m15_rate=2.938306089529111) (max=2262.6094129999997, min=8.964582, mean=926.3620534192959, p95=1142.15647, p99=1142.15647)] duration_units=milliseconds, rate_units=events/minute
23:25:55.133 - synch.mv [(count=0, m1_rate=0.0, m5_rate=0.0, m15_rate=0.0) (max=0.0, min=0.0, mean=0.0, p95=0.0, p99=0.0)] duration_units=milliseconds, rate_units=events/minute
23:25:55.134 - synch.pd [(count=8128479, m1_rate=1.0014775881691327, m5_rate=17.513433207045587, m15_rate=32.15343161896282) (max=2382.4285, min=11.087868, mean=86.09382842079452, p95=704.231577, p99=704.231577)] duration_units=milliseconds, rate_units=events/minute
23:25:57.574 - synch.ping [(count=11263507, m1_rate=14.672591592351536, m5_rate=29.61748000620723, m15_rate=45.35940069146042) (max=727.672752, min=0.037839, mean=30.988702910969955, p95=0.38009899999999996, p99=727.672752)] duration_units=milliseconds, rate_units=events/minute
23:25:57.575 - synch.rf [(count=3108144, m1_rate=2.275478611197182, m5_rate=6.041267684828509, m15_rate=11.073244392699149) (max=13846.969083999998, min=6.743231, mean=5238.839677417935, p95=13846.969083999998, p99=13846.969083999998)] duration_units=milliseconds, rate_units=events/minute
23:25:57.575 - synch.rm [(count=7264060, m1_rate=2.428741662549194, m5_rate=21.789814532777854, m15_rate=17.79838731835939) (max=2384.177084, min=12.280045999999999, mean=442.59246254776775, p95=2175.954425, p99=2175.954425)] duration_units=milliseconds, rate_units=events/minute
23:25:57.575 - synch.rmDir [(count=3660, m1_rate=1.77863632503E-312, m5_rate=8.89318162514E-312, m15_rate=2.6679544875427E-311) (max=24.527711999999998, min=8.412077, mean=8.738221413680268, p95=8.738203, p99=8.738203)] duration_units=milliseconds, rate_units=events/minute
23:25:57.576 - fs.aclOp [(count=347290508, m1_rate=61.53443396193107, m5_rate=712.8503403158022, m15_rate=1340.8594276225072) (max=2140.193092, min=0.034564, mean=14.023937659411207, p95=1.769826, p99=831.6180529999999)] duration_units=milliseconds, rate_units=events/minute
23:26:02.005 - fs.fileAttributeOp [(count=176295161, m1_rate=74.14077847272269, m5_rate=514.7531431957868, m15_rate=816.7100446664926) (max=2.4630829999999997, min=0.014825999999999999, mean=0.057977962956402264, p95=0.166902, p99=0.237734)] duration_units=milliseconds, rate_units=events/minute
23:26:02.680 - ea [(count=458897915, m1_rate=581.2318458871677, m5_rate=2025.9262383826863, m15_rate=2524.3492604568687) (max=2197.537074, min=0.022029999999999998, mean=18.1119945157166, p95=1.3073629999999998, p99=714.6460069999999)] duration_units=milliseconds, rate_units=events/minute

Comment by Lucia Morganti [ 07/Apr/20 ]

Around 5k ERRORS, which however seems a usual number, considering the previous days.
Errors are like

23:26:19.802 - ERROR [Timer-6] - REQUEST SUMMARY DAO - purgeExpiredRequests - Rolling back because of error: Can't
call commit when autocommit=true
java.sql.SQLException: Can't call commit when autocommit=true
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:934)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:931)
at com.mysql.jdbc.ConnectionImpl.commit(ConnectionImpl.java:1647)
at it.grid.storm.catalogs.RequestSummaryDAO.purgeExpiredRequests(RequestSummaryDAO.java:1194)
at it.grid.storm.catalogs.timertasks.RequestsGarbageCollector.purgeExpiredRequests(RequestsGarbageCollector.java:118)
at it.grid.storm.catalogs.timertasks.RequestsGarbageCollector.purgeExpiredRequests(RequestsGarbageCollector.java:88)
at it.grid.storm.catalogs.timertasks.RequestsGarbageCollector.run(RequestsGarbageCollector.java:38)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
23:26:22.940 - ERROR [Timer-4] - Error fetching new requests from database: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded

(the ones above are really close in time to when storm-backend-metrics.log stops logging)

23:56:45.134 - ERROR [xmlrpc-1557980] - srmRm: File does not exist

23:57:47.302 - ERROR [xmlrpc-1557533] - srmMkdir: Path specified exists as a file

23:58:28.396 - ERROR [Timer-6] - PICKER2: roll back failed! Can't call rollback when autocommit=true

I also see several
23:25:24.025 - WARN [xmlrpc-1557981] - Checksum value is not available for file : ...

Actually, there are no errors related to garbage collector in the previous days, 842 ERRORS related to garbage collector on Saturday, and very few on Sunday, so this could be related to the Saturday night fever.

Comment by Enrico Vianello [ 07/Apr/20 ]

Restarting storm-backend-server seems to be the real solution to the problem.
Then we need to understand why rest, xmlrpc and metrics endpoints were unresponsive.

Are there any ERROR into storm backend's log of Saturday?

Comment by Andrea Ceccanti [ 07/Apr/20 ]

The BE was stuck and didn't accept connections. I restarted it.

Comment by Lucia Morganti [ 07/Apr/20 ]

Enrico Fattibene cannot comment on this issue. He would like to say: gemss on hsm cms logs from 4th April 23:27:58 "yamssReorderRecall[8317]: Error: StoRM recall table service not responding" after executing a command like "curl -s -S -H "Token:$STORM_BACKEND_TOKEN" -X GET http://$STORM_BACKEND_NODE:9998/recalltable/cardinality/tasks/readyTakeOver"

 

Comment by Lucia Morganti [ 07/Apr/20 ]

hsm logs several "Error: StoRM recall table service not responding" starting from Saturday

Generated at Wed Apr 16 01:25:06 CEST 2025 using Jira 10.3.4#10030004-sha1:d6812f2d35a143c1c5fc283d2f5a72582f40aaf1.