[STOR-1515] StoRM WebDAV metrics on TPC.pull/push.throughput Created: 27/Jan/22  Updated: 27/Jun/23  Resolved: 31/Mar/23

Status: Closed
Project: StoRM
Component/s: webdav
Affects Version/s: 1.11.21
Fix Version/s: 1.11.22
Security Level: Public (Visbile by non-authn users.)

Type: Task Priority: Major
Reporter: Lucia Morganti Assignee: Enrico Vianello
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File mean_tpc_throughput.png    

 Description   

Hello,
the StoRM WebDAV metrics which are published by StoRM WebDAV endpoints are read from the endpoint (http://localhost:8085/status/metrics?pretty=true) every minute and injected in our monitoring system @INFN-T1 (after work described in https://issues.infn.it/jira/browse/STOR-1348).

We spotted a possible issue with mean values for TPC.pull/push.throughput (see histograms here: http://xfer-archive.cr.cnaf.infn.it:8085/status/metrics?pretty=true), namely the published mean value for throughput doesn't change from its last positive value for a time range of 20 minutes when no actual TPCs happen.

This is shown in the attached picture. To me, it doesn't seem a plotting artifact, given there are 20 stored mean values equal to 63.6 MB/s between 11:05 and 11:23.

Could you check how such mean values are computed, e.g. over which time range?

Thank you very much,
lucia



 Comments   
Comment by Enrico Vianello [ 31/Mar/23 ]

I've just added the suggested fix: in case last transfer status is older than 10 seconds the returned mean is "Empty".

https://github.com/italiangrid/storm-webdav/commit/519a745d15a14c261b79d1f465f8b0f4c7e7ada4

Comment by Lucia Morganti [ 30/Mar/22 ]

Thanks!

Comment by Enrico Vianello [ 29/Mar/22 ]

the metrics TPC.pull.throughput-bytes-per-sec and TPC.push.throughput-bytes-per-sec are based on "lastTransferStatus" which causes the freeze of the mean. Basically the problem is also due to the fact that the metric is an histogram and maybe is not the proper type for a mean that is related to a time scale.
https://github.com/italiangrid/storm-webdav/blob/develop/src/main/java/org/italiangrid/storm/webdav/tpc/transfer/impl/TransferRequestImpl.java#L119-L138
https://github.com/italiangrid/storm-webdav/blob/develop/src/main/java/org/italiangrid/storm/webdav/tpc/http/HttpTransferClientMetricsWrapper.java#L56-L57

Antonio Falabella will try to evaluate the best alternative, otherwise an idea for a fix could be to return an empty transfer throughput in case the last successful transfer is older than X seconds:

    Duration res = Duration.between(Instant.now(), lastTransferStatus.get().getInstant());
    if (res.getSeconds() > 10) {
      return Optional.empty();
    }

Here

Comment by Lucia Morganti [ 08/Feb/22 ]

Any news on this?

Generated at Wed Apr 16 00:43:24 CEST 2025 using Jira 10.3.4#10030004-sha1:d6812f2d35a143c1c5fc283d2f5a72582f40aaf1.