Loading...

XML

Word

Printable

Type: Task
Resolution: Fixed
Priority: Major
Fix Version/s: 1.11.22
Affects Version/s: 1.11.21
Component/s: backend, puppet-modules
Security Level: Public (Visbile by non-authn users.)
Labels:
None

(I put Major priority as we believe this issue has quite some impact on the ongoing tape data challenge)

Hi,

we noticed that it is not possible to configure a load balancing strategy for WebDAV server pool within StoRM Backend (https://italiangrid.github.io/storm-puppet-module/puppet_classes/storm_3A_3Abackend.html). This can be done instead for the GridFTP server pool, and we have been using 'smart-rr' in production for the pool of gridftp servers (default value would be 'rr' instead) since years.
Would it be possible to implement it for StoRM WebDAV pool as well?

A related question concerns the way in which the pool members are defined (and then used) within StoRM Backend.
Following Andrea's guidance we have set for storm-lhcb:

gsiftp_pool_balance_strategy => 'smart-rr',
gsiftp_pool_members => [

{ 'hostname' => 'xs-101.cr.cnaf.infn.it', }

{ 'hostname' => 'xs-102.cr.cnaf.infn.it', }

{ 'hostname' => 'xs-103.cr.cnaf.infn.it', }

,
],
webdav_pool_members => [

{ 'hostname' => 'xfer-lhcb.cr.cnaf.infn.it', }

Hence: for defining the pool of gridftp servers we use the individual hosts, while for webdav_pool_members we use their alias (4 hosts are in xfer-lhcb.cr.cnaf.infn.it).
Is such alias somehow _necessary _in the configuration for TPCs to work correctly, e.g. with storage-issued tokens? Or could we safely put individual hosts also in webdav_pool_members?
And suppose we put individual hosts in webdav_pool_members, then 'rr' strategy would be used?

An important tape data challenge is currently ongoing, and LHCb, which is testing EOS->CNAF_disk->CNAF_tape in SRM+HTTP reports many transfers failing with 'Timeout waiting for connection from pool'.
These are mostly TPCs between the storm webdav endpoints themselves, so xfer-lhcb (disk Storage Area) -> xfer-lhcb (tape Storage Area), and they are strongly unbalanced among the 4 available webdav server.
Looking at the failed PUT requests from EOS to CNAF_disk, we see the same (low) number of failed requests in each endpoint. Looking at the TPCs CNAF_disk -> CNAF_tape, we see 10k timeout errors in one endpoint and 40 in another .
Any idea on why the behaviour of TPC should be different wrt transfers, given both requestes are managed by StoRM Backend (srm+http)?

Thank you very much,
lucia

is blocked by

STOR-1526 Evaluate the support for a WebDAV endpoints load balancy strategy

Resolved

mentioned in: Page Loading...

Assignee:: Enrico Vianello
Reporter:: Lucia Morganti
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: 18/Mar/22 12:35 PM
Updated:: 27/Jun/23 6:26 PM
Resolved:: 14/Apr/22 10:23 PM

Details

Description

Attachments

Issue Links

Activity

People

Dates