Load balancing strategy for StoRM WebDAV server pool

XMLWordPrintable

    • Type: Task
    • Resolution: Fixed
    • Priority: Major
    • 1.11.22
    • Affects Version/s: 1.11.21
    • Component/s: backend, puppet-modules
    • Security Level: Public (Visbile by non-authn users.)
    • None

      (I put Major priority as we believe this issue has quite some impact on the ongoing tape data challenge)

      Hi,

      we noticed that it is not possible to configure a load balancing strategy for WebDAV server pool within StoRM Backend (https://italiangrid.github.io/storm-puppet-module/puppet_classes/storm_3A_3Abackend.html). This can be done instead for the GridFTP server pool, and we have been using 'smart-rr' in production for the pool of gridftp servers (default value would be 'rr' instead) since years.
      Would it be possible to implement it for StoRM WebDAV pool as well?

      A related question concerns the way in which the pool members are defined (and then used) within StoRM Backend.
      Following Andrea's guidance we have set for storm-lhcb:

      gsiftp_pool_balance_strategy => 'smart-rr',
      gsiftp_pool_members => [

      { 'hostname' => 'xs-101.cr.cnaf.infn.it', }

      ,

      { 'hostname' => 'xs-102.cr.cnaf.infn.it', }

      ,

      { 'hostname' => 'xs-103.cr.cnaf.infn.it', }

      ,
      ],
      webdav_pool_members => [

      { 'hostname' => 'xfer-lhcb.cr.cnaf.infn.it', }

      ,

      Hence: for defining the pool of gridftp servers we use the individual hosts, while for webdav_pool_members we use their alias (4 hosts are in xfer-lhcb.cr.cnaf.infn.it).
      Is such alias somehow _necessary _in the configuration for TPCs to work correctly, e.g. with storage-issued tokens? Or could we safely put individual hosts also in webdav_pool_members?
      And suppose we put individual hosts in webdav_pool_members, then 'rr' strategy would be used?

      An important tape data challenge is currently ongoing, and LHCb, which is testing EOS->CNAF_disk->CNAF_tape in SRM+HTTP reports many transfers failing with 'Timeout waiting for connection from pool'.
      These are mostly TPCs between the storm webdav endpoints themselves, so xfer-lhcb (disk Storage Area) -> xfer-lhcb (tape Storage Area), and they are strongly unbalanced among the 4 available webdav server.
      Looking at the failed PUT requests from EOS to CNAF_disk, we see the same (low) number of failed requests in each endpoint. Looking at the TPCs CNAF_disk -> CNAF_tape, we see 10k timeout errors in one endpoint and 40 in another .
      Any idea on why the behaviour of TPC should be different wrt transfers, given both requestes are managed by StoRM Backend (srm+http)?

      Thank you very much,
      lucia

            Assignee:
            Enrico Vianello
            Reporter:
            Lucia Morganti
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: