Summary of Category 3: HENP Computing Systems and Infrastructure

Summary of Category 3 HENP Computing Systems and Infrastructure
Ian Fisk and Michael Ernst CHEP 2003 March 28, 2003
Introduction
We
tried to break the week into themes
We Discussed Fabrics and Architectures on Monday
Heard general talks about building and securing large multi-purpose facilities As well as updates from a number of HEPN computing efforts We discussed emerging hardware and software technology on Tuesday Review of the most recent pasta report and update of commodity disk storage work Software for flexible clusters: MOSIX. Advanced storage and data serving: CASTOR, ENSTOR, dCache, Data Farm and ROOT-IO We discussed Grid and other services on Thursday Grid Interfaces and Storage Management over the grid Monitoring services
It
was a full week with a lot to discuss. Special thanks to all those who presented.
There is no way to cover very much of what was presented in a thirty minute
talk.
CAS 2002-10-24 Ian M. Fisk, UCSD 2
General Observations
Grid
functionality is coming quickly
Basic underlying concepts of distributed, parasitic, and multi-purpose
computing are already being deployed in running experiments Early implementation of interfaces for grid services to fabrics I would expect by the time the LHC experiments have real data that the tools and techniques will have been well broken-in by experiments running today
Shift
to commodity equipment accelerated since the last CHEP
I would argue that the shift is nearly complete
At least two large computing centers admitted to having nothing in their work rooms but Linux systems and a few Suns to debug software This has resulted in the development of tools to help handle this complicated component environment
With
notable exceptions high energy computing does not work well together
The individual experiments often have subtly different requirements, which
results in completely independent development efforts

Distributed Computing
Example
from CDF: Central Analysis Facility is very well used
Future
(very near future) plan is
to deploy satellite analysis farms to increase the computing
resources.
CAS 2002-10-24
Ian M. Fisk, UCSD
Distributed Computing
Peter
Elmer presented how the Babar experiment has been able to take advantage of distributed computing resources for primary event reconstruction
By splitting their prompt
calibration and event reconstruction, they now take advantage of 5 reconstruction farms at SLAC and 4 in Padova
CAS 2002-10-24
Ian M. Fisk, UCSD
Parasitic Computing
Bill
Lee presented the CLuED0 work of the D0 experiment
CLuED0 is a cluster of D0 desktop machines which along with some custom
management software provides D0 with 50% of their analysis CPU cycles parasitically. Heterogeneous system with distributed support
The
US LHC experiments submitted a proposal on Monday which, among many other topics, discussed the use of economic theories to optimize resource allocations.
Techniques already used in D0
CAS 2002-10-24
Ian M. Fisk, UCSD
Multipurpose Computing
Fundamental
to a grid connected facility is the ability to support multiple experiments at a minimum and ideally multiple disciplines
The people responsible for computing systems have been thinking about how
to make this possible, because so many regional computing centers have to support multiple experiments and user communities.
John Gordon gave an interesting talk on whether it was possible to build a
multipurpose center John identified 6 categories of problems and discussed possible solutions Software levels experts Local rules Security Firewalls The accelerator centres
Early Interfacing of Grid Services to Fabrics

Alex Sim gave a talk on the Storage Resource Manager: SRM Functionality
Manage space
Negotiate and assign space to users, Manage lifetime of spaces Manage files on behalf of a user Pin files in storage till they are released, Manage lifetime of files Manage action when pins expire (depends on file types) Manage file sharing Policies on what should reside on a storage resource at any one time Policies on what to evict when space is needed Get files from remote locations when necessary Purpose: to simplify clients task Manage multi-file requests A brokering function: queue file requests, pre-stage when possible Provide grid access to/from mass storage systems HPSS (LBNL, ORNL, BNL), Enstore (Fermi), JasMINE (Jlab), Castor (CERN), MSS (NCAR),
Early Implementation
The
functionality of SRM is impressive, leads to interesting analysis scenarios
Equally
interesting is the number of places that are prepared to interface their storage to the WAN using SRM
Robust file replication between BNL and LBNL
CAS 2002-10-24
Ian M. Fisk, UCSD
Shift to commodity equipment
CAS 2002-10-24
Ian M. Fisk, UCSD
10
Benefits and Complications

The
benefit is very substantial computing resources at a reasonable hardware cost.

The
complication is the scale and complexity of the commodity computing cluster

A reasonably big computing cluster today might be 1000 systems
With all the possible hardware problems associated with 1000 systems bought from the lowest bidder Considerable amount of deployment, integration, and development effort to create tools that allow a shelf or rack of linux boxes to behave like a computing resource. Configuration Tools Monitoring Tools Tools for systems control Scheduling Tools Security Techniques
Configuration Tools
We
heard an interesting talk from Thorsten Kleinwork on install and running systems at CERN
Systems are installed with kickstart and RPMs
CERN
and several other centers are deploying the configuration tools from EDG WP4
Pan & CDB (Configuration Data Base) for describing hosts:
Pan is a very flexible language for describing host configuration information:
Expressed in templates (ASCII) Allows includes (inheritance) Pan is compiled into XML, inside CDB XML is downloaded and the information provided by CCConfig, which is the high level API
Complicated
even to track what it is you have.
We had an interesting presentation from Jens Kreutzkamp from DESY about
how they track their IT assets.

Monitoring Tools
Systems
are complicated consisting of many components this has lead to the development of lots of monitoring tools
Very functional, complete and scalable though complicated to extend tools like
NGOP, which Tanya Levshina presented
System Status Page
CAS 2002-10-24
Ian M. Fisk, UCSD
13
Monitoring Tools (cont.)

On
the opposite end where examples of extremely lightweight monitoring packages for Babar presented by Matthias Wittgen.
Monitors CPU and network usage as well packets sent to disk and number of
processes Writes it to a central server where it is kept on a flat file.

Tools for system control

Andras
Horvath presented a technique for secure system control and reset access for a reasonable cost
This solutions doesnt scale to 6000 boxes
System Andras is implementing consists of serial connections for console
access and relays attached to the reset switch on the motherboard for resets
CAS 2002-10-24
Ian M. Fisk, UCSD
15
Security Techniques
Number
of systems in these large commodity clusters makes for interesting security work
Doubly so when worrying about making grid interfaces
The
work to secure the BNL facility was presented
Work prioritizing their assets and forming responses for security breaches
CAS 2002-10-24
Ian M. Fisk, UCSD
16
Field doesnt cooperate well

This
is not necessarily a problem, nor is it a criticism, simply an observation
One doesnt see a lot of common detector building projects, maybe it isnt
surprising that there arent a lot common computing development efforts I noticed during the week that there is a lot of duplication of effort, even between experiments that are geographically close We have forums for exchange like HEPIX and the Large Cluster Workshop meetings Even with these, we dont seem to do much development in common
There
are notable exceptions
Alan Silverman presented the work to write a guide to building and operating a
large cluster Their noble if somewhat ambitious goal is to Produce the definitive guide to building and running a cluster - how to choose, acquire, test and operate the hardware; software installation and upgrade tools; performance mgmt, logging, accounting, alarms, security, etc, etc
Grid Projects
The
grid projects are another area in which the field is working effectively together
A number of sites indicated the desire to use common tools developed by EDG
Work Package 4
Good buy in from fabric managers about the use of SRM Software deployment through the VDT
CAS 2002-10-24
Ian M. Fisk, UCSD
18
Conclusions
It
was a long and interesting week
Apologies
for not being able to summarize everything
We had very interesting discussions and presentations yesterday about how to
interface the fabrics and the grid services I also didnt get a change to cover some of the hardware and software R&D results
I
encourage people to look at the web page. Almost all the talks were posted.
CAS 2002-10-24
Ian M. Fisk, UCSD
19

Summary of Category 3: HENP Computing Systems and Infrastructure

Uploaded by

Copyright:

Available Formats

Summary of Category 3: HENP Computing Systems and Infrastructure

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Summary of Category 3: HENP Computing Systems and Infrastructure

Uploaded by

Copyright:

Available Formats

Summary of Category 3 HENP Computing Systems and Infrastructure

tried to break the week into themes

We Discussed Fabrics and Architectures on Monday

functionality is coming quickly

Basic underlying concepts of distributed, parasitic, and multi-purpose

to commodity equipment accelerated since the last CHEP

I would argue that the shift is nearly complete

results in completely independent development efforts

from CDF: Central Analysis Facility is very well used

(very near future) plan is

to deploy satellite analysis farms to increase the computing

Ian M. Fisk, UCSD

Ian M. Fisk, UCSD

Lee presented the CLuED0 work of the D0 experiment

CLuED0 is a cluster of D0 desktop machines which along with some custom

Ian M. Fisk, UCSD

Early Interfacing of Grid Services to Fabrics

CAS 2002-10-24 Ian M. Fisk, UCSD 8

functionality of SRM is impressive, leads to interesting analysis scenarios

Ian M. Fisk, UCSD

Shift to commodity equipment

Ian M. Fisk, UCSD

Benefits and Complications

benefit is very substantial computing resources at a reasonable hardware cost.

complication is the scale and complexity of the commodity computing cluster

CAS 2002-10-24 Ian M. Fisk, UCSD 11

Pan is a very flexible language for describing host configuration information:

even to track what it is you have.

We had an interesting presentation from Jens Kreutzkamp from DESY about

how they track their IT assets.

NGOP, which Tanya Levshina presented

System Status Page

Ian M. Fisk, UCSD

Monitoring Tools (cont.)

processes Writes it to a central server where it is kept on a flat file.

Tools for system control

System Andras is implementing consists of serial connections for console

Ian M. Fisk, UCSD

work to secure the BNL facility was presented

Ian M. Fisk, UCSD

Field doesnt cooperate well

is not necessarily a problem, nor is it a criticism, simply an observation

are notable exceptions

Ian M. Fisk, UCSD

was a long and interesting week

for not being able to summarize everything

We had very interesting discussions and presentations yesterday about how to

Ian M. Fisk, UCSD

You might also like