Tag Archives: HEPIX Spring 2011

HEPIX Spring 2011 – Day 5

What day it is can be told by all the suitcases around the room.

Version Control

An overview of the version control used in CERN. Quite cool, they’re not using Git yet but they are moving away from CVS to SVN (subversion) which is not updated anymore. Apparently hard to migrate.

They use DNS load balancing

  • Browse code / logging, revisions, branches: WEBSVN – on the fly tar creation.
  • TRAC – web SVN browsing tool plus: ticketing system, wiki, plug-ins.
  • SVNPlot – generate SVN statsw. No need to checkout source code (svnstats do ‘co’).

Mercurial was also suggested at the side of Git (which is founded by Linus Torvalds).

Cern – VM – FS

Cern-VM-FS (CVMFS) looked very promising. The last one is not intended at the moment for images but more for sending applications around. It uses Squid proxy server and looked really excellent. Gives you a mount point like /cvmfs/ and under there you have the softwares.

http://twitter.com/cvmfs

Requirements needed to set it up:

  • Rpms: cvmfs, -init-scripts, -keys, -auto-setup (for tier-3 sites does some system configs), fuse, fuse-libs, autofs
  • squid cache – you need to have one. Ideally two or more for resilience. Configured (at least) to accept traffic from your site to one or more cvmfs repository servers. You could use existing frontier-squids.

 

National Grid Service Cloud

A Brittish cloud.

Good for teaching with a VM – if a machine is messed up it can be reinstalled.

Scalability – ‘cloudbursting‘ – users make use of their local systems/clusters – until they are full – and then if they need to they can do extra work in the cloud. Scalability/cloudbursting is the key feature that users are looking for.

Easy way to test an application on a number of operating systems/platforms.

Two cases were not suitable. Intensive – with a lot of number crunching.

Good: you don’t have to worry about physical assembly or housing. They do have to install the servers and networking etc. Usually this is done by somebody else. Images are key to making this easier.

Bad: Eucalyptus stability – not so good. Bottlenecks: networking is important. More is required to the whole physical server when it’s running vms.

To put a 5GB vm on a machine you would need 10GB. 5 for the image and 5 for the actual machine.
Some were intending to develop the images locally on this cloud and then move it on to Amazon.

Previous Days:
Day 4
Day 3
Day 2
Day 1

HEPIX Spring 2011 – Day 4

Dinner on the 3rd night was amazing. It was at the hotel Weisse Schwan in Arheilgen outside Darmstadt and it was a nice reception hall with big round tables, waiters with lots of wine and great buffet food. A+

Cloudy day!

Or – Infrastructure as a Service – IaaS

A few had the standpoint that the HEP community is not ready for cloud, not secure enough and we have something that’s working. But maybe a mix period would work. At least for now it’s quite awesome for non i/o intensive applications.

There were talks about virtual images and how to (securely) transfer them between sites. Several options about this, stratuslab cloud distribution of images and cloudscheduler.

One great use case for running computing nodes in the cloud is at the moment for when the cluster is maxed out – then you can kick up some more vms in the cloud to help speed up the run. Or when running the jobs it keeps the VM running as long as jobs that require that kind of VMs are in the queue. Or for testing – quite easy to set up several VMs with different operating systems/platforms and then run testing on them. See cloudscheduler.org

Infrastructure as a Code – IaaC – see Opscode and Chef. A pretty interesting looking  configuration management system.

Terms:
fairshare
json

Oracle

Maybe the most interesting presentation at the end of the day – and the debate following was maybe the most – it was the presentations from Oracle Linux and Oracle Open Source.

Before the presentation they had a nice slide stating that they don’t make any promises based on the presentation. That presentation is not available but the other one is – the one about Oracle and Open Source..

Oracle Linux (OL) looks pretty good, it’s free to download but if you want any updates you need to pay them. They have an upgrade thing so if you’re on RHEL6 you can apparently update easily (changes some yum repos). A lot of advertisement – but it was a presentation about the distribution. It’s based on RHEL, they take the updates from RHEL, then add their own magic to it. They have a boot setup so if you want to you can boot OL in Red Hat Compatibility mode. Apparently Oracle wants to put Red Hat out of business (after which they were asked: “Where will you get the kernel then?”). x86-64 only.

On the horizon:  

  • btrfs(fs that supports error detection, CoW, snapshots, ssd optimization, small files are put in metadata)
  • vswitch(full network switch, set up virtual network in the OS, ACL, VLAN, QoS, flow monitoring with openFlow)
  • Zcache(keep more pages of the fs page cache longer in main memory, more cache using LZO compression and thus fewer I/O operations – a lot faster to compress/uncompress than to access disk)
  • storage connect
  • linux containers (resource management, jails on bsd, zones on solaris, own apps/libs/root, runs on top of the kernel, not a virtualization).

From the discussion:


Pidgin – some wanted Video. Pidgin said: no way. This is how Oracle will run their open source projects like MySQL, Lustre.

“If you don’t like how the project is going – fork.” – Gilles Gravier.

Two reasons to fork: proactively (worried) or because they are unhappy with how it’s going (how it’s going or not going).

People in the audience are afraid that a lot of times a company acquires an open source project and then closes it down.

“When you acquire a company and it’s the projects. You have two options if don’t want the project. Drop it or kill it. Kill it does not work for open source.” – Gilles Gravier.

Openoffice is not dropped yet. Lots of other options. Fork and work on closed source (like Grid Engine). Drop it and stop working on it. Drop it and “talk to the community”.

No info about Lustre – when asked about it Oracle did not want to comment. Asked to e-mail gilles.gravier@oracle.com for more information.

Will Oracle port debconf to Oracle Linux? Oracle will take a look.

There was lot of angst against Oracle that surfaced, but Oracle handed it quite well and had good answers.

From one of the Oracles: “Allow me to be a bit provocative: If Oracle’s prices were lower; would you consider buying an Oracle product?”

“It takes 25 years to make a good reputation, 5 minutes to loose it.” – CERN employee.
“SUN used to make hardware and give away software for free; Oracle is .. the other way around.” – Lenz Grimmer
“Laughter” – Audience.

European Open File System SCE

  • http://www.eofs.org
  • one repository of lustre
  • hpcfs.org is another lustre open source – this will merge with opensfs.org. Both are American.
  • Close work together with eofs.org – the two above have agreed on a set of improvements.
  • 2.1 lustre will be released by Whamcloud in summer 2011.
  • LUG – lustre user group – reports and interviews at http://insidehpc.com

 

Next Day:
Day 5

Previous Days:
Day 3
Day 2
Day 1

HEPIX Spring 2011 – Day 3

Day 3 woop!

An evaluation of gluster: uses distributed metadata, so no bottleneck that comes with a metadata server, can or will do do some replication/snapshot.

Virtualization of mass storage (tapes). Using IBM’s TSM (Tivoli Storage Manager) and ERMM. Where ERMM manages the libraries, so that TSM only sees the link to the ERMM. No need to set up specific paths from each agent to each tape drive in each library.
They were also using Oracle/SUN’s T10000c tape drives that goes all the way up to 5TB – which is quite far ahead of LTO consortium’s LTO-5 that only goes to 1.5/3TB per tape. Some talk about buffered tape marks which speeds up tape operations significantly.

Lustre success story at GSI. They have 105 servers that provide 1.2PB of storage and max throughput seen is 160Gb/s. Some problems with

Adaptec 5401 – boots longer than entire linux. Not very nice to administrate. Controller complains about high temps – and missing fans of non-existing enclosures. Filter out e-mails with level “ERROR” and look at the ones with “WARNING” instead.

Benchmarking storage with trace/replay. Using strace (comes default with most Unixes) to record some operations and the ioreplay to replay them. Proven to give very similar workloads. Especially great for when you have special applications.

IPv6 – running out of IPv4 addresses, when/will there be sites that are IPv6? Maybe if a new one comes up? What to do? Maybe collect/share IPv4 addresses?

Presentations about the evolve needed of two data centers to accomodate requirements of more resource/computing power.

Implementing ITIL with Service-Now (SNOW) at CERN.

Scientific Linux presentation. Live CD can be found here:

www.livecd.ethz.ch. They might port NFS 4.1 that comes with Linux Kernel 2.6.38 to work with SL5. There aren’t many differences between RHEL and SL but in SL there is a tool called Revisor, which can be used to create your own linux distributions/CDs quite easily.

 

Errata is a term – this means security fixes.

Dinner later today!

 

Next Days:
Day 5
Day 4

Previous Days:
Day 2
Day 1

HEPIX Spring 2011 – Day 2

Guten aben!

Darmstadt is a very beautiful city. It’s quite old and there are lots of parks and eh, cool, houses.

A person from the UK said yesterday (in the pub Ratkeller) something like this: “A particle physicist’s raison d’être is to find complexities, they wouldn’t turn away from one if their life depended on it. These are the people we provide IT for.”

So no wonder that their IT systems/infrastructure is a little bit complex too!

Today’s topics are: Site Reports, IT Infrastructure (Drupal, Indico, Invenio, Fair 3D cube) and Computing(OpenMP, CMS and Batch nodes).

Site reports

Some of these institutions have a synchrotron which is a cyclic particle accelerator – looks quite cool on the pictures. Some use cfengine for managing the clusters – as in they want to avoid logging on to each node and doing configuration but instead do it from a tool. One such tool that is quite common (Puppet) can also be used for Desktops.

Not many use HP storage stuff, DDN is quite common. Nexsan, bluearc

One site had big problems with their Dell servers – caused by misapplied cooling paste on the CPUs – Dell replaced 90% of the heatsinks and fixed this.


One also had disk failures during high load.They ran the HS06 – Hep Spec 06 – test and while running that disks dropped off.Disk failures traced to anomalously high cooling fan vibration. After replacing all components, and then moving fans to another machine, they saw the error.

IT Infrastructure

CERN is working on moving to Drupal for their web sites. Investigating Varnish (good for ddos, caching and load balancing). Drupal is hard to learn.

Then there were some sessions about programming – CMS 64-bit and OpenMP.
One thought here: is it possible to discern the properties of an Intel/AMD CPU based on the name? Like E5530? Maybe this link on intel.com can be of some assistance.

Fair 3D Tier-0 Green-IT Cube

Quite cool concept(patented) that they are very soon starting to build here in Germany.
Using water vaporization with outside air (and fans in summer) to cool the air, and also water based heat exchangers in each rack to push warm air (by pressure built up by fans, so the racks needs to be quite air tight) from the back of the servers through the heat exchanger that cools the air, and then pushes it over the aisle to the next row of racks. They managed to get down to a PUE of 1.062 at best.

Next Days:
Day 5
Day 4
Day 3

Previous Day:
Day 1

 

HEPIX Spring 2011 – Day 1

Morning.
Got in last night at around 2140 local time.
I should’ve done a little more exact research for how to find my hotel. Had to walk some 30 minutes (parts of it the wrong way) to get to it. But at least I made it to see some ice hockey.. . to bad Detroit lost.

Today’s another day though!

First stop: breakfast.

Wow. What a day, and it’s not over yet! So much cool stuff talked about.

Site Reports

The first half of the day was site reports from various places.

GSI here in Darmstadt (which is where some of the heaviest elements have been discovered). They have started an initiative to keep Lustre alive – as apparently Oracle is only going to develop this for their own services and hardware. They are running some SM – SuperMicro servers that have infiniband on board – and not like the HP ones I’ve seen that has the mellanox card as an additional mezzanine card. Nice. They were also running some really cool water cooling racks that uses the pressure in some way to push the hot air out of the racks. They found that their SM file servers had much stronger fans at the back, and not optimized airflow inside the servers so they had to tape over some (holes?) over the PCI slots on the back of the server to make it work properly for them. They were also running the servers in around 30C – altogether they got a PUE of around 1.1 which is quite impressive.

Other reports: Fermilab (loots of storage, their Enstore has for example 26PB of data on tape), KIT, Nikhef (moved to ManageEngine for patch and OS deployment, and Brocade for IP routers), CERN (lots of hard drives had to be replaced.. around 7000.. what vendor? HP, Dell, SM?), DESY (replaced Cisco routers with Juniper for better performance, RAL (problem with LSI controllers, replaced with adaptec), SLAC (FUDForum for communication).

 

Rest of the day was about:

Messaging

Some talk about messaging – for signing and encrypting messages. Could be used for sending commands to servers but also for other stuff. I’ve seen ActiveMQ in EyeOS and it’s also elsewhere as well. Sounds quite nice but apparently not many use it, instead they use ssh scripts to run things like that.

Security

About various threats that are public in the news lately and also presentation of some rootkits and a nice demo of a TTY hack. Basically the last one consists of one client/linux computer that has been taken hacked, then from this computer a person with access to a server sshs there. And then the TTY hack kicks in and gives the hacker access to the remote host. Not easy to defend against.

There was also a lengthier (longest of the day) 1h-1.5h presentation of a French site that went through how they went ahead when replacing their home-grown Batch management system with SGE(now Oracle Grid Engine).

*** Updated the post with links to some of the things. Maybe the TTY hack has another name that’s more public.

Next Days:

Day 5
Day 4
Day 3
Day 2

HEPIX Spring 2011

I’m heading to Hepix this whole week!

Looks like there’s some really interesting topics like:

Lustre, glustre, ipv6, stuff about the CERN it facilities, Scientific Linux report, cloud/grid virtualization, Oracle Linux.

I’ll sure be doing a bit of blogging about what’s going down.