Tag Archives: lustre

Lustre 2.5 + CentOS6 Test in OpenStack

Reason: Testing to Lustre 2.5 from a clean CentOS 6.5 install in an openstack.

Three VMs: two servers, one MDS, one OSS and one Client. CentOS65 on all. An open internal ethernet network for the lustre traffic (don’t forget firewalls). Yum updated to latest kernel. Two volumes presented to the lustreserver and lustreoss for MDT + OST, both are at /dev/vdc. Hostnames set. /etc/hosts updated with three IPs: lustreserver,  lustreoss and lustreclient.

With 2.6.32-431.17.1.el6.x86_64 there’s some issues at the moment for building the server components. One needs to use the latest branch for 2.5 so the instructions are https://wiki.hpdd.intel.com/pages/viewpage.action?pageId=8126821

Server side

MDT/OST: Install e2fsprogs and reboot after yum update (to run the latest kernel kernel).

yum localinstall all files from: http://downloads.whamcloud.com/public/e2fsprogs/1.42.9.wc1/el6/RPMS/x86_64/

Next is to rebuild lustre kernels to work with the kernel you are running and the one you have installed for next boot: https://wiki.hpdd.intel.com/display/PUB/Rebuilding+the+Lustre-client+rpms+for+a+new+kernel

RPMS are here: http://downloads.whamcloud.com/public/lustre/latest-feature-release/el6/server/SRPMS/

For rebuilding these are also needed:

yum -y install kernel-devel* kernel-debug* rpm-build make libselinux-devel gcc

basically:

  • git clone -b b2_5 git://git.whamcloud.com/fs/lustre-release.git
  • autogen
  • install kernel.src from redhat (puts tar.gz in /root/rpmbuild/SOURCES/)
  • if rpmbuilding as user build, then copy files from /root/rpmbuild into /home/build/rpmbuild..
  • rebuilding kernel requires quite a bit of hard disk space, as I only had 10G for / then I made symlinks under $HOME to the $HOME/kernel and $HOME/lustre-release

yum -y install expect and install the new kernel with lustre patches and the lustre and lustre modules.

Not important?: WARNING: /lib/modules/2.6.32-431.17.1.el6.x86_64/weak-updates/kernel/fs/lustre/fsfilt_ldiskfs.ko needs unknown symbol ldiskfs_free_blocks

/sbin/new-kernel-pkg –package kernel –mkinitrd –dracut –depmod –install 2.6.32.431.17.1.el6_lustre

chkconfig lustre on

edit /etc/modprobe.d/lustre.conf and add the lnet parameters

modprobe lnet
lctl network up
# lctl list_nids

creating MDT: mkfs.lustre –mdt –mgs –index=0 –fsname=wrk /dev/vdc1
mounting MDT: mkdir /mnt/MDT; mount.lustre /dev/vdc1 /mnt/MDT

creating OST: mkfs.lustre –ost –index=0 –fsname=wrk –mgsnode=lustreserver /dev/vdc1
mounting OST: mkdir /mnt/OST1; mount -t lustre /dev/vdc1 /mnt/OST1

Client Side

rpmbuild --rebuild --without servers

cd /root/rpmbuild/RPMS/x86_64
rpm -Uvh lustre-client*

add modprobe.d/lustre.conf
modprobe lnet
lctl network up
lctl list_nids

mount.lustre lustreserver@tcp:/wrk /wrk

lfs df!

Setup a 3 Node Lustre Filesystem

Introduction

Lustre is a filesystem often used by clusters because many computers can mount the filesystem simultaneously.

This is a small log/instruction for how to setup Lustre in 3 virtualized machines (one metadata server, one object storage server and one client).

Basic components:

VMWare Workstation
3 x CentOS 6.3 VMs.
Latest Lustre from Whamcloud

To use Lustre your kernel needs to support it. There’s a special one for server and one for the client. Some packages are needed on both.

Besides lustre you’ll need an updated version of e2fsprogs as well (because the version that comes from RHEL6.3 does not support large partitions).

Starting with the MDS. When the basic OS setup is done will make a copy of that to use for OSS and Client.

Setup basic services.

Install an MDS

This will run the MDT – the metadata target.

2GB RAM, 10GB disk, bridged networking, 500GB for /boot, rest for / (watch out, it may create a really large swap). Minimal install. Setup OS networking (static ip for servers, start on boot, open port 988 in firewall, possibly some for outgoing if you decide to restrain that too), run yum update and setup ntp. Download latest lustre and e2fsprogs to /root/lustre-client, lustre-server and e2fsprogs appropriately (x86_64). Lustre also does not support selinux, so disable that (works fine with it in enforcing until time to create mds/mdt, also fine with permissive until it’s time to mount).
Put all hostnames into /etc/hosts.
Poweroff and make two full clones.
Set hostname.

Install an OSS

This will contain the OST (object storage target). This is where the data will be stored.

Networking may not work (maybe device name changed to eth1 or eth2).
You may want to change this afterwards to get the interface back to be called (eth0). A blog post about doing that.

Install a client

This will access and use the filesystem.

Clone of the OSS before installing any lustre services or kernels.

Install Lustre

Before you do this it may be wise to take a snapshot of each server. In case you screw the VM up you can then go back :)

Starting with the MDS.

Installing e2fsprogs, kernel and lustre-modules.

Skipping debuginfo and devel packages, installing all the rest.

yum localinstall \ 
kernel-2.6.32-220.4.2.el6_lustre.x86_64.rpm kernel-firmware-2.6.32-220.4.2.el6_lustre.x86_64.rpm \
kernel-headers-2.6.32-220.4.2.el6_lustre.x86_64.rpm \
lustre-2.2.0-2.6.32_220.4.2.el6_lustre.x86_64.x86_64.rpm \ 
lustre-ldiskfs-3.3.0-2.6.32_220.4.2.el6_lustre.x86_64.x86_64.rpm \
lustre-modules-2.2.0-2.6.32_220.4.2.el6_lustre.x86_64.x86_64.rpm

The above was not the order they were installed. Yum changed the order so that for example kernel-headers was last.

yum localinstall e2fsprogs-1.42.3.wc3-7.el6.x86_64.rpm \
e2fsprogs-debuginfo-1.42.3.wc3-7.el6.x86_64.rpm \
e2fsprogs-devel-1.42.3.wc3-7.el6.x86_64.rpm \
e2fsprogs-libs-1.42.3.wc3-7.el6.x86_64.rpm \
e2fsprogs-static-1.42.3.wc3-7.el6.x86_64.rpm \
libcom_err-1.42.3.wc3-7.el6.x86_64.rpm \
libcom_err-devel-1.42.3.wc3-7.el6.x86_64.rpm \
libss-1.42.3.wc3-7.el6.x86_64.rpm \
libss-devel-1.42.3.wc3-7.el6.x86_64.rpm

After boot, confirm that you have lustre kernel installed by typing:

uname -av

and

mkfs.lustre --help

to see if you have that and

rpm -qa 'e2fs*'

to see if that was installed properly too.

By the way, you probably want to run this to exclude automatic yum kernel updates:

echo "exclude=kernel*" >> /etc/yum.conf

After install and reboot into new kernel it’s time to modprobe lustre, start creating MDT, OST and then mount things!
But hold on to your horses, first we ned to install the client :)

 

And then the Client

Install the e2fsprogs*

We cannot just install the lustre-client packages, because we run a different kernel than the ones that whamcloud have compiled the lustre-client against.

We can either back-pedal and install an older kernel. Or we can build (from source / SRPMS) a lustre-client that works on a kernel of our choosing. The later option seems like a better way, because we can then upgrade the kernel if we want to.

 

Build custom linux-client rpms

Because of a bug it appears that some ext4 source packages are needed – while they are not. You need to add some parameters to ./configure. This will be the topic of a future post.

The above rpmbuild should create rpms for the running kernel. If you want to create rpms for a non-running kernel you are supposed to be able to run.

Configure Lustre

Whamcloud have good instructions. Don’t be afraid to check out their wiki or use google.

/var/log/messages is the place to look for more detailed errors.

On the MDS

Because we do not have infiniband you want to change the parameters slightly for lnet to include tcp(eth0). These changes are not reflected until reboot (quite possibly something else) – but just editing a file under /etc/modprobe.d/ called for example lustre.conf is not enough.

Added a 5GB disk to the mds.

fdisk -cu /dev/sdb; n, p, 1, (first-last)

modprobe lustre lnet

mkfs.lustre –mdt –mgs

mount

On the OSS

Also add the parameters into modprobe.

mkfs.lustre –ost

mount

On the client

Add things into modprobe.

mount!

Write something.

Then hit: lfs df -h

To see usage!

 

Get it all working on boot

You want to start the MDS, then the OSS and last the client.
But while it’s running you can restart any node and eventually it will start working again.

Fstab on the client:
ip@tcp:/fsname /mnt lustre defaults,_netdev 0 0

Fstab on the OSS and MDS:
/dev/sdb1 /mnt/MDS lustre defaults,_netdev 0 0

While it’s running you can restart any node and eventually it will start working again.

HEPIX Spring 2011 – Day 4

Dinner on the 3rd night was amazing. It was at the hotel Weisse Schwan in Arheilgen outside Darmstadt and it was a nice reception hall with big round tables, waiters with lots of wine and great buffet food. A+

Cloudy day!

Or – Infrastructure as a Service – IaaS

A few had the standpoint that the HEP community is not ready for cloud, not secure enough and we have something that’s working. But maybe a mix period would work. At least for now it’s quite awesome for non i/o intensive applications.

There were talks about virtual images and how to (securely) transfer them between sites. Several options about this, stratuslab cloud distribution of images and cloudscheduler.

One great use case for running computing nodes in the cloud is at the moment for when the cluster is maxed out – then you can kick up some more vms in the cloud to help speed up the run. Or when running the jobs it keeps the VM running as long as jobs that require that kind of VMs are in the queue. Or for testing – quite easy to set up several VMs with different operating systems/platforms and then run testing on them. See cloudscheduler.org

Infrastructure as a Code – IaaC – see Opscode and Chef. A pretty interesting looking  configuration management system.

Terms:
fairshare
json

Oracle

Maybe the most interesting presentation at the end of the day – and the debate following was maybe the most – it was the presentations from Oracle Linux and Oracle Open Source.

Before the presentation they had a nice slide stating that they don’t make any promises based on the presentation. That presentation is not available but the other one is – the one about Oracle and Open Source..

Oracle Linux (OL) looks pretty good, it’s free to download but if you want any updates you need to pay them. They have an upgrade thing so if you’re on RHEL6 you can apparently update easily (changes some yum repos). A lot of advertisement – but it was a presentation about the distribution. It’s based on RHEL, they take the updates from RHEL, then add their own magic to it. They have a boot setup so if you want to you can boot OL in Red Hat Compatibility mode. Apparently Oracle wants to put Red Hat out of business (after which they were asked: “Where will you get the kernel then?”). x86-64 only.

On the horizon:  

  • btrfs(fs that supports error detection, CoW, snapshots, ssd optimization, small files are put in metadata)
  • vswitch(full network switch, set up virtual network in the OS, ACL, VLAN, QoS, flow monitoring with openFlow)
  • Zcache(keep more pages of the fs page cache longer in main memory, more cache using LZO compression and thus fewer I/O operations – a lot faster to compress/uncompress than to access disk)
  • storage connect
  • linux containers (resource management, jails on bsd, zones on solaris, own apps/libs/root, runs on top of the kernel, not a virtualization).

From the discussion:


Pidgin – some wanted Video. Pidgin said: no way. This is how Oracle will run their open source projects like MySQL, Lustre.

“If you don’t like how the project is going – fork.” – Gilles Gravier.

Two reasons to fork: proactively (worried) or because they are unhappy with how it’s going (how it’s going or not going).

People in the audience are afraid that a lot of times a company acquires an open source project and then closes it down.

“When you acquire a company and it’s the projects. You have two options if don’t want the project. Drop it or kill it. Kill it does not work for open source.” – Gilles Gravier.

Openoffice is not dropped yet. Lots of other options. Fork and work on closed source (like Grid Engine). Drop it and stop working on it. Drop it and “talk to the community”.

No info about Lustre – when asked about it Oracle did not want to comment. Asked to e-mail gilles.gravier@oracle.com for more information.

Will Oracle port debconf to Oracle Linux? Oracle will take a look.

There was lot of angst against Oracle that surfaced, but Oracle handed it quite well and had good answers.

From one of the Oracles: “Allow me to be a bit provocative: If Oracle’s prices were lower; would you consider buying an Oracle product?”

“It takes 25 years to make a good reputation, 5 minutes to loose it.” – CERN employee.
“SUN used to make hardware and give away software for free; Oracle is .. the other way around.” – Lenz Grimmer
“Laughter” – Audience.

European Open File System SCE

  • http://www.eofs.org
  • one repository of lustre
  • hpcfs.org is another lustre open source – this will merge with opensfs.org. Both are American.
  • Close work together with eofs.org – the two above have agreed on a set of improvements.
  • 2.1 lustre will be released by Whamcloud in summer 2011.
  • LUG – lustre user group – reports and interviews at http://insidehpc.com

 

Next Day:
Day 5

Previous Days:
Day 3
Day 2
Day 1

HEPIX Spring 2011 – Day 3

Day 3 woop!

An evaluation of gluster: uses distributed metadata, so no bottleneck that comes with a metadata server, can or will do do some replication/snapshot.

Virtualization of mass storage (tapes). Using IBM’s TSM (Tivoli Storage Manager) and ERMM. Where ERMM manages the libraries, so that TSM only sees the link to the ERMM. No need to set up specific paths from each agent to each tape drive in each library.
They were also using Oracle/SUN’s T10000c tape drives that goes all the way up to 5TB – which is quite far ahead of LTO consortium’s LTO-5 that only goes to 1.5/3TB per tape. Some talk about buffered tape marks which speeds up tape operations significantly.

Lustre success story at GSI. They have 105 servers that provide 1.2PB of storage and max throughput seen is 160Gb/s. Some problems with

Adaptec 5401 – boots longer than entire linux. Not very nice to administrate. Controller complains about high temps – and missing fans of non-existing enclosures. Filter out e-mails with level “ERROR” and look at the ones with “WARNING” instead.

Benchmarking storage with trace/replay. Using strace (comes default with most Unixes) to record some operations and the ioreplay to replay them. Proven to give very similar workloads. Especially great for when you have special applications.

IPv6 – running out of IPv4 addresses, when/will there be sites that are IPv6? Maybe if a new one comes up? What to do? Maybe collect/share IPv4 addresses?

Presentations about the evolve needed of two data centers to accomodate requirements of more resource/computing power.

Implementing ITIL with Service-Now (SNOW) at CERN.

Scientific Linux presentation. Live CD can be found here:

www.livecd.ethz.ch. They might port NFS 4.1 that comes with Linux Kernel 2.6.38 to work with SL5. There aren’t many differences between RHEL and SL but in SL there is a tool called Revisor, which can be used to create your own linux distributions/CDs quite easily.

 

Errata is a term – this means security fixes.

Dinner later today!

 

Next Days:
Day 5
Day 4

Previous Days:
Day 2
Day 1

HEPIX Spring 2011 – Day 1

Morning.
Got in last night at around 2140 local time.
I should’ve done a little more exact research for how to find my hotel. Had to walk some 30 minutes (parts of it the wrong way) to get to it. But at least I made it to see some ice hockey.. . to bad Detroit lost.

Today’s another day though!

First stop: breakfast.

Wow. What a day, and it’s not over yet! So much cool stuff talked about.

Site Reports

The first half of the day was site reports from various places.

GSI here in Darmstadt (which is where some of the heaviest elements have been discovered). They have started an initiative to keep Lustre alive – as apparently Oracle is only going to develop this for their own services and hardware. They are running some SM – SuperMicro servers that have infiniband on board – and not like the HP ones I’ve seen that has the mellanox card as an additional mezzanine card. Nice. They were also running some really cool water cooling racks that uses the pressure in some way to push the hot air out of the racks. They found that their SM file servers had much stronger fans at the back, and not optimized airflow inside the servers so they had to tape over some (holes?) over the PCI slots on the back of the server to make it work properly for them. They were also running the servers in around 30C – altogether they got a PUE of around 1.1 which is quite impressive.

Other reports: Fermilab (loots of storage, their Enstore has for example 26PB of data on tape), KIT, Nikhef (moved to ManageEngine for patch and OS deployment, and Brocade for IP routers), CERN (lots of hard drives had to be replaced.. around 7000.. what vendor? HP, Dell, SM?), DESY (replaced Cisco routers with Juniper for better performance, RAL (problem with LSI controllers, replaced with adaptec), SLAC (FUDForum for communication).

 

Rest of the day was about:

Messaging

Some talk about messaging – for signing and encrypting messages. Could be used for sending commands to servers but also for other stuff. I’ve seen ActiveMQ in EyeOS and it’s also elsewhere as well. Sounds quite nice but apparently not many use it, instead they use ssh scripts to run things like that.

Security

About various threats that are public in the news lately and also presentation of some rootkits and a nice demo of a TTY hack. Basically the last one consists of one client/linux computer that has been taken hacked, then from this computer a person with access to a server sshs there. And then the TTY hack kicks in and gives the hacker access to the remote host. Not easy to defend against.

There was also a lengthier (longest of the day) 1h-1.5h presentation of a French site that went through how they went ahead when replacing their home-grown Batch management system with SGE(now Oracle Grid Engine).

*** Updated the post with links to some of the things. Maybe the TTY hack has another name that’s more public.

Next Days:

Day 5
Day 4
Day 3
Day 2

HEPIX Spring 2011

I’m heading to Hepix this whole week!

Looks like there’s some really interesting topics like:

Lustre, glustre, ipv6, stuff about the CERN it facilities, Scientific Linux report, cloud/grid virtualization, Oracle Linux.

I’ll sure be doing a bit of blogging about what’s going down.