Getting Started with InfiniBand
The first step to using a new infiniband based network is to get the right packages installed. These are the infiniband related packages we ship and what they are there for (Note, the Fedora packages have not all been built or pushed to the repos yet, so their mention here is as a "Coming soon" variety, not an already done variety):
Once you have the right packages installed, the next thing is to get a basic system setup in place. The infiniband stack is both kernel based and user based. In order to use any given piece of hardware, you need both the kernel driver for that hardware, and the user space driver for that hardware. In the case of mlx4 hardware (which is a two part kernel driver), that means you need the core mlx4 kernel driver (mlx4_core) and also the infiniband mlx4 driver (mlx4_ib). When using infiniband, it is best to make sure you have the openib package installed. Then, these drivers will be automatically loaded into the kernel for you if you enable the openidb service. That will take care of the kernel space component (you don't need to tell openibd what hardware you have, it figures it out for itself). You can tweak the kernel space modules that are loaded by editing /etc/ofed/openib.conf to suit your needs.
- openib - The base package that includes the openibd init script that loads all the necessary kernel modules on boot up. The init script is not enabled by default, users need to run "chkconfig --level 2345 openibd on" in order to enable the service. You can configure the upper layer modules to load by editing /etc/ofed/openib.conf.
- rdma - This is an identical package to the openib package that exists only in Fedora and will exist in RHEL6 and later. The openib package name is historical and problematic to change in the middle of a product lifetime. Everything is the same as for openib except the service is named rdma and the config file is /etc/rdma/rdma.conf.
- libibverbs - The core user space library that implements the hardware abstracted verbs protocol on both infiniband and iwarp hardware.
- libmthca, libmlx4, libipathverbs, libehca, libcxgb3, libnes - User space hardware driver modules that implement the hardware specific bit fiddling operations that libibverbs uses to provide a hardware agnostic verbs API. Note: these libraries are very closely tied to the kernel hardware driver module, and as such they must be used with the kernel series they are intended to be used with. This means that if you are running a RHEL5.1 system with a RHEL5.1 kernel, and the RHEL5.2 libcxgb3 has a bug fix you need, upgrading the libcxgb3 library to the RHEL5.2 version without also switching to the RHEL5.2 kernel is *not* supported and not guaranteed to work at all. The kernel portion of the infiniband stack and the entire user space portion of the infiniband stack is subject to change from point release to point release and is not guaranteed to be work properly when components from different releases are mixed and matched.
- libibcm, librdmacm - Libraries to ease the process of initiating connections between infiniband hosts, primarily used in application development. The libibcm library uses infinband native hardware addresses to specify what machine to open a connection to, while librdmacm allows you to specify connections using tcp/ip addresses even though it opens rdma specific connections. The librdmacm-utils package includes some tools for testing your network connectivity.
- libibcommon, libibumad, libibmad - Libraries used for creation of management messages. Unless you are writing a subnet manager or fabric diagnostic package, you probably don't need these. They are used, however, by opensm, infinband-diags, ibutils, and ibsim.
- opensm - An infiniband subnet manager. In order to enable this, you enabled the opensmd service script. Configuration can be done via the /etc/ofed/opensm.conf file (/etc/rdma/opensm.conf in fedora).
- dapl - This is an application development environment that provides a transport neutral API for machine/machine operations. It's mainly used in new development, or by the closed source Intel MPI (which we don't ship, obviously since it's closed source). There are utilities for testing your dapl setup in the dapl-utils package.
- ibutils, infiniband-diags (formerly openib-diags) - There are various utilities in here for accessing the health of your infiniband fabric and testing end to end connectivity.
- ibsim - This is an infiniband fabric simulator. By creating a topology file that mimics the physical network/switch layout you had in mind, as well as specifying other relevant factors, you can test the overall simulated performance of the network.
- libsdp (RHEL only, not in fedora, nor will it ever be) - This library allows you to transparently cause tcp/ip applications to use a limited form of RDMA over RDMA specific network fabrics. You won't get as good a performance as you would if you modified the application to know about and use RDMA, but it should still perform better than regular tcp/ip over infiniband (IPoIB) interfaces (more on IPoIB in the setup section).
- mstflint, tvflash - These are both tools for burning new firmware on Mellanox based infiniband hardware.
- perftest, qperf - Performance testing tools specific to RDMA fabrics.
- ofed-docs - Documentation from the upstream OFED project.
- qlvnictools - Tool specifically for configuring participation in QLogic's proprietary InfiniPath switch vlan setups. This only works if you have the right IniniPath switch hardware and IPath host adapters.
- srptools - Simple daemon for attaching to and maintaining connections to SRP protocol based disks.
- mpi-selector - Simple login script based application that allows both a system wide or user specific default MPI implementation to be selected.
- openmpi, mvapich, mvapich2 (both mvapichs are rhel5.3/4.8 and later only) - Three competing MPI implementations. OpenMPI is an MPI2 compliant implementation that's generally the most modular and complete, it works on more architectures (all but s390/s390x), and is the most flexible. Mvapich is MPI1 compliant only, and works on a more limited set of architectures (i386, ia64, x86_64, ppc64), it also only runs over inifinband/iwarp, it will not run over plain tcp/ip networks. It should, however, be the fastest MPI we ship. The mvapich2 implementation is an MPI2 compliant version of mvapich and is identical in most respects except that it does not support the ppc64 architecture at the moment.
- mpitests - A series of packages intended to provide test applications for each of the above mpi implementations (in rhel5.3/4.8).
If you attempt to run an infiniband application and it says you have no hardware, then that means you either haven't enabled the openibd service or you really don't have any hardware ;-) However, if it says it can't find a driver for your hardware, then it's talking about the user space driver, not the kernel space driver.
For the user space component, you need the core infiniband hardware library, libibverbs, installed and you also need hardware specific library packages installed. The common hardware specific library packages are: libmthca, libmlx4, libipathverbs, libcxgb3, libehca, libnes. Once these packages are installed, the libibverbs library automatically loads whichever ones are necessary in order to support the hardware the kernel detected in your machine.
Once you have the proper kernel space and user space hardware drivers loaded, you are almost ready to start rudimentary testing. The next thing you need is to select one (or more, but more requires manual editing of settings that I don't cover here) machine to be the subnet manager for your infiniband network. This machine will need to have the opensm package installed and its service script, opensmd, enabled. The default settings will work for most people, however, if you have multiple, redundant infiniband network fabrics, then you will need to configure more than one machine to start opensm as it will only attach to one fabric each time it is run, and you will need to configure the additional instances of opensm to bind to the proper port so that it manages the redundant network fabric instead of the default fabric. You can edit /etc/ofed/opensm.conf to change what port opensm binds to (NOTE: the opensm.conf file in rhel5.3/4.8 is changing format compared to the one in rhel5.2/4.7, so hand modifications will need to be forward ported to the new config file when these updates are released).
Once you have the opensm machine selected, and you've started the machine with both the openibd and opensmd services enabled, you should have a functional infiniband fabric. An easy way to test this is to make sure that the libibverbs-utils package is installed and run ibv_devinfo and ibv_devices to see what infiniband/iwarp devices the system thinks are present. Assuming that your devices are found and ibv_devinfo shows your port state to be active, then you are ready to run programs on the infiniband fabric.
In addition to this, you can create tcp/ip interfaces over the infiniband network (IPoIB). To do so, you will need to create the ifcfg-ib0 (and possibly ifcfg-ib1) file in /etc/sysconfig/network-scripts. IPoIB interface types have not been added to our system-config-network tool, hence the need to manually create the files. In addition, IPoIB interfaces can not support dhcp, so they must be statically configured. A sample ifcfg-ib0 file looks like this:
In the case that you have two IB ports plugged into the same infiniband fabric (aka, on the same subnet, not each port on its own subnet) and that you also have IPoIB enabled on both ports, then in order to avoid possible confusion over why things sometimes work and sometimes don't when using IPoIB interface addresses to initiate connections between machines, it is best to add the following lines to your /etc/sysctl.conf file:
If you intend to be able to run infiniband using applications as any user other than root, you will also need to adjust the maximum locked memory for the system. This is done by modifying the /etc/security/limits.conf file. Depending on whether or not you want to release the limit on a specific group that is allowed to run infiniband applications or on all logins, your change should look something like this:
@ib_user - memlock 8192
* - memlock 8192
The value used above is a sample value. You can set the limit to -1 to remove the limit entirely. The actual amount of locked memory your application will need depends on how many connections it is going to open and how large of a message queue it allocates for each connection plus memory for the actual read/write buffers it sends. All RDMA memory must be locked in to physical memory so that the infiniband/iwarp hardware can safely access the memory via DMA.
Once you have reached this point, you should have at least a functional network of infiniband using machines. You should be able to use the various tools listed in the package list above to test both basic functionality and performance of your infiniband network. Now, putting that network to use is another matter ;-)