LDMAP program has been modified to be deployed on the
Iridis
computing cluster at University of Southampton. The construction of the genome-wide Linkage Disequilibrium
Unit (LDU) maps is made possible by adopting the
segmental approach. This method divides an
entire
chromosome
into segments, and the LDU map in each segment is constructed independently as an
individual job. With the availability of the large computing resource, the
segments are able to compute in parallel. The diagram above illustrates the
concept of parallel processing. Once the segmental maps are constructed, the
segments are concatenated to assemble a whole chromosome map. The segment
size (number of SNPs) is optimized to minimize computing load but lose only negligible
amounts of information (1000 - 2000 SNPs per segment) for delivering an
effective turnover computation time.
The
Linkage
Disequilibrium unit map is constructed in segments, and these
segments can be concatenated in 2 ways:
1)
Contig Assembly
2)
Contig-Free Assembly
(more efficient, recommended approach)
The genome-wide LDU maps for the human genome of four
populations have been constructed using the Phase 2
HapMap public release #20 (January 2006). The maps are
summary here.
The
Iridis cluster is put together from dual
processors AMD Opteron machines, and consists of over
900 processors capable of performing both 32 bit and 64
bit computing under Linux environment. This
configuration is commonly known as the
Beowulf cluster. The cluster job scheduler is
managed by
open-PBS
(Portable Batch System). This is a shared facility which
offers high performance computation for researchers
across the whole university.
Fine scaled
LDU maps are made using the
AMD Opteron on Iridis
cluster, click
here for more details.
The main features / limitations of LDMAP_cluster are:
1)
An entire or a large region of a
chromosome is divided into segments of appropriate length.
2)
The segments are submitted to the
Iridis cluster and executed on multiple machines in parallel.
3)
The job status is monitored (for a simple status summary via
"checkSeg" command; for a detail
summary via "qstat -a" command
).
4)
Retrieve result files and assemble
the complete LD map.
5)
LDMAP_cluster is a
64 bit program
that is capable of accommodating up to 2^64 Gb of memory (32 bit = 4
Gb).
6)
The program assigns 2 sub-jobs (segments) per
submission. This enables two jobs to run simultaneouslyon a dual processors node (see
program history).
7)
This program is compatible to deploy on
any standard Beowulf
cluster
with openPBS
installed. Amendments on the submission script and re-compilation might
be required for adapting different PBS and Operation System (OS)
platform.
8)
The maximum limit of this program
is ~70,000 SNPs. If the SNP dataset file contains >70k SNPs, the dataset is
split into smaller dataset files with an overlap of 200 SNPs. Currently exploiting 64 bit computing to
remove this limit and process larger segments
without splitting the dataset.
9)
LDMAP_cluster performs automatic online update. This ensures the latest version of
the program is delivered to all users.
(currently disable for security reason)
Key Notes:
1)
The LDMAP_cluster program is
currently at beta phase with a primitive UI (User Interface). An account
on Iridis and experience with Linux commands are required for using
the
program. An Iridis account can be applied for
here.
2)
The SNP
interval window and physical distance
(kb) for including the informative data are user-definable (21 Sep 06).
3)
FTP is the protocol for
transferring source data (SNP dat file) and results to/from Iridis.
Please make sure that the "binary" mode is used for transferring the
LDMAP_cluster_<date>.zip, and "ascii" mode is used for
transferring the dataset file. A
Windows client,
"SSH
Secure Shell", can be installed as an
alternative
from the Southampton University software collection here. Not only the files can be
transferred via the simple
"drag & drop"
actions, it is also a good terminal console (recommended by the ISS for
accessing Iridis). The manual for SSH Secure Shell can be found
here.
4)
All programs are pre-compiled with
the 64 bit GCC compiler, RHEL4. It is not necessary for the user
to re-compile any of the source codes.
5)
Most commands on this page can be
"cut / paste" onto the shell (terminal console), and the remaining
screen shots are presented in gif file format (for viewing only).
6)
coreFTP is another good FTP client on
MS Windows for transferring files (i.e. SNP datasets) with "drag & drop"
actions. The Lite version is free for non-commercial purpose. Please click here to download the client.
Work Flow Overview:
Constructing LD maps using LDMAP_cluster:
The
work flow diagram can be downloaded in pdf format
here.
Installation:
There are 2 components for
the installation:
-
download the LDMAP_cluster program (zip fle)
- download the SNP dataset file
Log on and Download the
LDMAP_cluster program:
The program can be executed from any sub-directory.
1) Log on to your
Iridis account (via
SSH Secure Shell), instructions
here
host:
iridis2.soton.ac.uk
Account holder
of previous Iridis account should has access to the current Iridis cluster, please make
sure the above host is used for logging in.
2) Download the program
There are 2 ways to download the source code and onto the home directory (~) on
Iridis:
i) FTP (for
both source code + dataset files)
ii) HTTP (for
source code only)
It is recommended on FTP, "binary" mode is used for downloading the zip file,
and "ascii" mode is used for transferring the dataset files.
FTP has to be initiated from Iridis for connecting to other servers, vice versa is
not possible.
The LDMAP_cluster is packaged in a zip file
LDMAP_cluster_<date>.zip,
please replace <date> with the release version date showed
on the top of the page, i.e.
"ver.
16Oct06" =
"LDMAP_cluster_161006.zip".
Initiate FTP from Iridis to connect to Cedar:
[wwsl@blue15 ~]$ ftp cedar.genetics.soton.ac.uk
Change directory and download....
ftp> bin
ftp> cd /export/home/winston
ftp> get
LDMAP_cluster_161006.zip
ftp> quit
3) Extract the program from the zip file :
[wwsl@blue15 ~]$ unzip
LDMAP_cluster_161006.zip
There should now be an executable "LDMAP_cluster",
and two
folders: "LDMAPsourceCode" and "mergeDataset"
The installation is completed.
The program
automatic updates if a newer version of LDMAP_cluster becomes available
(this function is currently disabled for security reason).
Download the SNP dataset for Phase II HapMap
The SNP
dataset files are located on Turaco in the directory: /export/home7/todd/phase2
ftp> cd /export/home7/todd/phase2
ftp> cd <chr#>
ftp> asc
ftp> get *.dat
ftp> quit
Please replace <chr#>
with the chromosome number (i.e. 1, 2, 3... 23). Chromosome X = 23.
Download other SNP dataset
Alternatively for other SNP datasets, please use
SSH Secure Shell
Client
(from your local PC to Iridis) to transfer the dataset.
Running LDMAP_cluster:
General
Overview: LDMAP_cluster
Instructions:
1) TheLDMAP_clusterprogram is executed by:
[wwsl@blue15 ~]$ LDMAP_cluster
2) The program requires the
user to define the following inputs:
- the SNP dataset file (e.g.
ceu10.dat)
-
Parallel Processing: Construct the map from multiple segments (select "1")
-
Non-Parallel Processing: Construct the map from entire dataset without
segmentation (select "2")
- the # of
SNPs per segment (2000 SNPs) - the size of
"Interval Window"(100 SNPs) - the size of
"Buffer Zone" (100 SNPs)
- the
physical distance (500 kb)
- the hours of computing
per segment (10 hrs)
e.g.Assigning 2000 SNPs per segment:
first segment
= 2000 SNPs
+ 100 SNPs X 1 contig middle segments =
2200 SNPs
+ 100 SNPs X 2 contigs last segment
= 1000 - 1999 SNPs + 100
SNPs X 1 contig
or
last segment = 2000 -
2999 SNPs + 100 SNPs X 1 contig
(seeprogram history)
If option 2 is
selected to construct the map from entire dataset without segmentation, the user
only requires to define the followings:
- the size of
"Interval Window"(100 SNPs) - the
physical distance (500 kb)
- the hours of computing
per segment (10 hrs)
Please note that the purpose of option 2 is
to construct the LD map from small dataset without losing any information in the
absent of segmentation. Ideally the dataset should not exceed 4000 SNPs, as it
will prolong the computing time and make the process inefficient (long queuing
time for long computing job; possible time out if insufficient computing time is
assigned).
Each submission script consists of two subJobs. Each subJobs
represents a split segment of the SNP dataset
(2nd + 3rd lines). This is followed
by the submission script filename
(4th line) and the submission job number
(last line).
Once the jobs have been submitted to Iridis, new files and
folders are created by the LDMAP_cluster program. Each
segment is submitted from an individual folder (seg_#)
accompanied by a PBS submission script (submitfile_#.csf).
3) Monitoring the submitted jobs on
Iridis:
"checkSeg"
command provides a simple
status summary of the
submitted jobs:
Alternatively, there are 2 commands that also display the
status of the submitted job(s): "qstat"
and "showq" commands.
[wwsl@blue15 ~]$ qstat
"qstat"
provides a more details summary about the submitted job. The long number
represents the job id for the submitted job. Typically, a segment of 2000 SNPs
would take 1-2 hrs to execute; a segment of 2000 SNPs would take 2-5 hrs to
execute, some take up to 10 hrs, depending on the number of iterations are
needed.
Quite often the job names are too long to be displayed on the list from "qstat",
the left part of the job name which distinguishes the dataset is often cut off,
therefore it is hard to tell which dataset the submission belongs to. Try "qstat
-a"
to display the left part of the job name. It is preferable to use the "qstat
-a"
as it gives a better account (i.e. executed time) than "qstat".
Click
here for
more details.
[wwsl@blue15 ~]$ qstat -a
"showq" command displays the job queue
of all users on the Iridis cluster. It consists of 3
lists: Active Jobs / Idle Jobs / Blocked Jobs.
If a user has over 40 jobs running on the cluster, the
excess jobs are blocked from executing. This makes sure that a
fair share of the resource is available for all the users.
Should anything go wrong (e.g. incorrect parameters are
input, the wrong dataset is submitted), use "qdel
<job id>" to remove the submitted jobs before they started
running on the computing cluster. More than one job number
can be placed in the arguments of "qdel".
Click
here
for more details.
[wwsl@blue15 ~]$ qdel 541385
Currently the user is
required to manually remove the folders and submission scripts related to the
finished or wrong dataset. The following command is recommended: Remove the folders and
submission scripts associated to a wrongly submitted dataset
[wwsl@blue15 ~]$ rm -rf chb4_1.dat*
4) Retrieving Results:
Run "checkSeg"
to confirm if all segments have completed.
Under
rare circumstances, there might be job(s) or segment(s)
shown as "Idle" among the completed segments on the list
from "checkSeg".
The "Idle" segment(s) will not be found on either "qstat -a" or "showq". Please
click here
for information.
Two set of files (e.g.
ceu15.dat_seg_1_2.o541359 and
ceu15.dat_seg_1_2.e541359) are generated by the PBS
for every job submission. These
files contain the job and error log for each submitted job.
The "retrieveResult" command retrieves the segment maps
from each segment folder (seg_#), and copies these files
into the newly created folder in the home directory. Please
note that the name of the dataset and a long number are concatenated
with the name of the folder (e.g.
result_ceu15.dat_1148226845). This number is the machine time
expressed in seconds. It is used here to generate a unique
name for the result of each dataset to avoid over-writing of
any previous ones.
[wwsl@blue15 ~]$
retrieveResult
The two folders are
created by "retrieveResult". The log folder contains all the
job / error log for the submitted jobs. The result folder
contains all the segments for the dataset.
[wwsl@blue15 ~]$ cd
result_chb4_2.dat_1143996184/
Inside the result_# folder, there are two types of file:
ter_#
and pro_#. They
contain the LDU maps and relevant information for each segment
(job submission) respectively. The number (#) corresponds the the segment
folder number (seg_# folder).
The
pro_#
file contains all information such as M, E, L
(parameters for the Malecot Model), the number of Newton
Raphson iterations for the composite -2lnL to converge and
other statistical error measurements. A sample of the pro_#
can be viewed here.
The
ter_# file is the actual LD map file. It consists of 3
columns: SNP name, physical distance (kb) and LD unit (LDU). A sample of the ter_# can be
viewed here.
5)
Assembling the complete LDU map
"CreateLDmap" command concatenates the
segments via Contig Assembly. The
complete LD map is saved as the "ldmapFinal.txt" file.
"CreateLDmapFree" command concatenates the segments via
Contig-Free Assembly. The complete LD map is saved as
the "ldmapFinalFree.txt" file.
It is
recommended to back up the two folders:
1)
result_<dataset>_<time>
2)
log_<dataset>_<time>
e.g.
They contain valuable
information about the LD maps as well as the computing log from the PBS.
The easiest way to transfer the result (the above 2 folders)
from Iridis is via the software
SSH Secure Shell where the folders could "drag and drop" to
anywhere on the Windows machine. Alternatively, FTP could be
used to export the two folders where it is necessary to
zipped the folders before transferring. "ascii mode" is recommended for transferring flat files.
7) Cleaning Up
To free more space for the
new dataset, remove the folders and submission scripts of
the complete dataset:
Storage Space Limit
A space of 5Gb is given to every user on Iridis.
To check how much space is used out of the given 5Gb. At your
home directory, type "du
-h -s
<folder>". Replace "wwsl" with your
<userID>.
[wwsl@blue15 ~]$ du -h -s ../wwsl
The available space =
5Gb - 2.2Gb = ~2.3Gb
Alternatively,
"du -s
<folder>"
displays the used space in Mb which gives a better estimate.
Concatenating LD maps constructed from
different datasets. Coming Soon......
Effective Iridis Submission (execution time v.s
number of jobs):
The PBS gives higher priority for the jobs with shorter execution time and request lower number of processors
first before the jobs request more processors and longer
execution time.
LDMAP_cluster adjusts the requested execution time according
to the number of SNPs in the segment (job). It is important
to quote a sufficient time for the job to complete. The
job is removed from the computing node once the requested
time is up whether it is finished or not.
Typically a segment of 2000 SNPs (excluding contig(s):
100-200 SNPs) requires 2 - 5 hrs to complete it
execution. The execution time varies between different
segments although they contain the same number of SNPs. The
computing time is largely depends on
the number of Newton-Raphson
iterations required for the segment to converge. Usually 800 iterations are
achieved within the first hour; 1600 iterations in ~2hrs;
3200 iterations in ~4hrs.Some segments might require >3200
iterations, therefore 10 hrs is defaulted as the
requested execution time for all segments containing 2000
SNPs.
Currently (09Apr06), the LDMAP_cluster processes one
sample (SNP dat file) at a time, the next sample is manually
submitted after the completion of the previous run. This
creates a time gap between each run. A way to
Program Outlook:
The construction of LD maps has
benefited greatly from deploying the program on the large
computing resource on Iridis at the Southampton
University.
The National e-Science Grid
The computational and data gird resources provided by the National Grid
Service (NGS). It consists of 7000 processors.
Source Code:
The
LDMAP_cluster and its source code are
freely
available to non-commercial users only.
Please click
here to download the source code (ver. 210906).
Please click
here to download the source code (ver. 161006).
Alternatively, under
Linux environment, please use the following command replacing the version number
if necessary.
Lau, W., Kuo, T., Tapper, W., Cox, S., Collins,
A. Exploiting large scale computing to construct high resolution linkage
disequilibrium maps of the human genome.
Bioinformatics 2007: 23; 517-519. [
ABSTRACT
]
Collins, A., Lau, W., De La Vega, F. M. Mapping Genes for Common Diseases:
The Case for Genetics (LD) Maps. Human
Heredity 2004: 58; 2-9 [
ABSTRACT
]
Program
History:
29Mar06
Transformed the LDMAP_Condor to LDMAP_cluster program under Iridis
(e.g. job submission script).
Added
function to take care of the the last segment. If it contains
> 800 SNPs it will merge to the previous segment as one job; otherwise
submit as an individual job (this limit increases to >1000 SNPs on
12Apr06). see
table30Mar06
Refine wording on the User Interface (UI).
Automate the program to create folder and copy necessary programs
from the LDMAP sourceFolder (LDMAPsourceCode).
31Mar06
Successful execution of 8 segments from ceu19.p1.dat
on Iridis. The first job is submitted on 18:36, and the last segment
finishes on 23:25. The execution times of the segments vary from ~30 - 64
mins. It is largely dependent on the # of Newton Raphson iterations required
to converge the composite likelihood.
Fixed inconsistence in the contigInfo.txt file.
The program terminates if the dataset file exceeds 70k
SNPs.
01Apr06
Added function to assign the length of execution time according to
the size of segment (# of SNPs per segment). Segment <2000 = 5hrs; Segment
>2000 = 6hs; Segment >3000 = 10hrs (this function is disabled on 12Apr06, 10
hrs as the default computing time).
07Apr06
Placed a limit on the program. LDMAP_cluster terminates
if the sample exceeds 70k SNPs.
08Apr06
Added function for displaying # of individual sample in the dataset
file.
Added
"checkSeg" command for
monitoring the status of submitted job(s) in a simple summary.
10Apr06
The program is now able to handle more than one dataset file at a
time. It groups the related folders and files by assigning a unique name to
each submitted dataset. However the number of dataset submission
is depended on the availability of the limited user storage
of 5Gb.
12Apr06
The
minimum number of SNPs in the last segments is 1000 SNPs (up
from 800 SNPs,
29Mar06). This increases the accuracy of
the construction of LD map in the last segment. see
table
checkSeg
command
now embedded the function to delete the dataset file
and temp.dat (~30Mb) from the segment folder once the job started to
process; and to delete the
intermediate
file (~40Mb) once the segment map
(ter file)is constructed.
Added display for estimate required space (Mb) in
the UI of LDMAP_cluster program before submission, based on the assumption of 2000 SNPs
per segment and the SNP dataset file size of 15Mb.
Fixed bugs in mergeFreeContig3.c that implements Contig-Free
Assembly.
is added as a better
alternative for concatenating the segments.
14Apr06
Program "
reformatMapFile" are now able to merge
(Contig Assembly) the sub LD maps files constructed from
split datasets . Chr.18 CEU phase 2 is the first dataset merged back to it
full length LD map constructed from 2 split dataset.
Fixed a bug from removing temp.dat file in checkSeg.c.
15Apr06
Program "
reformatMapFileFree" are now able to merge (Contig-Free
Assembly) the sub LD maps files constructed from a splitted dataset. Phase 2
Chr.18 CEU is the first map successfully merged from 2 datasets file.
16Apr06
Fixed a bug in "reformatMapFileFree.c". The merge LD
map no longer has duplicated SNPs. This is caused by the indexing
differences in the "ter" file and sub LD map. Now
reformatMapFileFree.c are able to use the same program (mergeContigFree3.c)
to carry out the process.
17Apr06
Fixed bugs in "mergeMapFileFreeContig.c" regarding the re-indexing
of SNPs. Phase 2 LD map of chromosome 1 is now constructed with the correct
number of SNPs.
19Apr06
Possible Bugs in mergeContigFree.c that
leads to the error in the indexing of SNP on the final LD maps concatenated
from split segment. The mergeContigFree.c uses the number of lines in the
ter file to determine the extended regions between adjacent segments, if a
SNP is dropped from one the ter file, it disrupts the
SNP indexing in the segment, thus causing the frame
shift effect. Will work on this soon..
21Apr06
LDMAP_cluster now assigns 2 sub-jobs (segments) per
submission. This enables the program to fully utilize the two processors in the dual
processors machine on Iridis. The number of submissions
decreased by half, and twice as many jobs are
running simultaneously, therefore the throughput is expected to
increase two fold.
The previous version of LDMAP_cluster is renamed as "LDMAP_cluster_singleSubmissionPerNode.c"
and remains in the LDMAPsourceCode folder for deployment on
the cluster with single processor machines only.
23Apr06
Added the function for retrieving
the job log files and submission scripts to log folder in the program
retrieveResult.c
26Apr06
LDMAP+ has been pre-compiled for RHEL
(Red Hat Enterprise Linux 4.0. It is prepared for the roll out of RHEL 4.0
upgrade from RHEL 3.0 on the execution nodes on Iridis cluster.
29Apr06
From ver. 29Apr06 onward,
LDMAP_cluster
performs automatic online update. This ensures the latest version of
the program is delivered to all users.
30Apr06
Fixed minor bugs to avoid the duplication of
<dataset>_contigInfo.txt and the bugs in the UI output.
01May06
Fixed a bug in the submission script introduced in ver. 30Apr06
that causes the idleness of alternate jobs (segments with even number) in
each submission.
04May06
User is now able to assign the computing hours the segment required
(i.e.10 hrs computing time is recommended for processing 2000 SNPs per
segment).
08May06
Fixed bug in the submission script for assigning computing time of single (odd) segment /
job.
19May06
The window size is now defaulted as 75 SNPs as this produces better LD
maps in shorter time. Amended wordings on LDMAP_cluster, updated the
supporting website address.
20May06
Amended wordings for
checkSeg.
14Aug06
Only the ldmapper1+ executable
and the dataset file in each segment folder rather than including all
the source code
19Aug06
Added feature for user to have
the full control to alter "Interval Window size" and physical distance
(kb) for the informative SNPs.
21Sep06 (Major updates to the
LDMAP_cluster
program)
Fixed Bug in respect for
constructing the LD map from entire dataset in single segment
submission.
Added option for user to choose
entire dataset or segment for constructing the LD map.
Amended web link, and UI
wordings.
Fixed potential bug on submitting
the job before creating the folder and moving the dataset file, this has
been reported previously in the chromScan_cluster where the dataset is
large and requires longer time to duplicate.
16Oct06
Added feature to allow user to execute
LDMAP_cluster
from any sub-directory.