University of Southampton School of Medicine

: Research

 
     
 
Constructing LDU Maps on Iridis  ver. 16Oct06    |  map summary  |   project outline
 

Introduction


Segment Assembly

 

Program Features / limitations

 

Key notes

 

Installation

 

Work flow

 

Running LDMAP-cluster on Iridis

 

FAQ / Troubleshooting


Program history


Source code

 

Download LD maps

 

Related publications

 

CHROMSCAN-cluster

program breakdown
known problems
other tools...

SP = Sub Process; SO = Sub Output

 

 

 

 

Introduction:

LDMAP program has been modified to be deployed on the Iridis computing cluster at University of Southampton. The construction of the genome-wide Linkage Disequilibrium Unit (LDU) maps is made possible by adopting the segmental approach. This method divides an entire chromosome into segments, and the LDU map in each segment is constructed independently as an individual job. With the availability of the large computing resource, the segments are able to compute in parallel. The diagram above illustrates the concept of parallel processing. Once the segmental maps are constructed, the segments are concatenated to assemble a whole chromosome map.  The segment size (number of SNPs) is optimized to minimize computing load but lose only negligible amounts of information (1000 - 2000 SNPs per segment) for delivering an effective turnover computation time.
 




 

The Linkage Disequilibrium unit map is constructed in segments, and these segments can be concatenated in 2 ways:
        1) Contig Assembly
        2) Contig-Free Assembly (more efficient, recommended approach)

 

The genome-wide LDU maps for the human genome of four populations have been constructed using the Phase 2 HapMap public release #20 (January 2006). The maps are summary here.


The Iridis cluster is put together from dual processors AMD Opteron machines, and consists of over 900 processors capable of performing both 32 bit and 64 bit computing under Linux environment. This configuration is commonly known as the Beowulf cluster. The cluster job scheduler is managed by open-PBS (Portable Batch System). This is a shared facility which offers high performance computation for researchers across the whole university.

Fine scaled LDU maps are made using the AMD Opteron on Iridis cluster, click here for more details.

 

 

Features / Limitations:

The modified LDMAP program is named LDMAP-cluster. The predecessor LDMAP-Condor is deployed on a Linux Condor Pool of ~80 nodes at the Computational Engineering and Design Centre (CEDC), School of Engineering Sciences at the university.

The main features / limitations of LDMAP_cluster are:

1) An entire or a large region of a chromosome is divided into segments of appropriate length.
2) The segments are submitted to the Iridis cluster and executed on multiple machines in parallel.
3) The job status is monitored (for a simple status summary via "checkSeg" command; for a detail summary via "qstat -a" command ).
4) Retrieve result files and assemble the complete LD map.
5) LDMAP_cluster is a 64 bit program that is capable of accommodating up to 2^64 Gb of memory (32 bit = 4 Gb).
6)

The program assigns 2 sub-jobs (segments) per submission. This enables two jobs to run simultaneously on a dual processors node (see program history).

 

7) This program is compatible to deploy on any standard Beowulf cluster with openPBS installed. Amendments on the submission script and re-compilation might be required for adapting different PBS and Operation System (OS) platform.
8)

The maximum limit of this program is ~70,000 SNPs. If the SNP dataset file contains >70k SNPs, the dataset is split into smaller dataset files with an overlap of 200 SNPs. Currently exploiting 64 bit computing to remove this limit and process larger segments without splitting the dataset.

 

9) LDMAP_cluster performs automatic online update. This ensures the latest version of the program is delivered to all users. (currently disable for security reason)



 
Key Notes:
   
1) The LDMAP_cluster program is currently at beta phase with a primitive UI (User Interface). An account on Iridis and experience with Linux commands are required for using the program. An Iridis account can be applied for here.

2) The SNP interval window and physical distance (kb) for including the informative data are user-definable (21 Sep 06). 

3) FTP is the protocol for transferring source data (SNP dat file) and results to/from Iridis. Please make sure that the "binary" mode is used for transferring the LDMAP_cluster_<date>.zip, and "ascii" mode is used for transferring the dataset file. A Windows client, "SSH Secure Shell", can be installed as an alternative from the Southampton University software collection here. Not only the files can be transferred  via the simple "drag & drop" actions, it is also a good terminal console (recommended by the ISS for accessing Iridis). The manual for SSH Secure Shell can be found here.

4) All programs are pre-compiled with the 64 bit GCC compiler, RHEL4. It is not necessary for the user to re-compile any of the source codes.

5) Most commands on this page can be "cut / paste" onto the shell (terminal console), and the remaining screen shots are presented in gif file format (for viewing only).

6) coreFTP is another good FTP client on MS Windows for transferring files (i.e. SNP datasets) with "drag & drop" actions. The Lite version is free for non-commercial purpose. Please click here to download the client.

 

 

Work Flow Overview: Constructing LD maps using LDMAP_cluster:

The work flow diagram can be downloaded in pdf format here.

 


 


 


 

 

Installation:

There are 2 components for the installation:
    -       download the LDMAP_cluster program (zip fle)
    -       download the SNP dataset file


Log on and Download the LDMAP_cluster program:

The program can be executed from any sub-directory.


1) Log on to your Iridis account (via
SSH Secure Shell), instructions here

   host: iridis2.soton.ac.uk

Account holder of  previous Iridis account should has access to the current Iridis cluster, please make sure the above host is used for logging in.


2) Download the program

There are 2 ways to download the source code and onto the home directory (~) on Iridis:
       i)  FTP (for both source code + dataset files)
       ii) HTTP (for source code only)

It is recommended on FTP, "binary" mode is used for downloading the zip file, and "ascii" mode is used for transferring the dataset files.

FTP has to be initiated from Iridis for connecting to other servers, vice versa is not possible.


The LDMAP_cluster is packaged in a zip file
LDMAP_cluster_<date>.zip, please replace <date> with the release version date showed on the top of the page, i.e. "ver. 16Oct06" = "LDMAP_cluster_161006.zip".

Initiate FTP from Iridis to connect to Cedar:

[wwsl@blue15 ~]$ ftp cedar.genetics.soton.ac.uk

 

Change directory and download....

ftp> bin

ftp> cd /export/home/winston

ftp> get LDMAP_cluster_161006.zip

ftp> quit




3) Extract the program from the zip file :

[wwsl@blue15 ~]$ unzip LDMAP_cluster_161006.zip

There should now be an executable "LDMAP_cluster",  and two folders: "LDMAPsourceCode" and "mergeDataset"

 

The installation is completed.

 

The program automatic updates if a newer version of LDMAP_cluster becomes available

(this function is currently disabled for security reason).



Download the SNP dataset for Phase II HapMap

The SNP dataset files are located on Turaco in the directory:
/export/home7/todd/phase2

ftp> cd /export/home7/todd/phase2

ftp> cd <chr#>

ftp> asc

ftp> get *.dat

ftp> quit

Please replace <chr#> with the chromosome number (i.e. 1, 2, 3... 23). Chromosome X = 23.
 

Download other SNP dataset

Alternatively for other SNP datasets,  please use SSH Secure Shell Client (from your local PC to Iridis) to transfer the dataset.





 
Running LDMAP_cluster :

 

General Overview: LDMAP_cluster

 

 

Instructions:


1) The LDMAP_cluster program is executed by:

[wwsl@blue15 ~]$ LDMAP_cluster
 


2) The program requires the user to define the following inputs:

    -    the SNP dataset file (e.g. ceu10.dat)

    -    Parallel Processing: Construct the map from multiple segments (select "1")

    -    Non-Parallel Processing: Construct the map from entire dataset without segmentation (select "2")

 


 
                                                             
    -    the # of SNPs per segment (2000 SNPs)
    -    the size of "Interval Window"(100 SNPs)  
    -    the size of "Buffer Zone" (100 SNPs)

    -    the physical distance (500 kb)

    -    the hours of computing per segment (10 hrs)


e.g. Assigning 2000 SNPs per segment:

    first segment    =   2000 SNPs          +   100 SNPs X 1 contig
  
    middle segments  =   2200 SNPs          +   100 SNPs X 2 contigs
   
    last segment     =   1000 - 1999 SNPs   +   100
SNPs X 1 contig
         or
    last segment     =   2000 - 2999 SNPs   +   100 SNPs X 1 contig
(see program history)

 

If option 2 is selected to construct the map from entire dataset without segmentation, the user only requires to define the followings:

    -    the size of "Interval Window"(100 SNPs)  
    -    the physical distance (500 kb)

    -    the hours of computing per segment (10 hrs)

 

Please note that the purpose of option 2 is to construct the LD map from small dataset without losing any information in the absent of segmentation. Ideally the dataset should not exceed 4000 SNPs, as it will prolong the computing time and make the process inefficient (long queuing time for long computing job; possible time out if insufficient computing time is assigned). 

 

 


Each submission script consists of two subJobs. Each subJobs represents a split segment of the SNP dataset
(2nd + 3rd lines). This is followed by the submission script filename (4th line)
and the submission job number (last line).





Once the jobs have been submitted to Iridis, new files and folders are created by the LDMAP_cluster program. Each segment is submitted from an individual folder (seg_#) accompanied by a PBS submission script (submitfile_#.csf).








3) Monitoring the submitted jobs on Iridis:


"checkSeg"
command provides a simple
status summary of the submitted jobs:


more details about
checkSeg please click here.



Alternatively, there are 2 commands that also display the status of the submitted job(s): "qstat" and "showq" commands.

[wwsl@blue15 ~]$ qstat



"qstat" provides a more details summary about the submitted job. The long number represents the job id for the submitted job. Typically, a segment of 2000 SNPs would take 1-2 hrs to execute; a segment of 2000 SNPs would take 2-5 hrs to execute, some take up to 10 hrs, depending on the number of iterations are needed.

Quite often the job names are too long to be displayed on the list from "
qstat", the left part of the job name which distinguishes the dataset is often cut off, therefore it is hard to tell which dataset the submission belongs to. Try "qstat -a" to display the left part of the job name. It is preferable to use the "qstat -a" as it gives a better account (i.e. executed time) than "qstat". Click here for more de
tails.
 
[wwsl@blue15 ~]$ qstat -a



"
showq" command displays the job queue of all users on the Iridis cluster. It consists of 3 lists: Active Jobs / Idle Jobs / Blocked Jobs.


If a user has over 40 jobs running on the cluster, the excess jobs are blocked from executing. This makes sure that a fair share of the resource is available for all the users.

Should anything go wrong (e.g. incorrect parameters are input, the wrong dataset is submitted), use "
qdel <job id>" to remove the submitted jobs before they started running on the computing cluster. More than one job number can be placed in the arguments of "qdel". Click here for more details.
[wwsl@blue15 ~]$ qdel 541385

Currently the user is required to manually remove the folders and submission scripts related to the finished or wrong dataset. The following command is recommended:

Remove the folders and submission scripts associated to a wrongly submitted dataset

[wwsl@blue15 ~]$  rm -rf chb4_1.dat* 





4) Retrieving Results:

Run "
checkSeg" to confirm if all segments have completed.




Under rare
circumstances, there might be job(s) or segment(s) shown as "Idle" among the completed segments on the list from "checkSeg". The "Idle" segment(s) will not be found on either  "qstat -a" or  "showq". Please click here
for information.


Two set of files (e.g.
ceu15.dat_seg_1_2.o541359 and ceu15.dat_seg_1_2.e541359) are generated by the PBS for every job submission. These files contain the job and error log for each submitted job.






The "
retrieveResult" command retrieves the segment maps from each segment folder (seg_#), and copies these files into the newly created folder in the home directory. Please note that the name of the dataset and a long number are concatenated with the name of the folder (e.g. result_ceu15.dat_1148226845). This number is the machine time expressed in seconds. It is used here to generate a unique name for the result of each dataset to avoid over-writing of any previous ones.

[wwsl@blue15 ~]$ retrieveResult




The two folders are created by "
retrieveResult". The log folder contains all the job / error log for the submitted jobs. The result folder contains all the segments for the dataset.



 

[wwsl@blue15 ~]$ cd result_chb4_2.dat_1143996184/



Inside the result_# folder, there are two types of file:
ter_# and pro_# . They contain the LDU maps and relevant information for each segment (job submission) respectively. The number (#) corresponds the the segment folder number (seg_# folder).

The
pro_# file contains all information such as M, E, L (parameters for the Malecot Model), the number of Newton Raphson iterations for the composite -2lnL to converge and other statistical error measurements. A sample of the pro_#  can be viewed here.

The
ter_# file is the actual LD map file. It consists of 3 columns: SNP name, physical distance (kb) and LD unit (LDU). A sample of the ter_#  can be viewed here.






5) Assembling the complete LDU map

"CreateLDmap" command concatenates the segments via Contig Assembly. The complete LD map is saved as the "ldmapFinal.txt" file.

"
CreateLDmapFree" command concatenates the segments via Contig-Free Assembly. The complete LD map is saved as the "ldmapFinalFree.txt" file.


[wwsl@blue15 ~/result_1143996184]$ CreateLDmapFree


 

6) Backing Up result and useful information:


It is recommended to back up the two folders:
    1)
result_<dataset>_<time>     
    2)
log_<dataset>_<time>         
 

e.g.

 

They contain valuable information about the LD maps as well as the computing log from the PBS.


The easiest way to transfer the result (the above 2 folders) from Iridis is via the software
SSH Secure Shell where the folders could "drag and drop" to anywhere on the Windows machine. Alternatively, FTP could be used to export the two folders where it is necessary to zipped the folders before transferring. "ascii mode" is
recommended for transferring flat files.


7) Cleaning Up


To free more space for the new dataset, remove the folders and submission scripts of the complete dataset:

[wwsl@blue15 ~]$ rm -rf chb4_1.dat* result_chb4_1.dat_* log_chb4_1.dat*
 


 

 

 

Storage Space Limit
A space of 5Gb is given to every user on Iridis.

To check how much space is used out of the given 5Gb. At your home directory, type "du -h -s <folder>". Replace "wwsl" with your <userID>.

[wwsl@blue15 ~]$ du -h -s ../wwsl

usedSpace

The available space = 5Gb - 2.2Gb = ~2.3Gb

Alternatively,
"du -s <folder>" displays the used space in Mb which gives a better estimate.
 




Concatenating LD maps constructed from different datasets.
Coming Soon......




Effective Iridis Submission (execution time v.s number of jobs):


The PBS gives higher priority for the jobs with shorter execution time and request lower number of processors first before the jobs request more processors and longer execution time.

LDMAP_cluster adjusts the requested execution time according to the number of SNPs in the segment (job). It is important to quote a sufficient time for the job to complete. The job is removed from the computing node once the requested time is up whether it is finished or not
.

lastSegExecfirst


Typically a segment of 2000 SNPs (excluding contig(s): 100-200 SNPs) requires 2 - 5 hrs to complete it execution. The execution time varies between different segments although they contain the same number of SNPs. The computing time is largely depends on
the number of  Newton-Raphson iterations required for the segment to converge. Usually 800 iterations are achieved within the first hour; 1600 iterations in ~2hrs; 3200 iterations in ~4hrs.Some segments might require >3200 iterations, therefore 10 hrs is defaulted as the requested execution time for all segments containing 2000 SNPs.

Currently (09Apr06), the LDMAP_cluster processes one sample (SNP dat file) at a time, the next sample is manually submitted after the completion of the previous run. This creates a time gap between each run. A way to






Program Outlook:

The construction of LD maps has benefited greatly from deploying the program on the large computing resource on Iridis at the Southampton University.


The National e-Science Grid
The computational and data gird resources provided by the National Grid Service (NGS). It consists of 7000 processors.
 

Source Code:

 

The LDMAP_cluster and its source code are freely available to non-commercial users only.

Please click here to download the source code (ver. 210906).

Please click here to download the source code (ver. 161006).

Alternatively, under Linux environment, please use the following command replacing the version number if necessary.

wget -q http://www.som.soton.ac.uk/research/geneticsdiv/epidemiology/LDMAP/sourceCode/LDMAP_cluster_210906.zip

 

 

Related Publications:

Lau, W., Kuo, T., Tapper, W., Cox, S., Collins, A. Exploiting large scale computing to construct high resolution linkage disequilibrium maps of the human genome. Bioinformatics 2007: 23; 517-519. [ ABSTRACT ]

Collins, A., Lau, W., De La Vega, F. M. Mapping Genes for Common Diseases: The Case for Genetics (LD) Maps. Human Heredity 2004: 58; 2-9 [ ABSTRACT ]

 

 

 

Program History:

29Mar06   

  • Transformed the LDMAP_Condor to LDMAP_cluster program under Iridis (e.g. job submission script).
  • Added function to take care of the the last segment. If it contains > 800 SNPs it will merge to the previous segment as one job; otherwise submit as an individual job (this limit increases to >1000 SNPs on 12Apr06). see table
30Mar06
  • Refine wording on the User Interface (UI).
  • Automate the program to create folder and copy necessary programs from the LDMAP sourceFolder (LDMAPsourceCode).
31Mar06  
  • Successful execution of 8 segments from ceu19.p1.dat on Iridis. The first job is submitted on 18:36, and the last segment finishes on 23:25. The execution times of the segments vary from ~30 - 64 mins. It is largely dependent on the # of Newton Raphson iterations required to converge the composite likelihood.  
  • Fixed inconsistence in the contigInfo.txt file.
  • The program terminates if the dataset file exceeds 70k SNPs.
01Apr06      
  • Added function to assign the length of execution time according to the size of segment (# of SNPs per segment). Segment <2000 = 5hrs; Segment >2000 = 6hs; Segment >3000 = 10hrs (this function is disabled on 12Apr06, 10 hrs as the default computing time).
07Apr06       
  • Placed a limit on the program. LDMAP_cluster terminates if the sample exceeds 70k SNPs.
08Apr06    
  • Added function for displaying # of individual sample in the dataset file.
  • Added "checkSeg" command for monitoring the status of submitted job(s) in a simple summary.

10Apr06

  • The program is now able to handle more than one dataset file at a time. It groups the related folders and files by assigning a unique name to each submitted dataset. However the number of dataset submission is depended on the availability of the limited user storage of 5Gb.

12Apr06

  • The minimum number of SNPs in the last segments is 1000 SNPs (up from 800 SNPs, 29Mar06). This increases the accuracy of the construction of LD map in the last segment. see table
  • checkSeg command now embedded the function to delete the dataset file and temp.dat (~30Mb) from the segment folder once the job started to process; and to delete the intermediate file (~40Mb) once the segment map (ter file) is constructed.
  • Added display for estimate required space (Mb) in the UI of LDMAP_cluster program before submission, based on the assumption of 2000 SNPs per segment and the SNP dataset file size of 15Mb.
  • Fixed bugs in mergeFreeContig3.c that implements Contig-Free Assembly.

13Apr06

14Apr06
  • Program "reformatMapFile" are now able to merge (Contig Assembly) the sub LD maps files constructed from split datasets . Chr.18 CEU phase 2 is the first dataset merged back to it full length LD map constructed from 2 split dataset.
  • Fixed a bug from removing temp.dat file in checkSeg.c.
15Apr06
  • Program "reformatMapFileFree" are now able to merge (Contig-Free Assembly) the sub LD maps files constructed from a splitted dataset. Phase 2 Chr.18 CEU is the first map successfully merged from 2 datasets file.
16Apr06
  • Fixed a bug in "reformatMapFileFree.c". The merge LD map no longer has duplicated SNPs. This is caused by the indexing differences in the "ter" file and sub LD map. Now reformatMapFileFree.c are able to use the same program (mergeContigFree3.c) to carry out the process.
17Apr06
  • Fixed bugs in "mergeMapFileFreeContig.c" regarding the re-indexing of SNPs. Phase 2 LD map of chromosome 1 is now constructed with the correct number of SNPs.
19Apr06
  • Possible Bugs in mergeContigFree.c that leads to the error in the indexing of SNP on the final LD maps concatenated from split segment. The mergeContigFree.c uses the number of lines in the ter file to determine the extended regions between adjacent segments, if a SNP is dropped from one the ter file, it disrupts the SNP indexing in the segment, thus causing the frame shift effect. Will work on this soon..
21Apr06
  • LDMAP_cluster now assigns 2 sub-jobs (segments) per submission. This enables the program to fully utilize the two processors in the dual processors machine on Iridis. The number of submissions decreased by half, and twice as many jobs are running simultaneously, therefore the throughput is expected to increase two fold. 
  • The previous version of LDMAP_cluster is renamed as "LDMAP_cluster_singleSubmissionPerNode.c" and remains in the LDMAPsourceCode folder for deployment on the cluster with single processor machines only.
23Apr06
  • Added the function for retrieving the job log files and submission scripts to log folder in the program retrieveResult.c
26Apr06
  • LDMAP+ has been pre-compiled for RHEL (Red Hat Enterprise Linux 4.0. It is prepared for the roll out of RHEL 4.0 upgrade from RHEL 3.0 on the execution nodes on Iridis cluster.
29Apr06
  • From ver. 29Apr06 onward, LDMAP_cluster performs automatic online update. This ensures the latest version of the program is delivered to all users. 
30Apr06
  • Fixed minor bugs to avoid the duplication of <dataset>_contigInfo.txt and the bugs in the UI output.
01May06
  • Fixed a bug in the submission script introduced in ver. 30Apr06 that causes the idleness of alternate jobs (segments with even number) in each submission.
04May06
  • User is now able to assign the computing hours the segment required (i.e.10 hrs computing time is recommended for processing 2000 SNPs per segment).
08May06
  • Fixed bug in the submission script for assigning computing time of single (odd) segment / job.

19May06

  • The window size is now defaulted as 75 SNPs as this produces better LD maps in shorter time. Amended wordings on LDMAP_cluster, updated the supporting website address.

20May06

  • Amended wordings for checkSeg.

14Aug06

  • Only the ldmapper1+ executable and the dataset file in each segment folder rather than including all the source code

19Aug06

  • Added feature for user to have the full control to alter "Interval Window size" and physical distance (kb) for the informative SNPs. 

21Sep06 (Major updates to the LDMAP_cluster program)

  • Fixed Bug in respect for constructing the LD map from entire dataset in single segment submission.
  • Added option for user to choose entire dataset or segment for constructing the LD map.
  • Amended web link, and UI wordings.
  • Fixed potential bug on submitting the job before creating the folder and moving the dataset file, this has been reported previously in the chromScan_cluster where the dataset is large and requires longer time to duplicate.

16Oct06

  • Added feature to allow user to execute LDMAP_cluster from any sub-directory.

 

 


Under Construction......

email: wwsl@soton.ac.uk


Copyright © 2006/07

created 02Apr06
updated 23Feb07

 
 
 
 
 
 
Top of Page