‘galign’ SNP and deletion search tools (for MacOSX Intel)
The advent of high throughput sequencing has allowed researchers to generate copious amounts of sequence data, raising the possibility that whole genome sequencing may become a viable method for cloning genes. One major current sequencing technology, Solexa, can generate roughly forty million 32-36 bp sequence reads in a single run. For organisms such as C. elegans or Drosophila, this amounts to about 10x coverage of the genome, sufficient, in principle, to guarantee that nearly every nucleotide in the genome is sampled at least once. In reality, the vagaries of sequencing reactions and the inability to assign reads from repetitive sequences to specific genomic locations are likely to limit coverage to around 98-99% of the genome.
The suite of programs described here is intended to identify single nucleotide polymorphisms (SNPs) or other alterations with respect to a reference genome sequence using sequence reads of 27-39 bp obtained using Solexa sequencing.
The files described here can be downloaded at http://shahamlab.rockefeller.edu/galign/galign0.8.zip. A beta version allowing processing of reads upto 78 nucleotides long can be downloaded at http://shahamlab.rockefeller.edu/galign/galign0.9beta.zip. For this beta version, convert Fastq files to 'galign' reads using Format_convert2. For alignment use Alignment_tool2. Source codes are available at http://shahamlab.rockefeller.edu/galign/Source_codes. A paper describing 'galign' in more detail has been submitted and is available as a preprint upon request (firstname.lastname@example.org).
To install, download the galign.zip file, expand the file (this is often done automatically), and place the ‘galign’ folder on your Desktop. The ‘galign’ folder should contain the following directory tree (Figure 1):
Following are descriptions of each program, followed by a step-by-step protocol for performing a search for C. elegans SNPs. These programs have been extensively tested on a Mac OSX Version 10.6.4 2.66 Ghz Quad-Core Intel Xeon machine, with 8 GB 1066 Mhz DDR3 RAM. However, it should be straight forward to recompile them from the source code for other machines.
The alignment tool uses two text files provided by the user to count the number of wild-type and mutant reads at every base in the genome.
Genome sequence file: This is a text file containing the sequence of the genome to be evaluated in lower-case text. The file is obtained using the Genome_assemble application (see below) and is placed in the ‘Genome_sequences’ folder in the 'galign' folder on the Desktop. The file contains a concatenated version of the genome, starting with chromosome I with no other text besides the raw sequence. The genome sequence file should be given a unique name. A file containing the C. elegans genome sequence (Wormbase release 195) is already placed in the ‘Genome_sequences’ folder.
Sequence reads file: This is a text file containing the sequence reads obtained from the Solexa runs in lower-case text. The file should be placed in the ‘Sequence_reads’ folder in the 'galign' folder on the Desktop. Since the final several bases of Solexa sequence reads are often less reliable, reads of 32-nucleotides should be used. The file should contain only sequence information and no other text. The file should be given a unique name. Each 32-nucleotide sequence read should be placed on a separate line. A tool for converting raw output from Solexa reads to the format specified here is supplied (see ‘Format_convert’ below).
Parsing the genome: Alignment begins with the genome sequence being read in groups of 13 consecutive nucleotides starting at one end of the genome sequence, and advancing one nucleotide at a time. Each genome position is assigned to a number array (the genome position array) whose element number is the genomic position, and whose value is a unique numerical descriptor of the corresponding 13-nt sequence. The descriptor is a base 4 representation of the 13-mer generated by the formula , where a(i) takes on the values 0, 1, 2, or 3 if the nucleotide at position i reads G, A, T, or C, respectively. Next, a set of 413 number arrays (pointer arrays) are generated. Each pointer array corresponds to a possible base 4 value of a 13-nt sequence, and the number of elements in each array corresponds to the number of occurrences of a given 13-nt sequence in the genome. The value of each array element is the genomic position of each 13-nt sequence.
Aligning a wild-type sequence read: Once the genome sequence has been parsed, 'galign' loads the first sequence read from the sequence-reads file into memory and divides it into 3 segments, A, B, and C, of respective lengths 13, 13, and x, with 1 ≤ x ≤ 13. Generally, sequences obtained from Illumina/Solexa reads are not accurate beyond 32 nucleotides, and the program currently uses x=6 as a default setting. Format_convert automatically generates sequence reads of 32 nts. 'galign' then checks whether sequence A exists in the genome, by referring to the pointer array corresponding to sequence A. If the sequence is found, the first genomic position of A is noted, and the program asks whether sequences B and C exist at positions 13 and 26 nts downstream of the genomic position of A, respectively, by referring to the pointer arrays. If fragment C is shorter than 13 nts, as is generally the case, the appropriate length genomic sequence is extracted from the genome position array value. If matches are found, a value of 1 is added to a new number array (the wild-type array) whose element numbers correspond to positions in the genome, and whose values represent the number of times that position has been read as wild type.
Aligning a sequence read with one nucleotide substitution: Single substitutions could occur in either fragment A, B, or C. If a match to the genome with fragment A is not detected, 'galign' attempts to match sequences B and C next to each other in the genome. If this is successful, the program revisits sequence A and examines whether genomic sequences 13 nts upstream of sequence B can be matched to sequence A with one substitution. Single substitutions are detected by subtracting the base 4 representation of the genomic sequence from that of the sequence read and only allowing a change of a multiple of a single power of 4 (see above).
If sequence A is matched in the genome but B is not, 'galign' checks to see if fragment C can be exactly aligned. If C matches, sequence B is tested for having a single substitution as above. A similar procedure is followed for putative substitutions in fragment C.
Finally, if single substitution matches are not found, 'galign' queries deeper elements of the pointer arrays and repeats the process for other genomic sites containing the A or B fragments. If no matches are detected at these positions, 'galign' generates the reverse complement of the input fragment, and repeats the above algorithm. If an alignment of a single-mismatched read is made, a value of 1 is added to each element of the wild-type array in which a sequence match was detected. A value of 1 is also added to an element of a new number array (the mutant array) corresponding to the position of the mismatched residue in the genome. In addition, the nucleotide sequence of a mutant 13-nt fragment, starting at the mutation site, and of the corresponding wild-type sequence, deduced from the genomic sequence, are added to two new character arrays (the mutant and wild-type sequence arrays, respectively) whose elements correspond to the position of the mutation site in the genome.
Output: After scanning through all sequence reads, the alignment tool generates a file containing the collected information. The file begins at position 1 of the first chromosome, displaying the value of the wild-type number array for that position, followed by the value of the mutant number array corresponding to that position. If the mutant number array value is not 0, the wild-type and the last mutant sequence stored in the respective sequence arrays are displayed.
Memory: For a genome of 100 Mbps, 'galign' requires 4*100,000,000=400,000,000 bytes for the genome array, and about 400,000,000 bytes for the pointer arrays, wild-type, and mutant number arrays. Memory allocation for the sequence strings is usually relatively small, unless the number of mutations is of the order of the number of nucleotides in the genome. Thus, roughly 1.6 Gb of memory needs to be allocated.
Nucleotide substitution data can be extracted from the alignment output file using the SNP_search tool. The output of SNP_search is a text file containing an ordered list of predicted nucleotide substitutions characterized by their positions with respect to a reference exon and with respect to the chromosome sequence, the predicted sequence alteration, and, in the case of lesions in coding regions, the predicted amino-acid substitution. The numbers of wild-type and mutant reads giving rise to the prediction are also displayed (Figure 2).
To perform its tasks, SNP_search also requires five accessory files: a file containing a list of exon names, exon start and end positions, chromosomal locations and strand information; two similar files for introns and intergenic regions, referenced by the upstream exon in the case of the intron file, and by the downstream exon for the intergenic region file; a file containing a list of exon names, the positions of the start and end of coding regions in each exon, and the reading frame (1, 2, or 3) at the 5' most nucleotide of the exon (unless it is the first exon of a gene, in which case the reading frame is assigned the value 1); and a file containing a list of exon names and their sequences. Accessory files for the C. elegans genome are present in the 'Feature_locations' folder.
Before executing SNP_search, the user is instructed to input three parameters restricting SNP detection: the minimum ratio of the number of mutant to wild-type reads required for calling a SNP, the minimum number of mutant reads to call a SNP, and whether SNPs should be scanned in exons, introns, or intergenic regions. SNP_search then scans the alignment output file, noting all positions satisfying the criteria set by the user-provided parameters. Flagged positions are then queried as to whether they fall within exons, introns, or intergenic regions, as indicated by the user, by comparing SNP position in the genome with the list of exon/intron/intergenic region start and end positions. SNP_search broadens the definition of an exon to include 10 nucleotides upstream and downstream of an exon, to facilitate identification of splice-site mutants.
For exonic domains, SNP_search next assesses whether the SNP falls inside a coding region. If so, the amino-acid affected by the SNP is recovered by subtracting the SNP position from the exon start position, calculating modulo 3 of the resulting value, and assessing the frame with respect to the known frame at the exon start site. The appropriate codon is then retrieved from the exon sequence files either directly, or by stitching sequences from adjacent exons in the event that a SNP affects either the first two or the last two nucleotides of an exon. Both wild-type and mutant amino acids are then printed to the output file. For introns and intergenic regions, SNP_search then outputs a file similar to that presented in Figure 2, except that the column displaying codon changes is absent.
Output of SNP_search is a text file located in the ‘SNP_results’ folder, with the name ‘yourname_snp_cexons.txt’. For intron/intergenic searches, ‘cexons’ is replaced by ‘cintrons’/’cintergenic’. For exons with strands ‘1’ or’-1’, positions are relative to the 5’ or 3’ end of the exon.
Deletions greater than one nucleotide: The Deletion_search tool of ‘galign’ identifies all positions in the genome that have no sequence reads, wild-type or mutant, as defined by the alignment tool output file. In addition to the output file of the alignment program, Deletion_search also requires 3 accessory files containing the names, start and end positions, chromosome, and strand locations of exons, introns, and intergenic regions. For the C. elegans genome, these are located in the 'Feature_locations' folder. The program searches through the alignment output file for an occurrence of 0 wild-type and 0 mutant reads. Once such a position is encountered, a running counter measures the length of the unread region, until a position with non-zero reads is reached. The start and end point of each unread region is then compared to exon/intron/intergenic region start and end sites to determine whether the region spans the user-indicated feature set. The output file produces an ordered list, based on chromosome position, of all genomic locations that lack hits (Figure 3).
Long stretches of unread genomic regions arising from sampling error of the genome are predicted to be exceedingly rare for a standard sequencing experiment. Thus, large stretches of sequence are likely to arise from three possible situations: First, since the ‘galign’ alignment tool assigns all repetitive sequences to the first occurrence of the sequence in the genome, large unread stretches could correspond to repetitive domains of the genome. Second, DNA regions whose amplification in sequencing reactions is not favored may not be represented in the sequence reads. Third, true deletions in the sequenced genome, with respect to the reference genome, may account for large unread domains.
Distinguishing among these options is not always possible. However, if coverage of the genome is extensive, runs of unread regions corresponding to true deletions should be flanked by single or, more rarely, more than one position containing at least one mutant read. This occurs because 32-nt sequence fragments that span the deletion site will be read as a mismatch by the alignment tool, unless the nucleotides at the positions flanking the deletion site are identical to the deleted nucleotides abutting the deletion site. Deletion_search flags sequences containing such flanking regions in the output file under the "Comments" heading.
The Deletion_search tool will tend to overestimate deletion sizes. Each genomic position can, in principle, be covered by 32 different 32-nt sequence reads. Thus, the probability of reading a given position in a standard sequencing experiment is theoretically 99.9997%. However, using the ‘galign’ alignment tool, genomic positions directly abutting deletion sites can only be covered by two sequence reads (and much more rarely by a larger number), reducing the chance of detection. Overall, therefore, the vast majority of deletions greater than one nucleotide are predicted to be found by ‘galign’.
Single nucleotide deletions: Unlike large deletions, only 45% of single nucleotide deletions will be detected by the Deletion_search algorithm. However, of the remaining 55%, 53% will be detected as substitutions by SNP_search. Overall, therefore, it is estimated that 98% of single-nucleotide deletions will be detected by ‘galign’.
Insertions: Detection of insertions by ‘galign’ follows similar considerations to deletion detection, however, unlike deletions, whose size can be estimated, insertion size is either difficult or impossible to extract from the ‘galign’ alignment output. As with single nucleotide deletions, insertion sites can be detected either by producing nucleotide substitution calls by SNP_search, or by presenting as unread genome positions using Deletion_search. Some insertions will, assuming full coverage, have a unique signature of two adjacent positions containing mutant reads. Although such dyads can arise from independent events, they may be an indication of insertion sites, and are displayed in the "Comments" column of the Deletion_search algorithm output.
Multiple local substitutions: Although the presence of two or more nucleotide substitutions within a 32 nt region could be revealed by SNP_search, these regions may also be detected by Deletion_search, since most sequence reads spanning these areas will contain more than one mismatch, and will not be read by the alignment tool, giving rise to unread positions. Distinguishing such instances from true deletions may not be possible in ‘galign’.
The output of ‘Deletion_search’ is a text file named ‘yourfirstname_del_cexons(or cintergenic or cintrons).txt’ located in the ‘Deletion_results’ folder in the ‘galign’ folder. The output consists of 7 columns: running index, exon name, chromosome number, strand, position of 5’ end of deletion, position of 3’ end of deletion, deletion length. Negative numbers are upstream of the exon, positive numbers are downstream. For exons with strands ‘1’ or’-1’, positions are relative to the 5’ or 3’ end of the exon.
The ‘simple_SNP’ tool processes files from the Alignment_result folder to reveal positions in the genome that conform to a user-provided SNP criterion, or that have no reads. The tool does not use information from the annotated genome, and does not determine whether displayed positions map in exons, introns, or intergenic regions.
The output of ‘simple_SNP’ is a text file named ‘yourfirstname_simpleSNP.txt’ located in the ‘SNP_results’ folder within the ‘galign’ folder.
The ‘Format_convert’ tool converts three common Solexa output files (gseq, fastq, fasta) to a file compatible with the ‘galign’ tool. The file to be converted must be placed on the Desktop prior to using the tool.
The output of ‘Format_convert’ is a text file named ‘yourfirstname_reads.txt’ located in the ‘Sequence_reads’ folder within the ‘galign’ folder. If conversion was successful, the output file should contain a list of 32-nucleotide sequences in lower case letters.
The 'Genome_assemble' tool allows chromosome sequences from a genome to be uploaded separately or from a single file. These sequences are then converted to the 'galign' format. Chromsome files should be in Fasta format.
The output of 'Genome_assemble' is a genome file deposited in the 'Genome_sequences' folder with the name given by the user.
Step-by-step procedure for identifying SNPs and other alterations in C. elegans
For more in-depth information, read the sections above.
1. The first step in identifying SNPs and other alterations in a mutant C. elegans strain is to subject DNA from that strain to Solexa sequencing. 10x coverage is the minimum recommended coverage to theoretically guarantee that every nucleotide gets read at least once. This corresponds to about 34,000,000 32-nucleotide reads.
2. Next, the sequencing facility analyzes data collected by the machine, and determines the quality of each sequence read. Generally you will be provided with one or more of the following output files:
gseq: qseq files will have many lines, each of which contains (separated by spaces) the machine name, run number, lane number, tile number, x coordinate, y coordinate, indexing value, read number, sequence, quality score, and quality filter pass status.
Fastq: Fastq files will have the read ID on one line, the sequence on the next line, the ID again on the next line, and the quality score on the next line. This is repeated for all reads.
Fasta: Fasta files will have the read ID on one line, followed by the sequence on the following line. This is repeated for all reads.
If you have files in any of these formats you can use the ‘Format_convert’ tool in ‘galign’ to convert them to a file compatible with the ‘galign’ program. To do so, the original sequence file must be placed on the Desktop.
If you do not have a file in one of these formats, you will need to generate a text-only file containing 32-nucleotide reads in lower-case letters, arranged one per line, with no additional text.
To use ‘Format_convert’, double-click on the ‘Format_convert’ icon. This will launch a ‘Terminal’ window. Follow the instructions and enter all the information. The output file will be placed automatically in the ‘Sequence_reads’ folder of ‘galign’.
3. Double-click on the ‘Alignment_tool’ icon inside the ‘galign’ folder to run the alignment program. This will launch a ‘Terminal’ window. Enter all the requested information. The genome sequence file is ‘elegansgenome195.txt’. This program takes about 1-hr to run through 40,000,000 32-nucleotide Solexa reads.
4. To identify single nucleotide substitutions, double-click on the ‘SNP_search’ icon. This will launch a ‘Terminal’ window. Enter all the requested information- the Alignment_results file has the name ‘yourfirstname_galign.txt’ and is located in the ‘Alignment_result’ folder. Typically this program runs for about a minute. The output file can be found in the ‘SNP_results’ folder, with the name ‘yourname_snp_(cexons or cintrons or cintergenic).txt’. The details of the output are elaborated on in the description of the ‘SNP_search’ tool above. If you are interested in all substitutions in the genome, regardless of whether they are in exons, introns, or intergenic regions, you will need to run this program once for each region.
5. To identify putative deletions and other alterations, double-click on the ‘Deletion_search’ icon. This will launch a ‘Terminal’ window. Enter all the requested information- the Alignment_results file has the name ‘yourfirstname_galign.txt’ and is located in the ‘Alignment_result’ folder. Typically this program takes about a minute to run. The output file can be found in the ‘Deletion_results’ folder, with the name ‘yourname_del_(cexons or cintrons or cintergenic).txt’. The details of the output, and what lesions they might represent, if any, are elaborated on in the description of the ‘Deletion_search’ tool above. If you are interested in all putative deletions in the genome, regardless of whether they are in exons, introns, or intergenic regions, you will need to run this program once for each region.
FEEDBACK ON ANY OF THE PROGRAMS OR EXPLANATORY MATERIAL IS WELCOME.