Description of the GFF file (version 2) (Adapted from GBrowse documentation written by Lincoln Stein) A. The GFF 2 file format The SynBrowse, like GBrowse is based around the GFF file format, which stands for "Gene Finding Format" and was invented at the Sanger Centre (http://www.sanger.ac.uk/Software/formats/GFF/). The GFF format is a flat tab-delimited file, each line of which corresponds to an annotation, or feature. Each line has nine columns and looks like this: 1 curated CDS 20715643 20715694 . + . mRNA At1g55475.1 ; Note "expressed protein" 1. reference sequence This is the ID of the sequence that is used to establish the coordinate system of the annotation. In the example above, the reference sequence is 1 (chromosome 1). 2. source The source of the annotation. This field describes how the annotation was derived. In the example above, the source is "curated" to indicate that the feature is curated feature. The names and versions of software programs are often used for the source field. 3. method The annotation method. This field describes the type of the annotation, such as "CDS". Together the method and source describe the annotation type. 4. start position The start of the annotation relative to the reference sequence. 5. stop position The stop of the annotation relative to the reference sequence. Start is always less than or equal to stop. 6. score For annotations that are associated with a numeric score (for example, a sequence similarity), this field describes the score. The score units are completely unspecified, but for sequence similarities, it is typically percent identity. Annotations that don't have a score can use "." 7. strand For those annotations which are strand-specific, this field is the strand on which the annotation resides. It is "+" for the forward strand, "-" for the reverse strand, or "." for annotations that are not stranded. 8. phase For annotations that are linked to proteins, this field describes the phase of the annotation on the codons. It is a number from 0 to 2, or "." for features that have no phase. 9. group GFF provides a simple way of generating annotation hierarchies ("is composed of" relationships) by providing a group field. The group field contains the class and ID of an annotation which is the logical parent of the current one. The group field is also used to store information about the target of sequence similarity hits, and miscellaneous notes. The sequences used to establish the coordinate system for annotations can correspond to sequenced clones, clone fragments, contigs or super-contigs. In addition to a group ID, the GFF format allows annotations to have a group class. This makes sure that all groups are unique even if they happen to share the same name. For example, you can have a GenBank accession named AP001234 and a clone named AP001234 and distinguish between them by giving the first one a class of Accession and the second a class of Clone. You should use double-quotes around the group name or class if it contains white space. B. Sequence alignments There are several cases in which an annotation indicates the relationship between two sequences. A common one is a similarity hit, where the annotation indicates an alignment. A second common case is a map assembly, in which the annotation indicates that a portion of a larger sequence is built up from one or more smaller ones. Both cases are indicated by using the Target tag in the group field. For the sequence alignments for SynBrowse, there are additional requirements for the source and method columns. In each similarity hit record, we specify the entry in the method column as "similarity" and the entry in the source column as "gene" or "coding" (for entirely conserved genes or single exons, respectively) appended with a similarity or identity percentage in deciles for protein alignments and as "align" or "conserved" (for gapped alignements or gap-free alignments, respectively) appended with a similarity or identity percentage in deciles for nucleotide alignments. The value of similarity or identity is shown in the score column. For example in SynBrowse requirement, a typical similarity hit will look like this: AC123572 coding50 similarity 18950 19093 0.597 + . Target "Sequence:1" 25440931 25441074 Here, the source is "coding50" to indicate that the feature is an alignment of coding region which have approximately 50% identity percentage. This value can be calculated from the score "0.597" by timing 100 and then taking the decile value. The method is similarity. The group field contains the Target tag, followed by an identifier for the biological object. The GFF format uses the notation Class:Name for the biological object, and even though this is stylistically inconsistent, that's the way it's done. The object identifier is followed by two integers indicating the start and stop of the alignment on the target sequence. Note that each item in the group field be separated by a white space. The previous example indicates that the the section of AC123572 from 18,950 to 19,093 aligns to Arabidopsis Chromosome 1 starting at position 25440931 and extending to position 25441074 with forward strand. Unlike the main start and stop columns for the standard GFF described in A, for a sequence alignment, it is possible for both the main and target start to be greater than the end. C. Loading the GFF file into the database Use the BioPerl script utilities bp_bulk_load_gff.pl, bp_load_gff.pl or bp_fast_load_gff.pl to load the GFF file into the database. For example, if your database is a MySQL database on the local host named "dicty", you can load it into an empty database using bp_bulk_load_gff.pl like this: bp_bulk_load_gff.pl -c -d dicty my_data.gff To update existing databases, use either bp_load_gff.pl or bp_fast_load_gff.pl. The latter is somewhat experimental, so use with care.