LncRNA Data Page - Gencode Version 7 Annotations

LncRNA Data Page - Gencode Version 7 Annotations

These data are described in the publication:

The GENCODE v7 catalogue of human long non-coding RNAs: Analysis of their gene structure, evolution and expression

Derrien et al

On this page you will find various data files relating to our work on lncRNA. 

Please note:

1) All coordinates refer to human genome version GRCh37 (hg19).

2) The genesets here are all from GENCODE version 7, unless stated otherwise.

3) Original GENCODE annotations can be found here: http://www.gencodegenes.org/releases/7.html

4) This page is maintained by Rory Johnson (rory.johnson@crg.eu; www.roryjohnson.org). Please contact me with any questions or requests regarding this data.

 

height=303

Compendium of GENCODE v7 lncRNA Annotation: Supplementary Table 1

This table summarises almost everything we know about the Gencode v7 lncRNA catalogue.

 

 

This table contains over 60 metrics (merged in 32 fields) computed for 14,880 Gencode LncRNAs transcripts :

# - 1 LncRNA_GeneId : lncRNA gene ID (v7)

# - 2 LncRNA_Txid : lncRNA transcript ID (v7)

# - 3 hg19_coordinates : genome coordinates (chr:start-stop)

# - 4 No_Exons : number of exons of the transcript

# - 5 LncRNA_tx_size : total length of concatenated exons

# - 6 LncRNA_IntronS_size : total length of all introns

# - 7 category : intergenic|exonic|intronic|encompassing(also known as overlapping)

# - 8 GeneId_Cod.Pot : Coding potential score computed by the GeneId software

# - 9 Genc_polyA : 0|1 : does the transcript have a polyA signal annotated by GENCODE in 100bp around TTS (1=yes/0=no)

# - 10 DiTAG_support : 0|1 : does the transcript have diTAG support for both ends (1=yes/0=no)

# - 11 5p_completeness_by_cage : 5' completeness based on 12 cell lines polyA+ whole cell cage data (cage clusters filtered by IDR 0.01 and TSS prediction strength >= 0.5) (1=yes/0=no).

# - 12 overV7Coding,overV3cCoding,overRefSeqCoding : 1|0 does the transcript have at least one exon overlapping (by at least 1 bp) in sense a protein coding exon of the 3 following datasets : gencode V7 protein coding exons, gencode V3c protein coding exons, RefSeq protein coding exons? (1=yes/0=no)

# - 13 overV7Coding,overV3cCoding,overRefSeqCoding : 0|1 => does the transcript overlap a protein-coding exon on the same strand? (1=yes/0=no)

# - 14 phastcons_primate_transcript_score : cumulative conservation score based on phastCons program at the transcript level in primate.

# - 15 phastcons_primate_intron_score : cumulative conservation score based on phastCons program at the intron level in primate.

# - 16 phastcons_primate_promoter_score : cumulative conservation score based on phastCons program at the promoter level in primate.

# - 17 phastcons_mammal_transcript_score : cumulative conservation score based on phastCons program at the transcript level in mammal.

# - 18 phastcons_mammal_intron_score : cumulative conservation score based on phastCons program at the intron level in mammal.

# - 19 phastcons_mammal_promoter_score : cumulative conservation score based on phastCons program at the promoter level in mammal.

# - 20 phastcons_vertebrate_transcript_score : cumulative conservation score based on phastCons program at the transcript level in vertebrate.

# - 21 phastcons_vertebrate_intron_score : cumulative conservation score based on phastCons program at the intron level in vertebrate.

# - 22 phastcons_vertebrate_promoter_score : cumulative conservation score based on phastCons program at the promoter level in vertebrate

# - 23 PrimatesSpe : 0|1 => is the transcript only found in primates? (1=yes/0=no)

# - 24 Human_1PrimateSpe : 0|1 => is the transcript only found in human + 1 primate species? (1=yes/0=no)

# - 25 Species_Homologs : list of species where putative lncRNA orthologue could be found (comma separated).

# - 26 gene_biotype : gene biotype as annotated in the source lncRNA gencode file.

# - 27 tx_biotype : transcript biotype as annotated in the source lncRNA gencode file.

# - 28 CSHL_WholeCell_LPA+-_15CellLines : format cellline1@RPKM1,cellline2@RPKM2... => RPKM at the gene level of the CELL compartment from the 115 CSHL RNASeq experiments (15 cell lines) where polyA+ and - together with bioreplicates were averaged.

# - 29 familyIndex : Name of the family.

# - 30 familySize : Size of the family (0 if no family).

# - 31 familyIndex_ARfiltered : Name of the family filtered for Repeats.

# - 32 familySize_ARfiltered : Size of the family filtered for Repeats (0 if no family).

 

Subclassifications of lncRNAs

The following files contain lists and classifications of lncRNAs as described in Derrien et al Figure 1B.

Files contain lncRNA coordinates in either in BED format, or in BEDPE format. For the latter, the coordinates and identity of the nearest protein coding gene are also provided.

Numbers indicate transcript counts.

       
14880 All GENCODE version 7 lncRNAs 9520 Intergenic lncRNAs    
    4165 Intergenic Same-Strand lncRNA  
    1938 Convergent lncRNA  
    3417 Divergent lncRNA  
  5360 Genic lncRNAs    
    2409 Exonic Antisense lncRNA  
    2784 Intronic lncRNA  
      563 Intronic Sense lncRNA
      2221 Intronic Antisense lncRNA
    167 Overlapping lncRNA  
      52 Overlapping Sense lncRNA
      115 Overlapping Antisense lncRNA
 

Microarray Annotation and Data

 

The GENCODE v3c custom microarray design is freely available and can be used to order arrays from Agilent, under Design ID AMADID 028073:

https://earray.chem.agilent.com/earray/PublishDesignLogin.do?eArrayActio...

These files relate to the custom lncRNA microarray platform. The array was printed by Agilent with custom 60mer probes targeting the GENCODE version 3c lncRNA catalogue. Altogether the array contains 58426 probes targeting 9747 lncRNA transcripts from 6314 gene loci.

This file summarises the probe sequences:

The custom Agilent GENCODE v3c lncRNA microarray probes were designed against this FASTA file of lncRNA sequences:

The following tab delimited files contain normalised, log2 transformed expression values at probe / transcript / gene level in 31 human tissues. NA denotes cases where the probe / transcript / gene is not detected. Probes / transcripts / genes that are not detected in any of the 31 samples are not in the list.

 

This microarray data is also available at NCBI GEO under ID GSE34894.

 

LncRNA Subcellular Localisation

These files contain the ratio of measured RNA abundances in the indicated subcellular fractions and cell types. Values represent the ratio of RNAseq RPKM values in two subcellular RNA fractions, as measured by RNAseq. Note that only lncRNAs having reliable detection in both fractions are included for each cell type. Details can be found in the Methods section of Derrien et al.

File format: transcript ID / IDR1 / RPKM1 / IDR2 / RPKM2 / (RPKM1/RPKM2), where RPKM1/IDR1 refer to the first RNA fraction mentioned in the title, and RPKM2/IDR2 refer to the second.

 

LncRNA PolyA Status

These files contain values for the nuclear PolyA+ / PolyA- ratio of lncRNAs. Values represent the ratio of RNAseq RPKM values in PolyA+ and PolyA- preparations, as measured by RNAseq. Note that only lncRNAs having reliable detection in both fractions are included for each cell type. Details can be found in the Methods section of Derrien et al.

File format: transcript ID / IDR1 / RPKM1 / IDR2 / RPKM2 / (RPKM1/RPKM2), where RPKM1/IDR1 refer to the first RNA fraction mentioned in the title, and RPKM2/IDR2 refer to the second.

 

Histone Modifications

This figure contains extended histone modification profiles. Each panel shows the mean density of ChIPseq reads for the indicated histone modification, across the aligned transcription start sites (TSS) of lncRNAs or protein-coding genes.

 

Overlap with lncRNAdb

We overlapped the Gencode v7 lncRNA annotations with representative transcripts from lncRNAdb. 47 out of 91 human lncRNAdb entries are represented by one or more Gencode v7 transcripts (total Gencode v7 lncRNAs: 289), shown here (Genome version hg19):

Repeating the above overlap, instead using all transcript isoforms from lncRNAdb, we retrieve 334 / 432 human lncRNAdb entries (representing 297 Gencode v7 lncRNA transcripts).

height=466