Reasearch Awards nomination

Email updates

Keep up to date with the latest news and content from Genome Biology and BioMed Central.

Open Access Highly Accessed Research

DNA methylation age of human tissues and cell types

Steve Horvath

Author Affiliations

Human Genetics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA 90095, USA

Biostatistics, School of Public Health, University of California Los Angeles, Los Angeles, CA 90095, USA

Human Genetics, Gonda Research Center, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA 90095-7088, USA

Genome Biology 2013, 14:R115  doi:10.1186/gb-2013-14-10-r115

Published: 21 October 2013

Additional files

Additional file 1:

DNA methylation data involving healthy (non-cancer) tissue. The rows correspond to 82 publicly available Illumina data sets. Column 1 reports the data set number and corresponding color code. Other columns report the source of the DNA (for example, tissue), Illumina platform, sample size n, proportion of females, median age, age range (minimum and maximum age), relevant citation (first author and publication year), public availability (for example, GEO identifier). The column 'Data Use’ reports whether the data set was used as a training set, test set, or served another purpose. The table also reports the age correlation, Cor(Age, DNAmAge), median error, and median age acceleration for DNAm age. The last two columns of the table report the age correlation (Cor LOOCV) and median error (Error LOOCV) resulting from a leave-one-data-set-out cross-validation analysis.

Format: CSV Size: 9KB Download file

Open Data

Additional file 2:

Materials and methods supplement. This document has the following sections: Limitations; Description of the healthy tissue and cell line data sets; Criteria guiding the choice of the training sets; Description of the cancer data sets; DNAm profiling and pre-processing steps; Normalization methods for the DNA methylation data; Explicit details on the definition of DNAm age; Chromatin state data used for Additional file 9; Comparing the multi-tissue predictor with other age predictors; Meta analysis for finding age-related CpGs; Variation of age related CpGs across somatic tissues; Studying age effects using gene expression data; Meta-analysis applied to gene expression data; Names of the genes whose mutations are associated with age acceleration; Is DNAm age a biomarker of aging?

Format: DOCX Size: 159KB Download file

Open Data

Additional file 3:

Coefficient values for the DNAm age predictor. This Excel file provides detailed information on the multi-tissue age predictor defined using the training set data. The multi-tissue age predictor uses 353 CpGs, of which 193 and 160 have positive and negative correlations with age, respectively. The table also represents the coefficient values for the shrunken age predictor that is based on a subset of 110 CpGs (a subset of the 353 CpGs). Although this information is sufficient for predicting age, I recommend using the R software tutorial since it implements the normalization method. The table reports a host of additional information for each CpG, including its variance, minimum value, maximum value, and median value across all training and test data. Further, it reports the median beta value in subjects aged younger than 35 years and in subjects older than 55 years.

Format: CSV Size: 131KB Download file

Open Data

Additional file 4:

Age predictions in blood data sets. (A) DNAm age has a high correlation with chronological age (y-axis) across all blood data sets. (B-S) Results for individual blood data sets. The negligible age correlation in panel 0) reflects very young subjects that were either zero or 0.75 years (9 months) old. (S) DNAm age in different cord blood data sets (x-axis). Bars report the mean DNAm age (±1 standard error). The mean DNAm age in data sets 6 and 50 is close to its expected value (zero) and it is not significantly different from zero in data set 48. (T) Mean DNAm age across whole blood, peripheral blood mononuclear cells, granulocytes as well as seven isolated cell populations (CD4+ T cells, CD8+ T cells, CD56+ natural killer cells, CD19+ B cells, CD14+ monocytes, neutrophils, and eosinophils) from healthy male subjects [82]. The red vertical line indicates the average age across subjects. No significant difference in DNAm age could be detected between these groups, but note the relatively small group sizes (indicated by the grey numbers on the y-axis).

Format: PDF Size: 52KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 5:

Age predictions in brain data sets. (A) Scatter plot showing that DNAm age (defined using the training set CpGs) has a high correlation (cor = 0.96, error = 3.2 years) with chronological age (y-axis) across all training and test data sets. (B-J) Results in individual brain data sets. (G) The brain samples of data set 12 are composed of 58 glial cell (labeled G, blue color), 58 neuron cell (labeled N, red color), 20 bulk (labeled B, turquoise), and 9 mixed samples (labeled M, brown). (K) Comparison of mean DNAm ages (horizontal bars) across different brain regions from the same subjects [48] reveals no significant difference between temporal cortex, pons, frontal cortex, and cerebellum. Differing group sizes (grey numbers on the y-axis) reflect that some suspicious samples were removed in an unbiased fashion (Additional file 2). (L) Using data sets 54 and 55, I found no significant difference in DNAm age (x-axis) between cerebellum and occipital cortex from the same subjects [70].

Format: PDF Size: 18KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 6:

Age predictions in breast data sets. (A) DNAm age is highly correlated with age across all breast data sets, but the high error of 12 years reflects accelerated aging in normal adjacent breast cancer tissue (data sets 56, 57). (B-D) Relationship between DNAm age and chronological age in individual data sets. As expected, the lowest error (8.9 years) is observed in normal breast tissue (training data set 14, panel (B)).

Format: PDF Size: 5KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 7:

Ingenuity Pathway Analysis. The document describes the results from applying Ingenuity Pathway Analysis to the 353 genes that are located near the 353 clock CpGs. Top biological function analysis implicated cell death/survival (74 genes, P = 1.1E-7) and cellular growth/development (71 genes, P = 3.7E-5). Significant overlap can be observed for the following disease-related gene sets: cancer (109 genes, P = 9.2E-5), endocrine system disorder (28, P = 2.6E-4), hereditary disorders (50 genes, 2.6E-4), and reproductive system disease (37 genes, P = 2.6E-4). Significant Ingenuity networks include a) hematological system development, tissue morphology, cell death and survival (P = E-37), b) cellular growth and proliferation, cell signaling, developmental disorder (P = E-37).

Format: PDF Size: 2.3MB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 8:

Marginal analysis of CpGs. The figure shows how individual CpGs (corresponding to points) relate to age and tissue variation. Red and blue points correspond the 193 positively and the 160 negatively related clock CpGs, respectively. (A) The variance across adult somatic tissues is highly correlated with variance across fetal somatic tissues, which illustrates that it is robustly defined. Note that data set 77 [78] was not used for defining DNAm age. (B,C) Average variance of DNAm levels across adult and fetal somatic tissues, respectively. The blue and red bars correspond to groups of positively and negatively related clock CpGs, respectively. (D) Tissue variance across the training data (F statistic from ANOVA) is highly correlated (cor = 0.73) with tissue variance across adult somatic tissues (data set 77), which illustrates that tissue variance is robustly defined. (E) Pure (unconfounded) age effects in the training data (x-axis) relate to those in all data sets (y-axis). To estimate pure age effects, I used a meta-analysis method that implicitly conditions on data set (Materials and methods; Additional file 2). The logarithm (base 10) of the meta-analysis P-value was multiplied by -1 or 1 so that high positive (negative) values indicate that the CpG is positively (negatively) correlated with age. The high correlation illustrates that little information is lost by focusing on the training data. Further, note that the most significantly positively (red dots) and negatively related CpGs (blue dots) are used in the epigenetic clock. (F) Tissue variance in the training data (y-axis) versus the signed logarithm of the meta-analysis P-value in the training data (x-axis).

Format: PDF Size: 499KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 9:

Characterizing the clock CpGs using DNA sequence properties. Figure titles are preceded by ' + ’ or '-’ if they report properties of positively related or negatively related clock CpGs, respectively. Panels in the first row (A-E) relate the clock CpGs to chromatin state annotation provided in [29]. The y-axis reports the mean number of cell lines (out of 9 cell lines) for which the CpGs were in the chromatin state mentioned in the title. (A) The bar plots shows that the 193 positively related CpGs were significantly (P = 1.6E-6) less likely to be in chromatin state 1 (active promoters) than the other 21k CpGs, which is not the case for the 160 negatively related CpGs (C). (B) Positively related CpGs were more likely to be in chromatin state 3 regions (poised promoters). (D) Negatively related CpGs were more likely to be in chromatin states 2 (weak promoters). (E) Negatively related CpG are often located chromatin state 4 regions (strong enhancers). (F) No significant relationship with CpG island status can be observed for the positively related CpGs. (K) Negatively related CpGs are significantly over-represented in shores. (G) Positively related CpGs were outside of RNApol2 bound regions (annotation from [87]). This is not the case for negatively related CpGs (L). (H-J) Positively related CpGs are over-represented near Polycomb-group target genes, that is, in regions with high occupancy of Suz12 (P = 7.1E-6, H), EED (P = 0.0030, I), and H3K27m3 (P = 0.0048, J). This is not the case for the negatively related CpGs (M-O).

Format: PDF Size: 6KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 10:

Estimating the heritability of age acceleration. Two twin data sets (data sets 41 and 50) are used to estimate the broad sense heritability of accelerated age (defined as difference between DNAm age and chronological age). (A,E) Age histograms for data set 41 (median age 63 years, all females) and data set 50 (composed of newborns), respectively. (B,F) All twins irrespective of zygosity. Each point corresponds to a twin pair and is colored red if the twins are monozygotic. Age acceleration of the first twin (randomly chosen) versus that in the second twin, respectively. (C,G) Monozygotic twins only. (D,H) Dizygotic twins only. The high correlations in monozygotic twins (cor = 0.4 for data set 41 and cor = 0.77 for data set 50) contrast sharply with those observed for dizygotic twins (cor = 0.20 and cor = -0.21).

Format: PDF Size: 5KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 11:

Aging effects in gene expression (mRNA) and DNAm data. Due to space limitations, I can only report results for the direct approach of matching each individual CpG to its corresponding gene symbol. Using publicly available gene expression data (Additional file 2), I do not find a significant relationship between age effects on messenger RNA levels and age effects on DNAm levels in (A) blood, (C) brain, (E) kidney, (G) muscle, and (I) CD8 T cells. For each data modality, I estimated 'pure’ age effect using a meta-analysis method that conditioned on data (as described in Additional file 2). The y-axis reports a signed logarithm (base 10) of the meta-analysis P-value, that is, a high positive (negative) value indicates that the gene expression level increases (decreases) with age. Gene expression data and CpG data were matched according to gene symbol as described in [88]. Each point in the scatter plots corresponds to a CpG (x-axis) and the corresponding gene symbol (y-axis). Genes corresponding to the positively related and negatively related clock CpGs are colored in red and blue, respectively. (B,D,F,H,J,L) Mean age effect (y-axis) across gene groups defined by their corresponding CpG. (K,L) Aging effects on DNAm levels (x-axis) do not affect genes known to be differentially expressed between naive CD8 T cells and CD8 memory cells. The y-axis reports the signed logarithm of the Student t-test P-value of differential expression.

Format: PDF Size: 390KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 12:

Description of cancer data sets. The file describes 32 publicly available cancer tissue data sets and 7 cancer cell line data sets. Column 1 reports the data number and corresponding color code. Other columns report the affected tissue, Illumina platform, sample size n, proportion of females, median age, age range (minimum and maximum age), relevant citation (TCGA or first author with publication year), and public availability. None of these data sets were used in the construction of estimator of DNAm age. The table also reports the age correlation, cor(Age,DNAmage), median error, and median age acceleration.

Format: XLSX Size: 15KB Download file

Open Data

Additional file 13:

DNAm age versus chronological age in cancer. Each point corresponds to a DNA methylation sample (cancer sample from a human subject). Points are colored and labeled according to the underlying cancer data sets as described in Additional file 12. (A) Across all cancer data sets, there is only a weak correlation (cor = 0.15, P = 1.9E-29) between DNAm age (x-axis) and chronological patient age (y-axis). The high error (40 years) reflects high age accelerations. (B) Each cancer/affected tissue shows evidence of significant age acceleration (y-axis) with an average age acceleration of 36.2 years. (C-W) Results for individual cancers/affected tissues. Several cancer tissues maintain moderately large age correlations (larger than 0.3), including brain (cor = 0.61) (E), thyroid (cor = 0.6) (U), kidney (cor = 0.45) (K,L), liver (cor = 0.42) (M), colorectal (cor = 0.37) (I), and breast (cor = 0.31) (F).

Format: PDF Size: 76KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 14:

Age acceleration versus tumor grade and stage. Panels correspond to the cancer data sets described in Additional file 12. Nominally significant negative correlations between grade and age acceleration can be observed in ovarian serous cystadenocarcinoma (panel G; P = 0.032) and uterine corpus endometroids (panel J; P = 0.019). A nominally significant positive correlation between stage and age acceleration can be observed for colon adenocarcinoma (panel O; P = 0.021). Only the highly significant negative correlation between stage and age acceleration in thyroid cancer (panel Z; P = 8.7E-9) remains significant after adjusting for multiple comparisons. Since grade and stage are often considered as ordinal variables, correlation test P-values are reported in all panels except the last. (H) For prostate cancer, the x-axis reports the Gleason sum score. The last panel shows that mean age acceleration in acute myeloid leukemia is not significantly related to French American British (FAB) morphology but some groups (notably M6 and M7) are very small (rotated grey numbers).

Format: PDF Size: 13KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 15:

Age acceleration versus mutation count status in breast cancer. Mutation count status (x-axis) was defined by assigning tumor samples to the high mutation count group if their number of somatic mutations was larger than 50. Other thresholds lead to similar results. (A-L) Findings for Illumina 27K (A-F) and 450K data (G-L). (A,G) The barplots show that mean age acceleration (y-axis) is lower in breast cancer samples with high mutation count (compared to those samples whose somatic mutation count is less than 50). This result can also be found in ER+ (B,H), ER- (C,I), PR + (D,J), PR- (E,K), and triple negative (F,L) breast cancer samples.

Format: PDF Size: 4KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 16:

Selected significant gene mutations versus age acceleration. The TCGA data sets were stratified by cancer type and Illumina platform. Mean age acceleration (y-axis) versus mutation status (x-axis) for up to two of the most significant genes per data set. Note that age acceleration in bone marrow (AML) was most highly related to mutation in the following two genes: U2AF1 and TP53. Age acceleration in the two breast cancer data sets was most highly related to mutations in GATA3, TP53, and TTN. For kidney renal cell carcinoma (KIRC): only AKAP9 was significant. Strikingly, TP53 was among the top two most significant mutated genes in 4 out of 13 cancer data sets. More information on these genes is presented in Additional file 2.

Format: PDF Size: 7KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 17:

Effect of TP53 mutation on age acceleration. Mutations in TP53 are associated with significantly lower age acceleration in five cancers: including AML (P = 0.0023), breast cancer (P = 1.4E-5 and P = 3.7E-8), ovarian serous cystadenocarcinoma (P = 0.03) (I), and uterine corpus endometrioid (P = 0.00093). Marginally significant results could be observed in lung squamous cell carcinoma (P = 0.047 for the 27K data but insignificant results for the 450K data).

Format: PDF Size: 5KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 18:

DNAm age of cancer cell lines. (A) A high variation of DNAm age (x-axis) can be observed across various cancer lines lines (y-axis). The DNAm age is reported in Additional file 19. (B) Across all cell lines, DNAm age (x-axis) does not have a significant correlation with the chronological age of the patient from whom the cancer cell line was derived. (C) Results for osteosarcoma cell lines.

Format: PDF Size: 4KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 19:

Cancer lines and DNAm age. This Excel file reports the DNAm age and age acceleration for 59 cancer cell lines.

Format: CSV Size: 6KB Download file

Open Data

Additional file 20:

R software tutorial. This file contains an R software tutorial that describes how to estimate DNAm age for data set 55. Further, it shows how to relate two measures of age acceleration to autism disease status. The R tutorial requires Additional files 21, 22, 23, 24, 25, 26 and 27 as input.

Format: DOCX Size: 57KB Download file

Open Data

Additional file 21:

Probe annotation file for the Illumina 27K array. This comma-delimited text file (.csv file) is needed for the R software tutorial.

Format: CSV Size: 1.2MB Download file

Open Data

Additional file 22:

Additional probe annotation file for the R tutorial. This comma-delimited text file (.csv file) is needed for the R software tutorial.

Format: CSV Size: 1MB Download file

Open Data

Additional file 23:

Coefficient values of the age predictor. This comma-delimited text file (.csv file) is needed for the R software tutorial. This file is very similar to Additional file 3 but rows appear in a different order.

Format: CSV Size: 131KB Download file

Open Data

Additional file 24:

R code for normalizing the DNA methylation data. This text file is needed for the R software tutorial. It contains R code for normalizing the DNA methylation data and adapts R functions described in [89].

Format: TXT Size: 20KB Download file

Open Data

Additional file 25:

This text file is needed for the R software tutorial. It contains R code implementing analysis steps.

Format: TXT Size: 6KB Download file

Open Data

Additional file 26:

Methylation data from data set 55. This comma-delimited text file (.csv file) contains the DNA methylation data needed for the R software tutorial.

Format: CSV Size: 5.3MB Download file

Open Data

Additional file 27:

This comma-delimited text file (.csv file) contains the sample annotation data needed for the R software tutorial.

Format: CSV Size: 2KB Download file

Open Data