br Consensus biomarker score br The
Consensus biomarker score
The consensus biomarker score was generated by combining the results from three types of sample comparison: (1) tumor versus paired normal, (2) tumor versus healthy tissue of origin, and (3) tumor versus all healthy tissues.
Comparison 1: primary tumor versus paired-normal tissue
The first comparison leveraged the paired nature of TCGA samples, meaning the tumor and normal tissue sample originated from the same patient. This enabled an estimation of gene expression changes that were specific to malignant transformation, rather than those arising from variation among patients or tissues of origin. TCGA data were filtered to only keep patients with paired samples; i.e., those with both a primary tumor and normal tissue sample. Furthermore, only cancer types with at least three patients after filtering were included, resulting in a final count of 693 patients spanning 20 cancer types. For each cancer type, a dif-ferential expression analysis was performed, comparing primary tumor with paired normal tissue, using the patient ID as a blocking factor.
Comparison 2: primary tumor versus healthy matched tissue
The second comparison was conducted in recognition of the fact that paired-normal samples are not always representative of normal healthy tissue, as nearby tumor JNJ-42153605 are known to perturb cellular function (Aran et al., 2017; Huang et al., 2016). Therefore, primary tumor TCGA samples were compared to GTEx healthy tissue samples (of the same tissue-of-origin) from non-cancer patients. For this analysis, all 9,760 primary tumor samples were used, not just those with a corresponding paired-normal tissue sample. A differ-ential expression analysis was performed for each cancer type, comparing primary tumor samples with those of the corresponding healthy tissue from GTEx.
Comparison 3: primary tumor versus all healthy tissues
The final comparison sought to identify genes with relatively low expression throughout all tissues in the body compared to their expression in a tumor. We hypothesized that tumor-derived expression changes in such genes would be more detectable in a biofluid than genes expressed at similar or higher levels in many healthy tissues, as the latter could impart a ‘‘dilution’’ effect on the tumor-associated signal of interest. For this analysis, we were more interested in transcript abundance rather than fold-changes between two conditions. Therefore, normalized gene counts (FPKM) were retrieved from TCGA for all tumor and paired normal tissue samples and converted to transcripts per million (TPM). TPM gene counts were also retrieved from the GTEx database for all measured tis-sues. The complete set of healthy tissues was obtained by combining healthy tissue samples from GTEx with paired normal samples from TCGA (Table S1).
For each gene in a given cancer type, the TPM values among all TCGA primary tumor samples for that cancer type were compared to the TPM values for that gene across all normal samples for a particular tissue type, using a right-tailed Wilcoxon rank-sum test (i.e., the null-hypothesis being that the tumor counts are not sampled from a distribution with a higher median than that of the normal tissue counts). This yielded a significance (p value) for each gene for a tissue type, where a low p value corre-sponded to genes with higher TPM values in primary tumor tissues than in the normal tissue. The comparison was repeated for all of the healthy tissue types, to obtain a p value for each tissue. The test was performed with each healthy tissue individually rather than pooling all of the normal samples together, as the pooled test would be biased by variations in the number of samples for
different tissues. Each of the p values obtained from the different tissues types were then combined (geometric mean) into a single p-like score (ranging from 0-1). The entire process was repeated for each of the different cancer types, yielding a single score for each gene and each cancer type.