The Broad GDAC mirrors data from the DCC on a daily basis. Although all data is mirrored, not every sample is ingested into Firehose. There are three main mechanisms that filter samples to ensure that only the most scientifically relevant samples make it into our standard data and analyses runs. These three mechanisms are redactions, replicate filtering, and blacklisting. This report summarizes the data that is ingested into Firehose, describes the three filtering mechanisms, lists those samples that are removed, and gives all available annotations from the DCC's Annotation Manager.
There were 0 redactions, 1139 replicate aliquots, 0 blacklisted aliquots, and 0 FFPE aliquots. The table below represents the sample counts for those samples that were ingested into firehose after filtering out redactions, replicates, and blacklisted data, and segregating FFPEs.
Cohort | BCR | CN | Clinical | MAF | Methylation | mRNA | miR |
---|---|---|---|---|---|---|---|
ACC | 92 | 90 | 92 | 92 | 80 | 79 | 80 |
BLCA | 412 | 412 | 412 | 412 | 412 | 408 | 409 |
BRCA | 1098 | 1094 | 1097 | 1044 | 1095 | 1085 | 1078 |
CESC | 308 | 295 | 307 | 305 | 307 | 304 | 307 |
CHOL | 51 | 36 | 45 | 51 | 36 | 36 | 36 |
COAD | 463 | 450 | 459 | 432 | 459 | 456 | 444 |
COADREAD | 635 | 614 | 629 | 589 | 624 | 622 | 605 |
DLBC | 58 | 48 | 48 | 48 | 48 | 48 | 47 |
ESCA | 185 | 184 | 185 | 184 | 185 | 161 | 184 |
GBM | 617 | 590 | 596 | 396 | 422 | 154 | 0 |
GBMLGG | 1133 | 1104 | 1111 | 909 | 938 | 665 | 512 |
HNSC | 528 | 517 | 528 | 510 | 528 | 500 | 523 |
KICH | 113 | 66 | 113 | 66 | 66 | 65 | 66 |
KIPAN | 941 | 886 | 941 | 693 | 890 | 883 | 873 |
KIRC | 537 | 530 | 537 | 339 | 533 | 530 | 516 |
KIRP | 291 | 290 | 291 | 288 | 291 | 288 | 291 |
LAML | 200 | 143 | 200 | 149 | 140 | 151 | 103 |
LGG | 516 | 514 | 515 | 513 | 516 | 511 | 512 |
LIHC | 377 | 375 | 377 | 375 | 377 | 371 | 372 |
LUAD | 585 | 518 | 522 | 569 | 578 | 513 | 513 |
LUSC | 504 | 503 | 504 | 497 | 503 | 501 | 478 |
MESO | 87 | 87 | 87 | 83 | 87 | 86 | 87 |
OV | 608 | 568 | 587 | 441 | 592 | 374 | 489 |
PAAD | 185 | 184 | 185 | 183 | 184 | 177 | 178 |
PANGI | 1298 | 1240 | 1257 | 1214 | 1287 | 1158 | 1225 |
PCPG | 179 | 178 | 179 | 179 | 179 | 178 | 179 |
PRAD | 500 | 495 | 500 | 498 | 498 | 495 | 494 |
READ | 172 | 164 | 170 | 157 | 165 | 166 | 161 |
SARC | 261 | 260 | 261 | 255 | 261 | 259 | 259 |
SKCM | 470 | 368 | 470 | 368 | 368 | 367 | 352 |
STAD | 478 | 442 | 443 | 441 | 478 | 375 | 436 |
STES | 663 | 626 | 628 | 625 | 663 | 536 | 620 |
TGCT | 150 | 134 | 134 | 150 | 150 | 150 | 150 |
THCA | 507 | 505 | 507 | 496 | 507 | 502 | 506 |
THYM | 124 | 124 | 124 | 123 | 124 | 119 | 124 |
UCEC | 560 | 540 | 548 | 542 | 547 | 543 | 538 |
UCS | 57 | 56 | 57 | 57 | 57 | 56 | 57 |
UVM | 80 | 80 | 80 | 80 | 80 | 80 | 80 |
Totals | 11353 | 10840 | 11160 | 10323 | 10853 | 10088 | 10049 |
Annotation data was taken from theTCGA Data Portalusing the query string:
https://tcga-data.nci.nih.gov/annotations/resources/searchannotations/json?item=TCGA
Redaction information was generated by filtering for the annotationClassificationName "Redaction"
FFPE information was generated by filtering for "FFPE" in annotation note text
Additional FFPEs were garnered from clinical data
Remaining annotations were sorted into sections by annotationClassificationName
The mRNA preprocess median module chooses the matrix for the platform(Affymetrix HG U133, Affymetrix Exon Array and Agilent Gene Expression) with the largest number of samples.
The mRNAseq preprocessor picks the "scaled_estimate" (RSEM) value from Illumina HiSeq/GA2 mRNAseq level_3 (v2) data set and makes the mRNAseq matrix with log2 transformed for the downstream analysis. If there are overlap samples between two different platforms, samples from illumina hiseq will be selected. The pipeline also creates the matrix with RPKM and log2 transform from HiSeq/GA2 mRNAseq level 3 (v1) data set.
The miRseq preprocessor picks the "RPM" (reads per million miRNA precursor reads) from the Illumina HiSeq/GA miRseq Level_3 data set and makes the matrix with log2 transformed values.
The methylation preprocessor filters methylation data for use in downstream pipelines. To learn more about this preprocessor, please visit the documentation.