Hi Guys, Below are the current thoughts/actions on my PANCAN8 plate, Just wanted to circulate for comment. I've include you, Rehan, because I found the integrative subtypes feedback last week to be useful, but please feel free to stay silent if bandwidth is lacking. Cheers, Mike --- Upshot of all the comments below is that we'll need a new freeze list, sooner than later. Here's why: 1) For starters, we generated new hg19 WIGs for MAF coverage, and included them in the 2012_07_25 data run; IMO this makes the 7/25 dataset a better baseline freeze candidate than 7/7 dataset. But I hope automation saves us from that being too much of a pain. 2) One Cohort or Many: I've been struggling with whether analyses should proceed from a single, massively bundled cohort or from the cleanly partitioned tumor sets. One reason is that we're going to be dealing with hg18 vs. hg19 problems when combining some datatypes ... MAFs again come to mind as the poster child example. Another is that bigger data means more breakage from memory exhaustion, longer runtimes, etc. The slides you walked through today, Josh (and Rehan last week) also pointed this way, by showing value of doing tissue-specific analyses FIRST and then integrating. But that concern may be irrelevant for other analyses, e.g. GISTIC (because the Broad GCC has already re-processed SNP6 data into hg18 and hg19 versions). Do we care? 3) Note that FH datasets and analyses combine COAD and READ sets into a single COADREAD cohort; should we consider having Synapse portal also show them as one? 4) Old data platforms and nomenclature: last week Sheila Reynolds and I exchanged a couple of long emails on scrubbing the whitelist, which had several parts: a) Removing old platforms: we need consensus on what can be safely removed. b) Correct file URLs & archive nomenclature: a good chunk of data was recently moved from protected to public areas in DCC ... in principle easy to reflect in our outputs, but needs to be done. c) MAFs: similar aim as (b) trickier because each of the manuscript AWGs (in PANCAN8 set) manually curated their MAFs (and originally posted on Wiki or Jamboree) and these AWG mafs are what FH uses (NOT any DCC mafs which may have been submitted by sequencing centers since manuscript publication--despite being likely to contain more mutation samples) 5) Clinical: just a shout-out that FH stddata includes clinical, and that each analysis run includes survival curves plus other statistically-significant clinical correlations. FOR FREE. :) Seems a great oppty for symbiosis: provide our clinical correlations as reasonably useful low-hanging fruit for PANCAN8, even if not 100% perfect, but then improve them over time with feedback (to benefit entire TCGA and cancer communities). 6) Methylation: we recognize & concur with Peter Laird's reservations about meth27/meth450 data, and have already conducted an in-house survey of the clustering diffs between the two platforms when treated as distinct cohorts. I'd be happy to share that with the PANCAN8 community on an upcoming call. 7) MutSig: I've impressed upon Gaddy the need for the latest version (with late replication correction, aka version "S2N") to be used for PANCAN8. He agrees and will be likewise prodding Mike Lawrence for it this PM. 8) To help coordinate Mutsig and other FH PANCAN analyses allow me to introduce Jaegil Kim (cc:d) as the newest member of our Broad PANCAN8 efforts.