Hi Guys,

Below are the current thoughts/actions on my PANCAN8 plate, Just wanted
to circulate for comment.  I've include you, Rehan, because I found the
integrative subtypes feedback last week to be useful, but please feel
free to stay silent if bandwidth is lacking.

Cheers,
Mike

---

Upshot of all the comments below is that we'll need a new freeze list, sooner
than later.  Here's why:

1)	For starters, we generated new hg19 WIGs for MAF coverage, and included
	them in the 2012_07_25 data run; IMO this makes the 7/25 dataset a better
	baseline freeze candidate than 7/7 dataset.  But I hope automation saves
	us from that being too much of a pain.

2)  One Cohort or Many: I've been struggling with whether analyses should
	proceed from a single, massively bundled cohort or from the cleanly
	partitioned tumor sets.

    One reason is that we're going to be dealing with hg18 vs. hg19 problems
	when combining some datatypes ... MAFs again come to mind as the poster
	child example.  Another is that bigger data means more breakage from
	memory exhaustion, longer runtimes, etc.

	The slides you walked through today, Josh (and Rehan last week) also
	pointed this way, by showing value of doing tissue-specific analyses
	FIRST and then integrating.

	But that concern may be irrelevant for other analyses, e.g. GISTIC
	(because the Broad GCC has already re-processed SNP6 data into hg18
	and hg19 versions).

	Do we care?

3)	Note that FH datasets and analyses combine COAD and READ sets into a
	single COADREAD cohort; should we consider having Synapse portal also
	show them as one?

4)  Old data platforms and nomenclature:  last week Sheila Reynolds and I
	exchanged a couple of long emails on scrubbing the whitelist, which had
	several parts:

	a) Removing old platforms: we need consensus on what can be safely removed.

	b) Correct file URLs & archive nomenclature: a good chunk of data was
	   recently moved from protected to public areas in DCC ... in principle
	   easy to reflect in our outputs, but needs to be done.
	   
	c) MAFs: similar aim as (b) trickier because each of the manuscript AWGs
	   (in PANCAN8 set) manually curated their MAFs (and originally posted on
	   Wiki or Jamboree) and these AWG mafs are what FH uses (NOT any DCC mafs
	   which may have been submitted by sequencing centers since manuscript
	   publication--despite being likely to contain more mutation samples)

5)	Clinical: just a shout-out that FH stddata includes clinical, and that each
	analysis run includes survival curves plus other statistically-significant
	clinical correlations.  FOR FREE.  :)

	Seems a great oppty for symbiosis:  provide our clinical correlations as
	reasonably useful low-hanging fruit for PANCAN8, even if not 100% perfect,
	but then improve them over time with feedback (to benefit entire TCGA and
	cancer communities).

6)	Methylation: we recognize & concur with Peter Laird's reservations about
	meth27/meth450 data, and have already conducted an in-house survey of the
	clustering diffs between the two platforms when treated as distinct cohorts.
	I'd be happy to share that with the PANCAN8 community on an upcoming call.

7)  MutSig:  I've impressed upon Gaddy the need for the latest version (with
	late replication correction, aka version "S2N") to be used for PANCAN8. He
	agrees and will be likewise prodding Mike Lawrence for it this PM.

8)	To help coordinate Mutsig and other FH PANCAN analyses allow me to introduce
	Jaegil Kim (cc:d) as the newest member of our Broad PANCAN8 efforts.