Preprocessing of clinical data
Overview
Introduction

The clinical information for each TCGA tumor sample is stored in a xml file. Patient ID, tumor and treatment info are entries of the xml file. These xml files have been preprocessed for further analysis.

Summary

Clinical data for tier 1 clinical variables are generated.

Table 1.  Tier1 clinical variables

Tumor.Feature Date.Statistics
gender dccuploaddate
primarysiteofdesease dateofbirth
histologicaltype dateofdeath
tumorstage dateoflastfollowup
tumorgrade dateoftumorrecurrence
patienttumorrecurrencestatus dateofinitialpathologicdiagnosis
radiationtherapy datelastknownalive
neoadjuvanttherapy vitalstatus
pathologicspread(pt)
pathologicspread(pn)
karnofskyperformancescore
Results
Tier 1 Data Statistics

Table 2.  Statistics of selected clinical variables.

Clinical.Variable Statistics
age mean: 62, std: 13
vitalstatus 6987 living, 3201 deceased
gender 3844 male, 6334 female
histologicaltype 107 muscle invasive urothelial carcinoma (pt2 or above), 10 mixed histology (please specify), 227 infiltrating ductal carcinoma, 20 other specify, 24 infiltrating lobular carcinoma, 774 colon adenocarcinoma, 119 colon mucinous adenocarcinoma, 46 untreated primary (de novo) gbm, 4 treated primary gbm, 712 head & neck squamous cell carcinoma, 1018 kidney clear cell renal carcinoma, 557 lung adenocarcinoma- not otherwise specified (nos), 163 lung adenocarcinoma mixed subtype, 36 lung papillary adenocarcinoma, 6 lung mucinous adenocarcinoma, 9 mucinous (colloid) adenocarcinoma, 4 lung clear cell adenocarcinoma, 22 lung acinar adenocarcinoma, 36 lung bronchioloalveolar carcinoma nonmucinous, 8 lung bronchioloalveolar carcinoma mucinous, 6 mucinous (colloid) carcinoma, 6 lung solid pattern predominant adenocarcinoma, 6 lung micropapillary adenocarcinoma, 672 lung squamous cell carcinoma- not otherwise specified (nos), 16 lung basaloid squamous cell carcinoma, 2 lung small cell squamous cell carcinoma, 2 lung papillary squamous cell caricnoma, 2 lung papillary squamous cell carcinoma, 1148 serous cystadenocarcinoma, 299 rectal adenocarcinoma, 27 rectal mucinous adenocarcinoma, 173 serous endometrial adenocarcinoma, 726 endometrioid endometrial adenocarcinoma, 36 mixed serous and endometrioid
patienttumorrecurrencestatus 10188 without recurrence
tumorstage 440 stage iia, 208 stage iiib, 241 stage iib, 95 stage iiic, 317 stage iv, 749 stage i, 393 stage iva, 9 stage iic, 323 stage ii, 410 stage iii, 237 stage iiia, 14 stage ivb, 269 stage ia, 421 stage ib, 813 iiic, 6 ib, 48 iiib, 173 iv, 40 iic, 6 ia, 16 iiia, 20 ic, 8 iib, 6 iia
pathologicspread(pt) 1125 t3, 20 t4b, 148 t4, 1091 t2, 281 t4a, 342 t1, 2 t0, 2 tis, 67 tx, 310 t1b, 104 t3b, 311 t1a, 163 t2a, 242 t3a, 4 t3c, 59 t2b
pathologicspread(pn) 2238 n0, 657 n1, 350 n2, 35 n1b, 160 n2b, 22 n2a, 36 n1a, 6 n1c, 672 nx, 73 n2c, 22 n3
Tier 1 Data

Table 3.  Get Full Table Illustration of the tier 1 data for three patients

Clinical.Variable Sample_1 Sample_2 Sample_3
yearstobirth 73 73 57
daystodeath NA NA 223
daystolastfollowup 389 389 NA
vitalstatus 0 0 1
dccuploaddate 10-4-2013 10-4-2013 17-4-2013
Methods & Data
Work Flow

1. Each xml file is converted to a tab-delimited text file by our R package.

2. All text files are aggregated into one big table by the Clinical_Aggregate_Tier1 pipeline. The 1st column of the table is the entry name of the xml file and the rest columns are the associated data for samples.

3. Data for the tier 1 clinical variables are extracted by the Clinical_Picker_Tier1 pipeline.

Diagram of Clinical Data Dicer

Figure 1.  Diagram that displays the work flow of processing clinical data. Clinical variables of interest and their associated values are marked in red and blue, respectively.