Preprocessing of clinical data
Overview
Introduction

The clinical information for each TCGA tumor sample is stored in a xml file. Patient ID, tumor and treatment info are entries of the xml file. These xml files have been preprocessed for further analysis.

Summary

Clinical data for tier 1 clinical variables are generated.

Table 1.  Tier1 clinical variables

Tumor.Feature Date.Statistics
gender dccuploaddate
primarysiteofdesease dateofbirth
histologicaltype dateofdeath
tumorstage dateoflastfollowup
tumorgrade dateoftumorrecurrence
patienttumorrecurrencestatus dateofinitialpathologicdiagnosis
radiationtherapy datelastknownalive
neoadjuvanttherapy vitalstatus
pathologicspread(pt)
pathologicspread(pn)
karnofskyperformancescore
Results
Tier 1 Data Statistics

Table 2.  Statistics of selected clinical variables.

Clinical.Variable Statistics
age mean: NaN, std: NA
vitalstatus 3384 living, 1505 deceased
gender 1828 male, 3051 female
histologicaltype 364 colon adenocarcinoma, 56 colon mucinous adenocarcinoma, 541 untreated primary (de novo) gbm, 20 treated primary gbm, 255 head & neck squamous cell carcinoma, 502 kidney clear cell renal carcinoma, 84 kidney papillary renal cell carcinoma, 49 astrocytoma, 58 oligodendroglioma, 33 oligoastrocytoma, 178 lung adenocarcinoma- not otherwise specified (nos), 61 lung adenocarcinoma mixed subtype, 10 lung papillary adenocarcinoma, 2 lung mucinous adenocarcinoma, 3 mucinous (colloid) adenocarcinoma, 2 lung clear cell adenocarcinoma, 3 lung acinar adenocarcinoma, 3 lung bronchioloalveolar carcinoma mucinous, 8 lung bronchioloalveolar carcinoma nonmucinous, 2 lung micropapillary adenocarcinoma, 1 lung solid pattern predominant adenocarcinoma, 264 lung squamous cell carcinoma- not otherwise specified (nos), 6 lung basaloid squamous cell carcinoma, 1 lung papillary squamous cell caricnoma, 1 lung papillary squamous cell carcinoma, 570 serous cystadenocarcinoma, 149 rectal adenocarcinoma, 13 rectal mucinous adenocarcinoma, 13 stomach adenocarcinoma - diffuse type, 100 stomach adenocarcinoma - not otherwise specified (nos), 7 stomach intestinal adenocarcinoma - mucinous type, 10 stomach intestinal adenocarcinoma - tubular type, 3 stomach intestinal adenocarcinoma - papillary type, 27 stomach intestinal adenocarcinoma - type not otherwise specified (nos), 69 uterine serous endometrial adenocarcinoma, 142 endometrioid endometrial adenocarcinoma (grade 3), 104 endometrioid endometrial adenocarcinoma (grade 1 or 2), 18 mixed serous and endometrioid, 57 endometrioid endometrial adenocarcinoma (grade 2), 34 endometrioid endometrial adenocarcinoma (grade 1)
tumorgrade 846 g3, 503 g2, 43 g1, 18 gx, 77 g4, 1 gb, 196 grade 3, 106 grade 2, 89 grade 1, 33 high grade
tumorstage 107 stage iiib, 198 stage iia, 88 stage iib, 46 stage iiic, 188 stage iv, 402 stage i, 147 stage iva, 4 stage iic, 191 stage ii, 220 stage iii, 135 stage iiia, 5 stage ivb, 106 stage ia, 3 stage ivc, 212 stage ib, 403 iiic, 3 ib, 24 iiib, 87 iv, 19 iic, 3 ia, 8 iiia, 10 ic, 4 iib, 3 iia
pathologicspread(pt) 590 t3, 12 t4b, 82 t4, 533 t2, 116 t4a, 155 t1, 1 t0, 1 tis, 32 tx, 142 t1b, 60 t3b, 168 t1a, 68 t2a, 137 t3a, 2 t3c, 33 t2b
pathologicspread(pn) 360 n1, 190 n2, 1067 n0, 17 n1b, 62 n2b, 13 n2a, 17 n1a, 3 n1c, 348 nx, 30 n2c, 22 n3, 1 n3a
Tier 1 Data

Table 3.  Get Full Table Illustration of the tier 1 data for three patients

Clinical.Variable Sample_ID Sample_ID Sample_ID
daystodeath sample_1 sample_2 sample_3
daystolastfollowup 389 223 81
karnofskyperformancescore NA NA NA
histologicaltype NA NA NA
vitalstatus 0 1 1
Methods & Data
Work Flow

1. Each xml file is converted to a tab-delimited text file by our R package.

2. All text files are aggregated into one big table by the Clinical_Aggregate_Tier1 pipeline. The 1st column of the table is the entry name of the xml file and the rest columns are the associated data for samples.

3. Data for the tier 1 clinical variables are extracted by the Clinical_Picker_Tier1 pipeline.

Diagram of Clinical Data Dicer

Figure 1.  Diagram that displays the work flow of processing clinical data. Clinical variables of interest and their associated values are marked in red and blue, respectively.