Analysis of subtelomere/telomere regions of cancer genomes using machine learning and high-performance computing

Poster #: 181
Session/Time: B
Author: Eleni Adam, MS, PhD
Mentor: Harold Riethman, PhD
Research Type: Basic Science

Abstract

INTRODUCTION:
Cancer continues to affect millions of people worldwide, nearly 40% at some point in their lives, with around one-third of cases having potentially a lethal outcome. Many cancers impact specific populations more severely than others and a systematic analysis of genetic features is needed to understand these cancer health disparities. Analysis of patterns of changed genes and rearranged chromosomes characteristic of cancer types will lead to more effective therapies. Our purpose is to understand cancer mechanisms better and improve our ability in diagnosing cancer at an earlier stage, treating it with more specific and effective therapies before it reaches the metastatic stage.

METHODS:
To study genome maintenance in cancer, we use the TCGA (The Cancer Genome Atlas) dataset of 33 cancer types. The subtelomeric analysis of cancer genomes consists of three main parts. In the first part, we use computational methods to extract the telomeric and subtelomeric information out of the large genomic datasets. Thereafter, based on the extracted data, we define the telomere and subtelomere-associated features. Given these newly defined features, machine learning methods are used to correlate them to clinical data.

RESULTS:
We developed the first pipeline, extracting the reads that contain the telomere repeat tract variations, and the subtelomeric (duplicons, TERRA promoter) patterns. Subsequently, we investigated the location of their mate-pairs (with respect to a reference genome) and classified them as telomere, subtelomere, or intrachromosomal. In order to optimize the pipeline for its use with multiple cancer datasets, we are collaborating with the Amazon cloud services to effectively streamline it for the anticipated scale-up.

CONCLUSION:
We have completed the development of our subtelomeric/telomeric computational pipeline and successfully applied it to the metastatic prostate cancer dataset of 101 normal/tumor paired individual genomes. Currently, we are in progress of optimizing it and refining it for the next step of feature identification.