Rebecca Asiimwe

Logo


I am a Bioinformatician and Data Scientist with over 10 years of experience in Computing, including 7 years of experience in big and complex data analytics, management and visualization. I am keen at developing computational tools, applying statistical methods and leveraging cutting-edge technologies (including machine learning and artificial intelligence) for integrating, managing and analyzing big, complex and high-dimensional datasets to support the elucidation and discovery of hidden patterns, and unknown correlations in biological and clinical data.

Experienced in: The analysis of high-dimensional and high-throughput data including next-generation sequencing (NGS) data, data processing, modeling, mining, analysis and visualization, building analytics pipelines and innovative data-driven tools and systems to solve complex problems in diverse domains, High Performance Cluster Computing, Python (Pandas, Numpy, Seaborn, Matplotlib, Scikit-learn, Keras, TensorFlow), Shell Scripting, Command Line Interface, Jupyter Notebook, Google Colab, Anaconda, R, RStudio, Shiny, Plotly, GitHub, GIT Version Control, Database Development, management and optimization (MySQL and PostgreSQL), Standard NGS Bioinformatics Toolsets, PLINK, GenomeStudio, Cell Ranger and Space Ranger.

GitHub Profile
Code Examples

Portfolio


Sample Research Projects:

1. Whole Genome Profiling for Stratifying Triple Negative Breast Cancers (TNBC): Identifying somatic alterations in tumor genomes of TNBC patients for subgroup discovery, and the identification of potentially actionable molecular events that could provide insights into treatment options for TNBC patients

The aim of this research project was to design, develop, optimize and utilize a relational database to structure, integrate, store, query, mine, statistically analyze and visualize large-scale whole genome profiling data from tumors of triple negative breast cancer patients, alongside orthogonally collected clinical data. Overall, the study was conducted to support the elucidation of genomic events underpinning a patient’s disease while providing an in-depth understanding and exploration of the landscape of characteristic mutations occurring in patient genomes that may reflect specific mutational processes as targetable vulnerabilities in the treatment of TNBCs. We further stratified our cohort of TNBC patients and discovered for the first time, and to the best of our knowledge, five genomic and clinically distinct subgroups, revealed by unsupervised hierarchical clustering, based on mutation signatures, and somatic mutations in patient genomes. This work further supported the utility of the genome as a potential discriminant biomarker in subgroup discovery, from which we can draw valuable insights into options for novel therapeutic modalities, and the identification of patients most likely to respond to specific forms of treatment. Please see publication for details on this project including methods applied.


2. Single-Cell RNA Sequencing (scRNA-Seq), CITE-Seq and Spatial Transcriptomics Projects

Transcriptomic profiling of complex tissues identifies known and novel features of the transcriptome and provides insights into tissue cell type diversity and dynamics that have a great impact on disease diagnostics, prevention, and drug discovery - all based on measures of gene and transcript abundance. I have had the privilege of conducting single‐cell and single-nuclear RNA sequencing analyses of thousands of cells from over 100 libraries and experiments.

I have conducted analyses of CITE-seq datasets based on transcriptome and cell surface protein epitope measurements while leveraging available antibodies at a single cell level. Studying cells concurrently at transcriptomic and proteomic levels can offer unprecedented insights into new cell types, disease states, or other conditions.

I have also conduced single cell ATAC-Seq analyses to study cell type-specific chromatin accessibility in tissue samples containing heterogeneous cellular populations for the identification of transcription factors that are active in a phenotype or condition(s) being investigated. The transcription factor binding sites and positions of nucleosomes identified from the analysis of ATAC-Seq data, potentially allows for the elucidation of important genetic pathways in a sample.

Besides sc/sn-RNA-, CITE- and ATAC-seq, I have also worked with spatial transcriptomics to characterize transcriptional patterning and regulation in tissues. Besides spatial transcriptomic analyses using Seurat, I have worked with cell2location for cell-type deconvolution of spatial transcriptomics data. I have also worked with tools such as soupX to remove ambient RNA, scrublet to remove doublets, monocle for single cell trajectory inference and souporcell and demuxlet for sample identity deconvolution.

All analyses are conducted using Python and R using respective packages to accomplish necessary tasks (Images shown below are from works I conducted and published or in preparation for publication. Code examples are on GitHub).


3. Improving Read Alignment

Allele specific expression (ASE) refers to the preferential expression of one allele over the other in a diploid genome. In humans, the widespread allelic variation at both gene and single nucleotide levels between individuals is commonly associated with complex traits. ASE analysis, quantifies the relative expression of two alleles and when integrated with expression quantitative trait locus (eQTL) analysis is a powerful tool for identifying biologically meaningful regulatory signals such as imprinting and cis-regulated gene expression variations that underlie phenotypic differences among individuals. RNA-seq can be used to measure allele-specific expression (ASE) by assigning sequence reads to individual alleles; however, this analysis remains limited by mapping bias, where sequenced reads are aligned to a conventional linear reference genome with each position represented only by the reference/most abundant allele. Together with Dr. Dobin, we are working on assessing reference bias towards improving STAR, a widely adopted read alignment tool, by reducing this bias.


4. DNA Methylation and Genotyping Data Analyses

Provided below is an overview of extracts from various projects for developed data preprocessing pipelines, and up or downstream analyses such as those conducted in this manuscript. Projects I have conducted in this domain (DNA methylation and genotyping) follow similar pipelines and analyses.


5. Select Machine Learning and Deep Learning Projects

i) Malarial Detection Using Deep Learning: A Convolutional Neural Networks (CNN) Approach

Malaria is an infectious and sometimes a fatal disease caused by plasmodium parasites that are transmitted through mosquito bites from female anopheles mosquitoes. Based on World Health Organization reports, there were 228 million cases and 405,000 deaths in 2018 alone. Of these deaths, Africa represented 93% and 94% of the total cases and deaths respectively. More recently (2020), it was estimated that there were 241 million cases of malaria worldwide, of which 627,000 people died from the disease, most of whom were children from sub-Saharan African countries. Many deaths have been attributed to poor health care services and lack of early and effective screening for malaria. It is therefore key to have measures for early detection, diagnosis, and treatment, to reduce these high mortality rates and the malaria burden worldwide. Current efforts towards malaria detection often involve examining a drop of a patient’s blood under a microscope. The specimen, often spread out on a thin blood smear is stained with Giemsa to give the parasites a distinctive appearance which is used to distinguish healthy from infected cells. Despite its ability to aid malaria detection, this process is quite tedious as it requires manual counting of cells by a trained technician or pathologist. Accuracy could in part depend on the technicians or expatriates examining the slides. This may leave many cases to go undiagnosed including late diagnoses that could contribute to the global malaria burden. Solving this problem is therefore imperative to aid early and accurate detection of malaria and consequently contribute to reducing severe illness and deaths caused by malaria.

The overall objective of this project was to build an automated and efficient computer vision model to detect malaria by distinguishing malaria parasitized red blood cells from uninfected red blood cells, and to classify which cells are parasitized - that is cells that have the Plasmodium parasite - and which cells are not (uninfected), in the midst of other cellular impurities. Key tools and packages used for this project include: Python (Pandas, numpy, seaborn, cv2, matplotlib, scikit-learn, tensorflow, keras), Jupyter Notebook and Google Colab.

ii) Image Processing Using Neural Networks and Convolutional Neural Networks

One of the most interesting tasks in deep learning is to recognize objects in natural scenes. The ability to process visual information using machine learning algorithms can be very useful as demonstrated in various applications. This project leveraged the SVHN dataset - a popular image recognition dataset, which contains over 600,000 labeled digits cropped from street-level photos. It has been used in neural networks created by Google to improve the map quality by automatically transcribing the address numbers from a patch of pixels. The transcribed number with a known street address helps pinpoint the location of the building it represents. The objective of this project was to predict the number depicted inside the image by using Artificial or Fully Connected Feed Forward Neural Networks and Convolutional Neural Networks. Key tools and packages used for this project include: Python (Pandas, numpy, seaborn, cv2, matplotlib, scikit-learn, tensorflow, keras), Jupyter Notebook and Google Colab.

iii) Using Machine Learning to Predict Breast Cancer

This project utilized the Breast Cancer Wisconsin (Diagnostic) Data Set for predictive analysis in breast cancer. Key tools largely used include: Jupyter Notebook, Python - numpy, pandas, matplotlib, plotly, seaborn and scikit-learn to run and compare Logistic Regression, Random Forest, KNeighbors, SVC, DecisionTree, GradientBoosting, AdaBoost, and XGB classification models.


6. Understanding Tissue-Specific Temporal Changes after SIV Infection to Guide HIV Treatment

Human immunodeficiency virus (HIV) has continued to be a public health issue on a global scale, and over 35 million people have died due to the onset of acquired immunodeficiency syndrome (AIDS)-related diseases. Early diagnosis is crucial as it enables the control of infection via antiretroviral therapy in addition to reducing the risk of transmission. However, it is estimated that half of the patients with HIV worldwide are “late-presenters” because of the asymptomatic nature of the virus; a rationale for which more knowledge is needed to understand the early stages of HIV infection, and immune response. Simian immunodeficiency virus (SIV), a retrovirus infecting non-human primates, has symptoms and a viral life cycle similar to that of HIV. Phylogenetic analysis also shows that SIV and HIV are ancestrally related and are closely related in viral replication and propagation, making it a good model for understanding early development of infection. This project aimed at evaluating tissue-specific transcriptomic changes in the first ten days after SIV infection in rhesus monkeys. We leveraged various statistical methods such as fitting linear models, cluster analysis, PCA, gene set enrichment and pathway analysis, using various CRAN/Bioconductor R packages, and found that significant transcriptomic changes occur as early as day 1 post infection throughout host tissues. This knowledge is crucial for development of treatment as well as vaccines and more effective post-exposure prophylaxis drugs. Please visit this GitHub repo for more details on this project.


7. Database Design, Development and Optimization

I have experience designing, developing, optimizing and managing large-scale databases, especially PostgreSQL and MySQL Database Management Systems (DBMSs). I have also worked with Oracle and Microsoft Access.


8. Shiny Apps and Dashboards

I have developed a number of Shiny-based tools, apps and/or dashboards, most of which are internal to the institutions I have worked with. Below are some of the few examples that are public. I have also had some experience conducting data analysis and visualizations using Dash (Python).

Not accessible publicly, however, this app’s functionality can be viewed in my thesis (pages 75 - 87)

Below is another tool (CoADD) that is under development. It compares DNA methylation patterns in immune cell types between cord and adult blood:


Publications

Please visit Google Scholar for a full list of my publications.


Page powered by jekyll