ShinyLimma! Differential Expression Analysis in R
Over my last two Summers, I worked at Lovelace Respiratory Research Institute in New Mexico. The particular lab I worked in had one bioinformatician: me. While everyone else in the lab had years (if not decades!) of experience in wet lab biology, very few of them had a well-rounded understanding of bioinformatics.
Any Biologist working in the field for as long as that lab must have a strong ground in statistics - it’s a large part of the trade. T-tests and p-values keep the field grounded throughout a wide-range of constantly changing experimental techniques, from real-time pcr to gel electrophoresis, statistics defines the significance of the results and is the language of how those results are reported to the wider scientific community.
At it’s core, that’s what Bioinformatics is: statistics on high-throughput biological data to arrive at a result, and sharing the biological significance of those results to a broader community using summary statistics and data visualization techniques. With the added benefit of reproducible results! With your script and data shared, and all your seeds set correctly, your results can (in theory) be perfectly reproduced by anyone who would like to check them and had a working internet connection. Each choice made during an experiment can be scrutinized for validity, and the community can decide whether choices you made during an experiment made sense. One can even experiment with them, modifying normalization choices or classification techniques, either to test alternate hypotheses or the robustness of the results.
Despite the natural overlap between the fields, I found during lab meetings no one knew quite what to make of my results. It wasn’t like they didn’t understand statistics. Wet-lab experiments with complex experimental designs and equally involved statistics were regularly discussed in lab.
It was the code. All my work involved coding in either R or Python, and when I would explain what I had done everyone’s eyes would glaze over. This wasn’t Biology anymore, grounded in the real world. This was done on a computer with scripts and models. They couldn’t follow along. Their training in Biology didn’t involve a programming component. Without the ability to code, and follow the abstractions and patterns involved with it, they considered bioinformatics more of a black magic. The way they spoke to me often implied they that results of bioinformatics research could be anything I wanted it to be, and frequently ignored any result that didn’t have a current grounding in the literature.
But my tools weren’t complicated. 90% of my day to day work in differential expression analysis followed a pretty simple, easy-to-understand computational flow. I wondered if I could create a tool that would make Bioinformatics - at least at the level I was working at - more digestible, and remove the stigma that comes with the “coding” involved.
BioConductor is an open-source tool, providing many of the libraries a Bioinformatician is likely to use processing high-throughput genomic data. In one single-paged application, the app guides you through the entire experiment from data input to reporting.
Among it’s packages is Limma (Linear Models for Microarray Data), a package written by Gordon Smythe, based off of his landmark paper: Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments.
Limma was the package I chose to focus on because it was the one I knew best, which is little surprise considering it’s in the top 5% of all packages downloaded on Bioconductor.
My goal was to see if I could create a graphical user interface that would enable a user to employ the vast majority of Limma’s rich amount of features without ever having to write a line of code.
Choosing Shiny as my application framework had some definite advantages and disadvantages, but in my mind definitely came away as the best tool for the job.
Cons: I’d never used Shiny before. Moreso, Shiny is still in it’s infancy. Not many long-form Bioinformatics apps have really been written yet in it, and without the years of use by a dedicated community, I would find myself without stack overflow to run to or a book to buy if I got stuck.
Moreso, I was much more familiar with writing GUI code in Java or Python. I could definitely use a wrapper library like rPython or rJava to make everything work, so writing it all in R seemed like more of a hindrance than a help.
Pros: The biggest pro, and ultimately the one that convinced me to work within Shiny, is that it was built with web-deployment in mind. Deploying a website seemed like a must to keep users away from coding entirely if that was their wish, and with the release of ShinyBoostrap the power inherent in the framework to create sleek, attractive apps that could be launched either with a dedicated website or through localhost seemed to provide the most flexible solution to reaching the users I wanted to provide shinyLimma! to.
I won’t get into the nitty gritty of my code here (you’ll find it available below!) but I did want to provide an overview of my workflow.
Users need only three things to get started:
- A probe file from an Illumina or Affymetrix machine for an experiment.
- A control probe file from an Illumina or Affymetrix machine for an experiment.
- A targets filed, a user defined text file that specifies the different experimental groups (Experiment/Control, Mutant/Wild, different sample etc).
After ShinyLimma! verifies the submitted files describe valid input, the user can then go through the following steps.
Note: Each step is a different tab of the User Interface, which keeps things nicely modular from a programming perspective, but more importantly keeps different tasks seperated in the mind of the user.
Exploratory Data Analysis: Used to test out a couple of basic clustering methods (heatmap) and a boxplot of arrays in order to find any problem outliers before preprocessing. Users can also generate an ArrayQC report then can be downloaded later, which is another open-source package Bioinformaticians often use as a sort of automated checking of data to make sure nothing went so wrong with the experiment they can’t move forward.
Preprocessing: Allows user to choose their desired normalization, and use user-defined filtering standards to decide what p-detection values probes mapped to different genes will be accepted for further analysis. After this is run, a plot of the data is displayed that the user can observe (to test the differences between preprocessing methods in a way that’s nicely visual).
Contrast Matrix Specifcation: Let’s the user define the experimental design. That will be used to run differential expression analysis. A syntax checker I’ve written provides basic error checking in order to test whether or not the comparisons typed out are valid for Limma. Also provides a help doc that explains what proper syntax is for the program as well as a brief overview of what a contrast matrix really is.
Results: Runs the linear model. This is where the users work pays off! A user can see a venn diagram comparing which genes were significantly different between the user-defined groups using a user-defined p-value, as well as find the values of certain genes they might be interested in from the results. The “Top Table” feature allows a user to look at the genes that were most differentially expressed, as defined by log fold-change and p-value.
Export: When finished, this tab will allow a user to download basic template report that defines which steps they used during their analysis, as well as an .R script that can reproduce the results they found in their analysis so they can be freely shared with other people.
Not quite finished yet…
- ShinyLimma isn’t quite finished yet, but you can run clone it from github and give it a test drive!
- I’m planning to have it wrapped up and sent to Bioconductor (hopefully with an associated publication!) early May 2018 in time for the end of my first year of graduate school. So check back often for progress updates if this is something you’re interested in.