bigdata /lab/allen/ en Differential expression using Deseq2 /lab/allen/2019/10/14/differential-expression-using-deseq2 Differential expression using Deseq2 Anonymous (not verified) Mon, 10/14/2019 - 09:45 Tags: bigdata

Often the goal of a RNA-seq type experiment is to find differentially expressed genes. Below I give guidelines for calling differential expression.

Idea behind differental expression programs

 

Imagine you do RNA-seq on 6 samples that are all biological replicated of each other. When you analyze them, you split them into two groups. Then you draw each gene as a dot on a graph where the x axis is expression level and the y axis is log fold change. What do you expext? What do you get? (for example you can use #data from SRP221750)

 

Step 0) Prepare

  • If you have never used R before, you will need to learn about R. I like this set of .
  • To work today, you need to install .
    • Within Rstudio you will need to install the following:
      • install.packages("ggplot2")
      • install.packages("tidyr")
      • install.packages("BiocManager")
      • BiocManager::install("DESeq2")
      • BiocManager::install("vsn")

 

 

PREPARE -get this data if you don't have your own

  1. Download the : (You must log in with a colorado.edu address).
  2. Unpack any gz  data:
    1. tar -zxvf archive.tar.gz

 

Step 1) Count the reads over each genes.

Tools for counting reads
  • I will show you: Rsubreads
  • Other tools you could use: Bedtools coverage or multi-cov (comand line), htseq (python)
  • Assumptions of counting programs to be aware of
    • Are low quality reads counted?
    • Are multi-mapped reads counted?
    • How are spliced reads handled?
    • How is paired end data handled?
Input:
  1. Mapped reads file (geneally bam or cram and the index files for those files)
  2. Regions to count file (geneally gtf or bed)
Output
  1. expression object (we will save as RData file)
Method

Create a R script that looks like this: Or run each of these commands on your command line.

    Step 2) Calculate differential expression

    To get the data I use in this example download the files from link.

    The major steps for differeatal expression are to normalize the data, determine where the differenal line will be, and call the differnetal expressed genes. How each of these steps is done varies from program to program.

    Tools

    I will teach you deseq2. However, I also recomend and edgeR or bayseq. bayseq is great for complicated patterns of anaysis, but not as good for cutoff anaysis.

     

    Input:
    1. list of samples you want to keep (the ones that looked ok on quality control)
    2. Coverage$counts from the RData file in step1
    Output
    1. list of genes that are diffentailly expressed via adjusted p-value
    2. normalized count object inside "DESeqDataSet"
    3. estimates of dispersion of the data
    4. basemean expression of each gene
    Additional Methods

    This is the link for the Deseq2 script I am using.

    This is another Deseq script that shows:

    • how you can use alternative size factors if you know the size factors might be affected by the data in some way
    • how to compare multiple things at once with a function

    Design terms information:

    • Imagine you have 3 biological replicates (repA, repB, repC) of RNA-seq between two people (person1 and person2). Imagine that the three replicates don't look very similar because of batch effects. Your metadata file should have one column with the replicate number and one column for the person. Your designs could be the following
      • ~rep + person #this would tell you genes that go the same direction (up or down-regulated) in the three replicates
      • ~rep + person + rep:person #this would show results for genes that are different between person1 and person2 the reference level  in this case repA------- person_1vs2 is just the results for replicate A
        • It would also get you two interaction terms repB.person2 repC:person2 which would tell you relative to person_1vs2  what is happening in repB and repC

     

    Other resources

    http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

    https://github.com/lpantano/DEGreport

    https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0190152

    https://rstudio-pubs-static.s3.amazonaws.com/329027_593046fb6d7a427da6b2c538caf601e1.html#the-condition-effect-for-genotype-i-the-main-effect

     

     

    Off

    Traditional 0 On White ]]>
    Mon, 14 Oct 2019 15:45:13 +0000 Anonymous 115 at /lab/allen
    Graphing tools /lab/allen/2019/02/28/graphing-tools Graphing tools Anonymous (not verified) Thu, 02/28/2019 - 12:20 Categories: data Tags: bigdata

    When I first got into bioinformatics one of the things I needed to learn quickly was how to graph with very large tabels of data. Below are some of my favorite websites for learning how to graph big data.

     

    A great webstie with instructions on using R

    Some great websties with instructions on using python for big data and graphing.

    Use this to learn the python package pandas (like excel for big data).

    Fancier plotting can be achieved with plotly or bokah 


     

    Specific to those in BioFrontiers on the compute cluster Fiji:

    To use Rstudio on fiji, use a web browser to go to  fiji-viz.colorado.edu and click on Rstudio. 

     

    Setting up a FANCY jupyter notebook on fiji.

    Step 0:

    Log into Fiji on the command line and do this:

    Step1:

    also run these line on the command line 

    pip3 install hide_code
    pip3 install plotly
    pip3 install ipywidgets
    pip3 install jupyter_contrib_nbextensions
    pip3 install jupyter_nbextensions_configurator
    jupyter contrib nbextension install --user

    jupyter nbextension install --user --py widgetsnbextension

    jupyter nbextension enable --user --py widgetsnbextension

    Then log off the fiji on the command line!

    Step 2:

    Start a new server on fiji-viz.colorado.edu

    Step 3:

    Start a new notebook

    step 4: use the control panel button to log off the server

    step 5: restart the server, on the home page you will have a new tab called Nbextentions, click that and turn on extensions you want

    step 6: Start a new notebook with the "New" button. Name it by clicking on the name. 

    BTW, if you want to do the R lesson in the jupyter notebook instead the starting file is here.

    #df=pandas.read_csv("https://raw.githubusercontent.com/kbroman/kbroman.github.io/master/datacarp/portal_data_joined.csv")

     

     

    Off

    Traditional 0 On White ]]>
    Thu, 28 Feb 2019 19:20:40 +0000 Anonymous 99 at /lab/allen
    Finding the number of unique items in a column /lab/allen/2018/03/14/finding-number-unique-items-column Finding the number of unique items in a column Anonymous (not verified) Wed, 03/14/2018 - 09:37 Tags: bigdata

    Often to check the content of a tab delemited file we want to know how many unique things there are in a particular column. Below I give you instructions for checking this using command line and the python package pandas.

     

    If you want to check for number of uniq things on the command line or in a shell script

    How many unique things are in column <1> of a file named ?

    # outputs a count of the unique things in a column
    cut -f 1 input_file | sort | uniq | wc -l

    #outputs the each of the unique things and how many of each there are
    cut -f 1 input_file | sort | uniq –c

    If you want to check for number of uniq things using the python package pandas

    #my files first line is chr, start, stop, name, score. I want to know how many uniq chromosomes there are or how many lines have each of thechromosomes.

    #First open python by typing python (if you are on fiji you must also module load python/2.7.3/pandas)

    # outputs a count of the unique things in a column

    import pandas
    df = pandas.read_csv("allmu.bed", sep="\t")
    print df["chr"].nunique()

    #outputs the each of the unique things and how many of each there are

    import pandas
    df = pandas.read_csv("allmu.bed", sep="\t")
    print df["chr"].value_counts()

     

     

    import pandas

    df = pandas.read_csv("allmu.bed", names=["chr", "start", "stop", "name", "score"], sep="\t")
    print sort(df["chr"].value_counts())

    print df["chr"].nunique()

     

     

     

    Off

    Traditional 0 On White ]]>
    Wed, 14 Mar 2018 15:37:02 +0000 Anonymous 56 at /lab/allen