Get All Weeks Bioconductor for Genomic Data Science Quiz Answers
Learn to use tools from the Bioconductor project to perform analysis of genomic data. This is the fifth course in the Genomic Big Data Specialization at Johns Hopkins University.
Bioconductor for Genomic Data Science Quiz Answers
Week 1 Quiz Answers
Quiz 1 Answers
Q1. Use the AnnotationHub package to obtain data on “CpG Islands” in the human genome.
Question: How many islands exists on the autosomes?
Q2. Question: How many CpG Islands exists on chromosome 4.
Q3. Obtain the data for the H3K4me3 histone modification for the H1 cell line from Epigenomics Roadmap, using AnnotationHub. Subset these regions to only keep regions mapped to the autosomes (chromosomes 1 to 22).
Question: How many bases does these regions cover?
Q4. Obtain the data for the H3K27me3 histone modification for the H1 cell line from Epigenomics Roadmap, using the AnnotationHub package. Subset these regions to only keep regions mapped to the autosomes. In the return data, each region has an associated “signalValue”.
Question: What is the mean signalValue across all regions on the standard chromosomes?
Q5. Bivalent regions are bound by both H3K4me3 and H3K27me3.
Question: Using the regions we have obtained above, how many bases on the standard chromosomes are bivalently marked?
Q6. We will examine the extent to which bivalent regions overlap CpG Islands.
Question: how big a fraction (expressed as a number between 0 and 1) of the bivalent regions, overlap one or more CpG Islands?
Q7. Question: How big a fraction (expressed as a number between 0 and 1) of the bases which are part of CpG Islands, are also bivalent marked
Q8. Question: How many bases are bivalently marked within 10kb of CpG Islands?
Tip: consider using the “resize()”” function.
Q9. Question: How big a fraction (expressed as a number between 0 and 1) of the human genome is contained in a CpG Island?
Tip 1: the object returned by AnnotationHub contains “seqlengths”.
Tip 2: you may encounter an integer overflow. As described in the session on R Basic Types, you can address this by converting integers to numeric before summing them, “as.numeric()”.
Q10. Question: Compute an odds-ratio for the overlap of bivalent marks with CpG islands.
Q1. Question: What is the GC content of “chr22” in the “hg19” build of the human genome?
Tip: The reference genome includes “N” bases; you will need to exclude those.
Q2. Background: In the previous assessment we studied H3K27me3 “narrowPeak” regions from the H1 cell line (recall that the Roadmap ID for this cell line is “E003”). We want to examine whether the GC content of the regions influence the signal; in other words wether the reported results appear biased by GC content.
Question: What is mean GC content of H3K27me3 “narrowPeak” regions from Epigenomics Roadmap from the H1 stem cell line on chr 22.
Clarification: Compute the GC content for each peak region as a percentage and then average those percentages to compute a number between 0 and 1.
Q3. The “narrowPeak” regions includes information on a value they call “signalValue”.
Question: What is the correlation between GC content and “signalValue” of these regions (on chr22)?
Q4. The “narrowPeak” regions are presumably reflective of a ChIP signal in these regions. To confirm this, we want to obtain the “fc.signal” data from AnnotationHub package on the same cell line and histone modification. This data represents a vector of fold-change enrichment of ChIP signal over input.
Question: what is the correlation between the “signalValue” of the “narrowPeak” regions and the average “fc.signal” across the same regions?
Clarification: First compute the average “fc.signal” for across each region, for example using “Views”; this yields a single number of each region. Next correlate these numbers with the “signalValue” of the “narrowPeaks”.
Q5. Referring to the objects made and defined in the previous question.
Question: How many bases on chr22 have an fc.signal greater than or equal to 1?
Q6. The H1 stem cell line is an embryonic stem cell line, a so-called pluripotent cell. Many epigenetic marks change upon differentiation. We will examine this. We choose the cell type with Roadmap ID “E055” which is foreskin fibroblast primary cells.
We will use the “fc.signal” for this cell type for the H3K27me3 mark, on chr22. We now have a signal track for E003 and a signal track for E055. We want to identify regions of the genome which gain H3K27me3 upon differentiation. These are regions which have a higher signal in E055 than in E003. To do this properly, we would need to standardize (normalize) the signal across the two samples; we will ignore this for now.
Question: Identify the regions of the genome where the signal in E003 is 0.5 or lower and the signal in E055 is 2 or higher.
Tip: If you end up with having to intersect two different Views, note that you will need to convert the Views to IRanges or GRanges first with \verb|ir <- as(vi, “IRanges”)|ir <- as(vi, “IRanges”).
Q7. CpG Islands are dense clusters of CpGs. The classic definition of a CpG Island compares the observed to the expected frequencies of CpG dinucleotides as well as the GC content.
Specifically, the observed CpG frequency is just the number of “CG” dinucleotides in a region. The expected CpG frequency is defined as the frequency of C multiplied by the frequency of G divided by the length of the region.
Question: What is the average observed-to-expected ratio of CpG dinucleotides for CpG Islands on chromosome 22?
Q8. A TATA box is a DNA element of the form “TATAAA”. Around 25% of genes should have a TATA box in their promoter. We will examine this statement.
Question: How many TATA boxes are there on chr 22 of build hg19 of the human genome?
Clarification: You need to remember to search both forward and reverse strands.
Q9. Question: How many promoters of transcripts on chromosome 22 containing a coding sequence, contains a TATA box on the same strand as the transcript?
Clarification: Use the TxDb.Hsapiens.UCSC.hg19.knownGene package to define transcripts and coding sequence. Here, we defined a promoter to be 900bp upstream and 100bp downstream of the transcription start site.
Q10. It is possible for two promoters from different transcripts to overlap, in which case the regulatory features inside the overlap might affect both transcripts. This happens frequently in bacteria.
Question: How many bases on chr22 are part of more than one promoter of a coding sequence?
Clarification: Use the TxDb.Hsapiens.UCSC.hg19.knownGene package to define transcripts and coding sequence. Here, we define a promoter to be 900bp upstream and 100bp downstream of the transcription start site. In this case, ignore strand in the analysis.
Q1. Question: What is the mean expression across all features for sample 5 in the ALL dataset (from the ALL package)?
Q2. We will use the biomaRt package to annotate an Affymetrix microarray. We want our results in the hg19 build of the human genome and we therefore need to connect to Ensembl 75 which is the latest release on this genome version. How to connect to older versions of Ensembl is described in the biomaRt package vignette; it can be achived with the command \verb|mart <- useMart(host=’feb2014.archive.ensembl.org’, biomart = “ENSEMBL_MART_ENSEMBL”)|mart <- useMart(host=’feb2014.archive.ensembl.org’, biomart = “ENSEMBL_MART_ENSEMBL”).
Question: Using this version of Ensembl, annotate each feature of the ALL dataset with the Ensembl gene id. How many probesets (features) are annotated with more than one Ensembl gene id?
Q3. Question: How many probesets (Affymetrix IDs) are annotated with one or more genes on the autosomes (chromosomes 1 to 22).
Q4. Use the MsetEx dataset from the minfiData package. Part of this question is to use the help system to figure out how to address the question.
Question: What is the mean value of the Methylation channel across the features for sample “5723646052_R04C01”?
Q5. Question: Access the processed data from NCBI GEO Accession number GSE788. What is the mean expression level of sample GSM9024?
Q6. We are using the airway dataset from the airway package.
Question: What is the average of the average length across the samples in the expriment?
Q7. We are using the airway dataset from the airway package. The features in this dataset are Ensembl genes.
Question: What is the number of Ensembl genes which have a count of 1 read or more in sample SRR1039512?
Q8. Question: The airway dataset contains more than 64k features. How many of these features overlaps with transcripts on the autosomes (chromosomes 1-22) as represented by the TxDb.Hsapiens.UCSC.hg19.knownGene package?
Clarification: A feature has to overlap the actual transcript, not the intron of a transcript. So you will need to make sure that the transcript representation does not contain introns.
Q9. The expression measures of the airway dataset are the number of reads mapping to each feature. In the previous question we have established that many of these features do not overlap autosomal transcripts from the TxDb.Hsapiens.UCSC.hg19.knownGene. But how many reads map to features which overlaps these transcripts?
Question: For sample SRR1039508, how big a percentage (expressed as a number between 0 and 1) of the total reads in the airway dataset for that sample, are part of a feature which overlaps an autosomal TxDb.Hsapiens.UCSC.hg19.knownGene transcript?
Q10. Consider sample SRR1039508 and only consider features which overlap autosomal transcripts from TxDb.Hsapiens.UCSC.hg19.knownGene. We should be able to very roughly divide these transcripts into expressed and non expressed transcript. Expressed transcripts should be marked by H3K4me3 at their promoter. The airway dataset have assayed “airway smooth muscle cells”. In the Roadmap Epigenomics data set, the E096 is supposed to be “lung”. Obtain the H3K4me3 narrowPeaks from the E096 sample using the AnnotationHub package.
Question: What is the median number of counts per feature (for sample SRR1039508) containing a H3K4me narrowPeak in their promoter (only features which overlap autosomal transcripts from TxDb.Hsapiens.UCSC.hg19.knownGene are considered)?
Clarification: We are using the standard 2.2kb default Bioconductor promotor setting.
Conclusion Compare this to the median number of counts for features without a H3K4me3 peak. Note that this short analysis has not taken transcript lengths into account and it compares different genomic regions to each other; this is highly suscepticle to bias such as sequence bias.
Q1. The yeastRNASeq experiment data package contains FASTQ files from an RNA seq experiment in yeast. When the package is installed, you can access one of the FASTQ files by the path given by
fastqFilePath <- system.file(“reads”, “wt_1_f.fastq.gz”, package = “yeastRNASeq”)
Question: What fraction of reads in this file has an A nucleotide in the 5th base of the read?
Q2. This is a continuation of Question 1.
Question: What is the average numeric quality value of the 5th base of these reads?
Q3. The leeBamViews experiment data package contains aligned BAM files from an RNA seq experiment in yeast (the same experiment as in Questions 1 and 2, but that is not pertinent to the question). You can access one of the BAM files by the path given by
bamFilePath <- system.file(“bam”, “isowt5_13e.bam”, package=”leeBamViews”)
These reads are short reads (36bp) and have been aligned to the genome using a standard aligner, ie. potential junctions have been ignored (this makes some sense as yeast has very few junctions and the reads are very short).
A read duplicated by position is a read where at least one more read shares the same position.
We will focus on the interval from 800,000 to 801,000 on yeast chromosome 13.
Question: In this interval, how many reads are duplicated by position?
Q4. This is a continuation of Question 3.
The package contains 8 BAM files in total, representing 8 different samples from 4 groups. A full list of file paths can be had as
bpaths <- list.files(system.file(“bam”, package=”leeBamViews”), pattern = “bam$”, full=TRUE)
An objective of the original paper was the discovery of novel transcribed regions in yeast. One such region is Scchr13:807762-808068.
Question: What is the average number of reads across the 8 samples falling in this interval?
Q5. In the lecture on the oligo package an ExpressionSet with 18 samples is constructed, representing normalized data from an Affymetrix gene expression microarray. The samples are divided into two groups given by the \verb|group|group variable.
Question: What is the average expression across samples in the control group for the “8149273” probeset (this is a character identifier, not a row number).
Q6. This is a continuation of Question 5.
Use the limma package to fit a two group comparison between the control group and the OSA group, and borrow strength across the genes using \verb|eBayes()|eBayes(). Include all 18 samples in the model fit.
Question: What is the absolute value of the log foldchange (\verb|logFC|logFC) of the gene with the lowest \verb|P.value|P.value.
Q7. This is a continuation of Question 6.
Question: How many genes are differentially expressed between the two groups at an \verb|adj.P.value|adj.P.value cutoff of 0.05?
Q8. An example 450k dataset is contained in the minfiData package. This dataset contains 6 samples; 3 cancer and 3 normals. Cancer has been shown to be globally hypo-methylated (less methylated) compared to normal tissue of the same kind.
Take the RGsetEx dataset in this package and preprocess it with the preprocessFunnorm function. For each sample, compute the average Beta value (percent methylation) across so-called OpenSea loci.
Question: What is the mean difference in beta values between the 3 normal samples and the 3 cancer samples, across OpenSea CpGs?
Q9. This is a continuation of Question 8.
The Caco2 cell line is a colon cancer cell line profiled by ENCODE. Obtain the narrowPeak DNase hyper sensitive sites computed by the analysis working group (AWG).
Question: How many of these DNase hypersensitive sites contain one or more CpGs on the 450k array?
Q10. The zebrafishRNASeq package contains summarized data from an RNA-seq experiment in zebrafish in the form of a data.frame called \verb|zfGenes|zfGenes. The experiment compared 3 control samples to 3 treatment samples.
Each row is a transcript; the data.frame contains 92 rows with spikein transcripts; these have a rowname starting with “ERCC”. Exclude these rows from the analysis.
Use DESeq2 to perform a differential expression analysis between control and treatment. Do not discard (filter) genes and use the \verb|padj|padj results output as the p-value.
Question: How many features are differentially expressed between control and treatment (ie. \verb|padj <= 0.05|padj <= 0.05)?
Bioconductor for Genomic Data Science Course Review:
In our experience, we suggest you enroll in Bioconductor for Genomic Data Science courses and gain some new skills from Professionals completely free and we assure you will be worth it.
Bioconductor for Genomic Data Science course is available on Coursera for free, if you are stuck anywhere between a quiz or a graded assessment quiz, just visit Networking Funda to get Bioconductor for Genomic Data Science Quiz Answers.
I hope this Bioconductor for Genomic Data Science Quiz Answers would be useful for you to learn something new from this Course. If it helped you then don’t forget to bookmark our site for more Quiz Answers.
This course is intended for audiences of all experiences who are interested in learning about new skills in a business context; there are no prerequisite courses.