联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp2

您当前位置:首页 >> javajava

日期:2020-03-24 11:15

The aim of this assignment is to demonstrate your familiarity with data manipulation

and analysis, using the software package R. Write commented R code to address each

individual task or question below. Your code will be checked to confirm that it works

- please ensure that all code functions as expected prior to submitting the assignment!

Marks will be awarded for accurate code and a concise description of how it works

(code with no explanation as to how it works will receive reduced marks). An

example answer is provided at the bottom of the assignment.

The assignment must be submitted by 5pm on the due date via Canvas and further

guidance will be provided when the assignment is set.


An association study of breast cancer has identified a region on chromosome 5,

mapping to 5p12, that is associated with predisposition to the disease. A fine-mapping

analysis of almost 3,500 SNPs has been performed to refine the association signals in

the region. You have been provided with the association statistics from this finemapping

study breastFineMapping_5p12.csv .

1) Read the fine-mapping summary statistics into an R object called snp.data

and determine precisely how many SNPs were genotyped. Note that the dataset

is comma delimited.

For each SNP in the dataset, SNP name (rsid), chromosome, position on human

genome build 37, reference allele, effect allele and minor allele are provided, along

with the minor allele and effect allele frequencies in control samples). The log odds

ratio (OR), corresponding to the effect of each additional copy of the effect allele

upon risk of breast cancer, is shown in the all_beta column and the standard error of

the log OR is provided in the all_se column. Finally, the p-value for association with

risk of breast cancer is given in the all_pvalue column.

2) Using your knowledge of relational operators and subsetting in R, find:

i) the genomic coordinate of SNP rs10941673.

ii) the two possible alleles for SNP rs114796267.

iii) the number of SNPs in the dataset that map within the interval 44,044,000 bp

to 44,188,000 bp.

3) Remove all SNPs with MAF of less than 1% from the dataset. How many

SNPs remain?

4) Create new columns corresponding to odds ratios and 95% confidence

intervals, rounding to two decimal places, for each of the remaining SNPs in the

dataset. For the subset of SNPs with p-values ≤ 0.05, for how many SNPs is the

effect allele associated with

i) an increased risk of breast cancer?

ii) a decreased risk of breast cancer?

5) Using the which.min function, extract from chr5 the row of data that

corresponds to the SNP with the smallest p-value. Does the minor allele of this

SNP confer increased or decreased risk of breast cancer?

You have been provided with log10 transformed gene expression data from breast

tissue for three genes, FGF10, MRPS30 and HCN1 that map within the vicinity of the

b r e a s t c a n c e r p r e d i s p o s i t i o n S N P t h a t y o u h a v e

identified breastGeneExpressionData.txt . The genotypes of the predisposition SNP

(called SNP_A in this data) for each individual in the gene expression dataset are also

provided and are encoded such that 0 = common allele homozygote, 1 = heterozygote

and 2 = minor allele homozygote.

6) Make a new dataframe in R called gen.exp that corresponds to the gene

expression data. How many breast tissue samples are in the dataset? Rename the

column called SNP_A to the name of the SNP that you identified in task 5.

7) Assess the relationship between log10 gene expression for each gene and SNP

genotype using box plots. Label the axes of each plot and give each plot a title.

For each gene, does expression increase or decrease with each additional copy of

the risk allele?

8) Perform an eQTL analysis to test the association between log10 gene expression

and SNP genotype for each gene using the lm function in R. When specifying the

model formula for this linear regression analysis use log10 gene expression as the

response variable and SNP as the predictor variable. The effect estimates from

the linear regression analysis correspond to the expected change in gene

expression for each additional minor allele of the SNP. For which of the three

genes is SNP genotype associated with expression?

9) Based on your findings, write a short report (500 words max) discussing the

breast cancer risk locus at 5p12. The report should include the summary

statistics (SNP name, OR, 95% CIs, P-value and MAF) of the most significantly

associated SNP from the fine-mapping data and the findings from your eQTL

analysis, including your boxplots for each gene. Your report should reference

recently published literature describing the characterisation of this risk locus.


Date set: 27.02.20

Date due: 27.03.20


Example

How many SNPs in the "breastFineMapping_5p12.csv" dataset have either “C”

or “T” reference alleles and a minor allele frequency of greater than 45%?

Answer: 6

Solution:

# Subset the data frame to include only rows that meet the criteria (RefAllele = C or T

and MAF > 0.45) and output the number of rows

dim(snp.data[(snp.data$ref_allele=="C" | snp.data$ref_allele=="T") &

snp.data$maf>0.45,])[1]

Description: A subset of the snp.data data-frame is created by using square brackets.

To do so, the name of the data-frame to be subset is specified, followed by square

brackets. Since the object to be subset has two dimensions, row and columns, these

must be defined and a comma is used to delineate them, with rows being specified by

arguments to the left of the comma and columns by arguments to the right of the

comma. Since our data-frame comprises one row per SNP, we need only to define

arguments to subset rows of snp.data. The “==” relational operator is used to identify

rows that have either “C” or “T” in the ref_allele column (defined using the $

operator) and the logical operator for OR “|”. Brackets are placed around the OR

argument for ref_allele so that the OR statement is first evaluated before an AND

statement evaluates if the rows for which the OR statement is true also have minor

allele frequency greater than 0.45. Finally, the dim function is wrapped around the

subset function so that the number of rows for which the subset statement is true is

returned, rather than the actual subset data-frame. Since we are only interested in row

numbers, [1] is used outside the dim function to return the first element of the output

from dim, which corresponds to the number of rows.


Assessment Cover Sheet



Rubric

Practical Assignment Mark Scheme

Practical Assignment Mark Scheme

Criteria Ratings Pts

This criterion is linked to a

learning outcomeTask 1

5.0 to >0.0 Pts

Full marks 0.0 Pts

No marks

5.0 pt

s

This criterion is linked to a

learning outcomeTask 2

10.0 to >0.0 Pts

Full marks 0.0 Pts

No marks

10.0 p

ts

This criterion is linked to a

learning outcomeTask 3

10.0 to >0.0 Pts

Full marks 0.0 Pts

No marks

10.0 p

ts

This criterion is linked to a

learning outcomeTask 4

10.0 to >0.0 Pts

Full marks 0.0 Pts

No marks

10.0 p

ts

This criterion is linked to a

learning outcomeTask 5

10.0 to >0.0 Pts

Full marks 0.0 Pts

No marks

10.0 p

ts

This criterion is linked to a

learning outcomeTask 6

5.0 to >0.0 Pts

Full marks 0.0 Pts

No marks

5.0 pt

s

This criterion is linked to a

learning outcomeTask 7

10.0 to >0.0 Pts

Full marks 0.0 Pts

No marks

10.0 p

ts

This criterion is linked to a

learning outcomeTask 8

10.0 to >0.0 Pts

Full marks 0.0 Pts

No marks

10.0 p

ts

This criterion is linked to a

learning outcomeTask 9

20.0 to >0.0 Pts

Full marks 0.0 Pts

No marks

20.0 p

ts

Total points: 90.0

Practical Assignment Mark Scheme

Criteria Ratings Pts


版权所有:留学生程序网 2020 All Rights Reserved 联系方式:QQ:99515681 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。