STM4PSD ASSIGNMENT 4

WhatsApp - +1-(760)-642-5721

1. Submit your assignment as a single scanned PDF file through the LMS before the due time. By submitting your
work electronically, you are affirming that it is all your own work, and you will be asked to confirm this as you
submit it.
2. You may use facts from the reading materials and from the lab classes to answer these questions. Use of facts
from external resources must be thoroughly explained and properly referenced.
3. Working must be shown to support your answers. It is not only the final answer that is important, but your
mastery of the required techniques, and the way you communicate your ideas and your approach to the problems.
You will be assessed on the way you communicate your answers. In accordance with department policy, students
may be asked by the subject coordinator to verbally explain or demonstrate their answers.
4. There are 20 marks available on this assignment. The marks allocated to each question are indicated next to
each question.

Question 1. (5 marks). Suppose you had two candidate algorithms, algorithm A and algorithm B, designed to
perform analysis on large data sets. You wish to determine which of the two algorithms, if any, has the fastest
average running time on large data sets. Let μA and μB denote the true mean running time for algorithm A and B,
respectively, and then let μd = μA − μB. Assume that the differences are normally distributed.
To determine which is the fastest, you run each algorithm on the same collection of n = 22 different data sets. Some
results are summarised in the table below (where xA and sA denote the sample mean and sample standard deviation
for algorithm A, xB and sB denote the sample mean and sample standard deviation for algorithm B, and xd and sd
denote the sample mean and sample standard deviation for the differences).
xA sA xB sB xd sd
130.63 47.60 149.82 45.58 19.18 5.18
The times are measured in minutes.
(a) Is this a one-sample or two-sample test? If it is a two-sample test, is it a paired or unpaired test?
(b) State appropriate null and alternative hypotheses.
(c) Calculate the appropriate test statistic by hand.
(d) Use the R function qt to calculate the 95% confidence interval for the difference in the two means.
(e) Use the R function pt to determine the p-value for the test statistic.
(f) State an appropriate conclusion on the basis of your previous answers.
Complete Question 1 by hand, except where otherwise stated, showing all working. Give answers correct to at least 3
decimal places. Assume a 5% level of significance.

Question 2. (4 marks). On YouTube, videos are displayed to viewiers with a preview image, which is known as a
thumbnail. Creators who post videos on YouTube are often told that the choice of thumbnail they attach to a video
is one of the most important factors for gaining increased viewership. One way to measure the success of a thumbnail
is by the click-through rate: the proportion of users who click on a video out of those who saw the video thumbnail.
YouTubers will test the performance of thumbnails by swapping thumbnails every 24 hours, and eventually staying
with the one that has the largest click-through rate. This is typically referred to as A/B testing.
An upcoming content creator has performed an experiment with their video thumbnails over a 48-hour time span. In
the first 24 hours, they used thumbnail A, and the video thumbnail was shown to 10,816 users, of which 397 clicked
on the video. In the second 24 hours, they used thumbnail B, and the video thumbnail was shown to 12,392 users, of
which 524 clicked on the video.

Use suitable R commands to investigate the results, and summarise your findings so that they could be understood
by a non-statistician and are relevant for the content creator. Assume a 5% level of significance.
Complete Question 2 using R. The only functions you may use are the built-in R functions that have been studied in
the labs. Submit the code/commands you used and the output you are basing your summary on. Give numbers to at
least 3 decimal places.

Question 3. (5 marks). In this question, you will be using the same data set as in Question 4 of Assignment 3.
The data is on the LMS with the filename bikes.csv. The CSV file contains 110 entries recording the number of
users of the Capital Bikeshare bicycle-sharing system based in Washington D.C., USA, with 60 samples from the year
2011 and 50 from the year 2012. Each row includes the year, month and date of the record, and various weather
features, as well as the number of casual users and the number of registered users on that day.
Let p denote the proportion of registered bike users amongst all bike users in the year 2012. Use suitable R commands
to find an estimate for p and a 95% confidence interval for p. Give a brief summary of your answer that could be
understood by someone with no background in statistics.
For Question 3, include in your submission the code you use and the estimates and intervals you obtained, to at least
3 decimal places. Your code must be neatly written (preferably typed). The only functions you may use are the built-in
R functions that have been studied in the labs.
Question 4. (6 marks). In this question, you will once again use the bike-sharing data set from Question 4 of
Assignment 3. You will apply linear regression to predict the number of bike-sharing users based on features of the
day’s weather. The response variables you will consider are casual (the number of casual users) and registered
(the number of registered users). The explanatory variables you will consider are:
• temp, the average temperature on that day in ◦C.
• humidity, the average humidity on that day, a percentage
• windspeed, the average wind speed on that day, in km/h.
(a) Construct a linear model in R using casual as the response variable and registered as the explanatory variable.
Does there appear to be a strong linear relationship between the the number of casual users and the number of
registered users? Justify your answer with reference to properties from the linear model summary output.
(b) Using appropriate R commands, create a linear model to predict casual, with temp, humidity and windspeed
as explanatory variables. Then:
(i) Provide a copy of the Residuals versus Fits plot and the Q-Q plot of the residuals. Do either of these plots
suggest that there are any linear regression model violations to be concerned with? Justify your answer
clearly with references to both plots. Note: regardless of your answer here, for the remainder of
this question, assume that there are no linear regression model violations.
(ii) Does the summary output applied to your model suggest that the regression model fits the data well? Explain.
(iii) What is the estimate of the humidity coefficient of this model? Interpret this coefficient, including appropriate
units as necessary.
(iv) Using appropriate R commands as needed, construct a 95% confidence interval for the humidity coefficient
of this model.
(c) Using appropriate R commands, create a linear model to predict registered, with temp, humidity and
windspeed as explanatory variables. Then:
(i) Provide a copy of the Residuals versus Fits plot and the Q-Q plot of the residuals. Do either of these plots
suggest that there are any linear regression model violations to be concerned with? Justify your answer
clearly with references to both plots. Note: regardless of your answer here, for the remainder of
this question, assume that there are no linear regression model violations.
(ii) Does the summary output applied to your model suggest that the regression model fits the data well? Explain.
(iii) What is the estimate of the temp coefficient of this model? Interpret this coefficient, including units as
necessary.
(iv) Let β1 denote the true coefficient of this model for the temp variable and consider the hypothesis
H0 : β1 = 0 versus H1 : β1 ̸= 0.
Do you reject H0 at the 5% significance level? Explain.
For Question 4, include in your submission the code you use and and any relevant results. You do not need to provide
all of the output shown by R; you need only describe and refer to those parts that are relevant to the questions being
asked and supply the specific features requested. Your code must be neatly written (preferably typed). Give numerical
answers to at least 3 decimal places.