Data Analytics Statistics for Data Analytics Assignment help
Your task is to investigate socio-economic factors influencing the cancer mortality in the US.
For this purpose we have sourced a dataset with average data for most of the 3000+ US
counties. Note that depending on the state, the size of counties varies a lot. The dataset is
provided in the file ‘cancer.csv’.
This dataset includes for each county the name of the county and its population size some
summary medical data and socio-economic data.
The medical data are:
• the incidence rate: the number of newly diagnosed cancer cases per 100,000 of
• the death rate: the number of death by cancer per 100,000 of population. Note: Some
of the death occur in the same year as the diagnosis, some may occur many years later.
Occasionally the death rate can therefore be higher than the incidence rate.
The available socio-economic data are:
• Income related: the median income, the percentage of unemployed, and the poverty
• Age related: Median Age across the population, and for male and female separately
• Household related: Average Household Size and percentage of Married Households
• Education related: Percentage of the highest educational level attained (No High
School / High School / Bachelor Degree) in the age groups 18-24 and over 25.
• Health Insurance related: Percentage of Private Insurance, Private Insurance paid by
Employer, Public Insurance and Public Insurance Only, and
• Race related: percentage of White/Black/Asian/Other.
Your task is to find (from your point of view) the best suited multiple linear regression model for
the expected cancer related death rate per county using the incidence rate of cancer and the
available socio-economic data for the county. Note that the model for the death rate you can’t
use the mortality rate and vice versa. However it is your choice if you build a model for the
mortality rate (and derive the death rate by multiplying the mortality rate with the incidence rate),
or if you build a model for the death rate (and derive the mortality rate by dividing the death rate
by the incidence rate).
The submission consists of two parts: a report of up to 6 pages in .pdf format using the IEEE
conference template and a supporting code file.
In your report you should:
• Use descrip3ve sta3s3cs and appropriate visualisa3ons to enhance understanding of
the variables in the dataset.
• Describe the model building steps you undertook in the process of arriving at your
final regression model. The ra3onale for rejec3ng intermediate models should be
explained clearly and details provided on the ra3onale the for choosing predictors,
transforma3ons undertaken, treatment of outliers, etc.
• Provide details on diagnos3cs undertaken to verify that the Gauss Markov and other
relevant assump3ons of mul3ple regression have been sa3sfied.
• Provide a succinct summary of the parameters of your final model and details of
model performance and fit.
The supporting code file should contain material required to reproduce the results of your
• If you used Jupyter Notebook, submit the notebook file with all the output produced
included. Make sure that it works using the “Restart Kernel and run all” option. For any
computer generated graphics you used in the report, insert in the Jupyter notebook a
comment referring to the figure number or caption.
• If you used R Studio or similar, submit the source file and make sure that one can run
the code sequentially. For any computer generated graphics you used in the report,
insert in the source code a comment referring to the figure number or caption.
• f you used a software package like SPSS, provide a .pdf document with a detailed
description of the steps you have taken to obtain the results in your report.
• By submiFng your work on Moodle you declare that this is your own work.
• Any material created by others must be properly referenced. Verba3m text copies
should be included in quotes.
• Figures not created by yourself should include an acknowledgement detailing the
name(s) of the creator(s) and proper references.
• Code and figures copied from class material or other sources should be clearly
marked as such and properly referenced. In par3cular it should not be (directly or
implicitly) claimed as your own. Instead a comment should be included in the source
code indica3ng where you obtained it from.
• Students are strongly advised to familiarise themselves with the Guide to Academic
Integrity. All submissions will be electronically screened for evidence of academic
misconduct, e.g. plagiarism, collusion and misrepresenta3on.