7144CEM Assignment Help
Principles of Data Science Assignment help
In this Portfolio, we will investigate the Wisconsin breast cancer diagnostic dataset. Each row corresponds to an image of a cell nucleus, i.e., each row has a specific ID number (Column 1). The result of the breast cancer test, diagnosis variable (either M=malignant or B=benign in column 2). Ten real-valued cell nucleus measurements are provided:
- a) radius (mean of distances from centre to points on the perimeter)-column 3
- b) texture (standard deviation of gray-scale values)- column 4
- c) perimeter- column 5
- d) area- column 6
- e) smoothness (local variation in radius lengths)- column 7
- f) compactness (perimeter^2 / area – 1.0)- column 8
- g) concavity (severity of concave portions of the contour)- column 9
- h) concave points (number of concave portions of the contour)- column 10
- i) symmetry- column 11
- j) fractal dimension (“coastline approximation” – 1)- column 12
The mean, standard error and “worst” (mean of the three largest values) of these measurements were computed for each image, resulting in 30 variables (columns 3-32). For instance, column 3 is Radius-mean, column 13 is Radius-SE, column 23 is Radius-worst.
Task 1 (Group Task) — Multivariate Statistical Analysis
This is a group task. Please be clear in your group report about how each group member has individually contributed to this group task. You may find the R package “factoextra” useful for this task. You must interpret and evaluate your results, not only write R code and build plots. Make sure it is clear which R code has produced which plots.
Throughout this task, please take every opportunity to investigate the effect of the categorical variable (diagnosis variable) in your plots.
- Use R to carry out Principal Component Analysis (PCA) on the dataset.
- By considering only the first five principal components of these cell nucleus measurements, produce, interpret and evaluate relevant plots such as scree plot, loadings plot, and biplot using PC1 and PC2. Also, produce and interpret a biplot using PC2 and PC3 as the axes.
- Repeat parts (a) and (b) for only the first ten cell nucleus measurements (Mean only)
- Use R to carry out Cluster Analysis on the dataset. You may find the agnes() function in the R package “cluster” useful, especially the agglomerative coefficient used for comparing clusterings.
- Using all cell nucleus variables, cluster the cells (rows) and compare results using different distance metrics (Manhattan, Euclidean, etc) and hierarchical clustering methods (single linkage, Ward’s method, etc) in a small table. Produce, interpret and evaluate at most two relevant dendrograms in your report. Repeat this procedure when we have only first ten cell nucleus measurements (Mean only).