CSE5BDC Assignment Help
Assignment 2023 Assignment help
Plagiarism is the submission of somebody else’s work in a manner that gives the impression that the work is your own. For individual assignments, plagiarism includes the case where two or more students work collaboratively on the assignment. The Department of Computer Science and Computer Engineering treats plagiarism very seriously. When it is detected, penalties are strictly imposed.
If you are working on your assignment on the lab computers. Make sure you delete the virtual machine and empty the recycle bin before you leave. Otherwise, other students may be able to see your solutions.
ChatGPT and similar AI tools
A key purpose of this assessment task is to test your own ability to complete the assigned tasks. Therefore, the use of ChatGPT, AI tools or chatbots with similar functionality is prohibited for this assessment task. Students who are found to be in breach of this rule will be subject to normal academic misconduct measures. Additionally, students may be engaged to provide an oral validation of their understanding of their submitted work (e.g. coding).
Expected quality of solutions
a) In general, writing more efficient code (less reading/writing from/into HDFS and less data shuffles) will be rewarded with more marks.
b) This entire assignment can be done using docker containers supplied in the labs and the supplied data sets without running out of memory. It is time to show your skills!
c) I am not too fussed about the layout of the output. As long as it looks similar to the example outputs for each task. That will be good enough. The idea is not to spend too much time massaging the output to be the right format but instead to spend the time to solve problems.
d) For Hive queries. We prefer answers that use less tables.
The questions in the assignment will be labelled using the following:
o Means this question needs to be done using Hive
• [Spark RDD]
o Means this question needs to be done using Spark RDDs, you are not allowed to use any Spark SQL features like dataframe or datasets.
• [Spark SQL]
o Means this question needs to be done using Spark SQL and therefore not allowed to use RDDs. In addition, you need to do these questions using the spark dataframe or dataset API, do not use SQL syntax.