Recent Question/Assignment

ASSESSMENT 3 BRIEF
Subject Code and Title BDA601—Big Data and Analytics
Assessment Model Evaluation
Individual/Group Individual
Length Source Code and Presentation (7–10 minutes)
Learning Outcomes The Subject Learning Outcomes demonstrated by the successful completion of the task below include:
c) Apply data science principles to the cleaning, manipulation and visualisation of data;
d) Design analytical models based on a given problem; and
e) Effectively report and communicate findings to an appropriate audience.
Submission Due by 11.55 pm AEST on the Sunday at the end of Module 12.
Weighting 40%
Total Marks 100 marks
Task Summary
Any enterprise-level, big-data, analytics project aimed at solving a real-world problem will generally comprise three phases:
1. Data preparation;
2. Data analysis and visualisation; and
3. Making decisions based on the analysis or insights.
In this Assessment, you will help the global community in its fight against COVID-19 by discovering meaningful insights in a dataset compiled by the Johns Hopkins University Center for Systems Science and Engineering.
Given the significance of the issue, you will slice and dice the data using different methods and drill down to gain insights that will help the individuals concerned make the right decisions.
Please refer to the Task Instructions (below) for details on how to complete this task.
Task Instructions 1. Dataset Preparation
The Johns Hopkins University COVID-19 dataset is a time-series dataset that officially began recording the global number of confirmed infections, deaths and recovered patients on 22 January 2020. The fields available in the dataset include the Province/State, Country/Region, the Latitude and Longitude of a country and the dates. The data period runs from 22 January 2020 to present.
In this Assessment, you are required to work with the latest version of this dataset (the version you use will depend on the day you download it). The dataset can be found at the URL provided below.
For this Assessment, you are only required to download the dataset related to confirmed infection numbers (i.e., only download the file named: time_series_covid19_confirmed_global.csv).
All of the analyses for this Assessment should be conducted on the confirmed infection numbers. You should use the dataset as it is without making any modifications to the downloaded file.
Humdata.org. (2020). Novel Coronavirus (Covid-19) cases data. Retrieved from. https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases [Accessed 05 August 2020].
2. Data Analysis and Visualisation
Using the dataset downloaded in the previous step, undertake a data analysis and visualisation of the top three infected countries.
The top three infected countries should be selected based on the total count of infected people from 22 January 2020 to the latest date in your file.
The analysis and the visualisation can be completed using the Python libraries of your choice i.e. Pyspark MLlib. You can use any other platform if you find it more efficient. The analysis and the visualisation should address the following sections collectively:
a) Predictive Modelling
In this section, fit a linear regression model to the time-series data for each of the three countries with an assumption that the infection rate has been increasing since the official record started. In this model, your dependent variable will be the count of infection for the independent variable (i.e., the week number).
Please note, you should convert the time-series data and represent the dates in the form of a week number. For example, 22 January 2020 to 28 January 2020 will be Week 1, 29 January 2020 to 4 February 2020 will be Week 2, etc.
Once all three linear regression models are ready, analyse the models thoroughly and identify the model with the highest variance. Select that country and its linear regression model and move to the next step.
b) Clustering
In this section, perform a K-Means clustering on the dataset used in the previous step for the country that had the highest amount of variance.
In the previous step, one of the assumptions was that the infection rate has been increasing since the official record started. Clustering should help you to validate that assumption and most importantly, should help you discover a trend of infection count over a period.
Determine the best value of K for K-Means clustering through iteration. Once the clusters stabilise, analyse the clusters thoroughly and observe the trend over time.
For example, consider whether you had cluster/s at the top of the graph in the first weeks of January, whether the cluster/s came back down in the graphs in the following weeks and whether the cluster/s went up again. You will use these observations in the next step.
c) Graph Analytics
In this section, perform graph analytics and show the relationship between the country in question in the previous step and its neighbouring countries based on the weekly count of infection. Assume that the neighbouring countries do not share any borders with each other.
To determine the neighbouring countries, you can either use the latitude and longitude information from the dataset or your own knowledge of geography and present a graphical view.
As part of this analysis, assume that the neighbouring countries may also display similar cluster trends over a period (as seen in the previous step). In your video presentation, you will make recommendations to these neighbouring countries in relation to possible trends.
d) Visualisation
In this section, you are required to visualise your analytical findings (that you derived using the above steps).
In big data and analytics projects, visualisation is an integral part of any analysis and often brings the analysis to life. Thus, ensure that you produce a high-quality visualisation, which you can use to tell stories and drill down from the raw data to the decision-making process.
3. Video Presentation
After completing the whole data analysis and visualisation process, the outcomes need to be communicated to the neighbouring countries as identified in the previous step. Thus, you should prepare a video presentation summarising the insights discovered in the previous step. You should use 8–10 slides in your presentation and your presentation should be no longer than 10 minutes.
This video presentation is related to the big data and analytics project phase ‘making decisions based on the analysis and insights’ (as described above). Thus, the contents of this video should be extremely helpful to the neighbouring countries as they make decisions about their COVID-19 policies.
Consequently, as you communicate about possible trends of infection, ensure that you support your findings with any insights that you discovered through predictive modelling, clustering, graph analytics and visualisation. Tell a story to your listeners by presenting drilleddown views of your discoveries and by relating all the outcomes from the analysis that you completed in the previous steps: predictive modelling, clustering, graph analytics and visualisation.
Submission Instructions
• Zip the following files and submit the .zip files via the Assessment link in the main navigation menu in BDA601—Big Data and Analytics: o Python source code. (Ensure that you include comments at the top of the main file on how to execute your code);
o Video presentation file; and o PDF slides used in video presentation.
The Learning Facilitator will provide feedback via the Grade Centre in the LMS portal. Feedback can be viewed in My Grades.
Academic Integrity Declaration
I declare that except where referenced, the work I am submitting for this assessment task is my own work. I have read and am aware of the Academic Integrity Policy and Procedure of Torrens University, Australia, viewable online at http://www.torrens.edu.au/policies-and-forms.
I am also aware that I need to keep a copy of all submitted material and any drafts and I agree to do so.
Assessment Rubric
Assessment Attributes Fail
(Yet to Achieve
Minimum Standard) 0–49% Pass
(Functional) 50–64% Credit
(Proficient)
65–74% Distinction
(Advanced)
75–84% High Distinction
(Exceptional)
85–100%
Completeness and efficiency
25%
None of the requirements are implemented.
The system does not function properly or is extremely buggy.
Requires an extreme level of manual configuration to run the system. Additionally, the configuration does not work.
One or two major requirements are implemented.
The system does not function properly. No exception handling implemented.
Requires users to follow a lengthy configuration manual to run the system.
All but one or two major requirements are implemented.
The system functions only if certain additional conditions are met. Basic exception handling implemented, but it is not thorough.
Requires users to follow a short configuration manual to run the system.
Most of the major requirements are implemented.
The system functions without any additional conditions having to be met. Basic exception handling implemented, but it is not thorough.
Only requires users to copy the necessary data to the right locations.
All of the major requirements are implemented.
The system functions properly. Exceptions are handled very well.
Users can run the system without any configuration.
Analysis and insights
30% The analysis of the data is not accurate, thorough and appropriate.
None of the analytical tasks are correlated.
Statistical evidences are not embedded.
The analysis of the data includes at least one accurate, thorough and appropriate insight for each section.
All of the analytical tasks are somewhat correlated.
Statistical evidences are loosely embedded.
The analysis of the data is mostly accurate, thorough and appropriate.
All of the analytical tasks are strongly correlated.
Statistical evidences are highly embedded.
The analysis of the data is completely accurate, thorough and appropriate.
All of the analytical tasks are solidly correlated.
Statistical evidences are heavily embedded.
The analysis of the data is extraordinarily accurate, thorough and appropriate.
All of the analytical tasks are remarkably correlated.
Statistical evidences are acutely embedded.
No meaningful insights were produced. A few good insights were produced. Strong insights were produced. Solid insights were produced. Thought-provoking insights were produced.
Visualisation and
creativity
20% Poor synthesis of information from the data source resulting in incorrect points of views.
Basic use of graphics that is not understandable.
Poor use of colours or patterns.
Synthesises basic information from the data source resulting in a few correct points of view.
Basic use of graphics that is somewhat understandable.
General use of colours or patterns.
Synthesises adequate information from the data source resulting in mostly correct points of view.
Somewhat inventive use of graphics that is mostly understandable.
Good use of colours or patterns.
Synthesises detail information from the data source resulting in correct points of view.
Inventive use of graphics
that is easily understandable.
Impressive use of colours or patterns.
Synthesises in-depth information from the data source resulting in correct points of view.
Super inventive use of graphics that is easily understandable.
Very impressive use of colours or patterns.
Clarity and
completeness of video presentation
25%
The skills demonstrated in understanding and communicating the key outcomes are unsatisfactory.
The slides used in the presentation are of unsatisfactory quality.
The overall presentation lacks organisation and is extremely difficult to follow.
The number of slides and the length of the video are outside the limits.
Demonstrated satisfactory skills in understanding and communicating the key outcomes to the concerned global communities.
The slides used in the presentation are of satisfactory quality.
The overall presentation is not well organised and is difficult to follow.
The number of slides and the length of the video are outside the limits.
Demonstrated good skills in understanding and communicating the key outcomes to the concerned global communities.
The slides used in the presentation are of high quality.
The overall presentation is mostly well organised but is somewhat difficult to follow.
The number of slides and the length of the video are within the limits.
Demonstrated advanced skills in understanding and communicating the key outcomes to the concerned global communities.
The slides used in the presentation are of outstanding quality.
The overall presentation is well organised, cohesive and easy to follow.
The number of slides and the length of the video are within the limits.
Demonstrated exemplary skills in understanding and communicating the key outcomes to the concerned global communities.
The slides used in the presentation are of exceptionally high quality.
The overall presentation is exceptionally well organised, highly cohesive and easy to follow.
The number of slides and the length of the video are within the limits.
The following Subject Learning Outcomes are addressed in this assessment
SLO c) Apply data science principles to the cleaning, manipulation and visualisation of data.
SLO d) Design analytical models based on a given problem.
SLO e) Effectively report and communicate findings to an appropriate audience.