Recent Question/Assignment

Assessment 2: Visualization and data processing – Total marks 40
Assessed by:
Grade: /40
Outline
The following exercises are designed to assess your understanding of concepts, implementation, and interpretation of topics in Visualization and Data Processing. Some questions may require you to search and use R functions that we have not used so far. In all following questions submit codes and output.
The questions in this assessment may have multiple correct solutions. Almost no statistical background is presumed knowledge for this assessment. All methods required for solution are available on the content pages of Weeks 2-5 of this subject. Some of them have been covered in detail during collaborate sessions.
Submissions
This assessment consists of 11 questions with several sub-questions. Insert code, plots and explanations/justifications in the provided text boxes where indicated. Do not remove the headings in the text boxes. Answers outside the box won’t be marked. Note that you should not need more space than is provided (the text boxes).
Change the file name to your first and last name when submitting to Learn JCU.
Submit as a Word file or a pdf file.

Visualization:
Import the data oneworld.csv (saved in https://drive.google.com/file/d/1dJnK9froCCxCn1PFEbv6svLdKnhRiFCL/view?usp=sharing) into R. The objective in this section is exploring the relationship between GDP categories, Infant mortality and regions.
Q1. Insert your R code to:
Create a new ordinal variable called GDPcat with three categories, “Low” “Medium” and “High”, derived from the variable GDP with:
• The proportion of countries in each GDPcat category is approximately “Low” 40%, “Medium” 40% and “High” 20%.
• The “Low” category has countries with the lowest GDP values and the “High” category has countries with the highest GDP values.
• Remove any missing observations.
Q2. Insert your R code, Plot, and interpretation of the plot:
Using the ggplot2 library, visualise the relationship between GDPcat and Infant.mortality, stratified by Regions, on a single plot. Comment on your plot.
Data Processing: Section Marks 15
Q3. Insert your R code to: Marks (4)
Write an R function to identify the proportion of missing observations in a variable or column of tabular data.
Q4: Insert the code to: Marks (2)
Implement the function from Q3 across all variables of the dataset airquality. This dataset is available in R. Print a list of the variable name with the proportion missing observations in each variable.
Q5. Insert your justification: Marks (2)
Use airquality dataset available in R. Specify a variable from the airquality dataset for univariate missing value imputation. Justify your variable choice based on the count or proportion of missing observations, noting that univariate imputation reduces the natural variation of a variable.
Using base R or dplyr functions (no additional libraries) replace all missing observations in the chosen variable from above with an imputation value. Justify the choice of replacement value. Hint: Read the appropriate section on your Weekly content page to perform this task.
Q6. Insert the code and justification to: Marks (3)
Using base R or dplyr functions (no additional libraries) replace all missing observations in the chosen variable from Q5 with an imputation value. Justify the choice of replacement value. Hint: Read the appropriate section on your Weekly content page to perform this task.
Q7. Insert the code, output and explanation: Marks (4)
Compare the mean and standard deviation of the chosen variable from Q5 before and after imputation. Provide an explanation of the comparison.

Text Analytics: Section Marks 15
Mysterydocs.RData is a collection of unstructured text documents (can be found https://drive.google.com/file/d/1FU2bTUMtqrFizpEQwoz1MQ5Yw2AHRgwe/view?usp=sharing).
The response to the questions below must include comments, where indicated.
Q8. Insert the code and output to: Mark (1)
Import the Mysterydocs.RData file into R and identify the number of documents in the docs dataset.
Q9. Insert the code and output to: Marks (4)
Using methods of Week 5 Topic 2, clean the collection of texts and convert it into tabular data. Use at least 5 cleaning steps, including stemming. Display the last six rows and first five columns (only) of the cleaned tabular data that you created.
Q10. Insert your R code and plot: Marks (3)
Create a subset of the cleaned tabular data from Q9 retaining only those words that have occurred at least 200 times within the entire corpus. Use a visualization tool to show the frequency distribution of words of the 50 most frequent words in the subset data. Hint: Select an appropriate visualization tool from your learnings of Week 3
Q11. Insert your R code, plot, and interpretation of the plot: Marks (7)
Visualise a similarity matrix between documents derived from the cleaned data in Q9. Comment on the visualisation and noting any obvious structure in the similarity matrix as depicted in the plot. For visualisation of the similarity matrix, you may use R functions such as levelplot() or image()or any other suitable plotting function. You would have to research the implementation of these functions.