Recent Question/Assignment

CSC8001 Data Science Project Report
(Due date: 6 Nov, 2015)
This assignment is marked out of 100 marks. It is worth 50% of your overall course mark. Please submit your assignment on-line using the submission link to the Data Science Project Report on the course web page.
Submit the files listed below in a single ZIP file:
• Titanic.rmd – R Markdown document used to generate your Data Science Project Report. An initial Sample_RMD.rm template file has been provided for you.
• Titanic.html – standalone HTML document (embedded images and code) generated in R Studio using Knitr and your Titanic.rmd R Markdown file.
• /data directory with your dataset files
Reproducible Research
For your Data Science Project Report you are expected to meet the criteria of a reproducible research project. Your Project report will document your analysis of the Titanic dataset. It will include your initial data exploration, model building and evaluation and your final predicted outcomes for the test dataset. For your research to be considered reproducible you must provide:
• The data used for your analysis
• All final code files, with appropriate comments
• A report of your analysis which includes background information explaining the question you are trying to answer, a discussion of the analysis and conclusions reached for your project with appropriate supporting explanations and figures.
To comply with this final requirement, your final report will be a standalone HTML document created using R Studio with Knitr & R Markdown tools. Using Knitr with R Markdown allows you to create a report that interweaves your discussion with your code and figures. See R Markdown — Dynamic Documents for R in the list of online resources provided below for further information.
Data Analysis Project
This assessment, like Assignment 2, is based on a Kaggle competition. For this assignment you are asked to predict which of the Titanic’s passengers survived the disaster. More information on the competition is available at the Kaggle competition site: Titanic: Machine Learning from Disaster
[https://www.kaggle.com/c/titanic].
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy. (Kaggle 2012)
Project Report Outline
Please use the project report outline provided below as a general guide to the specific sections and content that you should include in your project report.
1. Background
Introduce and discuss the background and purpose of your project. What information does the dataset provide? What question(s) are you trying to answer?
2. Exploratory analysis
Conduct exploratory analysis to discover which of the independent variables are most informative. You are required to explore and report on at least four variables. Three of the four must be Age, Sex and Class. You are free to explore and report on any other independent variables in the dataset. Your discussion should include at least one table or figure for each variable illustrating the relationship between each variable and passenger survival.
3. Building and evaluating the model
a. Discuss your choice of model. Explain why you’ve chosen this specific model. What are its strengths? What are its limitations?
b. Evaluate your model. The discussion for the evaluation section should include answers to following questions: How well does your model predict? Is it overfitting to the training set? Do you trust this model?
c. This section should include at least 2 tables or figures to summarize/ illustrate your discussion.
4. Predicting passenger survival
Finally, use the model you’ve built to predict the outcomes for the test set and compare these results to your training data. Optionally, I encourage you to submit your predictions to the Kaggle competition site and include your results in your report.
5. Conclusions
Discuss the conclusions you’ve drawn based on your analysis.
List of online resources
• Titanic: Machine Learning from Disaster
Kaggle competition site.
https://www.kaggle.com/c/titanic
• R Markdown — Dynamic Documents for R http://rmarkdown.rstudio.com/
• Getting Started with R: Kaggle's Titanic Competition:
List of 4 excellent tutorials for using R to compete in the Titanic competition. https://www.kaggle.com/c/titanic/details/new-getting-started-with-r
• Kaggle and DataCamp R Tutorial on Machine Learning
Interactive tutorial by Kaggle and DataCamp which provides coding exercises to help you predict the passenger survival rates for Kaggle’s Titanic competition.
https://www.datacamp.com/courses/kaggle-tutorial-on-machine-learing-the-sinking-of-thetitanic
References
Titanic: Machine Learning from Disaster 2012, Kaggle, viewed 8 Oct 2015, https://www.kaggle.com/c/titanic