The purpose of this project is to follow the process of going from data to knowledge using a data
set that applies to a real-world problem. For this project, you will form teams of 4 to 5 students.
Your team’s objective is to locate a data set in order to help solve a specific problem. This
means that when locating a data set, you should be thinking about an impactful problem that
working with this data set would solve and how this data set will allow you to work towards
solving the problem. You may use any data mining software packages or libraries you wish for
performing data mining tasks and any programming language for cleaning and pre-processing
Project Report [8-10 pages, 2500-3500 words]
• Executive Summary – give a high-level description of the problem and what will be
included in the remainder of the report. Be sure to mention the overall results of the project
as part of this summary!
• Problem Description - give a 2-3 sentence description of the problem your team solved.
• Data Exploration
o Source(s) of the data – provide link(s) and any background on the data that might
be of interest. For example, some data is “simulated” and did not come from a real-
o Number of records
o Attribute description – begin with a high-level description of each attribute,
including what the attribute is, its type (continuous, discrete). Using a table to do this
is ideal. Next, for each attribute, provide summary statistics (min, max, mean,
standard deviation, frequency, mode). There should be figures/plots for at least some
of the attributes, especially ones that appear to be interesting.
o Missing values – did you find any missing values? If so, how do you plan to deal
o Outliers – did you find any outliers? If so, how do you plan to deal with these?
• Data Preprocessing – which preprocessing steps did you use? Below is some guidance on
what to write about if you performed any of the following steps. Only include
preprocessing steps that you did as part of your project.
o Discretization - show the discretization scheme you used and the new distribution
for any attributes you discretized.
o Sampling - describe the process you used and show the summary statistics of the
attributes for the sampled data set.
o Aggregation – describe the process used for aggregating data within an attribute
and show the new distribution. Give reasoning for why you did this.
o Dimensionality Reduction/Feature Selection – how/why did you decide to remove
features from the data set. What was the result of having done this?
• Data Mining Techniques/Algorithms Used - Describe the techniques and algorithms used
at a high level and why you decided to use them.
• Results - Perform an appropriate analysis of results. For example, discuss errors in
classification models using confusion matrices. The purpose here is to compare the results
obtained from various models or approaches that you tried.
• Conclusions and Lessons Learned – What are the major takeaways from this project in
terms of how well you were able to solve the problem you stated. What did you learn from
working on the project together as a team?
Data Mining Tools/Languages/Libraries
• R - https://www.r-project.org
• Weka - https://www.cs.waikato.ac.nz/ml/index.html
• PowerBI - https://powerbi.microsoft.com/en-us/
Examples of Past Projects
IMDB movie earnings prediction - https://www.imdb.com/interfaces/
Credit card customer default prediction -
Predicting success of Kickstarter Campaigns - https://www.kaggle.com/kemical/kickstarter-
Predicting video game sales - https://www.kaggle.com/rush4ratio/video-game-sales-with-ratings