M1T1-M1T3 RM Assignment A1

Assignment A1 / Workshops M1T1-M1T3: RM

This assignment has two parts, which form the unit’s learning portfolio. Each part needs to be submitted by the deadline. The associated workshops include three sessions M1T1-M1T3, and optionally M1T4. By completing the workshops and the assignment students will learn how to explore data, gain insights into the problem domain, create classification models and make predictions based on such models. RapidMiner (RM) will be used as a tool to assist all these data mining tasks. All workshops rely on the material introduced in a series of classes.

During the workshop (on-campus and on-cloud) students will work individually. They'll be given some tasks and will use RapidMiner Studio to achieve them. Bringing in own laptops with pre-installed software is encouraged.

Before attending RM workshops, students are required to become familiar with class notes and all textbook readings (see the topic schedule with chapter references).

Activities – No late arrivals for the on-campus sessions!

Topic

1.

Install RapidMiner Studio, test that it works

Preparation

2.

The workshop facilitator will explain the mini case study. Work in teams but

M1T1

submit individual and unique reports. Improve your work iteratively. Work

Data Exploration

stage-by-stage.

Analytic

Start by formulating a business problem (it may change later).

Process

3.

Explore RM, find out where is Help, exercise the RM basics, clean a data file

and save it, manipulate and plot sample data, read and write data files.

Data Prep

Learn about the problem area and the assignment data.

Download your data as a CSV (or JSON if brave) file, explore your data.

Select attribute types, nominate them as labels and predictors.

Do not modify these ‘raw’ data files outside of the RM environment.

Plot the selected variables to investigate their properties. Create a sample

analytic process.

4.

Study relationships between the selected attributes using charts and

M1T2

correlation tables. Ensure that predictors are associated with labels, but

Relationships

predictors are independent of each other, same for the labels. Deal with

incorrect or missing attribute values. Visualise and report the results.

5.

Use the selected attributes to create a number of classification models in

M1T3

RM. Consider whether or not any of the previously selected attributes should

Modelling &

be dropped or new attributes added to improve the model. Evaluate your

Evaluation

intermediate and final models’ performance. Experiment with holdout and

Consider

cross-validation. Carry out ‘honest’ testing. Prepare and apply the process to

be deployed for classifying new data. Visualise and report all your results.

Deployment

6.

Prepare a report of your findings. Ensure that its executive summary is

Report and

aimed at management, include interpretation of results and give a well

Executive

justified recommendations.

Summary

7.

By the specified deadline, individually submit two components of your

Submission /

learning portfolio, i.e. LP1 and later LP2 parts of the assignment via

Learning Portfolio

CloudDeakin dropbox. With each submission, include your report in PDF,

formatted using the provided template plus a ZIP archive of all models, i.e.

your RapidMiner scripts (.RMP files) – do not use other file formats.

M1T1-M1T3 RM Assignment A1 image 1

In technical terms:

Your project objectives form a learning portfolio. The first objective (LP1) is to acquire and explore the available data, visualise and report any significant characteristics of non-text data, as well as, prepare the data for further processing. The second objective (LP2) is to create a classification system able to answer management questions using only non-text data. Text processing will be featured in assignment A2. Reports in PDF format and models developed in LP1 and LP2 in ZIP archives are to be submitted via CloudDeakin by their respective deadlines.

Data:

Data: http:// www.deakin.edu.au/~jlcybuls/pred/data/Wine-Reviews.zip
Original data source: https://www.kaggle.com/zynicide/wine-reviews

Hints on the process:

Formulate a business problem using plain English statements, however, cross-reference them with technical aspects described in the subsequent sections. When describing the problem and its solution keep in mind what can be achieved by using the available data.

Note that what you have been asked for and what can be delivered are two different things, e.g. if the model you are able to create does not provide sufficient quality of answers then this is what you need to report. Or your model could provide accurate advice only within a specific range of characteristics, so be clear about this.
Explore the data, select a label attribute(s) and identify candidate predictors of that label.

Explore your attributes in terms of their types, values, distribution and relationships. Use appropriate visualisations, analyse and interpret them. As the report template provides very limited space, be selective about what you include in the report – each visualisation must have a purpose and a description to advance your argument, use them as evidence! Depending on the model, some attributes may need to be transformed before using them in modelling tasks. You may also have to deal with incorrect or missing values.

Check the assessment criteria on the next page to see how you are going to be assessed. Stick to the recommended process. Complete the basics first before moving to the more advanced tasks or any extensions and research tasks.

You will submit your work in two learning portfolio parts LP1 and LP2.
Each part needs to be lodged via CloudDeakin dropbox before the deadline.

You will be allowed to submit your work once only!

It is essential that your reports use LP1 and LP2 templates.
Follow instructions embedded in the templates!

Both reports must fit into a strict page limit imposed by the template.
Only pages within the template limit will be reviewed and assessed!

Make sure that the problem statement and the executive summary are aimed at non-technical readers, while the remaining parts of the reports aim at a data / business analyst (and not highly technical programmers).

Your submission must include the report in PDF format and a ZIP archive of .RMP script files (these can be found in the RM project folder – simply ZIP these files).

Submissions not in a PDF and ZIP format will not be open or assessed!

There is a strict deadline for each submission. In cases of some documented illness, a special consideration may be granted but must be applied for well ahead of the deadline.

In general, requests for special considerations received less than three days before deadline will not be considered!

An automatic late penalty of 5% of the available marks per day (up to 5 days) will be applied to all late assignment submissions.

Late penalties apply immediately past the deadline – even 1 second!

Both parts LP1 and LP2 will be marked together after part LP2 is submitted.
Feedback will be provided on both parts together.

Team work and collaboration is encouraged but plagiarism will be penalised.
The CloudDeakin groups should be self-selected (you can also be a 1-member group). Team members can share ideas and help each other in solving technical problems. Seek your team’s feedback on all aspects of your assignment, especially before its submission. However, your assignment needs to be completed individually.

Ensure that your assignment is unique, otherwise plagiarism will be assumed!

Answer Detail

Get This Answer

Invite Tutor