Rapid Minder Assignment

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 38
At a glance
Powered by AI
The key takeaways are getting familiar with RapidMiner, handling missing data, reducing data, and building models.

The main steps in preparing data in RapidMiner are importing data, handling missing data, reducing data through filtering, and attribute reduction.

The different ways to handle missing data are to leave it as is, replace it with another value, or take the mean/mode of existing values.

PREPARING RAPIDMINER, IMPORTING DATA, AND HANDLING MISSING DATA

1. Our first task in data preparation is to handle missing data, however, because this will be our
first time using RapidMiner, the first few steps will involve getting RapidMiner set up.
2. We’ll then move straight into handling missing data. Missing data are data that do not exist in
a data set. As you can see in figure below, missing data is not the same as zero or some other
value.

3. It is blank, and the value is unknown. Missing data are also sometimes known in the
database world as null.
4. Depending on your objective in data mining, you may choose to leave missing data as they
are, or you may wish to replace missing data with some other value.
5. Launch the RapidMiner application. This can be done by double clicking your desktop icon or
by finding it in your application menu.

6. Next we will need to start a new data mining project in RapidMiner.


7. Within RapidMiner there are two main areas that hold useful tools: Repositories and
Operators.
8. The Repositories area is the place where you will connect to each data set you wish to mine.
The Operators area is where all data mining tools are located.
9. These are used to build models and otherwise manipulate data sets. Click on Repositories.
You will find that the initial repository we created upon our first launch of the RapidMiner
software is present in the list.
DATA DATA REDUCTION
REDUCTION

1. Go ahead and switch back to design perspective. The next set of steps will teach you to reduce
the number of observations in your data set through the process of filtering.
2. In the search box within the Operators tab, type in the word ‘filter’. This will help you locate the
‘Filter Examples’ operator, which is what we will use in this example.
3. Filter Examples operator over and connect it into your stream, right after the Replace Missing
Values operator.
4.
Taking a Sample from data
HANDLING INCONSISTENT DATA

1. Inconsistent data is different from missing data. Inconsistent data occurs when a value does
exist, however that value is not valid or meaningful.

2. In the parameters pane, change the attribute filter type to single, then indicate Twitter as the
attribute to be modified. In truth, in this data set there is only one instance of the value 99 across
all attributes and observations, so this change to a single attribute is not actually necessary in this
example, but it is good to be thoughtful and intentional with every step in a data mining process.
ATTRIBUTE REDUCTION

1. Return to design perspective. In the operator search field, type Select Attribute. The Select
Attributes operator will appear. Drag it onto the end of your stream so that it fits between the
Replace operator and the result set port.
2. In the Parameters pane, set the attribute filter type to ‘subset’, then click the Select Attributes
button
MODELING

1. Switch back to design perspective. On the Operators tab in the lower left hand corner, use the
search box and begin typing in the word correlation.
2. The tool we are looking for is called Correlation Matrix. You may be able to find it before you
even finish typing the full search term.
3. Once you’ve located it, drag it over into your process window and drop it into your stream.
4. Correlation coefficients are relatively easy to decipher. They are simply a measure of the strength
of the relationship between each possible set of attributes in the data set.
5. Because we have six attributes in this data set, our matrix is six columns wide by six rows tall.
6. In the location where an attribute intersects with itself, the correlation coefficient is ‘1’, because
everything compared to itself has a perfectly matched relationship.
7. All other pairs of attributes will have a correlation coefficient of less than one. To complicate
matters a bit, correlation coefficients can actually be negative as well, so all correlation
coefficients will fall somewhere between -1 and 1.
8. All correlation coefficients between 0 and 1 represent positive correlations, while all coefficients
between 0 and -1 are negative correlations. While this may seem straightforward, there is an
important distinction to be made when interpreting the matrix’s values.
9. This distinction has to do with the direction of movement between the two attributes being
analyzed. Let’s consider the relationship between the Heating_Oil consumption attribute, and
the Insulation rating level attribute. The coefficient there, as seen in our matrix in Figure above,
is 0.736.
10. This is a positive number, and therefore, a positive correlation. But what does that mean?
Correlations that are positive mean that as one attribute’s value rises, the other attribute’s value
also rises. But, a positive correlation also means that as one attribute’s value falls, the other’s
also falls.

11. RapidMiner attempts to help us recognize correlation strengths through color coding.
DEPLOYMENT

 We learned through our investigation, that the two most strongly correlated attributes in
our data set are Heating_Oil and Avg_Age, with a coefficient of 0.848.
 Thus, we know that in this data set, as the average age of the occupants in a home increases,
so too does the heating oil usage in that home.
 What we do not know is why that occurs. Data analysts often make the mistake of
interpreting correlation as causation.
 The assumption that correlation proves causation is dangerous and often false.
 Consider for a moment the correlation coefficient between Avg_Age and Temperature: -
0.673.
 As the age of a home’s residents increases, the average temperature outside decreases; and
as the temperature rises, the age of the folks inside goes down. But could the average age of
a home’s occupants have any effect on that home’s average yearly outdoor temperature?
Certainly not.
 Another false interpretation about correlations is that the coefficients are percentages, as if
to say that a correlation coefficient of 0.776 between two attributes is an indication that
there is 77.6% shared variability between those two attributes.
 Make views and charts
ASSOCIATION RULES

Roger is a city manager for a medium-sized, but steadily growing, city. The city has limited resources,
and like most municipalities, there are more needs than there are resources. He feels like the citizens
in the community are fairly active in various community organizations, and believes that he may be
able to get a number of groups to work together to meet some of the needs in the community. He
knows there are churches, social clubs, hobby enthusiasts and other types of groups in the community.
What he doesn’t know is if there are connections between the groups that might enable natural
collaborations between two or more groups that could work together on projects around town. He
decides that before he can begin asking community organizations to begin working together and to
accept responsibility for projects, he needs to find out if there are any existing associations between
the different types of groups in the area.

 Association rules are a data mining methodology that seeks to find frequent connections
between attributes in a data set.
 Association rules are very common when doing shopping basket analysis. Marketers and
vendors in many sectors use this data mining approach to try to find which products are most
frequently purchased together.
 If you have ever purchased items on an e-Commerce retail site like Amazon.com, you have
probably seen the fruits of association rule data mining.
 These are most commonly found in the recommendations sections of such web sites. You
might notice that when you search for a smartphone, recommendations for screen protectors,
protective cases, and other accessories such as charging cords or data cables are often
recommended to you.
 The items being recommended are identified by mining for items that previous customers
bought in conjunction with the item you search for. In other words, those items are found to
be associated with the item you are looking for, and that association is so frequent in the web
site’s data set, that the association might be considered a rule.
 Thus is born the name of this data mining approach: “association rules”. While association
rules are most common in shopping basket analysis, this modeling technique can be applied
to a broad range of questions. We will help Roger by creating an association rule model to try
to find linkages across types of community organizations.

Roger conducted a community survey and following data was collected :

 Elapsed_Time: This is the amount of time each respondent spent completing our survey. It is
expressed in decimal minutes (e.g. 4.5 in this attribute would be four minutes, thirty seconds).
 Time_in_Community: This question on the survey asked the person if they have lived in the
area for 0-2 years, 3-9 years, or 10+ years; and is recorded in the data set as Short, Medium,
or Long respectively.
 Gender: The survey respondent’s gender.
 Working: A yes/no column indicating whether or not the respondent currently has a paid job.
 Age: The survey respondent’s age in years.
 Family: A yes/no column indicating whether or not the respondent is currently a member of a
family-oriented community organization, such as Big Brothers/Big Sisters, childrens’
recreation or sports leagues, genealogy groups, etc.
 Hobbies: A yes/no column indicating whether or not the respondent is currently a member of
a hobby-oriented community organization, such as amateur radio, outdoor recreation,
motorcycle or bicycle riding, etc.
 Social_Club: A yes/no column indicating whether or not the respondent is currently a member
of a community social organization, such as Rotary International, Lion’s Club, etc.
 Political: A yes/no column indicating whether or not the respondent is currently a member of
a political organization with regular meetings in the community, such as a political party, a
grass-roots action group, a lobbying effort, etc.
 Professional: A yes/no column indicating whether or not the respondent is currently a
member of a professional organization with local chapter meetings, such as a chapter of a
law or medical society, a small business owner’s group, etc.
 Religious: A yes/no column indicating whether or not the respondent is currently a
member of a church in the community.
 Support_Group: A yes/no column indicating whether or not the respondent is currently a
member of a support-oriented community organization, such as Alcoholics Anonymous, an
anger management group, etc.

DATA PREPARATION

1. Import the Chapter 5 CSV data set into your RapidMiner data repository.
2. Drag your Chapter5 data set into a new process window in RapidMiner, and run the model in
order to inspect the data.
3. We have a fairly good understanding of our objectives and our data, but we know that some
additional preparation is needed.
4. First, we need to reduce the number of attributes in our data set.
5. The elapsed time each person took to complete the survey isn’t necessarily interesting in the
context of our current question, which is whether or not there are existing connections
between types of organizations in our community.
6. In order to reduce our data set to only those attributes related to our question, add a Select
Attributes operator to your stream, and select the following attributes for inclusion, as
illustrated in Figure: Family, Hobbies, Social_Club, Political, Professional, Religious,
Support_Group. Once you have these attributes selected, click OK to return to your main
process.
7. One other step is needed in our data preparation. This is to change the data types of our selected
attributes from integer to binominal.
8. The association rules operators need this data type in order to function properly.
9. In the search box on the Operators tab in design view, type ‘Numerical to’ (without the single
quotes) to locate the operators that will change attributes with a numeric data type to some
other data type. The one we will use is Numerical to Binominal.
10. You should also observe that within RapidMiner, the data type binominal is used instead of
binomial, a term many data analysts are more used to.
11. There is an important distinction. Binomial means one of two numbers (usually 0 and 1), so the
basic underlying data type is still numeric. Binominal on the other hand, means one of two
values which may be numeric or character based.
12. Click the play button to run your model and see how this conversion has taken place in our data
set. In results perspective, you should see the transformation, as depicted in Figure below
13. For each attribute in our data set, the values of 1 or 0 that existed in our source data set now
are reflected as either ‘true’ or ‘false’. Our data preparation phase is now complete and we are
ready for modelling

Modelling
1. Use the search field in the operators tab to look for an operator called FP-Growth.
2. The FP in FP-Growth stands for Frequency Pattern.
3. Frequency pattern analysis is handy for many kinds of data mining, and is a necessary component
of association rule mining.
4. Without having frequencies of attribute combinations, we cannot determine whether any of the
patterns in the data occur often enough to be considered rules.
5. Take note of the min support parameter on the right hand side. We will come back to this
parameter during the evaluation portion of this chapter’s example.
6. Also, be sure that both your exa port and your fre port are connected to res ports. The exa port
will generate a tab of your examples (your data set’s observations and meta data), while the fre
port will generate a matrix of any frequent patterns the operator might find in your data set.
7. Run your model to switch to results perspective.
8. In results perspective, we see that some of our attributes appear to have some frequent patterns
in them, and in fact, we begin to see that three attributes look like they might have some
association with one another.
9. We can investigate this possible connection further by adding one final operator to our model.
Return to design perspective, and in the operators search box, look for ‘Create Association’
(again, without the single quotes).
10. Drag the Create Association Rules operator over and drop it into the spline that connects the fre
port to the res port. This operator takes in frequent pattern matrix data and seeks out any
patterns that occur so frequently that they could be considered rules.
11. There are two main factors that dictate whether or not frequency patterns get translated into
association rules: Confidence percent and Support percent.
12. Confidence percent is a measure of how confident we are that when one attribute is flagged as
true, the associated attribute will also be flagged as true.
Clustering
Sonia is a program director for a major health insurance provider. Recently she has been reading
in medical journals and other articles, and found a strong emphasis on the influence of weight,
gender and cholesterol on the development of coronary heart disease. The research she’s read
confirms time after time that there is a connection between these three variables, and while
there is little that can be done about one’s gender, there are certainly life choices that can be
made to alter one’s cholesterol and weight. She begins brainstorming ideas for her company to
offer weight and cholesterol management programs to individuals who receive health insurance
through her employer. As she considers where her efforts might be most effective, she finds
herself wondering if there are natural groups of individuals who are most at risk for high weight
and high cholesterol, and if there are such groups, where the natural dividing lines between the
groups occur.

DATA UNDERSTANDING
 Using the insurance company’s claims database, Sonia extracts three attributes for 547
randomly selected individuals.
 The three attributes are the insured’s weight in pounds as recorded on the person’s most
recent medical examination, their last cholesterol level determined by blood work in their
doctor’s lab, and their gender.
 As is typical in many data sets, the gender attribute uses 0 to indicate Female and 1 to
indicate Male.
 We will use this sample data from Sonia’s employer’s database to build a cluster model
to help Sonia understand how her company’s clients, the health insurance policy holders,
appear to group together on the basis of their weights, genders and cholesterol levels.
 We should remember as we do this that means are particularly susceptible to undue
influence by extreme outliers, so watching for inconsistent data when using the k-Means
clustering data mining methodology is very important

DATA PREPARATION

1. Import Chapter06DataSet.csv
See if there are outliers, outliers would be clearly visible in graph output
Decision Trees

 Richard wants to be able to predict the timing of buying behaviors, but he also wants to
understand how his customers’ behaviors on his company’s web site indicate the timing of their
purchase of the new eReader.

Data Understanding
1. Richard has engaged us to help him with his project. We have decided to use a decision tree
model in order to find good early predictors of buying behavior.
2. Because Richard’s company does all of its business through its web site, there is a rich data set of
information for each customer, including items they have just browsed for, and those they have
actually purchased.
3. He has prepared two data sets for us to use. The training data set contains the web site activities
of customers who bought the company’s previous generation reader, and the timing with which
they bought their reader.
4. The second is comprised of attributes of current customers which Richard hopes will buy the new
eReader. He hopes to figure out which category of adopter each person in the scoring data set
will fall into based on the profiles and buying timing of those people in the training data set.
5. In analyzing his data set, Richard has found that customers’ activity in the areas of digital media
and books, and their general activity with electronics for sale on his company’s site, seem to have
a lot in common with when a person buys an eReader. With this in mind, we have worked with
Richard to compile data sets comprised of the following attributes:
 User_ID: A numeric, unique identifier assigned to each person who has an account on the
company’s web site.
 Gender: The customer’s gender, as identified in their customer account. In this data set,
it is recorded a ‘M’ for male and ‘F’ for Female. The Decision Tree operator can handle
non-numeric data types.
 Age: The person’s age at the time the data were extracted from the web site’s database.
This is calculated to the nearest year by taking the difference between the system date
and the person’s birthdate as recorded in their account.
 Marital_Status: The person’s marital status as recorded in their account. People who
indicated on their account that they are married are entered in the data set as ‘M’. Since
the web site does not distinguish single types of people, those who are divorced or
widowed are included with those who have never been married (indicated in the data set
as ‘S’).
 Website_Activity: This attribute is an indication of how active each customer is on the
company’s web site. Working with Richard, we used the web site database’s information
which records the duration of each customers visits to the web site to calculate how
frequently, and for how long each time, the customers use the web site. This is then
translated into one of three categories: Seldom, Regular, or Frequent.
 Browsed_Electronics_12Mo: This is simply a Yes/No column indicating whether or not
the person browsed for electronic products on the company’s web site in the past year.
 Bought_Electronics_12Mo: Another Yes/No column indicating whether or not they
purchased an electronic item through Richard’s company’s web site in the past year.
 Bought_Digital_Media_18Mo: This attribute is a Yes/No field indicating whether or not
the person has purchased some form of digital media (such as MP3 music) in the past
year and a half. This attribute does not include digital book purchases.
 Bought_Digital_Books: Richard believes that as an indicator of buying behavior relative
to the company’s new eReader, this attribute will likely be the best indicator. Thus, this
attribute has been set apart from the purchase of other types of digital media. Further,
this attribute indicates whether or not the customer has ever bought a digital book, not
just in the past year or so.
 Payment_Method: This attribute indicates how the person pays for their purchases. In
cases where the person has paid in more than one way, the mode, or most frequent
method of payment is used. There are four options:
 Bank Transfer—payment via e-check or other form of wire transfer directly from the bank
to the company.
 Website Account—the customer has set up a credit card or permanent electronic funds
transfer on their account so that purchases are directly charged through their account at
the time of purchase.
 Credit Card—the person enters a credit card number and authorization each time they
purchase something through the site.
 Monthly Billing—the person makes purchases periodically and receives a paper or
electronic bill which they pay later either by mailing a check or through the company web
site’s payment system.
 eReader_Adoption: This attribute exists only in the training data set. It consists of data
for customers who purchased the previous-gen eReader. Those who purchased within a
week of the product’s release are recorded in this attribute as ‘Innovator’. Those who
purchased after the first week but within the second or third weeks are entered as ‘Early
Adopter’. Those who purchased after three weeks but within the first two months are
‘Early Majority’. Those who purchased after the first two months are ‘Late Majority’. This
attribute will serve as our label when we apply our training data to our scoring data.

DATA PREPARATION
1. Import both data sets into your RapidMiner repository. You do not need to worry about
attribute data types because the Decision Tree operator can handle all types of data. Be sure
that you do designate the first row of each of the data sets as the attribute names as you
import. Save them in the repository with descriptive names, so that you will be able to tell
what they are.
2. Import Chapter10DataSet_Training.csv and Chapter10DataSet_Scoring.csv. Drag and drop
both of the data sets into a new main process window. Rename the Retrieve objects as
Training and Scoring respectively. Run your model to examine the data and familiarize yourself
with the attributes.
3. Switch back to design perspective. While there are no missing or apparently inconsistent
values in the data set, there is still some data preparation yet to do.
4. First of all, the User_ID is an arbitrarily assigned value for each customer. The customer
doesn’t use this value for anything, it is simply a way to uniquely identify each customer in the
data set.
5. It is not something that relates to each person in any way that would correlate to, or be
predictive of, their buying and technology adoption tendencies. As such, it should not be
included in the model as an independent variable.
6. We can handle this attribute in one of two ways. First, we can remove the attribute using a
Select Attributes operator, as was demonstrated previously. Alternatively, we can try a new
way of handling a non-predictive attribute.
7. This is accomplished using the Set Role operator. Using the search field in the Operators tab,
find and add Set Role operators to both your training and scoring streams.
8. In the Parameters area on the right hand side of the screen, set the role of the User_ID
attribute to ‘id’. This will leave the attribute in the data set throughout the model, but it won’t
consider the attribute as a predictor for the label attribute. Be sure to do this for both the
training and scoring data sets, since the User_ID attribute is found in both of them.
9. One of the nice side-effects of setting an attribute’s role to ‘id’ rather than removing it using
a Select Attributes operator is that it makes each record easier to match back to individual
people later, when viewing predictions in results perspective.
10. Before adding a Decision Tree operator, we still need to do another data preparation step.
The Decision Tree operator, as with other predictive model operators we’ve used to this point
in the text, expects the training stream to supply a ‘label’ attribute.

11. Next, search in the Operators tab for ‘Decision Tree’. Select the basic Decision Tree operator
and add it to your training stream.
12. Run the model and switch to the Tree (Decision Tree) tab in results perspective. You will see
our preliminary tree.
13. Return to design perspective. In the Operators tab search for and add an Apply Model
operator, bringing your training and scoring streams together. Ensure that both the lab and
mod ports are connected to res ports in order to generate our desired outputs.

You might also like