Final Term Paper

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

SWENG545 Final Term Paper

Michael T. Elliott
Data Analytics MPS Candidate
Pennsylvania State University

Abstract—The changing trends, as well as the new generations the analysis will focus on determining the most popular
of students have brought to light the need for the BFA program at electives to keep in the program. It will also look to group
the University of Diploma Printing to be revisited and electives into key concentration areas, that way students can
restructured. The chief of the committee has presented two choose special concentrations within the program.
different datasets to aid in the analysis of the restructuring of the
curriculum. The main objectives of this restructuring are to The chief of the committee has provided a list of the
eliminate obsolete electives and keep those that are still relevant, currently offered courses as well a historic record of student
as well as determine specializations within the electives that make enrollment for the past five years to aid in this analysis.
up three or four courses. These specializations will allow students
to add concentration areas to their study. By using KNIME C. Data Mining Goals
Analytics Platform, we are able to easily utilize a variety of data To accomplish these tasks, we will need to analyze the
mining techniques to aid in this analysis.
trends within the elective courses in order to determine which
Keywords—Data Analysis, KNIME Analytics Platform, courses are most popular, and which courses are obsolete.
Clustering Analysis, Association Rule Mining, Data Mining From here, we will want to then determine a way to
I. BUSINESS UNDERSTANDING effectively show which courses should be grouped together. By
grouping the elective courses together, we can determine
A. Understanding Objective specialties within the major that students can obtain.
In order to keep up with current styles and trends, the D. Project Plan
Department of Arts at the University of Diploma Printing is
looking to revamp its BFA curriculum. The University of Our first step will be to understand our data and figure out
Diploma Printing is located in Romanigstan and hosts over 240 what is has to offer. From here, the two data sets will need to
undergraduate majors within its BA and BFA programs. The be joined together so that we will be able to compare student
professors within the arts department feature world renowned enrollment history to our current list of courses.
artists from many different artistic disciplines. The dean of the Once the data has been processed and cleaned, we will need
college of arts has approached the faculty about the issues that to visit the elective course offerings and determine which
have been plaguing the department. courses are obsolete. To do this, we will want to show a time
Over the past five years, enrollment has plateaued in the series chart of the elective courses that were taken over time.
BFA program, disciplinary approaches have shifted, and key By doing this we will be able to effectively determine the most
demographics within the university are changing. Despite the relevant courses and which courses should be removed from the
setbacks within the BFA program, the rest of the College of program.
Arts and Sciences continues to be one of the fastest growing Once we have a list of relevant electives that we want to keep
colleges within the university. These issues have led to the need in our course roster, we can begin to look for specialties. To
for the BFA curriculum to be reviewed and restructured, with accomplish this, we will try three different techniques:
the hopes of increasing enrollment trends that match the rest of association rule mining, DBSCAN clustering algorithm, and
the college. hierarchical clustering algorithm. These three algorithms will
B. Assess the Situation be examined to determine which successfully produces a strong
list of specialties to compliment the major.
The current structure accounts for seven core courses and
six electives. To complement the reduction in credit hours, the II. DATA UNDERSTANDING
analysis will also focus on determining key concentrations
within the offered electives, as well as obsoleting the electives A. Data Collection
that are not popular to help lead the way for new course To begin this analysis, we will want to first look at the two
offerings. To aid in these requests, this analysis will focus on datasets that were given to use to use by the chief of the
two main topics. It will first focus on optimizing the elective committee. These datasets contain raw data that offer insight
courses that are offered within the BFA program. In particular, into our data analysis.
1) Current Courses Offered C. Data Quality
The first data set provides us with a list of 42 currently These data sets provide large amounts of meaningful data;
offered courses. This dataset appears to be more of a reference however, the data is full of misspellings, duplicates, and missing
table, which we will cross reference against the student data.
enrollment history. 1) Current Courses Offered
In addition to 42 rows of data, this dataset also contains three From looking at the data table to our first dataset, we can
columns: Course Name, Course Number, and Course Type. The apply a sort function to look through the course names
course name represents the full name of the class. The course alphabetically. This technique is a great way to catch
number represents the id of the course such as SWENG 545. misspellings and duplicates. In this case, the data shows that we
This number is typically important to show the difficulty of the
do in fact have two courses that are the same: ARTS 571 France
course. In most cases, the higher the number, the more difficult
and the Europ. Union and ARTS 577 France and the Europ.
the course. The last column that we have is the course type. This
Union.
column is a categorical variable that represents whether the
Along with this duplicate course name, the data set also
course is a fundamental class, an elective, or a core course.
appears to encounter a large variance of case discrepancies
2) Student Enrollment Last Five Years among the course column. There are some that are listed in all
uppercase, all lowercase, and some with a mix of both. These
The second dataset provides us with 4900 rows of past
inconsistencies will need to be cleaned up so that we can
student enrollment, as well as three columns: student name,
maintain the integrity of our dataset, and ensure that we are
semester new, and course name. This table provides us with all
providing the most accurate analysis we can.
of the different students that have enrolled in the last five years,
as well as the total number of classes they enrolled in. 2) Student Enrollment Last Five Years
From the course name column, we can cross reference them Due to the size of this dataset, we will need to filter columns
against the course name column from our other table. This will of this data separately and apply a duplicate row filter to each
provide useful when determining which courses are elective, column filter so that we can look at the unique rows of this data
and ultimately if the courses are even currently offered. set. As with the last dataset, we will sort the column so that the
unique values are listed alphabetically. This will provide ease
B. Data Description and Exploration
to finding duplicates and misspellings.
1) Current Courses Offered
When we look at the student names column, we can see that
The course list shows us that we are currently offering 42 the names appear to be all different, except for two names. The
courses. Of these courses, we have two courses that are two names in question were “Julie Christie 1995”, “Julie
fundamental courses, and appropriately labeled ‘F’ in the Christie 1999”, “Ken Follett 2001”, and “Ken Follett 2004”.
course type column. In addition to the fundamental courses, we These names will need to be combined and have the year
have seven core courses labeled ‘C’, and 33 courses that are removed. There is also one name that also sticks out and that is
elective and labeled ‘E’. “Olympia Snow, Senator” and “Olympia Snow”. When I look
this name up in the original data table, I can see that these two
2) Student Enrollment Last Five Years individuals are taking different classes at the same time. They
As we have already discovered, In our second dataset we our have also taken the same classes in differing semesters, which
given three main columns. Our first column, student name, means that we will treat these two individuals as separate
shows us the students that have enrolled in courses over the last students. Also, while looking at Olympia Snow, Senator, I
five years. Our data shows that within the last five years, there noticed that her courses are duplicated. Therefore, duplicates
have been 448 students who enrolled in courses. Our second will also need to be removed.
column shows us that these students have taken these courses Next, we will look at the semester new column. This column
over the span of 16 different semesters. It appears that we are appears to display 16 different semesters, all of which are
missing data from Summer 2005 since our records go from spelled correctly. There will be no need to clean this column, as
Spring 2005 to Fall 2005. Lastly, our third column lists the 169 it is already looking good.
courses that have been taken over this span. This dataset will
later be joined together with our current course offerings. As Lastly, we will analyze the course name column. We can see
we can see, there have been 169 different courses enrolled in, through the duplicate row filter node, that when we sort the
which differs greatly to the 42 courses that are currently offered classes, we can easily pick out all of the misspelling
in the program. Many of these rows of data will be eliminated discrepancies, the variances in cases, as well as even a missing
when we combine our datasets together. value. An example of this a misspelling is “AMERICAN
HEALT POLICY” vs “AMERICAN HEALTH POLICY”.
There are a large number of courses that have spelling
discrepancies, and therefore will need to have course names
adjusted to match each other so that we can keep the integrity The same operation will have to be used for student name as
of the data in check. well to have “Julie Christie 1995” and “Julie Christie 1999”
changed to just “Julie Christie”. We will also have to make the
III. PREPARATION changes for “Ken Follett 2001” and “Ken Follett 2004” so that
A. Data Selection it is just Ken Follett. By doing this we can ensure that the
student names are all different. This will drop the number of
The data combined in each of these datasets will provide
useful to this analysis once cleaned. For the course listing total number of students down from 448 to 446.
dataset, we will want to only look at the data for elective courses. 3) Changing everything to uppercase.
For the purpose of this analysis, we are looking at reducing the
number of electives as well as also defining specialties the In order to make sure all the cases are the same, and that
electives can be grouped in. To accomplish this, we will need to courses are not duplicated based on case discrepancies, we will
make sure that we are only looking at the course type elective, need to alter the cases of both datasets. To do this we will need
E. the case converter node, which easily allows us to select the
columns we want altered and whether we want them in
The data from our second dataset, which provides historical
uppercase or lowercase. For the purpose of this analysis, we
student enrollment for the last five years, will also need to be
will change the case on both datasets to uppercase. This will
filtered to only show the elective courses that are currently
offered through the university. We will need to take into account help when joining both datasets together.
each of the students, what semester they enrolled in their 4) Duplicate Rows
courses, and also the course they enrolled in.
The next step we need to do, is to remove the duplicates. This
B. Clean Data was accomplished by using the duplicate row filter. After
One of the most important steps in this analysis is the data implementing this filter, our total number of rows in the student
cleaning. To clean the data, we will need to look at not only enrollment history dataset dropped from 4900 rows of data to
removing the spelling discrepancies, but duplicate rows, 3651 rows of data.
missing data, and unifying the case. To accomplish this, we will
5) Missing Values
first start with the cell replacer node to help with spelling
discrepancies. The next step of our cleaning our data will be removing the
missing value from the student enrollment history dataset.
1) Remove Duplicate Rows from Course Offerings
There is only one missing value, and it is a course name. To
To begin the cleaning journey, we will first look at the course eliminate this, we will use the missing value node. With this
discrepancy in the current course offerings dataset. As node we can instruct the node to remove any row that has a
mentioned earlier, the dataset shows the course “France and the missing value. By performing this, we have now removed the
Europ. Union” entered for both ARTS 571, and also ARTS 577. one row with a missing value and dropped our row total to 3650
The nominal row filter node was used to simply filter out the rows.
course number ARTS 577. The decision to remove ARTS 577
6) Lengthen Abbreviations
instead of ARTS 571 makes no difference in this analysis.
The final step of cleaning will involve looking at the names
Either would provide the same results. Doing this removes the
of the courses and replacing this with their full name. For
duplicate course name, which will be important when joining
example we have a course offered called “Augustan Cultrl
the datasets together.
Revol”. We will use a cell replacer node as before, to change
2) Spelling Discepancies this name to Augustan Cultural Revolution.
By making this change, we will be able to better analyze this
The cell replacer node works with two inputs and one output.
in our model generation.
This node allows us to bring in our student enrollment history
dataset into one node, and then bring in a lookup csv file into C. Integrate Data
the other node. The second csv file was created with a lookup To make this data useful, we will need to join both together.
column and a replace column. If a course name needs changed, This is easily done by using the joiner node. For this node we
it goes into the lookup column and then what it needs to be will want to select the course name columns from both of the
changed to will be entered into the replace column. To datasets as well as the inner join option. Now that our data has
determine what needs replaced, we can cross reference the been cleaned, and all of the spellings align, we will be able to
duplicate row filter, which we used earlier to isolate all the easily perform an inner join, which joins on like values and
unique course names taken by students, to the course names removes values that are not alike. This will allow us to see the
shows in the current course offerings dataset. student enrollment history of only the courses that are currently
offered. Our new dataset now contains 2421 rows of data.
After cross referencing the current course offerings dataset,
course names that needed changed were added to the lookup D. Describe Final Dataset
column. This resulted in 53 course names that needed changed. Our new dataset is made up of 2421 rows of data, and
Our total number of course names in the student enrollment contains five columns: student name, semester new, course
history dataset dropped from 169 courses to 140 courses. number, course name, and course type. When we add in our
filter for Elective Courses, we see that total row number drop
from 2421 to 746. This number means that the 292 students
have taken 746 elective courses over the past five years. Below
is a histogram showing the distribution of enrolled electives per
student.

Fig. 3. Distribution of student enrollment per semester

In Fig. 3, we can see how many students enrolled in


electives over the last 16 semesters. On average we have seen
46.6 students enroll in elective courses each semester, with the
min falling at 13 students and the max being at 95.

Fig. 1. Histogram of enrolled electives per student. E. Format Data


Now that the data is combined and correctly cleaned, we can
continue our preprocessing by adding a count column. To create
Fig. 1 shows us the distribution of the number of electives
this column, we will use the math formula node. Under the
taken per student. The mean number of electives taken per
expression section of this node, we will simply enter the
student is 2.55, given the current course listing, with most
number one. This will translate to a new column that we will
students taking one or two elective courses.
call dummy. This new column will contain a one in every row
From here we can look at what courses the students enrolled and will come in handy later when we utilize certain algorithms
in to see if there are courses that are more popular than others. that will involve the creation of dummy variables and data
pivoting.
The last preprocessing step will be changing the format of
our semester new column to account for an actual point in time.
This will allow us to be able to view course enrollment over
time.
To do this, we will utilize another cell replacer node. The
adjoining csv file will take the semester new description and
replace it with a rough estimated date. For spring semesters, we
will change spring 2000 to ‘2000-01-01’, fall semesters will
change fall 2000 to ‘2000-08-01’, and for summers we will
change summer 2000 to ‘2000-05-01’. Next, we will need to
use to string to date/time node to properly convert this new
string to a date & time. Within this node we can select for this
format to just be date, which we will do. By doing this we can
successfully look at what classes’ students have taken over
Fig. 2. Bar chart of elective course enrollment. time. After properly filtering out only elective courses, we can
see in fig. 4, what electives students have enrolled in over time.
In Fig. 2, we can see that enrollment is on average below 50
students, with three classes standing out: American health
Policy, Cell Biology and Biochemistry, and Elementary Arabic
2. The mean elective course enrollment over the last five years
has been 26.6 students enrolled, with the max being 134.
This graph goes to show that some courses are more
relevant than others. Below in Fig 3, we can see the distribution
of student elective enrollment per semesters.
• ARTS 565: Environmental Studies Research Seminar
Junior Level
Eliminating these seven courses will now bring our total offered
electives down to 22. The elective courses that we are going to
keep are shown below:
• ARTS 549: Freshwater Ecology
• ARTS 583: British Poetry 1660-1914
• ARTS 543: Arts and Religion
• ARTS 569: Cell Biology & Biochemistry
• ARTS 545: 20th Century Russian Literature: Fiction
and Reality
• ARTS 555: Becoming Human
• ARTS 557: 19th-Century British Literature
Fig. 4. Elective courses taken over time
• ARTS 494: Business German: A Micro Perspective
Now that our data is properly cleaned and preprocessed, • ARTS 485 Evidenced Based Crime and Justice Policy
we can begin deriving models for analysis. • ARTS 492: African-American Literature
• ARTS 587: Elementary Arabic 2
IV. MODEL GENERATION • ARTS 493: American Health Policy
Before we begin, we should remind ourselves of what we are • ARTS 581: Comparative Politics
hoping to accomplish with model generation. For this, we have • ARTS 484: Europe in a Wider World
two goals in mine: • ARTS 488: Devil’s Pact Literature / Film
• ARTS 547: Communications Internship
• Determine which electives we want to keep, and
• ARTS 559: American South 1861-present
which electives we want to remove.
• ARTS 497: Contemporary Art – 1945 to present
• ARTS 491: Contemporary Political Thought
• Determine specialties within the electives by grouping
similar courses together. • ARTS 561: Augustan Cultural Revolution
• ARTS 495: Communications and the Presidency
A. Selecting Modeling Techniques • ARTS 567: Elementary German 1
1) Data Visualizaiton
Now that we have completed our first objective, we can
It is important to remember that there are many useful data move onto our next. Having obsoleted elective courses, will
mining techniques. One of the most important techniques is now shorten our list of data so that we will have less electives
visualizing the data. By visualizing the data, we can make to group into specializations.
inferences on certain characteristics and trends that we may
2) Association Rule Mining
have missed while looking at the raw data in the dataset.
The first model to try out will be association rule mining.
To tackle the first objective, we can simply look at Fig. 4.
Association rule mining is often used for retail. It helps to find
Figure 4 shows us exactly which electives students have
frequent itemsets based on what customers frequently buy
enrolled in over time. By looking at this chart, we can determine
together. Association rule mining then breaks these frequent
a trend of whether courses are relevant to students anymore.
itemsets into association rules , which are based on conditional
To answer the question of which courses we should keep, logic of if a customer buys item A then they will also buy item
and which courses we should obsolete, we will need to make an B, C, and D. Association rules rely heavily on statistics to
assumption. For this analysis we can assume that elective determine if such rules are actionable or not.
courses that have not been enrolled in since the spring 2003
By using association rule mining, we can look at the courses
semester can be obsoleted. This is a fair and generous
that students are taking and determine whether or not we can
assumption, which allows us to properly complete this
group these courses into frequent itemsets and potentially
objective. Below is a list of courses that do not meet our criteria
association rules. By grouping these courses into frequent
and will thus be removed from our current elective offerings:
itemsets, we can hopefully draw conclussions for potential
• ARTS 533: French Thought since 1945 elective specializations based on what courses students tend to
• ARTS 573: Analyzing the Political World take together.
• ARTS 571: France & The European Union 3) DBSCAN Clustering Algorithm
• ARTS 551: Aesthetics
• ARTS 585: Contemporary Socio Thought DBSCAN stands for Density Based Spatial Clustering
• ARTS 575: Early Mesopotamian History / Society Algorithm with Noise. This algorithm creates clusters based
around two parameters: EPS and MinPts. EPS determines the
radius of the clusters, while MinPts determines the minimum The distances that are used in these two algorithms are also
number of points allowed in each cluster. This clustering very specific to the algorithm. K-means algorithm utilizes
algorithm relies heavily on these parameters and is thus Euclidean distances and works best with continuous. K-
sensitive to these numbers. medoids however, can utilize any of the distance formulas, and
works with continuous as well as categorical data.
B. Build Model
DBSCAN can be utilized in one of two ways:
Now that we have introduced the different models and
• We can use it to analyze the data coming from what explained why they have been selected, we will begin to
students have enrolled in. generate the models. Recall fig 4. This figure was a data
• We can use it to analyze the string data from the names visualization that helped to answer our first objective. The data
of the courses. visualization alone was enough evidence to complete our
For this analysis, we will use the second option, and try to objective, therefore, there is no model to build for this data
group clusters based around similar course names. The goal is visualization. Instead, the first model we will look at will be the
that this will provide insight into forming clusters around association rule mining model.
courses with similar names. 1) Association Rule Mining Model
This algorithm will be the opposite of the association rule To create this model, we will first need to format the data in
mining algorithm. Instead of basing our clusters off the frequent a way that creates a column for student name, as well as a list
courses that students take together, we will base it off of course of courses (items) taken per student, with a final column
names. The hope here is to form specialties around courses that representing the count of courses in that list. The first step for
belong together, not just the courses that students enjoy taking. this is to create a dummy column, which we already
4) Hierarchical Clustering Algorithm accomplished in preprocessing. Next, we will use a GroupBy
node to group the data by student. The outcome of this will be
The hierarchical clustering algorithm is an algorithm that a dataset that contains each student’s name, and then a column
creates clusters by using a dendrogram. This algorithm uses a that will represent a list of the courses taken. To accomplish
distance type to group clusters based on their distances away this, we will want to group the data by student name. After that,
from one another. This algorithm can use four different types we will want to use the manual aggregation column to
of distances, or linkage methods. These include Minimum, aggregate our data based on Course Name and the dummy
Average, Complete, and Ward. Based on the points within each variable column. We will select Course Name and aggregate
cluster, minimum will show the minimum distance between this by List (sorted). After this is completed, we will select the
points in each cluster, complete will show the maximum Dummy column and aggregate this by Count. This will produce
distance between clusters based on the two points furthest from us with a column for student name, a column that is the list of
each other, average will show the average distance between the electives that the student has enrolled in, and finally a count of
points in two clusters, while ward uses variances in a way to how many courses are in this list. From here, we will want to
show distances between to clusters. connect this node to both the Association Rule Learner
This algorithm works well with categorical data and can be (Borgelt) node, as well as to the Item Set Finder (Borgelt).
used with strings as well. This clustering algorithm will be The Association Rule Learner (Borgelt) node will provide
based on the course names, much like the last. We will look at us with our association rule mining algorithm. Once connected,
the differences in the strings to create a hierarchical clustering we will want to select sorted list as our item column. This will
algorithm, with the hopes of producing significant results for be our list of items that the algorithm will use to generate
elective specializations. association rules. From here we will want to set our minimum
5) Honorable Mentions set size to 3 and our minimum set support to 5.0 since we have
a large data set. We will also set our minimum rule confidence
The above listed algorithms are believed to be the best to 20.0 and check the sorted antecedent list checkbox. From
techniques; however, we will also include a k-means clustering here we are able to run our algorithm successfully and see the
model as well as a k-medoids clustering model. Both of these association rules that we have produced.
models utilize a certain number of clusters, or partitions, to
generate a clustering model. This model is then iterated Our other node can now be utilized as well: Item Set Finder
continuously until the distances between centroids / medoids (Borgelt). Once this node is connected to the GroupBy node,
are minimized and cannot be minimized any further. K-means we can begin to configure the node. In our configuration, we
and K-medoids differs in two main areas: centroids/medoids will select list (sorted) as our item column. Next we are given
and distance formulas. the option of the algorithm: Apriori, FP Growth, RElim, SaM,
JIM, DICE, and TANIMOTO. For this we will use Apriori
K-means utilizes centroids which represent the center of the algorithm. Our target type will be frequent, our minimum set
cluster. This point may or not be an actual point, it is simply size will be 3, and our minimum support will be 1.0. We will
just the center of the cluster. K-medoids however, uses a select to run the minimum support as percentage and not
medoid, which is the median point in the dataset.
absolute. From here we can run the algorithm and view the connect the duplicate row filter node to the string distances
results. node as well as the input table port of the Hierarchical Cluster
Assigner (local) node and also the data to cluster port on the
2) DBSCAN Clustering Algorithm
Hierarchical Clustering (DistMatrix) node.
For this algorithm, we will begin by formatting the data so
The first node we will want to configure is the string
that we have only the current list of offered electives. This list
distances node. To configure this node, we will want our
will be post filtration to exclude the obsoleted course names that
column selection to be the course name. This will be the column
we eliminated for our first objective. As stated above, this
that we use to determine distances. For the Distance selection,
algorithm will use the course names to cluster and find
we can use Levenshtein Distance, N-gram Tversky Distance
similarities so that we can find specializations within the
(Tanimo / Dice), Jaro – Winkler Distance. As before, we will
elective courses.
use the Levenshtein Distance. For this algorithm, we will
To format this data, we will start off by using a column filter configure the deletion weight to be one, the insertion weight to
node. Within this node, we will filter based on course number, be two, and the exchange weight to be four. We will then select
course name, and course type. This will eliminate the student the normalize distances checkbox, as well as the uppercase
names from the dataset as well as the semester taken, as this input checkbox. From here we will connect the output of this
will prove to not be needed for this algorithm. Next, we will node into the distance measure input port of the Hierarchical
want to use a duplicate row filter node to eliminate the duplicate Clustering (DistMatrix) node.
rows that resulted from removing the student name column. The
To configure the Hierarchical Clustering (DistMatrix) node,
result will be a list of 22 currently offered electives. Now that
we will need to select the linkage type as well as a checkbox for
we have the data in the correct format, we will want to create
whether we would like to ignore missing values. For this model
connections to both the string distances node as well as the
we will select the complete linkage type. This means that we
DBSCAN node.
will utilize the maximum distance between each cluster as our
Once connected to the string distances node, we can begin distance measure. We will also check the ignore missing values
configuration. To configure this node, we will need to select checkbox, which will not affect our model since there are no
course name as the column selection. This will be the column missing values. From here we will connect the output of this
that we use to determine distances. For the Distance selection, node into the cluster tree port of the Hierarchical Cluster
we can use Levenshtein Distance, N-gram Tversky Distance Assigner (local) node.
(Tanimo / Dice), Jaro – Winkler Distance. We will select the
To complete this model, we will need to configure the final
Levenshtein Distance for this analysis. With this distance
node: Hierarchical Clustering Assigner (local). This node gives
formula we will enter one into deletion weight, one into
us the option to create clusters based on given number of
insertion weight, and one into exchange weight. We will also
clusters or else a given distance threshold. For this we will
check both normalize distances as well as uppercase input
select that we want to assign clusters based on cluster count.
checkboxes. This will help us to produce better distance
Form here we will select that we want to this algorithm to
measurements as well as utilize the data, since it is in all
generate eight different clusters. Since we have 22 different
uppercase letters.
courses that need clustered, eight clusters will allow us a
The string distance node can then be connected to the sufficient number of clusters to provide us with specializations.
distance port of the DBSCAN node. This will work in We can now successfully run this model.
conjunction with the connection from the duplicate row filter.
4) Honorable Mentions
Now that we have the correct inputs, we can begin
configuration. To configure this, we will need to look at two The last two algorithms that we will look at are the k-means
values: Epsilon and Minimum Points. As discussed earlier, and k-medoids algorithms. Both algorithms will need data to be
Epsilon is also known as EPS, and represents the minimum formatted the same way. We will want to make sure that we
radius of each cluster. Minimum Points will represent the have implemented our dummy variable column, which we
minimum number of points we want in each cluster. For created in preprocessing. From here we will need to pivot our
Epsilon, we will enter 0.625 and for Minimum Points we will data to create dummy variables that act as a one or zero. To
put in three. We will now be able to successfully run our model. accomplish this, we will need use the pivoting node.
3) Hierarchical Clustering Model To configure the pivoting node, we will need to connect the
output of the math formula node into the input of the pivoting
Our data will need to be formatted to match the format we
node. To configure the pivoting node, we will need to first
used in the previous DBSCAN model. After removing the
select for the node to group the course name column. We will
obsolete electives, we will filter the data through the column
then select for the node to pivot the student name column and
filter to eliminate semester new and student name columns.
also manually aggregate the dummy column by summing. This
After this we will continue to run the data through the duplicate
will give us a column of course names, and columns
row filter. This will leave us with a list of 22 elective courses
representing each student with a one representing whether they
rows and three columns: course number, course name, and
course type. Once this data has been formatted, we will want to
enrolled in that class. From here we will want to connect to the removed from our current course offerings. By doing this we
missing values node. have successfully eliminated seven courses, which is roughly
25% of the elective courses. We could make this assumption
The missing values node will help us to deal with missing
and keep only courses that have been taken within the last year,
values. When we used the pivoting node, the node used the
but by doing so we will remove even more classes.
dummy column to give a value of one to a student that enrolled
in a class. However, if the student did not enroll in a class, then It is important to understand that the goal of this objective
there is a “?” representing a missing value. To resolve this, we is to remove courses that are not relevant anymore. However,
will need to use the missing values node to change any value to successfully remove these courses, we must also take into
that is a question mark to a 0. To do this we will configure the account the repercussions.
missing values node under the string setting to fix any value
By removing a class, we are potentially taking away a class
that is missing with a 0. We will then select under the double
that a student will pay money to take. If we remove too many
selection that we replace any value that is missing with a 0.0
classes, then we narrow the selection of courses that students
value. This will fix the data so that we can successfully generate
will want to take and potentially deter students from applying
k-means and k-medoids models.
for this degree program. Currently for this major, we require six
For the K-means model, we will connect the missing values electives to be taken. This is roughly 20% of the currently
node to the K-means node. To configure this node, we will first offered electives. By removing seven courses, we drop our
need to select the number of clusters to six. Next, we will set currently offered electives down to 22 courses. This will mean
our centroid initialization parameters to random initialization that to complete the major requirements, students will have to
with a random seed of one. Finally, we will include all columns complete roughly 25% of the electives. If we continue to
that we can in the column selection. Next, we can run this model remove courses, then we begin to significantly limit the
and send the results to a column filter node so that we only see selection of electives that students are willing to take. By
the course name as well as the assigned cluster. This will help removing seven courses, we are cutting the electives by roughly
to eliminate all of the student name columns so we can easily 25%, while also giving students a diverse enough selection of
view the results. electives that they will still be happy with.
For the K-medoids model, we will need to connect the By using this data visualization, we are successfully able to
missing values node to both the k-medoids input table port, as complete our first objective.
well as the numeric distances node. For this model we will first
2) Association Rule Mining – Objective Two
begin by configuring the numeric distance node. To do this we
will select all of the columns for the column input. We will also To analyze these results, we need to look at both the
the distance to be Euclidean distance as well as the average frequent itemset node results as well as the association rule
distance for missing values. The output of this node will then learner node. Both nodes will provide us with results that we
link to the distance port of the k-medoids node. will need to evaluate.
The k-medoids now will be configured by first selecting the The frequent itemset node produces the following results.
partition count to be seven. We will then use a chunk size of
1,000, as well as set the constrain number of iterations to
10,000. Lastly, we will use a static seed of one for this model.
This will allow the experiment to be replicated.
V. MODEL EVALUATION
With all of our models successfully created, we can now
begin to interpret results and find a solution to our objectives Fig. 5. Frequent Itemsets Output

A. Evaluate Results Fig. 5 shows us the frequent itemsets, the item set size, as
1) Data Visualization – Objective One well as the itemset support, and relative itemset support. From
here we can see that these results produce large amounts of
For our first objective, we were to successfully determine a noise. Many of these itemsets contain two or more of the same
solution for updating our current elective courses by removing courses. This makes it challenging to distinguish which may be
courses that are not relevant and keeping those that are. To strong solutions to specializations. We can also see from the
accomplish this, we created a time series chart, fig. 4, that support column that there is very little support to any of these
shows which classes have been enrolled in for each semester. itemsets. The itemset with the strongest support is the itemset
By using this chart, we were able to determine which classes containing the courses: American South 1861 – Present, 20th
are relevant and which courses are not. To determine this, we Century Russian Literature: Fiction and Reality, and 19 th –
made a conservative assumption that we will only keep on our Century British Literature. This itemset however only has a
correct elective roster, courses that have been enrolled in since relative itemset support of only 2.098%, which is very small to
spring semester 2003. This means that if a class has not been draw inferences from.
enrolled in since spring of 2003, that it will be obsoleted and
We can continue our analysis of Association Rule Mining • 19th-Century British Literature
by looking at the association rule learner output node. We will • African-American Literature.
want to take into account the confidence of the rule as well as
the lift. If we have a lift over one, then we can consider the rule Cluster 1 had the following courses:
to be actionable. We can see below in Fig. 6 that We have • Comparative Politics
successfully generated eight association rules with varying • Contemporary Art – 1945 to Present
confidences and lifts. • Contemporary Political Thought.
Our final cluster is cluster 2:
• American Health Policy
Fig. 6. Association Rule Learner Output
• Communications Internship
• American South 1861 – Present
Out of the eight association rules, we can see that they all • Communications and the Presidency.
contain the course American health Policy. We can also see that
out of the eight association rules, only five of them are The rest of the data is shown as noise. This algorithm was
actionable rules with a lift greater than one. The five of these able to successfully deliver three different clusters that meet the
association rules have a confidence of 26.3% and lower, criteria for specializations.
making them hard to trust when setting specializations The quality of these algorithms appear to vary. Cluster 0
throughout the electives. appears to have performed well with grouping three of the four
3) DBSCAN Clustering Algorithm -Objective Two literature courses together, thus making this a literature
specialization. Cluster 1 grouped two political courses with an
The DBSCAN presents a different approach to selecting art class, however the art class does contain the word
specializations within the electives offered. This algorithm contemporary which matches with the contemporary in
takes the course names and clusters them based on their string contemporary political thought. This cluster has not performed
distances. This algorithm in particular is very sensitive to its as well as the first. Cluster 2 provides us a variety of courses
parameters. that are subject to question. Cluster 0 seems to have performed
The parameters we utilize in DBSCAN are Epsilon, which better than cluster 1 and cluster 2.
is the minimum radius of the clusters, as well as Minimum By providing three clusters, we are offering three different
Points, which are the minimum number of points per cluster. specializations within the current electives. This provides other
This algorithm was run numerous times before finding the right classes for students to take that don’t fall under a specialization
values for Epsilon. It was clear that Minimum Points needed to umbrella.
be three, since we are looking for clusters of three or four
courses to create specializations from. However, epsilon was 4) Hierarchical Clustering Algorithm – Objective Two
very sensitive down to the thousandths decimal place. If we had The hierarchical clustering algorithm shares commonalities
too high of a value for epsilon, then we obtained only one with the DBSCAN algorithm in the sense that it uses string
cluster. If our value of epsilon was too low, then all of our rows distances to create a clustering algorithm. However, the
were considered noise. The results below are done using an hierarchical clustering algorithm does so in a different manner.
epsilon of 0.625 as well as a minimum points setting of three.
This algorithm has multiple parameters that effect the
performance of the model. We have to take into account
multiple string distance types, as well as our linkage style, and
the number of clusters. We produce the best results when the
Levenshtein Distance is chosen, and when we configure the
deletion weight to be one, the insertion weight to be two, and
the exchange weight to be four. For the linkage style we use
complete, with the number of clusters being eight. Tuning the
model to these settings will produce successful results that we
can draw inferences from. Below we can see in Fig. 8, that there
are eight different clusters, with five of the clusters containing
three or four courses.

Fig. 7. DBSCAN Output

In this output shown in fig. 7, we can see that the algorithm


created three different clusters. These clusters were Cluster 0:
• 20th Century Russian Literature: Fiction and Reality
wildcard. This cluster could be labeled as arts and politics.
Cluster five provides us with four classes that are all based
around language and international affairs. This cluster can be
called international arts. Cluster six is made up of three
different courses. This cluster has performed at a much lower
level than the other four clusters.
Despite the downfall of cluster six, our model has still
provided us with four other strong specializations that make up
14 out of the 22 currently offered electives.
5) Honorary Mentions – Objective Two
The last two models that we will work with are the K-means
model and the k-medoids model. Both models performed
Fig. 8. Hierarchical Clustering Algorithm Output comparatively.
We can see that cluster two provides the following courses: We will first look at K-means clustering algorithm. We can
• Evidenced Based Crime and Justice Policy see from the output below, that this algorithm only provided us
• African-American Literature with one successful cluster of three of four classes.
• American Health Policy
• American South 1861 – Present
Next, we have cluster three which provides three courses
within its cluster:
• 20th Century Russian Literature: Fiction and Reality
• 19th Century – British Literature
• Devil’s Pact Literature / Film
In cluster four, we have three courses that are shown:
• Comparative Politics
• Communications Internship
• Communications and the Presidency
In cluster five, we have four classes that are shown:
• Business German: A Micro Perspective
• Elementary Arabic 2
• Europe in a Wider World
• Elementary German 1
The last cluster to give us enough courses for a specialty,
was cluster six, with the following three courses:
• Freshwater Ecology
• Art and Religion
• Augustan and Cultural Revolution
These five clusters have successfully grouped the courses
into groups of three and four electives, creating specializations Fig. 9. K-means Clustering algorithm output
within the elective courses. These five clusters make up 17 of
the 22 currently offered electives. We can see in fig. 9 that only cluster five provided us with
a successful cluster. The cluster contains the following courses:
When we look at performance of the model, we can see that
the model successfully grouped the courses into five clusters. • Cell Biology & Biochemistry
Cluster two provides us with a grouping of courses that we • Communications Internship
could consider to be all relevant to an American arts • Elementary Arabic 2
specialization. Cluster three provides us with three courses that
are all based around literature. This cluster we could call an arts The cluster provided by figure five is based on the courses
and literature specialization. Cluster four provides us with two that students have enrolled in. However, these courses do not
politics classes and an internship class, which acts as a appear to have much in common, except for the
communications internship acting as the wildcard which can used. However, of the five actionable association rules we
count for anything. From this cluster we are not able to draw produced, they all contained the course American Health
any insight from the results, thus making this algorithm Policy. This made it very difficult to use this as a successful
unsuccessful. model.
The K-medoids algorithm follows suit with another Next, the idea of using course names was used to see if a
poor performance. As we can see in the following figure, there model could generate meaningful concentrations based solely
are no successful groupings of three or four courses. on the course name strings. To do this we utilized two different
algorithms: DBSCAN and Hierarchical clustering algorithm.
Both of these algorithms produced successful clusters after a
time-consuming process of tuning the algorithms. These
algorithms performed well at grouping courses into groups of
two very similar courses. However, trying to get groups of three
and four courses proved to be challenging.
2) Data Cleaning
Garbage in, garbage out is the best way to look at data
cleaning. This analysis involved a lot of cleaning that occupied
much of the time allotted to this analysis. From the cleaning,
the most challenging part was interpreting student names and
course names. Many of the student names were very unique and
highly suspicious of misspellings and duplicate names. One of
the names was even an attempt at a duplicate entry. The name
in question was Olympia Snow, while the other name was
Olympia Snowe, senator. When these individuals were further
analyzed, it was discovered that they were enrolled in multiple
courses at different times and that they were in fact two unique
individuals.
Another challenging part of this was deciding how to format
the data. In order to use string clustering, assumptions had to be
made on some of the course names so that they were not
abbreviated. This was done so that the clustering algorithms
could identify those similarities. Using cell replacers to change
data was also a choice that was hard to make. However, to draw
insight from this analysis some formats needed to be changed,
such as changing the semester new column from a string to a
Fig. 10. K-medoids clustering algorithm output
date value. This proved to be one of the more meaningful
Despite tuning the different parameters of this algorithm, changes of the experiment, as it provided a strong solution to
there were no successful combinations to create a successful the first objective.
and meaningful result. This model, like K-means, was
3) Correct Approaches
unsuccessful.
In this analysis, an argument can be made that there is no
B. Analysis Limitations and Difficulties
one correct solution. Much like any project, data is interpreted,
Throughout this analysis, there have been many barriers that a plan is set in motion, and then iterations are made based on
were hard to navigate around. The three big areas were the continuation of data collection. For the purpose of this
clustering the courses, cleaning the data, and determining experiment the conclusions that are presented are strong and are
correct approaches to objectives. evidence based. However, strong arguments can be made using
1) Course Clustering other algorithms which may produce similar results.

In clustering the data, it was very difficult to find The ambiguity of the results can make an analysis like this
meaningful clusters. The initial instinct was to setup association challenging, as there is technically no correct answer. Instead,
rules so that we could see if there were concentrations that could there is just a scientific approach, and a correct way of
be inferred from the frequent itemsets. However, this proved to conducting an experiment, analyzing the information, and
provide little information as our frequent itemsets were too interpreting the results.
similar and did not group together well. C. Conclusions
From here we took these itemsets to be used in association 1) Objective One
rule mining and tried to create actionable rules that could be
By using data visualizations, we were successful in The second method of thinking was what DBSCAN and
eliminating seven courses from our currently offered electives, Hierarchical clustering algorithm were based around. These
to give us a new total of 22 currently offered electives. Recall algorithms focused on using the unique elective course names
Fig. 3. We can see that the following seven courses have not (post filtration of obsoleted electives from objective one) to
been enrolled in the last two years. cluster similar courses based on string distance. When choosing
which, methodology is better, we will need to look at this from
• ARTS 533: French Thought since 1945 objective in greater detail.
• ARTS 573: Analyzing the Political World
• ARTS 571: France & The European Union The second objective states that we need to determine
• ARTS 551: Aesthetics specializations for the elective courses based on groupings of
• ARTS 585: Contemporary Socio Thought three to four electives. From looking at this from the college’s
• ARTS 575: Early Mesopotamian History / Society point of view, these specializations should be based on courses
falling into similar concentrations of study. We should not grant
• ARTS 565: Environmental Studies Research Seminar
elective specializations based on what students tend to take
Junior Level
together. If we continue down this path, then we will be giving
These courses have in fact not been enrolled in during the out specializations in areas that are not necessarily similar.
2003, 2004, or 2005. By using this time series chart, we can Thus, we will follow the logic of the second methodology.
easily isolate these seven courses and determine that they From here we can look at the performances of both DBSCAN
should be removed from our current course roster, thus leaving and the hierarchical clustering algorithm. In comparing the two,
us with the following 22 courses for students to enroll in: we should look at the number of successful clusters that were
generated from each algorithm.
• ARTS 549: Freshwater Ecology
• ARTS 583: British Poetry 1660-1914 The DBSCAN algorithm produced three different clusters
• ARTS 543: Arts and Religion making up ten courses in total. Two of these clusters performed
• ARTS 569: Cell Biology & Biochemistry well, while the last cluster did not. This algorithm in total would
• ARTS 545: 20th Century Russian Literature: Fiction give us two successful clusters, which would result in two
and Reality different specialties within the electives.
• ARTS 555: Becoming Human The hierarchical clustering algorithm provided us with five
• ARTS 557: 19th-Century British Literature clusters comprised of 17 different courses. However, only four
• ARTS 494: Business German: A Micro Perspective of these algorithms were strong. The last algorithm provided us
• ARTS 485 Evidenced Based Crime and Justice Policy with three courses that did not have any similarities. Therefore,
• ARTS 492: African-American Literature this algorithm successfully generated four different
• ARTS 587: Elementary Arabic 2 specializations making up 14 of the 22 elective courses.
• ARTS 493: American Health Policy The hierarchical clustering algorithm provided the best
• ARTS 581: Comparative Politics model for clustering our data into specializations. The
• ARTS 484: Europe in a Wider World following are the successful cluster generated from this
• ARTS 488: Devil’s Pact Literature / Film algorithm, with their proposed specialization name.
• ARTS 547: Communications Internship
American Arts and History:
• ARTS 559: American South 1861-present
• ARTS 497: Contemporary Art – 1945 to present • Evidenced Based Crime and Justice Policy
• ARTS 491: Contemporary Political Thought • African-American Literature
• ARTS 561: Augustan Cultural Revolution • American Health Policy
• ARTS 495: Communications and the Presidency • American South 1861 – Present
• ARTS 567: Elementary German 1
Arts and Literature:
2) Objective two
• 20th Century Russian Literature: Fiction and Reality
For this objective we had three main algorithm models: • 19th Century – British Literature
Association Rule Mining, DBSCAN clustering algorithm, and • Devil’s Pact Literature / Film
Hierarchical clustering algorithm. Of these three algorithms, we
can separate them into two different methodologies. Political Arts:

The first of these methodologies is what association rule • Comparative Politics


mining was based around. This algorithm was used to utilize • Communications Internship
the student course enrollment history to provide us with courses • Communications and the Presidency
that students frequently take and to generate actionable
International Arts:
association rules from this data. It wasn’t until running this
algorithm that the idea for the second methodology came to be • Business German: A Micro Perspective
• Elementary Arabic 2 REFERENCES
• Europe in a Wider World
• Elementary German 1 [1] X. Zhang, “Term Project Details,”. [Online]. Available:
https://psu.instructure.com/courses/2145547/pages/term-project-details.
These four specializations make up just under 66% of the [Accessed Dec. 10, 2021].
elective courses, thus giving the students plenty of opportunity [2] X. Zhang, “9.1 Partitional Clustering Methods,”. [Online]. Available:
to complete a concentration of study. https://psu.instructure.com/courses/2145547/modules/items/32741209.
[Accessed Dec. 10, 2021].
We have provided enough evidence to support our findings. [3] X. Zhang, “9.2 Hierachrical Clustering Algorithms,”. [Online]. Available:
Our analysis has successfully used a time series chart to https://psu.instructure.com/courses/2145547/modules/items/32741210.
[Accessed Dec. 10, 2021].
eliminate seven electives from the current elective roster, to
[4] X. Zhang, “1.6 Example Data Mining Process,”. [Online]. Available:
give us a new total of 22 currently offered electives. We have https://psu.instructure.com/courses/2145547/modules/items/32741118.
also successfully grouped those 22 courses into four different [Accessed Dec. 10, 2021].
groups to offer four different specializations. Moving forward, [5] X. Zhang, “5.1 Apriori Algorithm,”. [Online]. Available:
both objective proposals will be submitted to the chief of the https://psu.instructure.com/courses/2145547/modules/items/32741158.
committee for review. [Accessed Dec. 10, 2021].

You might also like