Final Term Paper
Final Term Paper
Final Term Paper
Michael T. Elliott
Data Analytics MPS Candidate
Pennsylvania State University
Abstract—The changing trends, as well as the new generations the analysis will focus on determining the most popular
of students have brought to light the need for the BFA program at electives to keep in the program. It will also look to group
the University of Diploma Printing to be revisited and electives into key concentration areas, that way students can
restructured. The chief of the committee has presented two choose special concentrations within the program.
different datasets to aid in the analysis of the restructuring of the
curriculum. The main objectives of this restructuring are to The chief of the committee has provided a list of the
eliminate obsolete electives and keep those that are still relevant, currently offered courses as well a historic record of student
as well as determine specializations within the electives that make enrollment for the past five years to aid in this analysis.
up three or four courses. These specializations will allow students
to add concentration areas to their study. By using KNIME C. Data Mining Goals
Analytics Platform, we are able to easily utilize a variety of data To accomplish these tasks, we will need to analyze the
mining techniques to aid in this analysis.
trends within the elective courses in order to determine which
Keywords—Data Analysis, KNIME Analytics Platform, courses are most popular, and which courses are obsolete.
Clustering Analysis, Association Rule Mining, Data Mining From here, we will want to then determine a way to
I. BUSINESS UNDERSTANDING effectively show which courses should be grouped together. By
grouping the elective courses together, we can determine
A. Understanding Objective specialties within the major that students can obtain.
In order to keep up with current styles and trends, the D. Project Plan
Department of Arts at the University of Diploma Printing is
looking to revamp its BFA curriculum. The University of Our first step will be to understand our data and figure out
Diploma Printing is located in Romanigstan and hosts over 240 what is has to offer. From here, the two data sets will need to
undergraduate majors within its BA and BFA programs. The be joined together so that we will be able to compare student
professors within the arts department feature world renowned enrollment history to our current list of courses.
artists from many different artistic disciplines. The dean of the Once the data has been processed and cleaned, we will need
college of arts has approached the faculty about the issues that to visit the elective course offerings and determine which
have been plaguing the department. courses are obsolete. To do this, we will want to show a time
Over the past five years, enrollment has plateaued in the series chart of the elective courses that were taken over time.
BFA program, disciplinary approaches have shifted, and key By doing this we will be able to effectively determine the most
demographics within the university are changing. Despite the relevant courses and which courses should be removed from the
setbacks within the BFA program, the rest of the College of program.
Arts and Sciences continues to be one of the fastest growing Once we have a list of relevant electives that we want to keep
colleges within the university. These issues have led to the need in our course roster, we can begin to look for specialties. To
for the BFA curriculum to be reviewed and restructured, with accomplish this, we will try three different techniques:
the hopes of increasing enrollment trends that match the rest of association rule mining, DBSCAN clustering algorithm, and
the college. hierarchical clustering algorithm. These three algorithms will
B. Assess the Situation be examined to determine which successfully produces a strong
list of specialties to compliment the major.
The current structure accounts for seven core courses and
six electives. To complement the reduction in credit hours, the II. DATA UNDERSTANDING
analysis will also focus on determining key concentrations
within the offered electives, as well as obsoleting the electives A. Data Collection
that are not popular to help lead the way for new course To begin this analysis, we will want to first look at the two
offerings. To aid in these requests, this analysis will focus on datasets that were given to use to use by the chief of the
two main topics. It will first focus on optimizing the elective committee. These datasets contain raw data that offer insight
courses that are offered within the BFA program. In particular, into our data analysis.
1) Current Courses Offered C. Data Quality
The first data set provides us with a list of 42 currently These data sets provide large amounts of meaningful data;
offered courses. This dataset appears to be more of a reference however, the data is full of misspellings, duplicates, and missing
table, which we will cross reference against the student data.
enrollment history. 1) Current Courses Offered
In addition to 42 rows of data, this dataset also contains three From looking at the data table to our first dataset, we can
columns: Course Name, Course Number, and Course Type. The apply a sort function to look through the course names
course name represents the full name of the class. The course alphabetically. This technique is a great way to catch
number represents the id of the course such as SWENG 545. misspellings and duplicates. In this case, the data shows that we
This number is typically important to show the difficulty of the
do in fact have two courses that are the same: ARTS 571 France
course. In most cases, the higher the number, the more difficult
and the Europ. Union and ARTS 577 France and the Europ.
the course. The last column that we have is the course type. This
Union.
column is a categorical variable that represents whether the
Along with this duplicate course name, the data set also
course is a fundamental class, an elective, or a core course.
appears to encounter a large variance of case discrepancies
2) Student Enrollment Last Five Years among the course column. There are some that are listed in all
uppercase, all lowercase, and some with a mix of both. These
The second dataset provides us with 4900 rows of past
inconsistencies will need to be cleaned up so that we can
student enrollment, as well as three columns: student name,
maintain the integrity of our dataset, and ensure that we are
semester new, and course name. This table provides us with all
providing the most accurate analysis we can.
of the different students that have enrolled in the last five years,
as well as the total number of classes they enrolled in. 2) Student Enrollment Last Five Years
From the course name column, we can cross reference them Due to the size of this dataset, we will need to filter columns
against the course name column from our other table. This will of this data separately and apply a duplicate row filter to each
provide useful when determining which courses are elective, column filter so that we can look at the unique rows of this data
and ultimately if the courses are even currently offered. set. As with the last dataset, we will sort the column so that the
unique values are listed alphabetically. This will provide ease
B. Data Description and Exploration
to finding duplicates and misspellings.
1) Current Courses Offered
When we look at the student names column, we can see that
The course list shows us that we are currently offering 42 the names appear to be all different, except for two names. The
courses. Of these courses, we have two courses that are two names in question were “Julie Christie 1995”, “Julie
fundamental courses, and appropriately labeled ‘F’ in the Christie 1999”, “Ken Follett 2001”, and “Ken Follett 2004”.
course type column. In addition to the fundamental courses, we These names will need to be combined and have the year
have seven core courses labeled ‘C’, and 33 courses that are removed. There is also one name that also sticks out and that is
elective and labeled ‘E’. “Olympia Snow, Senator” and “Olympia Snow”. When I look
this name up in the original data table, I can see that these two
2) Student Enrollment Last Five Years individuals are taking different classes at the same time. They
As we have already discovered, In our second dataset we our have also taken the same classes in differing semesters, which
given three main columns. Our first column, student name, means that we will treat these two individuals as separate
shows us the students that have enrolled in courses over the last students. Also, while looking at Olympia Snow, Senator, I
five years. Our data shows that within the last five years, there noticed that her courses are duplicated. Therefore, duplicates
have been 448 students who enrolled in courses. Our second will also need to be removed.
column shows us that these students have taken these courses Next, we will look at the semester new column. This column
over the span of 16 different semesters. It appears that we are appears to display 16 different semesters, all of which are
missing data from Summer 2005 since our records go from spelled correctly. There will be no need to clean this column, as
Spring 2005 to Fall 2005. Lastly, our third column lists the 169 it is already looking good.
courses that have been taken over this span. This dataset will
later be joined together with our current course offerings. As Lastly, we will analyze the course name column. We can see
we can see, there have been 169 different courses enrolled in, through the duplicate row filter node, that when we sort the
which differs greatly to the 42 courses that are currently offered classes, we can easily pick out all of the misspelling
in the program. Many of these rows of data will be eliminated discrepancies, the variances in cases, as well as even a missing
when we combine our datasets together. value. An example of this a misspelling is “AMERICAN
HEALT POLICY” vs “AMERICAN HEALTH POLICY”.
There are a large number of courses that have spelling
discrepancies, and therefore will need to have course names
adjusted to match each other so that we can keep the integrity The same operation will have to be used for student name as
of the data in check. well to have “Julie Christie 1995” and “Julie Christie 1999”
changed to just “Julie Christie”. We will also have to make the
III. PREPARATION changes for “Ken Follett 2001” and “Ken Follett 2004” so that
A. Data Selection it is just Ken Follett. By doing this we can ensure that the
student names are all different. This will drop the number of
The data combined in each of these datasets will provide
useful to this analysis once cleaned. For the course listing total number of students down from 448 to 446.
dataset, we will want to only look at the data for elective courses. 3) Changing everything to uppercase.
For the purpose of this analysis, we are looking at reducing the
number of electives as well as also defining specialties the In order to make sure all the cases are the same, and that
electives can be grouped in. To accomplish this, we will need to courses are not duplicated based on case discrepancies, we will
make sure that we are only looking at the course type elective, need to alter the cases of both datasets. To do this we will need
E. the case converter node, which easily allows us to select the
columns we want altered and whether we want them in
The data from our second dataset, which provides historical
uppercase or lowercase. For the purpose of this analysis, we
student enrollment for the last five years, will also need to be
will change the case on both datasets to uppercase. This will
filtered to only show the elective courses that are currently
offered through the university. We will need to take into account help when joining both datasets together.
each of the students, what semester they enrolled in their 4) Duplicate Rows
courses, and also the course they enrolled in.
The next step we need to do, is to remove the duplicates. This
B. Clean Data was accomplished by using the duplicate row filter. After
One of the most important steps in this analysis is the data implementing this filter, our total number of rows in the student
cleaning. To clean the data, we will need to look at not only enrollment history dataset dropped from 4900 rows of data to
removing the spelling discrepancies, but duplicate rows, 3651 rows of data.
missing data, and unifying the case. To accomplish this, we will
5) Missing Values
first start with the cell replacer node to help with spelling
discrepancies. The next step of our cleaning our data will be removing the
missing value from the student enrollment history dataset.
1) Remove Duplicate Rows from Course Offerings
There is only one missing value, and it is a course name. To
To begin the cleaning journey, we will first look at the course eliminate this, we will use the missing value node. With this
discrepancy in the current course offerings dataset. As node we can instruct the node to remove any row that has a
mentioned earlier, the dataset shows the course “France and the missing value. By performing this, we have now removed the
Europ. Union” entered for both ARTS 571, and also ARTS 577. one row with a missing value and dropped our row total to 3650
The nominal row filter node was used to simply filter out the rows.
course number ARTS 577. The decision to remove ARTS 577
6) Lengthen Abbreviations
instead of ARTS 571 makes no difference in this analysis.
The final step of cleaning will involve looking at the names
Either would provide the same results. Doing this removes the
of the courses and replacing this with their full name. For
duplicate course name, which will be important when joining
example we have a course offered called “Augustan Cultrl
the datasets together.
Revol”. We will use a cell replacer node as before, to change
2) Spelling Discepancies this name to Augustan Cultural Revolution.
By making this change, we will be able to better analyze this
The cell replacer node works with two inputs and one output.
in our model generation.
This node allows us to bring in our student enrollment history
dataset into one node, and then bring in a lookup csv file into C. Integrate Data
the other node. The second csv file was created with a lookup To make this data useful, we will need to join both together.
column and a replace column. If a course name needs changed, This is easily done by using the joiner node. For this node we
it goes into the lookup column and then what it needs to be will want to select the course name columns from both of the
changed to will be entered into the replace column. To datasets as well as the inner join option. Now that our data has
determine what needs replaced, we can cross reference the been cleaned, and all of the spellings align, we will be able to
duplicate row filter, which we used earlier to isolate all the easily perform an inner join, which joins on like values and
unique course names taken by students, to the course names removes values that are not alike. This will allow us to see the
shows in the current course offerings dataset. student enrollment history of only the courses that are currently
offered. Our new dataset now contains 2421 rows of data.
After cross referencing the current course offerings dataset,
course names that needed changed were added to the lookup D. Describe Final Dataset
column. This resulted in 53 course names that needed changed. Our new dataset is made up of 2421 rows of data, and
Our total number of course names in the student enrollment contains five columns: student name, semester new, course
history dataset dropped from 169 courses to 140 courses. number, course name, and course type. When we add in our
filter for Elective Courses, we see that total row number drop
from 2421 to 746. This number means that the 292 students
have taken 746 elective courses over the past five years. Below
is a histogram showing the distribution of enrolled electives per
student.
A. Evaluate Results Fig. 5 shows us the frequent itemsets, the item set size, as
1) Data Visualization – Objective One well as the itemset support, and relative itemset support. From
here we can see that these results produce large amounts of
For our first objective, we were to successfully determine a noise. Many of these itemsets contain two or more of the same
solution for updating our current elective courses by removing courses. This makes it challenging to distinguish which may be
courses that are not relevant and keeping those that are. To strong solutions to specializations. We can also see from the
accomplish this, we created a time series chart, fig. 4, that support column that there is very little support to any of these
shows which classes have been enrolled in for each semester. itemsets. The itemset with the strongest support is the itemset
By using this chart, we were able to determine which classes containing the courses: American South 1861 – Present, 20th
are relevant and which courses are not. To determine this, we Century Russian Literature: Fiction and Reality, and 19 th –
made a conservative assumption that we will only keep on our Century British Literature. This itemset however only has a
correct elective roster, courses that have been enrolled in since relative itemset support of only 2.098%, which is very small to
spring semester 2003. This means that if a class has not been draw inferences from.
enrolled in since spring of 2003, that it will be obsoleted and
We can continue our analysis of Association Rule Mining • 19th-Century British Literature
by looking at the association rule learner output node. We will • African-American Literature.
want to take into account the confidence of the rule as well as
the lift. If we have a lift over one, then we can consider the rule Cluster 1 had the following courses:
to be actionable. We can see below in Fig. 6 that We have • Comparative Politics
successfully generated eight association rules with varying • Contemporary Art – 1945 to Present
confidences and lifts. • Contemporary Political Thought.
Our final cluster is cluster 2:
• American Health Policy
Fig. 6. Association Rule Learner Output
• Communications Internship
• American South 1861 – Present
Out of the eight association rules, we can see that they all • Communications and the Presidency.
contain the course American health Policy. We can also see that
out of the eight association rules, only five of them are The rest of the data is shown as noise. This algorithm was
actionable rules with a lift greater than one. The five of these able to successfully deliver three different clusters that meet the
association rules have a confidence of 26.3% and lower, criteria for specializations.
making them hard to trust when setting specializations The quality of these algorithms appear to vary. Cluster 0
throughout the electives. appears to have performed well with grouping three of the four
3) DBSCAN Clustering Algorithm -Objective Two literature courses together, thus making this a literature
specialization. Cluster 1 grouped two political courses with an
The DBSCAN presents a different approach to selecting art class, however the art class does contain the word
specializations within the electives offered. This algorithm contemporary which matches with the contemporary in
takes the course names and clusters them based on their string contemporary political thought. This cluster has not performed
distances. This algorithm in particular is very sensitive to its as well as the first. Cluster 2 provides us a variety of courses
parameters. that are subject to question. Cluster 0 seems to have performed
The parameters we utilize in DBSCAN are Epsilon, which better than cluster 1 and cluster 2.
is the minimum radius of the clusters, as well as Minimum By providing three clusters, we are offering three different
Points, which are the minimum number of points per cluster. specializations within the current electives. This provides other
This algorithm was run numerous times before finding the right classes for students to take that don’t fall under a specialization
values for Epsilon. It was clear that Minimum Points needed to umbrella.
be three, since we are looking for clusters of three or four
courses to create specializations from. However, epsilon was 4) Hierarchical Clustering Algorithm – Objective Two
very sensitive down to the thousandths decimal place. If we had The hierarchical clustering algorithm shares commonalities
too high of a value for epsilon, then we obtained only one with the DBSCAN algorithm in the sense that it uses string
cluster. If our value of epsilon was too low, then all of our rows distances to create a clustering algorithm. However, the
were considered noise. The results below are done using an hierarchical clustering algorithm does so in a different manner.
epsilon of 0.625 as well as a minimum points setting of three.
This algorithm has multiple parameters that effect the
performance of the model. We have to take into account
multiple string distance types, as well as our linkage style, and
the number of clusters. We produce the best results when the
Levenshtein Distance is chosen, and when we configure the
deletion weight to be one, the insertion weight to be two, and
the exchange weight to be four. For the linkage style we use
complete, with the number of clusters being eight. Tuning the
model to these settings will produce successful results that we
can draw inferences from. Below we can see in Fig. 8, that there
are eight different clusters, with five of the clusters containing
three or four courses.
In clustering the data, it was very difficult to find The ambiguity of the results can make an analysis like this
meaningful clusters. The initial instinct was to setup association challenging, as there is technically no correct answer. Instead,
rules so that we could see if there were concentrations that could there is just a scientific approach, and a correct way of
be inferred from the frequent itemsets. However, this proved to conducting an experiment, analyzing the information, and
provide little information as our frequent itemsets were too interpreting the results.
similar and did not group together well. C. Conclusions
From here we took these itemsets to be used in association 1) Objective One
rule mining and tried to create actionable rules that could be
By using data visualizations, we were successful in The second method of thinking was what DBSCAN and
eliminating seven courses from our currently offered electives, Hierarchical clustering algorithm were based around. These
to give us a new total of 22 currently offered electives. Recall algorithms focused on using the unique elective course names
Fig. 3. We can see that the following seven courses have not (post filtration of obsoleted electives from objective one) to
been enrolled in the last two years. cluster similar courses based on string distance. When choosing
which, methodology is better, we will need to look at this from
• ARTS 533: French Thought since 1945 objective in greater detail.
• ARTS 573: Analyzing the Political World
• ARTS 571: France & The European Union The second objective states that we need to determine
• ARTS 551: Aesthetics specializations for the elective courses based on groupings of
• ARTS 585: Contemporary Socio Thought three to four electives. From looking at this from the college’s
• ARTS 575: Early Mesopotamian History / Society point of view, these specializations should be based on courses
falling into similar concentrations of study. We should not grant
• ARTS 565: Environmental Studies Research Seminar
elective specializations based on what students tend to take
Junior Level
together. If we continue down this path, then we will be giving
These courses have in fact not been enrolled in during the out specializations in areas that are not necessarily similar.
2003, 2004, or 2005. By using this time series chart, we can Thus, we will follow the logic of the second methodology.
easily isolate these seven courses and determine that they From here we can look at the performances of both DBSCAN
should be removed from our current course roster, thus leaving and the hierarchical clustering algorithm. In comparing the two,
us with the following 22 courses for students to enroll in: we should look at the number of successful clusters that were
generated from each algorithm.
• ARTS 549: Freshwater Ecology
• ARTS 583: British Poetry 1660-1914 The DBSCAN algorithm produced three different clusters
• ARTS 543: Arts and Religion making up ten courses in total. Two of these clusters performed
• ARTS 569: Cell Biology & Biochemistry well, while the last cluster did not. This algorithm in total would
• ARTS 545: 20th Century Russian Literature: Fiction give us two successful clusters, which would result in two
and Reality different specialties within the electives.
• ARTS 555: Becoming Human The hierarchical clustering algorithm provided us with five
• ARTS 557: 19th-Century British Literature clusters comprised of 17 different courses. However, only four
• ARTS 494: Business German: A Micro Perspective of these algorithms were strong. The last algorithm provided us
• ARTS 485 Evidenced Based Crime and Justice Policy with three courses that did not have any similarities. Therefore,
• ARTS 492: African-American Literature this algorithm successfully generated four different
• ARTS 587: Elementary Arabic 2 specializations making up 14 of the 22 elective courses.
• ARTS 493: American Health Policy The hierarchical clustering algorithm provided the best
• ARTS 581: Comparative Politics model for clustering our data into specializations. The
• ARTS 484: Europe in a Wider World following are the successful cluster generated from this
• ARTS 488: Devil’s Pact Literature / Film algorithm, with their proposed specialization name.
• ARTS 547: Communications Internship
American Arts and History:
• ARTS 559: American South 1861-present
• ARTS 497: Contemporary Art – 1945 to present • Evidenced Based Crime and Justice Policy
• ARTS 491: Contemporary Political Thought • African-American Literature
• ARTS 561: Augustan Cultural Revolution • American Health Policy
• ARTS 495: Communications and the Presidency • American South 1861 – Present
• ARTS 567: Elementary German 1
Arts and Literature:
2) Objective two
• 20th Century Russian Literature: Fiction and Reality
For this objective we had three main algorithm models: • 19th Century – British Literature
Association Rule Mining, DBSCAN clustering algorithm, and • Devil’s Pact Literature / Film
Hierarchical clustering algorithm. Of these three algorithms, we
can separate them into two different methodologies. Political Arts: