OpenPDC DM Tools Examples
OpenPDC DM Tools Examples
OpenPDC DM Tools Examples
Summary
Example guide on how to use the openPDC datamining tools with Hadoop Covers 3 sample cases scanning both openPDC archive data and CSV data
Introduction
This is a quick guide on how to use the datamining tools included with the openPDC Hadoop utilities. Most of the tools included in this package are in a beta stage at this point and should be treated accordingly. In this guide well take a look at how to classify time series data from both demo CSV data and the openPDCs time series archive format. These tools are meant to serve as a basic framework on how to scan specific classes of time series patterns and classify them accordingly. In some cases they are useful for specific purposes such as finding unbounded oscillations in phasor measurement unit (PMU) data. They are still in a developmental stage and have not yet been widely tested. However they are quite suitable to be used as a basis for development and refinement in specific time series domains. These tools will be incredibly useful for any engineer or computer scientist who wishes to work in the domains of: Phasor Measurement Unit (PMU) data Arbitrary sensor data collected with the openPDC Time series applications in other domains as illustrated by Dr Keogh o Image recognition o Shape Matching o Medical Data o Log Data
We hope to continue this work further and as sensor data collection accelerates we believe this work to be a good building block to work from. Please feel free to email the authors with questions or comments.
Example Case 1
Input Archive Type: openPDC historian format Window Size: 10 seconds Window Step Size: 5 seconds
60 Series1 59.95 Series2 Series3 Series4 59.9 Series5 Series6 Series7 59.85 Series8 Series9 59.8 Series10
59.75 1 12 23 34 45 56 67 78 89 100 111 122 133 144 155 166 177 188 199 210 221 232 243 254 265 276 287 298
Figure 1. Graphed data from openPDC archive Excel is a good tool to use for this since we can easily graph selected ranges of instances and we can also quickly tag instance rows with the correct classification. Unfortunately at this point there is no easy way to classify your training instances other than by hand. In the future wed like to see tools that make this process much easier and efficient. Excel should open the CSV directly as a worksheet and allow you to work with the instances. The last column in each row is marked as UNCLASSIFED. This is the column you should change to be a numeric class ID. Typically we graph blocks of instances and then hand classify each instance visually. We repeat this process until weve classified all of the instances. Once all instances have been classified we then export this newly classified training instances CSV file. We now want to move this set of classified training instance to HDFS so the Map Reduce classifier can find it during the classification pass. This CSV file will be used by the 1NN classifier as the training set.
In this particular example we want to mark the two instances of the (same) unbounded oscillation with the class of 1 and all other instances with the class of 0. In a real scenario we ideally would like to have a proportionate number of instances from each class. This example should be considered a toy example due to the fact that there are only two instances of our anomaly.
Results
If you check the output file in HDFS you should see a file with two lines in it, each being a long number representing the same time offset. These time offsets should be the front end of where the pattern occurs. Depending on which file you were scanning, the file should look something like: 1 1 1212440905001 1212440910001