Inroduction To Data Science
Inroduction To Data Science
Inroduction To Data Science
Computer Science
Pattern recognition, visualization, data warehousing, High performance
computing, Databases, AI
Mathematics
Mathematical Modeling
Statistics
Statistical and Stochastic modeling, Probability.
Real Life Examples
Data Scientist
The Sexiest Job of the 21st Century
• Activities to consider
– Assess the structure of the data – this dictates the tools and analytic
techniques for the next phase
– Ensure the analytic techniques enable the team to meet the business
objectives and accept or reject the working hypotheses
– Determine if the situation warrants a single model or a series of
techniques as part of a larger analytic workflow
– Research and understand how other analysts have approached this kind
or similar kind of problem
2.4 Phase 3: Model Planning
Model Planning in Industry Verticals
• Commercial Tools
– SAS Enterprise Miner – built for enterprise-level computing and analytics
– SPSS Modeler (IBM) – provides enterprise-level computing and analytics
– Matlab – high-level language for data analytics, algorithms, data exploration
– Alpine Miner – provides GUI frontend for backend analytics tools
– STATISTICA and MATHEMATICA – popular data mining and analytics tools
• Free or Open Source Tools
– R and PL/R - PL/R is a procedural language for PostgreSQL with R
– Octave – language for computational modeling
– WEKA – data mining software package with analytic workbench
– Python – language providing toolkits for machine learning and analysis
– SQL – in-database implementations provide an alternative tool (see Chap 11)
2.6 Phase 5: Communicate Results
2.6 Phase 5: Communicate Results
• In this last phase, the team communicates the benefits of the project
more broadly and sets up a pilot project to deploy the work in a
controlled way
• Risk is managed effectively by undertaking small scope, pilot
deployment before a wide-scale rollout
• During the pilot project, the team may need to execute the algorithm
more efficiently in the database rather than with in-memory tools
like R, especially with larger datasets
• To test the model in a live setting, consider running the model in a
production environment for a discrete set of products or a single line
of business
• Monitor model accuracy and retrain the model if necessary
2.7 Phase 6: Operationalize
Key outputs from successful analytics project
2.7 Phase 6: Operationalize
Key outputs from successful analytics project
• David Dietrich, Barry Hiller, “Data Science & Big Data Analytics”,
EMC education services, Wiley publications, 2012, ISBN0-07-
120413-X