Friday, September 10, 2010

Machine Learning with WEKA

Bernhard Pfahringer (based on material by Eibe Frank, Mark Hall, and Peter Reutemann)

Department of Computer Science University of Waikato, New Zealand


WEKA : A Machine Learning Toolkit


The Explorer
- Classification and Regression
- Clustering
- Association Rules
- Attribute Selection
- Data Visualization


The Experimenter
The Knowledge Flow GUI
Other Utilities
Conclusions


WEKA: the software






nMachine learning/data mining software written in Java (distributed under the GNU Public License)
nUsed for research, education, and applications
nComplements “Data Mining” by Witten & Frank
nMain features:
uComprehensive set of data pre-processing tools, learning algorithms and evaluation methods
uGraphical user interfaces (incl. data visualization)
uEnvironment for comparing learning algorithms
WEKA: versions





nThere are several versions of WEKA:
uWEKA 3.4: “book version” compatible with description in data mining book
uWEKA 3.5.5: “development version” with lots of improvements
nThis talk is based on a nightly snapshot of WEKA 3.5.5 (12-Feb-2007)


WEKA only deals with “flat” files

@relation heart-disease-simplified

@attribute age numeric (numeric attribute)
@attribute sex { female, male} (nominal attribute)
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present(Flat file in ARFF format)

@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...




java weka.gui.GUIChooser







d

Final Project - CS3708: Decision Support Systems

CS3708: Decision Support Systems
Semester 1/2010
Dr. Md Maruf Hasan

Course Project: 30%

  1. Create or choose a suitable dataset for classification. The UCI Machine Learning Archive is a good source of dataset.

  1. Choose a suitable tool and learn about the tool thoroughly: WEKA, DTREG, XLMiner etc. are to name a few. Go though the documentation (at least the tutorials and examples). (See WEKA tutorial on S-Class)

  1. Each tool helps you how to preprocess the dataset (such as, discretization, missing value), as well as visualize the dataset. Also help you to build, evaluate and compare each classifier performance (accuracy, error, etc.)

  1. On 24 September, each student will be asked to make a Presentation. You should introduce your dataset and the kind of preprocessing and analysis you made. Students are expected to use decision tree or regression tree algorithms to analyze the dataset in a comprehensive way and make comparison and conclusions. (10 minutes presentation + 5 minutes Q&A), 15%

  1. You should also prepare a Report and submit it to me in hardcopy by 1 October. (A sample report is made available on S-Class system for your reference), 15%

  1. Please feel free to discuss with me about project related query – Talk to me or e-mail me your queries.

  1. Also note that the Final Exam is scheduled on 1 October, Friday (open book)

CS3708 Decision Support Systems, Semester 1, Academic Year 2010-2011

CS3708 Decision Support Systems, Semester 1, Academic Year 2010-2011

Instructor: Dr. Md Maruf Hasan
Lecturer, School of Technology
Shinawatra University
E-mail: maruf@shinawatra.ac.th; Mobile: 085 163 1564
Lecture Hours: Fridays, 0900-1200, Main Building, 307
Office Hours: Thursdays & Fridays 1300-1600

Course Description: This course introduces the historical roots, theoretical foundations, contexts and applications of Decision Support Systems (DSS) in business computing. The focus is on how techniques for business intelligence can be applied, enhanced, extended, and integrated in the development of computer based DSS that can support realistic, multi-criteria, multi-participant decision making processes.

Course Delivery Strategy: Lecture is based on the main textbook (Turban's Decision Support and Business Intelligence Systems, DSBIS) supplemented with examples taken from the Web and other references. Selected chapters from Anderson’s Quantitative Methods for Business (QMB) will be covered as basis of mathematical modeling. 

The first half of the semester, we focus on basic techniques with concrete examples. In the second half, we learn about how to use available commercial and open-source tools. Throughout the course, several modeling techniques along with available software/tools will be introduced. During the Project Meeting datasets from business and scientific domains will be introduced and students will be asked to choose suitable models and tools to analyze those data using more than one tools and models to perform a comparative analysis. Project report and final presentation are due in the final weeks.

Advance topics, such as Artificial Intelligence and Agent Technology; and other emerging technologies in development and integration of DSS (e.g. SOA), will also be introduced as time permits.

Tentative Schedule:

Week 1-2: Chapter 1, 2, 3
Introduction to DSS and Business Intelligence
Computerized Decision Making: Phases and Sub Systems

Week 3 - 4:  Chapter 4
Modeling and Analysis
Decision Analysis & Decision Tree, Bayesian/Probabilistic Model, Linear Programming, Monte Carlo Simulation, Queuing Theory, Regression Models and Forecasting, etc.

Week 5: Chapter 5, 6, 7
Introduction to Business Intelligence
Data Warehousing, OLAP, Data Mining and Data Visualization with Example

Week 6: Chapter 10, 11
GroupWare, Collaboration and Knowledge Management Tools and Techniques

Week 7:  Chapter 12, 13
Basic Concepts: Case-based Reasoning, Genetic Algorithm, Fuzzy Logic
Applications: NLP, Speech Technology, Optical Character Recognition

Week 8: Mid Term Exam

Week 9: Chapter 14
Collaborative Filtering Algorithm, Intelligent Agent, SOA and Semantic Web

Week 10:  Group Project  Meeting
Introduction of Tools & Techniques  (WEKA, STATISTICA, EXCEL)
Distribution of Experimental Datasets

Week 11: Chapter 9
Business Performance Management

Week 12: Chapter 15
DSS Development and Implementation Considerations

Week 13: Chapter 16
DSS Integration Issues

Week 14: Catch-up and Review

Week 15: Students’ Presentations

Week 16:  Final Exam

Textbook and References:

Main Textbook:
Decision Support and Business Intelligence Systems (DSBIS), 8/E (9th Edition just published)
by Efraim Turban et al.
ISBN: 9780131986602; Prentice Hall, 2007
SIU Library Call No.: HD30.2 T87 2007 (4 copies on RESERVE)

Reference Books with SIU Library Call Number:
(1) Data Mining: practical machine learning tools and techniques, 2nd Ed.
by Witten, I. H. (Ian H.), Frank, Eibe.
ISBN:  9780120884070; Morgan Kaufman, c2005;
SIU Library: QA76.9.D343 W829 2005

(2) Decision Modeling with Microsoft Excel, 6th Ed.
by Moore, Jeffrey H. (Jeffrey Hillsman), Weatherford, Lawrence R.
ISBN:  9780131218512; Prentice Hall, c2001. 
SIU Library: HD30.25 I63 2001

(3) Quantitative Methods for Business (QMB), 11ed (Main Textbook for CS2004: Computer Models for Business Decisions)
By Anderson, David R. et al.
ISBN:  9780324653489 ; Thomson South Western, c2008
SIU Library Call No.: T56 A63 2008 (2 copies on RESERVE)

Assessment and Evaluations (Tentative)

Mid-term Exam: 30%
Final Exam: 30%
Assignments: 10%
Projects: 30%

Special Notes to student registered for CS3004: Computer Models for Business Decisions:
During the first-half of the course (before midterm exam), students registered for CS3004: Computer Models for Business Decisions will be attending lectures with "CS3708: Decision Support Systems" students. For which 30% for midterm and 5% for assignments will be assessed by Dr. Md Maruf Hasan.

After the midterm exam, Dr. Thiti Vacharasintopchai will take care of CS3004 students (for Business Decision Modeling part). CS3708 students will be continuing with Dr. Maruf Hasan learning more about DSS Implementation, Integration and relevant topics as well as projects in Scientific Problem Solving and Business Decision Making. 

Followers