Data Mining

Practical Machine Learning Tools and Techniques with Java Implementations

1st Edition - October 11, 1999
Latest edition
Authors: Ian H. Witten, Eibe Frank
Language: English

This book offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining si… Read more

Purchase options

World Book Day celebration

Where learning shapes lives

Up to 25% off trusted resources that support research, study, and discovery.

Explore resources

Description

This book offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. Inside, you'll learn all you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining—including both tried-and-true techniques of the past and Java-based methods at the leading edge of contemporary research. If you're involved at any level in the work of extracting usable knowledge from large collections of data, this clearly written and effectively illustrated book will prove an invaluable resource.

Complementing the authors' instruction is a fully functional platform-independent Java software system for machine learning, available for download. Apply it to the sample data sets provided to refine your data mining skills, apply it to your own data to discern meaningful patterns and generate valuable insights, adapt it for your specialized data mining applications, or use it to develop your own machine learning schemes.

Key features

* Helps you select appropriate approaches to particular problems and to compare and evaluate the results of different techniques.* Covers performance improvement techniques, including input preprocessing and combining output from different methods.* Comes with downloadable machine learning software: use it to master the techniques covered inside, apply it to your own projects, and/or customize it to meet special needs.

PrefaceAcknowledgements1. What's It All About? 1.1 Data Mining And Machine LearningDescribing Structural PatternsMachine LearningData Mining1.2 Simple Examples: The Weather Problem And OthersThe Weather ProblemContact Lenses: An Idealized ProblemIrises: A Classic Numeric DatasetCpu Performance: Introducing Numeric PredictionLabor Negotiations: A More Realistic ExampleSoybean Classification: A Classic Machine Learning Success1.3 Fielded ApplicationsDecisions Involving JudgementScreening ImagesLoad ForecastingDiagnosisMarketing And Sales1.4 Machine Learning And Statistics1.5 Generalization As SearchEnumerating The Concept SpaceBias1.6 Data Mining And Ethics1.7 Further Reading2. Input: Concepts, Instances, Attributes2.1 What's A Concept2.2 What's In An Example2.3 What's In An Attribute2.4 Preparing The InputGathering The Data TogetherArff FormatAttribute TypesMissing ValuesInaccurate ValuesGetting To Know Your Data2.5 Further Reading3. Output: Knowledge Representation3.1 Decision Tables3.2 Decision Trees3.3 Classification Rules3.4 Association Rules3.5 Rules With Exceptions3.6 Rules Involving Relations3.7 Trees For Numeric Prediction3.8 Instance-Based Representation3.9 Clusters3.10 Further Reading4. Algorithms: The Basic Methods4.1 Inferring Rudimentary RulesMissing Values And Numeric AttributesDiscussion4.2 Statistical ModelingMissing Values And Numeric AttributesDiscussion4.3 Divide-And-Conquer: Constructing Decision TreesCalculating InformationHighly-Branching AttributesDiscussion4.4 Covering Algorithms: Constructing RulesRules Vs TreesA Simple Covering AlgorithmRules Vs Decision Lists4.5 Mining Association RulesItem SetsAssociation RulesGenerating Rules EfficientlyDiscussion4.6 Linear ModelsNumeric PredictionClassificationDiscussion4.7 Instance-Based LearningThe Distance FunctionDiscussion4.8 Further Reading5. Credibility: Evaluating What's Been Learned5.1 Training And Testing5.2 Predicting Performance5.3 Cross-Validation5.4 Other EstimatesLeave-One-OutThe Bootstrap5.5 Comparing Data Mining Schemes5.6 Predicting ProbabilitiesQuadratic Loss FunctionInformational Loss FunctionDiscussion5.7 Counting The CostLift ChartsRoc CurvesCost-Sensitive LearningDiscussion5.8 Evaluating Numeric Prediction5.9 The Minimum Description Length Principle5.10 Applying Mdl To Clustering5.11 Further Reading6. Implementations: Real Machine Learning Schemes6.1 Decision TreesNumeric AttributesMissing ValuesPruningEstimating Error RatesComplexity Of Decision-Tree InductionFrom Trees To RulesC4.5: Choices And OptionsDiscussion6.2 CLASSIFICATION RULESCriteria For Choosing TestsMissing Values, Numeric AttributesGood Rules And Bad RulesGenerating Good RulesGenerating Good Decision ListsProbability Measure For Rule EvaluationEvaluating Rules Using A Test SetObtaining Rules From Partial Decision TreesRules With ExceptionsDiscussion6.3 EXTENDING LINEAR CLASSIFICATION: SUPPORT VECTOR MACHINES The Maximum Margin HyperplaneNon-Linear Class BoundariesDiscussion6.4 INSTANCE-BASED LEARNINGReducing The Number Of ExemplarsPruning Noisy ExemplarsWeighting AttributesGeneralizing ExemplarsDistance Functions For Generalized ExemplarsGeneralized Distance FunctionsDiscussion6.5 NUMERIC PREDICTIONModel TreesBuilding The TreePruning The TreeNominal AttributesMissing ValuesPseudo-Code For Model Tree InductionLocally Weighted Linear RegressionDiscussion6.6 CLUSTERING Iterative Distance-Based ClusteringIncremental ClusteringCategory UtilityProbability-Based ClusteringThe EM AlgorithmExtending The Mixture ModelBayesian ClusteringDiscussion7. MOVING ON: ENGINEERING THE INPUT AND OUTPUT 2217.1 ATTRIBUTE SELECTIONScheme-Independent SelectionSearching The Attribute SpaceScheme-Specific Selection7.2 DISCRETIZING NUMERIC ATTRIBUTESUnsupervised DiscretizationEntropy-Based DiscretizationOther Discretization MethodsEntropy-Based Versus Error-Based DiscretizationConverting Discrete To Numeric Attributes7.3 Automatic Data CleansingImproving Decision TreesRobust RegressionDetecting Anomalies7.4 Combining Multiple ModelsBaggingBoostingStackingError-Correcting Output Codes7.5 Further Reading8. Nuts And Bolts: Machine Learning Algorithms In Java8.1 Getting Started8.2 Javadoc And The Class LibraryClasses, Instances, And PackagesThe Weka.Core PackageThe Weka.Classifiers PackageOther PackagesIndexes8.3 PROCESSING DATASETS USING THE MACHINE LEARNING PROGRAMSUsing M5PrimeGeneric OptionsScheme-Specific OptionsClassifiersMeta-Learning SchemesFiltersAssociation RulesClustering8.4 EMBEDDED MACHINE LEARNING8.5 WRITING NEW LEARNING SCHEMESAn Example ClassifierConventions For Implementing ClassifiersWriting FiltersAn Example FilterConventions For Writing Filters9. Looking Forward9.1 Learning From Massive Datasets9.2 Visualizing Machine LearningVisualizing The InputVisualizing The Output9.3 Incorporating Domain Knowledge9.4 Text MiningFinding Keyphrases For DocumentsFinding Information In Running TextSoft Parsing9.5 Mining The World-Wide Web9.6 Further ReadingReferencesIndex

Review quotes

"This is a milestone in the synthesis of data mining, data analysis, information theory, and machine learning."—Jim Gray, Microsoft Research

Product details

Edition: 1
Latest edition
Published: October 20, 1999
Language: English

About the authors

Ian H. Witten

Ian H. Witten is a professor of computer science at the University of Waikato in New Zealand. He directs the New Zealand Digital Library research project. His research interests include information retrieval, machine learning, text compression, and programming by demonstration. He received an MA in Mathematics from Cambridge University, England; an MSc in Computer Science from the University of Calgary, Canada; and a PhD in Electrical Engineering from Essex University, England. He is a fellow of the ACM and of the Royal Society of New Zealand. He has published widely on digital libraries, machine learning, text compression, hypertext, speech synthesis and signal processing, and computer typography.

Affiliations and expertise

Computer Science Department, University of Waikato, New Zealand

Eibe Frank

Eibe Frank lives in New Zealand with his Samoan spouse and two lovely boys, but originally hails from Germany, where he received his first degree in computer science from the University of Karlsruhe. He moved to New Zealand to pursue his Ph.D. in machine learning under the supervision of Ian H. Witten and joined the Department of Computer Science at the University of Waikato as a lecturer on completion of his studies. He is now a professor at the same institution. As an early adopter of the Java programming language, he laid the groundwork for the Weka software described in this book. He has contributed a number of publications on machine learning and data mining to the literature and has refereed for many conferences and journals in these areas.