Reuter, Timo: Event-based stream classification framework – a supervised clustering approach for social media applications. 2015
Inhalt
- Titlepage
- Abstract
- Table of Contents
- List of Figures
- List of Tables
- I Introduction
- 1 Introduction
- 1.1 Motivating Use Cases
- 1.2 Goal and Challenges
- 1.2.1 Clustering of Large Datasets
- 1.2.2 Clustering of Continuous Data Streams
- 1.2.3 Classifying of Concept Drifting Time Series Data
- 1.2.4 Noisy Data
- 1.3 Research Contributions of this Dissertation
- 1.4 Structure and Outline of this Thesis
- 2 Fundamentals of This Work
- 2.1 From the Categorization Idea to Event Clustering: Definition and Development
- 2.1.1 Categorization in Philosophy — The Classical View
- 2.1.2 Categorization in Cognitive Psychology — The Prototype View
- 2.1.3 Event Clustering Characterization
- 2.2 Characterization of an Event
- 3 Foundations and Related Work
- 3.1 Classification
- 3.2 Clustering
- 3.3 Distance Functions
- 3.4 Knowledge-based Clustering
- 3.5 Large-scale Processing and Scalability
- 3.6 New Event Detection
- 3.7 Event Identification and Detection
- 4 Event Clustering Dataset
- 4.1 Creation and Collection of the Dataset
- 4.2 Labeling of the Data — Creation of the Gold Standard
- 4.2.1 Usage of Social Event Calendars for Data Labeling
- 4.2.2 Fetching of Event Information from Upcoming and Last.fm
- 4.2.3 Labeling Process
- 4.3 Dataset Statistics
- 4.3.1 Data Quality
- 4.3.2 License Constraints
- 4.3.3 Data Point Distribution
- 4.3.4 Dataset Representation Format and Schema
- 4.4 Applications of the Dataset
- II Supervised Single-Pass Clustering with the Event-based Stream Classification Framework
- 5 System Description of the Stream Classification Framework for a Single-Pass Setting
- 5.1 Problem Statement
- 5.2 Overview of the Clustering Framework
- 5.3 Candidate Retrieval Strategies
- 5.4 Pairwise Feature Extraction
- 5.4.1 Temporal Features
- 5.4.2 Geographical Features
- 5.4.3 Textual Features
- 5.4.4 Document-Event Similarity Vector
- 5.5 Scoring and Ranking — Learning Similarity Functions
- 5.5.1 Problem Formulation using a Support Vector Machine
- 5.5.2 Problem Formulation as a Decision Tree Classification Problem
- 5.6 New Event Detection
- 6 Experimental Setup and Results of the Supervised Single-Pass Classification
- III Multi-pass Stream Clustering
- 7 System Description of the Stream Classification Framework for a Multi-Pass Setting
- 7.1 Problem Statement
- 7.2 System Overview
- 7.3 Multi-pass Requirements and Challenges
- 7.4 Multi-pass Strategies
- 8 Experimental Setup and Results of Supervised Multi-Pass Clustering
- 8.1 Analysis of First-Pass Strategies
- 8.2 Gold Standard Preparation for the Second Pass
- 8.2.1 Quality Issues in the Preparation Process
- 8.2.2 Creation of the Gold Standard for the Second Pass
- 8.3 Optimization of the Classification Framework Steps for the Second Pass
- 8.4 Clustering Framework in Two-Pass Mode — Optimization
- 8.4.1 Exhaustive Search for Optimal Features in Scoring, Ranking, and New Event Detection
- 8.4.2 Results of the Exhaustive Search
- 8.4.3 Optimization of Candidate Retrieval Strategy
- 8.5 Results of the Clustering Framework used in Two-Pass Mode
- 8.6 Conclusions
- IV Concluding Remarks
- 9 Remarks and Comparison of Clustering Approaches
- 9.1 Prerequisites for Event Clustering
- 9.2 Reflection on Multi-Pass Clustering in a Stream-based Setting
- 9.3 Comparison with Other Approaches
- 10 Conclusion
- V Appendix
