Multidimensional
Text Clustering for Hierarchical Topic Detection
Nevin L.
Zhang† and Leonard
K. M. Poon‡
(† The Hong Kong University of Science
and Technology
‡ The Education University of Hong
Kong)
Abstract
Text clustering is generally considered
unsuitable for topic detection because it associates each document with only
one ¡°topic¡± (i.e., document cluster). Recent advances in model-based
multidimensional clustering have overcome the difficulty, and have given rise
to a novel approach to hierarchical topic detection that outperforms the LDA
approach in empirical studies.
The new approach is called hierarchical
latent tree analysis (HLTA). The idea is to model document collections using a
class of graphical models called hierarchical latent tree models (HLTMs). The
variables at the bottom level of an HLTM are observed binary variables that
represent the presence/absence of words in a document. The variables at other
levels are binary latent variables. The latent variables at the second level
model word co-occurrence patterns, and those at higher levels model
co-occurrences of patterns at the level below.
Each latent variable gives a partition
of the documents, and the document clusters in the partitions are interpreted
as topics. The topics at high levels of the hierarchy capture ¡°long-range¡± word
co-occurrences and hence are thematically more general, while the topics at low
levels capture ¡°short-range¡± word co-occurrences and hence are thematically
more specific.
Outline
¡¤ Introduction
¡¤ Multidimensional
clustering and latent tree models
¡¤ Latent tree models
for topic detection
¡¤ Results on the New
York Times dataset
¡¤ The HLTA Algorithm
¡¤ Comparisons with
the LDA approach
¡¤ Analysis of
IJCAI/AAAI papers (2000-2015)
¡¤ Software
¡¤ Conclusions
Materials
¡¤ Examples
Dataset |
HLTA Approach |
LDA Approach
(nHDP) |
|
Part of Model |
Topic Tree |
Topic Tree |
|
300,000 articles
from New York Times (1987-2007) |
|||
AAAI/IJCAI papers
(2000-2015) |
Note: The topic tree for AAAI/IJCAI papers will take 1-2 minutes to load. Click on a topic to show the documents belonging to that topic and the counts by year.
Key
References
¡¤
P. Chen, N.L. Zhang, et al. Latent Tree Models for Hierarchical
Topic Detection. Artificial Intelligence, 250:105-124, 2017.
¡¤
P. Chen, N.L. Zhang, et al. Progressive
EM for Latent Tree Models and Hierarchical Topic Detection. AAAI 2016.
¡¤
T. Liu, N.L. Zhang, P. Chen. Hierarchical
Latent Tree Analysis for Topic Detection. ECML/PKDD (2) 2014:
256-272
¡¤
R. Mourad, C. Sinoquet, N. L. Zhang, T.F. Liu and P. Leray (2013). A survey on latent tree models and
applications. Journal
of Artificial Intelligence Research, 47, 157-203
¡¤
T.Liu, N.L. Zhang, et al. Greedy
learning of latent tree models for multidimensional clustering. Machine
Learning 98(1-2): 301-330 (2015)
¡¤
T. Chen, N. L. Zhang, T. F. Liu, Y. Wang, L. K. M. Poon (2012). Model-based
multidimensional clustering of categorical data. Artificial Intelligence, 176(1),
2246-2269.
¡¤
Paisley, J., Wang, C., Blei, D.
M., and Jordan, M. I. 2012. Nested hierarchical Dirichlet processes. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 37.
¡¤
Blei, D. M., Griffiths, T. L.,
and Jordan, M. I. 2010. The nested Chinese restaurant process and Bayesian
nonparametric inference of topic hierarchies. Journal of the ACM, 57(2):7:1¨C7:30.
Part of
the Model for IJCAI/AAAI Papers (click here for details)