December 29 , 2015, 11AM 302- 309
A wide variety of text analysis applications are based on
statistical machine learning techniques. The success of those applications is
critically affected by how we represent a document. Learning an efficient
document representation has two major challenges: sparsity and sequentiality.
The sparsity often causes high estimation error, and text’s sequential
nature, interdependency between words, causes even more complication. This thesis
presents novel document representations to overcome the two challenges.
First, I employ label characteristics to estimate a compact document
representation. Because label attributes implicitly describe the geometry of
dense subspace that has substantial impact, I can effectively resolve the
sparsity issue while only focusing the compact subspace. Second, while
modeling a document as a joint or conditional distribution between words and
their sequential information, I can efficiently reflect sequential nature of
text in my document representations. Lastly, the thesis is concluded with a
document representation that employs both labels and sequential information
in a unified formulation. The following
four criteria are utilized to evaluate the goodness of representations: how
close a representation is to its original data, how strongly a representation
can be distinguished from each other, how easy to interpret a representation
by a human, and how much computational effort is needed for a representation.
While pursuing those good representation criteria, I was able to obtain
document representations that are closer to the original data, stronger in
discrimination, and easier to be understood than traditional document
representations. Efficient computation algorithms make the proposed
approaches largely scalable. This thesis examines emotion prediction,
temporal emotion analysis, modeling documents with edit histories, locally
coherent topic modeling, and text categorization tasks for possible
applications. |