

However, due mainly to the lack of large sense-annotated corpora, most existing representation techniques are limited to the lexical level and thus cannot be effectively applied to individual word senses. As a result, semantic representation has been one of the prominent research areas in lexical semantics over the past decades. In the SemEval-2014~task on Cross-Level Semantic Similarity, we ranked first in Sentence-Phrase, Phrase-Word, and Word-Sense subtasks and second in the Paragraph-Sentence subtask.Owing to the need for a deep understanding of linguistic items, semantic representation is considered to be one of the fundamental components of several applications in Natural Language Processing and Artificial Intelligence. In the SemEval-2014~task on Multilingual Semantic Textual Similarity, we ranked a close second in both the English and Spanish subtasks. In the *SEM~2013~task on Semantic Textual Similarity, our best performing system ranked first among the~89~submitted runs. Additional wrappers and resources were used to handle task specific challenges that include processing Spanish text, comparing text sequences of different lengths, handling informal words and phrases, and matching words with sense definitions. We used a simple term alignment algorithm to handle longer pieces of text. At the core of our system lies a robust distributional word similarity component that combines Latent Semantic Analysis and machine learning augmented with data from several linguistic resources. We describe the SemSim system and its performance in the *SEM~2013~and SemEval-2014~tasks on semantic textual similarity. Semantic textual similarity is a measure of the degree of semantic equivalence between two pieces of text. Cross-language categorization surprisingly shows similar performance and is marginally better for some of the languages. Our algorithm outperforms the existing supervised technique, which used the same dataset. Empirical results obtained on five experimental languages show that categorization with expanded topics shows a very wide performance margin when compared to usage of the original topics. Furthermore, we compare the performance of our classifier with two state-of-the-art supervised algorithms (each for multilingual and cross-lingual tasks) using the same dataset. We compare the performance of the classifier with a model of it using the original class topics. The JRC-Acquis dataset is based on subject domain classification of the European Commission's EuroVoc microthesaurus. The multilabel categorization task uses the JRC-Acquis dataset.

We evaluate our categorization algorithm using a multilabel text categorization problem. The categorization algorithm computes the distributed semantic similarity between the expanded class topics and the text documents in the test corpus. The lexical knowledge in BabelNet is used for the word sense disambiguation and expansion of the topics' terms.

In this paper, as a specific contribution to the document index approach for text categorization, we present a joint multilingual/cross-lingual text categorization algorithm (JointMC) based on semantic term expansion of class topic terms through an optimized knowledge-based word sense disambiguation. Considering the semantics of terms is necessary because of the polysemous nature of most natural language words. Term expansion such as query expansion has been applied in numerous applications however, a major drawback of most of these applications is that the actual meaning of terms is not usually taken into consideration. One of these challenges is that the developer is required to have many different languages involved. Besides the rigor involved in developing training datasets and the requirement for repetition of training for different texts, working with multilingual texts poses additional unique challenges. The majority of the state-of-the-art text categorization algorithms are supervised and therefore require prior training.
