The latent semantic analysis model is a theory of how the meaning of representations can be learned by finding large samples of language without explicit instructions on how it is structured. To extract and understand patterns from documents, Latent Semantic Analysis inherently follows certain assumptions:
The meaning of sentences or documents is a sum of the meaning of all the words that appear in it. In general, the meaning of a certain word is an average in all the documents in which it appears.
Also, Latent Semantic Analysis assumes that semantic associations between words are not explicitly present, but only latently in the large sample of language.
Latent Semantic Analysis is made up of certain mathematical operations to obtain information about a document. This algorithm forms the basis for theme modeling. Therefore, the central idea is to take a matrix of what we have (documents and terms) and decompose it into a separate matrix of documents and topics and a matrix of topics and terms.
Therefore, the first step is to generate our matrix of document terms. It can also be built using a word bag model, but the results are scant and do not provide any meaning. In this way, given m documents and n words in our vocabulary, we can construct an m × n matrix in which each row represents a document and each column represents a word. Intuitively, a term carries great weight when it appears frequently throughout the document but infrequently throughout the corpus.
A collection of statistically representative text of the human language experience is first divided into passages with coherent meanings, typically paragraphs or documents. The collection is represented as a term passage matrix. Rows represent individual terms and the columns represent passages or documents (or other units of analysis of interest). Individual cell entries contain how often each term appears in a document.
Transformed matrix from passage to term
The entries in the matrix of terms and documents are often transformed to weight them according to their estimated importance in order to better imitate the process of human understanding. For language simulation, the best performance is observed when the frequencies accumulate sublinearly within the cells (typically log (freqij + 1), where freqij is the frequency of the i term in document j), and inversely with the general occurrence of the term in the collection (generally using inverse document frequency or entropy measurements).
Stop enumerating and deriving
These are used very rarely. According to the underlying theory and model, neither derivation nor stopping inclusion is appropriate or generally effective. As in natural language, the meaning of the passages cannot be accurately reconstructed or understood without all of their words. However, when using Latent Semantic Analysis to compare shorter word strings than normal text paragraphs, short sentences, and zero weighting of words, it is often pragmatically useful.
A decomposition of narrow range singular values is performed in the matrix, in which the largest singular k values are retained, and the remainder is set to 0. The resulting representation is the best k-dimensional approximation to the original matrix in the minor - its sense of squares. Each passage and term is now represented as a k-dimensional vector in the derived space. In most k applications, the dimensionality is much less than the number of terms in the term pass matrix.
Purpose and method of Latent Semantic Analysis
Latent Semantic Analysis models the contribution to natural language attributable to the combination of words in coherent passages. It uses a long-known method of matrix algebra, the decomposition of the singular value. This became practical for application to such complex phenomena only after the advent of powerful digital computing machines and algorithms in the late 1980s.
To construct a semantic space for a language, Latent Semantic Analysis first projects a representative body of text into a rectangular array of words by coherent passages. Each cell contains a transformation of the number of times a given word appears in a given passage. The matrix is decomposed in such a way that each passage is represented as a vector, whose value is the sum of vectors that represent its component words. The similarities between words and words, passages and words, and passages to passages, are calculated as point products, cosines, or other vector-algebraic metrics.
Latent Semantic Analysis as a theory and language model
The theoretical interpretation of the language of the analysis result is that the vectors approximate the meaning of a word as its average effect on the meaning of the passages in which it occurs, and reciprocally approximate the meaning of the passages as the average of the meaning of its words . However, the derived relationship between individual words should not be confused with the superficial coincidence, the frequency, or the probability that the words appear in the same passages. It is correctly interpreted as the similarity of the effects words have on the passages in which they occur.
Typical language simulation applications
Latent Semantic Analysis has been used more widely for educational technology applications and in small databases. In test collections when all the other characteristics of the comparison methods are held constant, Latent Semantic Analysis offers combined precision and recovery results that are about 30% better than others. Its strength is remembered for its independence from the overlapping of literal words.
Its wider lack of use appears to be due to widely overestimated training requirements. The best-known educational applications are the main component in automated essay scoring systems that match human readers in accuracy and in abstract writing and other computer tutors. The best-known educational applications are the main component in automated essay scoring systems that match human readers in accuracy and in abstract writing and other computer tutors.
It has been the foundation for technologies to improve indexing, assess consistency and content sequence of books, diagnose psychological disorders, match jobs and applicants, monitor and improve team communications and other applications. Also, It has been used as the basis of a metric for the state of development of words based on the amount of language found. In addition it has been used as a tool for experiments and as a component of theories and applications in psychology, anthropology, sociology, psycholinguistics, data mining, and machine learning.
Non-English and multi-language applications
Latent Semantic Analysis has been used successfully in a wide variety of languages. For instance, these include all the languages of the United Nations and the European Union, Chinese and Japanese (in representations of Chinese characters where the addition of components assumption applies to the complexity of the different components), Swahili, Hindi, Arabic and Latvian. Highly inflexed and compound word languages have been surprisingly manageable as long as large corpus of training are used. A demonstration of the linguistic and anthropological / philosophical interest, as well as the practical value, of the multi-lingual ability of Latent Semantic Analysis comes from the retrieval of information in multiple languages.
In this method, the independent spaces in two or more languages are first created from a single language corpus in which several hundred passages are direct translations or the corresponding texts in the other languages are topically closed. Then, the different spaces of the language are rotated using the Procrustes least squares method so that the common passages are better aligned. Tested for the similarity of one random passage to the other translated pair that is not used in alignment, recall and precision are within normal IR ranges for a single language.
Linguistic and philosophical implications.
Plato, Chomsky, Pinker, and others have stated that neither grammar nor semantics can be learned from exposure to language because there is so little information in experience, so it must be primarily innate. Latent Semantic Analysis has shown that computational induction can extract much more information than previously assumed. The finding that words and passages of similar meaning expressed in a wide variety of different languages can be mapped to one another by a simple linear transformation that implies that the semantic structure of language can, in a sense, be universal, presumably because in Everywhere people must learn to talk about all the same things.
Deficiencies, objections, evidence and arguments
Exposure to oral language, direct instruction from parents and teachers, and association of language with perception and action are not included in Latent Semantic Analysis. Similarly, Latent Semantic Analysis is blind to word order. However, some approaches suggest that Latent Semantic Analysis may be only 10% less than humans. Above all, some commentators have also argued that Latent Semantic Analysis is not based on perception and intention. The strength of this objection is greatly reduced by the perception of the abstract word itself and by the various successes of Latent Semantic Analysis.
The ability to derive meaning is the key to any approach that needs to use or assess knowledge. With the advent of more powerful computing and the availability of online texts and machine-readable dictionaries, novel techniques have been developed that can automatically derive semantic representations. In conclusion, these techniques capture the effects of the regularities inherent in language to learn about the semantic relationships between words.
Similarly, the techniques operate in large corpus, allowing the automatic development of lexicons in large language samples. Techniques can be incorporated into cognitive modeling methods for a wide range of psychological phenomena, such as language acquisition, speech processing, categorization, and memory. Additionally, the techniques can be used in applied environments, where a computer can derive representations of semantic knowledge from the text.
If your thesis is developed in the field of language, whatever the discipline, at Online-Tesis.com, we can advise you so that you can carry out the Latent Semantic Analysis, if necessary, with the professionalism that only our experts can offer you.
Berry, M. W., Dumais, S. T. and O'Brien, G. W. (1995). Using linear algebra for intelligent information retrieval. SIAM: Review, 37(4): 573-595.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41, 391-407.
Foltz, P. W., Laham, D., and Landauer, T. K. (1999). The Intelligent Essay Assessor: Applications to educational technology. Interactive Multimedia Electronic Journal of Computer-Enhanced Learning, 1(2). Online journal.