To the noun phrase recognition problem in application to automatic information extraction from Russian texts

Бесплатный доступ

The problem of isolating complex noun groups in Russian-language journalistic texts in the application to problems of automatic information retrieval is considered. By complex nominal groups are meant long nominal groups containing genitive, prepositional constructions, as well as proper names. A scheme for finding the boundaries of nominal groups is proposed, beginning with a fragment of text that obviously contains a name group. An algorithm for identifying such fragments has been developed. Their classification based on the frequency of occurrence of the types of fragments, the number of words of the fragment, their part-time composition, the presence of already identified named entities of different species, information on the occurrence of parts of fragments in the list of complex prepositions and stable combinations. The original system of attributes for constructing an algorithm for automatically extracting nominal groups within the boundaries of analysis of fragments constructed at the first stage is given. In the experimental part of the study, fragments (58032 fragments) were extracted from the collection of texts of socio-political subjects (1000 documents), complicated cases were analyzed

Еще

Information extraction, named entities recognition, noun phrase chunking

Короткий адрес: https://sciup.org/14336183

IDR: 14336183

Статья научная