Automatic text rubrication using machine learning algorithms

Бесплатный доступ

The article presents the solution of the problem of automatic rubrication of Russian-language texts using machine learning algorithms on the example of a corpus of news articles. This problem is considered as a classification problem for a certain number of disjoint classes. The algorithm of preparing text data for classification and its practical implementation in the Python programming language is presented. The analysis of existing methods of token normalization is carried out. The results of the research on the construction of a number of classifiers for solving the problem of classification of Russian-language texts are presented. The generalizing ability of classifiers is estimated by a number of metrics.

Еще

Classification, tokenization, normalization, stopword, metric

Короткий адрес: https://sciup.org/148323530

IDR: 148323530   |   DOI: 10.18137/RNU.V9187.21.04.P.175

Статья научная