Information retrieval and analysis for a modern organization

Бесплатный доступ

With the growing volume and demand for data a major concern for an Organization is to discover what data there actually is, what it contains and how it is being used and by who. The amount of data and the disparate systems used to handle this data increase in their number and complexity every year and unifying these systems becomes more and more complex. In this work we describe an Intelligent search engine system, specifically designed to tackle the problem of information retrieval and sharing in a large multifaceted organization, that already has many systems in place for each Department, which is an integral part of a joint Operational Data Platform(ODP) for data exploration and processing.

Еще

Data-driven projects, information retrieval, streaming processing, mesos, kafka

Короткий адрес: https://sciup.org/14916377

IDR: 14916377   |   DOI: 10.15514/ISPRAS-2016-28(4)-1

Список литературы Information retrieval and analysis for a modern organization

  • Topchyan A.R. Enabling Data Driven Projects for a Modern Enterprise. Trudy ISP RAN/Proc. ISP RAS, vol. 28, issue 3, 2016, pp. 209-230 DOI: 10.15514/ISPRAS-2016-28(3)-13
  • Rahman, Nayem, and Fahad Aldhaban. "Assessing the effectiveness of big data initiatives."2015 Portland International Conference on Management of Engineering and Technology (PICMET). IEEE, 2015.
  • Davenport, Thomas H., and Jill Dych´e. "Big data in big companies."International Institute for Analytics (2013).
  • Dunning, Ted, and Ellen Friedman. Streaming Architecture: New Designs Using Apache Kafka and Mapr Streams. O’Reilly Media.2016.
  • Marz, Nathan, and James Warren. Big Data: Principles and best practices of scalable real-time data systems. Manning Publications Co, 2015
  • Michael Hausenblas and Nathan Bijnens. Lambda Architecture. http://lambda-architecture.net, 2015.
  • K. Mani Chandy. vent-Driven Applications: Costs, Benefits and Design Approaches, California Institute of Technology, 2006.
  • Akidau, Tyler, et al. "MillWheel: fault-tolerant stream processing at internet scale."Proceedings of the VLDB Endowment 6.11:1033-1044, 2013.
  • Zaharia, Matei, et al. "Discretized streams: Fault-tolerant streaming computation at scale."Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 2013.
  • Akidau, Tyler, et al. "The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, outof-order data processing."Proceedings of the VLDB Endowment 8.12: 1792-1803, 2015.
  • Verma, Abhishek, et al. "Large-scale cluster management at Google with Borg."Proceedings of the Tenth European Conference on Computer Systems. ACM, 2015.
  • Boritz, J. "IS Practitioners’ Views on Core Concepts of Information Integrity". International Journal of Accounting Information Systems. Elsevier, 2011.
  • Netflix. Distributed Resource Scheduling with Apache Mesos. http://techblog.netflix.com/2016/07/distributedresource-scheduling-with.html
  • Newell, Andrew, et al. "Optimizing distributed actor systems for dynamic interactive services.”. Proceedings of the Eleventh European Conference on Computer Systems. ACM, 2016
  • Cohen, William, Pradeep Ravikumar, and Stephen Fienberg. "A comparison of string metrics for matching names and records.". Kdd workshop on data cleaning and object consolidation. Vol. 3, 2003
  • Hoffman, Matthew, Francis R. Bach, and David M. Blei. "Online learning for latent dirichlet allocation.”. Advances in neural information processing systems, 2010
  • Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation. “Journal of machine Learning research 3.Jan: 993-1022, 2003
  • Mihalcea, Rada, and Paul Tarau. "TextRank: Bringing order into texts. “Association for Computational Linguistics, 2004.
  • Hasan, Kazi Saidul, and Vincent Ng. "Conundrums in unsupervised key phrase extraction: making sense of the state-of-the-art. "Proceedings of the 23rd International Conference on Computational Linguistics: Posters. Association for Computational Linguistics, 2010.
  • Broder, Andrei Z. "Identifying and filtering near-duplicate documents. “Annual Symposium on Combinatorial Pattern Matching. Springer Berlin Heidelberg, 2000.
  • E. Cohen et al. "Finding interesting associations without support pruning. "IEEE Transactions on Knowledge and Data Engineering, vol. 13, no. 1, pp. 64-78, 2001.
  • Leskovec, Jure, Anand Rajaraman, and Jeffrey David Ullman. Mining of massive datasets. Cambridge University Press, 2014.
  • Krestel, Ralf, Peter Fankhauser, and Wolfgang Nejdl. "Latent dirichlet allocation for tag recommendation. “Proceedings of the third ACM conference on Recommender systems. ACM, 2009.
  • Maskeri, Girish, Santonu Sarkar, and Kenneth Heafield. "Mining business topics in source code using latent dirichlet allocation. “Proceedings of the 1st India software engineering conference. ACM, 2008.
  • Apache Kafka. http://kafka.apache.org, 2015.
  • Gormley, Clinton, and Zachary Tong. Elasticsearch: The Definitive Guide. "O’Reilly Media, Inc.", 2015.
  • Apache Mesos. http://mesos.apache.org, 2015.
  • Apache Tika. https://tika.apache.org, 2015.
  • Confluent Inc. Kafka-Connect. http://docs.confluent.io, 2015.
Еще
Статья научная