داده‌کاوی و استقرار دادگان اصطلاحنامه چندزبانۀ فرهنگی ایران (اصفا) در چهارچوب کریسپ

اکبری داریان, سعیده

doi:10.30484/nastinfo.2023.3405.2209

داده‌کاوی و استقرار دادگان اصطلاحنامه چندزبانۀ فرهنگی ایران (اصفا) در چهارچوب کریسپ

نوع مقاله : مقاله پژوهشی

نویسنده

سعیده اکبری داریان

استادیار سازمان اسناد و کتابخانه ملی ایران

10.30484/nastinfo.2023.3405.2209

چکیده

هدف: نظام سادۀ سازماندهی دانش (اسکاس) یک مدل داده‌ای رایج برای ‌‌به‌اشتراک‌گذاری و پیونددهی نظام‌های ‌‌‌سازماندهی دانش از طریق وب است. اسکاس، مسیر مهاجرت استاندارد و کم‌هزینه را برای انتقال نظام‌های سازمان دانش موجود به وب معنایی ارائه می‌دهد. پیوستن اصفا به جریان وب معنایی نیازمند تبدیل و استقرار دادگان اصفا براساس ‌اسکاس در قالب گراف آر.دی.اف. است. به این منظور باید رکوردهای مبتنی بر مارک ایران مهندسی مجدد شوند. هدف پژوهش حاضر، مهندسی مجدد دادگان اصفا با داده‌کاوی آنها در چهارچوب کریسپ و استقرار آنها بر روی پلتفرم اسکاسموس است.
روش: این پژوهش از نوع توسعه‌ای – کاربردی است و از روش‌شناسی کریسپ-دی.ام.، از نوع بدون نظارت و خوشه‌بندی سلسله‌مراتبی برای داده‌کاوی استفاده شده است. در مرحلۀ اول درک کسب و کار، هدف اصلی تبدیل دادگان اصفا به مدل داده‌ای اسکاس در قالب گراف آر.دی. اف. تعیین شد. در مرحلۀ درک داده، داده‌های میراثی اصفا شامل 11006 رکورد ذخیره‌ شده در قالب مارک ایران و شامل 18 حوزه، آموزش و پرورش، ادبیات، ارتباطات، اقتصاد، تاریخ، تصوف و عرفان، جامعه‌شناسی، جغرافیا، حقوق، روان‌شناسی، زبان‌شناسی، دین، علوم سیاسی، فلسفه، فناوری و علوم تجربی، کتابداری و اطلاع‌رسانی، مدیریت و فرهنگ و هنر است. در مرحلۀ سوم آماده‌سازی داده، داده‌های مفقود و پرت شناسایی و ویرایش شد. برای انتخاب ویژگی‌ها در لایۀ پیش‌پردازش مهندسی داده، عناصر ضروری برای تبدیل به اسکاس شناسایی و جدول انطباق آنها با فیلدهای مارک ایران تدوین گردید. در مرحلۀ مدل‌سازی، مقادیر ویژگی هدف با تکنیک خوشه‌بندی سلسله‌مراتبی و با استفاده از ماکروکد در اکسل تولید شد. ارزیابی مدل با تکنیک بررسی بصری و روش نمونه‌گیری تصادفی مورد تایید قرار گرفت. در مرحلۀ ششم تبدیل داده‌های مارک ایران به اسکاس در قالب گراف آر.دی.اف. با استفاده از ابزار اسکاس‌پلی انجام و داده‌ها به بستر پلتفرم ووک‌بنچ انتقال یافت. با استفاده از قالب تورتل، دادگان اصفا در پلتفرم اسکاسموس مستقر شد.
یافته‌ها: یافته اصلی پژوهش، استقرار و توسعۀ دادگان اسکاس اصفا در پلتفرم منبع باز اسکاسموس به نشانی skosmos.nlai.ir است. مجموع رکوردها پس از ایجاد رکوردهای مربوط به حوزه و مجموعه برای خوشه‌بندی به 11880 رکورد افزایش یافت. در مرحلۀ آماده‌‌‌سازی داده یکی از یافته‌های مهم، تدوین جدول انطباق بین عناصر هستۀ اسکاس و فیلدهای مارک ایران بود.
نتیجه‌گیری: در این پژوهش با بهره‌گیری از علم داده، روش نوآورانه‌‌‌ای برای داده‌کاوی دادگان اصطلاحنامه‌‌‌ای به‌کار رفت. ‌‌روش‌شناسی‌های به‌کار رفته در ادبیات این پژوهش تنها در دو مرحلۀ آماده‌‌‌سازی و استقرار و توسعه از شش مرحلۀ به‌کار رفته در این پژوهش جا گرفتند.

کلیدواژه‌ها

عنوان مقاله [English]

Data Mining and Deployment of Multilingual Iranian Cultural Thesaurus (ASFA) Dataset in the CRISP Framework

نویسنده [English]

Saeedeh Akbari Daryan

Assistant Professor of National Library and Archives of Iran

چکیده [English]

Purpose: The Simple Knowledge Organization System (SKOS) is a widely used data model for sharing and linking knowledge organization systems on the web. It offers a cost-effective way to migrate existing knowledge organization systems to the Semantic Web. To integrate ASFA into the Semantic Web, the ASFA dataset needs to be converted and deployed as an RDF graph based on SKOS. To achieve this, the records in ASFA's Iran MARC format must be re-engineered. This study aims to re-engineer the ASFA dataset using data mining in the CRISP framework and deploy it on the open-source platform Skosmos.
Method: The study used the developmental-applied type of research and employed the CRISP-D.M. methodology, unsupervised type, and hierarchical clustering technique for data mining to start the project, we first needed to understand the business goal. This goal was to convert the ASFA dataset into the SKOS data model, creating an RDF graph. It was discovered that ASFA's heritage data comprises 11,006 records categorized into 18 fields, including education, literature, communication, economy, history, Sufism and mysticism, sociology, geography, law, psychology, linguistics, religion, political science, philosophy, technology, experimental science, librarianship and information, management, culture, and art. The data was prepared by identifying and correcting missing and outlier data and before starting the project, our team needed to fully comprehend the business's objective. The ultimate goal was to convert the ASFA dataset into the SKOS data model. This was done to better comprehend the business objective. Creating an RDF graph. The modeling stage utilized the hierarchical clustering technique macrocode in Excel to generate target feature values. The model was evaluated through a visual inspection technique and random sampling method. In the sixth step, Iran MARC data was converted to SKOS as an RDF graph using the SkosPlay tool, and the data was transferred to the Vocbench platform. ASFA Dataset was deployed on the Skosmos platform using the Turtle format.
Findings: The main finding of this study is the deployment and development of ASFA Dataset based on SKOS/RDF on the open source platform Skosmos at kosmos.nlai.ir. The total number of records increased to 11,880 records creating collection records for clustering. One of the important findings during the data preparation stage was the compilation of the mapping table between SKOS core elements and Iran MARC fields.
Conclusion: By integrating stages of methodologies used in the literature review within the CRISP framework, an innovative method was developed for converting thesauri into a lightweight ontology based on SKOS/RDF graph format.

کلیدواژه‌ها [English]

Data Mining
SKOS
Iran MARC
RDF Graph
Reengineering
Skosmos
ASFA Thesaurus

مراجع

اکبری داریان، سعیده وانتهایی، علیرضا (1399) (طرح پژوهشی). ارائه مدل پیاده‌سازی اصطلاح‌نامه‌های سازمان اسناد وکتابخانه ملی ایران در چهارچوب‌های وب معناییSKOS/RDF در محیط نرم‌‌افزارهای منبع‌باز. سازمان اسناد و کتابخانه ملی ایران.

امیرحسینی، مازیار (1401) (نشست مجازی). سلسله هم‌اندیشی‌های نظام‌های سازمان دانش: سیر تکوین لایه‌های وب معنایی در بررسی جایگاه هستی‌شناسی‌ها. دانشگاه فردوسی مشهد. https://b2n.ir/Fumlibrary

امیرحسینی، مازیار (1401الف) (نشست مجازی). سلسله هم‌اندیشی‌های نظام‌های سازمان دانش: مهندسی مجدد مفهومی اصطلاحنامه در تدوین طرح مفهومی هستی‌شناسی سبک. دانشگاه فردوسی مشهد. https://b2n.ir/Fumlibrary

Akbari-Daryan, Saeedeh, Entehaee, Alireza (2020) (Research project). Implementation of thesauri of National Library and Archives of Iran by Semantic web Frameworks SKOS/RDF in open source applications: present a model. National Library and Archives of Iran. [In Persian]

Amirhosseini, Maziar (1401) (virtual session). The series of common thoughts of knowledge organization systems: The formation process of semantic web layers in studying of ontologies. Mashhad Ferdowsi University. https://b2n.ir/Fumlibrary. [In Persian]

Amirhosseini, Maziar (1401a) (virtual session). The series of common thoughts of knowledge organization systems: the conceptual reengineering of the thesaurus in the development of the conceptual schema of lightweight ontology. Mashhad Ferdowsi University. https://b2n.ir/Fumlibrary [In Persian]

Barbosa, E. R., Dutra, M. L., Godoy Viera, A. F., & Macedo, D. D. J. D. (2021). Thesaurus and subject heading lists as Linked Data. Transinformação, 33.

Biagetti, M. T. (2021). Ontologies as knowledge organization systems. KO KNOWLEDGE ORGANIZATION, 48(2), 152-176.

Davies, J. (2010). Lightweight ontologies. In Theory and Applications of Ontology: Computer Applications (pp.197-229). Dordrecht: Springer Netherlands.

Haravu, L. J., & Neelameghan, A. (2003). Text mining and data mining in knowledge organization and discovery: the making of knowledge-based products. Cataloging & classification quarterly, 37(1-2), 97-113.

Isaac, A., & Summers, E. (2009). SKOS simple knowledge organization system primer. Working Group Note, W3C.

Martínez-González, M. M., & Alvite-Diez, M. L. (2019). Thesauri and semantic web: discussion of the evolution of thesauri toward their integration with the semantic web. IEEE Access, 7, 153151-153170.

Merriam-Webster. (n.d.). Reengineer. In Merriam-Webster.com dictionary. Retrieved august 26, 2022, from https://www.merriam-webster.com/dictionary/reengineering

McGraw-Hill (2003). Reengineering.McGraw-Hill Dictionary of Scientific & Technical Terms, 6E. Retrieved August 26 2022 from https://encyclopedia2.thefreedictionary.com/reengineering

McGraw-Hill Companies (2002). reengineering. McGraw-Hill Concise Encyclopedia of Engineering. Retrieved August 26 2022 from https://encyclopedia2.thefreedictionary.com/reengineering

Mazzocchi, F. (2018). Knowledge organization system (KOS): an introductory critical account. Knowledge Organization: KO, 45(1).

Miles, A., Rogers, N., & Beckett, D. (2004). Migrating Thesauri to the Semantic Web-Guidelines and case studies for generating RDF encodings of existing thesauri. SWAD-Europe project deliverable, 8.

Piatetsky-Shapiro, G (2014). CRISP-DM, still the top methodology for analytics, data mining, or data science projects. https://www.kdnuggets.com/2014/10/crisp-dm-top-methodology-analytics-data-mining-data-science-projects.html

Theng, Y. L., Foo, S., Goh, D., & Na, J. C. (Eds.). (2009). Handbook of Research on Digital Libraries: Design, Development, and Impact: Design, Development, and Impact. IGI Global.

Van Assem, M., Malaisé, V., Miles, A., & Schreiber, G. (2006). A method to convert thesauri to SKOS. In The Semantic Web: Research and Applications: 3rd European Semantic Web Conference, ESWC 2006 Budva, Montenegro, June 11-14, 2006 Proceedings 3 (pp. 95-109). Springer Berlin Heidelberg.

Villazón-Terrazas, B. C., Suárez-Figueroa, M., & Gómez-Pérez, A. (2010). A pattern-based method for re-engineering non-ontological resources into ontologies. International Journal on Semantic Web and Information Systems (IJSWIS), 6(4), 27-63.

Zeng, M. L., & Mayr, P. (2019). Knowledge Organization Systems (KOS) in the Semantic Web: a multi-dimensional review. International Journal on Digital Libraries, 20(3), 209-230.