AgreementMaker:Efficient Matching for Large Real-World 翻译

·  阅读 233

正文之前

这篇文章还是我看前几天那个基于框架进行本体匹配的一个Previous Work里面的一个Previous Work。可以说有点菜,但是还是比较有参考意义的, 所以我把源码下载了下来,然后准备把对应的文章读一读,然后我个人比较喜欢中英对照,直接看中文的时候略过一些不重要的地方,在关键部位看原文。所以就有了这么多的翻译版本了。。

引用如下:Cruz I F, Antonelli F P, Stroe C. AgreementMaker: efficient matching for large real-world schemas and ontologies[J]. Proceedings of the VLDB Endowment, 2009, 2(2): 1586-1589.

谷歌借张图片镇楼好了

正文

Abstract

摘要

We present the AgreementMaker system for matching real world schemas and ontologies, which may consist of hundreds or even thousands of concepts. The end users of the system are sophisticated domain experts whose needs have driven the design and implementation of the system: they require a responsive, powerful, and extensible framework to perform, evaluate, and compare matching methods. The system comprises a wide range of matching methods addressing different levels of granularity of the components being matched (conceptual vs. structural), the amount of user intervention that they require (manual vs. automatic), their usage (stand-alone vs. composed), and the types of components to consider (schema only or schema and instances). Performance measurements (recall, precision, and runtime) are supported by the system, along with the weighted combination of the results provided by those methods. The AgreementMaker has been used and tested in practical applications and in the Ontology Alignment Evaluation Initiative (OAEI) competition. We report here on some of its most advanced features, including its extensible architecture that facilitates the integration and performance tuning of a variety of matching methods, its capability to evaluate, compare, and combine matching results, and its user interface with a control panel that drives all the matching methods and evaluation strategies.

我们提出了AgreementMaker系统,用于匹配真实世界模式和本体,可能包含数百甚至数千个概念。系统的最终用户是复杂的领域专家,他们的需求推动了系统的设计和实现:他们需要一个响应迅速,功能强大且可扩展的框架来执行,评估和比较匹配方法。该系统包含多种匹配方法,可以解决匹配的组件(概念与结构)的不同粒度级别,他们需要的用户干预量(手动与自动),它们的使用(独立与组合),以及要考虑的组件类型(仅架构或架构和实例)。系统支持性能测量(召回率,准确率和运行时性能),以及这些方法提供的结果的加权组合。 AgreementMaker已在实际应用和Ontology Alignment Evaluation Initiative(OAEI)竞赛中使用和测试。我们在此报告其一些最先进的功能,包括其可扩展的体系结构,有助于各种匹配方法的集成和性能调整,评估,比较和组合匹配结果的能力,以及控制所有匹配方法和评估策略的用户界面和控制面板。

1. Introduction

1. 介绍

The issue of schema matching in databases [11], which has been investigated since the early 80’s, is fundamental to data integration, as is the closely-related issue of ontology alignment or matching [12]. The matching problem consists of defining mappings among schema or ontology elements that are semantically related. Such mappings are typically defined between two schemas or two ontologies at a time one being called the source and the other being called the target.

自80年代早期以来一直在研究的数据库[11]中的模式匹配问题是数据集成的基础,与本体对齐或匹配密切相关的问题也是如此[12]。匹配问题包括定义在语义上相关的 模式或本体元素之间 的映射。这种映射通常在两个模式或两个本体之间定义,一个被称为源本体,另一个被称为目标本体。

We have been developing the AgreementMaker matching system, whose name takes after agreement, the encoding of a mapping. The capabilities of our system have been driven by the real-world problems of end users who are sophisticated domain experts. We have considered a variety of domains and applications, including: geospatial [2], environmental [4], and biomedical [13]. The conceptual information for these applications is stored in the form of ontologies. However, as demonstrated by others, the same approach can be used for schema matching [1, 10]. To validate our approach, we competed against seven other systems in the biomedical track of the 2007 Ontology Alignment Evaluation Initiative (OAEI), to match ontologies describing the mouse adult anatomy of the Mouse Gene Expression Database Project (2744 classes) and the human anatomy of the National Cancer Institute (3304 classes). We came in third in terms of accuracy (F-measure) [5].

我们一直在开发AgreementMaker匹配系统,其名称取决于协议(映射的编码)。我们系统的功能受到最终用户的现实问题的驱动,这些最终用户是非常复杂的领域专家。我们已经考虑了各种领域和应用,包括:地理空间[2],环境[4]和生物医学[13]。这些应用程序的概念信息以本体的形式存储。但是,正如其他人所证明的那样,相同的方法可以用于模式匹配[1,10]。为了验证我们的方法,我们与2007年本体校准评估计划(OAEI)的生物医学行业中的其他七个系统进行了竞争,以匹配描述小鼠基因表达数据库项目(2744类)的成年小鼠解剖学的本体和国家癌症研究所(3304类)的人体解剖学分类本体。我们在准确性方面排名第三(F-measure)[5]。

The AgreementMaker, which is currently in its third version, has been evolving to accommodate: (1) user requirements, as expressed by domain experts; (2) a wide range of input (ontology) and output (agreement file) formats; (3) a large choice of matching methods depending on the different granularity of the set of components being matched (local vs. global), on different features considered in the comparison (conceptual vs. structural), on the amount of intervention that they require from users (manual vs. automatic), on usage (stand-alone vs. composed), and on the types of components to consider (schema only or schema and instances); (4) improved performance, that is, accuracy (precision, recall, F-measure) and efficiency (execution time) for the automatic methods; (5) an extensible architecture to incorporate new methods easily and to tune their performance; (6) the capability to evaluate, compare, and combine different strategies and matching results; (7) a comprehensive user interface supporting both advanced visualization techniques and a control panel that drives all the matching methods and evaluation strategies.

目前处于第三版的AgreementMaker正在不断发展以适应:(1)领域专家表达的用户需求; (2)广泛的输入(本体)和输出(协议文件)格式; (3)根据不同粒度的组件集的匹配选项(本地与全局),在比较中考虑的不同特征(概念与结构),他们需要的来自用户的干预量(手动与自动),使用(独立与组合),以及要考虑的组件类型(仅架构或架构和实例); (4)改进性能,即自动方法的准确度(精确度,召回率,F测量值)和效率(执行时间); (5)可扩展的架构,可以轻松地整合新方法并调整其性能; (6)评估,比较和组合不同策略和匹配结果的能力; (7)全面的用户界面,支持高级可视化技术和控制面板,驱动所有匹配方法和评估策略。

In this demo paper, we focus on the most recent developments of the system, which has been almost completely redesigned in the last year. In particular, we describe: (1) the user interface with particular emphasis on the control panel and improved visualization and interaction capabilities; (2) the automatic matching methods and execution capabilities; and (3) the evaluation strategies for determining the efficiency of the matching methods and for performing the combination of results.

在本演示文章中,我们将重点介绍该系统的最新发展,该系统在去年几乎完全重新设计。特别是,我们描述:(1)用户界面,特别强调控制面板和改进的可视化和交互功能; (2)自动匹配方法和执行能力; (3)用于确定匹配方法的效率和执行结果组合的评估策略。

2. RELATED WORK

2.相关工作

There are several notable systems related to ours, including Clio [6], COMA++ [1], Falcon-AO [7], and Ri MOM [14] (just to mention a few). Clio stands apart because of its single focus on database-specific constraints and operators (e.g., foreign keys, joins) to infer the mappings whereas constraints in ontologies (as implemented in the other three systems and in AgreementMaker) are of a different nature [12]. This different emphasis also permeates the remaining components of the various systems, as those that also support ontology matching implement a rich tool box of stringsimilarity and structural-based techniques and focus on performance. Consequently, some of these systems do not focus on user interaction: for example, Falcon-AO and Ri MOM provide simple interfaces that offer limited user interaction (e.g., no manual manipulation of the ontologies). However, what separates AgreementMaker from these other systems (including from COMA++, which has a more sophisticated user interface than the other two) is the degree to which it integrates the evaluation of the quality of the obtained mappings with the graphical user interface and therefore with the iterative matching process. This tight integration emerged from our work with domain experts, who required that the evaluation be an integral part of the matching process, not an “add on” capability.

有几个与我们相关的着名系统,包括Clio [6],COMA ++ [1],Falcon-AO [7]和Ri MOM [14](仅举几例)。 Clio之所以与众不同,是因为它专注于特定于数据库的约束和运算符(例如,外键,连接)来推断映射,而本体中的约束(在其他三个系统和AgreementMaker中实现)具有不同的性质[12 ]。这种不同的重点也渗透到各种系统的其余组件中,因为那些支持本体匹配的组件实现了丰富的相似性和基于结构的技术工具箱,并专注于性能。因此,这些系统中的一些不关注用户交互:例如,Falcon-AO和Ri MOM提供了限制用户交互的简单接口(例如,没有对本体的手动操纵)。然而,将AgreementMaker与其他系统(包括COMA ++,其具有比其他两个更复杂的用户界面)区别开来的是它将获得的映射的质量评估与图形用户界面集成的程度,因此迭代匹配过程(大意是可以直接看到评估结果的改进?)。这种紧密集成源于我们与领域专家的合作,他们要求评估是匹配过程中不可或缺的一部分,而不是“附加”功能。

3. ARCHITECTURE

3.架构

The AgreementMaker supports a wide variety of methods or matchers. Our architecture (see Figure 1) allows for serial and parallel composition where, respectively, the output of one or more methods can be used as input to another one, or several methods can be used on the same input and then combined. A set of mappings may therefore be the result of a sequence of steps, called layers.

AgreementMaker支持各种方法或匹配器。我们的体系结构(参见图1)允许串行和并行组合,其中一个或多个方法的输出可以分别用作另一个方法的输入,或者可以在同一输入上使用多个方法然后组合。因此,一组映射可能是一系列步骤的结果,称为层。

The matching process of a generic matcher (see Figure 2), can be divided into two main modules: (1) similarity computation in which each concept of the source ontology is compared with all the concepts of the target ontology, thus producing two similarity matrices (one for classes and the other one for properties), which contain a value for each pair of concepts; (2) mappings selection in which the matrix is scanned to select only the best mappings according to a given threshold and to the cardinality of the correspondences, for example, 1-1, 1-N, N-1, M-N

通用匹配器的匹配过程(见图2)可以分为两个主要模块:(1)相似度计算,其中源本体的每个概念与目标本体的所有概念进行比较,从而产生两个相似性矩阵(一个用于类,另一个用于属性),其中包含每对概念的值; (2)映射选择,扫描矩阵以根据给定阈值和对应关系的基数仅选择最佳映射,例如1-1,1-N,N-1,M-N

To enable extensibility, we adopted the object-oriented template pattern by defining the skeleton of the matching process in a generic matcher, which defers only a few operations to the concrete matcher extensions (see Figure 3). This abstraction minimizes development effort by completely decoupling the structure of a single method from the architecture of the whole system, thus allowing reuse or any possible composition of matching modules.

为了实现可扩展性,我们通过在通用匹配器中定义匹配过程的框架来实现面向对象的模板模式(???不懂),该模式仅将少数操作推迟到具体的匹配器扩展(参见图3)。这种抽象通过将单个方法的结构与整个系统的体系结构完全解耦来最小化开发效率,从而允许重用或任何可能的匹配模块组合。

A first layer matcher produces the similarity matrices, while the second and third layer matchers extend the first layer matchers. In particular, a second layer matcher improves on the results of a first layer matcher using conceptual or structural information, depending on whether it considers one concept alone or a concept and its neighbors. Finally, a third layer matcher combines the results of two or more matchers from the previous layers, in order to obtain a final matching or alignment, that is, a set of mappings.

第一层匹配器产生相似性矩阵,而第二和第三层匹配器扩展第一层匹配器。特别地,第二层匹配器使用概念或结构信息改进第一层匹配器的结果,这取决于它是单独考虑一个概念还是概念及其邻居。最后,第三层匹配器组合来自先前层的两个或更多个匹配器的结果,以便获得最终匹配或对齐,即一组映射。

4. USER INTERFACE

4.用户界面

The source and target ontologies (in XML, RDFS, OWL, or N3) are visualized side by side using the familiar outline tree paradigm (see Figure 4). Agreements can be exported in different formats (e.g., XML, Excel). Because all the matching operations and their results are managed by this interface, we gave special consideration to its design [4]. We describe next two new features of the interface: the control panel and the visualization of non-hierarchical ontologies (e.g., due to multiple inheritance in OWL). The latter feature allows for specific subtrees to be visually duplicated. Because we adopt the Model-View-Control pattern, this duplication does not affect the underlying data structures. The control panel (see Figure 5) allows users to run and manage matching methods and their results. Users can select parameters common to all methods (such as threshold and cardinality) and method-specific parameters. When a method has run, a new row is dynamically added to the table that is part of the control panel at the same time that lines depicting the mappings between the concepts are added (see Figure 4). Each row is color coded and allows for its selection so that the corresponding mappings (of the same color) can be compared visually. Each row also displays the performance values for the associated methods, thus allowing for the comparison with those of other rows. In addition, users can modify at runtime the method parameters by changing directly their values in the table or by selecting previously calculated matchings as input to the methods to be applied next. Multiple matchings can also be combined manually or with an automatic combination matcher.

源和目标本体(在XML,RDFS,OWL或N3中)使用熟悉的大纲树范例并排显示(参见图4)。匹配结果可以以不同的格式导出(例如,XML,Excel)。由于所有匹配操作及其结果均由此接口管理,因此我们特别考虑了其设计[4]。我们将介绍接口的下两个新功能:控制面板和非分层结构的可视化(例如,由于OWL中的多重继承)。后一特征允许在视觉上复制特定的子树。因为我们采用模型-视图-控制模式,所以这种应用不会影响基础数据结构。控制面板(参见图5)允许用户运行和管理匹配方法及其结果。用户可以选择所有方法共有的参数(例如阈值和基数)和特定于方法的参数。当一个方法运行时,一个新行被动态地添加到作为控制面板一部分的表中,同时添加了描述概念之间映射的行(参见图4)。每行都是彩色编码的,并允许其选择,以便可以在视觉上比较相应的映射(相同颜色)。每行还显示相关方法的性能值,从而允许与其他行的性能值进行比较。此外,用户可以在运行时通过直接更改表中的值或通过选择先前计算的匹配结果作为下一个要应用的方法的输入来修改这个方法的参数。多个匹配也可以手动组合或与自动组合匹配器组合。

5. MATCHING METHODS

5.匹配方法

First layer matchers compare concept features (e.g., label, comments, annotations, and instances) and use a variety of methods including syntactic and lexical comparison algorithms as well as the use of a lexicon like Word Net. Of those methods some were proposed by others (e.g., edit distance, Jaro-Winkler) and some devised by us, including a substring-based comparison that favors the length of the common substrings and a concept document-based comparison containing a wide range of features. Those features are represented as TF-IDF vectors and use a cosine similarity metric (see Figure 6).

第一层匹配器比较概念特征(例如,标签,注释,注释和实例)并使用各种方法,包括句法和词汇比较算法以及Word Net等词典的使用。其中一些方法是由其他人提出的(例如,编辑距离,Jaro-Winkler)和我们设计的一些方法,包括基于子串的比较,这有利于公共子串的长度和基于文件的概念等方面进行广泛特征上的比较。这些特征表示为TF-IDF向量并使用余弦相似性度量(参见图6)。

Second layer matchers use structural properties of the ontologies. Our own methods include the Descendant’s Similarity Inheritance (DSI) and the Sibling’s Similarity Contribution (SSC) matchers [3].

第二层匹配器使用本体的结构属性。我们自己的方法包括后代的相似性遗传(DSI)和兄弟姐妹的相似性贡献(SSC)匹配[3]。

Finally, third layer matchers combine the results of two or more matchers so as to obtain a unique final matching in two steps. In the first step, a similarity matrix is built for each pair of concepts, using our Linear Weighted Combination (LWC) matcher, which processes the weighted average for the different similarity results (see Figure 7). Weights can be assigned manually or automatically, the latter assignment being determined using our evaluation methods. The second step uses that similarity matrix and takes into account a threshold value and the desired cardinality. When the cardinality is 1-1, we adopt the Shortest Augmenting Path algorithm [9] to find the optimal solution for this optimization problem (namely the assignment problem reduced to the maximum weight matching in a bipartite graph) in polynomial time.

最后,第三层匹配器组合两个或更多匹配器的结果,以便在两个步骤中获得唯一的最终匹配。在第一步中,使用我们的线性加权组合(LWC)匹配器为每对概念建立相似性矩阵,该匹配器处理不同相似性结果的加权平均值(参见图7)。可以手动或自动分配权重,后者分配使用我们的评估方法确定。第二步使用该相似性矩阵并考虑阈值和期望的基数。当基数为1-1时,我们采用最短增广路径算法[9],在多项式时间内找到该优化问题的最优解(即,将分配问题降级到二分图中的最大权重匹配)。

6. EVALUATION

6.评估

The design of optimal methods to find correct and complete mappings between real-world ontologies is a hard task for several reasons. First of all, an algorithm may be effective for a given scenario, but not for others. Even within the same scenario, the use of different parameters can change significantly the outcome. Moreover, in interviewing domain experts in the geospatial domain, we discovered that they do not trust automatic methods unless quality metrics are associated with the matching results. These observations have motivated a variety of evaluation techniques, that determine runtime and accuracy (precision, recall, and F-measure).

由于几个原因,设计在现实世界本体之间找到正确和完整映射的最佳方法是一项艰巨的任务。首先,算法可能对给定场景有效,但对其他场景则无效。即使在相同的情况下,使用不同的参数也可以显着改变结果。此外,在访问地理空间域中的域专家时,我们发现他们不信任自动方法,除非质量度量与匹配结果相关联。这些观察结果激发了各种评估技术,这些技术决定了运行时间和准确性(精确度,召回率和F测量值)。

The most effective evaluation technique compares the mappings found by the system between the two ontologies with a reference matching or “gold standard,” which is a set of correct and complete mappings as built by domain experts. When a reference matching is available, the AgreementMaker can determine the quality of the found matching analytically or visually. A reference matching can also be used to tune algorithms by using a feedback mechanism provided by a succession of runs.

最有效的评估技术将系统在两个本体之间发现的映射与参考匹配或“黄金标准”进行比较,后者是由领域专家构建的一组正确和完整的映射。当参考匹配可用时,AgreementMaker可以分析或直观地确定找到的匹配的质量。参考匹配也可以用于通过使用由一系列运行提供的反馈机制来调整算法。

When a gold standard is not available, “inherent” quality measures need to be considered. Quality measures can be defined at two levels as associated with the two main modules of a matcher (see Figure 2): similarity or selection level. We can consider local quality as associated with a correspondence at the similarity level (or mapping at the selection level) or global quality as associated with all the correspondences at the similarity level (or with all possible mappings at the selection level). We have incorporated in our system a global-selection quality measure proposed by others [8] and a local-similarity quality measure that we have devised. Experiments have shown that our quality measure is usually effective in defining weights for the LWC matcher.

如果没有黄金标准,则需要考虑“固有的”质量措施。质量测量可以在两个级别定义,与匹配器的两个主要模块相关联(参见图2):相似性或选择级别。我们可以将与相似性级别(或选择级别的映射)的对应关联的本地质量或与相似性级别(或选择级别的所有可能映射)的所有对应关联的全局质量相关联【PS这什么鬼!!!】。我们已经在我们的系统中纳入了其他人提出的全球选择质量测量[8]以及我们设计的局部相似性质量测量。实验表明,我们的质量测量通常在定义LWC匹配器的权重方面是有效的。

7. DEMONSTRATION

7.演示

Our demo focuses on the matching methods and evaluation strategies for determining the efficiency of ontology matching methods. Due to the tight integration of the evaluation strategies with the graphical user interface, a unique feature of our system, all the steps will be performed through the interface. Users will start by uploading their own ontologies, load our own, or download ontologies from the web, thus taking advantage of the several standard formats supported. Users can then explore the interface freely or follow a walk-through, consisting of browsing the ontologies, expanding and contracting nodes, and customizing the display. They have access to the information associated with each concept to be aligned, including descriptions, annotations, and (context) relations, and they can use them to visually detect mappings.

我们的演示侧重于确定本体匹配方法的效率的匹配方法和评估策略。由于评估策略与图形用户界面(我们系统的独特功能)的紧密集成,所有步骤都将通过界面执行。用户将首先上传他们自己的本体(加载我们提供的本体,或从网上下载的本体)从而利用支持的几种标准格式。然后,用户可以自由地浏览界面或按照演练进行浏览,包括浏览本体,扩展和收缩节点以及自定义显示。他们可以访问与要对齐的每个概念相关的信息,包括描述,注释和(上下文)关系,他们可以使用它们来直观地检测映射。

正文之后

初版是直接CAJViewer文字识别,然后用python进行清洗,然后谷歌文件直接翻译,最后整合起来的。所以估摸着友好度比较低,等我看完之后慢慢一点点的改正吧。。

分类:
阅读
标签:
分类:
阅读
标签: