User Tools

Site Tools


okkam:task:schema_mapping

tasks

Schema Mapping

Schema Mapping Task
Assigned to:Julien Gaugaz
Due date:2008/12/31

In Okkam, entities come from different sources, and therefore have different schemas. I.e., a same attribute can have different names, or one attribute of a schema would be represented using two attributes in another schema. We thus need some kind of schema mapping functionality.

This task aims at designing and implementing such a schema mapping functionality in Okkam.

Important

Main Resp. Julien
Other People Anyone?
DoW task 3.5 Advanced Ontology-based methods for Entity matching
Deliverable Design in D6.1. First minimal implementation maybe in D5.2, full implementation in D5.4
Milestone MS5, MS11

Sub-tasks

Deadline Description Status
08.09.2008 Implement query expansion in query planning

Contributes to a Component

Contributes to the Demo

  • Dataset: IMDB-Amazon?
  • Scenario: Query IMDB data with Amazon entities.

Prerequisites Tasks

  • add

Requirements

  1. Given a structured matching query we want that the query attribute names also matches similar attributes in the repository.
  2. Querying should be fast: so, a priori no lengthy computations at query time, and indexing if possible.
  3. Better schema match should yield a better entity match score.
  4. Similarly, the atf and adf functions used by the ranking function should consider the schema matches and their matching score.
  5. Handle the case where a query attribute name is not present in the repository.

Granularity

Granularity refers to the level at which we define an attribute. E.g., is the attribute 'title' of a person the same than the attribute 'title' of a book? A priori we see 3 different granularity:

  1. Entity: attributes with identical name from two different entities are considered distinct, independently of the type/class/category of the entities considered.
  2. Category: attributes with identical name from entities of the same category are considered identical. Conversely, if the entities are of different categories, the attributes are considered distinct.
  3. Source: two attributes are identical only if
    • they have the same name AND
    • they have the same source.
    • The source of an attribute is the program or person which created (modified?) it.
  4. Universe: two attributes with identical name are always considered identical, independently of their entities or entity categories.

Category Discovery

  • Automatic clustering of entities into categories
    • Soft clusters: some entities are better representative of the category than others. Also, an entity can belong to more than one cluster.
    • Hard clusters
  • Category schema matching

………………….

Query Expansion vs. Attribute Expansion

To actually perform the schema mappings, we consider two alternatives:

  1. Query Expansion: Each named attribute of a matching query is transformed into a disjunction of attribute name/value pairs, possibly with different boosting weights.
  2. Attribute Expansion: Each entity's attribute has a list of alternative names, with different 'representation' weights. Those 'representation' weight indicate how well the attribute (name AND value) represent the entity.

Query expansion has the advantage that it doesn't require re-indexing when schema matching results change, at the cost of longer queries, and thus longer response time of the data store. Inversely attribute expansion allows a quicker response time to queries at the cost of a bigger data store, and also a more expensive process to integrate the results of changed schema mappings.

We have to mention here the possibility to take a mediated schema approach. The advantage of this over the approaches mentioned above is that it would limit the size of the storage and in the same time allows for a quick response time. It corresponds to the attribute expansion approach where alternative attributes are collapse into one attribute with a unique name (the one of the mediated schema). There is several drawback to this approach however:

  • Collapsing attributes using the mediated schema makes it difficult to revise schema matching decisions.
  • It doesn't allow to take into consideration the 'score' of the mapping. Whereas we could possibly use the 'representation' weight of the attribute to express how well the original attribute matches the mediated one.
  • It add one step of uncertainty, needing two mapping instead of one: query → mediated ← original instead of query → original.
  • Creation and management of the mediated schema: whereas it would be possible to do this in an ad-hoc and supervised manner for a relatively stable repository (like in the first phase of the repository population), it would necessitate automatic creation and management of the mediated schema in case the schemas evolve more quickly.

Ranking of Entities

We would like that the ranking function used to retrieve entities are aware of the schema matches. For example we would like that the inverse attribute frequency is computed over all attributes which matches. This is done transparently in the case of a mediated schema, but this is not the case for the other more attractive alternatives.

Also, if we assort schema matchings with a confidence value, we want the ranking of entities to take into account those. For example, for a query 'last_name:Smith' we would expect en entity with 'family_name:Smith' rank higher than another entity with 'name:Smith'.

Discovering Schema Matchings

………………….

Initial Approach

As a first step, we have to come up with a solution which is quickly implemented and yet performs reasonably well.

Granularity

Given the limited initial number of sources, we can reasonably assume that entity categories are coherent, i.e. that a category has the same name in all entity profiles. In other words, if a category “person” exists, then it will be referred to in the same manner in all entities of this category, and not, for example using “people”. Given this we define an attribute at the category level.

Query Expansion vs. Attribute Expansion

When the store is populated by entities coming from a few different sources, adding a source will more likely bring more schema alternatives, and thus new schema mappings will be required. However, after a while, when the number of sources is already large, we expect the number of new schema mappings necessary per unit of time to diminish. For this reason the initial approach to handle schema diversity in Okkam will be based on query expansion.

Ranking of Entities

Whereas query expansion will take care that the desired attribute names are matched, the ranking function should take care to compute the iaf (inverse attribute frequency) reflecting the detected attribute matches. Since the schema matching model in the early phase is hard clusters of attributes, it is sufficient to compute the iaf as if attributes of a same cluster are the same. This implies only a small change from the original ranking function, and should not prove forbiddingly slower.

Discovering Schema Matchings

Again, given the limited initial number of sources, and a relatively controlled population of the data store in an early phase, we perform ad-hoc schema matching in the initial approach. This will allow for a quick solution with reasonable performances. The result of schema matching in the early phase is a hard cluster model of attributes.

okkam/task/schema_mapping.txt · Last modified: 2008/09/23 10:00 (external edit)